CUDA
# 查看 GPU + 驱动 + CUDA Runtime
nvidia-smi
Wed Feb 4 17:54:46 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 13.0 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S On | 00000000:36:00.0 Off | 0 |
| N/A 23C P8 32W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
# CUDA Toolkit 是否安装
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
# 查看已安装 NVIDIA 包
dpkg -l | grep -i nvidia
ii cuda-nsight-compute-13-0 13.0.2-1 amd64 NVIDIA Nsight Compute
ii cuda-nvtx-13-0 13.0.85-1 amd64 NVIDIA Tools Extension
hi libnccl-dev 2.28.3-1+cuda13.0 amd64 NVIDIA Collective Communication Library (NCCL) Development Files
hi libnccl2 2.28.3-1+cuda13.0 amd64 NVIDIA Collective Communication Library (NCCL) Runtime
ii nsight-compute-2025.3.1 2025.3.1.4-1 amd64 NVIDIA Nsight Compute
# GPU 架构 / 显存 / PCIe
nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Wed Feb 4 17:58:13 2026
Driver Version : 570.133.20
CUDA Version : 13.0
Attached GPUs : 1
GPU 00000000:36:00.0
Product Name : NVIDIA L40S
Product Brand : NVIDIA
Product Architecture : Ada Lovelace
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323723031205
GPU UUID : GPU-63ecdaf3-b7dc-2190-9912-795023d341e6
Minor Number : 2
VBIOS Version : 95.02.66.00.02
MultiGPU Board : No
Board ID : 0x3600
Board Part Number : 900-2G133-0080-000
GPU Part Number : 26B9-896-A1
FRU Part Number : N/A
Platform Info
Chassis Serial Number : N/A
Slot Number : N/A
Tray Index : N/A
Host ID : N/A
Peer Type : N/A
Module Id : 1
GPU Fabric GUID : N/A
Inforom Version
Image Version : G133.0242.00.03
OEM Object : 2.1
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU C2C Mode : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
vGPU Heterogeneous Mode : N/A
GPU Reset Status
Reset Required : Requested functionality has been deprecated
Drain and Reset Recommended : Requested functionality has been deprecated
GPU Recovery Action : None
GSP Firmware Version : 570.133.20
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x36
Device : 0x00
Domain : 0x0000
Base Classcode : 0x3
Sub Classcode : 0x2
Device Id : 0x26B910DE
Bus Id : 00000000:36:00.0
Sub System Id : 0x185110DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 350 KB/s
Rx Throughput : 350 KB/s
Atomic Caps Outbound : N/A
Atomic Caps Inbound : N/A
Fan Speed : N/A
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 46068 MiB
Reserved : 600 MiB
Used : 0 MiB
Free : 45469 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
GPU : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
DRAM Encryption Mode
Current : N/A
Pending : N/A
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
SRAM Threshold Exceeded : No
Aggregate Uncorrectable SRAM Sources
SRAM L2 : 0
SRAM SM : 0
SRAM Microcontroller : 0
SRAM PCIE : 0
SRAM Other : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 24 C
GPU T.Limit Temp : 64 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating T.Limit Temp : N/A
GPU Power Readings
Average Power Draw : 32.52 W
Instantaneous Power Draw : 32.52 W
Current Power Limit : 350.00 W
Requested Power Limit : 350.00 W
Default Power Limit : 350.00 W
Min Power Limit : 100.00 W
Max Power Limit : 350.00 W
GPU Memory Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Module Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Power Smoothing : N/A
Workload Power Profiles
Requested Profiles : N/A
Enforced Profiles : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 1185 MHz
Applications Clocks
Graphics : 2520 MHz
Memory : 9001 MHz
Default Applications Clocks
Graphics : 2520 MHz
Memory : 9001 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2520 MHz
SM : 2520 MHz
Memory : 9001 MHz
Video : 1965 MHz
Max Customer Boost Clocks
Graphics : 2520 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
CliqueId : N/A
ClusterUUID : N/A
Health
Bandwidth : N/A
Route Recovery in progress : N/A
Route Unhealthy : N/A
Access Timeout Recovery : N/A
Processes : None
Capabilities
EGM : disabled
# 只看关键摘要(推荐)
nvidia-smi -L
GPU 0: NVIDIA L40S (UUID: GPU-63ecdaf3-b7dc-2190-9912-795023d341e6)
uname -a
Linux cci-1629dd8f-ac56-446b-9ab7-03f479be1a73-0 5.15.0-126-generic #136+zetyun SMP Fri Aug 8 07:03:31 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
model deploy
export HF_ENDPOINT=http://hfmirror.mas.zetyun.cn:8082
hf download BAAI/bge-large-zh-v1.5 --local-dir /mnt/nas/models/BAAI/bge-large-zh-v1.5
hf download BAAI/bge-reranker-large --local-dir /mnt/nas/models/BAAI/bge-reranker-large
hf download Qwen/Qwen2.5-7B-Instruct --local-dir /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct
Downloading (incomplete total...): 0%| | 1.68M/7.81G [00:00<06:49, 19.1MB/s]Still waiting to acquire lock on /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct/.cache/huggingface/.gitignore.lock (elapsed: 0.1 seconds) | 0/14 [00:00<?, ?it/s]
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:39<00:00, 2.84s/it]
Download complete: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [00:39<00:00, 449MB/s]/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct
Download complete: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [00:39<00:00, 383MB/s]
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
#If you'd prefer that conda's base environment not be activated on startup,
# run the following command when conda is activated:
#conda config --set auto_activate_base false
#Note: You can undo this later by running `conda init --reverse $SHELL`
ls /root/public/datasets/
A1C AIGym AlayaNeW EdBianchi Jize1 RUC-AIBOX Salesforce agentica-org fka gy65896 kaist-ai multimolecule openvla sanjion9 togethercomputer yentinglin
AI-MO AgentGym BytedTsinghua-SIA JAYASUDHA-S-V PRIME-RL SWE-bench SamuelYang bookcorpus google hongchi lerobot nateraw rajpurkar swordfaith wtcherr zixianma
OS Version : Ubuntu 24.04.3 LTS
│ Kernel Version : 5.15.0-126-generic
│ Hostname : cci-1629dd8f-ac56-446b-9ab7-03f479be1a73-0
│ IP Address : 172.16.62.247
│ CPU Model : Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
│ CPU Cores : 10C
│ Memory Usage : 7914 MB / 81920 MB (9.66%)
│ GPU Information : NVIDIA L40S × 1
│ CUDA Version : 13.0
python env setup
conda create -n bge python=3.12 -y
conda activate bge
pip install torch --index-url https://download.pytorch.org/whl/cu121
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch
Downloading https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (780.4 MB)
━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.1/780.4 MB 940.1 kB/s eta 0:12:29
# check version
python3 -c "import torch; print('PyTorch 版本:', torch.__version__)"
# PyTorch 版本: 2.5.1+cu121
# verify cuda
python - << 'EOF'
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
EOF
#/home/jean/miniconda3/envs/bge/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
# cpu = _conversion_method_template(device=torch.device("cpu"))
# True
# NVIDIA L40S
pip install -U \
transformers \
huggingface_hub \
accelerate \
safetensors \
sentencepiece \
einops
# verify
python3 -c "import transformers; print(transformers.__version__)"
# deploy
uvicorn app:app --host 0.0.0.0 --port 13300 --workers 1
`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 391/391 [00:00<00:00, 2389.74it/s, Materializing param=pooler.dense.weight]
BertModel LOAD REPORT from: /home/jean/models/BAAI/bge-large-zh-v1.5
Key | Status | Details
------------------------+------------+--------
embeddings.position_ids | UNEXPECTED |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 393/393 [00:00<00:00, 662.98it/s, Materializing param=roberta.encoder.layer.23.output.dense.weight]
XLMRobertaForSequenceClassification LOAD REPORT from: /home/jean/models/BAAI/bge-reranker-large
Key | Status | Details
--------------------------------+------------+--------
roberta.embeddings.position_ids | UNEXPECTED |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
INFO: Started server process [17095]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:13300 (Press CTRL+C to quit)
pip install -U vllm
# deploy
python -m vllm.entrypoints.openai.api_server \
--model /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.9 \
--port 13301
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] █ █ █▄ ▄█
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.15.1
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] █▄█▀ █ █ █ █ model /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:261] non-default args: {'port': 13301, 'model': '/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', 'dtype': 'float16'}
(APIServer pid=3403) INFO 02-05 10:23:51 [model.py:541] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=3403) WARNING 02-05 10:23:51 [model.py:1885] Casting torch.bfloat16 to torch.float16.
(APIServer pid=3403) INFO 02-05 10:23:51 [model.py:1561] Using max model len 32768
(APIServer pid=3403) INFO 02-05 10:23:51 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=3403) INFO 02-05 10:23:51 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:57 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model='/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:58 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.246.55:35053 backend=nccl
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:58 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:59 [gpu_model_runner.py:4033] Starting to load model /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct...
(EngineCore_DP0 pid=3669) INFO 02-05 10:24:17 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:28<01:26, 28.97s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:58<00:58, 29.17s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [01:19<00:25, 25.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:45<00:00, 25.76s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:45<00:00, 26.41s/it]
(EngineCore_DP0 pid=3669)
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:03 [default_loader.py:291] Loading weights took 105.94 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:04 [gpu_model_runner.py:4130] Model loading took 14.25 GiB memory and 123.992349 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:08 [backends.py:812] Using cache directory: /home/jean/.cache/vllm/torch_compile_cache/3a10497c3c/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:08 [backends.py:872] Dynamo bytecode transform time: 4.21 s
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:15 [backends.py:302] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:19 [backends.py:319] Compiling a graph for compile range (1, 2048) takes 7.78 s
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:19 [monitor.py:34] torch.compile takes 11.99 s in total
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [gpu_worker.py:356] Available KV cache memory: 24.27 GiB
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [kv_cache_utils.py:1307] GPU KV cache size: 454,512 tokens
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 13.87x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:02<00:00, 20.37it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 23.62it/s]
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [gpu_model_runner.py:5063] Graph capturing finished in 5 secs, took 0.49 GiB
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [core.py:272] init engine (profile, create kv cache, warmup model) took 21.19 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [vllm.py:624] Asynchronous scheduling is enabled.
(APIServer pid=3403) INFO 02-05 10:26:26 [api_server.py:665] Supported tasks: ['generate']
(APIServer pid=3403) WARNING 02-05 10:26:26 [model.py:1371] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=3403) INFO 02-05 10:26:26 [serving.py:177] Warming up chat template processing...
(APIServer pid=3403) INFO 02-05 10:26:26 [hf.py:310] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=3403) INFO 02-05 10:26:26 [serving.py:212] Chat template warmup completed in 238.9ms
(APIServer pid=3403) INFO 02-05 10:26:26 [api_server.py:946] Starting vLLM API server 0 on http://0.0.0.0:13301
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:38] Available routes are:
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=3403) INFO: Started server process [3403]
(APIServer pid=3403) INFO: Waiting for application startup.
(APIServer pid=3403) INFO: Application startup complete.
(APIServer pid=3403) INFO: 172.31.41.115:55780 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3403) INFO 02-05 10:28:06 [loggers.py:257] Engine 000: Avg prompt throughput: 3.1 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3403) INFO 02-05 10:28:16 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=53914) INFO: 172.31.41.115:19077 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=53914) INFO 02-10 17:02:55 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=53914) INFO 02-10 17:03:05 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=53914) INFO 02-10 17:03:35 [loggers.py:257] Engine 000: Avg prompt throughput: 331.3 tokens/s, Avg generation throughput: 6.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 33.3%
(APIServer pid=53914) INFO: 172.31.41.114:53944 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=53914) INFO 02-10 17:03:45 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 25.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 33.3%
(APIServer pid=53914) INFO 02-10 17:03:55 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 33.3%
pip list | grep torch
torch 2.9.1
torchaudio 2.9.1
torchvision 0.24.1
transformers 4.57.6
safetensors 0.7.0
linux
# 查看主机接口
lsusb
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 8087:0a2b Intel Corp. Bluetooth wireless interface
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 002 Device 002: ID 1058:25ee Western Digital Technologies, Inc. My Book 25EE
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
#
lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) (rev 05)
00:01.2 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x4) (rev 05)
00:02.0 Display controller: Intel Corporation HD Graphics 630 (rev 04)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (rev 31)
00:15.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #2 (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.1 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #2 (rev f1)
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1d.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #13 (rev f1)
00:1e.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO UART #0 (rev 31)
00:1f.0 ISA bridge: Intel Corporation HM175 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Polaris 22 XT [Radeon RX Vega M GH] (rev c0)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Polaris 22 HDMI Audio
02:00.0 USB controller: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller
03:00.0 SD Host controller: O2 Micro, Inc. SD/MMC Card Reader Controller (rev 01)
05:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
06:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)
07:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
08:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
08:01.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
08:02.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
08:04.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
3d:00.0 USB controller: Intel Corporation JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine Ridge 4C 2016] (rev 02)
72:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
73:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
# 安装工具(Debian/Ubuntu)
sudo apt install pciutils
# 安装工具(RHEL/CentOS/Fedora)
sudo dnf install pciutils
# 查看所有显卡信息
lspci | grep -E "VGA|3D controller"
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Polaris 22 XT [Radeon RX Vega M GH] (rev c0)
# 查看硬件信息
sudo apt install lshw # Debian/Ubuntu
sudo dnf install lshw # RHEL/CentOS
# 查看显卡详细信息
sudo lshw -C display
*-display
description: VGA compatible controller
product: Polaris 22 XT [Radeon RX Vega M GH]
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:01:00.0
logical name: /dev/fb0
version: c0
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi vga_controller bus_master cap_list rom fb
configuration: depth=32 driver=amdgpu latency=0 mode=1920x1080 resolution=1920,1080 visual=truecolor xres=1920 yres=1080
resources: iomemory:200-1ff iomemory:210-20f irq:186 memory:2000000000-20ffffffff memory:2100000000-21001fffff ioport:e000(size=256) memory:db500000-db53ffff memory:c0000-dffff
*-display
description: Display controller
product: HD Graphics 630
vendor: Intel Corporation
physical id: 2
bus info: pci@0000:00:02.0
version: 04
width: 64 bits
clock: 33MHz
capabilities: pciexpress msi pm bus_master cap_list
configuration: driver=i915 latency=0
resources: iomemory:2f0-2ef iomemory:2f0-2ef irq:185 memory:2ffe000000-2ffeffffff memory:2fa0000000-2fafffffff ioport:f000(size=64)
# 查看内存详细信息(含DDR型号)
sudo dmidecode -t memory
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 3.1.1 present.
Handle 0x002F, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 32 GB
Error Information Handle: Not Provided
Number Of Devices: 2
Handle 0x0030, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002F
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: SODIMM
Set: None
Locator: ChannelA-DIMM0
Bank Locator: BANK 0
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2667 MT/s
Manufacturer: Samsung
Serial Number: 37385824
Asset Tag: 9876543210
Part Number: M471A2K43CB1-CTD
Rank: 2
Configured Memory Speed: 2667 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Handle 0x0031, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002F
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: SODIMM
Set: None
Locator: ChannelB-DIMM0
Bank Locator: BANK 2
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2667 MT/s
Manufacturer: Samsung
Serial Number: 37385598
Asset Tag: 9876543210
Part Number: M471A2K43CB1-CTD
Rank: 2
Configured Memory Speed: 2667 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
# 精简输出,只提取关键内存信息(DDR型号+容量+频率)
sudo dmidecode -t memory | grep -E 'Type:|Size:|Speed:|Manufacturer:'
Error Correction Type: None
Size: 16 GB
Type: DDR4
Speed: 2667 MT/s
Manufacturer: Samsung
Configured Memory Speed: 2667 MT/s
Size: 16 GB
Type: DDR4
Speed: 2667 MT/s
Manufacturer: Samsung
Configured Memory Speed: 2667 MT/s
# 可视化硬件信息
sudo lshw -short -C memory
H/W path Device Class Description
=================================================================
/0/0 memory 64KiB BIOS
/0/2f memory 32GiB System Memory
/0/2f/0 memory 16GiB SODIMM DDR4 Synchronous Unbuffered (Unregistered) 2667 MHz (0.4 ns)
/0/2f/1 memory 16GiB SODIMM DDR4 Synchronous Unbuffered (Unregistered) 2667 MHz (0.4 ns)
/0/34 memory 256KiB L1 cache
/0/35 memory 1MiB L2 cache
/0/36 memory 8MiB L3 cache
/0/100/1f.2 memory Memory controller
NVIDIA 显卡专用命令
# 基础查看
nvidia-smi
# 实时监控(每秒刷新一次)
watch -n 1 nvidia-smi
# 输出包含 GPU 编号、显存占用、进程信息、驱动版本、CUDA 版本等关键数据。
# 安装
sudo apt install nvtop # Debian/Ubuntu
sudo dnf install nvtop # RHEL/CentOS
# 运行
nvtop
AMD 显卡专用命令
# 安装 实时监控 AMD 显卡的使用率、显存、温度等
sudo apt install radeontop # Debian/Ubuntu
# 运行
sudo radeontop
Collecting data, please wait....
vLLM API Service
🚀 vLLM API 服务后台运行方案
✅ 方案一:nohup(最简单推荐)
nohup python -m vllm.entrypoints.openai.api_server \
--model /mnt/nas/models/Qwen/Qwen2.5-VL-7B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.9 \
--port 9001 \
> vllm.log 2>&1 &
管理命令: ```bash
查看日志
tail -f vllm.log
查看进程
ps aux | grep vllm
停止服务
pkill -f "vllm.entrypoints.openai.api_server"
或
kill $(lsof -t -i:9001) ```
✅ 方案二:screen(可交互会话)
# 1. 创建会话
screen -S vllm-server
# 2. 在会话中运行命令
python -m vllm.entrypoints.openai.api_server \
--model /mnt/nas/models/Qwen/Qwen2.5-VL-7B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.9 \
--port 9001
# 3. 按 Ctrl+A 然后 D 脱离会话
# 4. 重新连接会话
screen -r vllm-server
# 5. 查看会话列表
screen -ls
✅ 方案三:tmux(更强大的会话管理)
# 1. 创建会话
tmux new -s vllm-server
# 2. 运行命令(同上)
# 3. 按 Ctrl+B 然后 D 脱离会话
# 4. 重新连接
tmux attach -t vllm-server
# 5. 查看会话
tmux ls
✅ 方案四:systemd 服务(生产环境推荐)
1. 创建服务文件:
bash
sudo vim /etc/systemd/system/vllm-qwen.service
2. 写入以下内容: ```ini [Unit] Description=vLLM Qwen2.5-VL API Server After=network.target
[Service] Type=simple User=your_username WorkingDirectory=/home/your_username ExecStart=/usr/bin/python -m vllm.entrypoints.openai.api_server \ --model /mnt/nas/models/Qwen/Qwen2.5-VL-7B-Instruct \ --dtype float16 \ --gpu-memory-utilization 0.9 \ --port 9001 Restart=always RestartSec=10 StandardOutput=append:/var/log/vllm/qwen.log StandardError=append:/var/log/vllm/qwen.err
GPU 环境变量(如需)
Environment="CUDA_VISIBLE_DEVICES=0" Environment="TRANSFORMERS_CACHE=/mnt/nas/cache"
[Install] WantedBy=multi-user.target ```
3. 启用并启动服务: ```bash
创建日志目录
sudo mkdir -p /var/log/vllm sudo chown your_username:your_username /var/log/vllm
重载 systemd
sudo systemctl daemon-reload
启用开机自启
sudo systemctl enable vllm-qwen
启动服务
sudo systemctl start vllm-qwen
查看状态
sudo systemctl status vllm-qwen
查看日志
sudo journalctl -u vllm-qwen -f
停止服务
sudo systemctl stop vllm-qwen
重启服务
sudo systemctl restart vllm-qwen ```
📋 方案对比
| 方案 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|
| nohup | 简单快捷,无需额外工具 | 无法重新连接会话 | 快速测试/临时部署 |
| screen | 可重新连接,易上手 | 功能相对基础 | 开发调试 |
| tmux | 功能强大,支持多窗口 | 需学习快捷键 | 长期开发 |
| systemd | 开机自启,自动重启,日志管理 | 配置稍复杂 | 生产环境 ✅ |
⚠️ 重要注意事项
1. 显存检查
# 启动前检查 GPU 状态
nvidia-smi
# 监控显存使用
watch -n 1 nvidia-smi
2. 端口占用检查
# 检查 9001 端口是否被占用
lsof -i:9001
# 或
netstat -tlnp | grep 9001
3. 测试服务是否正常
# 健康检查
curl http://localhost:9001/health
# 测试推理
curl http://localhost:9001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-VL-7B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'
4. 常见问题排查
| 问题 | 解决方案 |
|---|---|
| CUDA out of memory | 降低 --gpu-memory-utilization 至 0.8 或 0.7 |
| 模型加载失败 | 检查路径权限 ls -la /mnt/nas/models/Qwen/ |
| 端口被占用 | 更换端口 --port 9002 或 kill 占用进程 |
| 服务意外退出 | 查看日志 tail -f vllm.log 或 journalctl -u vllm-qwen |
5. 环境变量建议(可选)
export CUDA_VISIBLE_DEVICES=0
export TRANSFORMERS_CACHE=/mnt/nas/cache
export HF_HOME=/mnt/nas/cache/huggingface
export VLLM_LOGGING_LEVEL=INFO
🎯 推荐选择
- 开发测试 →
screen或tmux(方便调试) - 生产部署 →
systemd(稳定可靠,自动重启) - 快速验证 →
nohup(最简单)
Page Source