CUDA

# 查看 GPU + 驱动 + CUDA Runtime
nvidia-smi
Wed Feb  4 17:54:46 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 13.0     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:36:00.0 Off |                    0 |
| N/A   23C    P8             32W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

# CUDA Toolkit 是否安装
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

# 查看已安装 NVIDIA 包
dpkg -l | grep -i nvidia
ii  cuda-nsight-compute-13-0         13.0.2-1                                amd64        NVIDIA Nsight Compute
ii  cuda-nvtx-13-0                   13.0.85-1                               amd64        NVIDIA Tools Extension
hi  libnccl-dev                      2.28.3-1+cuda13.0                       amd64        NVIDIA Collective Communication Library (NCCL) Development Files
hi  libnccl2                         2.28.3-1+cuda13.0                       amd64        NVIDIA Collective Communication Library (NCCL) Runtime
ii  nsight-compute-2025.3.1          2025.3.1.4-1                            amd64        NVIDIA Nsight Compute

# GPU 架构 / 显存 / PCIe
nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Wed Feb  4 17:58:13 2026
Driver Version                            : 570.133.20
CUDA Version                              : 13.0

Attached GPUs                             : 1
GPU 00000000:36:00.0
    Product Name                          : NVIDIA L40S
    Product Brand                         : NVIDIA
    Product Architecture                  : Ada Lovelace
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1323723031205
    GPU UUID                              : GPU-63ecdaf3-b7dc-2190-9912-795023d341e6
    Minor Number                          : 2
    VBIOS Version                         : 95.02.66.00.02
    MultiGPU Board                        : No
    Board ID                              : 0x3600
    Board Part Number                     : 900-2G133-0080-000
    GPU Part Number                       : 26B9-896-A1
    FRU Part Number                       : N/A
    Platform Info
        Chassis Serial Number             : N/A
        Slot Number                       : N/A
        Tray Index                        : N/A
        Host ID                           : N/A
        Peer Type                         : N/A
        Module Id                         : 1
        GPU Fabric GUID                   : N/A
    Inforom Version
        Image Version                     : G133.0242.00.03
        OEM Object                        : 2.1
        ECC Object                        : 6.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : Requested functionality has been deprecated
        Drain and Reset Recommended       : Requested functionality has been deprecated
    GPU Recovery Action                   : None
    GSP Firmware Version                  : 570.133.20
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x36
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x2
        Device Id                         : 0x26B910DE
        Bus Id                            : 00000000:36:00.0
        Sub System Id                     : 0x185110DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 350 KB/s
        Rx Throughput                     : 350 KB/s
        Atomic Caps Outbound              : N/A
        Atomic Caps Inbound               : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 46068 MiB
        Reserved                          : 600 MiB
        Used                              : 0 MiB
        Free                              : 45469 MiB
    BAR1 Memory Usage
        Total                             : 65536 MiB
        Used                              : 1 MiB
        Free                              : 65535 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        GPU                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    DRAM Encryption Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Threshold Exceeded       : No
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : 0
            SRAM SM                       : 0
            SRAM Microcontroller          : 0
            SRAM PCIE                     : 0
            SRAM Other                    : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 24 C
        GPU T.Limit Temp                  : 64 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A
    GPU Power Readings
        Average Power Draw                : 32.52 W
        Instantaneous Power Draw          : 32.52 W
        Current Power Limit               : 350.00 W
        Requested Power Limit             : 350.00 W
        Default Power Limit               : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 350.00 W
    GPU Memory Power Readings 
        Average Power Draw                : N/A
        Instantaneous Power Draw          : N/A
    Module Power Readings
        Average Power Draw                : N/A
        Instantaneous Power Draw          : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Power Smoothing                       : N/A
    Workload Power Profiles
        Requested Profiles                : N/A
        Enforced Profiles                 : N/A
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 1185 MHz
    Applications Clocks
        Graphics                          : 2520 MHz
        Memory                            : 9001 MHz
    Default Applications Clocks
        Graphics                          : 2520 MHz
        Memory                            : 9001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2520 MHz
        SM                                : 2520 MHz
        Memory                            : 9001 MHz
        Video                             : 1965 MHz
    Max Customer Boost Clocks
        Graphics                          : 2520 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
            Route Recovery in progress    : N/A
            Route Unhealthy               : N/A
            Access Timeout Recovery       : N/A
    Processes                             : None
    Capabilities
        EGM                               : disabled


# 只看关键摘要(推荐)
nvidia-smi -L
GPU 0: NVIDIA L40S (UUID: GPU-63ecdaf3-b7dc-2190-9912-795023d341e6)

uname -a
Linux cci-1629dd8f-ac56-446b-9ab7-03f479be1a73-0 5.15.0-126-generic #136+zetyun SMP Fri Aug 8 07:03:31 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

model deploy

export HF_ENDPOINT=http://hfmirror.mas.zetyun.cn:8082
hf download  BAAI/bge-large-zh-v1.5 --local-dir /mnt/nas/models/BAAI/bge-large-zh-v1.5
hf download  BAAI/bge-reranker-large --local-dir /mnt/nas/models/BAAI/bge-reranker-large
hf download  Qwen/Qwen2.5-7B-Instruct --local-dir /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct

Downloading (incomplete total...):   0%|                                                                                                        | 1.68M/7.81G [00:00<06:49, 19.1MB/s]Still waiting to acquire lock on /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct/.cache/huggingface/.gitignore.lock (elapsed: 0.1 seconds)                         | 0/14 [00:00<?, ?it/s]
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:39<00:00,  2.84s/it]
Download complete: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [00:39<00:00, 449MB/s]/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct
Download complete: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [00:39<00:00, 383MB/s]

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh


#If you'd prefer that conda's base environment not be activated on startup,
#   run the following command when conda is activated:

#conda config --set auto_activate_base false

#Note: You can undo this later by running `conda init --reverse $SHELL`

ls /root/public/datasets/
A1C    AIGym     AlayaNeW           EdBianchi      Jize1     RUC-AIBOX  Salesforce  agentica-org  fka     gy65896  kaist-ai  multimolecule  openvla    sanjion9    togethercomputer  yentinglin
AI-MO  AgentGym  BytedTsinghua-SIA  JAYASUDHA-S-V  PRIME-RL  SWE-bench  SamuelYang  bookcorpus    google  hongchi  lerobot   nateraw        rajpurkar  swordfaith  wtcherr           zixianma

OS Version      : Ubuntu 24.04.3 LTS
│ Kernel Version  : 5.15.0-126-generic
│ Hostname        : cci-1629dd8f-ac56-446b-9ab7-03f479be1a73-0
│ IP Address      : 172.16.62.247
│ CPU Model       : Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
│ CPU Cores       : 10C
│ Memory Usage    : 7914 MB / 81920 MB (9.66%)
│ GPU Information : NVIDIA L40S × 1
│ CUDA Version    : 13.0

python env setup

conda create -n bge python=3.12 -y
conda activate bge
pip install torch --index-url https://download.pytorch.org/whl/cu121
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch
  Downloading https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (780.4 MB)
     ━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.1/780.4 MB 940.1 kB/s eta 0:12:29

# check version
python3 -c "import torch; print('PyTorch 版本:', torch.__version__)"
# PyTorch 版本: 2.5.1+cu121
# verify cuda
python - << 'EOF'
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
EOF
#/home/jean/miniconda3/envs/bge/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
#  cpu = _conversion_method_template(device=torch.device("cpu"))
# True
# NVIDIA L40S

pip install -U \
  transformers \
  huggingface_hub \
  accelerate \
  safetensors \
  sentencepiece \
  einops

# verify
python3 -c "import transformers; print(transformers.__version__)"

# deploy
uvicorn app:app --host 0.0.0.0 --port 13300 --workers 1
`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 391/391 [00:00<00:00, 2389.74it/s, Materializing param=pooler.dense.weight]
BertModel LOAD REPORT from: /home/jean/models/BAAI/bge-large-zh-v1.5
Key                     | Status     | Details
------------------------+------------+--------
embeddings.position_ids | UNEXPECTED |        

Notes:
- UNEXPECTED    :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 393/393 [00:00<00:00, 662.98it/s, Materializing param=roberta.encoder.layer.23.output.dense.weight]
XLMRobertaForSequenceClassification LOAD REPORT from: /home/jean/models/BAAI/bge-reranker-large
Key                             | Status     | Details
--------------------------------+------------+--------
roberta.embeddings.position_ids | UNEXPECTED |        

Notes:
- UNEXPECTED    :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
INFO:     Started server process [17095]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:13300 (Press CTRL+C to quit)

pip install -U vllm

# deploy
python -m vllm.entrypoints.openai.api_server \
  --model /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct \
  --dtype float16 \
  --gpu-memory-utilization 0.9 \
  --port 13301

(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] 
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]        █     █     █▄   ▄█
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.15.1
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]   █▄█▀ █     █     █     █  model   /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] 
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:261] non-default args: {'port': 13301, 'model': '/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', 'dtype': 'float16'}
(APIServer pid=3403) INFO 02-05 10:23:51 [model.py:541] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=3403) WARNING 02-05 10:23:51 [model.py:1885] Casting torch.bfloat16 to torch.float16.
(APIServer pid=3403) INFO 02-05 10:23:51 [model.py:1561] Using max model len 32768
(APIServer pid=3403) INFO 02-05 10:23:51 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=3403) INFO 02-05 10:23:51 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:57 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model='/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:58 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.246.55:35053 backend=nccl
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:58 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:59 [gpu_model_runner.py:4033] Starting to load model /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct...
(EngineCore_DP0 pid=3669) INFO 02-05 10:24:17 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:28<01:26, 28.97s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:58<00:58, 29.17s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [01:19<00:25, 25.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:45<00:00, 25.76s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:45<00:00, 26.41s/it]
(EngineCore_DP0 pid=3669) 
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:03 [default_loader.py:291] Loading weights took 105.94 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:04 [gpu_model_runner.py:4130] Model loading took 14.25 GiB memory and 123.992349 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:08 [backends.py:812] Using cache directory: /home/jean/.cache/vllm/torch_compile_cache/3a10497c3c/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:08 [backends.py:872] Dynamo bytecode transform time: 4.21 s
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:15 [backends.py:302] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:19 [backends.py:319] Compiling a graph for compile range (1, 2048) takes 7.78 s
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:19 [monitor.py:34] torch.compile takes 11.99 s in total
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [gpu_worker.py:356] Available KV cache memory: 24.27 GiB
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [kv_cache_utils.py:1307] GPU KV cache size: 454,512 tokens
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 13.87x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:02<00:00, 20.37it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 23.62it/s]
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [gpu_model_runner.py:5063] Graph capturing finished in 5 secs, took 0.49 GiB
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [core.py:272] init engine (profile, create kv cache, warmup model) took 21.19 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [vllm.py:624] Asynchronous scheduling is enabled.
(APIServer pid=3403) INFO 02-05 10:26:26 [api_server.py:665] Supported tasks: ['generate']
(APIServer pid=3403) WARNING 02-05 10:26:26 [model.py:1371] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=3403) INFO 02-05 10:26:26 [serving.py:177] Warming up chat template processing...
(APIServer pid=3403) INFO 02-05 10:26:26 [hf.py:310] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=3403) INFO 02-05 10:26:26 [serving.py:212] Chat template warmup completed in 238.9ms
(APIServer pid=3403) INFO 02-05 10:26:26 [api_server.py:946] Starting vLLM API server 0 on http://0.0.0.0:13301
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:38] Available routes are:
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=3403) INFO:     Started server process [3403]
(APIServer pid=3403) INFO:     Waiting for application startup.
(APIServer pid=3403) INFO:     Application startup complete.
(APIServer pid=3403) INFO:     172.31.41.115:55780 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3403) INFO 02-05 10:28:06 [loggers.py:257] Engine 000: Avg prompt throughput: 3.1 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3403) INFO 02-05 10:28:16 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=53914) INFO:     172.31.41.115:19077 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=53914) INFO 02-10 17:02:55 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=53914) INFO 02-10 17:03:05 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=53914) INFO 02-10 17:03:35 [loggers.py:257] Engine 000: Avg prompt throughput: 331.3 tokens/s, Avg generation throughput: 6.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 33.3%
(APIServer pid=53914) INFO:     172.31.41.114:53944 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=53914) INFO 02-10 17:03:45 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 25.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 33.3%
(APIServer pid=53914) INFO 02-10 17:03:55 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 33.3%

pip list | grep torch
torch                             2.9.1
torchaudio                        2.9.1
torchvision                       0.24.1
transformers                      4.57.6
safetensors                       0.7.0

linux

# 查看主机接口 
lsusb
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 8087:0a2b Intel Corp. Bluetooth wireless interface
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 002 Device 002: ID 1058:25ee Western Digital Technologies, Inc. My Book 25EE
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
# 
lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) (rev 05)
00:01.2 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x4) (rev 05)
00:02.0 Display controller: Intel Corporation HD Graphics 630 (rev 04)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (rev 31)
00:15.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #2 (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.1 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #2 (rev f1)
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1d.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #13 (rev f1)
00:1e.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO UART #0 (rev 31)
00:1f.0 ISA bridge: Intel Corporation HM175 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Polaris 22 XT [Radeon RX Vega M GH] (rev c0)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Polaris 22 HDMI Audio
02:00.0 USB controller: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller
03:00.0 SD Host controller: O2 Micro, Inc. SD/MMC Card Reader Controller (rev 01)
05:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
06:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)
07:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
08:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
08:01.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
08:02.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
08:04.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
3d:00.0 USB controller: Intel Corporation JHL6540 Thunderbolt 3 USB Controller (C step) [Alpine Ridge 4C 2016] (rev 02)
72:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
73:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

# 安装工具(Debian/Ubuntu)
sudo apt install pciutils
# 安装工具(RHEL/CentOS/Fedora)
sudo dnf install pciutils

# 查看所有显卡信息
lspci | grep -E "VGA|3D controller"

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Polaris 22 XT [Radeon RX Vega M GH] (rev c0)

# 查看硬件信息
sudo apt install lshw  # Debian/Ubuntu
sudo dnf install lshw  # RHEL/CentOS

# 查看显卡详细信息
sudo lshw -C display

*-display                 
       description: VGA compatible controller
       product: Polaris 22 XT [Radeon RX Vega M GH]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:01:00.0
       logical name: /dev/fb0
       version: c0
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi vga_controller bus_master cap_list rom fb
       configuration: depth=32 driver=amdgpu latency=0 mode=1920x1080 resolution=1920,1080 visual=truecolor xres=1920 yres=1080
       resources: iomemory:200-1ff iomemory:210-20f irq:186 memory:2000000000-20ffffffff memory:2100000000-21001fffff ioport:e000(size=256) memory:db500000-db53ffff memory:c0000-dffff
  *-display
       description: Display controller
       product: HD Graphics 630
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 04
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm bus_master cap_list
       configuration: driver=i915 latency=0
       resources: iomemory:2f0-2ef iomemory:2f0-2ef irq:185 memory:2ffe000000-2ffeffffff memory:2fa0000000-2fafffffff ioport:f000(size=64)

# 查看内存详细信息(含DDR型号)
sudo dmidecode -t memory
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 3.1.1 present.

Handle 0x002F, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: None
        Maximum Capacity: 32 GB
        Error Information Handle: Not Provided
        Number Of Devices: 2

Handle 0x0030, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x002F
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 16 GB
        Form Factor: SODIMM
        Set: None
        Locator: ChannelA-DIMM0
        Bank Locator: BANK 0
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 2667 MT/s
        Manufacturer: Samsung
        Serial Number: 37385824
        Asset Tag: 9876543210
        Part Number: M471A2K43CB1-CTD    
        Rank: 2
        Configured Memory Speed: 2667 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

Handle 0x0031, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x002F
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 16 GB
        Form Factor: SODIMM
        Set: None
        Locator: ChannelB-DIMM0
        Bank Locator: BANK 2
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 2667 MT/s
        Manufacturer: Samsung
        Serial Number: 37385598
        Asset Tag: 9876543210
        Part Number: M471A2K43CB1-CTD    
        Rank: 2
        Configured Memory Speed: 2667 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

# 精简输出,只提取关键内存信息(DDR型号+容量+频率)
sudo dmidecode -t memory | grep -E 'Type:|Size:|Speed:|Manufacturer:'
Error Correction Type: None
Size: 16 GB
Type: DDR4
Speed: 2667 MT/s
Manufacturer: Samsung
Configured Memory Speed: 2667 MT/s
Size: 16 GB
Type: DDR4
Speed: 2667 MT/s
Manufacturer: Samsung
Configured Memory Speed: 2667 MT/s

# 可视化硬件信息
sudo lshw -short -C memory
H/W path               Device          Class          Description
=================================================================
/0/0                                   memory         64KiB BIOS
/0/2f                                  memory         32GiB System Memory
/0/2f/0                                memory         16GiB SODIMM DDR4 Synchronous Unbuffered (Unregistered) 2667 MHz (0.4 ns)
/0/2f/1                                memory         16GiB SODIMM DDR4 Synchronous Unbuffered (Unregistered) 2667 MHz (0.4 ns)
/0/34                                  memory         256KiB L1 cache
/0/35                                  memory         1MiB L2 cache
/0/36                                  memory         8MiB L3 cache
/0/100/1f.2                            memory         Memory controller

NVIDIA 显卡专用命令

# 基础查看
nvidia-smi
# 实时监控(每秒刷新一次)
watch -n 1 nvidia-smi

# 输出包含 GPU 编号、显存占用、进程信息、驱动版本、CUDA 版本等关键数据。

# 安装
sudo apt install nvtop  # Debian/Ubuntu
sudo dnf install nvtop  # RHEL/CentOS
# 运行
nvtop

AMD 显卡专用命令

# 安装 实时监控 AMD 显卡的使用率、显存、温度等
sudo apt install radeontop  # Debian/Ubuntu
# 运行
sudo radeontop

Collecting data, please wait....

vLLM API Service

🚀 vLLM API 服务后台运行方案

✅ 方案一:nohup(最简单推荐)

nohup python -m vllm.entrypoints.openai.api_server \
  --model /mnt/nas/models/Qwen/Qwen2.5-VL-7B-Instruct \
  --dtype float16 \
  --gpu-memory-utilization 0.9 \
  --port 9001 \
  > vllm.log 2>&1 &

管理命令: ```bash

查看日志

tail -f vllm.log

查看进程

ps aux | grep vllm

停止服务

pkill -f "vllm.entrypoints.openai.api_server"

kill $(lsof -t -i:9001) ```


✅ 方案二:screen(可交互会话)

# 1. 创建会话
screen -S vllm-server

# 2. 在会话中运行命令
python -m vllm.entrypoints.openai.api_server \
  --model /mnt/nas/models/Qwen/Qwen2.5-VL-7B-Instruct \
  --dtype float16 \
  --gpu-memory-utilization 0.9 \
  --port 9001

# 3. 按 Ctrl+A 然后 D 脱离会话

# 4. 重新连接会话
screen -r vllm-server

# 5. 查看会话列表
screen -ls

✅ 方案三:tmux(更强大的会话管理)

# 1. 创建会话
tmux new -s vllm-server

# 2. 运行命令(同上)

# 3. 按 Ctrl+B 然后 D 脱离会话

# 4. 重新连接
tmux attach -t vllm-server

# 5. 查看会话
tmux ls

✅ 方案四:systemd 服务(生产环境推荐)

1. 创建服务文件: bash sudo vim /etc/systemd/system/vllm-qwen.service

2. 写入以下内容: ```ini [Unit] Description=vLLM Qwen2.5-VL API Server After=network.target

[Service] Type=simple User=your_username WorkingDirectory=/home/your_username ExecStart=/usr/bin/python -m vllm.entrypoints.openai.api_server \ --model /mnt/nas/models/Qwen/Qwen2.5-VL-7B-Instruct \ --dtype float16 \ --gpu-memory-utilization 0.9 \ --port 9001 Restart=always RestartSec=10 StandardOutput=append:/var/log/vllm/qwen.log StandardError=append:/var/log/vllm/qwen.err

GPU 环境变量(如需)

Environment="CUDA_VISIBLE_DEVICES=0" Environment="TRANSFORMERS_CACHE=/mnt/nas/cache"

[Install] WantedBy=multi-user.target ```

3. 启用并启动服务: ```bash

创建日志目录

sudo mkdir -p /var/log/vllm sudo chown your_username:your_username /var/log/vllm

重载 systemd

sudo systemctl daemon-reload

启用开机自启

sudo systemctl enable vllm-qwen

启动服务

sudo systemctl start vllm-qwen

查看状态

sudo systemctl status vllm-qwen

查看日志

sudo journalctl -u vllm-qwen -f

停止服务

sudo systemctl stop vllm-qwen

重启服务

sudo systemctl restart vllm-qwen ```


📋 方案对比

方案 优点 缺点 适用场景
nohup 简单快捷,无需额外工具 无法重新连接会话 快速测试/临时部署
screen 可重新连接,易上手 功能相对基础 开发调试
tmux 功能强大,支持多窗口 需学习快捷键 长期开发
systemd 开机自启,自动重启,日志管理 配置稍复杂 生产环境

⚠️ 重要注意事项

1. 显存检查

# 启动前检查 GPU 状态
nvidia-smi

# 监控显存使用
watch -n 1 nvidia-smi

2. 端口占用检查

# 检查 9001 端口是否被占用
lsof -i:9001
# 或
netstat -tlnp | grep 9001

3. 测试服务是否正常

# 健康检查
curl http://localhost:9001/health

# 测试推理
curl http://localhost:9001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-VL-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

4. 常见问题排查

问题 解决方案
CUDA out of memory 降低 --gpu-memory-utilization 至 0.8 或 0.7
模型加载失败 检查路径权限 ls -la /mnt/nas/models/Qwen/
端口被占用 更换端口 --port 9002 或 kill 占用进程
服务意外退出 查看日志 tail -f vllm.logjournalctl -u vllm-qwen

5. 环境变量建议(可选)

export CUDA_VISIBLE_DEVICES=0
export TRANSFORMERS_CACHE=/mnt/nas/cache
export HF_HOME=/mnt/nas/cache/huggingface
export VLLM_LOGGING_LEVEL=INFO

🎯 推荐选择


Page Source