Home / linux / cuda

CUDA

# 查看 GPU + 驱动 + CUDA Runtime
nvidia-smi
Wed Feb  4 17:54:46 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 13.0     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:36:00.0 Off |                    0 |
| N/A   23C    P8             32W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

# CUDA Toolkit 是否安装
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

# 查看已安装 NVIDIA 包
dpkg -l | grep -i nvidia
ii  cuda-nsight-compute-13-0         13.0.2-1                                amd64        NVIDIA Nsight Compute
ii  cuda-nvtx-13-0                   13.0.85-1                               amd64        NVIDIA Tools Extension
hi  libnccl-dev                      2.28.3-1+cuda13.0                       amd64        NVIDIA Collective Communication Library (NCCL) Development Files
hi  libnccl2                         2.28.3-1+cuda13.0                       amd64        NVIDIA Collective Communication Library (NCCL) Runtime
ii  nsight-compute-2025.3.1          2025.3.1.4-1                            amd64        NVIDIA Nsight Compute

# GPU 架构 / 显存 / PCIe
nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Wed Feb  4 17:58:13 2026
Driver Version                            : 570.133.20
CUDA Version                              : 13.0

Attached GPUs                             : 1
GPU 00000000:36:00.0
    Product Name                          : NVIDIA L40S
    Product Brand                         : NVIDIA
    Product Architecture                  : Ada Lovelace
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1323723031205
    GPU UUID                              : GPU-63ecdaf3-b7dc-2190-9912-795023d341e6
    Minor Number                          : 2
    VBIOS Version                         : 95.02.66.00.02
    MultiGPU Board                        : No
    Board ID                              : 0x3600
    Board Part Number                     : 900-2G133-0080-000
    GPU Part Number                       : 26B9-896-A1
    FRU Part Number                       : N/A
    Platform Info
        Chassis Serial Number             : N/A
        Slot Number                       : N/A
        Tray Index                        : N/A
        Host ID                           : N/A
        Peer Type                         : N/A
        Module Id                         : 1
        GPU Fabric GUID                   : N/A
    Inforom Version
        Image Version                     : G133.0242.00.03
        OEM Object                        : 2.1
        ECC Object                        : 6.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : Requested functionality has been deprecated
        Drain and Reset Recommended       : Requested functionality has been deprecated
    GPU Recovery Action                   : None
    GSP Firmware Version                  : 570.133.20
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x36
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x2
        Device Id                         : 0x26B910DE
        Bus Id                            : 00000000:36:00.0
        Sub System Id                     : 0x185110DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 350 KB/s
        Rx Throughput                     : 350 KB/s
        Atomic Caps Outbound              : N/A
        Atomic Caps Inbound               : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 46068 MiB
        Reserved                          : 600 MiB
        Used                              : 0 MiB
        Free                              : 45469 MiB
    BAR1 Memory Usage
        Total                             : 65536 MiB
        Used                              : 1 MiB
        Free                              : 65535 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        GPU                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    DRAM Encryption Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Threshold Exceeded       : No
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : 0
            SRAM SM                       : 0
            SRAM Microcontroller          : 0
            SRAM PCIE                     : 0
            SRAM Other                    : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 24 C
        GPU T.Limit Temp                  : 64 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A
    GPU Power Readings
        Average Power Draw                : 32.52 W
        Instantaneous Power Draw          : 32.52 W
        Current Power Limit               : 350.00 W
        Requested Power Limit             : 350.00 W
        Default Power Limit               : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 350.00 W
    GPU Memory Power Readings 
        Average Power Draw                : N/A
        Instantaneous Power Draw          : N/A
    Module Power Readings
        Average Power Draw                : N/A
        Instantaneous Power Draw          : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Power Smoothing                       : N/A
    Workload Power Profiles
        Requested Profiles                : N/A
        Enforced Profiles                 : N/A
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 1185 MHz
    Applications Clocks
        Graphics                          : 2520 MHz
        Memory                            : 9001 MHz
    Default Applications Clocks
        Graphics                          : 2520 MHz
        Memory                            : 9001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2520 MHz
        SM                                : 2520 MHz
        Memory                            : 9001 MHz
        Video                             : 1965 MHz
    Max Customer Boost Clocks
        Graphics                          : 2520 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
            Route Recovery in progress    : N/A
            Route Unhealthy               : N/A
            Access Timeout Recovery       : N/A
    Processes                             : None
    Capabilities
        EGM                               : disabled


# 只看关键摘要（推荐）
nvidia-smi -L
GPU 0: NVIDIA L40S (UUID: GPU-63ecdaf3-b7dc-2190-9912-795023d341e6)

uname -a
Linux cci-1629dd8f-ac56-446b-9ab7-03f479be1a73-0 5.15.0-126-generic #136+zetyun SMP Fri Aug 8 07:03:31 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

model deploy

export HF_ENDPOINT=http://hfmirror.mas.zetyun.cn:8082
hf download  BAAI/bge-large-zh-v1.5 --local-dir /mnt/nas/models/BAAI/bge-large-zh-v1.5
hf download  BAAI/bge-reranker-large --local-dir /mnt/nas/models/BAAI/bge-reranker-large
hf download  Qwen/Qwen2.5-7B-Instruct --local-dir /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct

Downloading (incomplete total...):   0%|                                                                                                        | 1.68M/7.81G [00:00<06:49, 19.1MB/s]Still waiting to acquire lock on /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct/.cache/huggingface/.gitignore.lock (elapsed: 0.1 seconds)                         | 0/14 [00:00<?, ?it/s]
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:39<00:00,  2.84s/it]
Download complete: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [00:39<00:00, 449MB/s]/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct
Download complete: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [00:39<00:00, 383MB/s]

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh


#If you'd prefer that conda's base environment not be activated on startup,
#   run the following command when conda is activated:

#conda config --set auto_activate_base false

#Note: You can undo this later by running `conda init --reverse $SHELL`

ls /root/public/datasets/
A1C    AIGym     AlayaNeW           EdBianchi      Jize1     RUC-AIBOX  Salesforce  agentica-org  fka     gy65896  kaist-ai  multimolecule  openvla    sanjion9    togethercomputer  yentinglin
AI-MO  AgentGym  BytedTsinghua-SIA  JAYASUDHA-S-V  PRIME-RL  SWE-bench  SamuelYang  bookcorpus    google  hongchi  lerobot   nateraw        rajpurkar  swordfaith  wtcherr           zixianma

OS Version      : Ubuntu 24.04.3 LTS
│ Kernel Version  : 5.15.0-126-generic
│ Hostname        : cci-1629dd8f-ac56-446b-9ab7-03f479be1a73-0
│ IP Address      : 172.16.62.247
│ CPU Model       : Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
│ CPU Cores       : 10C
│ Memory Usage    : 7914 MB / 81920 MB (9.66%)
│ GPU Information : NVIDIA L40S × 1
│ CUDA Version    : 13.0

python env setup

conda create -n bge python=3.12 -y
conda activate bge
pip install torch --index-url https://download.pytorch.org/whl/cu121
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch
  Downloading https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (780.4 MB)
     ━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.1/780.4 MB 940.1 kB/s eta 0:12:29

# check version
python3 -c "import torch; print('PyTorch 版本：', torch.__version__)"
# PyTorch 版本： 2.5.1+cu121
# verify cuda
python - << 'EOF'
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
EOF
#/home/jean/miniconda3/envs/bge/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
#  cpu = _conversion_method_template(device=torch.device("cpu"))
# True
# NVIDIA L40S

pip install -U \
  transformers \
  huggingface_hub \
  accelerate \
  safetensors \
  sentencepiece \
  einops

# verify
python3 -c "import transformers; print(transformers.__version__)"

# deploy
uvicorn app:app --host 0.0.0.0 --port 13300 --workers 1
`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 391/391 [00:00<00:00, 2389.74it/s, Materializing param=pooler.dense.weight]
BertModel LOAD REPORT from: /home/jean/models/BAAI/bge-large-zh-v1.5
Key                     | Status     | Details
------------------------+------------+--------
embeddings.position_ids | UNEXPECTED |        

Notes:
- UNEXPECTED    :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 393/393 [00:00<00:00, 662.98it/s, Materializing param=roberta.encoder.layer.23.output.dense.weight]
XLMRobertaForSequenceClassification LOAD REPORT from: /home/jean/models/BAAI/bge-reranker-large
Key                             | Status     | Details
--------------------------------+------------+--------
roberta.embeddings.position_ids | UNEXPECTED |        

Notes:
- UNEXPECTED    :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
INFO:     Started server process [17095]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:13300 (Press CTRL+C to quit)

pip install -U vllm

# deploy
python -m vllm.entrypoints.openai.api_server \
  --model /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct \
  --dtype float16 \
  --gpu-memory-utilization 0.9 \
  --port 13301

(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] 
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]        █     █     █▄   ▄█
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.15.1
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]   █▄█▀ █     █     █     █  model   /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] 
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:261] non-default args: {'port': 13301, 'model': '/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', 'dtype': 'float16'}
(APIServer pid=3403) INFO 02-05 10:23:51 [model.py:541] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=3403) WARNING 02-05 10:23:51 [model.py:1885] Casting torch.bfloat16 to torch.float16.
(APIServer pid=3403) INFO 02-05 10:23:51 [model.py:1561] Using max model len 32768
(APIServer pid=3403) INFO 02-05 10:23:51 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=3403) INFO 02-05 10:23:51 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:57 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model='/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:58 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.246.55:35053 backend=nccl
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:58 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:59 [gpu_model_runner.py:4033] Starting to load model /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct...
(EngineCore_DP0 pid=3669) INFO 02-05 10:24:17 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:28<01:26, 28.97s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:58<00:58, 29.17s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [01:19<00:25, 25.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:45<00:00, 25.76s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:45<00:00, 26.41s/it]
(EngineCore_DP0 pid=3669) 
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:03 [default_loader.py:291] Loading weights took 105.94 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:04 [gpu_model_runner.py:4130] Model loading took 14.25 GiB memory and 123.992349 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:08 [backends.py:812] Using cache directory: /home/jean/.cache/vllm/torch_compile_cache/3a10497c3c/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:08 [backends.py:872] Dynamo bytecode transform time: 4.21 s
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:15 [backends.py:302] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:19 [backends.py:319] Compiling a graph for compile range (1, 2048) takes 7.78 s
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:19 [monitor.py:34] torch.compile takes 11.99 s in total
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [gpu_worker.py:356] Available KV cache memory: 24.27 GiB
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [kv_cache_utils.py:1307] GPU KV cache size: 454,512 tokens
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 13.87x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:02<00:00, 20.37it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 23.62it/s]
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [gpu_model_runner.py:5063] Graph capturing finished in 5 secs, took 0.49 GiB
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [core.py:272] init engine (profile, create kv cache, warmup model) took 21.19 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [vllm.py:624] Asynchronous scheduling is enabled.
(APIServer pid=3403) INFO 02-05 10:26:26 [api_server.py:665] Supported tasks: ['generate']
(APIServer pid=3403) WARNING 02-05 10:26:26 [model.py:1371] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=3403) INFO 02-05 10:26:26 [serving.py:177] Warming up chat template processing...
(APIServer pid=3403) INFO 02-05 10:26:26 [hf.py:310] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=3403) INFO 02-05 10:26:26 [serving.py:212] Chat template warmup completed in 238.9ms
(APIServer pid=3403) INFO 02-05 10:26:26 [api_server.py:946] Starting vLLM API server 0 on http://0.0.0.0:13301
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:38] Available routes are:
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=3403) INFO:     Started server process [3403]
(APIServer pid=3403) INFO:     Waiting for application startup.
(APIServer pid=3403) INFO:     Application startup complete.
(APIServer pid=3403) INFO:     172.31.41.115:55780 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3403) INFO 02-05 10:28:06 [loggers.py:257] Engine 000: Avg prompt throughput: 3.1 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3403) INFO 02-05 10:28:16 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

pip list | grep torch
torch                             2.9.1
torchaudio                        2.9.1
torchvision                       0.24.1
transformers                      4.57.6
safetensors                       0.7.0

Page Source