CUDA
# 查看 GPU + 驱动 + CUDA Runtime
nvidia-smi
Wed Feb 4 17:54:46 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 13.0 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S On | 00000000:36:00.0 Off | 0 |
| N/A 23C P8 32W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
# CUDA Toolkit 是否安装
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
# 查看已安装 NVIDIA 包
dpkg -l | grep -i nvidia
ii cuda-nsight-compute-13-0 13.0.2-1 amd64 NVIDIA Nsight Compute
ii cuda-nvtx-13-0 13.0.85-1 amd64 NVIDIA Tools Extension
hi libnccl-dev 2.28.3-1+cuda13.0 amd64 NVIDIA Collective Communication Library (NCCL) Development Files
hi libnccl2 2.28.3-1+cuda13.0 amd64 NVIDIA Collective Communication Library (NCCL) Runtime
ii nsight-compute-2025.3.1 2025.3.1.4-1 amd64 NVIDIA Nsight Compute
# GPU 架构 / 显存 / PCIe
nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Wed Feb 4 17:58:13 2026
Driver Version : 570.133.20
CUDA Version : 13.0
Attached GPUs : 1
GPU 00000000:36:00.0
Product Name : NVIDIA L40S
Product Brand : NVIDIA
Product Architecture : Ada Lovelace
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323723031205
GPU UUID : GPU-63ecdaf3-b7dc-2190-9912-795023d341e6
Minor Number : 2
VBIOS Version : 95.02.66.00.02
MultiGPU Board : No
Board ID : 0x3600
Board Part Number : 900-2G133-0080-000
GPU Part Number : 26B9-896-A1
FRU Part Number : N/A
Platform Info
Chassis Serial Number : N/A
Slot Number : N/A
Tray Index : N/A
Host ID : N/A
Peer Type : N/A
Module Id : 1
GPU Fabric GUID : N/A
Inforom Version
Image Version : G133.0242.00.03
OEM Object : 2.1
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU C2C Mode : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
vGPU Heterogeneous Mode : N/A
GPU Reset Status
Reset Required : Requested functionality has been deprecated
Drain and Reset Recommended : Requested functionality has been deprecated
GPU Recovery Action : None
GSP Firmware Version : 570.133.20
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x36
Device : 0x00
Domain : 0x0000
Base Classcode : 0x3
Sub Classcode : 0x2
Device Id : 0x26B910DE
Bus Id : 00000000:36:00.0
Sub System Id : 0x185110DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 350 KB/s
Rx Throughput : 350 KB/s
Atomic Caps Outbound : N/A
Atomic Caps Inbound : N/A
Fan Speed : N/A
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 46068 MiB
Reserved : 600 MiB
Used : 0 MiB
Free : 45469 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
GPU : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
DRAM Encryption Mode
Current : N/A
Pending : N/A
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
SRAM Threshold Exceeded : No
Aggregate Uncorrectable SRAM Sources
SRAM L2 : 0
SRAM SM : 0
SRAM Microcontroller : 0
SRAM PCIE : 0
SRAM Other : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 24 C
GPU T.Limit Temp : 64 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating T.Limit Temp : N/A
GPU Power Readings
Average Power Draw : 32.52 W
Instantaneous Power Draw : 32.52 W
Current Power Limit : 350.00 W
Requested Power Limit : 350.00 W
Default Power Limit : 350.00 W
Min Power Limit : 100.00 W
Max Power Limit : 350.00 W
GPU Memory Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Module Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Power Smoothing : N/A
Workload Power Profiles
Requested Profiles : N/A
Enforced Profiles : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 1185 MHz
Applications Clocks
Graphics : 2520 MHz
Memory : 9001 MHz
Default Applications Clocks
Graphics : 2520 MHz
Memory : 9001 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2520 MHz
SM : 2520 MHz
Memory : 9001 MHz
Video : 1965 MHz
Max Customer Boost Clocks
Graphics : 2520 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
CliqueId : N/A
ClusterUUID : N/A
Health
Bandwidth : N/A
Route Recovery in progress : N/A
Route Unhealthy : N/A
Access Timeout Recovery : N/A
Processes : None
Capabilities
EGM : disabled
# 只看关键摘要(推荐)
nvidia-smi -L
GPU 0: NVIDIA L40S (UUID: GPU-63ecdaf3-b7dc-2190-9912-795023d341e6)
uname -a
Linux cci-1629dd8f-ac56-446b-9ab7-03f479be1a73-0 5.15.0-126-generic #136+zetyun SMP Fri Aug 8 07:03:31 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
model deploy
export HF_ENDPOINT=http://hfmirror.mas.zetyun.cn:8082
hf download BAAI/bge-large-zh-v1.5 --local-dir /mnt/nas/models/BAAI/bge-large-zh-v1.5
hf download BAAI/bge-reranker-large --local-dir /mnt/nas/models/BAAI/bge-reranker-large
hf download Qwen/Qwen2.5-7B-Instruct --local-dir /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct
Downloading (incomplete total...): 0%| | 1.68M/7.81G [00:00<06:49, 19.1MB/s]Still waiting to acquire lock on /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct/.cache/huggingface/.gitignore.lock (elapsed: 0.1 seconds) | 0/14 [00:00<?, ?it/s]
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:39<00:00, 2.84s/it]
Download complete: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [00:39<00:00, 449MB/s]/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct
Download complete: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [00:39<00:00, 383MB/s]
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
#If you'd prefer that conda's base environment not be activated on startup,
# run the following command when conda is activated:
#conda config --set auto_activate_base false
#Note: You can undo this later by running `conda init --reverse $SHELL`
ls /root/public/datasets/
A1C AIGym AlayaNeW EdBianchi Jize1 RUC-AIBOX Salesforce agentica-org fka gy65896 kaist-ai multimolecule openvla sanjion9 togethercomputer yentinglin
AI-MO AgentGym BytedTsinghua-SIA JAYASUDHA-S-V PRIME-RL SWE-bench SamuelYang bookcorpus google hongchi lerobot nateraw rajpurkar swordfaith wtcherr zixianma
OS Version : Ubuntu 24.04.3 LTS
│ Kernel Version : 5.15.0-126-generic
│ Hostname : cci-1629dd8f-ac56-446b-9ab7-03f479be1a73-0
│ IP Address : 172.16.62.247
│ CPU Model : Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
│ CPU Cores : 10C
│ Memory Usage : 7914 MB / 81920 MB (9.66%)
│ GPU Information : NVIDIA L40S × 1
│ CUDA Version : 13.0
python env setup
conda create -n bge python=3.12 -y
conda activate bge
pip install torch --index-url https://download.pytorch.org/whl/cu121
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch
Downloading https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (780.4 MB)
━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.1/780.4 MB 940.1 kB/s eta 0:12:29
# check version
python3 -c "import torch; print('PyTorch 版本:', torch.__version__)"
# PyTorch 版本: 2.5.1+cu121
# verify cuda
python - << 'EOF'
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
EOF
#/home/jean/miniconda3/envs/bge/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
# cpu = _conversion_method_template(device=torch.device("cpu"))
# True
# NVIDIA L40S
pip install -U \
transformers \
huggingface_hub \
accelerate \
safetensors \
sentencepiece \
einops
# verify
python3 -c "import transformers; print(transformers.__version__)"
# deploy
uvicorn app:app --host 0.0.0.0 --port 13300 --workers 1
`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 391/391 [00:00<00:00, 2389.74it/s, Materializing param=pooler.dense.weight]
BertModel LOAD REPORT from: /home/jean/models/BAAI/bge-large-zh-v1.5
Key | Status | Details
------------------------+------------+--------
embeddings.position_ids | UNEXPECTED |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 393/393 [00:00<00:00, 662.98it/s, Materializing param=roberta.encoder.layer.23.output.dense.weight]
XLMRobertaForSequenceClassification LOAD REPORT from: /home/jean/models/BAAI/bge-reranker-large
Key | Status | Details
--------------------------------+------------+--------
roberta.embeddings.position_ids | UNEXPECTED |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
INFO: Started server process [17095]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:13300 (Press CTRL+C to quit)
pip install -U vllm
# deploy
python -m vllm.entrypoints.openai.api_server \
--model /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.9 \
--port 13301
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] █ █ █▄ ▄█
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.15.1
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] █▄█▀ █ █ █ █ model /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:325]
(APIServer pid=3403) INFO 02-05 10:23:44 [utils.py:261] non-default args: {'port': 13301, 'model': '/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', 'dtype': 'float16'}
(APIServer pid=3403) INFO 02-05 10:23:51 [model.py:541] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=3403) WARNING 02-05 10:23:51 [model.py:1885] Casting torch.bfloat16 to torch.float16.
(APIServer pid=3403) INFO 02-05 10:23:51 [model.py:1561] Using max model len 32768
(APIServer pid=3403) INFO 02-05 10:23:51 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=3403) INFO 02-05 10:23:51 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:57 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model='/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/mnt/nas/models/Qwen/Qwen2.5-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:58 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.246.55:35053 backend=nccl
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:58 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=3669) INFO 02-05 10:23:59 [gpu_model_runner.py:4033] Starting to load model /mnt/nas/models/Qwen/Qwen2.5-7B-Instruct...
(EngineCore_DP0 pid=3669) INFO 02-05 10:24:17 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:28<01:26, 28.97s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:58<00:58, 29.17s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [01:19<00:25, 25.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:45<00:00, 25.76s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:45<00:00, 26.41s/it]
(EngineCore_DP0 pid=3669)
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:03 [default_loader.py:291] Loading weights took 105.94 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:04 [gpu_model_runner.py:4130] Model loading took 14.25 GiB memory and 123.992349 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:08 [backends.py:812] Using cache directory: /home/jean/.cache/vllm/torch_compile_cache/3a10497c3c/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:08 [backends.py:872] Dynamo bytecode transform time: 4.21 s
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:15 [backends.py:302] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:19 [backends.py:319] Compiling a graph for compile range (1, 2048) takes 7.78 s
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:19 [monitor.py:34] torch.compile takes 11.99 s in total
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [gpu_worker.py:356] Available KV cache memory: 24.27 GiB
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [kv_cache_utils.py:1307] GPU KV cache size: 454,512 tokens
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:20 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 13.87x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:02<00:00, 20.37it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 23.62it/s]
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [gpu_model_runner.py:5063] Graph capturing finished in 5 secs, took 0.49 GiB
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [core.py:272] init engine (profile, create kv cache, warmup model) took 21.19 seconds
(EngineCore_DP0 pid=3669) INFO 02-05 10:26:25 [vllm.py:624] Asynchronous scheduling is enabled.
(APIServer pid=3403) INFO 02-05 10:26:26 [api_server.py:665] Supported tasks: ['generate']
(APIServer pid=3403) WARNING 02-05 10:26:26 [model.py:1371] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=3403) INFO 02-05 10:26:26 [serving.py:177] Warming up chat template processing...
(APIServer pid=3403) INFO 02-05 10:26:26 [hf.py:310] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=3403) INFO 02-05 10:26:26 [serving.py:212] Chat template warmup completed in 238.9ms
(APIServer pid=3403) INFO 02-05 10:26:26 [api_server.py:946] Starting vLLM API server 0 on http://0.0.0.0:13301
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:38] Available routes are:
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=3403) INFO 02-05 10:26:26 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=3403) INFO: Started server process [3403]
(APIServer pid=3403) INFO: Waiting for application startup.
(APIServer pid=3403) INFO: Application startup complete.
(APIServer pid=3403) INFO: 172.31.41.115:55780 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=3403) INFO 02-05 10:28:06 [loggers.py:257] Engine 000: Avg prompt throughput: 3.1 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3403) INFO 02-05 10:28:16 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
pip list | grep torch
torch 2.9.1
torchaudio 2.9.1
torchvision 0.24.1
transformers 4.57.6
safetensors 0.7.0
Page Source