Jetson Nano 上的 Transformer 实验路线

Jetson Nano 的定位是学习、验证和边缘原型,不是大模型训练机器。它的关键约束:

详见 ../../iot/jetson/index.md

1. 能做什么

适合:

不适合:

2. 环境检查

python3 --version
/usr/local/cuda/bin/nvcc --version
cat /usr/local/cuda/version.txt
python3 -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
tegrastats

如果 PyTorch CUDA 不可用,先不要做训练,先退回 CPU 或重新安装适配 JetPack 4.x 的 PyTorch wheel。

3. 最小训练配置

建议从以下参数开始:

config = {
    "vocab_size": 128,
    "d_model": 64,
    "num_heads": 2,
    "d_ff": 128,
    "num_layers": 1,
    "seq_len": 32,
    "batch_size": 1,
    "learning_rate": 3e-4,
    "max_steps": 200,
}

能跑通后,再尝试:

config = {
    "vocab_size": 256,
    "d_model": 128,
    "num_heads": 4,
    "d_ff": 256,
    "num_layers": 2,
    "seq_len": 64,
    "batch_size": 1,
    "learning_rate": 3e-4,
    "max_steps": 500,
}

4. Jetson 友好的 tiny Transformer 示例

这是一个训练骨架,用于字符级语言模型学习,不追求效果。

import torch
import torch.nn as nn
import torch.nn.functional as F

class TinyDecoderLM(nn.Module):
    def __init__(self, vocab_size=128, d_model=64, nhead=2, num_layers=1, max_len=128):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=d_model * 2,
            dropout=0.1,
            batch_first=True,
        )
        self.blocks = nn.TransformerEncoder(layer, num_layers=num_layers)
        self.lm_head = nn.Linear(d_model, vocab_size)
        self.max_len = max_len

    def forward(self, idx):
        bsz, seq_len = idx.shape
        pos = torch.arange(seq_len, device=idx.device).unsqueeze(0)
        x = self.token_emb(idx) + self.pos_emb(pos)
        mask = torch.triu(torch.ones(seq_len, seq_len, device=idx.device), diagonal=1).bool()
        x = self.blocks(x, mask=mask)
        return self.lm_head(x)

def make_batch(data, seq_len=32, batch_size=1, device="cpu"):
    starts = torch.randint(0, len(data) - seq_len - 1, (batch_size,))
    x = torch.stack([data[i:i+seq_len] for i in starts]).to(device)
    y = torch.stack([data[i+1:i+seq_len+1] for i in starts]).to(device)
    return x, y

text = "Transformer is attention. Jetson Nano can train tiny models. " * 100
chars = sorted(set(text))
stoi = {ch: i for i, ch in enumerate(chars)}
encoded = torch.tensor([stoi[ch] for ch in text], dtype=torch.long)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = TinyDecoderLM(vocab_size=len(chars), d_model=64, nhead=2, num_layers=1).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)

for step in range(200):
    x, y = make_batch(encoded, seq_len=32, batch_size=1, device=device)
    logits = model(x)
    loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
    opt.zero_grad()
    loss.backward()
    opt.step()
    if step % 20 == 0:
        print(step, loss.item())

5. CUDA 与矩阵运算实验

Char02.txt 中的矩阵大小 10242048 对 Nano 可能偏大。建议先用:

matrix_size = 256
block_size = 64

然后逐步扩大到:

matrix_size = 512
block_size = 64

监控命令:

tegrastats
cat /sys/devices/gpu.0/load
cat /sys/devices/gpu.0/devfreq/57000000.gpu/cur_freq

6. 推理模型选择

模型 Jetson Nano 建议
sshleifer/tiny-gpt2 最适合验证 Hugging Face 流程
distilgpt2 可尝试 CPU/GPU 推理,速度有限
Qwen2.5-0.5B 建议使用量化推理或云端;Nano 上不要训练
paraphrase-MiniLM-L6-v2 可用于小规模 embedding / 检索
t5-small 可用于短文本 seq2seq 演示,训练要谨慎

7. 实验习惯

8. 什么时候换机器

出现以下情况建议改用桌面 GPU、云 GPU 或远程训练:


Page Source