Pytorch 常见问题

2021-05-23 / 0 评论 / 1,321 阅读 / 正在检测是否收录...

05/23

1.CUDA_VISIBLE_DEVICES设置无效，始终占用GPU0？

1. 在import torch前设置环境变量
2. CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=3 python train.py

2.RuntimeError: CUDA error: device-side assert triggered

设置环境变量，让报错显示更具体的代码行

import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

3.RuntimeError: DataLoader worker (pid xxx) is killed by signal: Aborted.
what(): CUDA error: initialization error

pytorch github issue ref
google找到的文章，大多怀疑是内存问题

4.resume training时，出现GPU OOM的问题

在DDP训练场景下进行resume training可能出现该问题，原因在于每个进程torch.load都加载在同一块卡上，导致最后OOM。解决方案: map_location指定加载在哪块卡上

checkpoint = torch.load(checkpoint_path, map_location='cuda:{}'.format(opts.local_rank))

5.CUDNN和pytorch版本不匹配

可以从torch_stable.html下载安装

6 . unrecognized arguments: --local_rank，由于torch2.0升级导致，修复方案：

python -m torch.distributed.launch xxx 替换为 torchrun xxx

版权属于：人工智能炼丹师

本文链接： https://jefxiong.cn/index.php/archives/155.html

作品采用：《署名-非商业性使用-相同方式共享 4.0 国际 (CC BY-NC-SA 4.0) 》许可协议授权