标签 Pytroch 下的文章 - 人工智能炼丹师 - AIGC论文速读

标签搜索

Jefxiong

累计撰写 60 篇文章
累计收到 8 条评论

搜索到 1 篇与 Pytroch 的结果

2021-05-23
Pytorch 常见问题 1.CUDA_VISIBLE_DEVICES设置无效，始终占用GPU0？ 1. 在import torch前设置环境变量 2. CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=3 python train.py 2.RuntimeError: CUDA error: device-side assert triggered 设置环境变量，让报错显示更具体的代码行 import os os.environ["CUDA_LAUNCH_BLOCKING"] = "1" 3.RuntimeError: DataLoader worker (pid xxx) is killed by signal: Aborted. what(): CUDA error: initialization error pytorch github issue ref google找到的文章，大多怀疑是内存问题尝试修改pin_memory没有效用尝试修改shm没有效果，mount -o remount,size=32g /dev/shm 尝试改小num_worker无效果(16->8)将num_workers设置为0可以解决问题，但肯定不是最优解！！！ 4.resume training时，出现GPU OOM的问题在DDP训练场景下进行resume training可能出现该问题，原因在于每个进程torch.load都加载在同一块卡上，导致最后OOM。解决方案: map_location指定加载在哪块卡上 checkpoint = torch.load(checkpoint_path, map_location='cuda:{}'.format(opts.local_rank)) 5.CUDNN和pytorch版本不匹配可以从torch_stable.html下载安装 6 . unrecognized arguments: --local_rank，由于torch2.0升级导致，修复方案： python -m torch.distributed.launch xxx 替换为 torchrun xxx
- 2021年05月23日
- 1,470 阅读
- 0 评论
- 6 点赞

Jefxiong

60 文章数

8 评论量

人生倒计时

标签云

粤ICP备2021042327号