标签搜索

Jefxiong

累计撰写 52 篇文章
累计收到 7 条评论

首页
/
算法基础
/
正文

算法基础 Pytorch

Pytorch DDP

人工智能炼丹师

2022-02-19 / 1 评论 / 2,259 阅读 / 正在检测是否收录...

02/19

1. Pytorch DDP 使用大致流程

使用 torch.distributed.init_process_group 初始化进程组
使用 torch.nn.parallel.DistributedDataParallel 创建 分布式模型
使用 torch.utils.data.distributed.DistributedSampler 创建 DataLoader

2. Pytorch DDP 训练报错信息

[W reducer.cpp:346] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.

Expected to mark a variable ready only once.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

3. DDP多卡训练需要主要事项？

为每个进程设置不同的随机种子
使用syncBN
DistributedSampler
多进程的日志管理，文件创建管理，目录创建管理等

4. DDP 单机多卡卡住问题(hang/stuck/dead lock)

现象： GPU利用率100%，但是程序一直没有输出。

(1) 限制单进程并发数(无效)

torch.set_num_threads(4)

(2) 设置find_unused_parameters为True

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[int(args.local_rank)], output_device=int(args.local_rank), find_unused_parameters=True) 
# find_unused_parameters 的含义: 查找未用于计算损失的参数

(3) 排查NCCL问题？

# 设置环境变量，看是否有更多的信息输出，更多环境变量参考
# https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

(4) 在不同GPU上，部分分支的输出结果未参与损失计算

报错信息： RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

由于过滤了样本，只有当样本满足一定条件下才计算损失(比如回归任务中，只有当pred与gt在一定范围内才计算损失)，会导致不同GPU上，缺少梯度计算，导致导致程序假死

参考文献

版权属于：人工智能炼丹师

本文链接： https://jefxiong.cn/index.php/archives/422.html

作品采用：《署名-非商业性使用-相同方式共享 4.0 国际 (CC BY-NC-SA 4.0) 》许可协议授权

取消

mygqxbrntw
Windows 10 · Google Chrome

真好呢

2024-11-14 回复

Jefxiong

52 文章数

7 评论量

人生倒计时

标签云