您现在的位置是:首页 >学无止境 >pytorch分布式卡住网站首页学无止境

pytorch分布式卡住

HackerTom 2024-06-19 13:56:27
简介pytorch分布式卡住

在一台 A100 的实验室用单机多卡的方式跑 MoCoGAN-HD 时,发现其在跑到 main_worker 打完这行的 log 之后就卡住不动。手动 Ctrl + C:

------------ Options -------------
G_step: 5
batchSize: 4
beta1: 0.5
beta2: 0.999
checkpoints_dir: checkpoints/my_dataset
cross_domain: True
dataroot: /home/itom/data3/my_dataset/train-frames
display_freq: 100
dist_backend: nccl
dist_url: tcp://localhost:10003
gpu: None
h_dim: 384
img_g_weights: pretrained/checkpoint.pkl
isPCA: False
isTrain: True
l_len: 256
latent_dimension: 512
load_pretrain_epoch: -1
load_pretrain_path: pretrained_models
lr: 0.0001
moco_m: 0.999
moco_t: 0.07
multiprocessing_distributed: True
n_frames_G: 16
n_mlp: 8
n_pca: 384
name: my_dataset
nc: 3
norm_D_3d: instance
num_D: 2
print_freq: 5
q_len: 4096
rank: 0
resize_style_gan_size: None
save_epoch_freq: 10
save_latest_freq: 1000
save_pca_path: pca_stats/my_dataset
sg2_ada: False
style_gan_size: [512, 256]
time_step: 5
total_epoch: 500
video_frame_size: 128
w_match: 1.0
w_residual: 0.2
workers: 8
world_size: 1
-------------- End ----------------
Use GPU: 0 for training
Use GPU: 1 for training
^CTraceback (most recent call last):
  File "train_sg2_ada.py", line 243, in <module>
    main()
  File "train_sg2_ada.py", line 48, in main
    args=(ngpus_per_node, args))
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 77, in join
    timeout=timeout,
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/multiprocessing/connection.py", line 921, in wait
    ready = selector.select(timeout)
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt


^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

通过手动 print 发现是卡在 dist.init_process_group 这里,此时 nvidia-smi 有占用,但只有 2k+MB,明显没启动完。同学说换用 gloo backend,开头是在 shell 改环境变量:

PL_TORCH_DISTRIBUTED_BACKEND=gloo 
CUDA_HOME=/usr/local/cuda 
python -W ignore train.py ...

无效,还是卡住。注意到其 train_options.py 中有个 –dist_backend 参数,默认值是 nccl,换用明令行指定:

CUDA_HOME=/usr/local/cuda 
python -W ignore train.py --dist_backend gloo ...

可以了。

它的代码基于 rosinality/stylegan2-pytorch,说不定用这份代码如果卡住了也可以参考。

风语者!平时喜欢研究各种技术,目前在从事后端开发工作,热爱生活、热爱工作。