您现在的位置是:首页 >学无止境 >pytorch分布式卡住网站首页学无止境
pytorch分布式卡住
简介pytorch分布式卡住
在一台 A100 的实验室用单机多卡的方式跑 MoCoGAN-HD 时,发现其在跑到 main_worker
打完这行的 log 之后就卡住不动。手动 Ctrl + C:
------------ Options -------------
G_step: 5
batchSize: 4
beta1: 0.5
beta2: 0.999
checkpoints_dir: checkpoints/my_dataset
cross_domain: True
dataroot: /home/itom/data3/my_dataset/train-frames
display_freq: 100
dist_backend: nccl
dist_url: tcp://localhost:10003
gpu: None
h_dim: 384
img_g_weights: pretrained/checkpoint.pkl
isPCA: False
isTrain: True
l_len: 256
latent_dimension: 512
load_pretrain_epoch: -1
load_pretrain_path: pretrained_models
lr: 0.0001
moco_m: 0.999
moco_t: 0.07
multiprocessing_distributed: True
n_frames_G: 16
n_mlp: 8
n_pca: 384
name: my_dataset
nc: 3
norm_D_3d: instance
num_D: 2
print_freq: 5
q_len: 4096
rank: 0
resize_style_gan_size: None
save_epoch_freq: 10
save_latest_freq: 1000
save_pca_path: pca_stats/my_dataset
sg2_ada: False
style_gan_size: [512, 256]
time_step: 5
total_epoch: 500
video_frame_size: 128
w_match: 1.0
w_residual: 0.2
workers: 8
world_size: 1
-------------- End ----------------
Use GPU: 0 for training
Use GPU: 1 for training
^CTraceback (most recent call last):
File "train_sg2_ada.py", line 243, in <module>
main()
File "train_sg2_ada.py", line 48, in main
args=(ngpus_per_node, args))
File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 77, in join
timeout=timeout,
File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/multiprocessing/connection.py", line 921, in wait
ready = selector.select(timeout)
File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
通过手动 print 发现是卡在 dist.init_process_group 这里,此时 nvidia-smi
有占用,但只有 2k+MB,明显没启动完。同学说换用 gloo
backend,开头是在 shell 改环境变量:
PL_TORCH_DISTRIBUTED_BACKEND=gloo
CUDA_HOME=/usr/local/cuda
python -W ignore train.py ...
无效,还是卡住。注意到其 train_options.py 中有个 –dist_backend 参数,默认值是 nccl
,换用明令行指定:
CUDA_HOME=/usr/local/cuda
python -W ignore train.py --dist_backend gloo ...
可以了。
它的代码基于 rosinality/stylegan2-pytorch,说不定用这份代码如果卡住了也可以参考。
风语者!平时喜欢研究各种技术,目前在从事后端开发工作,热爱生活、热爱工作。