Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stuck at the beginning #30

Open
mumukawayi opened this issue Sep 20, 2023 · 1 comment
Open

Training stuck at the beginning #30

mumukawayi opened this issue Sep 20, 2023 · 1 comment

Comments

@mumukawayi
Copy link

mumukawayi commented Sep 20, 2023

Thanks for this great works! I was trying to follow the training procedure, but it seems the training stucks at the beginning, it keeps showing the following for more than ten minutes and does not proceed any more:
`Training for 25000 kimg...

tick 0 kimg 0.0 time 1m 20s sec/tick 5.5 sec/kimg 1372.65 maintenance 74.8 cpumem 5.39 gpumem 21.01 reserved 22.00 augment 0.000`
By the way, I was trying to resume from "afhqcats512-128.pkl". Could any body give me some advice about how to move on?

@dunbar12138
Copy link
Owner

Hi, I'm not sure about the problem based on the provided information. However, if you are training with multiple GPUs, you might want to try setting the environment variable NCCL_P2P_DISABLE=1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants