-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train failed because the loss is nan #3
Comments
Are you using the latest version of master? This problem was coming in earlier versions.. @JoeyHeisenberg |
sorry for the late response. and I git clone it just last week, here is the result of "git log" |
Would it be possible for you to add the following code just above the last line of loss_function.py |
I made a little correction. You can try with that. @JoeyHeisenberg |
@Jeevesh8 I add the "print" code to run, and here was the results and I git pull the lastest code, but I got this error here is the hparams.py def create_hparams(hparams_string=None, verbose=False):
` |
That error in latest code has been removed now @JoeyHeisenberg . You can try now. |
I reduce the learning rate from 1e-3 to 1e-4 and reduce the batchsize to 16, so far so good. @Jeevesh8 |
Great @JoeyHeisenberg ! In which languages are you training, if I may ask? Would you be mind sharing the learned weights ? I currently don't have access to much compute resource, so can't train my own. |
THX for helpping,and the model is stilling training;and the model is trained by the Chinese dataset and EN dataset,sorry for that I can't share the learned weights cause I use the company‘s private dataset, and they don't allow us to share. maybe you can try the open source data like https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar(CSMSC,Chinese data for TTS recorded by one person)and some speech recognition dataset |
Okay. Thanks :) |
@JoeyHeisenberg you can use clvc-infer-gh.ipynb to produce wav files |
It didn‘t sythesize the rensonable wavs (with my dataset), and I stuck with other project. I'll send the results if I make any progress |
Could you at least hear some words @JoeyHeisenberg ? I would write code to check if any probability distribution has collapsed to its mode etc. You can use that. Or you can check yourself. Let me know. |
@JoeyHeisenberg how did you set up for multi-GPU on a single system? |
@JoeyHeisenberg You can try training further with reduced learning rate. |
@singhaki you can do like this |
@Jeevesh8 i tried it but gets stuck for hours after Done initializing distributed. |
@JoeyHeisenberg Can you show your mel target and mel predicted in tensorboard images tab? |
@singhaki I set it on 3 GPUs @Jeevesh8 Here is the mel target and mel predicted |
@JoeyHeisenberg The mel-specs seem close. 1.) Can you upload your log directory to google drive and share ? OR 2.) Also, Can you tell whether these mel-spec alignment are corresponding to inference on cross-lingual case or same-language case ? |
@JoeyHeisenberg Also, please let know what happens when you train further with lower learning rate ? |
@JoeyHeisenberg please make sure all 3 points here are true. |
I check the audio, and data scale is int16 as follows and about the silent parts at the beginning and end, I actually set "start0" and "end0" for them, It should be fine. Now, I'm facing two problem here are two samples from train.txt and all phonemes are put on symbols.py, and I didn't change other code maybe I made some mistakes on textProcessing, I will check the code to fix these problem, but recently I have to do other project first. I will let you know if I make some progress and really thank you very much for your help. @Jeevesh8 |
@JoeyHeisenberg hello !I want to ask if the multilingual model you trained is effective? |
I train the network on pytorch1.5 with 3 gpu, and train failed because the loss is nan
hparams
################################
# Optimization Hyperparameters #
################################
use_saved_learning_rate=False,
learning_rate=1e-3,
weight_decay=1e-6,
grad_clip_thresh=1.0,
batch_size=24,
mask_padding=True, # set model's padded outputs to padded values
nvidia-smi when failed
![image](https://user-images.githubusercontent.com/33407667/83220353-5242b280-a1a5-11ea-8ec6-3fa35f1af6d9.png)
FP16 Run: False Dynamic Loss Scaling: True Distributed Run: True cuDNN Enabled: True cuDNN Benchmark: False Initializing Distributed Done initializing distributed Epoch: 0 Train loss 0 17732198.000000 Grad Norm 2364579774464.000000 3.59s/it /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " Validation loss 0: 1936390491.411058 Saving model and optimizer state at iteration 0 to outdir/checkpoint_0 Train loss 1 3056206.750000 Grad Norm 110319550464.000000 2.15s/it Train loss 2 3184566.750000 Grad Norm 245978316800.000000 2.15s/it Train loss 3 4360.366699 Grad Norm 405133.906250 1.97s/it Train loss 4 511.444702 Grad Norm 10648.502930 1.79s/it Train loss 5 162.903091 Grad Norm 1219.052490 1.82s/it Train loss 6 76.945312 Grad Norm 334.356018 1.77s/it Train loss 7 41.395836 Grad Norm 134.526993 2.25s/it Train loss 8 27.028851 Grad Norm nan 1.94s/it Train loss 9 nan Grad Norm nan 1.90s/it Train loss 10 nan Grad Norm nan 2.47s/it Train loss 11 nan Grad Norm nan 1.82s/it Train loss 12 nan Grad Norm nan 2.03s/it
Train loss 982 nan Grad Norm nan 2.11s/it Epoch: 1 Train loss 983 nan Grad Norm nan 2.10s/it Train loss 984 nan Grad Norm nan 1.98s/it Train loss 985 nan Grad Norm nan 2.05s/it Train loss 986 nan Grad Norm nan 1.95s/it Train loss 987 nan Grad Norm nan 1.87s/it Train loss 988 nan Grad Norm nan 1.91s/it Train loss 989 nan Grad Norm nan 1.77s/it Train loss 990 nan Grad Norm nan 2.19s/it Train loss 991 nan Grad Norm nan 1.92s/it Train loss 992 nan Grad Norm nan 1.88s/it Train loss 993 nan Grad Norm nan 2.35s/it Train loss 994 nan Grad Norm nan 1.84s/it Train loss 995 nan Grad Norm nan 2.05s/it Train loss 996 nan Grad Norm nan 1.93s/it Train loss 997 nan Grad Norm nan 2.42s/it Train loss 998 nan Grad Norm nan 2.27s/it Train loss 999 nan Grad Norm nan 2.28s/it Train loss 1000 nan Grad Norm nan 1.68s/it Validation loss 1000: nan Traceback (most recent call last): File "train.py", line 292, in <module> args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 250, in train hparams.distributed_run, rank) File "train.py", line 147, in validate logger.log_validation(val_loss, model, y, y_pred, iteration) File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/logger.py", line 27, in log_validation self.add_histogram(tag, value.data.cpu().numpy(), iteration) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 425, in add_histogram histogram(tag, values, bins, max_bins=max_bins), global_step, walltime) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 226, in histogram hist = make_histogram(values.astype(float), bins, max_bins) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 264, in make_histogram raise ValueError('The histogram is empty, please file a bug report.') ValueError: The histogram is empty, please file a bug report.
The text was updated successfully, but these errors were encountered: