Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train failed because the loss is nan #3

Open
JoeyHeisenberg opened this issue May 29, 2020 · 27 comments
Open

Train failed because the loss is nan #3

JoeyHeisenberg opened this issue May 29, 2020 · 27 comments
Assignees

Comments

@JoeyHeisenberg
Copy link

I train the network on pytorch1.5 with 3 gpu, and train failed because the loss is nan

hparams
################################
# Optimization Hyperparameters #
################################
use_saved_learning_rate=False,
learning_rate=1e-3,
weight_decay=1e-6,
grad_clip_thresh=1.0,
batch_size=24,
mask_padding=True, # set model's padded outputs to padded values

nvidia-smi when failed
image

FP16 Run: False Dynamic Loss Scaling: True Distributed Run: True cuDNN Enabled: True cuDNN Benchmark: False Initializing Distributed Done initializing distributed Epoch: 0 Train loss 0 17732198.000000 Grad Norm 2364579774464.000000 3.59s/it /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " Validation loss 0: 1936390491.411058 Saving model and optimizer state at iteration 0 to outdir/checkpoint_0 Train loss 1 3056206.750000 Grad Norm 110319550464.000000 2.15s/it Train loss 2 3184566.750000 Grad Norm 245978316800.000000 2.15s/it Train loss 3 4360.366699 Grad Norm 405133.906250 1.97s/it Train loss 4 511.444702 Grad Norm 10648.502930 1.79s/it Train loss 5 162.903091 Grad Norm 1219.052490 1.82s/it Train loss 6 76.945312 Grad Norm 334.356018 1.77s/it Train loss 7 41.395836 Grad Norm 134.526993 2.25s/it Train loss 8 27.028851 Grad Norm nan 1.94s/it Train loss 9 nan Grad Norm nan 1.90s/it Train loss 10 nan Grad Norm nan 2.47s/it Train loss 11 nan Grad Norm nan 1.82s/it Train loss 12 nan Grad Norm nan 2.03s/it

Train loss 982 nan Grad Norm nan 2.11s/it Epoch: 1 Train loss 983 nan Grad Norm nan 2.10s/it Train loss 984 nan Grad Norm nan 1.98s/it Train loss 985 nan Grad Norm nan 2.05s/it Train loss 986 nan Grad Norm nan 1.95s/it Train loss 987 nan Grad Norm nan 1.87s/it Train loss 988 nan Grad Norm nan 1.91s/it Train loss 989 nan Grad Norm nan 1.77s/it Train loss 990 nan Grad Norm nan 2.19s/it Train loss 991 nan Grad Norm nan 1.92s/it Train loss 992 nan Grad Norm nan 1.88s/it Train loss 993 nan Grad Norm nan 2.35s/it Train loss 994 nan Grad Norm nan 1.84s/it Train loss 995 nan Grad Norm nan 2.05s/it Train loss 996 nan Grad Norm nan 1.93s/it Train loss 997 nan Grad Norm nan 2.42s/it Train loss 998 nan Grad Norm nan 2.27s/it Train loss 999 nan Grad Norm nan 2.28s/it Train loss 1000 nan Grad Norm nan 1.68s/it Validation loss 1000: nan Traceback (most recent call last): File "train.py", line 292, in <module> args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 250, in train hparams.distributed_run, rank) File "train.py", line 147, in validate logger.log_validation(val_loss, model, y, y_pred, iteration) File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/logger.py", line 27, in log_validation self.add_histogram(tag, value.data.cpu().numpy(), iteration) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 425, in add_histogram histogram(tag, values, bins, max_bins=max_bins), global_step, walltime) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 226, in histogram hist = make_histogram(values.astype(float), bins, max_bins) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 264, in make_histogram raise ValueError('The histogram is empty, please file a bug report.') ValueError: The histogram is empty, please file a bug report.

@Jeevesh8
Copy link
Member

Are you using the latest version of master? This problem was coming in earlier versions.. @JoeyHeisenberg

@JoeyHeisenberg
Copy link
Author

sorry for the late response. and I git clone it just last week, here is the result of "git log"
image
@Jeevesh8

@Jeevesh8
Copy link
Member

Jeevesh8 commented Jun 1, 2020

Would it be possible for you to add the following code just above the last line of loss_function.py
print("Mel Loss:- ", mel_loss)
print("gate_loss :- ", gate_loss)
print("speaker_loss :- ", speaker_loss)
print("kl loss:- ", kl_loss)
print("Total Loss:- ", (mel_loss + gate_loss) + 0.02*speaker_loss +kl_loss)
and see which loss becomes NaN first?
Also, you can try reducing learning rate. And reducing hparams.mcn to 1 or 2.
It'd be helpful if you can check if the same thing is happening on single GPU too? Please attach your entire hparams.py file too, if possible.
@JoeyHeisenberg

@Jeevesh8
Copy link
Member

Jeevesh8 commented Jun 1, 2020

I made a little correction. You can try with that. @JoeyHeisenberg

@JoeyHeisenberg
Copy link
Author

@Jeevesh8 I add the "print" code to run, and here was the results
image

and I git pull the lastest code, but I got this error
Traceback (most recent call last):
File "train.py", line 292, in
args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 216, in train
y_pred = model(x)
File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/model.py", line 562, in forward
encoder_outputs, mels, memory_lengths=text_lengths, speaker=speaker, lang=lang)
File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/model.py", line 431, in forward
residual_encoding = self.residual_encoder(decoder_inputs)
File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(input, **kwargs)
File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/residual_encoder.py", line 112, in forward
self.calc_q_tilde(z_l)
File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/residual_encoder.py", line 98, in calc_q_tilde
ans = p_zl_givn_yl
self.y_l.probs
RuntimeError: expected device cuda:0 but got device cpu

here is the hparams.py
`import tensorflow as tf
from text import symbols

def create_hparams(hparams_string=None, verbose=False):
"""Create model hyperparameters. Parse nondefault from given string."""

hparams = tf.contrib.training.HParams(
    ################################
    # Experiment Parameters        #
    ################################
    epochs=500,
    iters_per_checkpoint=1000,
    seed=1234,
    dynamic_loss_scaling=True,
    fp16_run=False,
    distributed_run=True,
    dist_backend="nccl",
    dist_url="tcp://localhost:54321",
    cudnn_enabled=True,
    cudnn_benchmark=False,
    ignore_layers=['embedding.weight'],

    ################################
    # Data Parameters             #
    ################################
    load_mel_from_disk=False,
    training_files='./filelists/train.txt',
    validation_files='./filelists/valid.txt',
    text_cleaners=['basic_cleaners'],

    ################################
    # Audio Parameters             #
    ################################
    max_wav_value=32768.0,
    sampling_rate=16000,
    filter_length=1280,
    hop_length=320,
    win_length=1280,
    n_mel_channels=80,
    mel_fmin=80.0,
    mel_fmax=7600.0,

    ################################
    # Model Parameters             #
    ################################
    n_symbols=len(symbols),
    symbols_embedding_dim=512,

    # Encoder parameters
    encoder_kernel_size=5,
    encoder_n_convolutions=3,
    encoder_embedding_dim=512,

    # Decoder parameters
    n_frames_per_step=1,  # currently only 1 is supported
    decoder_rnn_dim=1024,
    prenet_dim=256,
    max_decoder_steps=1000,
    gate_threshold=0.5,
    p_attention_dropout=0.1,
    p_decoder_dropout=0.1,

    # Attention parameters
    attention_rnn_dim=1024,
    attention_dim=128,

    # Location Layer parameters
    attention_location_n_filters=32,
    attention_location_kernel_size=31,

    # Mel-post processing network parameters
    postnet_embedding_dim=512,
    postnet_kernel_size=5,
    postnet_n_convolutions=5,

    ################################
    # Optimization Hyperparameters #
    ################################
    use_saved_learning_rate=False,
    learning_rate=1e-3,
    weight_decay=1e-6,
    grad_clip_thresh=1.0,
    batch_size=24,
    mask_padding=True,  # set model's padded outputs to padded values

    ###############################
    # Speaker and Lang Embeddings #
    ###############################
    speaker_embedding_dim=64,
    lang_embedding_dim=3,
    n_langs=2,
    n_speakers=7,

    ###############################
    ## Speaker Classifier Params ##
    ###############################
    hidden_sc_dim=256,

    ##############################
    ## Residual Encoder Params  ##
    ##############################
    residual_encoding_dim=32,          # 16 for q(z_l|X) and 16 for q(z_o|X)
    dim_yo=7,                          #(==n_speakers) dim(y_{o})
    dim_yl=10,                         #K
    mcn=8                              # n for monte carlo sampling of q(z_l|X)and q(z_o|X)
)

if hparams_string:
    tf.logging.info('Parsing command line hparams: %s', hparams_string)
    hparams.parse(hparams_string)

if verbose:
    tf.logging.info('Final parsed hparams: %s', hparams.values())

return hparams

`

@Jeevesh8
Copy link
Member

Jeevesh8 commented Jun 1, 2020

That error in latest code has been removed now @JoeyHeisenberg . You can try now.

@JoeyHeisenberg
Copy link
Author

I reduce the learning rate from 1e-3 to 1e-4 and reduce the batchsize to 16, so far so good. @Jeevesh8
image

@Jeevesh8
Copy link
Member

Jeevesh8 commented Jun 1, 2020

Great @JoeyHeisenberg ! In which languages are you training, if I may ask? Would you be mind sharing the learned weights ? I currently don't have access to much compute resource, so can't train my own.

@JoeyHeisenberg
Copy link
Author

THX for helpping,and the model is stilling training;and the model is trained by the Chinese dataset and EN dataset,sorry for that I can't share the learned weights cause I use the company‘s private dataset, and they don't allow us to share. maybe you can try the open source data like https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar(CSMSC,Chinese data for TTS recorded by one person)and some speech recognition dataset

@JoeyHeisenberg
Copy link
Author

image

@Jeevesh8
Copy link
Member

Jeevesh8 commented Jun 2, 2020

Okay. Thanks :)
But please just let me know after training if the results turned out to be good or not, because I have only run on a dummy dataset on colab , till now.. 👍
Also, would it be possible for you to share some compute resources? I have my own dataset of some Indic languages, but estimated time of training on the dataset on colab's GPU is around 1 month.. Actually we are a start-up so if you/your company want to collaborate, we can do @JoeyHeisenberg

@Jeevesh8 Jeevesh8 closed this as completed Jun 3, 2020
@JoeyHeisenberg
Copy link
Author

The loss seems so large, and The alignment is bad
image
image
image

I tried to generate some wavs but failed
image

@Jeevesh8
Copy link
Member

Jeevesh8 commented Jun 8, 2020

@JoeyHeisenberg you can use clvc-infer-gh.ipynb to produce wav files

@Jeevesh8 Jeevesh8 reopened this Jun 8, 2020
@JoeyHeisenberg
Copy link
Author

It didn‘t sythesize the rensonable wavs (with my dataset), and I stuck with other project. I'll send the results if I make any progress

@Jeevesh8
Copy link
Member

Jeevesh8 commented Jun 9, 2020

Could you at least hear some words @JoeyHeisenberg ? I would write code to check if any probability distribution has collapsed to its mode etc. You can use that. Or you can check yourself. Let me know.

@JoeyHeisenberg
Copy link
Author

JoeyHeisenberg commented Jun 9, 2020

I loaded the checkpoint-97000 model, and It didn't sythesize the was, Here is the figure of mel_outputs, mel_outputs_postnet, alignments

image

Here is the val loss , it stuck at around 2.0+
image

@akashicMarga
Copy link

@JoeyHeisenberg how did you set up for multi-GPU on a single system?

@Jeevesh8
Copy link
Member

Jeevesh8 commented Jun 9, 2020

@JoeyHeisenberg You can try training further with reduced learning rate.

@Jeevesh8 Jeevesh8 reopened this Jun 9, 2020
@Jeevesh8
Copy link
Member

Jeevesh8 commented Jun 9, 2020

@singhaki you can do like this

@akashicMarga
Copy link

@Jeevesh8 i tried it but gets stuck for hours after Done initializing distributed.

@Jeevesh8
Copy link
Member

Jeevesh8 commented Jun 9, 2020

@JoeyHeisenberg Can you show your mel target and mel predicted in tensorboard images tab?

@JoeyHeisenberg
Copy link
Author

@singhaki I set it on 3 GPUs
CUDA_VISIBLE_DEVICES=0,1,2 nohup python -u multiproc train.py --output_directory=outdir --log_directory=logdir --n_gpus=3 --hparams=distributed_run=True > log_tacotron2_v2.file 2>&1 &

@Jeevesh8 Here is the mel target and mel predicted
image

@Jeevesh8
Copy link
Member

@JoeyHeisenberg The mel-specs seem close.

1.) Can you upload your log directory to google drive and share ? OR
Tell if attention maps (in images tab only) are improving during the epochs when loss is constant ?
(Because I read some issue on nvidia/tacotron2, and they had told there , that although loss becomes constant after some epochs, the alignments still go on improving, I will mention that issue here, when I find it)

2.) Also, Can you tell whether these mel-spec alignment are corresponding to inference on cross-lingual case or same-language case ?

@Jeevesh8
Copy link
Member

@JoeyHeisenberg Also, please let know what happens when you train further with lower learning rate ?

@Jeevesh8 Jeevesh8 self-assigned this Jun 10, 2020
@Jeevesh8 Jeevesh8 pinned this issue Jun 10, 2020
@Jeevesh8
Copy link
Member

@JoeyHeisenberg please make sure all 3 points here are true.

@JoeyHeisenberg
Copy link
Author

I check the audio, and data scale is int16 as follows
image

and about the silent parts at the beginning and end, I actually set "start0" and "end0" for them, It should be fine.

Now, I'm facing two problem
the first is on tensorboard, the mel-spec predicted seems same as the mel-spec target, but the aligment don't look fine(same-language case), I already train 96000 step with batchsize 24, maybe I should train further as you mention with lower lr
the second is that I can not sythesize the normal audio with my waveGan

here are two samples from train.txt and all phonemes are put on symbols.py, and I didn't change other code
chinese
000718.wav|start0 ou2 er2 m ai3 i4 x ie1 z iy1 l iao4 h uo4 i4 sh uang1 ua4 z iy5 sp1 d eng3 c uen2 g ou4 l e5 sp1 h uei4 q v4 m ai3 i2 t ao4 x in1 i1 sh ang5 end0|0|0
English
004291.wav|start0 T EY1 K IH0 NG AH1 P S M OW1 K IH0 NG R AE1 NG K S HH AY1 IH0 N AE1 K SH AH0 N Z P IY1 P AH0 L W IH1 SH DH EY1 K UH1 D S T AH1 B AW1 T end0|6|1

maybe I made some mistakes on textProcessing, I will check the code to fix these problem, but recently I have to do other project first. I will let you know if I make some progress and really thank you very much for your help. @Jeevesh8

@Jeevesh8 Jeevesh8 unpinned this issue Sep 12, 2020
@c9412600
Copy link

@JoeyHeisenberg hello !I want to ask if the multilingual model you trained is effective?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants