Train failed because the loss is nan #3

JoeyHeisenberg · 2020-05-29T04:11:07Z

I train the network on pytorch1.5 with 3 gpu, and train failed because the loss is nan

hparams
################################
# Optimization Hyperparameters #
################################
use_saved_learning_rate=False,
learning_rate=1e-3,
weight_decay=1e-6,
grad_clip_thresh=1.0,
batch_size=24,
mask_padding=True, # set model's padded outputs to padded values

nvidia-smi when failed

FP16 Run: False Dynamic Loss Scaling: True Distributed Run: True cuDNN Enabled: True cuDNN Benchmark: False Initializing Distributed Done initializing distributed Epoch: 0 Train loss 0 17732198.000000 Grad Norm 2364579774464.000000 3.59s/it /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " Validation loss 0: 1936390491.411058 Saving model and optimizer state at iteration 0 to outdir/checkpoint_0 Train loss 1 3056206.750000 Grad Norm 110319550464.000000 2.15s/it Train loss 2 3184566.750000 Grad Norm 245978316800.000000 2.15s/it Train loss 3 4360.366699 Grad Norm 405133.906250 1.97s/it Train loss 4 511.444702 Grad Norm 10648.502930 1.79s/it Train loss 5 162.903091 Grad Norm 1219.052490 1.82s/it Train loss 6 76.945312 Grad Norm 334.356018 1.77s/it Train loss 7 41.395836 Grad Norm 134.526993 2.25s/it Train loss 8 27.028851 Grad Norm nan 1.94s/it Train loss 9 nan Grad Norm nan 1.90s/it Train loss 10 nan Grad Norm nan 2.47s/it Train loss 11 nan Grad Norm nan 1.82s/it Train loss 12 nan Grad Norm nan 2.03s/it

Train loss 982 nan Grad Norm nan 2.11s/it Epoch: 1 Train loss 983 nan Grad Norm nan 2.10s/it Train loss 984 nan Grad Norm nan 1.98s/it Train loss 985 nan Grad Norm nan 2.05s/it Train loss 986 nan Grad Norm nan 1.95s/it Train loss 987 nan Grad Norm nan 1.87s/it Train loss 988 nan Grad Norm nan 1.91s/it Train loss 989 nan Grad Norm nan 1.77s/it Train loss 990 nan Grad Norm nan 2.19s/it Train loss 991 nan Grad Norm nan 1.92s/it Train loss 992 nan Grad Norm nan 1.88s/it Train loss 993 nan Grad Norm nan 2.35s/it Train loss 994 nan Grad Norm nan 1.84s/it Train loss 995 nan Grad Norm nan 2.05s/it Train loss 996 nan Grad Norm nan 1.93s/it Train loss 997 nan Grad Norm nan 2.42s/it Train loss 998 nan Grad Norm nan 2.27s/it Train loss 999 nan Grad Norm nan 2.28s/it Train loss 1000 nan Grad Norm nan 1.68s/it Validation loss 1000: nan Traceback (most recent call last): File "train.py", line 292, in <module> args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 250, in train hparams.distributed_run, rank) File "train.py", line 147, in validate logger.log_validation(val_loss, model, y, y_pred, iteration) File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/logger.py", line 27, in log_validation self.add_histogram(tag, value.data.cpu().numpy(), iteration) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 425, in add_histogram histogram(tag, values, bins, max_bins=max_bins), global_step, walltime) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 226, in histogram hist = make_histogram(values.astype(float), bins, max_bins) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 264, in make_histogram raise ValueError('The histogram is empty, please file a bug report.') ValueError: The histogram is empty, please file a bug report.

The text was updated successfully, but these errors were encountered:

Jeevesh8 · 2020-05-29T08:40:07Z

Are you using the latest version of master? This problem was coming in earlier versions.. @JoeyHeisenberg

JoeyHeisenberg · 2020-06-01T02:32:11Z

sorry for the late response. and I git clone it just last week, here is the result of "git log"

@Jeevesh8

Jeevesh8 · 2020-06-01T05:18:08Z

Would it be possible for you to add the following code just above the last line of loss_function.py
print("Mel Loss:- ", mel_loss)
print("gate_loss :- ", gate_loss)
print("speaker_loss :- ", speaker_loss)
print("kl loss:- ", kl_loss)
print("Total Loss:- ", (mel_loss + gate_loss) + 0.02*speaker_loss +kl_loss)
and see which loss becomes NaN first?
Also, you can try reducing learning rate. And reducing hparams.mcn to 1 or 2.
It'd be helpful if you can check if the same thing is happening on single GPU too? Please attach your entire hparams.py file too, if possible.
@JoeyHeisenberg

Jeevesh8 · 2020-06-01T09:07:52Z

I made a little correction. You can try with that. @JoeyHeisenberg

JoeyHeisenberg · 2020-06-01T10:44:42Z

@Jeevesh8 I add the "print" code to run, and here was the results

and I git pull the lastest code, but I got this error
Traceback (most recent call last):
File "train.py", line 292, in
args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 216, in train
y_pred = model(x)
File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/model.py", line 562, in forward
encoder_outputs, mels, memory_lengths=text_lengths, speaker=speaker, lang=lang)
File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/model.py", line 431, in forward
residual_encoding = self.residual_encoder(decoder_inputs)
File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(input, **kwargs)
File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/residual_encoder.py", line 112, in forward
self.calc_q_tilde(z_l)
File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/residual_encoder.py", line 98, in calc_q_tilde
ans = p_zl_givn_ylself.y_l.probs
RuntimeError: expected device cuda:0 but got device cpu

here is the hparams.py
`import tensorflow as tf
from text import symbols

def create_hparams(hparams_string=None, verbose=False):
"""Create model hyperparameters. Parse nondefault from given string."""

hparams = tf.contrib.training.HParams(
    ################################
    # Experiment Parameters        #
    ################################
    epochs=500,
    iters_per_checkpoint=1000,
    seed=1234,
    dynamic_loss_scaling=True,
    fp16_run=False,
    distributed_run=True,
    dist_backend="nccl",
    dist_url="tcp://localhost:54321",
    cudnn_enabled=True,
    cudnn_benchmark=False,
    ignore_layers=['embedding.weight'],

    ################################
    # Data Parameters             #
    ################################
    load_mel_from_disk=False,
    training_files='./filelists/train.txt',
    validation_files='./filelists/valid.txt',
    text_cleaners=['basic_cleaners'],

    ################################
    # Audio Parameters             #
    ################################
    max_wav_value=32768.0,
    sampling_rate=16000,
    filter_length=1280,
    hop_length=320,
    win_length=1280,
    n_mel_channels=80,
    mel_fmin=80.0,
    mel_fmax=7600.0,

    ################################
    # Model Parameters             #
    ################################
    n_symbols=len(symbols),
    symbols_embedding_dim=512,

    # Encoder parameters
    encoder_kernel_size=5,
    encoder_n_convolutions=3,
    encoder_embedding_dim=512,

    # Decoder parameters
    n_frames_per_step=1,  # currently only 1 is supported
    decoder_rnn_dim=1024,
    prenet_dim=256,
    max_decoder_steps=1000,
    gate_threshold=0.5,
    p_attention_dropout=0.1,
    p_decoder_dropout=0.1,

    # Attention parameters
    attention_rnn_dim=1024,
    attention_dim=128,

    # Location Layer parameters
    attention_location_n_filters=32,
    attention_location_kernel_size=31,

    # Mel-post processing network parameters
    postnet_embedding_dim=512,
    postnet_kernel_size=5,
    postnet_n_convolutions=5,

    ################################
    # Optimization Hyperparameters #
    ################################
    use_saved_learning_rate=False,
    learning_rate=1e-3,
    weight_decay=1e-6,
    grad_clip_thresh=1.0,
    batch_size=24,
    mask_padding=True,  # set model's padded outputs to padded values

    ###############################
    # Speaker and Lang Embeddings #
    ###############################
    speaker_embedding_dim=64,
    lang_embedding_dim=3,
    n_langs=2,
    n_speakers=7,

    ###############################
    ## Speaker Classifier Params ##
    ###############################
    hidden_sc_dim=256,

    ##############################
    ## Residual Encoder Params  ##
    ##############################
    residual_encoding_dim=32,          # 16 for q(z_l|X) and 16 for q(z_o|X)
    dim_yo=7,                          #(==n_speakers) dim(y_{o})
    dim_yl=10,                         #K
    mcn=8                              # n for monte carlo sampling of q(z_l|X)and q(z_o|X)
)

if hparams_string:
    tf.logging.info('Parsing command line hparams: %s', hparams_string)
    hparams.parse(hparams_string)

if verbose:
    tf.logging.info('Final parsed hparams: %s', hparams.values())

return hparams

`

Jeevesh8 · 2020-06-01T11:17:27Z

That error in latest code has been removed now @JoeyHeisenberg . You can try now.

JoeyHeisenberg · 2020-06-01T13:22:04Z

I reduce the learning rate from 1e-3 to 1e-4 and reduce the batchsize to 16, so far so good. @Jeevesh8

Jeevesh8 · 2020-06-01T13:31:29Z

Great @JoeyHeisenberg ! In which languages are you training, if I may ask? Would you be mind sharing the learned weights ? I currently don't have access to much compute resource, so can't train my own.

JoeyHeisenberg · 2020-06-02T01:53:13Z

THX for helpping，and the model is stilling training；and the model is trained by the Chinese dataset and EN dataset，sorry for that I can't share the learned weights cause I use the company‘s private dataset， and they don't allow us to share. maybe you can try the open source data like https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar（CSMSC，Chinese data for TTS recorded by one person）and some speech recognition dataset

JoeyHeisenberg · 2020-06-02T01:53:21Z

Jeevesh8 · 2020-06-02T04:15:03Z

Okay. Thanks :)
But please just let me know after training if the results turned out to be good or not, because I have only run on a dummy dataset on colab , till now.. 👍
Also, would it be possible for you to share some compute resources? I have my own dataset of some Indic languages, but estimated time of training on the dataset on colab's GPU is around 1 month.. Actually we are a start-up so if you/your company want to collaborate, we can do @JoeyHeisenberg

JoeyHeisenberg · 2020-06-08T11:41:51Z

The loss seems so large， and The alignment is bad

I tried to generate some wavs but failed

Jeevesh8 · 2020-06-08T12:28:37Z

@JoeyHeisenberg you can use clvc-infer-gh.ipynb to produce wav files

JoeyHeisenberg · 2020-06-09T02:54:32Z

It didn‘t sythesize the rensonable wavs (with my dataset), and I stuck with other project. I'll send the results if I make any progress

Jeevesh8 · 2020-06-09T03:03:20Z

Could you at least hear some words @JoeyHeisenberg ? I would write code to check if any probability distribution has collapsed to its mode etc. You can use that. Or you can check yourself. Let me know.

JoeyHeisenberg · 2020-06-09T06:20:09Z

I loaded the checkpoint-97000 model, and It didn't sythesize the was, Here is the figure of mel_outputs, mel_outputs_postnet, alignments

Here is the val loss , it stuck at around 2.0+

akashicMarga · 2020-06-09T13:30:57Z

@JoeyHeisenberg how did you set up for multi-GPU on a single system?

Jeevesh8 · 2020-06-09T15:47:18Z

@JoeyHeisenberg You can try training further with reduced learning rate.

Jeevesh8 · 2020-06-09T17:09:43Z

@singhaki you can do like this

akashicMarga · 2020-06-09T17:14:30Z

@Jeevesh8 i tried it but gets stuck for hours after Done initializing distributed.

Jeevesh8 · 2020-06-09T17:20:55Z

@JoeyHeisenberg Can you show your mel target and mel predicted in tensorboard images tab?

JoeyHeisenberg · 2020-06-10T04:00:19Z

@singhaki I set it on 3 GPUs
CUDA_VISIBLE_DEVICES=0,1,2 nohup python -u multiproc train.py --output_directory=outdir --log_directory=logdir --n_gpus=3 --hparams=distributed_run=True > log_tacotron2_v2.file 2>&1 &

@Jeevesh8 Here is the mel target and mel predicted

Jeevesh8 · 2020-06-10T04:42:08Z

@JoeyHeisenberg The mel-specs seem close.

1.) Can you upload your log directory to google drive and share ? OR
Tell if attention maps (in images tab only) are improving during the epochs when loss is constant ?
(Because I read some issue on nvidia/tacotron2, and they had told there , that although loss becomes constant after some epochs, the alignments still go on improving, I will mention that issue here, when I find it)

2.) Also, Can you tell whether these mel-spec alignment are corresponding to inference on cross-lingual case or same-language case ?

Jeevesh8 · 2020-06-10T04:52:43Z

@JoeyHeisenberg Also, please let know what happens when you train further with lower learning rate ?

Jeevesh8 · 2020-06-11T05:44:58Z

@JoeyHeisenberg please make sure all 3 points here are true.

JoeyHeisenberg · 2020-06-11T08:00:28Z

I check the audio, and data scale is int16 as follows

and about the silent parts at the beginning and end, I actually set "start0" and "end0" for them, It should be fine.

Now, I'm facing two problem
the first is on tensorboard, the mel-spec predicted seems same as the mel-spec target, but the aligment don't look fine(same-language case), I already train 96000 step with batchsize 24, maybe I should train further as you mention with lower lr
the second is that I can not sythesize the normal audio with my waveGan

here are two samples from train.txt and all phonemes are put on symbols.py, and I didn't change other code
chinese
000718.wav|start0 ou2 er2 m ai3 i4 x ie1 z iy1 l iao4 h uo4 i4 sh uang1 ua4 z iy5 sp1 d eng3 c uen2 g ou4 l e5 sp1 h uei4 q v4 m ai3 i2 t ao4 x in1 i1 sh ang5 end0|0|0
English
004291.wav|start0 T EY1 K IH0 NG AH1 P S M OW1 K IH0 NG R AE1 NG K S HH AY1 IH0 N AE1 K SH AH0 N Z P IY1 P AH0 L W IH1 SH DH EY1 K UH1 D S T AH1 B AW1 T end0|6|1

maybe I made some mistakes on textProcessing, I will check the code to fix these problem, but recently I have to do other project first. I will let you know if I make some progress and really thank you very much for your help. @Jeevesh8

c9412600 · 2020-12-22T11:18:14Z

@JoeyHeisenberg hello ！I want to ask if the multilingual model you trained is effective?

Jeevesh8 closed this as completed Jun 3, 2020

Jeevesh8 reopened this Jun 8, 2020

JoeyHeisenberg closed this as completed Jun 9, 2020

Jeevesh8 reopened this Jun 9, 2020

Jeevesh8 self-assigned this Jun 10, 2020

Jeevesh8 pinned this issue Jun 10, 2020

Jeevesh8 unpinned this issue Sep 12, 2020

c9412600 mentioned this issue Dec 23, 2020

code-switch speech have different voice #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train failed because the loss is nan #3

Train failed because the loss is nan #3

JoeyHeisenberg commented May 29, 2020

Jeevesh8 commented May 29, 2020

JoeyHeisenberg commented Jun 1, 2020

Jeevesh8 commented Jun 1, 2020 •

edited

Loading

Jeevesh8 commented Jun 1, 2020

JoeyHeisenberg commented Jun 1, 2020

Jeevesh8 commented Jun 1, 2020

JoeyHeisenberg commented Jun 1, 2020

Jeevesh8 commented Jun 1, 2020

JoeyHeisenberg commented Jun 2, 2020

JoeyHeisenberg commented Jun 2, 2020

Jeevesh8 commented Jun 2, 2020

JoeyHeisenberg commented Jun 8, 2020

Jeevesh8 commented Jun 8, 2020

JoeyHeisenberg commented Jun 9, 2020

Jeevesh8 commented Jun 9, 2020

JoeyHeisenberg commented Jun 9, 2020 •

edited

Loading

akashicMarga commented Jun 9, 2020

Jeevesh8 commented Jun 9, 2020

Jeevesh8 commented Jun 9, 2020

akashicMarga commented Jun 9, 2020

Jeevesh8 commented Jun 9, 2020

JoeyHeisenberg commented Jun 10, 2020

Jeevesh8 commented Jun 10, 2020

Jeevesh8 commented Jun 10, 2020

Jeevesh8 commented Jun 11, 2020

JoeyHeisenberg commented Jun 11, 2020

c9412600 commented Dec 22, 2020

Train failed because the loss is nan #3

Train failed because the loss is nan #3

Comments

JoeyHeisenberg commented May 29, 2020

Jeevesh8 commented May 29, 2020

JoeyHeisenberg commented Jun 1, 2020

Jeevesh8 commented Jun 1, 2020 • edited Loading

Jeevesh8 commented Jun 1, 2020

JoeyHeisenberg commented Jun 1, 2020

Jeevesh8 commented Jun 1, 2020

JoeyHeisenberg commented Jun 1, 2020

Jeevesh8 commented Jun 1, 2020

JoeyHeisenberg commented Jun 2, 2020

JoeyHeisenberg commented Jun 2, 2020

Jeevesh8 commented Jun 2, 2020

JoeyHeisenberg commented Jun 8, 2020

Jeevesh8 commented Jun 8, 2020

JoeyHeisenberg commented Jun 9, 2020

Jeevesh8 commented Jun 9, 2020

JoeyHeisenberg commented Jun 9, 2020 • edited Loading

akashicMarga commented Jun 9, 2020

Jeevesh8 commented Jun 9, 2020

Jeevesh8 commented Jun 9, 2020

akashicMarga commented Jun 9, 2020

Jeevesh8 commented Jun 9, 2020

JoeyHeisenberg commented Jun 10, 2020

Jeevesh8 commented Jun 10, 2020

Jeevesh8 commented Jun 10, 2020

Jeevesh8 commented Jun 11, 2020

JoeyHeisenberg commented Jun 11, 2020

c9412600 commented Dec 22, 2020

Jeevesh8 commented Jun 1, 2020 •

edited

Loading

JoeyHeisenberg commented Jun 9, 2020 •

edited

Loading