Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU Estimator Crashing #12

Open
captain-pool opened this issue Jun 20, 2019 · 2 comments
Open

TPU Estimator Crashing #12

captain-pool opened this issue Jun 20, 2019 · 2 comments

Comments

@captain-pool
Copy link
Owner

Tensorflow version: tensorflow==2.0.0b0
Tensorflow Datasets Version: tfds-nightly==1.0.2.dev201906090105
Tensorflow Hub Version: tf-hub-nightly==0.5.0.dev201905270046

Issue

Code Raises
End of sequence [[node input_pipeline_task0/while/IteratorGetNext (defined at image_retraining_tpu.py:139) ]]
for All values of max_steps in TPUEstimator.train(...)

Reproduce the issue

$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=8

The Same error rises for

--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=4
$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=100
$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=500
$ python3 image_retraining_tpu.py --tpu [TPU_NAME] \
--use_tpu --use_compat --data_dir gs://[BUCKET_NAME]/data_dir \
--model_dir gs://[BUCKET_NAME]/model_dir --batch_size=32 \
--iterations=8 --max_steps=1000

Line 139

classifier.train(
input_fn=lambda params: input_fn(
mode=tf.estimator.ModeKeys.TRAIN,
**params),
max_steps=FLAGS.max_steps)

Log file

Error starts from Line 230 of output.log
output.log

CC: @srjoglekar246 @vbardiovskyg

@srjoglekar246
Copy link
Collaborator

This looks likes a bug with the TPUEstimator. As far as I understand this part of the docs, the Estimator API handles the OutofRange error from the input data function by stopping iterations (and not raising an exception). TPUEstimator doesn't seem to behave that way yet.
Can you open an issue on TF to cross-check?
Also, does the script work with the try...except block?

@captain-pool
Copy link
Owner Author

captain-pool commented Jun 22, 2019

Nope it doesn't. Actually, weirdly enough the code doesn't stop running. It keeps on saying that TPU is Healthy and tries to refresh the token and Doesn't break out, even if there's no more code to execute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants