Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for GPU regression failure #2636

Merged
merged 26 commits into from
Oct 3, 2023
Merged

Conversation

agunapal
Copy link
Collaborator

@agunapal agunapal commented Sep 29, 2023

Description

This PR fixes GPU regression runs by following

  • Fixing DCGAN create_mar script
  • Skipping 2 tests in sm_mme tests
  • Skipping 1 test in torch_compile
    Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Passing Regression GPU runner on custom runner
Passing Docker Regression run

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@codecov
Copy link

codecov bot commented Sep 29, 2023

Codecov Report

Merging #2636 (49c4d73) into master (6c82e99) will not change coverage.
The diff coverage is n/a.

❗ Current head 49c4d73 differs from pull request most recent head 6e7b3da. Consider uploading reports for the commit 6e7b3da to get more accurate results

@@           Coverage Diff           @@
##           master    #2636   +/-   ##
=======================================
  Coverage   71.34%   71.34%           
=======================================
  Files          85       85           
  Lines        3905     3905           
  Branches       58       58           
=======================================
  Hits         2786     2786           
  Misses       1115     1115           
  Partials        4        4           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please merge the fix for dcgan or skip the test asap and also skip the oom test since it's been blocking us from getting a useful signal on regression tests for weeks now

@msaroufim msaroufim self-requested a review October 1, 2023 04:07
msaroufim

This comment was marked as duplicate.

@agunapal agunapal changed the title (WIP)Fix for regression failure Fix for regression failure Oct 3, 2023
@agunapal
Copy link
Collaborator Author

agunapal commented Oct 3, 2023

Hi @msaroufim I've fixed the dcgan test, skipped 2 tests in sm_mme.
The only failure is in torch_compile test. Can you please check

@agunapal agunapal changed the title Fix for regression failure (WIP)Fix for regression failure Oct 3, 2023
@agunapal
Copy link
Collaborator Author

agunapal commented Oct 3, 2023

The test passes locally, not on the runner
Skipping it


test_torch_compile.py::TestTorchCompile::test_archive_model_artifacts PASSED                                                                                                       [ 20%]
test_torch_compile.py::TestTorchCompile::test_start_torchserve PASSED                                                                                                              [ 40%]
test_torch_compile.py::TestTorchCompile::test_server_status PASSED                                                                                                                 [ 60%]
test_torch_compile.py::TestTorchCompile::test_registered_model PASSED                                                                                                              [ 80%]
test_torch_compile.py::TestTorchCompile::test_serve_inference PASSED                                                                                                               [100%]

==================================================================================== warnings summary ====================================================================================
test_torch_compile.py:11
  /home/ubuntu/serve/test/pytest/test_torch_compile.py:11: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    from pkg_resources import packaging

../../../anaconda3/envs/torchserve/lib/python3.10/site-packages/pkg_resources/__init__.py:2871
../../../anaconda3/envs/torchserve/lib/python3.10/site-packages/pkg_resources/__init__.py:2871
  /home/ubuntu/anaconda3/envs/torchserve/lib/python3.10/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('ruamel')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================= 5 passed, 3 warnings in 29.98s =============================================================================
(torchserve) ubuntu@ip-172-31-11-40:~/serve/test/pytest$ pip list | grep -i torch
intel-extension-for-pytorch 2.0.100
torch                       2.1.0+cu121
torch-model-archiver        0.8.2b20230930
torch-workflow-archiver     0.2.10b20230930
torchaudio                  2.1.0+cu121
torchdata                   0.7.0
torchpippy                  0.1.1
torchserve                  0.8.2b20230930
torchtext                   0.16.0+cpu
torchvision                 0.16.0+cu121

Failing run
https://github.com/pytorch/serve/actions/runs/6385270530/job/17333639086?pr=2636

@agunapal agunapal changed the title (WIP)Fix for regression failure Fix for GPU regression failure Oct 3, 2023
@agunapal agunapal requested a review from lxning October 3, 2023 01:00
@msaroufim msaroufim added this pull request to the merge queue Oct 3, 2023
@msaroufim
Copy link
Member

msaroufim commented Oct 3, 2023

The test torch.compile issue is this upstream issue from core pytorch/pytorch#103417 - the fix will involve updating some instructions to the runner since this is pointing to an issue with how nvidia drivers are installed

Merged via the queue into master with commit 1f1ab2b Oct 3, 2023
12 of 13 checks passed
@agunapal agunapal deleted the issues/fix_regression_failure branch October 3, 2023 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants