Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor PT2 code changes #2222

Merged
merged 27 commits into from
Apr 19, 2023
Merged

Refactor PT2 code changes #2222

merged 27 commits into from
Apr 19, 2023

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented Apr 6, 2023

Description

Note: the link check lint keeps timing out so i might ignore it for now

Now that PT 2.0 was official released there's a few updates we needed to make

  1. Gate behavior behind checks for PT 2.0 version before using compile
  2. Refactor a lot of our optimization utils: ipex, onnx, pt2 into standalone functions - I can't quite put them in seperate files since imports need to happen in the same file in python
  3. Using the new config.yaml and deprecated the compile.json - I updated a test util to also leverage it
  4. Added a few tests - something I'm not too happy with is there seems to be some code duplication in our pytest folder with many ways of running a test, a mock handler can't take in extra files and I don't have bandwidth right now to fix all these issues
  5. The yaml config object is not guaranteed to have the config but should check for it in model initialization
  6. I check in logs if a compilation actually occurs
  7. Added a few misc warnings and info calls that are useful
  8. I checked that ONNX tests still pass, if we're merging all the distributed PRs this week I'd like to just add the ONNX dependency as well

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Tested in CI

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@msaroufim
Copy link
Member Author

@morgandu FYI this is how we're gonna do the configs moving forward

@codecov
Copy link

codecov bot commented Apr 13, 2023

Codecov Report

Merging #2222 (4922e66) into master (2fa042b) will increase coverage by 0.05%.
The diff coverage is 40.81%.

❗ Current head 4922e66 differs from pull request most recent head b3b48dd. Consider uploading reports for the commit b3b48dd to get more accurate results

@@            Coverage Diff             @@
##           master    #2222      +/-   ##
==========================================
+ Coverage   71.41%   71.47%   +0.05%     
==========================================
  Files          73       73              
  Lines        3348     3341       -7     
  Branches       57       57              
==========================================
- Hits         2391     2388       -3     
+ Misses        954      950       -4     
  Partials        3        3              
Impacted Files Coverage Δ
ts/utils/util.py 72.60% <20.00%> (+0.80%) ⬆️
ts/torch_handler/base_handler.py 55.50% <43.18%> (+0.52%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@msaroufim
Copy link
Member Author

Chatted with @morgandu offline the Kokoro TPU failure is expected since we're making a breaking change and getting rid of the compile.json file

ts/torch_handler/base_handler.py Outdated Show resolved Hide resolved
ts/torch_handler/base_handler.py Show resolved Hide resolved
ts/torch_handler/base_handler.py Outdated Show resolved Hide resolved


@pytest.mark.skipif(PT_2_AVAILABLE == False, reason="torch version is < 2.0.0")
class TestTorchCompile:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the functions such as test_start_torchserve, test_server_status, test_registered_model, and test_serve_inference can be generalized to be shared with the other test cases. This work can be done later.


`{"pt2" : "inductor"}`
`pt2: "inductor"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this pr cover the feature enable/disable torch.compile flag?

Copy link
Member Author

@msaroufim msaroufim Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we discussed it verbally a few weeks ago to make that happen we'd need to be able to load a new config dynamically and not just package it with archiver. Right now users will see logs that compilation failed, they will fallback to non compiled model. Compilation will only be attempted if pt2 flag shows up in yaml so default behavior is unchanged

Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@msaroufim
Copy link
Member Author

GitHub is acting up in CI. Whoever sees this tomorrow please merge

@msaroufim
Copy link
Member Author

msaroufim commented Apr 19, 2023

@agunapal does the docker 3.10 CI failure look legit to you here? https://github.com/pytorch/serve/actions/runs/4739570657/jobs/8414522301?pr=2222

@agunapal
Copy link
Collaborator

@msaroufim the 3.8 failure seems like a network issue. I am not sure about the 3.10 one. I guess we need to add a retry action for this . This is the first time i have seen it fail

@msaroufim msaroufim merged commit cf7544b into master Apr 19, 2023
@msaroufim msaroufim deleted the msaroufim/pt2changes branch April 19, 2023 05:32
@msaroufim
Copy link
Member Author

@agunapal btw regarding network failures I've noticed that if you make lots of network queries in a row even in the same github action then things start to fail,, I wonder if github intentionall has some rate limiting to avoid spam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants