Changes to support `torch._export.aot_compile` #2832

agunapal · 2023-12-05T21:21:14Z

Description

This PR

Adds support for torch._export.aot_compile
Includes an example with ResNet18 with max-autotune and dynamic_shapes
Tested equivalence with multiple loads

Comparison with torch.compile

Model	Mode	Model loading(ms)	First Inference Time (ms)

ResNet 18	compile + max-autotune	4111	25918
VGG 16	compile + max-autotune	5284	64960
ResNet 18	torch._export.aot_compile + max-autotune	15876	2704
VGG 16	torch._export.aot_compile + max-autotune	16042	2794

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

pytest -v test_torch_export.py 
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.10.13, pytest-7.3.1, pluggy-1.0.0 -- /home/ubuntu/anaconda3/envs/ts_sam_Dec1/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/serve
plugins: cov-4.1.0, mock-3.12.0
collected 1 item                                                                                                                                                                         

test_torch_export.py::test_torch_export_aot_compile PASSED                                                                                                                         [100%]

==================================================================================== warnings summary ====================================================================================
test_torch_export.py:4
  /home/ubuntu/serve/test/pytest/test_torch_export.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    from pkg_resources import packaging

../../../anaconda3/envs/ts_sam_Dec1/lib/python3.10/site-packages/pkg_resources/__init__.py:2868
  /home/ubuntu/anaconda3/envs/ts_sam_Dec1/lib/python3.10/site-packages/pkg_resources/__init__.py:2868: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

../../../anaconda3/envs/ts_sam_Dec1/lib/python3.10/site-packages/pkg_resources/__init__.py:2868
  /home/ubuntu/anaconda3/envs/ts_sam_Dec1/lib/python3.10/site-packages/pkg_resources/__init__.py:2868: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.logging')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

../../../anaconda3/envs/ts_sam_Dec1/lib/python3.10/site-packages/pkg_resources/__init__.py:2348
  /home/ubuntu/anaconda3/envs/ts_sam_Dec1/lib/python3.10/site-packages/pkg_resources/__init__.py:2348: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(parent)

../../../anaconda3/envs/ts_sam_Dec1/lib/python3.10/site-packages/pkg_resources/__init__.py:2868
../../../anaconda3/envs/ts_sam_Dec1/lib/python3.10/site-packages/pkg_resources/__init__.py:2868
  /home/ubuntu/anaconda3/envs/ts_sam_Dec1/lib/python3.10/site-packages/pkg_resources/__init__.py:2868: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('ruamel')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

../../../anaconda3/envs/ts_sam_Dec1/lib/python3.10/site-packages/transformers/utils/generic.py:441
  /home/ubuntu/anaconda3/envs/ts_sam_Dec1/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
    _torch_pytree._register_pytree_node(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================= 1 passed, 7 warnings in 45.19s =============================================================================
(ts_sam_Dec1) ubuntu@ip-172-31-11-40:~/serve/test/pytest$

Checking Equivalance when loading multiple times

Used the following code

import pickle
import torch

FILE1 = "test_load.pkl"
FILE2 = "test_load1.pkl"
print("Reading output 1")
with open(FILE1,'rb') as f:
    output1 = pickle.load(f)

print("Reading output 2")
with open(FILE2,'rb') as f:
    output2 = pickle.load(f)


print("Checking if output1 == output2", torch.equal(output1, output2))

results in

python test_equivalance.py 
Reading output 1
Reading output 2
Checking if output1 == output2 True

Model Loading logs

2023-12-13T20:03:15,259 [INFO ] W-9000-res18-pt2_1.0-stdout MODEL_LOG - torch._export is an experimental feature! Succesfully loaded torch exported model.
2023-12-13T20:03:15,259 [INFO ] W-9000-res18-pt2_1.0-stdout MODEL_LOG - export is not a supported backend
2023-12-13T20:03:15,264 [INFO ] W-9000-res18-pt2_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 17378
2023-12-13T20:03:15,264 [DEBUG] W-9000-res18-pt2_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-res18-pt2_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-13T20:03:15,265 [INFO ] W-9000-res18-pt2_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:19913.0|#WorkerName:W-9000-res18-pt2_1.0,Level:Host|#hostname:ip-172-31-11-40,timestamp:1702497795
2023-12-13T20:03:15,265 [INFO ] W-9000-res18-pt2_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:3.0|#Level:Host|#hostname:ip-172-31-11-40,timestamp:1702497795

Model loaded on 4 GPUs

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

…/pytorch/serve into feature/torch_export_aot_compile

chauhang · 2023-12-12T02:47:57Z

@ankithagunapal Thanks for adding this support. Why is the model load taking so longer with AOTCompile?

…/pytorch/serve into feature/torch_export_aot_compile

agunapal · 2023-12-12T02:57:10Z

@ankithagunapal Thanks for adding this support. Why is the model load taking so longer with AOTCompile?

@chauhang Yet to run the profiler on the loading of the .so file. Will check and follow-up

examples/pt2/README.md

examples/pt2/torch_export_aot_compile/README.md

msaroufim · 2023-12-12T16:39:55Z

examples/pt2/torch_export_aot_compile/README.md

+
+Install PyTorch 2.2 nightlies by running
+```
+chmod +x install_pytorch_nightlies.sh


we don't need this script, install_dependencies.py has a nightly flag

examples/pt2/torch_export_aot_compile/README.md

msaroufim · 2023-12-12T16:41:26Z

examples/pt2/torch_export_aot_compile/README.md

+The model is saved with `.so` extension
+Here we are torch exporting with AOT Inductor with `max_auotune` mode.
+This is also making use of `dynamic_shapes` to support batch size from 1 to 32.
+In the code, the min batch_size is mentioned as 2 instead of 1. You can find an explanation for this [here](https://pytorch.org/docs/main/export.html#expressing-dynamism)


can you bring core idea? Batch size 1 is what people choose if they want low latency

I read the doc , but I don't understand it tbh

requirements/common_gpu.txt

test/pytest/test_torch_export.py

ts/handler_utils/torch_export/load_model.py

msaroufim · 2023-12-12T16:45:05Z

ts/torch_handler/base_handler.py

@@ -151,6 +155,12 @@ def initialize(self, context):
            self.map_location = "cpu"
            self.device = torch.device(self.map_location)

+        TORCH_EXPORT_AVAILABLE = False


set these global variables in packaging if condition instead

I looked at the logic again. This seems like the best place.

ts/torch_handler/base_handler.py

msaroufim · 2023-12-12T16:46:32Z

ts/torch_handler/base_handler.py

@@ -53,6 +53,10 @@
    )
    PT2_AVAILABLE = False

+if packaging.version.parse(torch.__version__) > packaging.version.parse("2.1.1"):
+    PT220_AVAILABLE = True


does this mean pt.2.2? Can we call this call PT2_2-0 should be easier to read

python doesn't allow - So PT2_2-0 doesnt work

msaroufim · 2023-12-12T16:47:05Z

ts/torch_handler/base_handler.py

@@ -180,6 +190,13 @@ def initialize(self, context):
            self.model = setup_ort_session(self.model_pt_path, self.map_location)
            logger.info("Succesfully setup ort session")

+        elif self.model_pt_path.endswith(".so") and TORCH_EXPORT_AVAILABLE:


archiver doc strings need an update too to make it clear it also supports .so file

not sure which one you are specifically referring to?
This doesn't talk about the format of the file

usage: torch-model-archiver [-h] --model-name MODEL_NAME [--serialized-file SERIALIZED_FILE] [--model-file MODEL_FILE] --handler HANDLER [--extra-files EXTRA_FILES] [--runtime {python,python3}] [--export-path EXPORT_PATH] [--archive-format {tgz,no-archive,zip-store,default}] [-f] -v VERSION [-r REQUIREMENTS_FILE] [-c CONFIG_FILE] torch-model-archiver: error: the following arguments are required: --model-name, --handler, -v/--version

msaroufim

Looks mostly good, a few minor things left to do - also still not super clear on how the timing measurements were obtained - I'd add some log.info statements

Co-authored-by: Mark Saroufim <marksaroufim@fb.com>

agunapal · 2023-12-19T00:05:09Z

Looks mostly good, a few minor things left to do - also still not super clear on how the timing measurements were obtained - I'd add some log.info statements

@msaroufim Thanks. I have addressed most of the feedback. However, wondering how we should address support for CPU. Does aot_compile on CPU make sense?

msaroufim · 2023-12-19T01:06:41Z

CPU support is a completely valid scenario, inductor codegens native c++ that the intel team has been optimizing

msaroufim · 2023-12-19T01:09:59Z

ts/torch_handler/base_handler.py

+        if hasattr(self, "model_yaml_config") and "pt2" in self.model_yaml_config:
+            pt2_value = self.model_yaml_config["pt2"]
+            if pt2_value == "export" and PT220_AVAILABLE:
+                USE_TORCH_EXPORT = True


why do you need this bool flag?

msaroufim · 2023-12-19T01:10:25Z

ts/torch_handler/base_handler.py

+        elif self.model_pt_path.endswith(".so") and USE_TORCH_EXPORT:
+            # Set cuda stream to the gpu_id of the backend worker
+            if torch.cuda.is_available() and properties.get("gpu_id") is not None:
+                torch.cuda.set_stream(torch.cuda.Stream(int(properties.get("gpu_id"))))


could you add some comment here? Why are we launching things on specific streams?

msaroufim

stamping to unblock but this needs a few more things before merge

This should indeed work with cpu so add a test
Get rid of the install nightlies script and instead use the good old install_dependencies.py script
Add a comment on the cuda streams section
You don't need this flag USE_TORCH_EXPORT

…/pytorch/serve into feature/torch_export_aot_compile

agunapal · 2023-12-20T22:45:42Z

stamping to unblock but this needs a few more things before merge

This should indeed work with cpu so add a test

Get rid of the install nightlies script and instead use the good old install_dependencies.py script

Add a comment on the cuda streams section

You don't need this flag USE_TORCH_EXPORT

Tested on CPU. pytest runs on both CPU/GPU. There is a limitation for batch_size on CPU. Mentioned this in the script.
I have updated instructions to have both. Most users using torchserve already have dependencies installed. the torch version doesn't get overwritten. It needs to be uninstalled and then installed again. Hence, script is needed
Good catch. Turns out this was not needed. We just need torch.cuda.set_device
Refactored the code.

Also, changed the config to the following. (in case they add other options in export in the future)

pt2 :
  export:
    aot_compile: true

…/pytorch/serve into feature/torch_export_aot_compile

Changes to support torch._export.aot_compile

8754eb2

agunapal requested a review from msaroufim December 5, 2023 21:21

agunapal and others added 4 commits December 5, 2023 13:22

Merge branch 'master' into feature/torch_export_aot_compile

330e4c1

Changes to support torch._export.aot_compile

afc255a

Merge branch 'feature/torch_export_aot_compile' of https://github.com…

0d0a325

…/pytorch/serve into feature/torch_export_aot_compile

Merge branch 'master' into feature/torch_export_aot_compile

546d0df

agunapal changed the title ~~(WIP)Changes to support torch._export.aot_compile~~ Changes to support torch._export.aot_compile Dec 12, 2023

agunapal added 2 commits December 12, 2023 02:53

corrected copy paste error in readme

3d696bf

Merge branch 'feature/torch_export_aot_compile' of https://github.com…

9ab1c3e

…/pytorch/serve into feature/torch_export_aot_compile