Update to Large Model and docs requirements #2468

sekyondaMeta · 2023-07-17T17:00:39Z

Description

Updates to large model inference doc per Issue #2438

Adding requirements to requirements.txt to build incomplete docs per issue #2048

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Docs built locally and they work:

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

Updates to large model inference doc per Issue pytorch#2438 Adding requirements to requirements.txt to build incomplete docs per issue pytorch#2048

codecov · 2023-07-17T17:19:40Z

Codecov Report

Merging #2468 (75bbb2c) into master (31b42e8) will not change coverage.
The diff coverage is n/a.

❗ Current head 75bbb2c differs from pull request most recent head e8dabf0. Consider uploading reports for the commit e8dabf0 to get more accurate results

@@           Coverage Diff           @@
##           master    #2468   +/-   ##
=======================================
  Coverage   72.66%   72.66%           
=======================================
  Files          78       78           
  Lines        3669     3669           
  Branches       58       58           
=======================================
  Hits         2666     2666           
  Misses        999      999           
  Partials        4        4

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

msaroufim · 2023-07-17T18:15:32Z

docs/large_model_inference.md

+
+To reduce model latency we recommend:
+* Pre-installing the model parallel library such as Deepspeed on the container or host.
+* Pre-downloading the model checkpoints. For example, if using HuggingFace, a pretrained model can be pre-downloaded via [Download_model.py](https://github.com/pytorch/serve/blob/75f66dc557b3b67a3ab56536a37d7aa21582cc04/examples/large_models/deepspeed/opt/Readme.md?plain=1#L7)


for link can we please link to section instead of line?

msaroufim · 2023-07-17T18:16:10Z

docs/large_model_inference.md

-#### Tune torchrun parameters
-User is able to tune torchrun parameters in [model config YAML file](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/model-archiver/README.md?plain=1#L179). The supported parameters are defined at [here](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L329). For example, by default, `OMP_NUMBER_THREADS` is 1. It can be modified in the YAML file.
+ You can tune the model config YAML file to get better performance in the following ways:
+* Update the [responseTimeout](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/docs/configuration.md?plain=1#L216) if high model loading or inference latency causes response timeout.


long model loading?

msaroufim · 2023-07-17T18:16:30Z

docs/large_model_inference.md

-User is able to tune torchrun parameters in [model config YAML file](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/model-archiver/README.md?plain=1#L179). The supported parameters are defined at [here](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L329). For example, by default, `OMP_NUMBER_THREADS` is 1. It can be modified in the YAML file.
+ You can tune the model config YAML file to get better performance in the following ways:
+* Update the [responseTimeout](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/docs/configuration.md?plain=1#L216) if high model loading or inference latency causes response timeout.
+* Tune the [torchrun parameters](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/model-archiver/README.md?plain=1#L179). The supported parameters are defined at [here](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L329). For example, by default, `OMP_NUMBER_THREADS` is 1. This can be modified in the YAML file.


make it clearer that torchrun is a job launcher

msaroufim · 2023-07-17T18:19:28Z

docs/large_model_inference.md

+        return ["hello world "]
+```
+Client side receives the chunked data.
+```python


Add the import statements for test_utils here so this code can actually run

msaroufim · 2023-07-17T18:20:02Z

docs/large_model_inference.md

+    if type(data) is list:
+        for i in range (3):
+            send_intermediate_predict_response(["intermediate_response"], context.request_ids, "Intermediate Prediction success", 200, context)
+        return ["hello world "]


is this line needed? confusing why intermediate responses and a handle response both exist

msaroufim · 2023-07-17T18:21:40Z

docs/large_model_inference.md

+        if chunk:
+            prediction.append(chunk.decode("utf-8"))
+
+    assert str(" ".join(prediction)) == "hello hello hello hello world "


this example is confusing, can we have some plausible response back with a real model? The echo_stream model doesnt make it clear why this feature is useful

msaroufim · 2023-07-17T18:22:33Z

docs/large_model_inference.md

+
+#### GRPC Server Side Streaming
+
+TorchServe [GRPC API](grpc_api.md) adds server side streaming of the inference API "StreamPredictions" to allow a sequence of inference responses to be sent over the same GRPC stream. This API is only recommended for use case when the inference latency of the full response is high and the inference intermediate results are sent to client. An example could be LLMs for generative applications, where generating "n" number of tokens can have high latency, in this case the user can receive each generated token once ready until the full response completes. This API automatically forces the batchSize to be one.


Reading the justification here I'm not sure i follow what value this adds on top of the http response streaming work

I think the idea was supposed to be if you are already using (or already decided to use) one instead of the other . That said, I do see what you mean, will think of a better way o phrase this.

…to largemodel

lxning · 2023-07-24T18:10:36Z

docs/large_model_inference.md

-#### Tune torchrun parameters
-User is able to tune torchrun parameters in [model config YAML file](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/model-archiver/README.md?plain=1#L179). The supported parameters are defined at [here](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L329). For example, by default, `OMP_NUMBER_THREADS` is 1. It can be modified in the YAML file.
+ You can tune the model config YAML file to get better performance in the following ways:
+* Update the [responseTimeout](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/docs/configuration.md?plain=1#L216) if high model loading or inference latency causes response timeout.


Update to Large Model and docs requirements

f57523d

Updates to large model inference doc per Issue pytorch#2438 Adding requirements to requirements.txt to build incomplete docs per issue pytorch#2048

sekyondaMeta requested review from msaroufim and lxning July 17, 2023 17:38

msaroufim reviewed Jul 17, 2023

View reviewed changes

msaroufim requested changes Jul 17, 2023

View reviewed changes

sekyondaMeta added 4 commits July 17, 2023 14:23

Update large_model_inference.md

59c4277

Merge branch 'pytorch:master' into largemodel

4af7d91

Update large_model_inference.md

dd21491

Merge branch 'largemodel' of https://github.com/sekyondaMeta/serve in…

f5d83e0

…to largemodel

msaroufim self-requested a review July 23, 2023 18:05

msaroufim approved these changes Jul 23, 2023

View reviewed changes

sekyondaMeta added 3 commits July 24, 2023 14:01

Merge branch 'pytorch:master' into largemodel

ae48127

Update large_model_inference.md

f5773a2

Update large_model_inference.md

dc2c043

sekyondaMeta mentioned this pull request Jul 24, 2023

Organize HTTP and GRPC response streaming together in LMI section #2496

Closed

lxning approved these changes Jul 25, 2023

View reviewed changes

lxning and others added 2 commits July 25, 2023 15:29

Merge branch 'master' into largemodel

05a5a99

Merge branch 'master' into largemodel

e8dabf0

msaroufim merged commit 61f1c41 into pytorch:master Jul 26, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to Large Model and docs requirements #2468

Update to Large Model and docs requirements #2468

sekyondaMeta commented Jul 17, 2023

codecov bot commented Jul 17, 2023 •

edited

Loading

msaroufim Jul 17, 2023

msaroufim Jul 17, 2023

lxning Jul 24, 2023

msaroufim Jul 17, 2023

sekyondaMeta Jul 18, 2023

msaroufim Jul 17, 2023

sekyondaMeta Jul 18, 2023

msaroufim Jul 17, 2023

msaroufim Jul 17, 2023

msaroufim Jul 17, 2023

sekyondaMeta Jul 18, 2023

lxning Jul 24, 2023


		#### GRPC Server Side Streaming

		TorchServe [GRPC API](grpc_api.md) adds server side streaming of the inference API "StreamPredictions" to allow a sequence of inference responses to be sent over the same GRPC stream. This API is only recommended for use case when the inference latency of the full response is high and the inference intermediate results are sent to client. An example could be LLMs for generative applications, where generating "n" number of tokens can have high latency, in this case the user can receive each generated token once ready until the full response completes. This API automatically forces the batchSize to be one.

Update to Large Model and docs requirements #2468

Update to Large Model and docs requirements #2468

Conversation

sekyondaMeta commented Jul 17, 2023

Description

Type of change

Feature/Issue validation/testing

Checklist:

codecov bot commented Jul 17, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 17, 2023 •

edited

Loading