support inf2 neuronx transformer continuous batching #2803

lxning · 2023-11-21T18:10:38Z

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test

Run inference:

python examples/large_models/utils/test_llm_streaming_response.py -m llama-2-13b -o 50 -t 2 -n 4 --prompt-text "Today the weather is really nice and I am planning on " --prompt-randomize
Tasks are completed
payload={'prompt': 'Today the weather is really nice', 'max_new_tokens': 25}
, output=Today the weather is really nice. I am going to go out and enjoy it.
I am going to go to the park and play with my friends

payload={'prompt': 'Today the weather is really nice', 'max_new_tokens': 34}
, output=Today the weather is really nice. I am going to go out and enjoy it.
I am going to go to the park and play with my friends.
I am going to go to the

payload={'prompt': 'Today the weather is really nice and I am planning', 'max_new_tokens': 27}
, output=Today the weather is really nice and I am planning weather to go out and do some gardening. I have a lot of work to do in the garden. I have to weed

payload={'prompt': 'Today the', 'max_new_tokens': 23}
, output=Today the, the2018-2019 school year begins. We are excited to welcome our students

payload={'prompt': 'Today the weather is really', 'max_new_tokens': 25}
, output=Today the weather is reallyay bad. It is raining and the wind is blowing. I am not going to school today. I am going

payload={'prompt': 'Today the', 'max_new_tokens': 25}
, output=Today the, the2018-2019 school year begins. We are excited to welcome our students back to

payload={'prompt': 'Today', 'max_new_tokens': 43}
, output=Todaytix is the easiest way to buy tickets to the best shows in London. We’ve got tickets to all the top shows in London, from the West End to the Fringe, and everything in

payload={'prompt': 'Today the weather', 'max_new_tokens': 45}
, output=Today the weather, tomorrow the world.
The weather is a big deal in the UK. It’s a national obsession. It’s a topic of conversation that can be used to break the ice with strangers, to

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

mreso

Over all this looks good. Left a couple of comments. Please add at least an e2e test for this. We already have some example tests that you can refer to.

mreso · 2024-01-17T18:35:54Z

examples/large_models/inferentia2/llama2/Readme.md

@@ -85,7 +82,7 @@ python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13
 ### Step 4: Package model artifacts

 ```bash
-torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive
+torch-model-archiver --model-name llama-2-13b --version 1.0 --handler /PATH/TO/inf2_handler.py -r requirements.txt --config-file /PATH/TO/model-config.yaml --archive-format no-archive


Better to set an env variable where users switch between the two choices in a single place and then just copy the commands

mreso · 2024-01-27T05:28:16Z

...s/large_models/inferentia2/llama2/continuous_batching/inf2-llama-2-continuous-batching.ipynb

+    "\n",
+    "# Install dependencies, now all commands run under serve dir\n",
+    "!cd serve\n",
+    "!git checkout feat/inf2_cb\n",


Is this still valid if merged into master?

I added notice at the beginning of this cell. "This notebook demonstrates TorchServe continuous batching serving Llama-2-70b on Inferentia-2 inf2.48xlarge with DLAMI: Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20231226". Currently this notebook is needed for SA to present the solution to cx.

mreso · 2024-01-27T05:29:13Z

...s/large_models/inferentia2/llama2/continuous_batching/inf2-llama-2-continuous-batching.ipynb

+    "### Create model artifacts\n",
+    "\n",
+    "Note: run `mv model/models--meta-llama--Llama-2-70b-hf/snapshots/90052941a64de02075ca800b09fcea1bdaacb939/model.safetensors.index.json model/models--meta-llama--Llama-2-70b-hf/snapshots/90052941a64de02075ca800b09fcea1bdaacb939/model.safetensors.index.json.bkp`\n",
+    " if neuron sdk does not support safetensors"


"if" neuron sdk ....? On what does this depend?

neuron sdk is still in beta version to support model safetensors format.

mreso · 2024-01-27T05:35:38Z

examples/large_models/inferentia2/llama2/continuous_batching/model-config.yaml

+    model_class_name: "llama.model.LlamaForSampling"
+    tokenizer_class_name: "transformers.LlamaTokenizer"
+    amp: "bf16"
+    tp_degree: 24


Is this the minimal number of cores possible for this model size?

According to inf2 team guidance, set tp_degree as 24 on inf2.48x, 32 for trn1.32xlarge.

examples/large_models/inferentia2/llama2/streamer/Readme.md

mreso · 2024-01-27T06:40:45Z

ts/torch_handler/distributed/base_neuronx_continuous_batching_handler.py

+        prefill_input_text, prefill_tokens, prefill_seq_ids, req_decode_seq_ids = inputs
+        results = {}
+        # Test if this is the beginning of a continuous batching
+        go_to_decode = True if len(req_decode_seq_ids) > 0 else False


Better to test this in 195 directly for clarity

mreso · 2024-01-27T06:41:53Z

ts/torch_handler/distributed/base_neuronx_continuous_batching_handler.py

+        return prefill_input_text, prefill_tokens, prefill_seq_ids, req_decode_seq_ids
+
+    def inference(self, inputs):
+        prefill_input_text, prefill_tokens, prefill_seq_ids, req_decode_seq_ids = inputs


why not decode_seq_ids like prefill_seq_ids?

it is to highlight the decode_seq_ids from frontend by adding prefix "req_". This is different from the self.decode_seq_ids which include the prefill seq ids.

mreso · 2024-01-27T06:44:20Z

ts/torch_handler/distributed/base_neuronx_continuous_batching_handler.py

+        z = torch.empty(x.shape[0], self.max_length, dtype=torch.int64)
+        for idx, item in enumerate(x):
+            pad = torch.zeros(self.max_length - len(x[idx]), dtype=torch.int)
+            z[idx] = torch.cat((x[idx], pad))


we should test if stacking the concatenated tensors is faster

the pad is added to the same dimension. Why stack is needed at here?

mreso · 2024-01-27T06:48:55Z

ts/torch_handler/distributed/base_neuronx_continuous_batching_handler.py

+                        )
+                    else:
+                        req_id = self._get_req_id(seq_id)
+                        logger.warning(


In which cases can this occur?

This is to prevent from any scenarios in frontend which might delete the request when client is disconnected.

mreso · 2024-01-27T06:53:52Z

ts/torch_handler/distributed/base_neuronx_continuous_batching_handler.py

+logger = logging.getLogger(__name__)
+
+
+class BaseNeuronXContinuousBatchingHandler(BaseHandler):


Better to at least write an e2e test for this handler. The e2e test should queue multiple requests and check for results to make sure ids are not mixed up. Better to test methods as well.

test_llm_streaming_response.py is a tool for manual e2e testing, which is a workaround solution for the inf2 environment dependency.

or we can add test data in https://github.com/pytorch/serve/blob/master/test/postman/inference_stream_data.json when we have inf2 regression test ci-job. Currently all of inf1/inf2 examples use nightly benchmark dashboard as testing tool.

lxning · 2024-01-30T06:23:34Z

Over all this looks good. Left a couple of comments. Please add at least an e2e test for this. We already have some example tests that you can refer to.

This PR requires inf2 environment to run e2e test. TorchServe only has inf2 benchmark ci job. That's why I only posted e2e test result which I ran manually on inf2.48x. The notebook is an e2e test.

mreso · 2024-02-07T23:36:11Z

Thanks @lxning will be good to have an e2e test under test/pytest so we can repeat the test easily if necessary (we make some changes to the example etc.) You can skip the test if no inf2 hw is detected.

mreso

LGTM, left a couple of minor comments

mreso · 2024-02-21T01:48:34Z

examples/large_models/inferentia2/llama2/streamer/Readme.md

+### Step 8: Run inference
+
+```bash
+python test_stream_response.py


Path and filename have to be ../utils/test_llm_streaming_response.py

I keep streamer section as orginal. This PR should update the entire stream section.

ts/handler_utils/utils.py

ts/torch_handler/distributed/base_neuronx_continuous_batching_handler.py

ts_scripts/spellcheck_conf/wordlist.txt

* fmt * fmt * fmt * add space * fmt * fmt * fmt * fmt * fix regression test * check key result * fmt * update folder * fmt * update key name * add orjson * update streamer * add key text for streamer iterator * update test_hf_batch_streamer output * integrate split checkpoint in handler * fmt * fmt * fmt * fmt * fmt * fmt * update notebook * fmt * add handler utils * fix typo * fmt * fmt * fmt * fmt * fmt * Fix lint * fix typo in notebook example * enable authentication * fmt * fmt * fmt * update readme * fix lint * fmt * update test data * update test * update test * replace os.path with pathlib * update test * fmt

lxning and others added 9 commits November 16, 2023 12:06

fmt

6b2a07c

fmt

6e27cbe

fmt

7dcfffd

add space

5eed61e

fmt

c81320a

fmt

426e930

fmt

eb98816

fmt

8d1251f

Merge branch 'master' into feat/inf2_cb

285b1a6

lxning requested review from mreso, HamidShojanazeri and chauhang November 22, 2023 07:32

lxning changed the title ~~[WIP] example inferentia2 llama2 continuous batching~~ Example inferentia2 llama2 continuous batching Nov 22, 2023

lxning added 14 commits November 22, 2023 12:45

fix regression test

81c4532

check key result

9f2e450

fmt

687a1f5

update folder

632e896

fmt

f6f6df1

update key name

31446cf

add orjson

60f8a4c

update streamer

7cee167

add key text for streamer iterator

540115d

update test_hf_batch_streamer output

63f42b5

integrate split checkpoint in handler

42d4719

fmt

5a5252e

fmt

dd42d7c

fmt

100927e

lxning changed the title ~~Example inferentia2 llama2 continuous batching~~ [WIP]Example inferentia2 llama2 continuous batching Dec 12, 2023

lxning added 2 commits December 14, 2023 10:05

fmt

06e8417

fmt

607a349

lxning and others added 5 commits January 23, 2024 14:13

Merge branch 'master' into feat/inf2_cb

a3cbd77

Merge branch 'master' into feat/inf2_cb

414dcd5

fmt

ded1c26

update readme

4bd9b8e

fix lint

932e7ac

mreso requested changes Jan 27, 2024

View reviewed changes

lxning mentioned this pull request Jan 30, 2024

[WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example #2733

Closed

2 tasks

lxning and others added 8 commits February 18, 2024 22:16

fmt

f7a5531

Merge branch 'master' into feat/inf2_cb

db4566e

update test data

cbfcec4

update test

da34b53

update test

b373077

replace os.path with pathlib

aa3eafe

update test

2cb2229

Merge branch 'master' into feat/inf2_cb

253882c

mreso approved these changes Feb 21, 2024

View reviewed changes

fmt

a2ba124

lxning added this pull request to the merge queue Feb 27, 2024

chauhang added this to the v0.10.0 milestone Feb 27, 2024

Merged via the queue into master with commit 2818784 Feb 27, 2024
15 checks passed

lxning self-assigned this Feb 27, 2024

lxning added enhancement New feature or request example labels Feb 27, 2024

lxning added this to Done in v0.10.0 lifecycle Feb 27, 2024

lxning added p0 high priority documentation Improvements or additions to documentation labels Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support inf2 neuronx transformer continuous batching #2803

support inf2 neuronx transformer continuous batching #2803

lxning commented Nov 21, 2023 •

edited

Loading

mreso left a comment

mreso Jan 17, 2024

mreso Jan 27, 2024

lxning Jan 30, 2024

mreso Jan 27, 2024

lxning Jan 30, 2024 •

edited

Loading

mreso Jan 27, 2024

lxning Jan 30, 2024

mreso Jan 27, 2024

mreso Jan 27, 2024

lxning Feb 19, 2024

mreso Jan 27, 2024

lxning Feb 19, 2024

mreso Jan 27, 2024

lxning Feb 19, 2024

mreso Jan 27, 2024

lxning Jan 30, 2024

lxning Feb 19, 2024

lxning commented Jan 30, 2024 •

edited

Loading

mreso commented Feb 7, 2024

mreso left a comment

mreso Feb 21, 2024

lxning Feb 27, 2024

		logger = logging.getLogger(__name__)


		class BaseNeuronXContinuousBatchingHandler(BaseHandler):

support inf2 neuronx transformer continuous batching #2803

support inf2 neuronx transformer continuous batching #2803

Conversation

lxning commented Nov 21, 2023 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

mreso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lxning Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lxning commented Jan 30, 2024 • edited Loading

mreso commented Feb 7, 2024

mreso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lxning commented Nov 21, 2023 •

edited

Loading

lxning Jan 30, 2024 •

edited

Loading

lxning commented Jan 30, 2024 •

edited

Loading