Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support inf2 neuronx transformer continuous batching #2803

Merged
merged 55 commits into from
Feb 27, 2024
Merged

Conversation

lxning
Copy link
Collaborator

@lxning lxning commented Nov 21, 2023

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test

Run inference:

python examples/large_models/utils/test_llm_streaming_response.py -m llama-2-13b -o 50 -t 2 -n 4 --prompt-text "Today the weather is really nice and I am planning on " --prompt-randomize
Tasks are completed
payload={'prompt': 'Today the weather is really nice', 'max_new_tokens': 25}
, output=Today the weather is really nice. I am going to go out and enjoy it.
I am going to go to the park and play with my friends

payload={'prompt': 'Today the weather is really nice', 'max_new_tokens': 34}
, output=Today the weather is really nice. I am going to go out and enjoy it.
I am going to go to the park and play with my friends.
I am going to go to the

payload={'prompt': 'Today the weather is really nice and I am planning', 'max_new_tokens': 27}
, output=Today the weather is really nice and I am planning weather to go out and do some gardening. I have a lot of work to do in the garden. I have to weed

payload={'prompt': 'Today the', 'max_new_tokens': 23}
, output=Today the, the2018-2019 school year begins. We are excited to welcome our students

payload={'prompt': 'Today the weather is really', 'max_new_tokens': 25}
, output=Today the weather is reallyay bad. It is raining and the wind is blowing. I am not going to school today. I am going

payload={'prompt': 'Today the', 'max_new_tokens': 25}
, output=Today the, the2018-2019 school year begins. We are excited to welcome our students back to

payload={'prompt': 'Today', 'max_new_tokens': 43}
, output=Todaytix is the easiest way to buy tickets to the best shows in London. We’ve got tickets to all the top shows in London, from the West End to the Fringe, and everything in

payload={'prompt': 'Today the weather', 'max_new_tokens': 45}
, output=Today the weather, tomorrow the world.
The weather is a big deal in the UK. It’s a national obsession. It’s a topic of conversation that can be used to break the ice with strangers, to

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@lxning lxning changed the title [WIP] example inferentia2 llama2 continuous batching Example inferentia2 llama2 continuous batching Nov 22, 2023
@lxning lxning changed the title Example inferentia2 llama2 continuous batching [WIP]Example inferentia2 llama2 continuous batching Dec 12, 2023
Copy link
Collaborator

@mreso mreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over all this looks good. Left a couple of comments. Please add at least an e2e test for this. We already have some example tests that you can refer to.

@@ -85,7 +82,7 @@ python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13
### Step 4: Package model artifacts

```bash
torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive
torch-model-archiver --model-name llama-2-13b --version 1.0 --handler /PATH/TO/inf2_handler.py -r requirements.txt --config-file /PATH/TO/model-config.yaml --archive-format no-archive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to set an env variable where users switch between the two choices in a single place and then just copy the commands

"\n",
"# Install dependencies, now all commands run under serve dir\n",
"!cd serve\n",
"!git checkout feat/inf2_cb\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still valid if merged into master?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added notice at the beginning of this cell. "This notebook demonstrates TorchServe continuous batching serving Llama-2-70b on Inferentia-2 inf2.48xlarge with DLAMI: Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20231226". Currently this notebook is needed for SA to present the solution to cx.

"### Create model artifacts\n",
"\n",
"Note: run `mv model/models--meta-llama--Llama-2-70b-hf/snapshots/90052941a64de02075ca800b09fcea1bdaacb939/model.safetensors.index.json model/models--meta-llama--Llama-2-70b-hf/snapshots/90052941a64de02075ca800b09fcea1bdaacb939/model.safetensors.index.json.bkp`\n",
" if neuron sdk does not support safetensors"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if" neuron sdk ....? On what does this depend?

Copy link
Collaborator Author

@lxning lxning Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neuron sdk is still in beta version to support model safetensors format.

model_class_name: "llama.model.LlamaForSampling"
tokenizer_class_name: "transformers.LlamaTokenizer"
amp: "bf16"
tp_degree: 24
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the minimal number of cores possible for this model size?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to inf2 team guidance, set tp_degree as 24 on inf2.48x, 32 for trn1.32xlarge.

prefill_input_text, prefill_tokens, prefill_seq_ids, req_decode_seq_ids = inputs
results = {}
# Test if this is the beginning of a continuous batching
go_to_decode = True if len(req_decode_seq_ids) > 0 else False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to test this in 195 directly for clarity

return prefill_input_text, prefill_tokens, prefill_seq_ids, req_decode_seq_ids

def inference(self, inputs):
prefill_input_text, prefill_tokens, prefill_seq_ids, req_decode_seq_ids = inputs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not decode_seq_ids like prefill_seq_ids?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is to highlight the decode_seq_ids from frontend by adding prefix "req_". This is different from the self.decode_seq_ids which include the prefill seq ids.

z = torch.empty(x.shape[0], self.max_length, dtype=torch.int64)
for idx, item in enumerate(x):
pad = torch.zeros(self.max_length - len(x[idx]), dtype=torch.int)
z[idx] = torch.cat((x[idx], pad))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should test if stacking the concatenated tensors is faster

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pad is added to the same dimension. Why stack is needed at here?

)
else:
req_id = self._get_req_id(seq_id)
logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases can this occur?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent from any scenarios in frontend which might delete the request when client is disconnected.

logger = logging.getLogger(__name__)


class BaseNeuronXContinuousBatchingHandler(BaseHandler):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to at least write an e2e test for this handler. The e2e test should queue multiple requests and check for results to make sure ids are not mixed up. Better to test methods as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_llm_streaming_response.py is a tool for manual e2e testing, which is a workaround solution for the inf2 environment dependency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we can add test data in https://github.com/pytorch/serve/blob/master/test/postman/inference_stream_data.json when we have inf2 regression test ci-job. Currently all of inf1/inf2 examples use nightly benchmark dashboard as testing tool.

@lxning
Copy link
Collaborator Author

lxning commented Jan 30, 2024

Over all this looks good. Left a couple of comments. Please add at least an e2e test for this. We already have some example tests that you can refer to.

This PR requires inf2 environment to run e2e test. TorchServe only has inf2 benchmark ci job. That's why I only posted e2e test result which I ran manually on inf2.48x. The notebook is an e2e test.

@mreso
Copy link
Collaborator

mreso commented Feb 7, 2024

Thanks @lxning will be good to have an e2e test under test/pytest so we can repeat the test easily if necessary (we make some changes to the example etc.) You can skip the test if no inf2 hw is detected.

Copy link
Collaborator

@mreso mreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a couple of minor comments

### Step 8: Run inference

```bash
python test_stream_response.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path and filename have to be ../utils/test_llm_streaming_response.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep streamer section as orginal. This PR should update the entire stream section.

ts/handler_utils/utils.py Outdated Show resolved Hide resolved
ts/handler_utils/utils.py Outdated Show resolved Hide resolved
ts_scripts/spellcheck_conf/wordlist.txt Show resolved Hide resolved
@lxning lxning added this pull request to the merge queue Feb 27, 2024
@chauhang chauhang added this to the v0.10.0 milestone Feb 27, 2024
Merged via the queue into master with commit 2818784 Feb 27, 2024
15 checks passed
@lxning lxning self-assigned this Feb 27, 2024
@lxning lxning added enhancement New feature or request example labels Feb 27, 2024
@lxning lxning added this to Done in v0.10.0 lifecycle Feb 27, 2024
@lxning lxning added p0 high priority documentation Improvements or additions to documentation labels Feb 27, 2024
muthuraj-i2i pushed a commit to muthuraj-i2i/serve that referenced this pull request Mar 1, 2024
* fmt

* fmt

* fmt

* add space

* fmt

* fmt

* fmt

* fmt

* fix regression test

* check key result

* fmt

* update folder

* fmt

* update key name

* add orjson

* update streamer

* add key text for streamer iterator

* update test_hf_batch_streamer output

* integrate split checkpoint in handler

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

* update notebook

* fmt

* add handler utils

* fix typo

* fmt

* fmt

* fmt

* fmt

* fmt

* Fix lint

* fix typo in notebook example

* enable authentication

* fmt

* fmt

* fmt

* update readme

* fix lint

* fmt

* update test data

* update test

* update test

* replace os.path with pathlib

* update test

* fmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request example p0 high priority
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants