llama2 70b chat accelerate example #2494

lxning · 2023-07-24T06:45:44Z

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A

torchserve --ncs --start --model-store model_store/ --models llama2-70b-chat --ts-config config.properties

curl http://localhost:8080/predictions/llama2-70b-chat -T sample.txt
how areyou
I'm fine,


- [ ] Test B
Logs for Test B


## Checklist:

- [ ] Did you have fun?
- [ ] Have you added tests that prove your fix is effective or that this feature works?
- [ ] Has code been commented, particularly in hard-to-understand areas?
- [ ] Have you made corresponding changes to the documentation?

codecov · 2023-07-24T07:04:15Z

Codecov Report

Merging #2494 (d67037d) into master (683608b) will not change coverage.
The diff coverage is n/a.

❗ Current head d67037d differs from pull request most recent head 116b1a4. Consider uploading reports for the commit 116b1a4 to get more accurate results

@@           Coverage Diff           @@
##           master    #2494   +/-   ##
=======================================
  Coverage   72.64%   72.64%           
=======================================
  Files          79       79           
  Lines        3733     3733           
  Branches       58       58           
=======================================
  Hits         2712     2712           
  Misses       1017     1017           
  Partials        4        4

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

chauhang · 2023-07-24T18:02:04Z

examples/large_models/Huggingface_accelerate/llama2/custom_handler.py

+            device_map="balanced",
+            low_cpu_mem_usage=True,
+            torch_dtype=torch.float16,
+            load_in_4bit=True,


We should run some tests for 4bit vs 8bit quantization. For real world deployment 8bit might give better results

chauhang · 2023-07-24T18:02:52Z

examples/large_models/Huggingface_accelerate/llama2/custom_handler.py

+            load_in_4bit=True,
+            trust_remote_code=True)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.tokenizer.pad_token = self.tokenizer.eos_token


Please see the HF docs for llama2 model for padding token handling:

https://huggingface.co/docs/transformers/main/model_doc/llama2

@lxning you can also follow this example for the pad token settings, you would need to resize the token embeddings as well.

chauhang · 2023-07-24T18:04:59Z

examples/large_models/Huggingface_accelerate/llama2/Readme.md

+
+### Step 1: Download model Permission
+
+Follow [this instruction](https://huggingface.co/meta-llama/Llama-2-70b-hf) to get permission


Suggest switching to chat model instead, so that we can showcase the processing for prompts for the chat model

chauhang

@lxning Thanks for getting this started. Please see the llama-recipes for prompt handling part and update accordingly

chauhang · 2023-07-24T18:06:51Z

examples/large_models/Huggingface_accelerate/llama2/custom_handler.py

+        logger.info("Model %s loaded successfully", ctx.model_name)
+        self.initialized = True
+
+    def preprocess(self, requests):


Check the processing logic in llama-recipes

@HamidShojanazeri Please verify the preprocessing logic is aligned with the llama model processing

examples/large_models/Huggingface_accelerate/llama2/custom_handler.py

chauhang · 2023-07-25T00:29:27Z

examples/large_models/Huggingface_accelerate/llama2/sample_text.txt

@@ -0,0 +1 @@
+How are you


For chat example use sample similar to chats.json

HamidShojanazeri · 2023-08-24T21:58:13Z

examples/large_models/Huggingface_accelerate/llama2/custom_handler.py

+            load_in_4bit=True,
+            trust_remote_code=True)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.tokenizer.pad_token = self.tokenizer.eos_token


@lxning you can also follow this example for the pad token settings, you would need to resize the token embeddings as well.

examples/large_models/Huggingface_accelerate/llama2/custom_handler.py

HamidShojanazeri · 2023-08-24T22:00:00Z

examples/large_models/Huggingface_accelerate/llama2/custom_handler.py

+        inferences = self.tokenizer.batch_decode(
+            outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
+        )
+


@lxning shall we add response streaming as well, its sounds like a good place to show it.

we can update this after response streaming is merged

examples/large_models/Huggingface_accelerate/llama2/custom_handler.py

agunapal

verified to be working

HamidShojanazeri · 2023-08-28T18:30:53Z

examples/large_models/Huggingface_accelerate/llama2/custom_handler.py

+            torch_dtype=torch.float16,
+            load_in_8bit=True,
+            trust_remote_code=True)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)


@lxning can you pls add accelerated/Bt support as well, as shown here. https://github.com/facebookresearch/llama-recipes/blob/main/inference/inference.py#L68-L69

Proceeding to unblock branch cut. Suggested changes will be addressed in a subsequent PR

chauhang

Approving, please address the outstanding issues in a follow-up PR

agunapal · 2023-08-28T19:05:39Z

Regression test failing because of not being able to download from FAIR gan zoo

llam2 accelerate example

5c9cbfa

add readme

69abdab

lxning requested review from HamidShojanazeri and chauhang July 24, 2023 07:40

lxning self-assigned this Jul 24, 2023

lxning added documentation Improvements or additions to documentation example labels Jul 24, 2023

lxning added this to the v0.9.0 milestone Jul 24, 2023

lxning changed the title ~~[wip]llam2 accelerate example~~ llama2 70b chat accelerate example Jul 24, 2023

chauhang reviewed Jul 24, 2023

View reviewed changes

chauhang previously requested changes Jul 24, 2023

View reviewed changes

chauhang reviewed Jul 25, 2023

View reviewed changes

chauhang and others added 2 commits August 24, 2023 12:22

Merge branch 'master' into llama2

1de0da6

fmt

02d53cd

HamidShojanazeri reviewed Aug 24, 2023

View reviewed changes

msaroufim and others added 3 commits August 24, 2023 19:09

Merge branch 'master' into llama2

3952c43

Merge branch 'master' into llama2

9e0f3e4

Merge branch 'master' into llama2

23e1913

HamidShojanazeri requested changes Aug 25, 2023

View reviewed changes

examples/large_models/Huggingface_accelerate/llama2/custom_handler.py Outdated Show resolved Hide resolved

HamidShojanazeri and others added 8 commits August 25, 2023 06:42

fixing the padding and prompt

4f68032

update steps

1c831f4

Updated readme with more details

bb7e0c4

changed to inheriting from basehandler

b1d7ac9

add model_path

e76b69c

change to int8

b13b024

add download cmd

d19654d

update download path

cf45520

minor edit for model_path

116b1a4

agunapal approved these changes Aug 28, 2023

View reviewed changes

HamidShojanazeri approved these changes Aug 28, 2023

View reviewed changes

chauhang approved these changes Aug 28, 2023

View reviewed changes

agunapal merged commit 04e0b37 into master Aug 28, 2023
9 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama2 70b chat accelerate example #2494

llama2 70b chat accelerate example #2494

lxning commented Jul 24, 2023 •

edited

Loading

codecov bot commented Jul 24, 2023 •

edited

Loading

chauhang Jul 24, 2023

chauhang Jul 24, 2023

HamidShojanazeri Aug 24, 2023

chauhang Jul 24, 2023

chauhang left a comment

chauhang Jul 24, 2023

chauhang Aug 25, 2023

chauhang Jul 25, 2023

HamidShojanazeri Aug 24, 2023

HamidShojanazeri Aug 24, 2023

lxning Aug 24, 2023

agunapal left a comment

HamidShojanazeri Aug 28, 2023

chauhang left a comment

agunapal commented Aug 28, 2023


		### Step 1: Download model Permission

		Follow [this instruction](https://huggingface.co/meta-llama/Llama-2-70b-hf) to get permission

llama2 70b chat accelerate example #2494

llama2 70b chat accelerate example #2494

Conversation

lxning commented Jul 24, 2023 • edited Loading

Description

Type of change

Feature/Issue validation/testing

codecov bot commented Jul 24, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chauhang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agunapal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chauhang left a comment

Choose a reason for hiding this comment

agunapal commented Aug 28, 2023

lxning commented Jul 24, 2023 •

edited

Loading

codecov bot commented Jul 24, 2023 •

edited

Loading