Integrate vllm with example Lora and Mistral #3077

lxning · 2024-04-07T03:05:29Z

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test Lora

cd serve/examples/large_models/vllm/lora
cat prompt.json
{
  "prompt": "A robot may not injure a human being",
  "max_new_tokens": 50,
  "temperature": 0.8,
  "logprobs": 1,
  "prompt_logprobs": 1,
  "max_tokens": 128,
  "adapter": "adapter_1"
}
python ../../utils/test_llm_streaming_response.py -m lora -o 50 -t 1 -n 1 --prompt-text "@prompt.json" --prompt-json
Tasks are completed
payload={'prompt': 'A robot may not injure a human being', 'max_new_tokens': 50, 'temperature': 0.8, 'logprobs': 1, 'prompt_logprobs': 1, 'max_tokens': 128, 'adapter': 'adapter_1'}
, output= or, through inaction, allow a human being to come to harm
o

Test Mistral

cd serve/examples/large_models/vllm/mistral
cat prompt.json
{
  "prompt": "A robot may not injure a human being",
  "max_new_tokens": 50,
  "temperature": 0.8,
  "logprobs": 1,
  "prompt_logprobs": 1,
  "max_tokens": 128
}
python ../../utils/test_llm_streaming_response.py -m mistral -o 50 -t 1 -n 1 --prompt-text "@prompt.json" --prompt-json
Tasks are completed
payload={'prompt': 'A robot may not injure a human being', 'max_new_tokens': 50, 'temperature': 0.8, 'logprobs': 1, 'prompt_logprobs': 1, 'max_tokens': 128}
, output= or, through inaction, allow a human being to come to harm.

Test pytest

pytest test_example_vllm.py::test_vllm_lora_mar
======================= test session starts ========================
platform linux -- Python 3.10.14, pytest-7.3.1, pluggy-1.4.0
rootdir: /home/ubuntu/serve
plugins: cov-4.1.0, mock-3.12.0
collected 1 item

test_example_vllm.py .                                       [100%]

================== 1 passed in 141.14s (0:02:21) ===================

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

examples/large_models/vllm/Readme.md

mreso

LGTM

mreso · 2024-05-03T00:38:37Z

examples/large_models/utils/test_llm_streaming_response.py

+        "--prompt-json",
+        action=argparse.BooleanOptionalAction,
+        default=False,
+        help="Flag the imput prompt is a json format with prompt parameters",


mreso · 2024-05-03T05:22:40Z

examples/large_models/vllm/lora/model-config.yaml

+        enable_lora: true
+        max_loras: 4
+        max_cpu_loras: 4
+        max_num_seqs: 16


vllm uses paged attention which typically allows for a larger batch sizes. We need to figure out a way to saturate the engine as setting batchSize == max_num_seqs will lead to under utilization.

We could use a similar strategy like for the micro-batching to always have enough requests for the engine available. Preferred would be an async mode which will just route all requests to the backend and gets replies asynchronously (as discussed earlier)

mreso · 2024-05-03T05:30:04Z

test/pytest/test_example_vllm.py

+    )
+
+    model_archiver.generate_model_archive(config)
+    shutil.move(LORA_SRC_PATH / "model", mar_file_path)


If we move the files and delete them we can run the test only once before we need to put them back manually. Can we use symbolic links instead?

lxning added 2 commits April 3, 2024 14:47

init

53cf7b1

init

7d52d63

lxning self-assigned this Apr 7, 2024

chauhang reviewed Apr 14, 2024

View reviewed changes

examples/large_models/vllm/Readme.md Outdated Show resolved Hide resolved

lxning and others added 13 commits April 17, 2024 15:39

fix typo

27f25b3

update client to load json promt

e9d4943

fix vllm parameter parse

bd1531a

Merge branch 'master' into feat/lora

f3fd17f

fix stop criteria bugs

c048e4d

update readme and add test

91b2efa

update test

954b470

Merge branch 'master' into feat/lora

9ba9883

fix lint

09cf328

update mistral example

7b646bb

update mistral example

8247385

update model config

4e5d96e

fix output text

c02cc8c

lxning changed the title ~~[WIP] Lora example~~ integrate vllm with example Lora and Mistral Apr 30, 2024

lxning changed the title ~~integrate vllm with example Lora and Mistral~~ Integrate vllm with example Lora and Mistral Apr 30, 2024

update pytest

dfd8ffa

lxning requested review from agunapal and mreso April 30, 2024 19:45

lxning added documentation Improvements or additions to documentation enhancement New feature or request labels Apr 30, 2024

lxning added this to the v0.10.1 milestone Apr 30, 2024

lxning added this to In Review in v0.11.0 lifecycle Apr 30, 2024

lxning requested a review from HamidShojanazeri April 30, 2024 22:31

mreso approved these changes May 3, 2024

View reviewed changes

Merge branch 'master' into feat/lora

66d3c03

mreso enabled auto-merge May 3, 2024 05:59

mreso added this pull request to the merge queue May 3, 2024

Merged via the queue into master with commit f2c26f3 May 3, 2024
9 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate vllm with example Lora and Mistral #3077

Integrate vllm with example Lora and Mistral #3077

lxning commented Apr 7, 2024 •

edited

Loading

mreso left a comment

mreso May 3, 2024

mreso May 3, 2024

mreso May 3, 2024

mreso May 3, 2024

Integrate vllm with example Lora and Mistral #3077

Integrate vllm with example Lora and Mistral #3077

Conversation

lxning commented Apr 7, 2024 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

mreso left a comment

Choose a reason for hiding this comment

mreso May 3, 2024

Choose a reason for hiding this comment

mreso May 3, 2024

Choose a reason for hiding this comment

mreso May 3, 2024

Choose a reason for hiding this comment

mreso May 3, 2024

Choose a reason for hiding this comment

lxning commented Apr 7, 2024 •

edited

Loading