Skip to content

Commit

Permalink
IPEX LLM serving example (#3068)
Browse files Browse the repository at this point in the history
* adding the files for ipex int8 serving of llms

* Update README.md

Fixed some markdowns

* Fix handler name

* Adding default PyTorch support

* Fixing some issues with handler, added test to verify smooth-quant

* adding auto_mixed_precision flag to config

* Removing min_new_tokens from generation config

* fix lint

* lint

* lint

* Fixing unit tests with different model that doesn't require license

* Fix lint error

* Fix lint error in test

* Adding requirements.txt

* adding datasets to the requirements

* upgrading the ipex version to 2.3.0 to match that of pytorch

* Skipping ipex llm tests if accelerate is not present

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-51-123.us-west-2.compute.internal>
Co-authored-by: lxning <23464292+lxning@users.noreply.github.com>
Co-authored-by: lxning <lninga@amazon.com>
Co-authored-by: Matthias Reso <13337103+mreso@users.noreply.github.com>
  • Loading branch information
5 people committed May 16, 2024
1 parent d9fbb19 commit 34bc370
Show file tree
Hide file tree
Showing 11 changed files with 1,202 additions and 1 deletion.
53 changes: 53 additions & 0 deletions examples/large_models/ipex_llm_int8/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Serving IPEX Optimized Models
This example provides an example of serving IPEX-optimized LLMs e.g. ```meta-llama/llama2-7b-hf``` on huggingface. For setting up the Python environment for this example, please refer here: https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/inference/python/llm/README.md#3-environment-setup


1. Run the model archiver
```
torch-model-archiver --model-name llama2-7b --version 1.0 --handler llm_handler.py --config-file llama2-7b-int8-woq-config.yaml --archive-format no-archive
```

2. Move the model inside model_store
```
mkdir model_store
mv llama2-7b ./model_store
```

3. Start the torch server
```
torchserve --ncs --start --model-store model_store models llama2-7b
```

5. Test the model status
```
curl http://localhost:8081/models/llama2-7b
```

6. Send the request
```
curl http://localhost:8080/predictions/llama2-7b -T ./sample_text_0.txt
```
## Model Config
In addition to usual torchserve configurations, you need to enable ipex specific optimization arguments.

In order to enable IPEX, ```ipex_enable=true``` in the ```config.parameters``` file. If not enabled it will run with default PyTorch with ```auto_mixed_precision``` if enabled. In order to enable ```auto_mixed_precision```, you need to set ```auto_mixed_precision: true``` in model-config file.

You can choose either Weight-only Quantization or Smoothquant path for quantizing the model to ```INT8```. If the ```quant_with_amp``` flag is set to ```true```, it'll use a mix of ```INT8``` and ```bfloat16``` precisions, otherwise, it'll use ```INT8``` and ```FP32``` combination. If neither approaches are enabled, the model runs on ```bfloat16``` precision by default as long as ```quant_with_amp``` or ```auto_mixed_precision``` is set to ```true```.

There are 3 different example config files; ```model-config-llama2-7b-int8-sq.yaml``` for quantizing with smooth-quant, ```model-config-llama2-7b-int8-woq.yaml``` for quantizing with weight only quantization, and ```model-config-llama2-7b-bf16.yaml``` for running the text generation on bfloat16 precision.

### IPEX Weight Only Quantization
<ul>
<li> weight_type: weight data type for weight only quantization. Options: INT8 or INT4.
<li> lowp_mode: low precision mode for weight only quantization. It indicates data type for computation.
</ul>

### IPEX Smooth Quantization

<ul>
<li> calibration_dataset, and calibration split: dataset and split to be used for calibrating the model quantization
<li> num_calibration_iters: number of calibration iterations
<li> alpha: a floating point number between 0.0 and 1.0. For more complex smoothquant config, explore IPEX quantization recipes ( https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/inference/python/llm/single_instance/run_quantization.py )
</ul>

Set ```greedy``` to true if you want to perform greedy search decoding. If set false, beam search of size 4 is performed by default.
3 changes: 3 additions & 0 deletions examples/large_models/ipex_llm_int8/config.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ipex_enable=true
cpu_launcher_enable=true
cpu_launcher_args=--node_id 0
Loading

0 comments on commit 34bc370

Please sign in to comment.