GPT fast example #2815

mreso · 2023-12-01T01:49:07Z

Description

This PR adds an example for gpt-fast

Fixes #(issue)

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

pytest test/pytest/test_example_gpt_fast.py -k test_gpt_fast_mar -s

=========================================================================================================== test session starts ============================================================================================================
platform linux -- Python 3.10.13, pytest-7.3.1, pluggy-1.3.0
rootdir: /home/ubuntu/serve
plugins: mock-3.12.0, cov-4.1.0
collected 3 items / 2 deselected / 1 selected

test/pytest/test_example_gpt_fast.py TorchServe is not currently running.
TorchServe is not currently running.
['torchserve', '--start', '--model-store', '/tmp/pytest-of-ubuntu/pytest-98/work_dir0/model_store', '--no-config-snapshots']
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-12-02T05:02:53,312 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2023-12-02T05:02:53,314 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-12-02T05:02:53,366 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/ts/configs/metrics.yaml
2023-12-02T05:02:53,501 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.9.0
TS Home: /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages
Current directory: /home/ubuntu/serve
Temp directory: /tmp
Metrics config path: /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 16
Max heap size: 15908 M
Python executable: /home/ubuntu/miniconda3/envs/gpt-fast/bin/python3.10
Config file: N/A
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /tmp/pytest-of-ubuntu/pytest-98/work_dir0/model_store
Initial Models: N/A
Log dir: /home/ubuntu/serve/logs
Metrics dir: /home/ubuntu/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /tmp/pytest-of-ubuntu/pytest-98/work_dir0/model_store
Model config: N/A
2023-12-02T05:02:53,507 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2023-12-02T05:02:53,526 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-12-02T05:02:53,580 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2023-12-02T05:02:53,580 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2023-12-02T05:02:53,581 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2023-12-02T05:02:53,581 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2023-12-02T05:02:53,582 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2023-12-02T05:02:53,839 [INFO ] epollEventLoopGroup-3-1 org.pytorch.serve.archive.model.ModelArchive - createTempDir /tmp/models/2a0ffd0db7f74765b5c6b2ba3c13b2dd
2023-12-02T05:02:53,841 [INFO ] epollEventLoopGroup-3-1 org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /tmp/models/2a0ffd0db7f74765b5c6b2ba3c13b2dd/gpt_fast_handler
2023-12-02T05:02:53,852 [DEBUG] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model gpt_fast_handler
2023-12-02T05:02:53,852 [DEBUG] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model gpt_fast_handler
2023-12-02T05:02:53,853 [INFO ] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelManager - Model gpt_fast_handler loaded.
2023-12-02T05:02:53,854 [DEBUG] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelManager - updateModel: gpt_fast_handler, count: 1
2023-12-02T05:02:53,876 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/ubuntu/miniconda3/envs/gpt-fast/bin/python3.10, /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/ts/configs/metrics.yaml]
2023-12-02T05:02:54,347 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,350 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:19.849136352539062|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,350 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:464.7647895812988|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,350 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:95.9|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,350 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.00868507903421921|#Level:Host,DeviceId:0|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,351 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:2.0|#Level:Host,DeviceId:0|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,351 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,351 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:58461.4609375|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,351 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:4439.125|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,351 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:8.1|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:55,804 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9000, pid=868147
2023-12-02T05:02:55,805 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000
2023-12-02T05:02:55,812 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Successfully loaded /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/ts/configs/metrics.yaml.
2023-12-02T05:02:55,812 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - [PID]868147
2023-12-02T05:02:55,813 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-02T05:02:55,813 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Python runtime: 3.10.13
2023-12-02T05:02:55,813 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_handler_1.0 State change null -> WORKER_STARTED
2023-12-02T05:02:55,815 [INFO ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2023-12-02T05:02:55,820 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000.
2023-12-02T05:02:55,823 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1701493375823
2023-12-02T05:02:55,825 [INFO ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1701493375825
2023-12-02T05:02:55,837 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - model_name: gpt_fast_handler, batchSize: 1
2023-12-02T05:02:55,990 [WARN ] W-9000-gpt_fast_handler_1.0-stderr MODEL_LOG - /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
2023-12-02T05:02:55,990 [WARN ] W-9000-gpt_fast_handler_1.0-stderr MODEL_LOG -   _torch_pytree._register_pytree_node(
2023-12-02T05:02:57,048 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Enabled tensor cores
2023-12-02T05:02:57,049 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-02T05:02:57,049 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-02T05:02:57,050 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Loading model ...
2023-12-02T05:03:01,948 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Time to load model: 4.90 seconds
2023-12-02T05:03:01,961 [INFO ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 6135
2023-12-02T05:03:01,962 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_handler_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-02T05:03:01,962 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:8103.0|#WorkerName:W-9000-gpt_fast_handler_1.0,Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493381
2023-12-02T05:03:01,962 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:4.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493381
2023-12-02T05:03:01,968 [INFO ] epollEventLoopGroup-3-1 ACCESS_LOG - /127.0.0.1:39200 "POST /models?model_name=gpt_fast_handler&url=%2Ftmp%2Fpytest-of-ubuntu%2Fpytest-98%2Fwork_dir0%2Fmodel_store%2Fgpt_fast_handler&initial_workers=1&synchronous=true&batch_size=1 HTTP/1.1" 200 8136
2023-12-02T05:03:01,968 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493381
2023-12-02T05:03:01,984 [INFO ] epollEventLoopGroup-3-2 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:gpt_fast_handler,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701493381
2023-12-02T05:03:01,984 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT repeats 1 to backend at: 1701493381984
2023-12-02T05:03:01,985 [INFO ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1701493381985
2023-12-02T05:03:01,986 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Backend received inference at: 1701493381
2023-12-02T05:03:03,050 [INFO ] W-9000-gpt_fast_handler_1.0 ACCESS_LOG - /127.0.0.1:44702 "POST /predictions/gpt_fast_handler HTTP/1.1" 200 1068
2023-12-02T05:03:03,051 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493383
2023-12-02T05:03:04,965 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Num tokens = 50
.2023-12-02T05:03:04,965 [INFO ] W-9000-gpt_fast_handler_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:2979.36|#ModelName:gpt_fast_handler,Level:Model|#type:GAUGE|#hostname:ip-172-31-15-101,1701493384,bc0a4b09-ccd8-46d8-a894-6a48f878f7b1, pattern=[METRICS]
2023-12-02T05:03:04,965 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:2981114.159|#model_name:gpt_fast_handler,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701493384
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:152.639|#model_name:gpt_fast_handler,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701493384
2023-12-02T05:03:04,966 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.job.RestJob - Waiting time ns: 152639, Backend time ns: 2981671148
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_METRICS - HandlerTime.ms:2979.36|#ModelName:gpt_fast_handler,Level:Model|#hostname:ip-172-31-15-101,requestID:bc0a4b09-ccd8-46d8-a894-6a48f878f7b1,timestamp:1701493384
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - QueueTime.Milliseconds:0.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493384
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]PredictionTime.Milliseconds:2979.46|#ModelName:gpt_fast_handler,Level:Model|#type:GAUGE|#hostname:ip-172-31-15-101,1701493384,bc0a4b09-ccd8-46d8-a894-6a48f878f7b1, pattern=[METRICS]
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 2977
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_METRICS - PredictionTime.ms:2979.46|#ModelName:gpt_fast_handler,Level:Model|#hostname:ip-172-31-15-101,requestID:bc0a4b09-ccd8-46d8-a894-6a48f878f7b1,timestamp:1701493384
2023-12-02T05:03:04,967 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:5.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493384
2023-12-02T05:03:04,974 [DEBUG] epollEventLoopGroup-3-3 org.pytorch.serve.wlm.ModelVersionedRefs - Removed model: gpt_fast_handler version: 1.0
2023-12-02T05:03:04,974 [DEBUG] epollEventLoopGroup-3-3 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_handler_1.0 State change WORKER_MODEL_LOADED -> WORKER_SCALED_DOWN
2023-12-02T05:03:04,975 [WARN ] epollEventLoopGroup-3-3 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-gpt_fast_handler_1.0-stderr
2023-12-02T05:03:04,975 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Frontend disconnected.
2023-12-02T05:03:04,975 [WARN ] epollEventLoopGroup-3-3 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-gpt_fast_handler_1.0-stdout
2023-12-02T05:03:04,975 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_SCALED_DOWN
2023-12-02T05:03:04,975 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_SCALED_DOWN
2023-12-02T05:03:04,976 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Shutting down the thread .. Scaling down.
2023-12-02T05:03:04,976 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_handler_1.0 State change WORKER_SCALED_DOWN -> WORKER_STOPPED
2023-12-02T05:03:04,976 [WARN ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-gpt_fast_handler_1.0-stderr
2023-12-02T05:03:04,977 [WARN ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-gpt_fast_handler_1.0-stdout
2023-12-02T05:03:04,977 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Worker terminated due to scale-down call.
2023-12-02T05:03:04,996 [INFO ] epollEventLoopGroup-3-3 org.pytorch.serve.wlm.ModelManager - Model gpt_fast_handler unregistered.
2023-12-02T05:03:04,996 [INFO ] epollEventLoopGroup-3-3 ACCESS_LOG - /127.0.0.1:35060 "DELETE /models/gpt_fast_handler HTTP/1.1" 200 22
2023-12-02T05:03:04,997 [INFO ] epollEventLoopGroup-3-3 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493384


===================================================================================================== 1 passed, 2 deselected in 16.83s =====================================================================================================

Inference

2023-12-02T04:48:31,939 [INFO ] epollEventLoopGroup-3-4 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:gpt-fast,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701492511
2023-12-02T04:48:31,939 [DEBUG] W-9000-gpt-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT repeats 1 to backend at: 1701492511939
2023-12-02T04:48:31,940 [INFO ] W-9000-gpt-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1701492511940
2023-12-02T04:48:31,940 [INFO ] W-9000-gpt-fast_1.0-stdout MODEL_LOG - Backend received inference at: 1701492511
2023-12-02T04:48:32,011 [INFO ] W-9000-gpt-fast_1.0 ACCESS_LOG - /127.0.0.1:53314 "PUT /predictions/gpt-fast HTTP/1.1" 200 72
2023-12-02T04:48:32,012 [INFO ] W-9000-gpt-fast_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701492512
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0-stdout MODEL_LOG - Num tokens = 50
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:773600.846|#model_name:gpt-fast,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701492512
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:772.45|#ModelName:gpt-fast,Level:Model|#type:GAUGE|#hostname:ip-172-31-15-101,1701492512,ac2b9077-8456-4916-b7df-ed4644787aca, pattern=[METRICS]
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:87.744|#model_name:gpt-fast,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701492512
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0-stdout MODEL_METRICS - HandlerTime.ms:772.45|#ModelName:gpt-fast,Level:Model|#hostname:ip-172-31-15-101,requestID:ac2b9077-8456-4916-b7df-ed4644787aca,timestamp:1701492512
2023-12-02T04:48:32,713 [DEBUG] W-9000-gpt-fast_1.0 org.pytorch.serve.job.RestJob - Waiting time ns: 87744, Backend time ns: 773955376
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0 TS_METRICS - QueueTime.Milliseconds:0.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701492512
2023-12-02T04:48:32,714 [INFO ] W-9000-gpt-fast_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]PredictionTime.Milliseconds:772.56|#ModelName:gpt-fast,Level:Model|#type:GAUGE|#hostname:ip-172-31-15-101,1701492512,ac2b9077-8456-4916-b7df-ed4644787aca, pattern=[METRICS]
2023-12-02T04:48:32,714 [INFO ] W-9000-gpt-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 770
2023-12-02T04:48:32,714 [INFO ] W-9000-gpt-fast_1.0-stdout MODEL_METRICS - PredictionTime.ms:772.56|#ModelName:gpt-fast,Level:Model|#hostname:ip-172-31-15-101,requestID:ac2b9077-8456-4916-b7df-ed4644787aca,timestamp:1701492512
2023-12-02T04:48:32,714 [INFO ] W-9000-gpt-fast_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:5.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701492512

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

Remove files and finish tests Add readme Complete readme

agunapal

LGTM overall.

The model should point to the in8 model since the README talks about doing the quantization.
Should we add tokens /second in the handler or may be this can do with a client side script.

examples/large_models/gpt_fast/README.md

examples/large_models/gpt_fast/model_config.yaml

chauhang

@mreso Thanks for submitting this PR. I left few review comments inline. It will be good to mention that example has been tested on A10g, A100, H100 GPUs.

examples/large_models/gpt_fast/model_config.yaml

examples/large_models/gpt_fast/README.md

examples/large_models/gpt_fast/handler.py

examples/large_models/gpt_fast/model_config.yaml

examples/large_models/gpt_fast/handler.py

examples/large_models/gpt_fast/README.md

agunapal

LGTM

mreso · 2023-12-05T18:34:52Z

@mreso Thanks for submitting this PR. I left few review comments inline. It will be good to mention that example has been tested on A10g, A100, H100 GPUs.

Thanks for reviewing @chauhang I've addressed your comments.

mreso added 5 commits December 1, 2023 01:47

Initial commit for gpt_fast example

f7eb055

Remove files and finish tests Add readme Complete readme

Add int8 quantization to example

f158d29

Add missing json file

c3fffcf

Enable streaming response

0f0ea6b

Remove print

8f35092

mreso changed the title ~~(WIP) GPT fast example~~ GPT fast example Dec 2, 2023

Adapt unit test to list return value, fix lint error

ccbd4e8

mreso marked this pull request as ready for review December 2, 2023 05:05

mreso requested review from chauhang and agunapal December 2, 2023 05:05

agunapal requested changes Dec 2, 2023

View reviewed changes

Merge branch 'master' into feature/gpt-fast

10910a8

chauhang reviewed Dec 2, 2023

View reviewed changes

examples/large_models/gpt_fast/README.md Outdated Show resolved Hide resolved

chauhang reviewed Dec 3, 2023

View reviewed changes

examples/large_models/gpt_fast/README.md Show resolved Hide resolved

chauhang reviewed Dec 3, 2023

View reviewed changes

examples/large_models/gpt_fast/model_config.yaml Outdated Show resolved Hide resolved

chauhang suggested changes Dec 3, 2023

View reviewed changes

lxning reviewed Dec 4, 2023

View reviewed changes

examples/large_models/gpt_fast/README.md Outdated Show resolved Hide resolved

mreso added 2 commits December 4, 2023 19:18

Assert if batch_size is not 1

e640b39

Addressed review comments

aefb14b

agunapal approved these changes Dec 4, 2023

View reviewed changes

mreso enabled auto-merge December 4, 2023 22:37

lxning approved these changes Dec 5, 2023

View reviewed changes

Added GPU compatibility remark

8394d53

mreso requested a review from chauhang December 5, 2023 18:33

chauhang approved these changes Dec 5, 2023

View reviewed changes

mreso added this pull request to the merge queue Dec 5, 2023

Merged via the queue into master with commit a1602ba Dec 5, 2023
13 checks passed

chauhang added this to the v0.10.0 milestone Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT fast example #2815

GPT fast example #2815

mreso commented Dec 1, 2023 •

edited

Loading

agunapal left a comment

chauhang left a comment

agunapal left a comment

mreso commented Dec 5, 2023

GPT fast example #2815

GPT fast example #2815

Conversation

mreso commented Dec 1, 2023 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

agunapal left a comment

Choose a reason for hiding this comment

chauhang left a comment

Choose a reason for hiding this comment

agunapal left a comment

Choose a reason for hiding this comment

mreso commented Dec 5, 2023

mreso commented Dec 1, 2023 •

edited

Loading