Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT fast example #2815

Merged
merged 10 commits into from
Dec 5, 2023
Merged

GPT fast example #2815

merged 10 commits into from
Dec 5, 2023

Conversation

mreso
Copy link
Collaborator

@mreso mreso commented Dec 1, 2023

Description

This PR adds an example for gpt-fast

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • pytest test/pytest/test_example_gpt_fast.py -k test_gpt_fast_mar -s
=========================================================================================================== test session starts ============================================================================================================
platform linux -- Python 3.10.13, pytest-7.3.1, pluggy-1.3.0
rootdir: /home/ubuntu/serve
plugins: mock-3.12.0, cov-4.1.0
collected 3 items / 2 deselected / 1 selected

test/pytest/test_example_gpt_fast.py TorchServe is not currently running.
TorchServe is not currently running.
['torchserve', '--start', '--model-store', '/tmp/pytest-of-ubuntu/pytest-98/work_dir0/model_store', '--no-config-snapshots']
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-12-02T05:02:53,312 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2023-12-02T05:02:53,314 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-12-02T05:02:53,366 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/ts/configs/metrics.yaml
2023-12-02T05:02:53,501 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.9.0
TS Home: /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages
Current directory: /home/ubuntu/serve
Temp directory: /tmp
Metrics config path: /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 16
Max heap size: 15908 M
Python executable: /home/ubuntu/miniconda3/envs/gpt-fast/bin/python3.10
Config file: N/A
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /tmp/pytest-of-ubuntu/pytest-98/work_dir0/model_store
Initial Models: N/A
Log dir: /home/ubuntu/serve/logs
Metrics dir: /home/ubuntu/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /tmp/pytest-of-ubuntu/pytest-98/work_dir0/model_store
Model config: N/A
2023-12-02T05:02:53,507 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2023-12-02T05:02:53,526 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-12-02T05:02:53,580 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2023-12-02T05:02:53,580 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2023-12-02T05:02:53,581 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2023-12-02T05:02:53,581 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2023-12-02T05:02:53,582 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2023-12-02T05:02:53,839 [INFO ] epollEventLoopGroup-3-1 org.pytorch.serve.archive.model.ModelArchive - createTempDir /tmp/models/2a0ffd0db7f74765b5c6b2ba3c13b2dd
2023-12-02T05:02:53,841 [INFO ] epollEventLoopGroup-3-1 org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /tmp/models/2a0ffd0db7f74765b5c6b2ba3c13b2dd/gpt_fast_handler
2023-12-02T05:02:53,852 [DEBUG] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model gpt_fast_handler
2023-12-02T05:02:53,852 [DEBUG] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model gpt_fast_handler
2023-12-02T05:02:53,853 [INFO ] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelManager - Model gpt_fast_handler loaded.
2023-12-02T05:02:53,854 [DEBUG] epollEventLoopGroup-3-1 org.pytorch.serve.wlm.ModelManager - updateModel: gpt_fast_handler, count: 1
2023-12-02T05:02:53,876 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/ubuntu/miniconda3/envs/gpt-fast/bin/python3.10, /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/ts/configs/metrics.yaml]
2023-12-02T05:02:54,347 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,350 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:19.849136352539062|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,350 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:464.7647895812988|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,350 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:95.9|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,350 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.00868507903421921|#Level:Host,DeviceId:0|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,351 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:2.0|#Level:Host,DeviceId:0|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,351 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,351 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:58461.4609375|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,351 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:4439.125|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:54,351 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:8.1|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493374
2023-12-02T05:02:55,804 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9000, pid=868147
2023-12-02T05:02:55,805 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000
2023-12-02T05:02:55,812 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Successfully loaded /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/ts/configs/metrics.yaml.
2023-12-02T05:02:55,812 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - [PID]868147
2023-12-02T05:02:55,813 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-02T05:02:55,813 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Python runtime: 3.10.13
2023-12-02T05:02:55,813 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_handler_1.0 State change null -> WORKER_STARTED
2023-12-02T05:02:55,815 [INFO ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2023-12-02T05:02:55,820 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000.
2023-12-02T05:02:55,823 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1701493375823
2023-12-02T05:02:55,825 [INFO ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1701493375825
2023-12-02T05:02:55,837 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - model_name: gpt_fast_handler, batchSize: 1
2023-12-02T05:02:55,990 [WARN ] W-9000-gpt_fast_handler_1.0-stderr MODEL_LOG - /home/ubuntu/miniconda3/envs/gpt-fast/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
2023-12-02T05:02:55,990 [WARN ] W-9000-gpt_fast_handler_1.0-stderr MODEL_LOG -   _torch_pytree._register_pytree_node(
2023-12-02T05:02:57,048 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Enabled tensor cores
2023-12-02T05:02:57,049 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-02T05:02:57,049 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-02T05:02:57,050 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Loading model ...
2023-12-02T05:03:01,948 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Time to load model: 4.90 seconds
2023-12-02T05:03:01,961 [INFO ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 6135
2023-12-02T05:03:01,962 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_handler_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-02T05:03:01,962 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:8103.0|#WorkerName:W-9000-gpt_fast_handler_1.0,Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493381
2023-12-02T05:03:01,962 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:4.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493381
2023-12-02T05:03:01,968 [INFO ] epollEventLoopGroup-3-1 ACCESS_LOG - /127.0.0.1:39200 "POST /models?model_name=gpt_fast_handler&url=%2Ftmp%2Fpytest-of-ubuntu%2Fpytest-98%2Fwork_dir0%2Fmodel_store%2Fgpt_fast_handler&initial_workers=1&synchronous=true&batch_size=1 HTTP/1.1" 200 8136
2023-12-02T05:03:01,968 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493381
2023-12-02T05:03:01,984 [INFO ] epollEventLoopGroup-3-2 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:gpt_fast_handler,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701493381
2023-12-02T05:03:01,984 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT repeats 1 to backend at: 1701493381984
2023-12-02T05:03:01,985 [INFO ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1701493381985
2023-12-02T05:03:01,986 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Backend received inference at: 1701493381
2023-12-02T05:03:03,050 [INFO ] W-9000-gpt_fast_handler_1.0 ACCESS_LOG - /127.0.0.1:44702 "POST /predictions/gpt_fast_handler HTTP/1.1" 200 1068
2023-12-02T05:03:03,051 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493383
2023-12-02T05:03:04,965 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Num tokens = 50
.2023-12-02T05:03:04,965 [INFO ] W-9000-gpt_fast_handler_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:2979.36|#ModelName:gpt_fast_handler,Level:Model|#type:GAUGE|#hostname:ip-172-31-15-101,1701493384,bc0a4b09-ccd8-46d8-a894-6a48f878f7b1, pattern=[METRICS]
2023-12-02T05:03:04,965 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:2981114.159|#model_name:gpt_fast_handler,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701493384
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:152.639|#model_name:gpt_fast_handler,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701493384
2023-12-02T05:03:04,966 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.job.RestJob - Waiting time ns: 152639, Backend time ns: 2981671148
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_METRICS - HandlerTime.ms:2979.36|#ModelName:gpt_fast_handler,Level:Model|#hostname:ip-172-31-15-101,requestID:bc0a4b09-ccd8-46d8-a894-6a48f878f7b1,timestamp:1701493384
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - QueueTime.Milliseconds:0.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493384
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]PredictionTime.Milliseconds:2979.46|#ModelName:gpt_fast_handler,Level:Model|#type:GAUGE|#hostname:ip-172-31-15-101,1701493384,bc0a4b09-ccd8-46d8-a894-6a48f878f7b1, pattern=[METRICS]
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 2977
2023-12-02T05:03:04,966 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_METRICS - PredictionTime.ms:2979.46|#ModelName:gpt_fast_handler,Level:Model|#hostname:ip-172-31-15-101,requestID:bc0a4b09-ccd8-46d8-a894-6a48f878f7b1,timestamp:1701493384
2023-12-02T05:03:04,967 [INFO ] W-9000-gpt_fast_handler_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:5.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493384
2023-12-02T05:03:04,974 [DEBUG] epollEventLoopGroup-3-3 org.pytorch.serve.wlm.ModelVersionedRefs - Removed model: gpt_fast_handler version: 1.0
2023-12-02T05:03:04,974 [DEBUG] epollEventLoopGroup-3-3 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_handler_1.0 State change WORKER_MODEL_LOADED -> WORKER_SCALED_DOWN
2023-12-02T05:03:04,975 [WARN ] epollEventLoopGroup-3-3 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-gpt_fast_handler_1.0-stderr
2023-12-02T05:03:04,975 [INFO ] W-9000-gpt_fast_handler_1.0-stdout MODEL_LOG - Frontend disconnected.
2023-12-02T05:03:04,975 [WARN ] epollEventLoopGroup-3-3 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-gpt_fast_handler_1.0-stdout
2023-12-02T05:03:04,975 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_SCALED_DOWN
2023-12-02T05:03:04,975 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_SCALED_DOWN
2023-12-02T05:03:04,976 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Shutting down the thread .. Scaling down.
2023-12-02T05:03:04,976 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_handler_1.0 State change WORKER_SCALED_DOWN -> WORKER_STOPPED
2023-12-02T05:03:04,976 [WARN ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-gpt_fast_handler_1.0-stderr
2023-12-02T05:03:04,977 [WARN ] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-gpt_fast_handler_1.0-stdout
2023-12-02T05:03:04,977 [DEBUG] W-9000-gpt_fast_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Worker terminated due to scale-down call.
2023-12-02T05:03:04,996 [INFO ] epollEventLoopGroup-3-3 org.pytorch.serve.wlm.ModelManager - Model gpt_fast_handler unregistered.
2023-12-02T05:03:04,996 [INFO ] epollEventLoopGroup-3-3 ACCESS_LOG - /127.0.0.1:35060 "DELETE /models/gpt_fast_handler HTTP/1.1" 200 22
2023-12-02T05:03:04,997 [INFO ] epollEventLoopGroup-3-3 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701493384


===================================================================================================== 1 passed, 2 deselected in 16.83s =====================================================================================================
  • Inference
2023-12-02T04:48:31,939 [INFO ] epollEventLoopGroup-3-4 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:gpt-fast,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701492511
2023-12-02T04:48:31,939 [DEBUG] W-9000-gpt-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT repeats 1 to backend at: 1701492511939
2023-12-02T04:48:31,940 [INFO ] W-9000-gpt-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1701492511940
2023-12-02T04:48:31,940 [INFO ] W-9000-gpt-fast_1.0-stdout MODEL_LOG - Backend received inference at: 1701492511
2023-12-02T04:48:32,011 [INFO ] W-9000-gpt-fast_1.0 ACCESS_LOG - /127.0.0.1:53314 "PUT /predictions/gpt-fast HTTP/1.1" 200 72
2023-12-02T04:48:32,012 [INFO ] W-9000-gpt-fast_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701492512
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0-stdout MODEL_LOG - Num tokens = 50
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:773600.846|#model_name:gpt-fast,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701492512
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:772.45|#ModelName:gpt-fast,Level:Model|#type:GAUGE|#hostname:ip-172-31-15-101,1701492512,ac2b9077-8456-4916-b7df-ed4644787aca, pattern=[METRICS]
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:87.744|#model_name:gpt-fast,model_version:default|#hostname:ip-172-31-15-101,timestamp:1701492512
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0-stdout MODEL_METRICS - HandlerTime.ms:772.45|#ModelName:gpt-fast,Level:Model|#hostname:ip-172-31-15-101,requestID:ac2b9077-8456-4916-b7df-ed4644787aca,timestamp:1701492512
2023-12-02T04:48:32,713 [DEBUG] W-9000-gpt-fast_1.0 org.pytorch.serve.job.RestJob - Waiting time ns: 87744, Backend time ns: 773955376
2023-12-02T04:48:32,713 [INFO ] W-9000-gpt-fast_1.0 TS_METRICS - QueueTime.Milliseconds:0.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701492512
2023-12-02T04:48:32,714 [INFO ] W-9000-gpt-fast_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]PredictionTime.Milliseconds:772.56|#ModelName:gpt-fast,Level:Model|#type:GAUGE|#hostname:ip-172-31-15-101,1701492512,ac2b9077-8456-4916-b7df-ed4644787aca, pattern=[METRICS]
2023-12-02T04:48:32,714 [INFO ] W-9000-gpt-fast_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 770
2023-12-02T04:48:32,714 [INFO ] W-9000-gpt-fast_1.0-stdout MODEL_METRICS - PredictionTime.ms:772.56|#ModelName:gpt-fast,Level:Model|#hostname:ip-172-31-15-101,requestID:ac2b9077-8456-4916-b7df-ed4644787aca,timestamp:1701492512
2023-12-02T04:48:32,714 [INFO ] W-9000-gpt-fast_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:5.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1701492512

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@mreso mreso changed the title (WIP) GPT fast example GPT fast example Dec 2, 2023
@mreso mreso marked this pull request as ready for review December 2, 2023 05:05
Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall.

  1. The model should point to the in8 model since the README talks about doing the quantization.
  2. Should we add tokens /second in the handler or may be this can do with a client side script.

Copy link
Contributor

@chauhang chauhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mreso Thanks for submitting this PR. I left few review comments inline. It will be good to mention that example has been tested on A10g, A100, H100 GPUs.

examples/large_models/gpt_fast/model_config.yaml Outdated Show resolved Hide resolved
examples/large_models/gpt_fast/README.md Outdated Show resolved Hide resolved
examples/large_models/gpt_fast/handler.py Outdated Show resolved Hide resolved
examples/large_models/gpt_fast/handler.py Show resolved Hide resolved
examples/large_models/gpt_fast/handler.py Show resolved Hide resolved
Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mreso mreso enabled auto-merge December 4, 2023 22:37
@mreso mreso requested a review from chauhang December 5, 2023 18:33
@mreso
Copy link
Collaborator Author

mreso commented Dec 5, 2023

@mreso Thanks for submitting this PR. I left few review comments inline. It will be good to mention that example has been tested on A10g, A100, H100 GPUs.

Thanks for reviewing @chauhang I've addressed your comments.

@mreso mreso added this pull request to the merge queue Dec 5, 2023
Merged via the queue into master with commit a1602ba Dec 5, 2023
13 checks passed
@chauhang chauhang added this to the v0.10.0 milestone Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants