support system_metrics_cmd in config.properties #3000

lxning · 2024-03-04T23:22:16Z

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test case1: use TorchServe default system metrics cmd

torchserve --ncs --start --model-store model_store

Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /home/ubuntu/serve/model_store
CPP log config: N/A
Model config: N/A
System metrics command: default
2024-03-05T17:32:43,601 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-03-05T17:32:43,620 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-03-05T17:32:43,675 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-03-05T17:32:43,676 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-03-05T17:32:43,677 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2024-03-05T17:32:43,677 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-03-05T17:32:43,678 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2024-03-05T17:32:45,250 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:5.0|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,251 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:261.5556411743164|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,251 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:731.0341110229492|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,251 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.6|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,252 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.00868507903421921|#Level:Host,DeviceId:0|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,252 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:2.0|#Level:Host,DeviceId:0|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,252 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.00868507903421921|#Level:Host,DeviceId:1|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,252 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:2.0|#Level:Host,DeviceId:1|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,252 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.00868507903421921|#Level:Host,DeviceId:2|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,253 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:2.0|#Level:Host,DeviceId:2|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,253 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.00868507903421921|#Level:Host,DeviceId:3|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,253 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:2.0|#Level:Host,DeviceId:3|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,253 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,253 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:1|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,254 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:2|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,254 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:3|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,254 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:375841.48828125|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,254 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:3571.5859375|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,254 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:1.8|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965

Test case2: use customized system metrics cmd

cat config.properties
inference_address=http://127.0.0.1:8080
management_address=http://127.0.0.1:8081

number_of_netty_threads=32
job_queue_size=1000

vmargs=-Xmx4g -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError
prefer_direct_buffer=True

default_response_timeout=300
unregister_model_timeout=300
install_py_dep_per_model=true
system_metrics_cmd=ts/metrics/metric_collector.py --gpu 0

torchserve --ncs --start --model-store model_store --ts-config config.properties

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-03-05T17:40:58,612 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-03-05T17:40:58,614 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-03-05T17:40:58,665 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /opt/conda/lib/python3.10/site-packages/ts/configs/metrics.yaml
2024-03-05T17:40:58,843 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.9.0
TS Home: /opt/conda/lib/python3.10/site-packages
Current directory: /home/ubuntu/serve
Temp directory: /tmp
Metrics config path: /opt/conda/lib/python3.10/site-packages/ts/configs/metrics.yaml
Number of GPUs: 4
Number of CPUs: 96
Max heap size: 4096 M
Python executable: /opt/conda/bin/python
Config file: config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/ubuntu/serve/model_store
Initial Models: N/A
Log dir: /home/ubuntu/serve/logs
Metrics dir: /home/ubuntu/serve/logs
Netty threads: 32
Netty client threads: 0
Default workers per model: 4
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: True
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /home/ubuntu/serve/model_store
CPP log config: N/A
Model config: N/A
System metrics command: ts/metrics/metric_collector.py --gpu 0
2024-03-05T17:40:58,849 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-03-05T17:40:58,868 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-03-05T17:40:58,918 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-03-05T17:40:58,919 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-03-05T17:40:58,920 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2024-03-05T17:40:58,920 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-03-05T17:40:58,921 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2024-03-05T17:40:59,144 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,146 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:261.55556869506836|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,146 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:731.0341835021973|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,146 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.6|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,147 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:375894.63671875|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,147 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:3518.4375|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,147 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:1.8|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:41:59,140 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,141 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:261.55555725097656|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,141 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:731.0341949462891|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,141 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.6|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,141 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:375890.171875|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,142 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:3523.39453125|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,142 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:1.8|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

cc @namannandan

agunapal

@lxning

Thanks. This should support something like this, where the user passes this as an extra file.

torch-model-archiver --model-name mnist --version 1.0 --model-file examples/image_classifier/mnist/mnist.py --serialized-file examples/image_classifier/mnist/mnist_cnn.pt --handler  examples/image_classifier/mnist/mnist_handler.py --extra-files "examples/custom_system_metrics/metric_collector.py"

and in the config.properties, I specify

system_metrics_cmd="metric_collector.py --gpu 0"

I tried this, currently it gives an error

 can't open file '/home/ubuntu/anaconda3/envs/torchserve/lib/python3.10/site-packages/"metric_collector.py': [Errno 2] No such file or directory

lxning · 2024-03-05T18:36:00Z

@lxning

Thanks. This should support something like this, where the user passes this as an extra file.

torch-model-archiver --model-name mnist --version 1.0 --model-file examples/image_classifier/mnist/mnist.py --serialized-file examples/image_classifier/mnist/mnist_cnn.pt --handler  examples/image_classifier/mnist/mnist_handler.py --extra-files "examples/custom_system_metrics/metric_collector.py"

and in the config.properties, I specify

system_metrics_cmd="metric_collector.py --gpu 0"

I tried this, currently it gives an error

 can't open file '/home/ubuntu/anaconda3/envs/torchserve/lib/python3.10/site-packages/"metric_collector.py': [Errno 2] No such file or directory

Q1:

lxning · 2024-03-05T18:36:28Z

@agunapal

Case1: it is invalid to add system metrics in model archiver in TorchServe architecture. The reason is system metrics is global, not model specific. The scripts included in model archiver is invisible for frontend when TorchServe is started.

Case2: By format definition in config.properties, the value system_metrics_cmd="metric_collector.py --gpu 0" is different from system_metrics_cmd=metric_collector.py --gpu 0.

agunapal · 2024-03-05T19:00:43Z

@agunapal

Case1: it is invalid to add system metrics in model archiver in TorchServe architecture. The reason is system metrics is global, not model specific. The scripts included in model archiver is invisible for frontend when TorchServe is started.

Case2: By format definition in config.properties, the value system_metrics_cmd="metric_collector.py --gpu 0" is different from system_metrics_cmd=metric_collector.py --gpu 0.

Confirmed this to be working.

System metrics command: /home/ubuntu/serve/examples/custom_system_metrics/metric_collector.py --gpu 0
2024-03-05T18:45:14,125 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-03-05T18:45:14,140 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: mnist.mar
2024-03-05T18:45:14,255 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model mnist
2024-03-05T18:45:14,255 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model mnist
2024-03-05T18:45:14,255 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model mnist loaded.
2024-03-05T18:45:14,255 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: mnist, count: 1
2024-03-05T18:45:14,263 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/ubuntu/anaconda3/envs/torchserve/bin/python, /home/ubuntu/anaconda3/envs/torchserve/lib/python3.10/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /home/ubuntu/anaconda3/envs/torchserve/lib/python3.10/site-packages/ts/configs/metrics.yaml]
2024-03-05T18:45:14,263 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-03-05T18:45:14,337 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-03-05T18:45:14,337 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-03-05T18:45:14,338 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2024-03-05T18:45:14,339 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-03-05T18:45:14,339 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2024-03-05T18:45:14,516 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2024-03-05T18:45:14,560 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:100.0|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,561 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:254.63021087646484|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,561 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:35.92073440551758|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,562 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:12.4|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,562 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29909.2578125|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,562 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:1416.21484375|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,563 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:5.8|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314

Will send a separate PR to show this can be used with a custom python script

agunapal

LGTM to me.

nit: please edit the example as suggested to make it obvious

agunapal · 2024-03-05T19:02:56Z

docs/configuration.md

@@ -297,6 +297,7 @@ e.g. : To allow base URLs `https://s3.amazonaws.com/` and `https://torchserve.py
  * For security reason, `use_env_allowed_urls=true` is required in config.properties to read `allowed_urls` from environment variable.
 * `workflow_store` : Path of workflow store directory. Defaults to model store directory.
 * `disable_system_metrics` : Disable collection of system metrics when set to "true". Default value is "false".
+* `system_metrics_cmd`: The customized system metrics python script name with arguments. For example:`ts/metrics/metric_collector.py --gpu 0`. Default: empty which means TorchServe collects system metrics via "ts/metrics/metric_collector.py --gpu $CUDA_VISIBLE_DEVICES".


Suggest changing the example to system_metrics_cmd=ts/metrics/metric_collector.py --gpu 0 to disable GPU metrics

lxning added 2 commits March 4, 2024 15:17

support system_metrics_cmd in config.properties

adb7e9e

address security concern

c5bde5f

lxning self-assigned this Mar 5, 2024

lxning added documentation Improvements or additions to documentation enhancement New feature or request p0 high priority metrics labels Mar 5, 2024

lxning added this to the v0.10.0 milestone Mar 5, 2024

lxning requested a review from agunapal March 5, 2024 17:44

lxning and others added 3 commits March 5, 2024 09:49

add log

854964e

update readme

bde52bf

Merge branch 'master' into feat/sys_metrics

4cfc1a1

agunapal requested changes Mar 5, 2024

View reviewed changes

lxning closed this Mar 5, 2024

lxning reopened this Mar 5, 2024

agunapal approved these changes Mar 5, 2024

View reviewed changes

agunapal mentioned this pull request Mar 5, 2024

Flag to disable collection of system gpu metrics #2981

Closed

8 tasks

lxning added this pull request to the merge queue Mar 5, 2024

agunapal mentioned this pull request Mar 5, 2024

Config to disable gpu system metrics collection #2965

Closed

Merged via the queue into master with commit 1ff1b3b Mar 5, 2024
24 of 29 checks passed

lxning mentioned this pull request Apr 16, 2024

Metrics collector crashes when NVIDIA MIGs are present #3090

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support system_metrics_cmd in config.properties #3000

support system_metrics_cmd in config.properties #3000

lxning commented Mar 4, 2024 •

edited

Loading

agunapal left a comment

lxning commented Mar 5, 2024

lxning commented Mar 5, 2024 •

edited

Loading

agunapal commented Mar 5, 2024

agunapal left a comment

agunapal Mar 5, 2024

support system_metrics_cmd in config.properties #3000

support system_metrics_cmd in config.properties #3000

Conversation

lxning commented Mar 4, 2024 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

agunapal left a comment

Choose a reason for hiding this comment

lxning commented Mar 5, 2024

lxning commented Mar 5, 2024 • edited Loading

agunapal commented Mar 5, 2024

agunapal left a comment

Choose a reason for hiding this comment

agunapal Mar 5, 2024

Choose a reason for hiding this comment

lxning commented Mar 4, 2024 •

edited

Loading

lxning commented Mar 5, 2024 •

edited

Loading