Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support system_metrics_cmd in config.properties #3000

Merged
merged 5 commits into from
Mar 5, 2024
Merged

Conversation

lxning
Copy link
Collaborator

@lxning lxning commented Mar 4, 2024

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test case1: use TorchServe default system metrics cmd
torchserve --ncs --start --model-store model_store

Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /home/ubuntu/serve/model_store
CPP log config: N/A
Model config: N/A
System metrics command: default
2024-03-05T17:32:43,601 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-03-05T17:32:43,620 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-03-05T17:32:43,675 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-03-05T17:32:43,676 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-03-05T17:32:43,677 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2024-03-05T17:32:43,677 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-03-05T17:32:43,678 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2024-03-05T17:32:45,250 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:5.0|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,251 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:261.5556411743164|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,251 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:731.0341110229492|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,251 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.6|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,252 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.00868507903421921|#Level:Host,DeviceId:0|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,252 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:2.0|#Level:Host,DeviceId:0|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,252 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.00868507903421921|#Level:Host,DeviceId:1|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,252 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:2.0|#Level:Host,DeviceId:1|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,252 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.00868507903421921|#Level:Host,DeviceId:2|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,253 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:2.0|#Level:Host,DeviceId:2|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,253 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.00868507903421921|#Level:Host,DeviceId:3|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,253 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:2.0|#Level:Host,DeviceId:3|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,253 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,253 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:1|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,254 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:2|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,254 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:3|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,254 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:375841.48828125|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,254 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:3571.5859375|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
2024-03-05T17:32:45,254 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:1.8|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709659965
  • Test case2: use customized system metrics cmd
cat config.properties
inference_address=http://127.0.0.1:8080
management_address=http://127.0.0.1:8081

number_of_netty_threads=32
job_queue_size=1000

vmargs=-Xmx4g -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError
prefer_direct_buffer=True

default_response_timeout=300
unregister_model_timeout=300
install_py_dep_per_model=true
system_metrics_cmd=ts/metrics/metric_collector.py --gpu 0

torchserve --ncs --start --model-store model_store --ts-config config.properties

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-03-05T17:40:58,612 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-03-05T17:40:58,614 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-03-05T17:40:58,665 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /opt/conda/lib/python3.10/site-packages/ts/configs/metrics.yaml
2024-03-05T17:40:58,843 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.9.0
TS Home: /opt/conda/lib/python3.10/site-packages
Current directory: /home/ubuntu/serve
Temp directory: /tmp
Metrics config path: /opt/conda/lib/python3.10/site-packages/ts/configs/metrics.yaml
Number of GPUs: 4
Number of CPUs: 96
Max heap size: 4096 M
Python executable: /opt/conda/bin/python
Config file: config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/ubuntu/serve/model_store
Initial Models: N/A
Log dir: /home/ubuntu/serve/logs
Metrics dir: /home/ubuntu/serve/logs
Netty threads: 32
Netty client threads: 0
Default workers per model: 4
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: True
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /home/ubuntu/serve/model_store
CPP log config: N/A
Model config: N/A
System metrics command: ts/metrics/metric_collector.py --gpu 0
2024-03-05T17:40:58,849 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-03-05T17:40:58,868 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-03-05T17:40:58,918 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-03-05T17:40:58,919 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-03-05T17:40:58,920 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2024-03-05T17:40:58,920 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-03-05T17:40:58,921 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2024-03-05T17:40:59,144 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,146 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:261.55556869506836|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,146 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:731.0341835021973|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,146 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.6|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,147 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:375894.63671875|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,147 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:3518.4375|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:40:59,147 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:1.8|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660459
2024-03-05T17:41:59,140 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,141 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:261.55555725097656|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,141 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:731.0341949462891|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,141 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.6|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,141 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:375890.171875|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,142 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:3523.39453125|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519
2024-03-05T17:41:59,142 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:1.8|#Level:Host|#hostname:ip-172-31-8-38,timestamp:1709660519

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

cc @namannandan

@lxning lxning self-assigned this Mar 5, 2024
@lxning lxning added documentation Improvements or additions to documentation enhancement New feature or request p0 high priority metrics labels Mar 5, 2024
@lxning lxning added this to the v0.10.0 milestone Mar 5, 2024
@lxning lxning requested a review from agunapal March 5, 2024 17:44
Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxning

Thanks. This should support something like this, where the user passes this as an extra file.

torch-model-archiver --model-name mnist --version 1.0 --model-file examples/image_classifier/mnist/mnist.py --serialized-file examples/image_classifier/mnist/mnist_cnn.pt --handler  examples/image_classifier/mnist/mnist_handler.py --extra-files "examples/custom_system_metrics/metric_collector.py"

and in the config.properties, I specify

system_metrics_cmd="metric_collector.py --gpu 0"

I tried this, currently it gives an error

 can't open file '/home/ubuntu/anaconda3/envs/torchserve/lib/python3.10/site-packages/"metric_collector.py': [Errno 2] No such file or directory

@lxning
Copy link
Collaborator Author

lxning commented Mar 5, 2024

@lxning

Thanks. This should support something like this, where the user passes this as an extra file.

torch-model-archiver --model-name mnist --version 1.0 --model-file examples/image_classifier/mnist/mnist.py --serialized-file examples/image_classifier/mnist/mnist_cnn.pt --handler  examples/image_classifier/mnist/mnist_handler.py --extra-files "examples/custom_system_metrics/metric_collector.py"

and in the config.properties, I specify

system_metrics_cmd="metric_collector.py --gpu 0"

I tried this, currently it gives an error

 can't open file '/home/ubuntu/anaconda3/envs/torchserve/lib/python3.10/site-packages/"metric_collector.py': [Errno 2] No such file or directory

Q1:

@lxning lxning closed this Mar 5, 2024
@lxning
Copy link
Collaborator Author

lxning commented Mar 5, 2024

@agunapal

Case1: it is invalid to add system metrics in model archiver in TorchServe architecture. The reason is system metrics is global, not model specific. The scripts included in model archiver is invisible for frontend when TorchServe is started.

Case2: By format definition in config.properties, the value system_metrics_cmd="metric_collector.py --gpu 0" is different from system_metrics_cmd=metric_collector.py --gpu 0.

@lxning lxning reopened this Mar 5, 2024
@agunapal
Copy link
Collaborator

agunapal commented Mar 5, 2024

@agunapal

Case1: it is invalid to add system metrics in model archiver in TorchServe architecture. The reason is system metrics is global, not model specific. The scripts included in model archiver is invisible for frontend when TorchServe is started.

Case2: By format definition in config.properties, the value system_metrics_cmd="metric_collector.py --gpu 0" is different from system_metrics_cmd=metric_collector.py --gpu 0.

Confirmed this to be working.

System metrics command: /home/ubuntu/serve/examples/custom_system_metrics/metric_collector.py --gpu 0
2024-03-05T18:45:14,125 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-03-05T18:45:14,140 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: mnist.mar
2024-03-05T18:45:14,255 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model mnist
2024-03-05T18:45:14,255 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model mnist
2024-03-05T18:45:14,255 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model mnist loaded.
2024-03-05T18:45:14,255 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: mnist, count: 1
2024-03-05T18:45:14,263 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/ubuntu/anaconda3/envs/torchserve/bin/python, /home/ubuntu/anaconda3/envs/torchserve/lib/python3.10/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /home/ubuntu/anaconda3/envs/torchserve/lib/python3.10/site-packages/ts/configs/metrics.yaml]
2024-03-05T18:45:14,263 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-03-05T18:45:14,337 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-03-05T18:45:14,337 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-03-05T18:45:14,338 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2024-03-05T18:45:14,339 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-03-05T18:45:14,339 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2024-03-05T18:45:14,516 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2024-03-05T18:45:14,560 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:100.0|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,561 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:254.63021087646484|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,561 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:35.92073440551758|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,562 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:12.4|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,562 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29909.2578125|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,562 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:1416.21484375|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314
2024-03-05T18:45:14,563 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:5.8|#Level:Host|#hostname:ip-172-31-12-209,timestamp:1709664314

Will send a separate PR to show this can be used with a custom python script

Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me.

nit: please edit the example as suggested to make it obvious

@@ -297,6 +297,7 @@ e.g. : To allow base URLs `https://s3.amazonaws.com/` and `https://torchserve.py
* For security reason, `use_env_allowed_urls=true` is required in config.properties to read `allowed_urls` from environment variable.
* `workflow_store` : Path of workflow store directory. Defaults to model store directory.
* `disable_system_metrics` : Disable collection of system metrics when set to "true". Default value is "false".
* `system_metrics_cmd`: The customized system metrics python script name with arguments. For example:`ts/metrics/metric_collector.py --gpu 0`. Default: empty which means TorchServe collects system metrics via "ts/metrics/metric_collector.py --gpu $CUDA_VISIBLE_DEVICES".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changing the example to system_metrics_cmd=ts/metrics/metric_collector.py --gpu 0 to disable GPU metrics

@lxning lxning added this pull request to the merge queue Mar 5, 2024
Merged via the queue into master with commit 1ff1b3b Mar 5, 2024
24 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request metrics p0 high priority
Projects
v0.10.0 lifecycle
Awaiting triage
Development

Successfully merging this pull request may close these issues.

None yet

2 participants