Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration option to disable system metrics collection #2104

Merged
merged 8 commits into from
Feb 14, 2023

Conversation

namannandan
Copy link
Collaborator

@namannandan namannandan commented Feb 2, 2023

Description

Support disabling system metrics collection using the configuration option disable_system_metrics via a configuration file or the environment variable TS_DISABLE_SYSTEM_METRICS.

Fixes #2052

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

  • Regression tests
test_metrics.py::test_collect_system_metrics_when_not_disabled PASSED                                                                                                                                       [ 88%]
test_metrics.py::test_disable_system_metrics_using_config_properties PASSED                                                                                                                                 [ 94%]
test_metrics.py::test_disable_system_metrics_using_environment_variable PASSED                                                                                                                              [100%]

cpu_regression_system_metrics_disable.log
gpu_regression_system_metrics_disable.log

  • Manual test with system metrics disabled. Register a model and make inference requests
$ cat logs/ts_metrics.log 
2023-02-02T00:56:55,064 - Requests2XX.Count:1|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328214
2023-02-02T00:57:25,592 - Requests2XX.Count:1|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328214
2023-02-02T00:57:29,198 - W-9001-resnet18_1.0.ms:3608|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328249
2023-02-02T00:57:29,198 - W-9000-resnet18_1.0.ms:3609|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328249
2023-02-02T00:57:29,199 - WorkerThreadTime.ms:24|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328249
2023-02-02T00:57:29,199 - WorkerThreadTime.ms:24|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328249
2023-02-02T00:58:45,775 - Requests2XX.Count:1|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328214
2023-02-02T00:58:45,777 - QueueTime.ms:0|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328325
2023-02-02T00:58:45,778 - WorkerThreadTime.ms:7|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328325
2023-02-02T01:02:45,042 - Requests2XX.Count:1|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328214
2023-02-02T01:02:45,042 - QueueTime.ms:0|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328565

System metrics such as CPUUtilization, MemoryUtilization etc.. are not collected and reported

  • Manual test to verify config snapshot
$ cat custom_config.properties 
disable_system_metrics=true
$ ls ../model_store 
resnet-18-py-eager.mar
$  rm -rf logs
$ torchserve --start --model-store ../model_store --models all --ts-config custom_config.properties
.....
.....
Number of GPUs: 0
Number of CPUs: 12
Max heap size: 4096 M
Python executable: /Users/namannan/.pyenv/versions/3.8.13/bin/python
Config file: custom_config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /Volumes/workplace/pytorch/model_store
Initial Models: all
Log dir: /Volumes/workplace/pytorch/serve/logs
Metrics dir: /Volumes/workplace/pytorch/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 12
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Disable system metrics: true
Workflow Store: /Volumes/workplace/pytorch/model_store
Model config: N/A
.....
.....
$ curl -X GET "http://localhost:8081/models"
{
  "models": [
    {
      "modelName": "resnet-18-py-eager",
      "modelUrl": "resnet-18-py-eager.mar"
    }
  ]
}
$ torchserve --stop
$ torchserve --start --model-store ../model_store --models all
.....
.....
Number of GPUs: 0
Number of CPUs: 12
Max heap size: 4096 M
Python executable: /Users/namannan/.pyenv/versions/3.8.13/bin/python
Config file: logs/config/20230207163503771-shutdown.cfg
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /Volumes/workplace/pytorch/model_store
Initial Models: all
Log dir: /Volumes/workplace/pytorch/serve/logs
Metrics dir: /Volumes/workplace/pytorch/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 12
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Disable system metrics: true
Workflow Store: /Volumes/workplace/pytorch/model_store
Model config: N/A
2023-02-07T16:35:43,958 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {
  "name": "20230207163503771-shutdown.cfg",
  "modelCount": 1,
  "created": 1675816503771,
  "models": {
    "resnet-18-py-eager": {
      "1.0": {
        "defaultVersion": true,
        "marName": "resnet-18-py-eager.mar",
        "minWorkers": 12,
        "maxWorkers": 12,
        "batchSize": 1,
        "maxBatchDelay": 100,
        "responseTimeout": 120
      }
    }
  }
}
.....
.....
$ curl -X GET "http://localhost:8081/models"
{
  "models": [
    {
      "modelName": "resnet-18-py-eager",
      "modelUrl": "resnet-18-py-eager.mar"
    }
  ]
}
$ torchserve --stop
$ torchserve --start --model-store ../model_store --models all
$ curl -X GET "http://localhost:8081/models"
{
  "models": [
    {
      "modelName": "resnet-18-py-eager",
      "modelUrl": "resnet-18-py-eager.mar"
    }
  ]
}
$ torchserve --stop

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@codecov
Copy link

codecov bot commented Feb 2, 2023

Codecov Report

Merging #2104 (d87fab7) into master (30ec515) will not change coverage.
The diff coverage is n/a.

❗ Current head d87fab7 differs from pull request most recent head cb9bc10. Consider uploading reports for the commit cb9bc10 to get more accurate results

@@           Coverage Diff           @@
##           master    #2104   +/-   ##
=======================================
  Coverage   53.36%   53.36%           
=======================================
  Files          71       71           
  Lines        3225     3225           
  Branches       56       56           
=======================================
  Hits         1721     1721           
  Misses       1504     1504           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@namannandan namannandan marked this pull request as ready for review February 2, 2023 19:31
test/pytest/test_metrics.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@lxning lxning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • please attach complete regression test log
  • check if snapshot works fine. (ie. run "torchserve --start --model-store xxx --models all" twice)

docs/configuration.md Outdated Show resolved Hide resolved
@namannandan
Copy link
Collaborator Author

Added CPU and GPU regression logs in the summary. Also included the transcript for a manual test session to check that config snapshot works as expected.

@namannandan namannandan merged commit 7e65972 into pytorch:master Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a flag to disable system metrics collection
4 participants