Add configuration option to disable system metrics collection #2104

namannandan · 2023-02-02T18:50:38Z

Description

Support disabling system metrics collection using the configuration option disable_system_metrics via a configuration file or the environment variable TS_DISABLE_SYSTEM_METRICS.

Fixes #2052

Type of change

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Regression tests

test_metrics.py::test_collect_system_metrics_when_not_disabled PASSED                                                                                                                                       [ 88%]
test_metrics.py::test_disable_system_metrics_using_config_properties PASSED                                                                                                                                 [ 94%]
test_metrics.py::test_disable_system_metrics_using_environment_variable PASSED                                                                                                                              [100%]

cpu_regression_system_metrics_disable.log
gpu_regression_system_metrics_disable.log

Manual test with system metrics disabled. Register a model and make inference requests

$ cat logs/ts_metrics.log 
2023-02-02T00:56:55,064 - Requests2XX.Count:1|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328214
2023-02-02T00:57:25,592 - Requests2XX.Count:1|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328214
2023-02-02T00:57:29,198 - W-9001-resnet18_1.0.ms:3608|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328249
2023-02-02T00:57:29,198 - W-9000-resnet18_1.0.ms:3609|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328249
2023-02-02T00:57:29,199 - WorkerThreadTime.ms:24|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328249
2023-02-02T00:57:29,199 - WorkerThreadTime.ms:24|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328249
2023-02-02T00:58:45,775 - Requests2XX.Count:1|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328214
2023-02-02T00:58:45,777 - QueueTime.ms:0|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328325
2023-02-02T00:58:45,778 - WorkerThreadTime.ms:7|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328325
2023-02-02T01:02:45,042 - Requests2XX.Count:1|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328214
2023-02-02T01:02:45,042 - QueueTime.ms:0|#Level:Host|#hostname:88665a372f4b.ant.amazon.com,timestamp:1675328565

System metrics such as CPUUtilization, MemoryUtilization etc.. are not collected and reported

Manual test to verify config snapshot

$ cat custom_config.properties 
disable_system_metrics=true
$ ls ../model_store 
resnet-18-py-eager.mar
$  rm -rf logs
$ torchserve --start --model-store ../model_store --models all --ts-config custom_config.properties
.....
.....
Number of GPUs: 0
Number of CPUs: 12
Max heap size: 4096 M
Python executable: /Users/namannan/.pyenv/versions/3.8.13/bin/python
Config file: custom_config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /Volumes/workplace/pytorch/model_store
Initial Models: all
Log dir: /Volumes/workplace/pytorch/serve/logs
Metrics dir: /Volumes/workplace/pytorch/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 12
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Disable system metrics: true
Workflow Store: /Volumes/workplace/pytorch/model_store
Model config: N/A
.....
.....
$ curl -X GET "http://localhost:8081/models"
{
  "models": [
    {
      "modelName": "resnet-18-py-eager",
      "modelUrl": "resnet-18-py-eager.mar"
    }
  ]
}
$ torchserve --stop
$ torchserve --start --model-store ../model_store --models all
.....
.....
Number of GPUs: 0
Number of CPUs: 12
Max heap size: 4096 M
Python executable: /Users/namannan/.pyenv/versions/3.8.13/bin/python
Config file: logs/config/20230207163503771-shutdown.cfg
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /Volumes/workplace/pytorch/model_store
Initial Models: all
Log dir: /Volumes/workplace/pytorch/serve/logs
Metrics dir: /Volumes/workplace/pytorch/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 12
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Disable system metrics: true
Workflow Store: /Volumes/workplace/pytorch/model_store
Model config: N/A
2023-02-07T16:35:43,958 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {
  "name": "20230207163503771-shutdown.cfg",
  "modelCount": 1,
  "created": 1675816503771,
  "models": {
    "resnet-18-py-eager": {
      "1.0": {
        "defaultVersion": true,
        "marName": "resnet-18-py-eager.mar",
        "minWorkers": 12,
        "maxWorkers": 12,
        "batchSize": 1,
        "maxBatchDelay": 100,
        "responseTimeout": 120
      }
    }
  }
}
.....
.....
$ curl -X GET "http://localhost:8081/models"
{
  "models": [
    {
      "modelName": "resnet-18-py-eager",
      "modelUrl": "resnet-18-py-eager.mar"
    }
  ]
}
$ torchserve --stop
$ torchserve --start --model-store ../model_store --models all
$ curl -X GET "http://localhost:8081/models"
{
  "models": [
    {
      "modelName": "resnet-18-py-eager",
      "modelUrl": "resnet-18-py-eager.mar"
    }
  ]
}
$ torchserve --stop

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

Fix linter permalink issue

codecov · 2023-02-02T19:30:33Z

Codecov Report

Merging #2104 (d87fab7) into master (30ec515) will not change coverage.
The diff coverage is n/a.

❗ Current head d87fab7 differs from pull request most recent head cb9bc10. Consider uploading reports for the commit cb9bc10 to get more accurate results

@@           Coverage Diff           @@
##           master    #2104   +/-   ##
=======================================
  Coverage   53.36%   53.36%           
=======================================
  Files          71       71           
  Lines        3225     3225           
  Branches       56       56           
=======================================
  Hits         1721     1721           
  Misses       1504     1504

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

test/pytest/test_metrics.py

lxning

please attach complete regression test log
check if snapshot works fine. (ie. run "torchserve --start --model-store xxx --models all" twice)

docs/configuration.md

namannandan · 2023-02-08T00:48:03Z

Added CPU and GPU regression logs in the summary. Also included the transcript for a manual test session to check that config snapshot works as expected.

Naman Nandan added 3 commits February 1, 2023 15:13

Add configuration option to disable system metrics

5da233a

Update documentation for disable system metrics configuration

f9b232c

Fix linter permalink issue

add tests

8efee82

namannandan marked this pull request as ready for review February 2, 2023 19:31

namannandan requested review from lxning, mreso, msaroufim and agunapal February 2, 2023 19:32

msaroufim requested changes Feb 2, 2023

View reviewed changes

test/pytest/test_metrics.py Outdated Show resolved Hide resolved

Improve tests to check for system metrics

3a00c1a

namannandan requested a review from msaroufim February 3, 2023 19:19

msaroufim approved these changes Feb 3, 2023

View reviewed changes

Merge branch 'master' into naman-system-metrics-flag

207f119

lxning reviewed Feb 7, 2023

View reviewed changes

docs/configuration.md Outdated Show resolved Hide resolved

Revert permalink in documentation

875917d

namannandan requested a review from lxning February 8, 2023 00:48

msaroufim and others added 2 commits February 10, 2023 11:00

Merge branch 'master' into naman-system-metrics-flag

82f3b1a

Merge branch 'master' into naman-system-metrics-flag

cb9bc10

agunapal approved these changes Feb 14, 2023

View reviewed changes

namannandan merged commit 7e65972 into pytorch:master Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configuration option to disable system metrics collection #2104

Add configuration option to disable system metrics collection #2104

namannandan commented Feb 2, 2023 •

edited

Loading

codecov bot commented Feb 2, 2023 •

edited

Loading

lxning left a comment

namannandan commented Feb 8, 2023

Add configuration option to disable system metrics collection #2104

Add configuration option to disable system metrics collection #2104

Conversation

namannandan commented Feb 2, 2023 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

codecov bot commented Feb 2, 2023 • edited Loading

Codecov Report

lxning left a comment

Choose a reason for hiding this comment

namannandan commented Feb 8, 2023

namannandan commented Feb 2, 2023 •

edited

Loading

codecov bot commented Feb 2, 2023 •

edited

Loading