Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto detect backend metrics not defined in metrics configuration #2769

Merged
merged 25 commits into from
Nov 21, 2023

Conversation

namannandan
Copy link
Collaborator

@namannandan namannandan commented Nov 6, 2023

Description

Enable auto detection of backend metrics that are not defined in the metrics configuration file.

To enable backwards compatibility with version <=0.6.0, set request ID for model load reqeusts so that metrics updated in the initialize method of the handler include both ModelName and Level dimensions.

Fixes: #2747, #2772

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

  • Unit and integration tests included in this PR
  • Manual testing with custom metrics example updated in this PR
$ cd examples/custom_metrics
$ cat metrics.yaml
dimensions:
  - &model_name "ModelName"
  - &worker_name "WorkerName"
  - &level "Level"
  - &device_id "DeviceId"
  - &hostname "Hostname"

ts_metrics:
  counter:
    - name: Requests2XX
      unit: Count
      dimensions: [*level, *hostname]
    - name: Requests4XX
      unit: Count
      dimensions: [*level, *hostname]
    - name: Requests5XX
      unit: Count
      dimensions: [*level, *hostname]
    - name: ts_inference_requests_total
      unit: Count
      dimensions: ["model_name", "model_version", "hostname"]
    - name: ts_inference_latency_microseconds
      unit: Microseconds
      dimensions: ["model_name", "model_version", "hostname"]
    - name: ts_queue_latency_microseconds
      unit: Microseconds
      dimensions: ["model_name", "model_version", "hostname"]
  gauge:
    - name: QueueTime
      unit: Milliseconds
      dimensions: [*level, *hostname]
    - name: WorkerThreadTime
      unit: Milliseconds
      dimensions: [*level, *hostname]
    - name: WorkerLoadTime
      unit: Milliseconds
      dimensions: [*worker_name, *level, *hostname]
    - name: CPUUtilization
      unit: Percent
      dimensions: [*level, *hostname]
    - name: MemoryUsed
      unit: Megabytes
      dimensions: [*level, *hostname]
    - name: MemoryAvailable
      unit: Megabytes
      dimensions: [*level, *hostname]
    - name: MemoryUtilization
      unit: Percent
      dimensions: [*level, *hostname]
    - name: DiskUsage
      unit: Gigabytes
      dimensions: [*level, *hostname]
    - name: DiskUtilization
      unit: Percent
      dimensions: [*level, *hostname]
    - name: DiskAvailable
      unit: Gigabytes
      dimensions: [*level, *hostname]
    - name: GPUMemoryUtilization
      unit: Percent
      dimensions: [*level, *device_id, *hostname]
    - name: GPUMemoryUsed
      unit: Megabytes
      dimensions: [*level, *device_id, *hostname]
    - name: GPUUtilization
      unit: Percent
      dimensions: [*level, *device_id, *hostname]

model_metrics:
  # Dimension "Hostname" is automatically added for model metrics in the backend
  # "HandlerMethodTime" and "ExamplePercentMetric" metrics are not defined here to show auto-detection of backend metrics
  counter:
    - name: InferenceRequestCount
      unit: count
      dimensions: []
    - name: InitializeCallCount
      unit: count
      dimensions: [*model_name, *level]
    - name: PreprocessCallCount
      unit: count
      dimensions: [*model_name]
    - name: PostprocessCallCount
      unit: CallCount
      dimensions: [*model_name, *level]
  gauge:
    - name: HandlerTime
      unit: ms
      dimensions: [*model_name, *level]
    - name: PredictionTime
      unit: ms
      dimensions: [*model_name, *level]
    - name: RequestBatchSize
      unit: count
      dimensions: ["ModelName"]
    - name: SizeOfImage
      unit: kB
      dimensions: [*model_name, *level]

Note that HandlerMethodTime and ExamplePercentMetric are not defined in the metrics configuration file above.

$ torchserve --ncs --start --model-store model_store --ts-config examples/custom_metrics/config.properties --models mnist=mnist.mar
$ curl http://127.0.0.1:8080/predictions/mnist -T examples/image_classifier/mnist/test_data/0.png
$ curl http://127.0.0.1:8082/metrics
.....
.....
# HELP HandlerMethodTime Torchserve prometheus gauge metric with unit: Milliseconds
# TYPE HandlerMethodTime gauge
HandlerMethodTime{MethodName="preprocess",ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 1.291036605834961
.....
.....
# HELP ExamplePercentMetric Torchserve prometheus histogram metric with unit: Percent
# TYPE ExamplePercentMetric histogram
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.005",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.01",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.025",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.05",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.075",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.1",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.25",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.5",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.75",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="1.0",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="2.5",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="5.0",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="7.5",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="10.0",} 0.0
ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="+Inf",} 3.0
ExamplePercentMetric_count{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 3.0
ExamplePercentMetric_sum{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 150.0
.....
.....

As can be seen above, HandlerMethodTime and ExamplePercentMetric are auto detected, registered to the prometheus client and updated.

@namannandan namannandan marked this pull request as ready for review November 7, 2023 21:53
docs/metrics.md Show resolved Hide resolved
Comment on lines 108 to 111
for (Dimension dimension : parsedMetric.getDimensions()) {
dimensionNames.add(dimension.getName());
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need recopy to a new list? is the return value of getDimensions immutable?

Copy link
Collaborator Author

@namannandan namannandan Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsedMetric is an instance of Metric object but while creating a new IMetric object using MetricBuilder.build, we need to provide a list of dimension names, hence the new list.

No, the return value of getDimensions is not immutable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, why parsedMetric.getDimensions().add("Hostname") can not work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsedMetric is an instance of Metric. parsedMetric.getDimensions() returns a list of Dimension objects.

Therefore the new lists are being created above.

Also, the Metric class API returns references to private objects which should ideally be immutable. So, I believe it is better to create a new list like we are doing here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it takes time to deepcopy. Metric object will be GC eventually if it is just used once at here and not referenced any more. The shallow copy of dimension name is fine and faster if Metric is only used at here for building IMetrics and it will not be GC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added getDimensionNames and getDimensionValues utility functions to Metric class.

ts/metrics/caching_metric.py Show resolved Hide resolved
docs/metrics.md Outdated Show resolved Hide resolved
docs/metrics.md Outdated Show resolved Hide resolved
Comment on lines 108 to 111
for (Dimension dimension : parsedMetric.getDimensions()) {
dimensionNames.add(dimension.getName());
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, why parsedMetric.getDimensions().add("Hostname") can not work?

Comment on lines 324 to 331
List<String> dimensionValues = new ArrayList<String>();
for (Dimension dimension : parsedMetric.getDimensions()) {
dimensionValues.add(dimension.getValue());
}
// Hostname is added as a dimension by default to backend metrics
dimensionValues.add(parsedMetric.getHostName());

this.metricCache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 131 to 135
# Backwards Compatibility with releases <=0.6.0
# Request ID is not set for model load requests
# TODO: UUID serves as a temporary request ID for model load requests
self.metrics_cache.set_request_ids(str(uuid.uuid4()))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is just initialize metrics_cache in service, not real model load metrics. uuid should be set within func load. So far there is no model loading metrics as i know.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved logic for request id assignment for model load requests from model_service_worker.py to model_loader.py.

That's correct, we don't have any default metrics that we emit during model load time, but a custom handler can emit metrics in the initialize method. If the request id is not set, it can cause the metric to only have Level:Error as the default dimension that gets added instead of ModelName:<model_name> and Level:Model. Described the issue here: #2772

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting uuid for model loading should be at func load, not here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.hostName = hostName;
this.dimensions = Arrays.asList(dimensions);
this.setDimensions(Arrays.asList(dimensions));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why original "this.dimensions = Arrays.asList(dimensions);" needs to be replaced with setDimensions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because setDimensions now takes care of setting up the values of dimensionNames and dimensionsValues as well to support the getDimensionNames and getDimensionValues methods.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. Thanks

@namannandan namannandan added this pull request to the merge queue Nov 21, 2023
Merged via the queue into pytorch:master with commit 98c19bc Nov 21, 2023
13 checks passed
@chauhang chauhang added this to the v0.10.0 milestone Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metric auto-detection
3 participants