[connector/spanmetrics] Add maximum span duration metric #31885

swar8080 · 2024-03-21T01:25:43Z

Component(s)

connector/spanmetrics

Is your feature request related to a problem? Please describe.

Hello, adding a duration_max span metric would be helpful for finding anomalies in latency

Describe the solution you'd like

The connector flushes metrics at a configurable interval, so maybe it could emit a gauge with the highest latency seen since the last flush

For cumulative span metrics, I suppose it could flush a value of zero if no spans are seen in-between flushes? For delta span metrics it would not emit anything in that case?

I'm also assuming new configuration would be good for disabling this metric

Describe alternatives you've considered

Lmk if there's an alternative

Additional context

I can implement this change

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-21T01:26:00Z

Pinging code owners:

connector/spanmetrics: @portertech

See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 · 2024-03-29T17:28:39Z

Span duration metrics are currently exported from the connector in the form of histograms with buckets that can be modified as users see fit. Is there a reason this is insufficient for your use case?

swar8080 · 2024-03-29T18:23:53Z

Hi @crobert-1, we are running the gateway collector set-up where one span metric configuration is producing span metrics for dozens of applications. That makes it harder to pick a set of histogram buckets that work for all use cases. For example, a 30 second bucket may be sufficient for identifying long-running API call spans but not for a batch job's spans that take several minutes. We also don't get the exactness of the latency that a max metric would give, only that the latency is between two buckets' values.

Although we're switching from using an observability vendor that calculated max automatically, so we may be used to relying on max when the imprecision of histograms works in practice

crobert-1 · 2024-03-29T20:44:44Z

Would the idea be to then have a datapoint within the metric representing each application that's sending data in?

That generally makes sense to me, but one question I have is how much effort is required to allow the connector to export metrics that aren't histograms. It looks like metrics are all output as histograms from the README, so I'm wondering what this change would entail from that perspective. (Correct me if I'm wrong here and it's already exporting gauges.)

swar8080 · 2024-03-31T18:04:52Z

Would the idea be to then have a datapoint within the metric representing each application that's sending data in?

I was picturing it working like the existing calls counter span metrics. Just like how calls creates a counter metric for each unique combination of specific span attributes seen, like span name, span kind, status code, etc., a corresponding maximum gauge metric could be created too. The gauge could track max latency seen for spans with those same attributes within the metrics_flush_interval.

That generally makes sense to me, but one question I have is how much effort is required to allow the connector to export metrics that aren't histograms

The control flow seems mostly the same as the calls metric and needs to be adapted for the gauge and maximum tracking. That might still be a couple hundred lines of non-test code.

The tasks coming to mind are:

Where the count for the calls metric is incremented, also determine the new maximum for those span attributes. Only exemplars for spans with the maximum latency would be kept
When flushing metrics to the exporters, conversion from the struct used to track the maximums to gauge metrics
Resetting the maximum after flushing. That way a new maximum can be tracked for the next flush and also omitted from flushing until there's a new matching span
Configuration to opt-out of the maximum metric

portertech · 2024-04-09T16:48:46Z

I suppose it would be too cumbersome to support specific service (and/or span type) explicit histogram bucket configuration?

swar8080 · 2024-04-10T01:13:00Z

I suppose it would be too cumbersome to support specific service (and/or span type) explicit histogram bucket configuration?

That's a workaround but creates more effort for an observability team to maintain and explain how it works to other teams. In our case, we're using the centralized/gateway collector set-up instead of sidecars. I suppose each custom request from other teams would need a new filter, span metric component, and pipeline set-up

github-actions · 2024-06-10T03:31:51Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

connector/spanmetrics: @portertech @Frapschen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

swar8080 added enhancement New feature or request needs triage New item requiring triage labels Mar 21, 2024

github-actions bot added the connector/spanmetrics label Mar 21, 2024

github-actions bot mentioned this issue Mar 26, 2024

Weekly Report: 2024-03-19 - 2024-03-26 #31947

Closed

github-actions bot mentioned this issue Apr 2, 2024

Weekly Report: 2024-03-26 - 2024-04-02 #32082

Closed

swar8080 mentioned this issue Apr 2, 2024

Memory leaks for Sum metric exemplars #31683

Closed

github-actions bot mentioned this issue Apr 9, 2024

Weekly Report: 2024-04-02 - 2024-04-09 #32230

Closed

github-actions bot mentioned this issue Apr 16, 2024

Weekly Report: 2024-04-09 - 2024-04-16 #32407

Closed

github-actions bot mentioned this issue May 28, 2024

Weekly Report: 2024-05-21 - 2024-05-28 #33243

Open

github-actions bot added the Stale label Jun 10, 2024

This was referenced Jun 19, 2024

Weekly Report: 2024-06-12 - 2024-06-19 LucaLanziani/opentelemetry-collector-contrib#10

Open

Weekly Report: 2024-06-12 - 2024-06-19 LucaLanziani/opentelemetry-collector-contrib#11

Open

github-actions bot mentioned this issue Jul 2, 2024

Weekly Report: 2024-06-25 - 2024-07-02 #33839

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[connector/spanmetrics] Add maximum span duration metric #31885

[connector/spanmetrics] Add maximum span duration metric #31885

swar8080 commented Mar 21, 2024

github-actions bot commented Mar 21, 2024

crobert-1 commented Mar 29, 2024

swar8080 commented Mar 29, 2024

crobert-1 commented Mar 29, 2024

swar8080 commented Mar 31, 2024

portertech commented Apr 9, 2024

swar8080 commented Apr 10, 2024 •

edited

Loading

github-actions bot commented Jun 10, 2024

[connector/spanmetrics] Add maximum span duration metric #31885

[connector/spanmetrics] Add maximum span duration metric #31885

Comments

swar8080 commented Mar 21, 2024

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Mar 21, 2024

crobert-1 commented Mar 29, 2024

swar8080 commented Mar 29, 2024

crobert-1 commented Mar 29, 2024

swar8080 commented Mar 31, 2024

portertech commented Apr 9, 2024

swar8080 commented Apr 10, 2024 • edited Loading

github-actions bot commented Jun 10, 2024

swar8080 commented Apr 10, 2024 •

edited

Loading