Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/spanmetrics] Add maximum span duration metric #31885

Open
swar8080 opened this issue Mar 21, 2024 · 8 comments
Open

[connector/spanmetrics] Add maximum span duration metric #31885

swar8080 opened this issue Mar 21, 2024 · 8 comments
Labels
connector/spanmetrics enhancement New feature or request needs triage New item requiring triage Stale

Comments

@swar8080
Copy link
Contributor

Component(s)

connector/spanmetrics

Is your feature request related to a problem? Please describe.

Hello, adding a duration_max span metric would be helpful for finding anomalies in latency

Describe the solution you'd like

The connector flushes metrics at a configurable interval, so maybe it could emit a gauge with the highest latency seen since the last flush

For cumulative span metrics, I suppose it could flush a value of zero if no spans are seen in-between flushes? For delta span metrics it would not emit anything in that case?

I'm also assuming new configuration would be good for disabling this metric

Describe alternatives you've considered

Lmk if there's an alternative

Additional context

I can implement this change

@swar8080 swar8080 added enhancement New feature or request needs triage New item requiring triage labels Mar 21, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Span duration metrics are currently exported from the connector in the form of histograms with buckets that can be modified as users see fit. Is there a reason this is insufficient for your use case?

@swar8080
Copy link
Contributor Author

Hi @crobert-1, we are running the gateway collector set-up where one span metric configuration is producing span metrics for dozens of applications. That makes it harder to pick a set of histogram buckets that work for all use cases. For example, a 30 second bucket may be sufficient for identifying long-running API call spans but not for a batch job's spans that take several minutes. We also don't get the exactness of the latency that a max metric would give, only that the latency is between two buckets' values.

Although we're switching from using an observability vendor that calculated max automatically, so we may be used to relying on max when the imprecision of histograms works in practice

@crobert-1
Copy link
Member

Would the idea be to then have a datapoint within the metric representing each application that's sending data in?

That generally makes sense to me, but one question I have is how much effort is required to allow the connector to export metrics that aren't histograms. It looks like metrics are all output as histograms from the README, so I'm wondering what this change would entail from that perspective. (Correct me if I'm wrong here and it's already exporting gauges.)

@swar8080
Copy link
Contributor Author

Would the idea be to then have a datapoint within the metric representing each application that's sending data in?

I was picturing it working like the existing calls counter span metrics. Just like how calls creates a counter metric for each unique combination of specific span attributes seen, like span name, span kind, status code, etc., a corresponding maximum gauge metric could be created too. The gauge could track max latency seen for spans with those same attributes within the metrics_flush_interval.

That generally makes sense to me, but one question I have is how much effort is required to allow the connector to export metrics that aren't histograms

The control flow seems mostly the same as the calls metric and needs to be adapted for the gauge and maximum tracking. That might still be a couple hundred lines of non-test code.

The tasks coming to mind are:

  • Where the count for the calls metric is incremented, also determine the new maximum for those span attributes. Only exemplars for spans with the maximum latency would be kept
  • When flushing metrics to the exporters, conversion from the struct used to track the maximums to gauge metrics
  • Resetting the maximum after flushing. That way a new maximum can be tracked for the next flush and also omitted from flushing until there's a new matching span
  • Configuration to opt-out of the maximum metric

@portertech
Copy link
Contributor

I suppose it would be too cumbersome to support specific service (and/or span type) explicit histogram bucket configuration?

@swar8080
Copy link
Contributor Author

swar8080 commented Apr 10, 2024

I suppose it would be too cumbersome to support specific service (and/or span type) explicit histogram bucket configuration?

That's a workaround but creates more effort for an observability team to maintain and explain how it works to other teams. In our case, we're using the centralized/gateway collector set-up instead of sidecars. I suppose each custom request from other teams would need a new filter, span metric component, and pipeline set-up

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connector/spanmetrics enhancement New feature or request needs triage New item requiring triage Stale
Projects
None yet
Development

No branches or pull requests

3 participants