Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spanmetrics] Need a feature to add Event attributes to dimensions for spanmetrics #27451

Closed
aishyandapalli opened this issue Oct 5, 2023 · 24 comments

Comments

@aishyandapalli
Copy link
Contributor

aishyandapalli commented Oct 5, 2023

Component(s)

No response

Is your feature request related to a problem? Please describe.

Business Justification - We have exception data recorded as span events. There are exception.type and exception.message recorded as span event attributes. There should be a way to count the total number of errors for each service name, operation, span kind, status code, exception.type and exception.message. So basically there should be a way to add exception.type and exception.message dimensions to RED metrics.

We should also be able to calculate the latency metric for the list of dimensions - service name, operation, span kind, status code, exception.type and exception.message

We would love to see a feature where we can add the exception data from Events to spanmetrics dimensions.

Describe the solution you'd like

We want a way to add exception data from span Events to spanmetric dimensions.

Describe alternatives you've considered

No response

Additional context

No response

@aishyandapalli aishyandapalli added enhancement New feature or request needs triage New item requiring triage labels Oct 5, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2023

Pinging code owners for connector/spanmetrics: @albertteoh. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

This makes sense to me. WDYT @albertteoh?

@vjsamuel
Copy link

vjsamuel commented Oct 9, 2023

@djaglowski @albertteoh , few things that we need to figure out is how do we handle span metrics for spans that have more than one exception recorded. however, I would agree that being able to generate additional labels to record errors as labels and generate exemplars for the same would make the metrics being generated very very useful.

@albertteoh
Copy link
Contributor

I just want to be clear that the ask is to add opinionated support specifically for the exception.type and exception.message (as described in semantic conventions) attributes of span events (if any).

If that's the ask, I'm generally supportive of this new feature.

Additional questions:

  • Should it be an opt-in feature or always on? I'm okay either way, but slightly prefer the former only because it can save some cpu cycles iterating over events, while also reducing the number of new metrics created if the user won't use these labels.
  • Are there any requirements around a span event's Name()?

how do we handle span metrics for spans that have more than one exception recorded

I'm thinking we could assign each set of exception.type and exception.message to a new metric key (new set of dimensions), so a single span could potentially result in updates on multiple timeseries within a single metric name (e.g. calls).

@vjsamuel
Copy link

  • I think its a great idea to have this as an opt in feature given that it will be expensive to traverse records.
  • the question on creating unique metric keys for type and exception resulting in multiple time series would be, is it a fair expectation to say that count(calls) should be count of all spans? if yes then when we record each exception, if there is more than one in a span then that premise would be broken. same goes for latency if we record the same duration for all exceptions seen, we would end up with skewed histograms.

would love to hear your thoughts around the latter.

@vjsamuel
Copy link

vjsamuel commented Oct 10, 2023

or should we create an exceptions or errors metric alone given that we don't have the notion of a latency for such spans? what we care for the most is a series denoting the trends of errors and exemplars for the same.

@albertteoh
Copy link
Contributor

is it a fair expectation to say that count(calls) should be count of all spans?

Yeah, great point, @vjsamuel.

given that we don't have the notion of a latency for such spans?

Why wouldn't exception spans have a notion of latency, or perhaps I've misunderstood what you meant? I can imagine it would be quite useful to know; say if a request times out. This requirement was also mentioned in the description:

"We should also be able to calculate the latency metric for the list of dimensions - service name, operation, span kind, status code, exception.type and exception.message."

One alternative is for the spanmetrics connector to only select the first exception.type/message it sees and ignore any other exception events present in the span. The thinking behind that is that usually for a single unit of work (span), there should typically be a single error, if any, that would cause it to fail. It would be unusual for there to be more than one error IMO.

I'm interested to get your thoughts on this @vjsamuel.

Also @aishyandapalli, do you have any thoughts around how you'd like to see metrics for multiple exceptions per span in your use case?

@djaglowski
Copy link
Member

I just want to be clear that the ask is to add opinionated support specifically for the exception.type and exception.message (as described in semantic conventions) attributes of span events (if any).

I am admittedly not very tuned in to the details of this connector, but I'd like to encourage that we look first for solutions which are not hardcoded to specific attributes. To that end, is there strictly a need to count by exception.type and exception.message, or could these be configurable dimensions in a more generic solution?

@vjsamuel
Copy link

@djaglowski https://opentelemetry.io/docs/specs/otel/trace/semantic_conventions/exceptions/ records the specification of exceptions. they must have type and message. while i am open to configurability, we must mandate at least type. message can get tricky if the message has uniqueness associated to it.

@djaglowski
Copy link
Member

@vjsamuel, are there events which are not exceptions, which could use the same functionality but based on different attributes?

@vjsamuel
Copy link

maybe yes? :) @aishyandapalli and i were razor focused on the errors aspect of things but there could be value.

@albertteoh
Copy link
Contributor

How will we represent events in metrics if we support the more general case. Presumably, each attribute key-value would be included as part of metric's dimensions which would break this invariant called out by @vjsamuel:

is it a fair expectation to say that count(calls) should be count of all spans?

Perhaps we could set this expectation as a condition of enabling event attributes in spanmetrics; or maybe there is a solution to this problem that I haven't thought of?

@vjsamuel
Copy link

lets maybe create two new metrics events and exceptions. the former can have free form attribute support + whatever is defined for the span attributes. the latter is razor focused on type and message. we can allow users to decide if they want just type or type and message. this will keep the feature separate, not diminish the meaning of the current metrics and also allow users to draw graphs out of things found in span events. thoughts?

@albertteoh
Copy link
Contributor

I like the idea of creating new metrics (sorry, you mentioned it earlier in this thread); would we also need events_duration, etc. equivalents?

What is the benefit of having a separate exceptions metric?

@vjsamuel
Copy link

vjsamuel commented Oct 11, 2023

span events dont have any duration attached to them specifically. so we would only do counters. keeping the metric separate is just making the representation prettier. nothing else. also exceptions have dedicated semantic conventions.

@vjsamuel
Copy link

@albertteoh any thoughts on how we should move forward?

@albertteoh
Copy link
Contributor

@albertteoh any thoughts on how we should move forward?

We can go ahead with adding this feature. I propose starting with the following spec:

  1. Opt-in via configuration: events.enabled (defaults to false).
  2. If enabled, it will produce a new metric called events which will add a dimension value per configured selected event attribute key-value.
  3. The events dimensions must be identical in configuration (and its semantics) to the existing dimensions configuration used to select "top-level" span attributes.
  4. Similar to dimensions, if an event's attributes are repeated in another event with a different value, it will result in a new timeseries. It will also contain all of the span's "top-level" dimensions. Example:
    1. If the following events are present in a span:
      1. { "event0": { "attr": "val0" } }
      2. { "event1": { "attr": "val1"} }
    2. The following two timeseries will be created:
      1. events{service.name="foo",..., attr="val0"}
      2. events{service.name="foo",..., attr="val1"}
  5. dimensions must be non-empty if events.enabled is true. An empty dimensions list should return a configuration error.

The following example demonstrates all of the above spec:

connectors:
  spanmetrics:
    dimensions:
      - name: http.method
        default: GET
    ...
    events:
      enabled: true
      dimensions:
        - name: attr
          default: val
        - name: exceptions.type
        - name: exceptions.message 

I've considered including exceptions, however, I'm not yet convinced it's worth creating a new metric for it. If we want a count of errors, users should use the status.code dimension. Adding a new exceptions metric also either introduces data redundancy/duplication, or requires logic in code to specifically prevent this.

However, I'm open to being convinced otherwise. 😄

Please let me know what you think of that proposed spec.

@aishyandapalli
Copy link
Contributor Author

Wow, this is such a detailed explanation. Thanks for that @albertteoh

@vjsamuel for final approval on the above logic

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Oct 13, 2023
@vjsamuel
Copy link

@aishyandapalli, @albertteoh owns this :) he is the final approver. i think the proposal works and we should build it out quickly and submit a PR.

@aishyandapalli
Copy link
Contributor Author

@vjsamuel I meant for our use case :) I will start working on this

@vjsamuel
Copy link

@albertteoh is it possible today for calls and in future events to have exemplars attached to them? if not @aishyandapalli can add support for that as well.

@albertteoh
Copy link
Contributor

@albertteoh is it possible today for calls and in future events to have exemplars attached to them? if not @aishyandapalli can add support for that as well.

I think so, I can't see a reason why we shouldn't add exemplars to calls and events.

At the time when it was added to spanmetrics processor, the focus was just on histograms. cc @anneelisabethlelievre

Thank you very much @vjsamuel @aishyandapalli! 🙏🏼

djaglowski pushed a commit that referenced this issue Oct 31, 2023
… of configured event attributes for a span (#27811)

**Description:**
We have an events section for a span. The details for all the exceptions
like exception.type and exception.message are recorded as Events for a
span. Right now, we don't have a feature to add event attributes to span
metrics.

The idea of this PR is to develop a feature which adds a new metric
`events_total` with a default set of dimensions like `service_name,
span_name, span_kind, status_code`. We can configure to add additional
set of dimensions like `exception.type` and `exception.message` which
will be fetched from the Events section for a span

**Link to tracking Issue:**
[27451](#27451)

---------

Co-authored-by: Albert <26584478+albertteoh@users.noreply.github.com>
jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
… of configured event attributes for a span (open-telemetry#27811)

**Description:**
We have an events section for a span. The details for all the exceptions
like exception.type and exception.message are recorded as Events for a
span. Right now, we don't have a feature to add event attributes to span
metrics.

The idea of this PR is to develop a feature which adds a new metric
`events_total` with a default set of dimensions like `service_name,
span_name, span_kind, status_code`. We can configure to add additional
set of dimensions like `exception.type` and `exception.message` which
will be fetched from the Events section for a span

**Link to tracking Issue:**
[27451](open-telemetry#27451)

---------

Co-authored-by: Albert <26584478+albertteoh@users.noreply.github.com>
RoryCrispin pushed a commit to ClickHouse/opentelemetry-collector-contrib that referenced this issue Nov 24, 2023
… of configured event attributes for a span (open-telemetry#27811)

**Description:**
We have an events section for a span. The details for all the exceptions
like exception.type and exception.message are recorded as Events for a
span. Right now, we don't have a feature to add event attributes to span
metrics.

The idea of this PR is to develop a feature which adds a new metric
`events_total` with a default set of dimensions like `service_name,
span_name, span_kind, status_code`. We can configure to add additional
set of dimensions like `exception.type` and `exception.message` which
will be fetched from the Events section for a span

**Link to tracking Issue:**
[27451](open-telemetry#27451)

---------

Co-authored-by: Albert <26584478+albertteoh@users.noreply.github.com>
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Dec 13, 2023
@djaglowski
Copy link
Member

Closed by #27811

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants