Refactor tail-sampling processor - Decision cache #31583

jpkrohling · 2024-03-05T11:58:21Z

The decision cache can be a regular cached, where the key is the trace ID and the value is the decision. We can use a ring buffer to limit the number of entries in this cache, but this can be an implementation detail. Whenever a new span arrives, we'd run it through the cache and immediately release/drop based on a previous decision.

github-actions · 2024-03-05T12:06:17Z

Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

kentquirk · 2024-03-05T15:42:49Z

Drop decisions only need a single bit -- "we dropped this", so there's benefit to splitting the decision cache into a Bloom or Cuckoo filter for drops and a separate cache for keep (attaching metadata to spans like sampling probability may require the keep cache to be more than a filter).

Cache size should be configurable.

djluck · 2024-03-05T21:48:09Z

@kentquirk can you elaborate a bit more on why we'd need to capture sampling probability per-trace? My understanding is that by entering a trace id into the "sampled" cache we're stating that all spans for this trace should be sampled.

jpkrohling · 2024-03-06T09:29:56Z

@kentquirk, do you want to work on this one? I can assign this to you. I'd like to work on converting the metrics to use OTel, which could be helpful in adding observability to the cache.

kentquirk · 2024-03-06T15:01:57Z

@djluck You're correct that the probability applies to all spans in a trace, but sampling probability / adjusted count is of interest when analyzing the results. When I see a trace I want to know "how many traces like this occurred?" For a simple example, if I normally sample at 10% but keep all errors, I need to be able to multiply non-error traces by 10 in order to calculate the appropriate fraction of errors. To do that, I need to know what the sampling probability was. That data needs to be attached to the outgoing spans.

@jpkrohling You can assign it to me. It might end up getting split into a couple of pieces.

github-actions · 2024-05-13T03:29:16Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jamesmoessis · 2024-05-21T00:46:14Z

Hi @kentquirk @jpkrohling this is certainly something I think this processor could use, and would help with early sampling decisions and preventing many spans from being stored in memory for longer than needed.

The discussions around bloom/cuckoo filter are definitely on the right track. Looking through Refinery's code, it uses a dual cuckoo-filter mechanism which seems to be quite memory efficient and relatively low rate of false-positives.

Another possible implementation is using an LRU cache. To make it twice as memory efficient you could also just store the right half of the trace id as a uint64.

As my team is in the midst of implementing tail sampling at quite a large scale, I'd be interested in taking a look at this in the coming weeks.

jpkrohling · 2024-05-21T15:38:35Z

I would prefer if someone from Honeycomb could implement the cuckoo filter, to prevent any misunderstandings on the code ending up similar to Refinery's, and while I thought LRU would be the optimal one except for the size, your suggestion of using only the right-half of the IDs is wonderful!

jpkrohling · 2024-06-10T14:58:22Z

For now, I'll implement the cache as a circular buffer, and we can have alternative implementations right after that.

**Description:** Adds simple LRU decision cache for sampled trace IDs. The design makes it easy to add another cache for non-sampled IDs. It does not save any other information other than the trace ID that is sampled. It only holds the right half of the trace ID (as a uint64) in the cache. By default the cache remains no-op. Only when the user configures the cache size will the cache become active. **Link to tracking Issue:** #31583 **Testing:** * unit tests on new code * test in `processor_decision_test.go` to test that a trace that was sampled, cached, but the span data was dropped persists a "keep" decision. **Documentation:** Added description to README

…lemetry#33533) **Description:** Adds simple LRU decision cache for sampled trace IDs. The design makes it easy to add another cache for non-sampled IDs. It does not save any other information other than the trace ID that is sampled. It only holds the right half of the trace ID (as a uint64) in the cache. By default the cache remains no-op. Only when the user configures the cache size will the cache become active. **Link to tracking Issue:** open-telemetry#31583 **Testing:** * unit tests on new code * test in `processor_decision_test.go` to test that a trace that was sampled, cached, but the span data was dropped persists a "keep" decision. **Documentation:** Added description to README

jpkrohling mentioned this issue Mar 5, 2024

Refactor tail-sampling processor #31580

Open

4 tasks

jpkrohling added the processor/tailsampling Tail sampling processor label Mar 5, 2024

jpkrohling self-assigned this Mar 5, 2024

djluck mentioned this issue Mar 5, 2024

Smarter waiting for late spans in tailsamplingprocessor #31498

Open

jpkrohling assigned kentquirk and unassigned jpkrohling Mar 11, 2024

github-actions bot mentioned this issue Mar 12, 2024

Weekly Report: 2024-03-05 - 2024-03-12 #31693

Closed

github-actions bot added the Stale label May 13, 2024

jpkrohling removed the Stale label May 16, 2024

jpkrohling assigned jpkrohling and unassigned kentquirk Jun 10, 2024

jamesmoessis mentioned this issue Jun 13, 2024

Decision cache for "keep" trace IDs #33533

Merged

jamesmoessis mentioned this issue Jun 24, 2024

[processor/tailsampling] Decision cache for non-sampled trace IDs #33722

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor tail-sampling processor - Decision cache #31583

Refactor tail-sampling processor - Decision cache #31583

jpkrohling commented Mar 5, 2024 •

edited

Loading

github-actions bot commented Mar 5, 2024

kentquirk commented Mar 5, 2024

djluck commented Mar 5, 2024

jpkrohling commented Mar 6, 2024

kentquirk commented Mar 6, 2024

github-actions bot commented May 13, 2024

jamesmoessis commented May 21, 2024 •

edited

Loading

jpkrohling commented May 21, 2024 •

edited

Loading

jpkrohling commented Jun 10, 2024

Refactor tail-sampling processor - Decision cache #31583

Refactor tail-sampling processor - Decision cache #31583

Comments

jpkrohling commented Mar 5, 2024 • edited Loading

github-actions bot commented Mar 5, 2024

kentquirk commented Mar 5, 2024

djluck commented Mar 5, 2024

jpkrohling commented Mar 6, 2024

kentquirk commented Mar 6, 2024

github-actions bot commented May 13, 2024

jamesmoessis commented May 21, 2024 • edited Loading

jpkrohling commented May 21, 2024 • edited Loading

jpkrohling commented Jun 10, 2024

jpkrohling commented Mar 5, 2024 •

edited

Loading

jamesmoessis commented May 21, 2024 •

edited

Loading

jpkrohling commented May 21, 2024 •

edited

Loading