[exporterhelper] Batch sender produces inconsistent and suboptimal batch sizes #9952

carsonip · 2024-04-12T10:17:05Z

Describe the bug

In my experiments, the batch sender is producing inconsistent batch sizes which could be lower than desired due to goroutine scheduling even after #9761 . The scenario I usually run into is that given queue sender concurrency = batch sender concurrency limit = N, and they are all blocked on send, when the send eventually returns, activeRequest will first be set to N-1, then a new consumer goroutine comes in and increments the active request, then realize N-1+1 == concurrencyLimit, and will send off the request right away, causing an undesirably small request to be exported without batching.

I tried to fix it by resetting bs.activeRequests to 0 next to close(b.done). While it fixes for sendMergeBatch, it will not work for sendMergeSplitBatch since it may be exporting something outside of activeBatch. I assume there might be a way around this for merge split batch, but I'm not sure if it is worth the trouble and complexity to workaround unfavorable goroutine scheduling.

Steps to reproduce

Use batch sender with sending_queue num_consumers=100, send events to it.

What did you expect to see?

Batch sender consistently produces batch of size 100

What did you see instead?

Batch sender produces first batch of size 100, then most of the time size is 1.

What version did you use?

v0.97.0

What config did you use?

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: localhost:4317
      http:
        endpoint: localhost:4318

exporters:
  elasticsearch:
    endpoints: [ "http://localhost:9200" ]
    logs_index: foo
    user: ****
    password: ****
    sending_queue:
      enabled: true
      storage: file_storage/elasticsearchexporter
      num_consumers: 100
      queue_size: 10000000
    retry:
      enabled: true
      max_requests: 10000

extensions:
  file_storage/elasticsearchexporter:
    directory: /tmp/otelcol/file_storage/elasticsearchexporter

service:
  extensions: [file_storage/elasticsearchexporter]
  pipelines:
    logs:
      receivers: [otlp]
      processors: []
      exporters: [elasticsearch]

Environment

Linux Mint 21.3

Additional context

The text was updated successfully, but these errors were encountered:

TylerHelmuth · 2024-04-16T19:19:06Z

/cc @dmitryax

carsonip · 2024-04-16T20:15:21Z

@dmitryax question: what's the reason behind the concurrency check in the first place? If it is just to avoid meaningless waits when all goroutines are blocked by the same batch, then #9891 should fix it. If it is to avoid waits when all goroutines are blocked by different batches, I believe it is inherently vulnerable to the issue I mentioned in the bug report, and may need a different solution.

dmitryax · 2024-06-05T03:30:56Z

If it is just to avoid meaningless waits when all goroutines are blocked by the same batch,

That's the main reason, yes.

If it is to avoid waits when all goroutines are blocked by different batches

This is the secondary reason. We want to control the total number of spawn goroutines. Probably, we can introduce num_workers similar to the queue option. But that can wait until we see a strong need.

Drop the number of goroutines in batch at once to workaround unfavorable goroutine scheduling. Fixes open-telemetry#9952

…10337) #### Description - Correctly keep track of bs.activeRequests, which denotes the number of send waiting for next sender in chain to return. This is already done before this PR, but in a way that is vulnerable to unfavorable goroutine scheduling - Decrease bs.activeRequests by the number of requests blocked by an activeBatch at once, so that it workarounds the "bug" mentioned in (1). #### Link to tracking issue Fixes #9952 --------- Co-authored-by: Dmitrii Anoshin <anoshindx@gmail.com>

carsonip added the bug Something isn't working label Apr 12, 2024

carsonip mentioned this issue Apr 12, 2024

[exporterhelper] Make concurrency limit apply per-batch #9891

Closed

TylerHelmuth added the area:exporter label Apr 16, 2024

This was referenced Apr 18, 2024

Introduce new exporter helper with batching option #8122

Open

[exporter/elasticsearch] Use exporterhelper/batchsender for reliability open-telemetry/opentelemetry-collector-contrib#32632

Open

carsonip mentioned this issue Jun 5, 2024

[exporterhelper] Fix small batch due to scheduling in batch sender #10337

Merged

carsonip mentioned this issue Jun 7, 2024

[exporterhelper] Awkwardness due to API between queue sender and batch sender #10368

Open

dmitryax closed this as completed in #10337 Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporterhelper] Batch sender produces inconsistent and suboptimal batch sizes #9952

[exporterhelper] Batch sender produces inconsistent and suboptimal batch sizes #9952

carsonip commented Apr 12, 2024

TylerHelmuth commented Apr 16, 2024

carsonip commented Apr 16, 2024

dmitryax commented Jun 5, 2024 •

edited

Loading

[exporterhelper] Batch sender produces inconsistent and suboptimal batch sizes #9952

[exporterhelper] Batch sender produces inconsistent and suboptimal batch sizes #9952

Comments

carsonip commented Apr 12, 2024

TylerHelmuth commented Apr 16, 2024

carsonip commented Apr 16, 2024

dmitryax commented Jun 5, 2024 • edited Loading

dmitryax commented Jun 5, 2024 •

edited

Loading