Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporterhelper] batchsender casues indefinitely blocking in-flight requests to block collector shutdown #10166

Closed
carsonip opened this issue May 16, 2024 · 14 comments · Fixed by #10287
Assignees
Labels
bug Something isn't working

Comments

@carsonip
Copy link
Contributor

carsonip commented May 16, 2024

Describe the bug

With batchsender, in-flight requests that are indefinitely blocking (due to e.g. retries) will block batchsender shutdown and the collector shutdown.

Batchsender shutdown is blocked by https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/batch_sender.go#L220 which in turn blocks queue sender and exporter shutdown in https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/common.go#L330C1-L334C33

This is in contrast to the case without batchsender, where retries are handled by retry_sender (which is bypassed by batchsender since batchsender calls request.Export directly), and on shutdown, retry_sender will avoid retrying and return experr.shutdownErr if downstream senders return an error.

Steps to reproduce

Write a request with Export that blocks indefinitely. Trigger collector shutdown. Observe that the collector hangs.

What did you expect to see?

Collector should not hang on shutdown.

In-flight requests should be signaled that the collector is shutting down, possibly return an experr.shutdownErr in the call chain, so that the collector can exit cleanly without removing the request from persistent queue.

What did you see instead?

Collector hangs as batchsender waits for all active requests to return.

What version did you use?

v0.100.0

What config did you use?

Environment

Additional context

Q:

  • Is it intentional to bypass retrysender and timeoutsender in batchsender? It was raised during code review
  • Even if batchsender.Shutdown returns without polling for in-flight requests to end, and that be.ShutdownFunc.Shutdown(ctx)) is called, how to cleanly exit the collector without losing events in persistent queue? experr.shutdownErr is in an internal package, and contrib exporter code does not have a way to avoid queue item removal from persistent queue aside from blocking queue Consume.
@dmitryax
Copy link
Member

dmitryax commented May 29, 2024

Write a request with Export that blocks indefinitely

If the Export doesn't handle the context cancellation, it's expected that it can block collector even without the batch exporter

@carsonip please see if you run into #10255 which is fixed by #10258.

@lahsivjar
Copy link

@dmitryax One of the cases where this block happens is when the Export function relies on the Context to know about the shutdown events. For example: if a component flushes event on Export with infinite retries, relying on canceling retries based on the Context, then the batch-sender will never terminate. This will happen because:

  1. When the collector is shutdown then batch-sender is shutdown first before the component's ShutdownFunc is called (ref).
  2. The batchsender shutdown waits till ALL active requests are flushed (ref). However, in an event when the endpoint is not available and the infinite retry logic in the component is at work then it will never know that the shutdown is invoked as the context that it has will be the context that is passed to the send event in the batch sender and we don't cancel it (ref).

One option could be to not have retries in the component logic and instead rely on retry-sender but looks like retry-sender is disabled when batch-sender is in the pipeline (ref) -- not sure if this is intentional but I couldn't find it documented anywhere.

Another option could be to handle canceling the retry in the component but that couldn't happen because the component will not know about the retry until its Shutdown function is called which will be blocked waiting for the batch sender's shutdown to finish.

@dmitryax
Copy link
Member

@lahsivjar That's why we have the timeout configuration. If the timeout is disabled (set to 0), the collector shutdown is expected to get stuck in such cases.

@dmitryax
Copy link
Member

dmitryax commented May 30, 2024

The batchsender shutdown waits till ALL active requests are flushed

Not all, but only those that are currently being batched. Once the current batch is delivered (or canceled due to timeout), the batch processor shutdown is complete and it behaves as pass-through to allow the memory queue to flush

@dmitryax
Copy link
Member

If you don't have timeout set, I believe it'll be exactly the same problem even without the batch processor. Especially if the memory queue is enabled

@lahsivjar
Copy link

That's why we have the timeout configuration. If the timeout is disabled (set to 0), the collector shutdown is expected to get stuck in such cases.

Ah, that makes sense to me, thanks for the pointer. I do think the behaviour needs to be documented though -- I will open a PR for that.

@dmitryax What do you think about the interaction of retry-sender and batch-sender?

@lahsivjar
Copy link

If you don't have timeout set, I believe it'll be exactly the same problem even without the batch processor. Especially if the memory queue is enabled

Not quite, since the retry-sender is shutdown before the batch sender it will only send the request once.

@lahsivjar
Copy link

lahsivjar commented May 30, 2024

That's why we have the timeout configuration. If the timeout is disabled (set to 0), the collector shutdown is expected to get stuck in such cases.

Hmm, looks like timeout sender is the last in the chain ref. Since, the batch sender doesn't forward to any next sender in the happy path (ref) I don't see how this will be enforced. Am I missing something here?

@dmitryax
Copy link
Member

The timeout sender doesn't need to be shut down. It just cancels requests that take more than the pre-defined timeout duration. It prevents requests from taking more than that duration, assuming that the exporter handles the context cancellations.

@dmitryax
Copy link
Member

Since, the batch sender doesn't forward to any next sender in the happy path (ref) I don't see how this will be enforced

I'm not sure I understand this. The batch sender always sends the requests (batched or not) to the next sender which is typically the retry or the timeout sender.

@lahsivjar
Copy link

The batch sender always sends the requests (batched or not) to the next sender which is typically the retry or the timeout sender

Maybe I am missing something obvious here but the code in send basically only calls bs.nextSender.send if the bs.stopped.Load() returns true that means only when the batch sender's Shutdown has been called. That's why I think both retry and timeout sender (which are next in the chain) are bypassed for the happy path which calls sendMerge{SplitBatch, Batch} followed by Request#Export.

@dmitryax
Copy link
Member

@lahsivjar you're right! Now I see your point. We are completely ignoring the next senders in the chain. Let me work on a fix

@carsonip
Copy link
Contributor Author

@lahsivjar you're right! Now I see your point. We are completely ignoring the next senders in the chain. Let me work on a fix

Sorry if I didn't describe the problem clear enough in the issue. Yes, using next senders in the chain will fix it.

With the fix, retrySender will retry a shutdownErr on shutdown (which wasn't possible before because it is from an internal package), and shutdown can complete while retaining the entry in persistent queue.

@dmitryax
Copy link
Member

dmitryax commented Jun 1, 2024

Submitted #10287

dmitryax added a commit that referenced this issue Jun 4, 2024
…10287)

This change fixes a bug when the retry and timeout logic was not applied
with enabled batching. The batch sender was ignoring the next senders in
the chain.

Fixes
#10166
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants