Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/failover] Failover connector erroneously flips back to lower priority pipelines #32094

Closed
sinkingpoint opened this issue Apr 2, 2024 · 6 comments · Fixed by #32216 or #32386
Labels
bug Something isn't working connector/failover

Comments

@sinkingpoint
Copy link
Contributor

Component(s)

connector/failover

What happened?

Description

The failover connector periodically retries higher priority pipelines that have failed, so that it can reinstate them as the stable pipeline should they start working again. We observe however that when it does so, it then reinstates the lower priority pipeline, even when the higher priority pipeline is working.

Steps to Reproduce

  1. Create a pipeline with two exporters, connected with a failover connector.
  2. Establish a job inserting logs so we can observe the output:
while true; do sleep 1; echo 'test' > logs; done
  1. Start two listeners on each of the receiving ports:
nc -l 127.0.0.1 4278 # the high priority exporter
nc -l 127.0.0.1 4279 # the low priority exporter
  1. Observe that logs correctly flow to the high priority exporter
  2. Shut down the high priority exporter
  3. Observe that logs correctly flow to the low priority exporter
  4. Restart the high priority exporter
nc -l 127.0.0.1 4278 # the high priority exporter
  1. Observe that logs correctly flow to the high priority exporter, but then after a few seconds fall back to the low priority exporter (and that it begins to flip flop back and forth)

Expected Result

The logs should be stably redirected to the high priority exporter once it comes back online

Actual Result

The logs flip flop between the high and low priority exporters

Investigation

Adding a bit more logging around pipeline decisions finds that the lower priority pipeline is being re-inserted at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/failoverconnector/internal/state/pipeline_selector.go#L105-L107

This is because the loop terminates for pipelines after, but including the current pipeline (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/failoverconnector/internal/state/pipeline_selector.go#L96-L98). This means that while the lower priority pipeline is active, it creates a job that makes it active again, even if we select a higher priority pipeline.

Collector version

master (failover connector isn't released yet)

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  namedpipe:
    path: ./logs

exporters:
  syslog/a:
    endpoint: localhost
    port: 4278
    retry_on_failure:
      enabled: false
    tls:
      insecure: true
  syslog/b:
    endpoint: localhost
    port: 4279
    retry_on_failure:
      enabled: false
    tls:
      insecure: true

connectors:
  failover:
    retry_interval: 10s
    retry_gap: 3s
    priority_levels:
      - [logs/a]
      - [logs/b]

service:
  pipelines:
    logs:
      receivers: [namedpipe]
      exporters: [failover]
    logs/a:
      receivers: [failover]
      exporters: [syslog/a]
    logs/b:
      receivers: [failover]
      exporters: [syslog/b]

Log output

No response

Additional context

No response

@sinkingpoint sinkingpoint added bug Something isn't working needs triage New item requiring triage labels Apr 2, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@sinkingpoint
Copy link
Contributor Author

I'm aware that the failover connector isn't official yet fwiw, but we're testing it internally so I figure I should create tickets for the issues we find

@sinkingpoint
Copy link
Contributor Author

sinkingpoint commented Apr 2, 2024

The fix is pretty simple - sinkingpoint@ad1b387 but it's 10pm and I'm too tired to write a proper test for it. I'll do so in the morning 😅

@djaglowski
Copy link
Member

we're testing it internally so I figure I should create tickets for the issues we find

Thank you! This is especially helpful for newer components.

@akats7
Copy link
Contributor

akats7 commented Apr 2, 2024

Hey @sinkingpoint, thanks for creating this issue

I did also notice this bug and was going to include the fix in my next PR.

I do think your fix might need another modification, as I think the result you'll get with your fix is that when the higher priority pipeline (lets call it level 1) comes back up, it will switch to level 1 but then will switch one more time to level 2 and then stay there.

After the new currentIndex is set it will proceed right into the next iteration of the loop and then the i > p.loadStable() check can happen before the index is set to be stable. Which means on the next ticker it will again set the currentIndex to i.

Something like this resolves this issue

select {
		case <-ctx.Done():
			return
		case <-ticker.C:
			if i > p.loadStable() {
				return
			}
			p.currentIndex.Store(int32(i))
		}
	}
}

Feel free to add a fix, or let me know if you'd like me to, thanks!

@sinkingpoint
Copy link
Contributor Author

Ah, that makes more sense. If you've got a PR going then I'm happy to let you deal with it there :)

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Apr 8, 2024
djaglowski added a commit that referenced this issue Apr 15, 2024
Description:

This PR adds a bug fix that caused the pipeline selector to continue
switching between the stable and stable + 1 index after a new stable
index has been established.

Link to tracking Issue:

Resolves
#32094

Testing:

Additional test case added to check current index after stable check

---------

Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
djaglowski pushed a commit that referenced this issue Apr 22, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds the failover connector to the contrib distro and moves the
component to alpha state as all MVP functionality has been put in place.
This PR also adds a bug fix that caused the pipeline selector to
continue switching between the stable and stable + 1 index after a new
stable index has been established.

**Link to tracking Issue:** <Issue number if applicable>

Resolves #32094

**Testing:** <Describe what testing was performed and which tests were
added.>

Additional test case added to check current index after stable check
rimitchell pushed a commit to rimitchell/opentelemetry-collector-contrib that referenced this issue May 8, 2024
Description:

This PR adds a bug fix that caused the pipeline selector to continue
switching between the stable and stable + 1 index after a new stable
index has been established.

Link to tracking Issue:

Resolves
open-telemetry#32094

Testing:

Additional test case added to check current index after stable check

---------

Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
rimitchell pushed a commit to rimitchell/opentelemetry-collector-contrib that referenced this issue May 8, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds the failover connector to the contrib distro and moves the
component to alpha state as all MVP functionality has been put in place.
This PR also adds a bug fix that caused the pipeline selector to
continue switching between the stable and stable + 1 index after a new
stable index has been established.

**Link to tracking Issue:** <Issue number if applicable>

Resolves open-telemetry#32094

**Testing:** <Describe what testing was performed and which tests were
added.>

Additional test case added to check current index after stable check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/failover
Projects
None yet
4 participants