[connector/failover] Failover connector erroneously flips back to lower priority pipelines #32094

sinkingpoint · 2024-04-02T10:10:23Z

Component(s)

connector/failover

What happened?

Description

The failover connector periodically retries higher priority pipelines that have failed, so that it can reinstate them as the stable pipeline should they start working again. We observe however that when it does so, it then reinstates the lower priority pipeline, even when the higher priority pipeline is working.

Steps to Reproduce

Create a pipeline with two exporters, connected with a failover connector.
Establish a job inserting logs so we can observe the output:

while true; do sleep 1; echo 'test' > logs; done

Start two listeners on each of the receiving ports:

nc -l 127.0.0.1 4278 # the high priority exporter
nc -l 127.0.0.1 4279 # the low priority exporter

Observe that logs correctly flow to the high priority exporter
Shut down the high priority exporter
Observe that logs correctly flow to the low priority exporter
Restart the high priority exporter

nc -l 127.0.0.1 4278 # the high priority exporter

Observe that logs correctly flow to the high priority exporter, but then after a few seconds fall back to the low priority exporter (and that it begins to flip flop back and forth)

Expected Result

The logs should be stably redirected to the high priority exporter once it comes back online

Actual Result

The logs flip flop between the high and low priority exporters

Investigation

Adding a bit more logging around pipeline decisions finds that the lower priority pipeline is being re-inserted at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/failoverconnector/internal/state/pipeline_selector.go#L105-L107

This is because the loop terminates for pipelines after, but including the current pipeline (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/failoverconnector/internal/state/pipeline_selector.go#L96-L98). This means that while the lower priority pipeline is active, it creates a job that makes it active again, even if we select a higher priority pipeline.

Collector version

master (failover connector isn't released yet)

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  namedpipe:
    path: ./logs

exporters:
  syslog/a:
    endpoint: localhost
    port: 4278
    retry_on_failure:
      enabled: false
    tls:
      insecure: true
  syslog/b:
    endpoint: localhost
    port: 4279
    retry_on_failure:
      enabled: false
    tls:
      insecure: true

connectors:
  failover:
    retry_interval: 10s
    retry_gap: 3s
    priority_levels:
      - [logs/a]
      - [logs/b]

service:
  pipelines:
    logs:
      receivers: [namedpipe]
      exporters: [failover]
    logs/a:
      receivers: [failover]
      exporters: [syslog/a]
    logs/b:
      receivers: [failover]
      exporters: [syslog/b]

Log output

No response

Additional context

No response

github-actions · 2024-04-02T10:10:37Z

Pinging code owners:

connector/failover: @akats7 @djaglowski @fatsheep9146

See Adding Labels via Comments if you do not have permissions to add labels yourself.

sinkingpoint · 2024-04-02T10:11:15Z

I'm aware that the failover connector isn't official yet fwiw, but we're testing it internally so I figure I should create tickets for the issues we find

sinkingpoint · 2024-04-02T10:17:06Z

The fix is pretty simple - sinkingpoint@ad1b387 but it's 10pm and I'm too tired to write a proper test for it. I'll do so in the morning 😅

djaglowski · 2024-04-02T13:26:24Z

we're testing it internally so I figure I should create tickets for the issues we find

Thank you! This is especially helpful for newer components.

akats7 · 2024-04-02T16:22:37Z

Hey @sinkingpoint, thanks for creating this issue

I did also notice this bug and was going to include the fix in my next PR.

I do think your fix might need another modification, as I think the result you'll get with your fix is that when the higher priority pipeline (lets call it level 1) comes back up, it will switch to level 1 but then will switch one more time to level 2 and then stay there.

After the new currentIndex is set it will proceed right into the next iteration of the loop and then the i > p.loadStable() check can happen before the index is set to be stable. Which means on the next ticker it will again set the currentIndex to i.

Something like this resolves this issue

select {
		case <-ctx.Done():
			return
		case <-ticker.C:
			if i > p.loadStable() {
				return
			}
			p.currentIndex.Store(int32(i))
		}
	}
}

Feel free to add a fix, or let me know if you'd like me to, thanks!

sinkingpoint · 2024-04-03T09:57:54Z

Ah, that makes more sense. If you've got a PR going then I'm happy to let you deal with it there :)

Description: This PR adds a bug fix that caused the pipeline selector to continue switching between the stable and stable + 1 index after a new stable index has been established. Link to tracking Issue: Resolves #32094 Testing: Additional test case added to check current index after stable check --------- Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>

**Description:** <Describe what has changed.>  This PR adds the failover connector to the contrib distro and moves the component to alpha state as all MVP functionality has been put in place. This PR also adds a bug fix that caused the pipeline selector to continue switching between the stable and stable + 1 index after a new stable index has been established. **Link to tracking Issue:** <Issue number if applicable> Resolves #32094 **Testing:** <Describe what testing was performed and which tests were added.> Additional test case added to check current index after stable check

Description: This PR adds a bug fix that caused the pipeline selector to continue switching between the stable and stable + 1 index after a new stable index has been established. Link to tracking Issue: Resolves open-telemetry#32094 Testing: Additional test case added to check current index after stable check --------- Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>

**Description:** <Describe what has changed.>  This PR adds the failover connector to the contrib distro and moves the component to alpha state as all MVP functionality has been put in place. This PR also adds a bug fix that caused the pipeline selector to continue switching between the stable and stable + 1 index after a new stable index has been established. **Link to tracking Issue:** <Issue number if applicable> Resolves open-telemetry#32094 **Testing:** <Describe what testing was performed and which tests were added.> Additional test case added to check current index after stable check

sinkingpoint added bug Something isn't working needs triage New item requiring triage labels Apr 2, 2024

github-actions bot added the connector/failover label Apr 2, 2024

akats7 mentioned this issue Apr 8, 2024

set failover connector to alpha #32216

Merged

crobert-1 removed the needs triage New item requiring triage label Apr 8, 2024

github-actions bot mentioned this issue Apr 9, 2024

Weekly Report: 2024-04-02 - 2024-04-09 #32230

Closed

akats7 mentioned this issue Apr 15, 2024

Failover switch bug fix #32386

Merged

djaglowski closed this as completed in #32386 Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[connector/failover] Failover connector erroneously flips back to lower priority pipelines #32094

[connector/failover] Failover connector erroneously flips back to lower priority pipelines #32094

sinkingpoint commented Apr 2, 2024

github-actions bot commented Apr 2, 2024

sinkingpoint commented Apr 2, 2024

sinkingpoint commented Apr 2, 2024 •

edited

Loading

djaglowski commented Apr 2, 2024

akats7 commented Apr 2, 2024 •

edited

Loading

sinkingpoint commented Apr 3, 2024

[connector/failover] Failover connector erroneously flips back to lower priority pipelines #32094

[connector/failover] Failover connector erroneously flips back to lower priority pipelines #32094

Comments

sinkingpoint commented Apr 2, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Investigation

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Apr 2, 2024

sinkingpoint commented Apr 2, 2024

sinkingpoint commented Apr 2, 2024 • edited Loading

djaglowski commented Apr 2, 2024

akats7 commented Apr 2, 2024 • edited Loading

sinkingpoint commented Apr 3, 2024

sinkingpoint commented Apr 2, 2024 •

edited

Loading

akats7 commented Apr 2, 2024 •

edited

Loading