Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error creating a collector pod when specifying prometheus labeldrop actions #958

Closed
jacobwalters opened this issue May 31, 2022 · 12 comments
Assignees
Labels
area:target-allocator Issues for target-allocator bug Something isn't working

Comments

@jacobwalters
Copy link

jacobwalters commented May 31, 2022

Describe the bug
I am seeing this error

2022/05/31 17:54:37 application run finished with error: failed to get config: cannot unmarshal the configuration: error reading receivers configuration for "prometheus": prometheus receiver failed to unmarshal yaml to prometheus config: labeldrop action requires only 'regex', and no other fields

This is happening when I am loading valid prometheus configurations into the opentelemetrycollector resource.

I am specifying the label drop actions like this.

metric_relabel_configs:
- regex: (.*id)
  action: labeldrop

The operator is creating a configmap called adot-collector where the labeldrop actions are transformed into this

metric_relabel_configs:
- separator: ;
  regex: (.*id)
  replacement: $1
  action: labeldrop

Compared to how I defined this, the operator is adding the fields "replacement", and "separator". These fields the operator is adding is causing the error in the logs above.

I was able to fix this issue by removing the label drop commands from my config and redeploying.

Steps to reproduce

Create a opentelemetrycollector object in statefulset mode with the targetAllocator enabled. configure the prometheus receiver to scrape a prometheus endpoint and use the metric_relabel_config field to drop a label. Observe the creation of the collector pod and the configmap generated by the operator.

What did you expect to see?

I expected to see the opentelemetry-operator crate a collector pod successfully.
What did you see instead?
I saw that the opentelemetry-operator created the pod Adot-collecotr-0. It was in a crashback loop. I looked at the logs and saw this message

2022/05/31 17:54:37 application run finished with error: failed to get config: cannot unmarshal the configuration: error reading receivers configuration for "prometheus": prometheus receiver failed to unmarshal yaml to prometheus config: labeldrop action requires only 'regex', and no other fields

When i looked at eh configmap created by the operator. The labeldrop action had additional keys ("replacement", "seperator") that I had not added.

What version did you use?

Version: (e.g., v0.4.0, 1eb551b, etc)
the opentelemetry helm charts. opentelemetry-collector-0.17.0

What config did you use?

---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: adot
  namespace: opentelemetry
  labels:
    app: opentelemetry-operator
spec:
  image: "aws-otel-collector:0.17.0"
  mode: statefulset
  serviceAccount: generic-collector
  targetAllocator:
    enabled: true
    image: target-allocator:0.1.0
  resources:
    requests:
      cpu: 1
      memory: 6G
    limits:
      cpu: 1
      memory: 6G
  volumeMounts:
  - name: etcd-secrets
    mountPath: /etc/prometheus/secrets/kube-etcd-client-certs
  volumes: 
  - name: etcd-secrets
    secret:
      defaultMode: 420
      optional: true
      secretName: cilium-etcd-secrets
  volumeClaimTemplates:
  - metadata:
      name: shard-volume
    spec:
      accessModes:
      - ReadWriteOnce
      storageClassName: ebs-all-zones
      resources:
        requests:
          storage: 1Gi
  config: |
    receivers:
      prometheus:
        config:
          global:
            scrape_interval: 60s
            scrape_timeout: 10s
          scrape_configs:
          - job_name: kubelet
            honor_labels: false
            honor_timestamps: false
            metrics_path: /metrics
            scheme: https
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            tls_config:
              insecure_skip_verify: true
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            kubernetes_sd_configs:
            - role: endpoints
              namespaces:
                names:
                - kube-system
            relabel_configs:
            - action: keep
              regex: kubelet
              source_labels: [__meta_kubernetes_endpoints_name]
            - action: keep
              regex: https-metrics
              source_labels: [__meta_kubernetes_endpoint_port_name]
            metric_relabel_configs:
            - action: labeldrop
              regex: (.*id)
    exporters:
      awsprometheusremotewrite:
        timeout: 30s
        endpoint: https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-dummy-value/api/v1/remote_write
        aws_auth:
          region: "us-east-1"
          service: "aps"
        external_labels:
          clusterId: clusterId
          cluster: clustername
          env: poc
          __replica__: ${POD_NAME}
      logging:
        loglevel: warn
    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 70
        spike_limit_percentage: 20
      batch:
      batch/2:
        timeout: 10s
        send_batch_size: 10000
        send_batch_max_size: 12000
    extensions:
      health_check:
      pprof:
        endpoint: :1888
      zpages:
        endpoint: :55679
    service:
      extensions: [pprof, zpages, health_check]
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [memory_limiter, batch]
          exporters: [logging, awsprometheusremotewrite]

Environment
EKS 1.20 cluster

Additional context
Add any other context about the problem here.

@jacobwalters jacobwalters added the bug Something isn't working label May 31, 2022
@Aneurysm9
Copy link
Member

I think this issue needs to move to the operator repo as the config rewriting there appears to be the cause. The validation triggering the error here is in the Prometheus project and not something the receiver can impact.

@open-telemetry/operator-maintainers or @open-telemetry/operator-approvers are you able to transfer this issue?

@erichsueh3
Copy link
Contributor

Updating with current findings, the separator and replacement fields are being added during yaml marshal/unmarshalling in ReplaceConfig(), see here. Will be out this week, but can dive deeper into why this is happening after

@CoderPoet
Copy link
Contributor

It appears to be caused by the default value
image

@dmitryax dmitryax transferred this issue from open-telemetry/opentelemetry-collector-contrib Jun 28, 2022
@anunarapureddy
Copy link
Member

anunarapureddy commented Mar 15, 2023

I am seeing the same issue when specifying a valid labelmap action in the scrape job.

Input - Specification:

relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    action: labelmap

Output - The operator is creating a configmap where the labelmap actions are transformed into:

 - separator: ;
   regex: __meta_kubernetes_node_label_(.+)
   replacement: $1
   action: labelmap

Error Message:

Error: failed to get config: cannot unmarshal the configuration: 1 error(s) decoding:

* error decoding 'receivers': error reading receivers configuration for "prometheus": prometheus receiver failed to
 unmarshal yaml to prometheus config: "" is invalid 'replacement' for labelmap `action`

cc: @alolita

@utr1903
Copy link
Contributor

utr1903 commented Mar 22, 2023

@anunarapureddy I am also having the same issue with labelmap

@anunarapureddy
Copy link
Member

anunarapureddy commented Mar 27, 2023

@CoderPoet @Aneurysm9 @erichsueh3 Is anyone actively looking/working on a fix for this issue?

cc: @alolita

@alolita
Copy link
Member

alolita commented Mar 28, 2023

@open-telemetry/operator-maintainers can you please reassign this issue to @anunarapureddy Thanks!

@Nitesh-vaidyanath
Copy link

Nitesh-vaidyanath commented Mar 29, 2023

Looks like I found issue here https://github.com/prometheus/prometheus/blob/main/model/relabel/relabel.go.

      DefaultRelabelConfig = Config{
	          Action:      Replace,
	          Separator:   ";",
	          Regex:       MustNewRegexp("(.*)"),
	          Replacement: "$1",
          }

metric_relabel_configs expects only regex field but actual config is overwritten by DefaultRelabelConfig which is returning error from below if condition.

        metric_relabel_configs:
        - action: labeldrop
          regex: (.*id)
          
if c.Action == LabelDrop || c.Action == LabelKeep {
	if c.SourceLabels != nil ||
		c.TargetLabel != DefaultRelabelConfig.TargetLabel ||
		c.Modulus != DefaultRelabelConfig.Modulus ||
		c.Separator != DefaultRelabelConfig.Separator ||
		c.Replacement != DefaultRelabelConfig.Replacement {
		return fmt.Errorf("%s action requires only 'regex', and no other fields", c.Action)
	}
}

@Nitesh-vaidyanath
Copy link

Looks like I found issue here https://github.com/prometheus/prometheus/blob/main/model/relabel/relabel.go.

      DefaultRelabelConfig = Config{
	          Action:      Replace,
	          Separator:   ";",
	          Regex:       MustNewRegexp("(.*)"),
	          Replacement: "$1",
          }

        func (c *Config) UnmarshalYAML(unmarshal func(interface{}) error) error {
            *c = DefaultRelabelConfig

metric_relabel_configs expects only regex field but actual config is overwritten by DefaultRelabelConfig which is returning error from below if condition.

        metric_relabel_configs:
        - action: labeldrop
          regex: (.*id)
          
if c.Action == LabelDrop || c.Action == LabelKeep {
	if c.SourceLabels != nil ||
		c.TargetLabel != DefaultRelabelConfig.TargetLabel ||
		c.Modulus != DefaultRelabelConfig.Modulus ||
		c.Separator != DefaultRelabelConfig.Separator ||
		c.Replacement != DefaultRelabelConfig.Replacement {
		return fmt.Errorf("%s action requires only 'regex', and no other fields", c.Action)
	}
}

https://github.com/prometheus/prometheus/blob/main/model/relabel/relabel.go

Screenshot 2023-03-30 at 12 12 06 AM

Screenshot 2023-03-30 at 12 12 13 AM

Screenshot 2023-03-30 at 12 12 24 AM

@pavolloffay pavolloffay added the area:target-allocator Issues for target-allocator label Mar 29, 2023
@erichsueh3 erichsueh3 removed their assignment Mar 29, 2023
@matej-g
Copy link
Contributor

matej-g commented Apr 19, 2023

Looks like the cause of the issue is clear. Is anyone working on a fix?

@jaronoff97
Copy link
Contributor

@matej-g I believe @anunarapureddy is working on a fix, we were having a discussion in the slack about it

anunarapureddy added a commit to anunarapureddy/opentelemetry-operator that referenced this issue May 5, 2023
jaronoff97 added a commit that referenced this issue Jun 1, 2023
…r is enabled (#1712)

* Code fix for issues #1623, #1622, #958

* updated the PR with an alternative logic to fi the relabel config issue

* fixed codelint issues

* reverted the logic to update scrape_configs

* fixed unit test case failures

* fixed code lint issues

* changes to replace_config to handle unamrshalling issue when tarallocatorewrite flag is enabled

* fixed unit test case failure

* updated e2e test to cover labelkeep usecase

* updated unit test to escape $ signs

* updated unit test to cover missing and invalig configs

* resolved merge conflicts

* updated tests, refactored logic based on the new commits

* uncommented labeldrop e2e test

* added back target allocator attribute

* comment for unescaping TA config

* updated the README document to reflect the traslations

* updated the README document to reflect the traslations

* updated the TA config section in the README document

* addressed the review comment on README doc

---------

Co-authored-by: Jacob Aronoff <jaronoff97@users.noreply.github.com>
KitSutliff pushed a commit to KitSutliff/opentelemetry-operator that referenced this issue Jun 22, 2023
…r is enabled (open-telemetry#1712)

* Code fix for issues open-telemetry#1623, open-telemetry#1622, open-telemetry#958

* updated the PR with an alternative logic to fi the relabel config issue

* fixed codelint issues

* reverted the logic to update scrape_configs

* fixed unit test case failures

* fixed code lint issues

* changes to replace_config to handle unamrshalling issue when tarallocatorewrite flag is enabled

* fixed unit test case failure

* updated e2e test to cover labelkeep usecase

* updated unit test to escape $ signs

* updated unit test to cover missing and invalig configs

* resolved merge conflicts

* updated tests, refactored logic based on the new commits

* uncommented labeldrop e2e test

* added back target allocator attribute

* comment for unescaping TA config

* updated the README document to reflect the traslations

* updated the README document to reflect the traslations

* updated the TA config section in the README document

* addressed the review comment on README doc

---------

Co-authored-by: Jacob Aronoff <jaronoff97@users.noreply.github.com>
@jaronoff97
Copy link
Contributor

This should have been resolved by #1712 please let me know if you are still experiencing issues on the latest versions.

ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this issue May 1, 2024
…r is enabled (open-telemetry#1712)

* Code fix for issues open-telemetry#1623, open-telemetry#1622, open-telemetry#958

* updated the PR with an alternative logic to fi the relabel config issue

* fixed codelint issues

* reverted the logic to update scrape_configs

* fixed unit test case failures

* fixed code lint issues

* changes to replace_config to handle unamrshalling issue when tarallocatorewrite flag is enabled

* fixed unit test case failure

* updated e2e test to cover labelkeep usecase

* updated unit test to escape $ signs

* updated unit test to cover missing and invalig configs

* resolved merge conflicts

* updated tests, refactored logic based on the new commits

* uncommented labeldrop e2e test

* added back target allocator attribute

* comment for unescaping TA config

* updated the README document to reflect the traslations

* updated the README document to reflect the traslations

* updated the TA config section in the README document

* addressed the review comment on README doc

---------

Co-authored-by: Jacob Aronoff <jaronoff97@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:target-allocator Issues for target-allocator bug Something isn't working
Projects
None yet
Development

No branches or pull requests