Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issues related to prometheus relabel configs when target allocator is enabled #1712

Merged
merged 29 commits into from
Jun 1, 2023

Conversation

anunarapureddy
Copy link
Member

@anunarapureddy anunarapureddy commented May 5, 2023

Related to following issues: #1623, #1622, #958

Issue: Collector and Target Allocator pods fail to get created in certain scenarios when Target Allocator is enabled and Prometheus configuration contains labelmap, labeldrop, or labelkeep actions.

Cause:

The issue manifests itself in the following scenarios.

  1. First, when the configMap is built by the operator, the unmarshalling logic calls the Prometheus relabel config validation logic (relabel.go), which includes the additional defaults, including the $1 replacement field, appended to the original Prometheus configuration. Upon initialization, the collector pod loads the configMap and interprets $1 as an environment variable, leading to its replacement with an empty string. As a result, the same Prometheus relabel config validation logic fails when the config is unmarshalled in the collector code, as the replacement key is now replaced with an empty string due to env var substitution.

  2. The second scenario occurs when the Prometheus configuration explicitly includes $$ for labelmap, labeldrop, labelkeep or other actions that undergo regex validation in the Prometheus validation code (relabel.go). In this case, when the operator creates the collector configMap, the regex validation fails, causing the configMap to get created with an empty yaml file.

  3. In the third scenario, when Prometheus configuration explicitly includes $$ for labelmap, labeldrop, labelkeep or other actions, the TA ConfigMap is created with the $$ signs. Since the TA does not support environment variable substitution, the unmarshalling logic during the ConfigMap mount process in the TA pod will perform regex validation and cause the TA pod creation to fail.

Resolution:

Below is the guideline from OpenTelemetry documentation:

Note: Since the collector configuration supports env variable substitution $ characters in your prometheus configuration are interpreted as environment variables. If you want to use $ characters in your prometheus configuration, you must escape them using $$.

Key points considered in order to address the issue:

  • To use "$" characters in Prometheus configuration, they must be escaped using "$$".
  • The collector configuration supports environment variable substitution, but the TA does not.
  • The ConfigMap for the collector pod should have "$$" to prevent env var substitution, unless the user intentionally wants it to happen.
  • The TA ConfigMap should have a single "$", as it does not support env var substitution.

The operator was updated to modify the YAML file directly instead of going through the Prometheus unmarshalling logic to update the SD config. The TA ConfigMap creation process was modified to replace "$$" signs with a "$" to address the third scenario. These changes ensure that both the Collector and TA pods are created correctly, even when the Prometheus configuration includes labelmap, labeldrop, labelkeep, or other actions that undergo regex validation.

cc: @alolita @jaronoff97

@anunarapureddy anunarapureddy requested a review from a team as a code owner May 5, 2023 01:55
@kristinapathak
Copy link
Contributor

I've looked and thought about this but mostly have follow up questions.

For the issue with including $1 - my understanding is this is an issue due to the TA being enabled and adding back $1's which are not supported in the collector configuration:

Note: Since the collector configuration supports env variable substitution $ characters in your prometheus configuration are interpreted as environment variables. If you want to use $ characters in your prometheus configuration, you must escape them using $$.

While I wish they could be supported, I don't think we should be using the operator to make $ variables work (which I believe this PR is doing). My recommendation is instead to modify DefaultRelabelConfig to no longer set Replacement. This is not ideal, but because it's a public var, we can set it in the operator code to our desired default config and the prometheus code will use the new value. Then $1's will no longer be added back into the scrape config when the TA is enabled.

The second issue I see being addressed in this PR is prometheus not validating regex with $$. Is this only with TA enabled or regardless of TA enablement? And is this due to the code change of $->$$? I can't find anything in the prometheus code that explains why this wouldn't work.

I looked through the code in the prometheus receiver to find somewhere that $$ was checked or treated specially but couldn't find anything, so my assumption is that $$ is considered valid in the scrape config, just doesn't pass the regex checks for some reason. In other words, the same scrape config with $$ passed in the operator which causes a problem could be passed in a collector config directly with no problem. Is that true? I plan to look into this more but wanted to be sure to provide timely feedback.

My general takeaway is that there are multiple issues here, and I would rather detangle and solve them separately, if possible. Is there any other issue being addressed here that I've missed?

@anunarapureddy anunarapureddy marked this pull request as draft May 9, 2023 08:46
@anunarapureddy anunarapureddy marked this pull request as ready for review May 10, 2023 23:11
@anunarapureddy
Copy link
Member Author

@open-telemetry/operator-maintainers As a first-time contributor, I need approval from a maintainer to run workflows. Would someone be able to assist me with obtaining approval?

@anunarapureddy anunarapureddy marked this pull request as draft May 11, 2023 03:09
@anunarapureddy anunarapureddy marked this pull request as ready for review May 11, 2023 04:49
Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, @kristinapathak and I definitely like this version more than the massaging you had previously. It's easier to understand and the overall design is more extendable. We had a few thoughts/questions to confirm what we had discussed. Thank you again for your contribution 🙇

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, @kristinapathak and I definitely like this version more than the massaging you had previously. It's easier to understand and the overall design is more extendable. We had a few thoughts/questions to confirm what we had discussed. Thank you again for your contribution 🙇

@TylerHelmuth
Copy link
Member

@anunarapureddy grab the latest changes from main to deal with the failing autoscale e2e test.

Copy link
Contributor

@kristinapathak kristinapathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Left a few very minor suggestions/questions.

@jaronoff97
Copy link
Contributor

@anunarapureddy we just added some more e2e tests to help us catch more of these in the future, and there's already one for labeldropping that you just need to uncomment to enable. (PR) Could you rebase and uncomment that test when you get a chance as well?

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. I think we're being too strict about requiring the Prometheus config to exist, but that can be relaxed in a future PR if need be. Overall, despite the verbosity (should we consider parsing permissively into structs instead?), I think this will greatly improve the reliability of the config transformations.

pkg/collector/reconcile/configmap.go Outdated Show resolved Hide resolved
Comment on lines +182 to +200
prometheusConfigProperty, ok := prometheus["config"]
if !ok {
return nil, errorNoComponent("prometheusConfig")
}

prometheusConfig, ok := prometheusConfigProperty.(map[interface{}]interface{})
if !ok {
return nil, errorNotAMap("prometheusConfig")
}

scrapeConfigsProperty, ok := prometheusConfig["scrape_configs"]
if !ok {
return nil, errorNoComponent("scrape_configs")
}

scrapeConfigs, ok := scrapeConfigsProperty.([]interface{})
if !ok {
return nil, errorNotAList("scrape_configs")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we error if the Prometheus config doesn't exist, or just return quietly? See my comment for the TA config unescaping as well.

@anunarapureddy
Copy link
Member Author

@jaronoff97 @swiatekm-sumo Addressed the review comments and I can work on creating a follow up PR to relax the restrictions on prometheus configuration.

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@bryan-aguilar bryan-aguilar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update the target allocator documentation in the README to clearly document the translation behavior?

@anunarapureddy
Copy link
Member Author

@bryan-aguilar That's a good suggestion. I can create a follow up PR to update the TA documentation.

@bryan-aguilar
Copy link
Contributor

@bryan-aguilar That's a good suggestion. I can create a follow up PR to update the TA documentation.

Do we predict further changes to this functionality after this PR? I worry about documentation drift and a possible loss of context if we don't lump documentation and functional changes in the same PR.

@anunarapureddy
Copy link
Member Author

anunarapureddy commented May 31, 2023

@bryan-aguilar That's a good suggestion. I can create a follow up PR to update the TA documentation.

Do we predict further changes to this functionality after this PR? I worry about documentation drift and a possible loss of context if we don't lump documentation and functional changes in the same PR.

Could you please review the updated README documentation which reflects the translations being done?

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
pkg/collector/reconcile/config_replace.go Show resolved Hide resolved
Copy link
Contributor

@kristinapathak kristinapathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thanks for putting in so much time and effort into getting $s working 🙂

Copy link
Contributor

@bryan-aguilar bryan-aguilar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the documentation!

@jaronoff97
Copy link
Contributor

once tests pass, will merge this in :)

@jaronoff97 jaronoff97 merged commit e60c087 into open-telemetry:main Jun 1, 2023
17 checks passed
KitSutliff pushed a commit to KitSutliff/opentelemetry-operator that referenced this pull request Jun 22, 2023
…r is enabled (open-telemetry#1712)

* Code fix for issues open-telemetry#1623, open-telemetry#1622, open-telemetry#958

* updated the PR with an alternative logic to fi the relabel config issue

* fixed codelint issues

* reverted the logic to update scrape_configs

* fixed unit test case failures

* fixed code lint issues

* changes to replace_config to handle unamrshalling issue when tarallocatorewrite flag is enabled

* fixed unit test case failure

* updated e2e test to cover labelkeep usecase

* updated unit test to escape $ signs

* updated unit test to cover missing and invalig configs

* resolved merge conflicts

* updated tests, refactored logic based on the new commits

* uncommented labeldrop e2e test

* added back target allocator attribute

* comment for unescaping TA config

* updated the README document to reflect the traslations

* updated the README document to reflect the traslations

* updated the TA config section in the README document

* addressed the review comment on README doc

---------

Co-authored-by: Jacob Aronoff <jaronoff97@users.noreply.github.com>
ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this pull request May 1, 2024
…r is enabled (open-telemetry#1712)

* Code fix for issues open-telemetry#1623, open-telemetry#1622, open-telemetry#958

* updated the PR with an alternative logic to fi the relabel config issue

* fixed codelint issues

* reverted the logic to update scrape_configs

* fixed unit test case failures

* fixed code lint issues

* changes to replace_config to handle unamrshalling issue when tarallocatorewrite flag is enabled

* fixed unit test case failure

* updated e2e test to cover labelkeep usecase

* updated unit test to escape $ signs

* updated unit test to cover missing and invalig configs

* resolved merge conflicts

* updated tests, refactored logic based on the new commits

* uncommented labeldrop e2e test

* added back target allocator attribute

* comment for unescaping TA config

* updated the README document to reflect the traslations

* updated the README document to reflect the traslations

* updated the TA config section in the README document

* addressed the review comment on README doc

---------

Co-authored-by: Jacob Aronoff <jaronoff97@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants