-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix issues related to prometheus relabel configs when target allocator is enabled #1712
Conversation
I've looked and thought about this but mostly have follow up questions. For the issue with including
While I wish they could be supported, I don't think we should be using the operator to make The second issue I see being addressed in this PR is prometheus not validating regex with I looked through the code in the prometheus receiver to find somewhere that My general takeaway is that there are multiple issues here, and I would rather detangle and solve them separately, if possible. Is there any other issue being addressed here that I've missed? |
@open-telemetry/operator-maintainers As a first-time contributor, I need approval from a maintainer to run workflows. Would someone be able to assist me with obtaining approval? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, @kristinapathak and I definitely like this version more than the massaging you had previously. It's easier to understand and the overall design is more extendable. We had a few thoughts/questions to confirm what we had discussed. Thank you again for your contribution 🙇
pkg/collector/testdata/relabel_config_expected_with_sd_config.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, @kristinapathak and I definitely like this version more than the massaging you had previously. It's easier to understand and the overall design is more extendable. We had a few thoughts/questions to confirm what we had discussed. Thank you again for your contribution 🙇
@anunarapureddy grab the latest changes from main to deal with the failing autoscale e2e test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Left a few very minor suggestions/questions.
@anunarapureddy we just added some more e2e tests to help us catch more of these in the future, and there's already one for labeldropping that you just need to uncomment to enable. (PR) Could you rebase and uncomment that test when you get a chance as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall. I think we're being too strict about requiring the Prometheus config to exist, but that can be relaxed in a future PR if need be. Overall, despite the verbosity (should we consider parsing permissively into structs instead?), I think this will greatly improve the reliability of the config transformations.
prometheusConfigProperty, ok := prometheus["config"] | ||
if !ok { | ||
return nil, errorNoComponent("prometheusConfig") | ||
} | ||
|
||
prometheusConfig, ok := prometheusConfigProperty.(map[interface{}]interface{}) | ||
if !ok { | ||
return nil, errorNotAMap("prometheusConfig") | ||
} | ||
|
||
scrapeConfigsProperty, ok := prometheusConfig["scrape_configs"] | ||
if !ok { | ||
return nil, errorNoComponent("scrape_configs") | ||
} | ||
|
||
scrapeConfigs, ok := scrapeConfigsProperty.([]interface{}) | ||
if !ok { | ||
return nil, errorNotAList("scrape_configs") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we error if the Prometheus config doesn't exist, or just return quietly? See my comment for the TA config unescaping as well.
@jaronoff97 @swiatekm-sumo Addressed the review comments and I can work on creating a follow up PR to relax the restrictions on prometheus configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update the target allocator documentation in the README to clearly document the translation behavior?
@bryan-aguilar That's a good suggestion. I can create a follow up PR to update the TA documentation. |
Do we predict further changes to this functionality after this PR? I worry about documentation drift and a possible loss of context if we don't lump documentation and functional changes in the same PR. |
Could you please review the updated README documentation which reflects the translations being done? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Thanks for putting in so much time and effort into getting $
s working 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the documentation!
once tests pass, will merge this in :) |
…r is enabled (open-telemetry#1712) * Code fix for issues open-telemetry#1623, open-telemetry#1622, open-telemetry#958 * updated the PR with an alternative logic to fi the relabel config issue * fixed codelint issues * reverted the logic to update scrape_configs * fixed unit test case failures * fixed code lint issues * changes to replace_config to handle unamrshalling issue when tarallocatorewrite flag is enabled * fixed unit test case failure * updated e2e test to cover labelkeep usecase * updated unit test to escape $ signs * updated unit test to cover missing and invalig configs * resolved merge conflicts * updated tests, refactored logic based on the new commits * uncommented labeldrop e2e test * added back target allocator attribute * comment for unescaping TA config * updated the README document to reflect the traslations * updated the README document to reflect the traslations * updated the TA config section in the README document * addressed the review comment on README doc --------- Co-authored-by: Jacob Aronoff <jaronoff97@users.noreply.github.com>
…r is enabled (open-telemetry#1712) * Code fix for issues open-telemetry#1623, open-telemetry#1622, open-telemetry#958 * updated the PR with an alternative logic to fi the relabel config issue * fixed codelint issues * reverted the logic to update scrape_configs * fixed unit test case failures * fixed code lint issues * changes to replace_config to handle unamrshalling issue when tarallocatorewrite flag is enabled * fixed unit test case failure * updated e2e test to cover labelkeep usecase * updated unit test to escape $ signs * updated unit test to cover missing and invalig configs * resolved merge conflicts * updated tests, refactored logic based on the new commits * uncommented labeldrop e2e test * added back target allocator attribute * comment for unescaping TA config * updated the README document to reflect the traslations * updated the README document to reflect the traslations * updated the TA config section in the README document * addressed the review comment on README doc --------- Co-authored-by: Jacob Aronoff <jaronoff97@users.noreply.github.com>
Related to following issues: #1623, #1622, #958
Issue: Collector and Target Allocator pods fail to get created in certain scenarios when Target Allocator is enabled and Prometheus configuration contains labelmap, labeldrop, or labelkeep actions.
Cause:
The issue manifests itself in the following scenarios.
First, when the configMap is built by the operator, the unmarshalling logic calls the Prometheus relabel config validation logic (relabel.go), which includes the additional defaults, including the $1 replacement field, appended to the original Prometheus configuration. Upon initialization, the collector pod loads the configMap and interprets $1 as an environment variable, leading to its replacement with an empty string. As a result, the same Prometheus relabel config validation logic fails when the config is unmarshalled in the collector code, as the replacement key is now replaced with an empty string due to env var substitution.
The second scenario occurs when the Prometheus configuration explicitly includes $$ for labelmap, labeldrop, labelkeep or other actions that undergo regex validation in the Prometheus validation code (relabel.go). In this case, when the operator creates the collector configMap, the regex validation fails, causing the configMap to get created with an empty yaml file.
In the third scenario, when Prometheus configuration explicitly includes $$ for labelmap, labeldrop, labelkeep or other actions, the TA ConfigMap is created with the $$ signs. Since the TA does not support environment variable substitution, the unmarshalling logic during the ConfigMap mount process in the TA pod will perform regex validation and cause the TA pod creation to fail.
Resolution:
Below is the guideline from OpenTelemetry documentation:
Key points considered in order to address the issue:
The operator was updated to modify the YAML file directly instead of going through the Prometheus unmarshalling logic to update the SD config. The TA ConfigMap creation process was modified to replace "$$" signs with a "$" to address the third scenario. These changes ensure that both the Collector and TA pods are created correctly, even when the Prometheus configuration includes labelmap, labeldrop, labelkeep, or other actions that undergo regex validation.
cc: @alolita @jaronoff97