Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

fix(repocachemanager): adding rate limit condition message #3543

Merged
merged 1 commit into from
Aug 24, 2021

Conversation

fdpeiter
Copy link
Contributor

Signed-off-by: Felipe Peiter felipe.peiter@wildlifestudios.com

Recently we've started seeing several messages related to the Dockerhub pull limit in our clusters:

{"auth":"{map[]}","caller":"repocachemanager.go:226","canonical_name":"index.docker.io/connecteverything/nats-operator","component":"warmer","err":{"errors":[{"code":"TOOMANYREQUESTS","message":"You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit"}]},"ref":"docker.io/connecteverything/nats-operator:0.8.0","ts":"2021-08-17T15:50:10.734988918Z"}

This is unexpected, as there is a specific switch case for handling rate-limiting here. I checked the content of err.Error() and got this: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit.

I've added this error message as a new condition for the switch case. I've tested it already and got the expected behaviour:

{"auth":"{map[]}","caller":"repocachemanager.go:216","canonical_name":"index.docker.io/tfgco/will-iam","component":"warmer","ts":"2021-08-20T17:20:46.220746763Z","warn":"aborting image tag fetching due to rate limiting, will try again later"}

Signed-off-by: Felipe Peiter <felipe.peiter@wildlifestudios.com>
@kingdonb
Copy link
Member

Thanks for making a contribution here! We can certainly get this merged, and CI will push out a pre-release image with the change. It may take some time to get another release scheduled, as we have already just had a new minor release.

Can you say something about the impact on your cluster of introducing this change? It's not easy for me to follow what's going on here in the code. On the latest version of Flux, v1.24.0, I get that these rate limiting messages would be misclassified as a generic error, and so I imagine that none of the rate limiting mitigation triggering backoff would be triggered, but I'm not really sure what is the direct impact or significance of warnAboutRateLimit.

Is there any actual quantified performance impact of traveling down the wrong code path here, from what you've seen I'm assuming you've spent some time testing this change out locally? (Or is the result just to simply correct a bad behavior in the logs?)

@fdpeiter
Copy link
Contributor Author

fdpeiter commented Aug 23, 2021

Hello there, @kingdonb. Thanks for the quick response!

I actually did got the chance to dig a little bit into this. After implementing the change, it seems that only 1 single call to update the images is done, which greatly helps lowering down the number of pulls on DockerHub:
Screen Shot 2021-08-23 at 13 11 22

This graph show how it was before, by fetching the TOOMANYREQUESTS exception:
Screen Shot 2021-08-23 at 13 13 48

This particular application only fetched 1 single image from DockerHub, but this behavior was more visible in applications without Registry scan restrictions.

I do suspect that the default switch case don't cancel the scan after failure, while the TOOMANYREQUESTS exception does.

Please let me know if you need more information.

@kingdonb
Copy link
Member

kingdonb commented Aug 23, 2021

Thanks very much, that's exactly the kind of color I wanted to add to this report before merging this. So, the performance is affected because Flux fails to back off properly when the "too many requests" exception isn't properly handled as a rate limit.

I'd like to go ahead and merge this as soon as possible, however I'm considering leaving the PR open for a couple of days to see whether anyone can poke holes in it. I'd be interested to know if this is just something new that Docker Hub has started, and whether there might be a third different way that some other image repo host is sending back their 429 errors that also doesn't match this one, if there might be more cases we need to cover here. If someone else reports this issue is affecting them in the same way, we can merge it more quickly and get a release out soon.

I don't want to wait too long since many people could be affected by this issue if it impacts Docker Hub users, but if you have already remediated the issue in your own Flux deployments, I think I could delay a while to avoid spending a half day or full day promoting a release when we might have another PR opened tomorrow that could be merged in the same release. I will mark this for the next release milestone and get back to you soon about it.

Thanks again for contributing this! Sorry for the poor experience for anyone who finds this issue that was impacted by it.

@kingdonb kingdonb added this to the 1.24.1 milestone Aug 23, 2021
@kingdonb kingdonb self-requested a review August 23, 2021 19:01
Copy link
Member

@kingdonb kingdonb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content of this change looks fine to me. 🎉 🌮 Congrats first-time contributor to Flux 😄

@fdpeiter
Copy link
Contributor Author

fdpeiter commented Aug 23, 2021

Thanks very much, that's exactly the kind of color I wanted to add to this report before merging this. So, the performance is affected because Flux fails to back off properly when the "too many requests" exception isn't properly handled as a rate limit.

That sounds correct, as it doesn't seem to fetch all images after rate-limited.

I don't want to wait too long since many people could be affected by this issue if it impacts Docker Hub users, but if you have already remediated the issue in your own Flux deployments, I think I could delay a while to avoid spending a half day or full day promoting a release when we might have another PR opened tomorrow that could be merged in the same release. I will mark this for the next release milestone and get back to you soon about it.

That makes a lot of sense! I will roll out this for 40+ clusters this week, and I can bring updates if I find some—no need to rush.

Thanks again!

@kingdonb kingdonb merged commit c1267f5 into fluxcd:master Aug 24, 2021
@kingdonb
Copy link
Member

#3544 is open tracking the release

I will probably wait for your report about how it went, before I merge this back to 1.24.x branch and push the release out.

Thanks for coordinating this with me!

@fdpeiter
Copy link
Contributor Author

Hey there, @kingdonb!

Results seems great, for comparison:
Screen Shot 2021-08-25 at 12 50 00
Those are logs related to "aborting image tag fetching due to rate limiting, will try again later".
Screen Shot 2021-08-25 at 12 50 48
Those are logs for TOOMANYREQUESTS.

You can safely ignore those two peaks, as I did a memcached upgrade (cleaning the cache).

@kingdonb kingdonb mentioned this pull request Aug 29, 2021
2 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants