fix(repocachemanager): adding rate limit condition message #3543

fdpeiter · 2021-08-20T17:54:07Z

Signed-off-by: Felipe Peiter felipe.peiter@wildlifestudios.com

Recently we've started seeing several messages related to the Dockerhub pull limit in our clusters:

{"auth":"{map[]}","caller":"repocachemanager.go:226","canonical_name":"index.docker.io/connecteverything/nats-operator","component":"warmer","err":{"errors":[{"code":"TOOMANYREQUESTS","message":"You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit"}]},"ref":"docker.io/connecteverything/nats-operator:0.8.0","ts":"2021-08-17T15:50:10.734988918Z"}

This is unexpected, as there is a specific switch case for handling rate-limiting here. I checked the content of err.Error() and got this: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit.

I've added this error message as a new condition for the switch case. I've tested it already and got the expected behaviour:

{"auth":"{map[]}","caller":"repocachemanager.go:216","canonical_name":"index.docker.io/tfgco/will-iam","component":"warmer","ts":"2021-08-20T17:20:46.220746763Z","warn":"aborting image tag fetching due to rate limiting, will try again later"}

Signed-off-by: Felipe Peiter <felipe.peiter@wildlifestudios.com>

kingdonb · 2021-08-23T15:35:43Z

Thanks for making a contribution here! We can certainly get this merged, and CI will push out a pre-release image with the change. It may take some time to get another release scheduled, as we have already just had a new minor release.

Can you say something about the impact on your cluster of introducing this change? It's not easy for me to follow what's going on here in the code. On the latest version of Flux, v1.24.0, I get that these rate limiting messages would be misclassified as a generic error, and so I imagine that none of the rate limiting mitigation triggering backoff would be triggered, but I'm not really sure what is the direct impact or significance of warnAboutRateLimit.

Is there any actual quantified performance impact of traveling down the wrong code path here, from what you've seen I'm assuming you've spent some time testing this change out locally? (Or is the result just to simply correct a bad behavior in the logs?)

fdpeiter · 2021-08-23T16:19:34Z

Hello there, @kingdonb. Thanks for the quick response!

I actually did got the chance to dig a little bit into this. After implementing the change, it seems that only 1 single call to update the images is done, which greatly helps lowering down the number of pulls on DockerHub:

This graph show how it was before, by fetching the TOOMANYREQUESTS exception:

This particular application only fetched 1 single image from DockerHub, but this behavior was more visible in applications without Registry scan restrictions.

I do suspect that the default switch case don't cancel the scan after failure, while the TOOMANYREQUESTS exception does.

Please let me know if you need more information.

kingdonb · 2021-08-23T19:00:06Z

Thanks very much, that's exactly the kind of color I wanted to add to this report before merging this. So, the performance is affected because Flux fails to back off properly when the "too many requests" exception isn't properly handled as a rate limit.

I'd like to go ahead and merge this as soon as possible, however I'm considering leaving the PR open for a couple of days to see whether anyone can poke holes in it. I'd be interested to know if this is just something new that Docker Hub has started, and whether there might be a third different way that some other image repo host is sending back their 429 errors that also doesn't match this one, if there might be more cases we need to cover here. If someone else reports this issue is affecting them in the same way, we can merge it more quickly and get a release out soon.

I don't want to wait too long since many people could be affected by this issue if it impacts Docker Hub users, but if you have already remediated the issue in your own Flux deployments, I think I could delay a while to avoid spending a half day or full day promoting a release when we might have another PR opened tomorrow that could be merged in the same release. I will mark this for the next release milestone and get back to you soon about it.

Thanks again for contributing this! Sorry for the poor experience for anyone who finds this issue that was impacted by it.

kingdonb

The content of this change looks fine to me. 🎉 🌮 Congrats first-time contributor to Flux 😄

fdpeiter · 2021-08-23T20:53:41Z

Thanks very much, that's exactly the kind of color I wanted to add to this report before merging this. So, the performance is affected because Flux fails to back off properly when the "too many requests" exception isn't properly handled as a rate limit.

That sounds correct, as it doesn't seem to fetch all images after rate-limited.

I don't want to wait too long since many people could be affected by this issue if it impacts Docker Hub users, but if you have already remediated the issue in your own Flux deployments, I think I could delay a while to avoid spending a half day or full day promoting a release when we might have another PR opened tomorrow that could be merged in the same release. I will mark this for the next release milestone and get back to you soon about it.

That makes a lot of sense! I will roll out this for 40+ clusters this week, and I can bring updates if I find some—no need to rush.

Thanks again!

kingdonb · 2021-08-25T00:44:38Z

#3544 is open tracking the release

I will probably wait for your report about how it went, before I merge this back to 1.24.x branch and push the release out.

Thanks for coordinating this with me!

fdpeiter · 2021-08-27T14:06:52Z

Hey there, @kingdonb!

Results seems great, for comparison:

Those are logs related to "aborting image tag fetching due to rate limiting, will try again later".

Those are logs for TOOMANYREQUESTS.

You can safely ignore those two peaks, as I did a memcached upgrade (cleaning the cache).

fix(repocachemanager): adding rate limit condition message

932ef6e

Signed-off-by: Felipe Peiter <felipe.peiter@wildlifestudios.com>

kingdonb added this to the 1.24.1 milestone Aug 23, 2021

kingdonb self-requested a review August 23, 2021 19:01

kingdonb approved these changes Aug 23, 2021

View reviewed changes

kingdonb merged commit c1267f5 into fluxcd:master Aug 24, 2021

kingdonb mentioned this pull request Aug 29, 2021

Release 1.24.1 #3544

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(repocachemanager): adding rate limit condition message #3543

fix(repocachemanager): adding rate limit condition message #3543

fdpeiter commented Aug 20, 2021

kingdonb commented Aug 23, 2021

fdpeiter commented Aug 23, 2021 •

edited

Loading

kingdonb commented Aug 23, 2021 •

edited

Loading

kingdonb left a comment

fdpeiter commented Aug 23, 2021 •

edited

Loading

kingdonb commented Aug 25, 2021

fdpeiter commented Aug 27, 2021

fix(repocachemanager): adding rate limit condition message #3543

fix(repocachemanager): adding rate limit condition message #3543

Conversation

fdpeiter commented Aug 20, 2021

kingdonb commented Aug 23, 2021

fdpeiter commented Aug 23, 2021 • edited Loading

kingdonb commented Aug 23, 2021 • edited Loading

kingdonb left a comment

Choose a reason for hiding this comment

fdpeiter commented Aug 23, 2021 • edited Loading

kingdonb commented Aug 25, 2021

fdpeiter commented Aug 27, 2021

fdpeiter commented Aug 23, 2021 •

edited

Loading

kingdonb commented Aug 23, 2021 •

edited

Loading

fdpeiter commented Aug 23, 2021 •

edited

Loading