-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address new ACME v2 rate limit ("too many new orders recently") #217
base: master
Are you sure you want to change the base?
Conversation
The ACME v2 API has a new rate limit of 300 new orders per account per 3 hours, which easily results in the "too many new orders recently" error for new or renewing domains. (see auto-ssl#213) As a first step to mitigate this new requirement, throttle the rate with which renewal requests are issued to a default 60 per hour (which leaves 40 for on-demand issues per hour).
With throttling, renewal jobs can take a significant time to complete (e.g. when a bunch of domains are renewing around the same time). To make sure we can spread out the renewal as evenly as possible throughout the period, and to ensure the renewal checks on different workers stay in sync, consider the duration of the actual job execution when re-scheduling the check.
f161c9e
to
bc2a999
Compare
This is to increase the chances that environments using multiple servers won't have all instances renewing at the same time, which might tie up all orders available to the account, and thus prevent certificates for new domains from being issued.
This implements the minimal functionality for considering a domain's past certificate failures inside allow_domain. We believe something like this to be necessary to be able to comply to the new ACME v2 limit of 300 new orders/account/3h. A different approach which might be worth considering is saving the statistical information to (permanent) storage, like we do with certificates. This patch instead only uses a shm-based dictionary, meaning that information might be stale in a multi-server setup. (Dictionary entries are created with an expiration date, and we're running the following patch to look for a certificate in storage before calling allow_domain, meaning that this shouldn't be a problem in practice: b1f9715)
665f2d3
to
6e1ea73
Compare
Previously, renewals were attempted 30 days before domains expired, and then every day going forward - until either the renewal succeeded or the existing certificate expired. This poses an issue with the new ACME v2 rate limit ("too many new orders recently"), because large numbers of domains might be expiring (and thus renewing) at the same time - e.g. because they all got moved onto lua-resty-auto-ssl the same time initially. Do spread the renewal dates out more evenly, this patch rolls the dice whether a domain should be renewed or not, with increasing odds it will be, between 30 and 15 days before the domain expires. Renewals for domains expiring in less than 15 days are always attempted.
@GUI This is becoming a problem. Can you merge this soon? |
I'm running into this issue as well, would be great to get this merged |
Thanks all. I'll try to review and release this weekend. Sorry for the delay! |
Sorry to add another nagging voice to the pile, but we're also struggling with this problem. Is there anything anyone else can do to test/review this and help build confidence for the merge? |
(Committer here) To provide a bit of additional context: we've been running those changes in production since January. They are not a guaranteed "solution" to the new rate-limit introduced with ACME v2, but they change the existing code to be as steady with renewal requests as possible, so that those won't unnecessarily push you over the limit at times. But this is happening on a "best effort" basis, short of more radical changes, such as having lua-resty-auto-ssl rotate between multiple Let's Encrypt accounts etc... In addition, 6e1ea73 adds a generic mechanism that allows setting a policy for how frequently domains are being retried, for which we weren't able to get a certificate previously. We've been using this with an exponential backoff, were we ultimately only waste a single order per days on problematic domains - but the desired behavior is probably differing amongst auto-ssl users, so this merely gives you the tools to implement this. (The one slightly ambiguous part IMHO: we're using 6e1ea73 together with b1f9715, a change to check if we already have a cert in permanent storage before calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gohai: There's lots of good stuff here, a big thanks for helping figure this out.
Everyone: Sorry for the delays to everyone in getting this merged. But if you want to use @gohai's branch, this all looks solid to me, I'm just wanting to mull over some different approaches before merging and releasing.
@gohai: I left some inline comments with some very initial thoughts, but I probably need to spend a bit more time giving all of this more thought. But let me know if any of my initial questions didn't make sense or you have any other thoughts. But happy to try and collaborate on this. Many thanks!
elseif now + (15 * 24 * 60 * 60) < cert["expiry"] then | ||
local rand_value = math.random(cert["expiry"] - (30 * 24 * 60 * 60), cert["expiry"] - (15 * 24 * 60 * 60)) | ||
local rand_renewal_threshold = now | ||
if rand_value < rand_renewal_threshold then | ||
ngx.log(ngx.NOTICE, "auto-ssl: expiry date is more than 15 days out, randomly not picked for renewal: ", domain) | ||
renew_check_cert_unlock(domain, storage, local_lock, distributed_lock_value) | ||
return | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this increasing likelihood approach needed with the other throttle mechanisms provided by renewals_per_hour
(60/hour)? I guess I'm just trying to think through the different scenarios with this and wondering how these two approaches interact.
If you have a big batch of domains that need to be renewed, would the renewals_per_hour
logic not eventually get through renewed, just at a slower pace? Or is this logic here an attempt to prioritize the renewals, so that ones expiring sooner have their renewal attempted sooner? Would this mainly be relevant if you expect your renewals to take more than 30 days at the throttled 60/hour rate?
If this is mainly intended to prioritize renewals (but you still want to saturate the 60/hour rate), then I'm wondering if maybe a slightly simpler approach would be to sort the renewals by expiration (instead of the current approach of randomizing the renewal order). That being said, there may be some performance implications of fetching all of those expiration times before renewals occur, so this may not actually be feasible, I'd need to investigate further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question about how this interacts with the throttle mechanism!
I implemented this after realizing that we had a bulk of many-thousand of certificates attempting to renew at the same day, as this was the date when we migrated them onto lua-resty-auto-ssl
initially.
While this went on, at some point we exceeded the rate limit, and this caused an issue, as we weren't able to generate certificates for newly-registered customer domains. This issue likely wouldn't happened if we had set our renewals_per_hour
more conservatively (we must have used more than 40/h for ad-hoc certificate requests).
So the aim of this patch is really just to smoothen out the renewals over many days, anticipating that this will make it less likely that we exceed the rate limit. (Example: I am starting out using lu-resty-auto-ssl
today and migrate 10k domains onto the system. Before this patch, all of these domains would simultaneously attempted for renewal 60 days from now, a job which would take 7 days to finish, at 60/h. This could be just fine. This patch changes this behavior so that instead the renewals get spread over the next 14 invocations (each of which should only take 12 hours to finish).
In case of the former, we'd be "maxing out" the 60/h for a week straight - a process which would reoccur every two months. In case of the latter, we'd be at 60/h for only half of the time, the rest of which we'd have a little bit more room for ad-hoc certificate requests to exceed 40/h. And over time, we'd inevitably spread them even more evenly, so that we'd be attempting renewal of the same number of domains on average ultimately.
Does this make more sense to you after reading this?
#### `get_failures` | ||
|
||
The optional `get_failures` function accepts a domain name argument, and can be used to retrieve statistics about failed certificate requests concerning the domain. The function will return a table with fields `first` (timestamp of first failure encountered), `last` (timestamp of most recent failure encountered), `num` (number of failures). The function will instead return `nil` if no error has been encountered. | ||
|
||
Note: the statistics are only kept for as long as the nginx instance is running. There is no sharing across multiple servers (as in a load-balanced environment) implemented. | ||
|
||
To make use of the `get_failures` function, add the following to the `http` configuration block: | ||
|
||
```nginx | ||
lua_shared_dict auto_ssl_failures 1m; | ||
``` | ||
|
||
When this shm-based dictionary exists, `lua-resty-auto-ssl` will use it to update a record it keeps for the domain whenever a Let's Encrypt certificate request fails (for both new domains, as well as renewing ones). When a certificate request is successful, `lua-resty-auto-ssl` will delete the record it has for the domain, so that future invocations will return `nil`. | ||
|
||
The `get_failures` function can be used inside `allow_domain` to implement per-domain rate-limiting, and similar rule sets. | ||
|
||
*Example:* | ||
|
||
```lua | ||
auto_ssl:set("allow_domain", function(domain, auto_ssl, ssl_options, renewal) | ||
local failures = auto_ssl:get_failures(domain) | ||
-- only attempt one certificate request per hour | ||
if not failures or 3600 < ngx.now() - failures["last"] then | ||
return true | ||
else | ||
return false | ||
end | ||
end) | ||
``` | ||
|
||
#### `track_failure` | ||
|
||
The optional `track_failure` function accepts a domain name argument and records a failure for this domain. This can be used to avoid repeated lookups of a domain in `allow_domain`. | ||
|
||
*Example:* | ||
|
||
```lua | ||
auto_ssl:set("allow_domain", function(domain, auto_ssl, ssl_options, renewal) | ||
local failures = auto_ssl:get_failures(domain) | ||
-- only attempt one lookup or certificate request per hour | ||
if failures and ngx.now() - failures["last"] <= 3600 then | ||
return false | ||
end | ||
|
||
local allow | ||
-- (external lookup to check domain, e.g. via http) | ||
if not allow then | ||
auto_ssl:track_failure(domain) | ||
return false | ||
else | ||
return true | ||
end | ||
end) | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This get_failures
and track_failure
logic to help throttle initial registrations is definitely a flexible solution that allows for some nifty logic in the allow_domain
callback. I think I see how all this fits together (nice!), but I guess I'm just mulling over other options to deal with the same issue. A couple things come to mind.
-
You commented:
The one slightly ambiguous part IMHO: we're using 6e1ea73 together with b1f9715, a change to check if we already have a cert in permanent storage before calling allow_domain, which was previously rejected during review. Without the latter patch, you might have a situation in a multi-server setup where one server serves a certificate for a domain, while another still returns the fallback certificate because of per-domain throttling. With the latter patch, auto-ssl will always return a certificate any of the servers has been able to get one.
The ordering issues around this does make me wondering about whether
allow_domain
is the right place to implement this logic. Because in a way, this seems like a slightly different concept than whatallow_domain
as originally intended to do. Perhaps we could introduce a different callback (eg,throttle_domain
) that would get called after the storage call (eg, slide it into where you had proposedallow_domain
get moved in b1f9715)? That would keepallow_domain
being called before the storage call, but then allow this logic to be handled separately. -
I'm also wondering if we could integrate this logic in a more built-in fashion so that users don't have to implement custom
allow_domain
(or other callbacks) to handle this. Similar to howrenewals_per_hour
is just a simple global setting, I'm wondering if we could simplify this in some fashion and integrate this logic directly into codebase, rather than relying on the callback. Or do your use-cases need more custom or per-domain logic for this type of thing?
I'd need to spend some more time investigating the feasibility of these ideas (and I realize you might have different use-cases), but these are just some initial thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review and thoughts, @GUI.
To your points:
- Separate callbacks would definitely work for us. We don't mind the Redis lookup to see if a certificate is in storage or not, but to accommodate users that do, having two makes a lot of sense to me IMHO. (perhaps
_early
and_late
?)
Part of my thinking behind adding this to the implementer's "toolbox" in allow_domain
was the realization that one could use the track_failure(domain)
mechanism also for caching any "regular" lookup that allow_domain
would do, if this is desired. (Example: we have to do a HTTP call to an internal web API to decide whether we should allow a new (sub-) domain)
- I do see beauty of having a simple setting, and have everything be handled behind the scenes by
lua-resty-auto-ssl
.
To be specific, we're currently using this exponential backoff algorithm in our setup. Unsure if something like this could be generalized to work for other users (without a callback?):
local failures = auto_ssl:get_failures(domain)
if failures then
-- 1 failures: next attempt after 3.75 min
-- 2 failures: next attempt after 7.5 min
-- 3 failures: next attempt after 15 min
-- 4 failures: next attempt after 30 min
-- 5 failures: next attempt after 60 min
-- ...
-- 10+ failures: next attempt after 24 hours
local backoff = math.min(225 * 2 ^ (failures["num"] - 1), 86400)
local elapsed = ngx.now() - failures["last"]
if elapsed < backoff then
ngx.log(ngx.NOTICE, domain .. ": " .. failures["num"] .. " cert failures, last attempt " .. math.floor(elapsed) .. "s ago, want to wait " .. backoff .. "s before trying again")
return false
end
Conceptually, the simplest mechanism to me would be one that (a) caches the results of allow_domain
, (b) equally takes into consideration cert failures, and (c) uses permanent storage so that the tracking automatically shared across multiple servers. (My patch only does (b) and gives the implementer the tools to implement (a).)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach that I didn't attempt, but which would make sense to me also: track each certificate request (success and failure), have one tunable to set the max number of certificate requests per three hours (defaulting to LE's 300
), and have algorithms to adaptively throttle both renewals, and also (failing) ad-hoc certificate requests as needed. With additional tunable as needed, and using permanent storage (i.e. aggregated across multiple servers).
The ACME v2 API has a new rate limit of 300 new orders per account per 3 hours, which easily results in the "too many new orders recently" error for new or renewing domains. (see #213)