Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address new ACME v2 rate limit ("too many new orders recently") #217

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

gohai
Copy link
Contributor

@gohai gohai commented Jan 22, 2020

The ACME v2 API has a new rate limit of 300 new orders per account per 3 hours, which easily results in the "too many new orders recently" error for new or renewing domains. (see #213)

@gohai gohai changed the title Throttle the rate of renewal requests (ACME v2 rate limit) Address new ACME v2 rate limit ("too many new orders recently") Jan 22, 2020
The ACME v2 API has a new rate limit of 300 new orders per account per 3 hours, which easily results in the "too many new orders recently" error for new or renewing domains. (see auto-ssl#213)

As a first step to mitigate this new requirement, throttle the rate with which renewal requests are issued to a default 60 per hour (which leaves 40 for on-demand issues per hour).
With throttling, renewal jobs can take a significant time to complete (e.g. when a bunch of domains are renewing around the same time). To make sure we can spread out the renewal as evenly as possible throughout the period, and to ensure the renewal checks on different workers stay in sync, consider the duration of the actual job execution when re-scheduling the check.
This is to increase the chances that environments using multiple servers won't have all instances renewing at the same time, which might tie up all orders available to the account, and thus prevent certificates for new domains from being issued.
This implements the minimal functionality for considering a domain's past certificate failures inside allow_domain. We believe something like this to be necessary to be able to comply to the new ACME v2 limit of 300 new orders/account/3h.

A different approach which might be worth considering is saving the statistical information to (permanent) storage, like we do with certificates. This patch instead only uses a shm-based dictionary, meaning that information might be stale in a multi-server setup. (Dictionary entries are created with an expiration date, and we're running the following patch to look for a certificate in storage before calling allow_domain, meaning that this shouldn't be a problem in practice: b1f9715)
Previously, renewals were attempted 30 days before domains expired, and then every day going forward - until either the renewal succeeded or the existing certificate expired. This poses an issue with the new ACME v2 rate limit ("too many new orders recently"), because large numbers of domains might be expiring (and thus renewing) at the same time - e.g. because they all got moved onto lua-resty-auto-ssl the same time initially.

Do spread the renewal dates out more evenly, this patch rolls the dice whether a domain should be renewed or not, with increasing odds it will be, between 30 and 15 days before the domain expires. Renewals for domains expiring in less than 15 days are always attempted.
@adityapatadia
Copy link

@GUI This is becoming a problem. Can you merge this soon?

@davidmerfield
Copy link

I'm running into this issue as well, would be great to get this merged

@GUI
Copy link
Collaborator

GUI commented Jun 25, 2020

Thanks all. I'll try to review and release this weekend. Sorry for the delay!

@ErisDS
Copy link

ErisDS commented Jul 7, 2020

Sorry to add another nagging voice to the pile, but we're also struggling with this problem.

Is there anything anyone else can do to test/review this and help build confidence for the merge?

@gohai
Copy link
Contributor Author

gohai commented Jul 7, 2020

(Committer here) To provide a bit of additional context: we've been running those changes in production since January. They are not a guaranteed "solution" to the new rate-limit introduced with ACME v2, but they change the existing code to be as steady with renewal requests as possible, so that those won't unnecessarily push you over the limit at times. But this is happening on a "best effort" basis, short of more radical changes, such as having lua-resty-auto-ssl rotate between multiple Let's Encrypt accounts etc...

In addition, 6e1ea73 adds a generic mechanism that allows setting a policy for how frequently domains are being retried, for which we weren't able to get a certificate previously. We've been using this with an exponential backoff, were we ultimately only waste a single order per days on problematic domains - but the desired behavior is probably differing amongst auto-ssl users, so this merely gives you the tools to implement this.

(The one slightly ambiguous part IMHO: we're using 6e1ea73 together with b1f9715, a change to check if we already have a cert in permanent storage before calling allow_domain, which was previously rejected during review. Without the latter patch, you might have a situation in a multi-server setup where one server serves a certificate for a domain, while another still returns the fallback certificate because of per-domain throttling. With the latter patch, auto-ssl will always return a certificate any of the servers has been able to get one.)

Copy link
Collaborator

@GUI GUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gohai: There's lots of good stuff here, a big thanks for helping figure this out.

Everyone: Sorry for the delays to everyone in getting this merged. But if you want to use @gohai's branch, this all looks solid to me, I'm just wanting to mull over some different approaches before merging and releasing.

@gohai: I left some inline comments with some very initial thoughts, but I probably need to spend a bit more time giving all of this more thought. But let me know if any of my initial questions didn't make sense or you have any other thoughts. But happy to try and collaborate on this. Many thanks!

Comment on lines +138 to +145
elseif now + (15 * 24 * 60 * 60) < cert["expiry"] then
local rand_value = math.random(cert["expiry"] - (30 * 24 * 60 * 60), cert["expiry"] - (15 * 24 * 60 * 60))
local rand_renewal_threshold = now
if rand_value < rand_renewal_threshold then
ngx.log(ngx.NOTICE, "auto-ssl: expiry date is more than 15 days out, randomly not picked for renewal: ", domain)
renew_check_cert_unlock(domain, storage, local_lock, distributed_lock_value)
return
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this increasing likelihood approach needed with the other throttle mechanisms provided by renewals_per_hour (60/hour)? I guess I'm just trying to think through the different scenarios with this and wondering how these two approaches interact.

If you have a big batch of domains that need to be renewed, would the renewals_per_hour logic not eventually get through renewed, just at a slower pace? Or is this logic here an attempt to prioritize the renewals, so that ones expiring sooner have their renewal attempted sooner? Would this mainly be relevant if you expect your renewals to take more than 30 days at the throttled 60/hour rate?

If this is mainly intended to prioritize renewals (but you still want to saturate the 60/hour rate), then I'm wondering if maybe a slightly simpler approach would be to sort the renewals by expiration (instead of the current approach of randomizing the renewal order). That being said, there may be some performance implications of fetching all of those expiration times before renewals occur, so this may not actually be feasible, I'd need to investigate further.

Copy link
Contributor Author

@gohai gohai Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question about how this interacts with the throttle mechanism!

I implemented this after realizing that we had a bulk of many-thousand of certificates attempting to renew at the same day, as this was the date when we migrated them onto lua-resty-auto-ssl initially.

While this went on, at some point we exceeded the rate limit, and this caused an issue, as we weren't able to generate certificates for newly-registered customer domains. This issue likely wouldn't happened if we had set our renewals_per_hour more conservatively (we must have used more than 40/h for ad-hoc certificate requests).

So the aim of this patch is really just to smoothen out the renewals over many days, anticipating that this will make it less likely that we exceed the rate limit. (Example: I am starting out using lu-resty-auto-ssl today and migrate 10k domains onto the system. Before this patch, all of these domains would simultaneously attempted for renewal 60 days from now, a job which would take 7 days to finish, at 60/h. This could be just fine. This patch changes this behavior so that instead the renewals get spread over the next 14 invocations (each of which should only take 12 hours to finish).

In case of the former, we'd be "maxing out" the 60/h for a week straight - a process which would reoccur every two months. In case of the latter, we'd be at 60/h for only half of the time, the rest of which we'd have a little bit more room for ad-hoc certificate requests to exceed 40/h. And over time, we'd inevitably spread them even more evenly, so that we'd be attempting renewal of the same number of domains on average ultimately.

Does this make more sense to you after reading this?

Comment on lines +164 to +218
#### `get_failures`

The optional `get_failures` function accepts a domain name argument, and can be used to retrieve statistics about failed certificate requests concerning the domain. The function will return a table with fields `first` (timestamp of first failure encountered), `last` (timestamp of most recent failure encountered), `num` (number of failures). The function will instead return `nil` if no error has been encountered.

Note: the statistics are only kept for as long as the nginx instance is running. There is no sharing across multiple servers (as in a load-balanced environment) implemented.

To make use of the `get_failures` function, add the following to the `http` configuration block:

```nginx
lua_shared_dict auto_ssl_failures 1m;
```

When this shm-based dictionary exists, `lua-resty-auto-ssl` will use it to update a record it keeps for the domain whenever a Let's Encrypt certificate request fails (for both new domains, as well as renewing ones). When a certificate request is successful, `lua-resty-auto-ssl` will delete the record it has for the domain, so that future invocations will return `nil`.

The `get_failures` function can be used inside `allow_domain` to implement per-domain rate-limiting, and similar rule sets.

*Example:*

```lua
auto_ssl:set("allow_domain", function(domain, auto_ssl, ssl_options, renewal)
local failures = auto_ssl:get_failures(domain)
-- only attempt one certificate request per hour
if not failures or 3600 < ngx.now() - failures["last"] then
return true
else
return false
end
end)
```

#### `track_failure`

The optional `track_failure` function accepts a domain name argument and records a failure for this domain. This can be used to avoid repeated lookups of a domain in `allow_domain`.

*Example:*

```lua
auto_ssl:set("allow_domain", function(domain, auto_ssl, ssl_options, renewal)
local failures = auto_ssl:get_failures(domain)
-- only attempt one lookup or certificate request per hour
if failures and ngx.now() - failures["last"] <= 3600 then
return false
end

local allow
-- (external lookup to check domain, e.g. via http)
if not allow then
auto_ssl:track_failure(domain)
return false
else
return true
end
end)
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This get_failures and track_failure logic to help throttle initial registrations is definitely a flexible solution that allows for some nifty logic in the allow_domain callback. I think I see how all this fits together (nice!), but I guess I'm just mulling over other options to deal with the same issue. A couple things come to mind.

  1. You commented:

    The one slightly ambiguous part IMHO: we're using 6e1ea73 together with b1f9715, a change to check if we already have a cert in permanent storage before calling allow_domain, which was previously rejected during review. Without the latter patch, you might have a situation in a multi-server setup where one server serves a certificate for a domain, while another still returns the fallback certificate because of per-domain throttling. With the latter patch, auto-ssl will always return a certificate any of the servers has been able to get one.

    The ordering issues around this does make me wondering about whether allow_domain is the right place to implement this logic. Because in a way, this seems like a slightly different concept than what allow_domain as originally intended to do. Perhaps we could introduce a different callback (eg, throttle_domain) that would get called after the storage call (eg, slide it into where you had proposed allow_domain get moved in b1f9715)? That would keep allow_domain being called before the storage call, but then allow this logic to be handled separately.

  2. I'm also wondering if we could integrate this logic in a more built-in fashion so that users don't have to implement custom allow_domain (or other callbacks) to handle this. Similar to how renewals_per_hour is just a simple global setting, I'm wondering if we could simplify this in some fashion and integrate this logic directly into codebase, rather than relying on the callback. Or do your use-cases need more custom or per-domain logic for this type of thing?

I'd need to spend some more time investigating the feasibility of these ideas (and I realize you might have different use-cases), but these are just some initial thoughts.

Copy link
Contributor Author

@gohai gohai Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review and thoughts, @GUI.

To your points:

  1. Separate callbacks would definitely work for us. We don't mind the Redis lookup to see if a certificate is in storage or not, but to accommodate users that do, having two makes a lot of sense to me IMHO. (perhaps _early and _late?)

Part of my thinking behind adding this to the implementer's "toolbox" in allow_domain was the realization that one could use the track_failure(domain) mechanism also for caching any "regular" lookup that allow_domain would do, if this is desired. (Example: we have to do a HTTP call to an internal web API to decide whether we should allow a new (sub-) domain)

  1. I do see beauty of having a simple setting, and have everything be handled behind the scenes by lua-resty-auto-ssl.

To be specific, we're currently using this exponential backoff algorithm in our setup. Unsure if something like this could be generalized to work for other users (without a callback?):

  local failures = auto_ssl:get_failures(domain)
  if failures then
    -- 1 failures: next attempt after 3.75 min
    -- 2 failures: next attempt after 7.5 min
    -- 3 failures: next attempt after 15 min
    -- 4 failures: next attempt after 30 min
    -- 5 failures: next attempt after 60 min
    -- ...
    -- 10+ failures: next attempt after 24 hours
    local backoff = math.min(225 * 2 ^ (failures["num"] - 1), 86400)
    local elapsed = ngx.now() - failures["last"]
    if elapsed < backoff then
      ngx.log(ngx.NOTICE, domain .. ": " .. failures["num"] .. " cert failures, last attempt " .. math.floor(elapsed) .. "s ago, want to wait " .. backoff .. "s before trying again")
      return false
    end

Conceptually, the simplest mechanism to me would be one that (a) caches the results of allow_domain, (b) equally takes into consideration cert failures, and (c) uses permanent storage so that the tracking automatically shared across multiple servers. (My patch only does (b) and gives the implementer the tools to implement (a).)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach that I didn't attempt, but which would make sense to me also: track each certificate request (success and failure), have one tunable to set the max number of certificate requests per three hours (defaulting to LE's 300), and have algorithms to adaptively throttle both renewals, and also (failing) ad-hoc certificate requests as needed. With additional tunable as needed, and using permanent storage (i.e. aggregated across multiple servers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants