Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address new ACME v2 rate limit ("too many new orders recently") #217

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,61 @@ auto_ssl:set("allow_domain", function(domain, auto_ssl, ssl_options, renewal)
end)
```

#### `get_failures`

The optional `get_failures` function accepts a domain name argument, and can be used to retrieve statistics about failed certificate requests concerning the domain. The function will return a table with fields `first` (timestamp of first failure encountered), `last` (timestamp of most recent failure encountered), `num` (number of failures). The function will instead return `nil` if no error has been encountered.

Note: the statistics are only kept for as long as the nginx instance is running. There is no sharing across multiple servers (as in a load-balanced environment) implemented.

To make use of the `get_failures` function, add the following to the `http` configuration block:

```nginx
lua_shared_dict auto_ssl_failures 1m;
```

When this shm-based dictionary exists, `lua-resty-auto-ssl` will use it to update a record it keeps for the domain whenever a Let's Encrypt certificate request fails (for both new domains, as well as renewing ones). When a certificate request is successful, `lua-resty-auto-ssl` will delete the record it has for the domain, so that future invocations will return `nil`.

The `get_failures` function can be used inside `allow_domain` to implement per-domain rate-limiting, and similar rule sets.

*Example:*

```lua
auto_ssl:set("allow_domain", function(domain, auto_ssl, ssl_options, renewal)
local failures = auto_ssl:get_failures(domain)
-- only attempt one certificate request per hour
if not failures or 3600 < ngx.now() - failures["last"] then
return true
else
return false
end
end)
```

#### `track_failure`

The optional `track_failure` function accepts a domain name argument and records a failure for this domain. This can be used to avoid repeated lookups of a domain in `allow_domain`.

*Example:*

```lua
auto_ssl:set("allow_domain", function(domain, auto_ssl, ssl_options, renewal)
local failures = auto_ssl:get_failures(domain)
-- only attempt one lookup or certificate request per hour
if failures and ngx.now() - failures["last"] <= 3600 then
return false
end

local allow
-- (external lookup to check domain, e.g. via http)
if not allow then
auto_ssl:track_failure(domain)
return false
else
return true
end
end)
```

Comment on lines +164 to +218
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This get_failures and track_failure logic to help throttle initial registrations is definitely a flexible solution that allows for some nifty logic in the allow_domain callback. I think I see how all this fits together (nice!), but I guess I'm just mulling over other options to deal with the same issue. A couple things come to mind.

  1. You commented:

    The one slightly ambiguous part IMHO: we're using 6e1ea73 together with b1f9715, a change to check if we already have a cert in permanent storage before calling allow_domain, which was previously rejected during review. Without the latter patch, you might have a situation in a multi-server setup where one server serves a certificate for a domain, while another still returns the fallback certificate because of per-domain throttling. With the latter patch, auto-ssl will always return a certificate any of the servers has been able to get one.

    The ordering issues around this does make me wondering about whether allow_domain is the right place to implement this logic. Because in a way, this seems like a slightly different concept than what allow_domain as originally intended to do. Perhaps we could introduce a different callback (eg, throttle_domain) that would get called after the storage call (eg, slide it into where you had proposed allow_domain get moved in b1f9715)? That would keep allow_domain being called before the storage call, but then allow this logic to be handled separately.

  2. I'm also wondering if we could integrate this logic in a more built-in fashion so that users don't have to implement custom allow_domain (or other callbacks) to handle this. Similar to how renewals_per_hour is just a simple global setting, I'm wondering if we could simplify this in some fashion and integrate this logic directly into codebase, rather than relying on the callback. Or do your use-cases need more custom or per-domain logic for this type of thing?

I'd need to spend some more time investigating the feasibility of these ideas (and I realize you might have different use-cases), but these are just some initial thoughts.

Copy link
Contributor Author

@gohai gohai Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review and thoughts, @GUI.

To your points:

  1. Separate callbacks would definitely work for us. We don't mind the Redis lookup to see if a certificate is in storage or not, but to accommodate users that do, having two makes a lot of sense to me IMHO. (perhaps _early and _late?)

Part of my thinking behind adding this to the implementer's "toolbox" in allow_domain was the realization that one could use the track_failure(domain) mechanism also for caching any "regular" lookup that allow_domain would do, if this is desired. (Example: we have to do a HTTP call to an internal web API to decide whether we should allow a new (sub-) domain)

  1. I do see beauty of having a simple setting, and have everything be handled behind the scenes by lua-resty-auto-ssl.

To be specific, we're currently using this exponential backoff algorithm in our setup. Unsure if something like this could be generalized to work for other users (without a callback?):

  local failures = auto_ssl:get_failures(domain)
  if failures then
    -- 1 failures: next attempt after 3.75 min
    -- 2 failures: next attempt after 7.5 min
    -- 3 failures: next attempt after 15 min
    -- 4 failures: next attempt after 30 min
    -- 5 failures: next attempt after 60 min
    -- ...
    -- 10+ failures: next attempt after 24 hours
    local backoff = math.min(225 * 2 ^ (failures["num"] - 1), 86400)
    local elapsed = ngx.now() - failures["last"]
    if elapsed < backoff then
      ngx.log(ngx.NOTICE, domain .. ": " .. failures["num"] .. " cert failures, last attempt " .. math.floor(elapsed) .. "s ago, want to wait " .. backoff .. "s before trying again")
      return false
    end

Conceptually, the simplest mechanism to me would be one that (a) caches the results of allow_domain, (b) equally takes into consideration cert failures, and (c) uses permanent storage so that the tracking automatically shared across multiple servers. (My patch only does (b) and gives the implementer the tools to implement (a).)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach that I didn't attempt, but which would make sense to me also: track each certificate request (success and failure), have one tunable to set the max number of certificate requests per three hours (defaulting to LE's 300), and have algorithms to adaptively throttle both renewals, and also (failing) ad-hoc certificate requests as needed. With additional tunable as needed, and using permanent storage (i.e. aggregated across multiple servers).

### `dir`
*Default:* `/etc/resty-auto-ssl`

Expand All @@ -183,6 +238,11 @@ How frequently (in seconds) all of the domains should be checked for certificate
auto_ssl:set("renew_check_interval", 172800)
```

### `renewals_per_hour`
*Default:* `60`

How many renewal requests to issue per hour at most. The ACME v2 protocol limits each account to 300 new orders per 3 hours. This setting will throttle the renewal job so that a sufficient margin remains available for new domains at all times. You might consider lowering this setting when the same Let's Encrypt account credentials are shared across multiple servers (in a load-balanced environment).

### `storage_adapter`
*Default:* `resty.auto-ssl.storage_adapters.file`<br>
*Options:* `resty.auto-ssl.storage_adapters.file`, `resty.auto-ssl.storage_adapters.redis`
Expand Down
62 changes: 62 additions & 0 deletions lib/resty/auto-ssl.lua
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,10 @@ function _M.new(options)
options["renew_check_interval"] = 86400 -- 1 day
end

if not options["renewals_per_hour"] then
options["renewals_per_hour"] = 60
end

if not options["hook_server_port"] then
options["hook_server_port"] = 8999
end
Expand Down Expand Up @@ -95,4 +99,62 @@ function _M.hook_server(self)
server(self)
end

function _M.get_failures(self, domain)
if not ngx.shared.auto_ssl_failures then
ngx.log(ngx.ERR, "auto-ssl: dict auto_ssl_failures could not be found. Please add it to your configuration: `lua_shared_dict auto_ssl_failures 1m;`")
return
end

local string = ngx.shared.auto_ssl_failures:get("domain:" .. domain)
if string then
local failures, json_err = self.storage.json_adapter:decode(string)
if json_err then
ngx.log(ngx.ERR, json_err, domain)
end
if failures then
local mt = {
__concat = function(op1, op2)
return tostring(op1) .. tostring(op2)
end,
__tostring = function(f)
return "first: " .. f["first"] .. ", last: " .. f["last"] .. ", num: " .. f["num"]
end
}
setmetatable(failures, mt)
return failures
end
end
end

function _M.track_failure(self, domain)
if not ngx.shared.auto_ssl_failures then
return
end

local failures
local string = ngx.shared.auto_ssl_failures:get("domain:" .. domain)
if string then
failures = self.storage.json_adapter:decode(string)
end
if not failures then
failures = {}
failures["first"] = ngx.now()
failures["last"] = failures["first"]
failures["num"] = 1
else
failures["last"] = ngx.now()
failures["num"] = failures["num"] + 1
end
string = self.storage.json_adapter:encode(failures)
ngx.shared.auto_ssl_failures:set("domain:" .. domain, string, 2592000)
end

function _M.track_success(_, domain)
if not ngx.shared.auto_ssl_failures then
return
end

ngx.shared.auto_ssl_failures:delete("domain:" .. domain)
end

return _M
3 changes: 2 additions & 1 deletion lib/resty/auto-ssl/init_worker.lua
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ local random_seed = require "resty.auto-ssl.utils.random_seed"
local renewal_job = require "resty.auto-ssl.jobs.renewal"
local shell_blocking = require "shell-games"
local start_sockproc = require "resty.auto-ssl.utils.start_sockproc"
local timer_rand = math.random()

return function(auto_ssl_instance)
local base_dir = auto_ssl_instance:get("dir")
Expand Down Expand Up @@ -37,5 +38,5 @@ return function(auto_ssl_instance)
storage_adapter:setup_worker()
end

renewal_job.spawn(auto_ssl_instance)
renewal_job.spawn(auto_ssl_instance, timer_rand)
end
40 changes: 37 additions & 3 deletions lib/resty/auto-ssl/jobs/renewal.lua
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ local shuffle_table = require "resty.auto-ssl.utils.shuffle_table"
local ssl_provider = require "resty.auto-ssl.ssl_providers.lets_encrypt"

local _M = {}
local min_renewal_seconds
local last_renewal

-- Based on lua-rest-upstream-healthcheck's lock:
-- https://github.com/openresty/lua-resty-upstream-healthcheck/blob/v0.03/lib/resty/upstream/healthcheck.lua#L423-L440
Expand Down Expand Up @@ -125,12 +127,22 @@ local function renew_check_cert(auto_ssl_instance, storage, domain)
end

-- If expiry date is known, attempt renewal if it's within 30 days.
-- Between 30 and 15 days out, only attempt renewal of a subset of domains (with
-- increasing likelihood of renewal being attempted).
if cert["expiry"] then
local now = ngx.now()
if now + (30 * 24 * 60 * 60) < cert["expiry"] then
ngx.log(ngx.NOTICE, "auto-ssl: expiry date is more than 30 days out, skipping renewal: ", domain)
renew_check_cert_unlock(domain, storage, local_lock, distributed_lock_value)
return
elseif now + (15 * 24 * 60 * 60) < cert["expiry"] then
local rand_value = math.random(cert["expiry"] - (30 * 24 * 60 * 60), cert["expiry"] - (15 * 24 * 60 * 60))
local rand_renewal_threshold = now
if rand_value < rand_renewal_threshold then
ngx.log(ngx.NOTICE, "auto-ssl: expiry date is more than 15 days out, randomly not picked for renewal: ", domain)
renew_check_cert_unlock(domain, storage, local_lock, distributed_lock_value)
return
end
Comment on lines +138 to +145
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this increasing likelihood approach needed with the other throttle mechanisms provided by renewals_per_hour (60/hour)? I guess I'm just trying to think through the different scenarios with this and wondering how these two approaches interact.

If you have a big batch of domains that need to be renewed, would the renewals_per_hour logic not eventually get through renewed, just at a slower pace? Or is this logic here an attempt to prioritize the renewals, so that ones expiring sooner have their renewal attempted sooner? Would this mainly be relevant if you expect your renewals to take more than 30 days at the throttled 60/hour rate?

If this is mainly intended to prioritize renewals (but you still want to saturate the 60/hour rate), then I'm wondering if maybe a slightly simpler approach would be to sort the renewals by expiration (instead of the current approach of randomizing the renewal order). That being said, there may be some performance implications of fetching all of those expiration times before renewals occur, so this may not actually be feasible, I'd need to investigate further.

Copy link
Contributor Author

@gohai gohai Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question about how this interacts with the throttle mechanism!

I implemented this after realizing that we had a bulk of many-thousand of certificates attempting to renew at the same day, as this was the date when we migrated them onto lua-resty-auto-ssl initially.

While this went on, at some point we exceeded the rate limit, and this caused an issue, as we weren't able to generate certificates for newly-registered customer domains. This issue likely wouldn't happened if we had set our renewals_per_hour more conservatively (we must have used more than 40/h for ad-hoc certificate requests).

So the aim of this patch is really just to smoothen out the renewals over many days, anticipating that this will make it less likely that we exceed the rate limit. (Example: I am starting out using lu-resty-auto-ssl today and migrate 10k domains onto the system. Before this patch, all of these domains would simultaneously attempted for renewal 60 days from now, a job which would take 7 days to finish, at 60/h. This could be just fine. This patch changes this behavior so that instead the renewals get spread over the next 14 invocations (each of which should only take 12 hours to finish).

In case of the former, we'd be "maxing out" the 60/h for a week straight - a process which would reoccur every two months. In case of the latter, we'd be at 60/h for only half of the time, the rest of which we'd have a little bit more room for ad-hoc certificate requests to exceed 40/h. And over time, we'd inevitably spread them even more evenly, so that we'd be attempting renewal of the same number of domains on average ultimately.

Does this make more sense to you after reading this?

end
end

Expand Down Expand Up @@ -181,9 +193,25 @@ local function renew_check_cert(auto_ssl_instance, storage, domain)
ngx.log(ngx.WARN, "auto-ssl: existing certificate is expired, deleting: ", domain)
storage:delete_cert(domain)
end

auto_ssl_instance:track_failure(domain)
else
auto_ssl_instance:track_success(domain)
end

renew_check_cert_unlock(domain, storage, local_lock, distributed_lock_value)

-- Throttle renewal requests based on renewals_per_hour setting.
if last_renewal and ngx.now() - last_renewal < min_renewal_seconds then
local to_sleep = min_renewal_seconds - (ngx.now() - last_renewal)
ngx.log(ngx.NOTICE, "auto-ssl: pausing renewal job for " .. to_sleep .. " seconds")
ngx.sleep(to_sleep)
end
if last_renewal then
last_renewal = last_renewal + min_renewal_seconds
else
last_renewal = ngx.now()
end
end

local function renew_all_domains(auto_ssl_instance)
Expand All @@ -199,6 +227,10 @@ local function renew_all_domains(auto_ssl_instance)
-- renewal attempts).
shuffle_table(domains)

-- Set up renewal request throttling.
min_renewal_seconds = 3600 / auto_ssl_instance:get("renewals_per_hour")
last_renewal = ngx.now()

for _, domain in ipairs(domains) do
renew_check_cert(auto_ssl_instance, storage, domain)
end
Expand Down Expand Up @@ -236,12 +268,14 @@ end
local function renew(premature, auto_ssl_instance)
if premature then return end

local start = ngx.now()
local renew_ok, renew_err = pcall(do_renew, auto_ssl_instance)
if not renew_ok then
ngx.log(ngx.ERR, "auto-ssl: failed to run do_renew cycle: ", renew_err)
end

local timer_ok, timer_err = ngx.timer.at(auto_ssl_instance:get("renew_check_interval"), renew, auto_ssl_instance)
local delay = math.max(0, auto_ssl_instance:get("renew_check_interval") - (ngx.now() - start))
local timer_ok, timer_err = ngx.timer.at(delay, renew, auto_ssl_instance)
if not timer_ok then
if timer_err ~= "process exiting" then
ngx.log(ngx.ERR, "auto-ssl: failed to create timer: ", timer_err)
Expand All @@ -250,8 +284,8 @@ local function renew(premature, auto_ssl_instance)
end
end

function _M.spawn(auto_ssl_instance)
local ok, err = ngx.timer.at(auto_ssl_instance:get("renew_check_interval"), renew, auto_ssl_instance)
function _M.spawn(auto_ssl_instance, timer_rand)
local ok, err = ngx.timer.at(timer_rand * auto_ssl_instance:get("renew_check_interval"), renew, auto_ssl_instance)
if not ok then
ngx.log(ngx.ERR, "auto-ssl: failed to create timer: ", err)
return
Expand Down
3 changes: 3 additions & 0 deletions lib/resty/auto-ssl/ssl_certificate.lua
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,9 @@ local function issue_cert(auto_ssl_instance, storage, domain)
cert, err = ssl_provider.issue_cert(auto_ssl_instance, domain)
if err then
ngx.log(ngx.ERR, "auto-ssl: issuing new certificate failed: ", err)
auto_ssl_instance:track_failure(domain)
else
auto_ssl_instance:track_success(domain)
end

issue_cert_unlock(domain, storage, local_lock, distributed_lock_value)
Expand Down
1 change: 1 addition & 0 deletions spec/config/nginx.conf.etlua
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ http {
allow_domain = function(domain)
return true
end,
renewals_per_hour = 3600,
}
<%- auto_ssl_pre_new or "" %>
auto_ssl = (require "resty.auto-ssl").new(options)
Expand Down