-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address new ACME v2 rate limit ("too many new orders recently") #217
base: master
Are you sure you want to change the base?
Changes from all commits
e9ab9d3
bc2a999
c89e4a7
6e1ea73
1ad9b3b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,8 @@ local shuffle_table = require "resty.auto-ssl.utils.shuffle_table" | |
local ssl_provider = require "resty.auto-ssl.ssl_providers.lets_encrypt" | ||
|
||
local _M = {} | ||
local min_renewal_seconds | ||
local last_renewal | ||
|
||
-- Based on lua-rest-upstream-healthcheck's lock: | ||
-- https://github.com/openresty/lua-resty-upstream-healthcheck/blob/v0.03/lib/resty/upstream/healthcheck.lua#L423-L440 | ||
|
@@ -125,12 +127,22 @@ local function renew_check_cert(auto_ssl_instance, storage, domain) | |
end | ||
|
||
-- If expiry date is known, attempt renewal if it's within 30 days. | ||
-- Between 30 and 15 days out, only attempt renewal of a subset of domains (with | ||
-- increasing likelihood of renewal being attempted). | ||
if cert["expiry"] then | ||
local now = ngx.now() | ||
if now + (30 * 24 * 60 * 60) < cert["expiry"] then | ||
ngx.log(ngx.NOTICE, "auto-ssl: expiry date is more than 30 days out, skipping renewal: ", domain) | ||
renew_check_cert_unlock(domain, storage, local_lock, distributed_lock_value) | ||
return | ||
elseif now + (15 * 24 * 60 * 60) < cert["expiry"] then | ||
local rand_value = math.random(cert["expiry"] - (30 * 24 * 60 * 60), cert["expiry"] - (15 * 24 * 60 * 60)) | ||
local rand_renewal_threshold = now | ||
if rand_value < rand_renewal_threshold then | ||
ngx.log(ngx.NOTICE, "auto-ssl: expiry date is more than 15 days out, randomly not picked for renewal: ", domain) | ||
renew_check_cert_unlock(domain, storage, local_lock, distributed_lock_value) | ||
return | ||
end | ||
Comment on lines
+138
to
+145
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this increasing likelihood approach needed with the other throttle mechanisms provided by If you have a big batch of domains that need to be renewed, would the If this is mainly intended to prioritize renewals (but you still want to saturate the 60/hour rate), then I'm wondering if maybe a slightly simpler approach would be to sort the renewals by expiration (instead of the current approach of randomizing the renewal order). That being said, there may be some performance implications of fetching all of those expiration times before renewals occur, so this may not actually be feasible, I'd need to investigate further. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great question about how this interacts with the throttle mechanism! I implemented this after realizing that we had a bulk of many-thousand of certificates attempting to renew at the same day, as this was the date when we migrated them onto While this went on, at some point we exceeded the rate limit, and this caused an issue, as we weren't able to generate certificates for newly-registered customer domains. This issue likely wouldn't happened if we had set our So the aim of this patch is really just to smoothen out the renewals over many days, anticipating that this will make it less likely that we exceed the rate limit. (Example: I am starting out using In case of the former, we'd be "maxing out" the 60/h for a week straight - a process which would reoccur every two months. In case of the latter, we'd be at 60/h for only half of the time, the rest of which we'd have a little bit more room for ad-hoc certificate requests to exceed 40/h. And over time, we'd inevitably spread them even more evenly, so that we'd be attempting renewal of the same number of domains on average ultimately. Does this make more sense to you after reading this? |
||
end | ||
end | ||
|
||
|
@@ -181,9 +193,25 @@ local function renew_check_cert(auto_ssl_instance, storage, domain) | |
ngx.log(ngx.WARN, "auto-ssl: existing certificate is expired, deleting: ", domain) | ||
storage:delete_cert(domain) | ||
end | ||
|
||
auto_ssl_instance:track_failure(domain) | ||
else | ||
auto_ssl_instance:track_success(domain) | ||
end | ||
|
||
renew_check_cert_unlock(domain, storage, local_lock, distributed_lock_value) | ||
|
||
-- Throttle renewal requests based on renewals_per_hour setting. | ||
if last_renewal and ngx.now() - last_renewal < min_renewal_seconds then | ||
local to_sleep = min_renewal_seconds - (ngx.now() - last_renewal) | ||
ngx.log(ngx.NOTICE, "auto-ssl: pausing renewal job for " .. to_sleep .. " seconds") | ||
ngx.sleep(to_sleep) | ||
end | ||
if last_renewal then | ||
last_renewal = last_renewal + min_renewal_seconds | ||
else | ||
last_renewal = ngx.now() | ||
end | ||
end | ||
|
||
local function renew_all_domains(auto_ssl_instance) | ||
|
@@ -199,6 +227,10 @@ local function renew_all_domains(auto_ssl_instance) | |
-- renewal attempts). | ||
shuffle_table(domains) | ||
|
||
-- Set up renewal request throttling. | ||
min_renewal_seconds = 3600 / auto_ssl_instance:get("renewals_per_hour") | ||
last_renewal = ngx.now() | ||
|
||
for _, domain in ipairs(domains) do | ||
renew_check_cert(auto_ssl_instance, storage, domain) | ||
end | ||
|
@@ -236,12 +268,14 @@ end | |
local function renew(premature, auto_ssl_instance) | ||
if premature then return end | ||
|
||
local start = ngx.now() | ||
local renew_ok, renew_err = pcall(do_renew, auto_ssl_instance) | ||
if not renew_ok then | ||
ngx.log(ngx.ERR, "auto-ssl: failed to run do_renew cycle: ", renew_err) | ||
end | ||
|
||
local timer_ok, timer_err = ngx.timer.at(auto_ssl_instance:get("renew_check_interval"), renew, auto_ssl_instance) | ||
local delay = math.max(0, auto_ssl_instance:get("renew_check_interval") - (ngx.now() - start)) | ||
local timer_ok, timer_err = ngx.timer.at(delay, renew, auto_ssl_instance) | ||
if not timer_ok then | ||
if timer_err ~= "process exiting" then | ||
ngx.log(ngx.ERR, "auto-ssl: failed to create timer: ", timer_err) | ||
|
@@ -250,8 +284,8 @@ local function renew(premature, auto_ssl_instance) | |
end | ||
end | ||
|
||
function _M.spawn(auto_ssl_instance) | ||
local ok, err = ngx.timer.at(auto_ssl_instance:get("renew_check_interval"), renew, auto_ssl_instance) | ||
function _M.spawn(auto_ssl_instance, timer_rand) | ||
local ok, err = ngx.timer.at(timer_rand * auto_ssl_instance:get("renew_check_interval"), renew, auto_ssl_instance) | ||
if not ok then | ||
ngx.log(ngx.ERR, "auto-ssl: failed to create timer: ", err) | ||
return | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
get_failures
andtrack_failure
logic to help throttle initial registrations is definitely a flexible solution that allows for some nifty logic in theallow_domain
callback. I think I see how all this fits together (nice!), but I guess I'm just mulling over other options to deal with the same issue. A couple things come to mind.You commented:
The ordering issues around this does make me wondering about whether
allow_domain
is the right place to implement this logic. Because in a way, this seems like a slightly different concept than whatallow_domain
as originally intended to do. Perhaps we could introduce a different callback (eg,throttle_domain
) that would get called after the storage call (eg, slide it into where you had proposedallow_domain
get moved in b1f9715)? That would keepallow_domain
being called before the storage call, but then allow this logic to be handled separately.I'm also wondering if we could integrate this logic in a more built-in fashion so that users don't have to implement custom
allow_domain
(or other callbacks) to handle this. Similar to howrenewals_per_hour
is just a simple global setting, I'm wondering if we could simplify this in some fashion and integrate this logic directly into codebase, rather than relying on the callback. Or do your use-cases need more custom or per-domain logic for this type of thing?I'd need to spend some more time investigating the feasibility of these ideas (and I realize you might have different use-cases), but these are just some initial thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review and thoughts, @GUI.
To your points:
_early
and_late
?)Part of my thinking behind adding this to the implementer's "toolbox" in
allow_domain
was the realization that one could use thetrack_failure(domain)
mechanism also for caching any "regular" lookup thatallow_domain
would do, if this is desired. (Example: we have to do a HTTP call to an internal web API to decide whether we should allow a new (sub-) domain)lua-resty-auto-ssl
.To be specific, we're currently using this exponential backoff algorithm in our setup. Unsure if something like this could be generalized to work for other users (without a callback?):
Conceptually, the simplest mechanism to me would be one that (a) caches the results of
allow_domain
, (b) equally takes into consideration cert failures, and (c) uses permanent storage so that the tracking automatically shared across multiple servers. (My patch only does (b) and gives the implementer the tools to implement (a).)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach that I didn't attempt, but which would make sense to me also: track each certificate request (success and failure), have one tunable to set the max number of certificate requests per three hours (defaulting to LE's
300
), and have algorithms to adaptively throttle both renewals, and also (failing) ad-hoc certificate requests as needed. With additional tunable as needed, and using permanent storage (i.e. aggregated across multiple servers).