-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: eigenda client returns 503 errors (for failover purpose) #828
base: master
Are you sure you want to change the base?
Conversation
api/errors.go
Outdated
// ErrorCode returns the error code for the API exception. | ||
ErrorCode() ErrorCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use the built-in http.StatusCode
type instead of defining our own enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea: b4b3441
I guess my main concern was whether we eventually switch to grpc error codes instead. In that case we might want ErrorCode to be a string for eg instead of an int. But in the meantime agree using stdlib is best.
Co-authored-by: Ethen <ethen@eigenlabs.org>
Co-authored-by: Ethen <ethen@eigenlabs.org>
// We set to unknown fault b/c disperser client returns a mix of 400s and 500s currently. | ||
// TODO: update disperser client to also return ErrorAPIGeneric errors | ||
Code: 0, | ||
Fault: api.ErrorFaultUnknown, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a very common returned error, rateLimited. It makes sense to just treat it as a 400 error.
return | ||
} | ||
|
||
// process response | ||
if *blobStatus == disperser.Failed { | ||
errChan <- fmt.Errorf("unable to disperse blob to eigenda (reply status %d): %w", blobStatus, err) | ||
// Don't think this state should be reachable. DisperseBlobAuthenticated should return an error instead of a failed status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only valid returned blob status is processing. I don't know how it changes in v2
// 1. means that there is a problem with EigenDA, so we return 503 to let the batcher failover to ethda | ||
// 2. means that there is a problem with Ethereum, so we return 500. | ||
// batcher would most likely resubmit another blob, which is not ideal but there isn't much to be done... | ||
// eigenDA v2 will have idempotency so one can just resubmit the same blob safely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In eigenDA v2, it is impossible for users to resubmit the same blob. In that sense, it does not have idempotency. But it is effortless for disperser to send control signal to ask operators to download, so there is much waste. So in v2, if a client is trusting the disperser, it will fully delegate to the disperser for retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user is able to resubmit the same blob to a different disperser, once it is permissionless.
errChan <- api.NewErrorAPIGeneric(http.StatusServiceUnavailable, | ||
fmt.Errorf("eigenda might be down. timed out waiting for blob to land onchain (request id=%s): %w", base64RequestID, ctx.Err())) | ||
} | ||
// but not else (otherwise it might be a problem with ethereum, so fallbacking to ethda wouldnt help) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be more explicit, with else if (grpcdisperser.Confirmed)
and default to return something else. Maybe useful for debugging?
@@ -200,10 +237,12 @@ func (m *EigenDAClient) putBlob(ctx context.Context, rawData []byte, resultChan | |||
alreadyWaitingForDispersal = true | |||
} | |||
case grpcdisperser.BlobStatus_FAILED: | |||
errChan <- fmt.Errorf("EigenDA blob dispersal failed in processing, requestID=%s: %w", base64RequestID, err) | |||
// TODO: when exactly does this happen? I think only happens if ethereum reorged and the blob was lost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- blob has expired, a client retrieve after 14 days. Sounds like 400 errors. (fine to ignore, since it would timeout already)
- internal logic error while requesting encoding (shouldn't happen), but it should return 503
- wait for blob finalization from confirmation and blob retry has exceeded its limit. Thinking a bit more, it is possible that a chain re-org triggered a Failed status. See code (https://github.com/Layr-Labs/eigenda/blob/master/disperser/batcher/finalizer.go#L179-L189). So we should be returning 500, not 503.
In any cases, EigenDA returned error into the failed status. So it makes sense that we return 500, i.e. internalServerError.
return | ||
case grpcdisperser.BlobStatus_INSUFFICIENT_SIGNATURES: | ||
errChan <- fmt.Errorf("EigenDA blob dispersal failed in processing with insufficient signatures, requestID=%s: %w", base64RequestID, err) | ||
// this might be a temporary condition where some eigenda nodes were temporarily offline, so we should retry | ||
errChan <- api.NewErrorAPIGeneric(http.StatusInternalServerError, fmt.Errorf("blob dispersal (requestID=%s) failed with insufficient signatures. please resubmit the blob.", base64RequestID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be 503. Effectively, it means the service is down for this batch.
InsufficientSignatures | ||
// DISPERSING means that the blob is currently being dispersed to DA Nodes and being confirmed onchain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have not yet confirmed onchain
Finalized | ||
// INSUFFICIENT_SIGNATURES means that the confirmation threshold for the blob was not met | ||
// for at least one quorum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think once a blob received an insufficient signature, it won't be retried.
Why are these changes needed?
Related to batcher failover project. See https://linear.app/eigenlabs/issue/EBE-60/design-fallback-mechanism-for-both-op-and-arb and ethereum-optimism/specs#434 for more details.
Basically we want eigenda-proxy to return 503 to batcher to signify "eigenda is down, failover and start submitting blobs to ethereum instead".
This PR makes the eigenda-client return 503s. eigenda-proxy will then use these, retry 3 times, and if it is still returning 503s, then return a 503 to the batcher.
The 2 sources of 503 errors that we have identified are:
TODO
Checks