Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: eigenda client returns 503 errors (for failover purpose) #828

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

samlaf
Copy link
Contributor

@samlaf samlaf commented Oct 22, 2024

Why are these changes needed?

Related to batcher failover project. See https://linear.app/eigenlabs/issue/EBE-60/design-fallback-mechanism-for-both-op-and-arb and ethereum-optimism/specs#434 for more details.

Basically we want eigenda-proxy to return 503 to batcher to signify "eigenda is down, failover and start submitting blobs to ethereum instead".

This PR makes the eigenda-client return 503s. eigenda-proxy will then use these, retry 3 times, and if it is still returning 503s, then return a 503 to the batcher.

The 2 sources of 503 errors that we have identified are:

  1. disperseBlob call failed (disperser is down) - TODO: this is not implemented yet b/c disperser client still doesn't return errors with error codes. I can either update this PR or make a followup one to do that.
  2. blob stuck in PROCESSING or DISPERSING status and call times out before reaching CONFIRMED state (landing onchain) - this is implemented

TODO

  • implement 503s in disperserClient
  • Add documentation for how to set the different TIMEOUTS (see this discussion)

Checks

  • I've made sure the lint is passing in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • I've checked the new test coverage and the coverage percentage didn't drop.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

api/clients/eigenda_client.go Outdated Show resolved Hide resolved
api/errors.go Outdated
Comment on lines 60 to 61
// ErrorCode returns the error code for the API exception.
ErrorCode() ErrorCode

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use the built-in http.StatusCode type instead of defining our own enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea: b4b3441

I guess my main concern was whether we eventually switch to grpc error codes instead. In that case we might want ErrorCode to be a string for eg instead of an int. But in the meantime agree using stdlib is best.

disperser/disperser.go Outdated Show resolved Hide resolved
@samlaf samlaf requested a review from epociask October 24, 2024 13:46
// We set to unknown fault b/c disperser client returns a mix of 400s and 500s currently.
// TODO: update disperser client to also return ErrorAPIGeneric errors
Code: 0,
Fault: api.ErrorFaultUnknown,
Copy link
Contributor

@bxue-l2 bxue-l2 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a very common returned error, rateLimited. It makes sense to just treat it as a 400 error.

return
}

// process response
if *blobStatus == disperser.Failed {
errChan <- fmt.Errorf("unable to disperse blob to eigenda (reply status %d): %w", blobStatus, err)
// Don't think this state should be reachable. DisperseBlobAuthenticated should return an error instead of a failed status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only valid returned blob status is processing. I don't know how it changes in v2

// 1. means that there is a problem with EigenDA, so we return 503 to let the batcher failover to ethda
// 2. means that there is a problem with Ethereum, so we return 500.
// batcher would most likely resubmit another blob, which is not ideal but there isn't much to be done...
// eigenDA v2 will have idempotency so one can just resubmit the same blob safely.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In eigenDA v2, it is impossible for users to resubmit the same blob. In that sense, it does not have idempotency. But it is effortless for disperser to send control signal to ask operators to download, so there is much waste. So in v2, if a client is trusting the disperser, it will fully delegate to the disperser for retrying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user is able to resubmit the same blob to a different disperser, once it is permissionless.

errChan <- api.NewErrorAPIGeneric(http.StatusServiceUnavailable,
fmt.Errorf("eigenda might be down. timed out waiting for blob to land onchain (request id=%s): %w", base64RequestID, ctx.Err()))
}
// but not else (otherwise it might be a problem with ethereum, so fallbacking to ethda wouldnt help)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be more explicit, with else if (grpcdisperser.Confirmed)

and default to return something else. Maybe useful for debugging?

@@ -200,10 +237,12 @@ func (m *EigenDAClient) putBlob(ctx context.Context, rawData []byte, resultChan
alreadyWaitingForDispersal = true
}
case grpcdisperser.BlobStatus_FAILED:
errChan <- fmt.Errorf("EigenDA blob dispersal failed in processing, requestID=%s: %w", base64RequestID, err)
// TODO: when exactly does this happen? I think only happens if ethereum reorged and the blob was lost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. blob has expired, a client retrieve after 14 days. Sounds like 400 errors. (fine to ignore, since it would timeout already)
  2. internal logic error while requesting encoding (shouldn't happen), but it should return 503
  3. wait for blob finalization from confirmation and blob retry has exceeded its limit. Thinking a bit more, it is possible that a chain re-org triggered a Failed status. See code (https://github.com/Layr-Labs/eigenda/blob/master/disperser/batcher/finalizer.go#L179-L189). So we should be returning 500, not 503.

In any cases, EigenDA returned error into the failed status. So it makes sense that we return 500, i.e. internalServerError.

return
case grpcdisperser.BlobStatus_INSUFFICIENT_SIGNATURES:
errChan <- fmt.Errorf("EigenDA blob dispersal failed in processing with insufficient signatures, requestID=%s: %w", base64RequestID, err)
// this might be a temporary condition where some eigenda nodes were temporarily offline, so we should retry
errChan <- api.NewErrorAPIGeneric(http.StatusInternalServerError, fmt.Errorf("blob dispersal (requestID=%s) failed with insufficient signatures. please resubmit the blob.", base64RequestID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be 503. Effectively, it means the service is down for this batch.

InsufficientSignatures
// DISPERSING means that the blob is currently being dispersed to DA Nodes and being confirmed onchain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have not yet confirmed onchain

Finalized
// INSUFFICIENT_SIGNATURES means that the confirmation threshold for the blob was not met
// for at least one quorum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think once a blob received an insufficient signature, it won't be retried.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants