feat: eigenda client returns 503 errors (for failover purpose) #828

samlaf · 2024-10-22T23:15:20Z

Why are these changes needed?

Related to batcher failover project. See https://linear.app/eigenlabs/issue/EBE-60/design-fallback-mechanism-for-both-op-and-arb and ethereum-optimism/specs#434 for more details.

Basically we want eigenda-proxy to return 503 to batcher to signify "eigenda is down, failover and start submitting blobs to ethereum instead".

This PR makes the eigenda-client return 503s. eigenda-proxy will then use these, retry 3 times, and if it is still returning 503s, then return a 503 to the batcher.

The 2 sources of 503 errors that we have identified are:

disperseBlob call failed (disperser is down) - TODO: this is not implemented yet b/c disperser client still doesn't return errors with error codes. I can either update this PR or make a followup one to do that.
blob stuck in PROCESSING or DISPERSING status and call times out before reaching CONFIRMED state (landing onchain) - this is implemented

TODO

implement 503s in disperserClient
Add documentation for how to set the different TIMEOUTS (see this discussion)

Checks

I've made sure the lint is passing in this PR.
I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
I've checked the new test coverage and the coverage percentage didn't drop.
Testing Strategy
- Unit tests
- Integration tests
- This PR is not tested :(

…erser.go

api/clients/eigenda_client.go

epociask · 2024-10-23T03:50:37Z

api/errors.go

+	// ErrorCode returns the error code for the API exception.
+	ErrorCode() ErrorCode


Why not just use the built-in http.StatusCode type instead of defining our own enum?

Good idea: b4b3441

I guess my main concern was whether we eventually switch to grpc error codes instead. In that case we might want ErrorCode to be a string for eg instead of an int. But in the meantime agree using stdlib is best.

disperser/disperser.go

Co-authored-by: Ethen <ethen@eigenlabs.org>

bxue-l2 · 2024-10-25T22:42:47Z

api/clients/eigenda_client.go

+			// We set to unknown fault b/c disperser client returns a mix of 400s and 500s currently.
+			// TODO: update disperser client to also return ErrorAPIGeneric errors
+			Code:  0,
+			Fault: api.ErrorFaultUnknown,


a very common returned error, rateLimited. It makes sense to just treat it as a 400 error.

bxue-l2 · 2024-10-25T22:53:13Z

api/clients/eigenda_client.go

 		return
 	}

 	// process response
 	if *blobStatus == disperser.Failed {
-		errChan <- fmt.Errorf("unable to disperse blob to eigenda (reply status %d): %w", blobStatus, err)
+		// Don't think this state should be reachable. DisperseBlobAuthenticated should return an error instead of a failed status.


the only valid returned blob status is processing. I don't know how it changes in v2

bxue-l2 · 2024-10-25T22:57:05Z

api/clients/eigenda_client.go

+			// 1. means that there is a problem with EigenDA, so we return 503 to let the batcher failover to ethda
+			// 2. means that there is a problem with Ethereum, so we return 500.
+			//    batcher would most likely resubmit another blob, which is not ideal but there isn't much to be done...
+			//    eigenDA v2 will have idempotency so one can just resubmit the same blob safely.


In eigenDA v2, it is impossible for users to resubmit the same blob. In that sense, it does not have idempotency. But it is effortless for disperser to send control signal to ask operators to download, so there is much waste. So in v2, if a client is trusting the disperser, it will fully delegate to the disperser for retrying.

The user is able to resubmit the same blob to a different disperser, once it is permissionless.

bxue-l2 · 2024-10-25T23:10:17Z

api/clients/eigenda_client.go

+				errChan <- api.NewErrorAPIGeneric(http.StatusServiceUnavailable,
+					fmt.Errorf("eigenda might be down. timed out waiting for blob to land onchain (request id=%s): %w", base64RequestID, ctx.Err()))
+			}
+			// but not else (otherwise it might be a problem with ethereum, so fallbacking to ethda wouldnt help)


should we be more explicit, with else if (grpcdisperser.Confirmed)

and default to return something else. Maybe useful for debugging?

bxue-l2 · 2024-10-25T23:40:47Z

api/clients/eigenda_client.go

@@ -200,10 +237,12 @@ func (m *EigenDAClient) putBlob(ctx context.Context, rawData []byte, resultChan
 					alreadyWaitingForDispersal = true
 				}
 			case grpcdisperser.BlobStatus_FAILED:
-				errChan <- fmt.Errorf("EigenDA blob dispersal failed in processing, requestID=%s: %w", base64RequestID, err)
+				// TODO: when exactly does this happen? I think only happens if ethereum reorged and the blob was lost


blob has expired, a client retrieve after 14 days. Sounds like 400 errors. (fine to ignore, since it would timeout already)

internal logic error while requesting encoding (shouldn't happen), but it should return 503

wait for blob finalization from confirmation and blob retry has exceeded its limit. Thinking a bit more, it is possible that a chain re-org triggered a Failed status. See code (https://github.com/Layr-Labs/eigenda/blob/master/disperser/batcher/finalizer.go#L179-L189). So we should be returning 500, not 503.

In any cases, EigenDA returned error into the failed status. So it makes sense that we return 500, i.e. internalServerError.

bxue-l2 · 2024-10-25T23:44:27Z

api/clients/eigenda_client.go

 				return
 			case grpcdisperser.BlobStatus_INSUFFICIENT_SIGNATURES:
-				errChan <- fmt.Errorf("EigenDA blob dispersal failed in processing with insufficient signatures, requestID=%s: %w", base64RequestID, err)
+				// this might be a temporary condition where some eigenda nodes were temporarily offline, so we should retry
+				errChan <- api.NewErrorAPIGeneric(http.StatusInternalServerError, fmt.Errorf("blob dispersal (requestID=%s) failed with insufficient signatures. please resubmit the blob.", base64RequestID))


I think it should be 503. Effectively, it means the service is down for this batch.

bxue-l2 · 2024-10-25T23:48:17Z

disperser/disperser.go

 	InsufficientSignatures
+	// DISPERSING means that the blob is currently being dispersed to DA Nodes and being confirmed onchain


have not yet confirmed onchain

bxue-l2 · 2024-10-25T23:52:51Z

disperser/disperser.go

 	Finalized
+	// INSUFFICIENT_SIGNATURES means that the confirmation threshold for the blob was not met
+	// for at least one quorum.


I think once a blob received an insufficient signature, it won't be retried.

samlaf added 3 commits October 23, 2024 00:05

chore: small edits, remove whitespace, add comments

e6930f1

docs: better documentation for BlobStatus in disperser.proto and disp…

4cd8023

…erser.go

feat: eigenda-client returns 503 errors for batcher failover

13e4c89

samlaf requested review from jianoaix, bxue-l2 and epociask October 22, 2024 23:15

epociask reviewed Oct 23, 2024

View reviewed changes

samlaf and others added 5 commits October 23, 2024 11:53

Update disperser/disperser.go

d0795af

Co-authored-by: Ethen <ethen@eigenlabs.org>

Update api/clients/eigenda_client.go

ae3a83e

Co-authored-by: Ethen <ethen@eigenlabs.org>

fix(lint): ErrorAPIGeneric.Error() had an infinite recursive call

7da886c

proto: make protoc

5e0b04b

refactor: use http status codes instead of our own enum in api errors.go

b4b3441

samlaf requested a review from epociask October 24, 2024 13:46

docs: remove deprecated notice on eigendaClient.GetCodec()

69ff616

samlaf mentioned this pull request Oct 25, 2024

feat(failover): return 503 to batcher when eigenda is down Layr-Labs/eigenda-proxy#193

Draft

bxue-l2 reviewed Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eigenda client returns 503 errors (for failover purpose) #828

feat: eigenda client returns 503 errors (for failover purpose) #828

samlaf commented Oct 22, 2024 •

edited

Loading

epociask Oct 23, 2024

samlaf Oct 23, 2024

bxue-l2 Oct 25, 2024 •

edited

Loading

bxue-l2 Oct 25, 2024

bxue-l2 Oct 25, 2024

bxue-l2 Oct 25, 2024

bxue-l2 Oct 25, 2024

bxue-l2 Oct 25, 2024

bxue-l2 Oct 25, 2024

bxue-l2 Oct 25, 2024

bxue-l2 Oct 25, 2024

		// ErrorCode returns the error code for the API exception.
		ErrorCode() ErrorCode

		InsufficientSignatures
		// DISPERSING means that the blob is currently being dispersed to DA Nodes and being confirmed onchain

feat: eigenda client returns 503 errors (for failover purpose) #828

Are you sure you want to change the base?

feat: eigenda client returns 503 errors (for failover purpose) #828

Conversation

samlaf commented Oct 22, 2024 • edited Loading

Why are these changes needed?

TODO

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bxue-l2 Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samlaf commented Oct 22, 2024 •

edited

Loading

bxue-l2 Oct 25, 2024 •

edited

Loading