Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Proof tx submission enhancement/bug #1574

Closed

Conversation

nodiesBlade
Copy link
Contributor

@nodiesBlade nodiesBlade commented Jul 17, 2023

Description

Whenever the evidence does not match the total proofs, the node tries to delete the evidence and submits a claim. It should never submit a proof whenever there is LESS than the number of relay proofs as that can cause an array index of out-bounds exception if the randomly selected index is more than what we have in the store. This can cause a node crash.

Example: Servicer starts off with 10 relays in the evidence store

1. Session ends
2. Servicer submits claim for 10 relays on chain
4. Servicer stops node and restarts evidence.db & evidence.db gets corrupted
5. Servicer will attempt to submit a proof, and psuedo-random selected index for merkle proof generation can result in AIOB

Solution:
If the number of relay proofs in the evidence store is less than the submitted claims total relay proofs then delete the evidence and don't submit a proof.


There is a potential race condition where the evidence is not sealed while submitting a claim, allowing for more relays to enter the evidence claim. Whenever a proof is generated via GenerateMerkleRoot, it will result in an incorrect Merkle proof from being generated due to a mismatch of how many relays are in the store and how many was submitted on chain. This can inadvertently happen more often as we accept session rollover relays

Example: Servicer starts off with 10 relays in the evidence store

1. Session ends
2. Servicer handles a relay, while also submitting the claim (RACE Condition, no proper locking mechanism here)
3. Servicer submits claim for 10 relays on chain
4. Servicer finishes handling the relay and evidence store now contains 11 relays
5. Servicer (after waiting a couple of blocks) submits a proof tx with a merkle proof for 11 relays, not 10
6. Validators will mark the TX as invalid due to invalid merkle proof.

Solution

  • In order to allow for a best-effort process to submit proof for the claim the servicer submitted, we will calculate the maxRelays based off:
    1. Use the claim's total proof if the proof count doesn't exceed the maximum relays per session.
    2. Otherwise, fallback to the maximum relays per session.

The GenerateMerkleRoot already accepts a maxRelays parameter which was introduced to fix the Chocalate Rain/Overservicing vulnerability. It discards relays above maxRelays with the following implementation

	if int64(len(ev.Proofs)) > maxRelays {
		ev.Proofs = ev.Proofs[:maxRelays]
		ev.NumOfProofs = maxRelays
	}

Note: The more ideal solution is to only submit the claim whenever we are done servicing for sure, but this involves locks and potential side effects that Pocket core was never really built with in mind

Note: Something tells me that we were missing a continue statement in the old code whenever the evidence does not match the total proofs in the claim. It shouldn't submit if there is a mismatch / if we deleted the evidence. My solution adds a continue for less than, and tries a best effort to submit proof if we do have enough relay proofs in our evidence store.

Summary generated by Reviewpad on 17 Jul 23 19:44 UTC

This pull request adds better proof validation to the proof.go file in the x/pocketcore/keeper package. It includes changes to handle a potential race condition where the evidence is not sealed while submitting a claim, allowing for more relays to enter the evidence claim. It also generates a merkle proof for the claim's total proof count if it doesn't exceed the maximum number of relays per session, otherwise it falls back to the max relays per session. Additionally, it includes validation of the level count on the claim by the total relays.

@reviewpad reviewpad bot added small Pull request is small waiting-for-review labels Jul 17, 2023
@nodiesBlade nodiesBlade changed the title Add Proof tx submission enhancement Add Proof tx submission enhancement/bug Jul 17, 2023
@Olshansk Olshansk added this to the Network Cost milestone Jul 17, 2023
@Olshansk Olshansk requested a review from msmania July 17, 2023 22:18
Copy link
Member

@Olshansk Olshansk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments - no blockers.

Appreciate the explanation in the issue. I understand it looks safe but trying to be "ultra careful" with any last minute changes.

Adding @msmania for a review as well.

x/pocketcore/keeper/proof.go Show resolved Hide resolved
@@ -50,13 +50,16 @@ func (k Keeper) SendProofTx(ctx sdk.Ctx, n client.Client, node *pc.PocketNode, p
ctx.Logger().Error(fmt.Sprintf("could not delete evidence is not sealed, could cause a relay leak: %s", err.Error()))
}
}
if evidence.NumOfProofs != claim.TotalProofs {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PoktBlade Have you looked into the difficulty of adding a unit test for this? I realize it's not easy, but just wanted to make sure that we at least put effort into it.

Copy link
Contributor Author

@nodiesBlade nodiesBlade Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into it some more for sanity purposes.

In regards to some other testing, @jorgecuesta did run it live in his fleet, and saw that there weren't any invalid proofs being submitted on chain. This was a dramatic decrease (0 invalid proofs) vs what he saw before the changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I wrote some integration tests last night based off TestClaimProtoTx, but it turns out the test for this was already broken.. The reason why we don't see them when running tests, these tests are skipped whenever -s flag is passed.

Anyhow.. will see if I can fix the initial broken tests (seems like a crashing error when submitting proofs, due to AIOB)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PoktBlade Bump on existing tests and/or new tests.

@@ -73,8 +76,24 @@ func (k Keeper) SendProofTx(ctx sdk.Ctx, n client.Client, node *pc.PocketNode, p
if !found {
ctx.Logger().Error(fmt.Sprintf("an error occurred creating the proof transaction with app %s not found with evidence %v", evidence.ApplicationPubKey, evidence))
}

// There is a potential race condition where the evidence is not sealed while submitting a claim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea why this wasn't a major concern in the past?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally whenever the claim is sealed, relays stop coming in and so there is some non deterministic timing in which the race condition could happen. If relays keep coming in (i.e session rollover), the problem becomes more prominent.

We did find some on chain tx's to show that this did periodically happen even without session rollover though.

Copy link
Contributor Author

@nodiesBlade nodiesBlade Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
You can go here and filter the transactions by "Proof" and "Failed", and put the start date before ~ the July 10th and it will show some of the failed submissions even with session rollover disabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was running into some issues but reached out to poktscan here: https://discord.com/channels/854406364931686400/950799595171086377/1131367539667107872

Copy link
Contributor

@msmania msmania left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good. Can you update the comments for better readability?

x/pocketcore/keeper/proof.go Outdated Show resolved Hide resolved
x/pocketcore/keeper/proof.go Outdated Show resolved Hide resolved
x/pocketcore/keeper/proof.go Outdated Show resolved Hide resolved
@@ -50,13 +50,16 @@ func (k Keeper) SendProofTx(ctx sdk.Ctx, n client.Client, node *pc.PocketNode, p
ctx.Logger().Error(fmt.Sprintf("could not delete evidence is not sealed, could cause a relay leak: %s", err.Error()))
}
}
if evidence.NumOfProofs != claim.TotalProofs {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PoktBlade Bump on existing tests and/or new tests.

@@ -73,8 +76,24 @@ func (k Keeper) SendProofTx(ctx sdk.Ctx, n client.Client, node *pc.PocketNode, p
if !found {
ctx.Logger().Error(fmt.Sprintf("an error occurred creating the proof transaction with app %s not found with evidence %v", evidence.ApplicationPubKey, evidence))
}

// There is a potential race condition where the evidence is not sealed while submitting a claim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was running into some issues but reached out to poktscan here: https://discord.com/channels/854406364931686400/950799595171086377/1131367539667107872

x/pocketcore/keeper/proof.go Outdated Show resolved Hide resolved
x/pocketcore/keeper/proof.go Outdated Show resolved Hide resolved
x/pocketcore/keeper/proof.go Show resolved Hide resolved
@nodiesBlade
Copy link
Contributor Author

Added all the comments as suggested! Thanks for the review.

In regards to the tests, I haven't made any progress to them - don't think I will have the bandwidth moving forward to diagnose / fix the integration tests. The changes here are minimal and only guard rails for proof tx, not consensus. If needed, we can push an additional hot fix without the need for consensus upgrades.

Copy link
Member

@Olshansk Olshansk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to block this PR until tests are fixed & added per this comment: #1574 (comment)

@PoktBlade Could you do a handoff (whether it's to myself, a core team member or someone from the community) with regard to the status of the tests you investigated? For example:

  • Which tests exist for it (e.g. names & links)
  • Which ones are broken (e.g. names & links)
  • Insight/ideas into why certain tests are broken (e.g. a few bullet points)
  • Which ones would you add if you had capacity (e.g. a list)

@nodiesBlade
Copy link
Contributor Author

I'm handing it off as is. The PR is already merged into Poktscan's fork.

@nodiesBlade
Copy link
Contributor Author

Since I don't have time to dedicate more to this issue, please cherry pick or fork my changes and move forward.

Closing since it won't be merged in / blocked.

@jorgecuesta
Copy link
Contributor

@Olshansk all the request you ask about test are not fair to be handle by the community people trying to help if pocket has people that get a salary for this.
This is why then people hesitate to help on the project. I think u guys may need to figure out a way to work on v0 because has a lot of things that are not working and should be working like TESTs and that is pocket responsibility, not the community indeed.

@Olshansk
Copy link
Member

@Olshansk all the request you ask about test are not fair to be handle by the community people trying to help if pocket has people that get a salary for this. This is why then people hesitate to help on the project. I think u guys may need to figure out a way to work on v0 because has a lot of things that are not working and should be working like TESTs and that is pocket responsibility, not the community indeed.

Thanks for the feedback.

I will work on adding the tests myself this time and reach out about getting a budget for it in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
small Pull request is small waiting-for-review
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants