Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug hangs in CI #423

Merged
merged 5 commits into from
Oct 2, 2024
Merged

Debug hangs in CI #423

merged 5 commits into from
Oct 2, 2024

Conversation

sebastianburckhardt
Copy link
Member

We have seen a lot of hangs recently in the CI, which seem to not be caused by recent changes but look a bit like errors in the test framework. I created this PR as a way to further investigate and hopefully fix these hangs.

In this first step, I am trying to understand hangs in the QueriesCopyToTail test which I have seen repeatedly in CI. The hangs seem to have something to do with the checkpoint injection mechanism used by the test. To narrow this down, we revise this state machine and add lots of tracing.

@sebastianburckhardt
Copy link
Member Author

I am going to rerun the tests several times to see if I can flush out more CI failures.

@davidmrdavid davidmrdavid self-requested a review October 1, 2024 02:30
Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, the hang is fixed by the cacheDebugger additions in FasterKV? Mostly around the takeFromOutput checks? Not sure I understand how that led to the CI to hang. It would help me if you could explain that. Thanks!

foreach ((string name, Task task) in scenarios.StartAllScenarios(includeTimers: true, includeLarge: true))
foreach ((string name, Task task) in scenarios.StartAllScenarios(includeTimers: false, includeLarge: true))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw timer tests failing spuriously (because of large time delays leading to unexpected results) so I did not think it was worth keeping them in there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you're seeing that timers make creating a testing spec difficult (as in it's difficult to assert what exactly will happen when). Ok, that makes sense

Comment on lines 2084 to 2092
takeValueFromOutput = (output.Val != null); // we have observed that src.Val is null sometimes, so if present, we use output instead
if (takeValueFromOutput)
{
this.cacheDebugger?.Record(key.Val, CacheDebugger.CacheEvent.SingleWriterCopyToTailFromOutput, output.Version, default, info.Address);
}
else
{
this.cacheDebugger?.Record(key.Val, CacheDebugger.CacheEvent.SingleWriterCopyToTail, src.Version, default, info.Address);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good place to elaborate a bit more on the significance of src.Val being null. Is this a bug in FASTER, or expected, or something else? Can you please add a small blurb about that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sort of what the comment on line 2084 is supposed to be.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that gives me enough context though. The way the comment is phrased ("we have observed...sometimes") makes me thing it is not expected that src.Val would be null, and if that's true then I think that's worth calling out explicitly. It would tell me whether or not there's something weird that FASTER is doing here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I actually also don't know. I can add a bit more text to say more explicitly that I don't really know what is supposed to be happening.

@sebastianburckhardt sebastianburckhardt added this to the 2.0.0 milestone Oct 2, 2024
@sebastianburckhardt sebastianburckhardt merged commit 55045a2 into main Oct 2, 2024
1 of 2 checks passed
@davidmrdavid davidmrdavid deleted the pr/debug-CI branch October 21, 2024 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants