Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark examples with codspeed #443

Merged
merged 19 commits into from
Oct 16, 2024

Conversation

saulshanabrook
Copy link
Member

@saulshanabrook saulshanabrook commented Oct 13, 2024

This PR adds benchmarking to CI by running all of our example files with codspeed.

This will comment on new PRs and say how much they affect the performance of running the example files.

Hopefully, it should not only help us see how PRs affect performance but also give us some profiling through codspeed to diagnose where things speed up or slow down.

@saulshanabrook saulshanabrook requested a review from a team as a code owner October 13, 2024 01:07
@saulshanabrook saulshanabrook requested review from ajpal and removed request for a team October 13, 2024 01:07
@saulshanabrook
Copy link
Member Author

I think this needs an admin on the egraphs-good org to approve the app in codspeed, I had to "request" adding it to this org.

I also would need admin permissions on this repo to add the CODSPEED_TOKEN secret, or someone else who already has them can create one and add them here!

@saulshanabrook
Copy link
Member Author

saulshanabrook commented Oct 14, 2024

In response to some questions offline on any downsides to adding codspeed:

It introduces some more complexity overall in terms of depending on external services, and will add some more noise to PRs about their performance impacts. It may also increase the time it takes to run the CI, from maybe 2 mins to 4 mins? If we find this to be too detrimental we can exclude some of our longer running examples from benchmarking.

Performance is also not an exact science, so it has the possibility of introducing some false assurances, when in fact the impact might be different when running in a different environment. Also, it means that we are judging performance based on the example files which might not representative of performance sensitive workloads.

Overall though I would judge the additional signal to be worth those, in helping us judge the impact of PRs on performance, especially those meant to make things faster. I think it will also encourage us to have representative example files.

I have found codspeed to be invaluable in the Python package to help diagnose the impact or large changes on end to end performance and it has helped me refine those changes so that the impacts are reasonable. It's built in flamegraphs have also helped me narrow down where the slowdowns occur and also see upstream slowdowns from egglog changes. So adding this instrumentation closer to the source would be helpful in getting that information sooner.

Copy link

codspeed-hq bot commented Oct 15, 2024

CodSpeed Performance Report

Congrats! CodSpeed is installed 🎉

🆕 87 new benchmarks were detected.

You will start to see performance impacts in the reports once the benchmarks are run from your default branch.

Detected benchmarks

  • antiunify (2.1 ms)
  • array (26.8 ms)
  • bdd (15.2 ms)
  • before-proofs (1.6 ms)
  • birewrite (1.3 ms)
  • bitwise (678.3 µs)
  • bool (1.4 ms)
  • calc (5.4 ms)
  • combinators (18.4 ms)
  • combined-nested (1 ms)
  • container-rebuild (2 ms)
  • cyk (11.2 ms)
  • cykjson (338.4 ms)
  • datatypes (582.1 µs)
  • delete (642.5 µs)
  • eggcc-extraction (5.5 s)
  • eqsat-basic (1.6 ms)
  • eqsolve (31.1 ms)
  • f64 (935.7 µs)
  • fail_wrong_assertion (1.3 ms)
  • fibonacci (1.6 ms)
  • fibonacci-demand (2.1 ms)
  • fusion (43.7 ms)
  • herbie (286.2 ms)
  • herbie-tutorial (12.7 ms)
  • i64 (350.2 µs)
  • include (1.2 ms)
  • integer_math (12.2 ms)
  • intersection (1.9 ms)
  • interval (2.7 ms)
  • knapsack (5.7 ms)
  • lambda (145 ms)
  • levenshtein-distance (15.1 ms)
  • list (4.8 ms)
  • map (650.5 µs)
  • math (36.7 ms)
  • math-microbenchmark (4.2 s)
  • matrix (11.5 ms)
  • merge-during-rebuild (1 ms)
  • merge-saturates (3 ms)
  • name-resolution (1.2 ms)
  • path (1 ms)
  • path-union (1.1 ms)
  • pathproof (1.5 ms)
  • points-to (2 ms)
  • primitives (523.4 µs)
  • prims (5.4 ms)
  • push-pop (648 µs)
  • rational (890.2 µs)
  • repro-define (693.2 µs)
  • repro-desugar-143 (8.6 ms)
  • repro-empty-query (618.4 µs)
  • repro-equal-constant (670 µs)
  • repro-equal-constant2 (654.4 µs)
  • repro-noteqbug (745.2 µs)
  • repro-primitive-query (681.2 µs)
  • repro-querybug (929.3 µs)
  • repro-querybug2 (669.1 µs)
  • repro-querybug3 (2 ms)
  • repro-querybug4 (711.8 µs)
  • repro-should-saturate (650.2 µs)
  • repro-silly-panic (918.8 µs)
  • repro-typechecking-schedule (462.1 µs)
  • repro-unsound (269.3 ms)
  • repro-unsound-htutorial (847.3 µs)
  • repro-vec-unequal (768.6 µs)
  • resolution (4.1 ms)
  • rw-analysis (39.8 ms)
  • schedule-demo (2.2 ms)
  • semi_naive_set_function (41.5 ms)
  • set (2.7 ms)
  • stratified (901.7 µs)
  • string (539.3 µs)
  • string_quotes (477.1 µs)
  • subsume (1.5 ms)
  • test-combined (1.3 ms)
  • test-combined-steps (2.9 ms)
  • towers-of-hanoi (5.3 ms)
  • tricky-type-checking (11.6 ms)
  • type-constraints-tests (578.4 µs)
  • typecheck (5.8 ms)
  • typeinfer (400.7 ms)
  • unification-points-to (7.8 ms)
  • unify (1.1 ms)
  • unstable-fn (5.7 ms)
  • until (2.7 ms)
  • vec (975.2 µs)

@@ -1293,7 +1293,6 @@ impl EGraph {
filename.push(file.as_str());
// append to file
let mut f = File::options()
.write(true)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit update from rust upgrade

stack: &mut Vec<Value>,
stack: &mut [Value],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit update from rust upgrade

channel = "1.74.0"
channel = "1.79.0"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to update toolchain codspeed dep

egraph.fact_directory = args.fact_directory.clone();
egraph.fact_directory.clone_from(&args.fact_directory);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit update from rust upgrade

.map(|s| s.clone())
.cloned()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit update from rust upgrade

@saulshanabrook
Copy link
Member Author

This is ready for a review now. There are lots of tradeoffs with how accurate we want benchmarks to be vs their overall time, but I tried to find a good balance for now.

Currently, the benchmarks CI task takes ~ 9 minutes to run. Why does it take so long? Codspeed instruments the code to measure CPU cycles to get more reliable benchmarks on flaky hardware like that in CI:

CodSpeed instruments your benchmarks to measure the performance of your code. A benchmark will be run only once and the CPU behavior will be simulated. This ensures that the measurement is as accurate as possible, taking into account not only the instructions executed but also the cache and memory access patterns. The simulation gives us an equivalent of the CPU cycles that includes cache and memory access.

In order to keep the timings reasonable, I reduce the number of times the eggcc-extraction schedules run from 5 to 2. Currently, on 2 it takes ~4 minutes for that benchmark alone to run. If I switch it even to 3 it takes 46 minutes, which seems too long. One question might be if the eggcc-extraction is doing similar enough work at say 2 runs opposed to 3 runs. Here is a screenshot of different profiles, the 3 on the left (taking around 10x as long) and the 2 on the right:

Screenshot 2024-10-15 at 5 09 35 PM

They look similar enough to probably have reasonable differences in performances based on the shorter one.

Here is a rough outline of where time is spent in the benchmarks currently:

output (2)

I hope that even if this isn't a perfect representation of production tasks, it can still be helpful as we iterate over PRs. If your PR is minor, you can always merge it in without waiting for the benchmarks to finish if you don't want to wait.

@saulshanabrook saulshanabrook changed the title Benchmark all examples with codspeed Benchmark examples with codspeed Oct 15, 2024
@saulshanabrook saulshanabrook requested review from yihozhang and removed request for ajpal October 15, 2024 21:30
benches/examples.rs Outdated Show resolved Hide resolved
benches/examples.rs Outdated Show resolved Hide resolved
benches/examples.rs Outdated Show resolved Hide resolved
@saulshanabrook saulshanabrook merged commit 34b5d7b into egraphs-good:main Oct 16, 2024
4 checks passed
@saulshanabrook saulshanabrook deleted the codspeed branch October 16, 2024 00:57
@saulshanabrook
Copy link
Member Author

I had increased the "threshold" for codspeed to say whether the performance change was meaningful to 10%. I brought it back down to 5% so it can alert us to more changes:

Screenshot 2024-10-16 at 12 07 45 PM

I think the main thing it affects is the comments on PRs. If you click on the report, you can see a granular breakdown of all changes.

@Alex-Fischman
Copy link
Contributor

@saulshanabrook Are you sure about this last change? The website says that the default is 10%, and 10% seems more in line with the variance that we're seeing.

@saulshanabrook
Copy link
Member Author

Yeah I'm open to that! It just seems like a 5% speedup on the long running. Benchmarks is actually rather significant... And if we are trying to improve them then we might need many smaller performance improvements to get there. I don't see much variability at all on the larger benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants