-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements for Random.Shuffle() #83305
Conversation
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsCloses #82838 This improves the performance of shuffling a span by using a different method. Brief ExplanationShuffling a span of length For example, to shuffle a span of length 3: You should be able to convince yourself the each of the six possible orderings are equally likely to be produced. Note that the old implementation did this backwards - doing the big swaps first and ending with the two element swap. Both directions are equally good at shuffling the slice but will produce different values for different seeds. The old implementation generated a new random number for each swap - incurring the cost of not only incrementing the rng but also of the bias checks and potential rejection. This article about this is very informative The new implementation in this PR reduces the number of random numbers generated by grouping the swaps and only generating one random number for each group. For example the first twelve swaps (including the trivial one where the first element is always 'swapped' with itself) can be grouped together and represented by one random number in the range For longer spans we create additional groups - the next seven swaps are represented by a random number in the range The division and modulo operations involved are expensive but this process seems to be about twice as fast as the old implementation. (See benchmark results below) ChangesI have changed the Issues
PerformanceI have found some optimizations since initially posting the issue and now the performance gains seem to be about 2x
|
@wainwrightmark are you still working on this one? |
Sorry, I sort of forgot about this. I think I was happy with this (apart from the small changes people have suggested which I'll do shortly). I think this does provide a significant benefit and I can't think of any significant improvements. If I do those small changes, could it then be merged? I'm afraid I haven't contributed here before and I don't know what the protocol is / who the decision makers are. |
@wainwrightmark no problem at all -- different PR's work at different speeds here. 😄
I haven't followed the PR and don't own the area. Speaking generally, we certainly often take changes of this level of complexity to get this kind of perf improvement in potentially perf sensitive API's. @stephentoub any take here? |
switch (n) | ||
{ | ||
case 2: | ||
return (479001600, 11); //12 factorial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be worth adding some debug asserts to calculate and self document these magic numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done this. I may have gone a bit overboard though - the simplest way to check the threshold numbers is to check that using the next number up results in an overflow exception. How critical is performance with debug symbols?
private static (int, int) CalculateBound(int n) | ||
{ | ||
int count; | ||
switch (n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this switch use fancy new pattern matching syntax to be more compact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this but I don't think relational patterns allow return statements.
I didn't read the paper you linked -- does this suggest that further improvement is possible for the Next methods, beyond #79790 ? (which was based on Lemire, which that paper references) |
Co-authored-by: Theodore Tsirpanis <teo@tsirpanis.gr> Co-authored-by: Günther Foidl <gue@korporal.at> Co-authored-by: Dan Moseley <danmose@microsoft.com>
// This does not prove that the shuffle is fair but will at least fail if the shuffle is obviously unfair. | ||
// This is a "Chi Squared Goodness of fit" problem. | ||
// The number of degrees of freedom is (len - 1)*(len - 1). | ||
// The significance is 0.01 so this test should naturally fail for 1 in 100 seeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds like a flaky test? These tests run 1000's of times a day someplace or another. I was using a critical value to give 10**-8 chance of failure. That was still enough to be basically certain of failure if there was a serious bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used a fixed seed so this test will always produce the same result no matter how many times you run it. It might falsely fail if the rng or the shuffle implementation changes but if that happens, one can change the seed and 99% of seeds should work.
{ | ||
int index = chooser.NextIndex(); | ||
|
||
// This pattern is faster than tuple deconstruction for large structs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that something the compiler could/should improve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was mentioned here, my understand is that it will be fixed after C# 11
Thanks. Area owner @michaelgsharp or @tannergooding sign off on correctness etc. |
I pulled down this PR, rebased it on main, built it, and tried it out. I'm not seeing the cited perf gains; in fact on larger inputs it's registering as a regression. What is the benchmark you're running? Here's mine: using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Linq;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
public class Tests
{
[Params(10, 100, 1000)]
public int Length { get; set; }
private int[] _values;
[GlobalSetup]
public void Setup() => _values = Enumerable.Range(0, Length).ToArray();
[Benchmark]
public void Shuffle() => Random.Shared.Shuffle(_values);
}
|
Is there any chance you could try with |
Thanks. I tried with AggressiveInlining on CalculateBounds, but it didn't change anything. I suspect what's changed between when you previously tested it and now is that dynamic PGO has been enabled by default. If I turn off dynamic PGO, I get this:
That suggests the main savings from this change is simply that it was avoiding a virtual call per Next, and dynamic PGO is avoiding that virtual call. But even if it wasn't, it'd be simpler to avoid the virtual call per iteration in other ways, e.g. adding Shuffle to the internal Impl abstraction such that this could be a single virtual dispatch rather than one per item. Are you able to produce numbers that show this change has a significant benefit? If yes, great, can you share both the benchmark and the numbers you're seeing. If not, while I appreciate the efforts, it probably makes sense to instead close this. |
Aha. This explains a lot. Running your benchmarks (which were basically the same as mine) on the latest version I get:
which I think are pretty comparable to your results. And the explanation for the change makes sense to me. Sorry for the huge waste of time. Feel free to close this. Though it might be worth re-exploring it if C# ever gets both efficient checked multiplication and something like |
Not a waste of time. Thanks for your efforts. |
Closes #82838
This improves the performance of shuffling a span by using a different method.
Brief Explanation
Shuffling a span of length
n
requires sequentially swapping each element with a random element from the span up to and including itself.For example, to shuffle a span of length 3:
The first swap is trivial - the first element has a 100% chance of being swapped with itself.
The second element is swapped with a random element of the first two elements so it is either swapped with the first element or left in place with equal probability.
The third element is swapped with a random one of the first three elements so it has a 1/3 chance of being swapped with the first element, 1/3 chance of being swapped with the second element and 1/3 chance of not being swapped.
You should be able to convince yourself the each of the six possible orderings are equally likely to be produced.
Note that the old implementation did this backwards - doing the big swaps first and ending with the two element swap. Both directions are equally good at shuffling the slice but will produce different values for different seeds.
The old implementation generated a new random number for each swap - incurring the cost of not only incrementing the rng but also of the bias checks and potential rejection. This article about this is very informative
The new implementation in this PR reduces the number of random numbers generated by grouping the swaps and only generating one random number for each group.
For example the first twelve swaps (including the trivial one where the first element is always 'swapped' with itself) can be grouped together and represented by one random number in the range
[0, 479001600)
(12 factorial).This makes sense as there are 12! ways to order 12 elements so each possible random number represents one of those orderings.
We can deconstruct the random number into swaps by doing successive divmod operations - the randomly generated number has an equal chance of being 0 mod 2 or mod 2, then after division by 2, it has an equal chance of being 0,1, or 2 mod 3 and so on.
For longer spans we create additional groups - the next seven swaps are represented by a random number in the range
[0,253955520)
, the upper bound of which is 19! / 12!Note that the group sizes are determined to be as large as possible whilst still fitting into a 32 bit integer. It would be possible to get larger groups by using 64 bit integers but I have found that this leads to worse performance.
The division and modulo operations involved are expensive but this process seems to be about twice as fast as the old implementation. (See benchmark results below)
Changes
I have changed the
Random.Shuffle
code to use the new method.This is a value breaking change -
Shuffle
will now produce a different but still random ordering. I have updated the tests to reflect this.I've also added a new test
Shuffle_Array_Fairness
which checks that shuffling isn't obviously unfair but this test is not necessarily useful or needed.Issues
Random
doesn't seem to have a method to produce random unsigned integers so I have to use integers instead which makes some of the group sizes smaller and this may be affecting performance.Performance
I have found some optimizations since initially posting the issue and now the performance gains seem to be about 2x