Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements for Random.Shuffle() #83305

Closed

Conversation

wainwrightmark
Copy link

Closes #82838

This improves the performance of shuffling a span by using a different method.

Brief Explanation

Shuffling a span of length n requires sequentially swapping each element with a random element from the span up to and including itself.

For example, to shuffle a span of length 3:
The first swap is trivial - the first element has a 100% chance of being swapped with itself.
The second element is swapped with a random element of the first two elements so it is either swapped with the first element or left in place with equal probability.
The third element is swapped with a random one of the first three elements so it has a 1/3 chance of being swapped with the first element, 1/3 chance of being swapped with the second element and 1/3 chance of not being swapped.

You should be able to convince yourself the each of the six possible orderings are equally likely to be produced.

Note that the old implementation did this backwards - doing the big swaps first and ending with the two element swap. Both directions are equally good at shuffling the slice but will produce different values for different seeds.

The old implementation generated a new random number for each swap - incurring the cost of not only incrementing the rng but also of the bias checks and potential rejection. This article about this is very informative

The new implementation in this PR reduces the number of random numbers generated by grouping the swaps and only generating one random number for each group.

For example the first twelve swaps (including the trivial one where the first element is always 'swapped' with itself) can be grouped together and represented by one random number in the range [0, 479001600) (12 factorial).
This makes sense as there are 12! ways to order 12 elements so each possible random number represents one of those orderings.
We can deconstruct the random number into swaps by doing successive divmod operations - the randomly generated number has an equal chance of being 0 mod 2 or mod 2, then after division by 2, it has an equal chance of being 0,1, or 2 mod 3 and so on.

For longer spans we create additional groups - the next seven swaps are represented by a random number in the range [0,253955520), the upper bound of which is 19! / 12!
Note that the group sizes are determined to be as large as possible whilst still fitting into a 32 bit integer. It would be possible to get larger groups by using 64 bit integers but I have found that this leads to worse performance.

The division and modulo operations involved are expensive but this process seems to be about twice as fast as the old implementation. (See benchmark results below)

Changes

I have changed the Random.Shuffle code to use the new method.
This is a value breaking change - Shuffle will now produce a different but still random ordering. I have updated the tests to reflect this.
I've also added a new test Shuffle_Array_Fairness which checks that shuffling isn't obviously unfair but this test is not necessarily useful or needed.

Issues

  • C# doesn't have an efficient way to do checked multiplication so I am using a big switch statement instead to calculate the group sizes.
  • Random doesn't seem to have a method to produce random unsigned integers so I have to use integers instead which makes some of the group sizes smaller and this may be affecting performance.

Performance

I have found some optimizations since initially posting the issue and now the performance gains seem to be about 2x

Method Mean Error StdDev Ratio
Shuffle_10_Old 104.85 ns 0.701 ns 0.656 ns 1.0
Shuffle_100_Old 940.88 ns 5.155 ns 4.822 ns 1.0
Shuffle_1000_Old 9,508.29 ns 62.160 ns 55.103 ns 1.0
Shuffle_10_New 50.07 ns 0.372 ns 0.348 ns 0.48
Shuffle_100_New 347.25 ns 2.303 ns 2.042 ns 0.37
Shuffle_1000_New 4,452.33 ns 51.625 ns 48.290 ns 0.47

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 11, 2023
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Apr 3, 2023

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #82838

This improves the performance of shuffling a span by using a different method.

Brief Explanation

Shuffling a span of length n requires sequentially swapping each element with a random element from the span up to and including itself.

For example, to shuffle a span of length 3:
The first swap is trivial - the first element has a 100% chance of being swapped with itself.
The second element is swapped with a random element of the first two elements so it is either swapped with the first element or left in place with equal probability.
The third element is swapped with a random one of the first three elements so it has a 1/3 chance of being swapped with the first element, 1/3 chance of being swapped with the second element and 1/3 chance of not being swapped.

You should be able to convince yourself the each of the six possible orderings are equally likely to be produced.

Note that the old implementation did this backwards - doing the big swaps first and ending with the two element swap. Both directions are equally good at shuffling the slice but will produce different values for different seeds.

The old implementation generated a new random number for each swap - incurring the cost of not only incrementing the rng but also of the bias checks and potential rejection. This article about this is very informative

The new implementation in this PR reduces the number of random numbers generated by grouping the swaps and only generating one random number for each group.

For example the first twelve swaps (including the trivial one where the first element is always 'swapped' with itself) can be grouped together and represented by one random number in the range [0, 479001600) (12 factorial).
This makes sense as there are 12! ways to order 12 elements so each possible random number represents one of those orderings.
We can deconstruct the random number into swaps by doing successive divmod operations - the randomly generated number has an equal chance of being 0 mod 2 or mod 2, then after division by 2, it has an equal chance of being 0,1, or 2 mod 3 and so on.

For longer spans we create additional groups - the next seven swaps are represented by a random number in the range [0,253955520), the upper bound of which is 19! / 12!
Note that the group sizes are determined to be as large as possible whilst still fitting into a 32 bit integer. It would be possible to get larger groups by using 64 bit integers but I have found that this leads to worse performance.

The division and modulo operations involved are expensive but this process seems to be about twice as fast as the old implementation. (See benchmark results below)

Changes

I have changed the Random.Shuffle code to use the new method.
This is a value breaking change - Shuffle will now produce a different but still random ordering. I have updated the tests to reflect this.
I've also added a new test Shuffle_Array_Fairness which checks that shuffling isn't obviously unfair but this test is not necessarily useful or needed.

Issues

  • C# doesn't have an efficient way to do checked multiplication so I am using a big switch statement instead to calculate the group sizes.
  • Random doesn't seem to have a method to produce random unsigned integers so I have to use integers instead which makes some of the group sizes smaller and this may be affecting performance.

Performance

I have found some optimizations since initially posting the issue and now the performance gains seem to be about 2x

Method Mean Error StdDev Ratio
Shuffle_10_Old 104.85 ns 0.701 ns 0.656 ns 1.0
Shuffle_100_Old 940.88 ns 5.155 ns 4.822 ns 1.0
Shuffle_1000_Old 9,508.29 ns 62.160 ns 55.103 ns 1.0
Shuffle_10_New 50.07 ns 0.372 ns 0.348 ns 0.48
Shuffle_100_New 347.25 ns 2.303 ns 2.042 ns 0.37
Shuffle_1000_New 4,452.33 ns 51.625 ns 48.290 ns 0.47
Author: wainwrightmark
Assignees: -
Labels:

area-System.Runtime, community-contribution

Milestone: -

@danmoseley
Copy link
Member

@wainwrightmark are you still working on this one?

@wainwrightmark
Copy link
Author

Sorry, I sort of forgot about this. I think I was happy with this (apart from the small changes people have suggested which I'll do shortly). I think this does provide a significant benefit and I can't think of any significant improvements. If I do those small changes, could it then be merged? I'm afraid I haven't contributed here before and I don't know what the protocol is / who the decision makers are.

@danmoseley
Copy link
Member

danmoseley commented Jun 12, 2023

@wainwrightmark no problem at all -- different PR's work at different speeds here. 😄

If I do those small changes, could it then be merged?

I haven't followed the PR and don't own the area. Speaking generally, we certainly often take changes of this level of complexity to get this kind of perf improvement in potentially perf sensitive API's. @stephentoub any take here?

switch (n)
{
case 2:
return (479001600, 11); //12 factorial
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worth adding some debug asserts to calculate and self document these magic numbers?

Copy link
Author

@wainwrightmark wainwrightmark Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done this. I may have gone a bit overboard though - the simplest way to check the threshold numbers is to check that using the next number up results in an overflow exception. How critical is performance with debug symbols?

private static (int, int) CalculateBound(int n)
{
int count;
switch (n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this switch use fancy new pattern matching syntax to be more compact?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this but I don't think relational patterns allow return statements.

@danmoseley
Copy link
Member

I didn't read the paper you linked -- does this suggest that further improvement is possible for the Next methods, beyond #79790 ? (which was based on Lemire, which that paper references)

wainwrightmark and others added 4 commits June 13, 2023 10:23
Co-authored-by: Theodore Tsirpanis <teo@tsirpanis.gr>
Co-authored-by: Günther Foidl <gue@korporal.at>
Co-authored-by: Dan Moseley <danmose@microsoft.com>
// This does not prove that the shuffle is fair but will at least fail if the shuffle is obviously unfair.
// This is a "Chi Squared Goodness of fit" problem.
// The number of degrees of freedom is (len - 1)*(len - 1).
// The significance is 0.01 so this test should naturally fail for 1 in 100 seeds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds like a flaky test? These tests run 1000's of times a day someplace or another. I was using a critical value to give 10**-8 chance of failure. That was still enough to be basically certain of failure if there was a serious bug.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used a fixed seed so this test will always produce the same result no matter how many times you run it. It might falsely fail if the rng or the shuffle implementation changes but if that happens, one can change the seed and 99% of seeds should work.

{
int index = chooser.NextIndex();

// This pattern is faster than tuple deconstruction for large structs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that something the compiler could/should improve?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was mentioned here, my understand is that it will be fixed after C# 11

@danmoseley
Copy link
Member

Thanks. Area owner @michaelgsharp or @tannergooding sign off on correctness etc.

@stephentoub
Copy link
Member

stephentoub commented Jul 5, 2023

I pulled down this PR, rebased it on main, built it, and tried it out. I'm not seeing the cited perf gains; in fact on larger inputs it's registering as a regression.

What is the benchmark you're running?

Here's mine:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Linq;

BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);

public class Tests
{
    [Params(10, 100, 1000)]
    public int Length { get; set; }

    private int[] _values;

    [GlobalSetup]
    public void Setup() => _values = Enumerable.Range(0, Length).ToArray();

    [Benchmark]
    public void Shuffle() => Random.Shared.Shuffle(_values);
}
Method Toolchain Length Mean Error StdDev Ratio
Shuffle \main\corerun.exe 10 75.80 ns 0.330 ns 0.275 ns 1.00
Shuffle \pr\corerun.exe 10 63.76 ns 0.436 ns 0.408 ns 0.84
Shuffle \main\corerun.exe 100 528.79 ns 2.568 ns 2.402 ns 1.00
Shuffle \pr\corerun.exe 100 600.31 ns 5.574 ns 5.214 ns 1.14
Shuffle \main\corerun.exe 1000 5,237.98 ns 56.743 ns 50.301 ns 1.00
Shuffle \pr\corerun.exe 1000 5,693.31 ns 33.543 ns 31.376 ns 1.09

@wainwrightmark
Copy link
Author

I pulled down this PR, rebased it on main, built it, and tried it out. I'm not seeing the cited perf gains; in fact on larger inputs it's registering as a regression.

What is the benchmark you're running?

Here's mine:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Linq;

BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);

public class Tests
{
    [Params(10, 100, 1000)]
    public int Length { get; set; }

    private int[] _values;

    [GlobalSetup]
    public void Setup() => _values = Enumerable.Range(0, Length).ToArray();

    [Benchmark]
    public void Shuffle() => Random.Shared.Shuffle(_values);
}

Method Toolchain Length Mean Error StdDev Ratio
Shuffle \main\corerun.exe 10 75.80 ns 0.330 ns 0.275 ns 1.00
Shuffle \pr\corerun.exe 10 63.76 ns 0.436 ns 0.408 ns 0.84
Shuffle \main\corerun.exe 100 528.79 ns 2.568 ns 2.402 ns 1.00
Shuffle \pr\corerun.exe 100 600.31 ns 5.574 ns 5.214 ns 1.14
Shuffle \main\corerun.exe 1000 5,237.98 ns 56.743 ns 50.301 ns 1.00
Shuffle \pr\corerun.exe 1000 5,693.31 ns 33.543 ns 31.376 ns 1.09

Is there any chance you could try with [MethodImpl(MethodImplOptions.AggressiveInlining)] on the CalculateBound method? I fear I may have been too eager to remove that.

@stephentoub
Copy link
Member

Thanks. I tried with AggressiveInlining on CalculateBounds, but it didn't change anything.

I suspect what's changed between when you previously tested it and now is that dynamic PGO has been enabled by default. If I turn off dynamic PGO, I get this:

Method Toolchain Length Mean Error StdDev Ratio
Shuffle \main\corerun.exe 10 90.64 ns 0.295 ns 0.276 ns 1.00
Shuffle \pr\corerun.exe 10 64.89 ns 0.410 ns 0.384 ns 0.72
Shuffle \main\corerun.exe 100 691.90 ns 6.231 ns 5.828 ns 1.00
Shuffle \pr\corerun.exe 100 601.46 ns 4.492 ns 4.202 ns 0.87
Shuffle \main\corerun.exe 1000 7,156.69 ns 50.182 ns 46.940 ns 1.00
Shuffle \pr\corerun.exe 1000 6,191.82 ns 33.938 ns 30.085 ns 0.87

That suggests the main savings from this change is simply that it was avoiding a virtual call per Next, and dynamic PGO is avoiding that virtual call. But even if it wasn't, it'd be simpler to avoid the virtual call per iteration in other ways, e.g. adding Shuffle to the internal Impl abstraction such that this could be a single virtual dispatch rather than one per item.

Are you able to produce numbers that show this change has a significant benefit? If yes, great, can you share both the benchmark and the numbers you're seeing. If not, while I appreciate the efforts, it probably makes sense to instead close this.

@wainwrightmark
Copy link
Author

Aha. This explains a lot. Running your benchmarks (which were basically the same as mine) on the latest version I get:

Method Length Mean Error StdDev Ratio
Shuffle_Old 10 49.02 ns 0.574 ns 0.536 ns 1.0
Shuffle_New 10 44.20 ns 0.360 ns 0.300 ns 0.9
Shuffle_Old 100 315.90 ns 3.672 ns 3.434 ns 1.0
Shuffle_New 100 274.48 ns 3.992 ns 3.734 ns 0.87
Shuffle_Old 1000 2,359.83 ns 39.669 ns 40.737 ns 1.0
Shuffle_New 1000 2,574.29 ns 28.813 ns 26.952 ns 1.09

which I think are pretty comparable to your results. And the explanation for the change makes sense to me.

Sorry for the huge waste of time. Feel free to close this. Though it might be worth re-exploring it if C# ever gets both efficient checked multiplication and something like Random.NextUInt

@stephentoub
Copy link
Member

Not a waste of time. Thanks for your efforts.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential 1.5x performance improvement for Random.Shuffle()
6 participants