`choose_multiple_weighted` returns unexpect probability of result #1476

heavycharged · 2024-07-23T10:09:22Z

Summary

I am trying to choose N values from vector with weights. All weights calculated as score.powf(4.)

What i expected

I expect values will be chosen so many times, in relation to their weights. Here is the weights after powf function:

value=1 has weight=0.0000409600
value=2 has weight=0.0000000037
value=3 has weight=0.0000000207

As you can see, the lowest weight has value 2, and i expect this value will be chosen least frequent.

What i got

got following results: {2: 100000, 3: 1, 1: 99999}

As you can see, value 2 had been choosen 100k times, more than value 3, that is only chosen once. Here is the score of values 2 and 3:

2: 0.0000000037
3: 0.0000000207

Note

I noticed that this happens with very small floats. Also, i guess that the initial order is matters. Both of this not mentioned in documentation, so this is either docs issue, or bug. Also i do not exclude that i may missed something, so maybe i am wrong.

Code sample

use std::collections::HashMap;

use rand::{prelude::SliceRandom, thread_rng};

fn main() {
    let values: Vec<(i32, f64)> =
        vec![(1, 0.080), (2, 0.0078), (3, 0.012)]
            .into_iter()
            .map(|v| (v.0, f64::powf(v.1, 4.0)))
            .collect();

    for v in &values {
        println!("value={} has weight={:.10}", v.0, v.1);
    }

    let mut counts = HashMap::new();

    for i in 0..100_000 {
        let winning: Vec<i32> = values
            .choose_multiple_weighted(&mut thread_rng(), 2, |a| a.1)
            .unwrap()
            .map(|v| v.0)
            .collect();

        for winner in winning {
            counts.entry(winner).and_modify(|v| *v += 1).or_insert(1);
        }
    }

    println!("got following results: {counts:?}");

    assert!(counts.get(&1).unwrap() > counts.get(&2).unwrap());
    assert!(counts.get(&1).unwrap() > counts.get(&3).unwrap());
    assert!(counts.get(&3).unwrap() > counts.get(&2).unwrap());
}

The text was updated successfully, but these errors were encountered:

dhardy · 2024-07-23T13:26:42Z

I can reproduce this. More specifically,

got following results:
(1, 1): 0
(1, 2): 0
(1, 3): 1
(2, 1): 0
(2, 2): 0
(2, 3): 96875
(3, 1): 3124
(3, 2): 0
(3, 3): 0

dhardy · 2024-07-23T14:06:20Z

The problem is in sample_efraimidis_spirakis (https://doi.org/10.1016/j.ipl.2005.11.003), which calculates key = rng.random::<f64>().powf(1.0 / weight); usually the result rounds to 0 due to the very small weights used.

Note that our f64 random samples have limited precision. We have considered (and even implemented) algorithms for high-precision f64 values. This was not adopted since there is performance overhead and usually it's not necessary. Increasing precision could not be expected to solve this particular issue anyway.

More to the point, this algorithm does not appear to be suitable for the small weights used as inputs here. Pre-scaling of weights into some more suitable range may be a solution, however a key point of the algorithm used is that it requires only a single iteration over a list of weights (which are in this case generated via call to the weight function for each value). Without knowing the largest input weight in advance we would have to iterate twice over the weights (storing or regenerating). At this point a different API may be better; or we could allow the user to input the maximum weight (or a weight scaling function) as a hint.

For this particular example, a partial fix is to scale by the inverse of the largest weight in the weight fn: choose_multiple_weighted(&mut thread_rng(), 2, |a| a.1 / largest_weight). The results are better but still likely wrong since key often rounds to zero for values 2 and 3:

got following results:
(1, 1): 0
(1, 2): 6
(1, 3): 58
(2, 1): 5272
(2, 2): 0
(2, 3): 0
(3, 1): 94664
(3, 2): 0
(3, 3): 0

@TheIronBorn have you got any suggestions? It seems that the algorithm is just a poor choice for small weights.

dhardy · 2024-07-24T16:28:01Z

We could keep this algorithm but document its limitations while documenting its limitations.

In this case, the best (or at least easiest) alternative may be to use WeightedIndex, re-sampling in the case that identical samples are returned (assuming that there are enough items available and likely to be selected, this should be reasonably efficient).

Choose-multiple-weighted is not a problem I've put much thought into; there are probably other algorithms.

dhardy · 2024-07-26T09:28:11Z

@teryror I see you have experience with this algorithm (#1365). Any thoughts here? Wikipedia suggests using ln(r) / w_i keys instead.

heavycharged added the X-bug Type: bug report label Jul 23, 2024

dhardy added the A-degenerate Algorithm fails on some input label Jul 26, 2024

dhardy mentioned this issue Jul 28, 2024

Doc issues #1480

Merged

1 task

dhardy added a commit that referenced this issue Aug 12, 2024

Document known issues: #1378, #1312, #1476.

79f1b0f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`choose_multiple_weighted` returns unexpect probability of result #1476

`choose_multiple_weighted` returns unexpect probability of result #1476

heavycharged commented Jul 23, 2024

dhardy commented Jul 23, 2024

dhardy commented Jul 23, 2024 •

edited

Loading

dhardy commented Jul 24, 2024

dhardy commented Jul 26, 2024

choose_multiple_weighted returns unexpect probability of result #1476

choose_multiple_weighted returns unexpect probability of result #1476

Comments

heavycharged commented Jul 23, 2024

Summary

What i expected

What i got

Note

Code sample

dhardy commented Jul 23, 2024

dhardy commented Jul 23, 2024 • edited Loading

dhardy commented Jul 24, 2024

dhardy commented Jul 26, 2024

`choose_multiple_weighted` returns unexpect probability of result #1476

`choose_multiple_weighted` returns unexpect probability of result #1476

dhardy commented Jul 23, 2024 •

edited

Loading