`Poisson` variance deviation at high `lambda` #1515

JamboChen · 2024-10-19T10:53:56Z

Summary

Poisson distribution where, at high lambda values, the sample variance deviates noticeably from the mean.

During a Kolmogorov-Smirnov (KS) test, the Poisson distribution fails when lambda approaches MAX_LAMBDA. I conducted a simple test to examine the sample mean and variance, and observed the following results:

lambda: 1e1, mean: 1.00e1, var: 9.98e0
lambda: 1e10, mean: 1.00e10, var: 1.00e10
lambda: 1e12, mean: 1.00e12, var: 1.00e12
lambda: 1e14, mean: 1.00e14, var: 1.11e14
lambda: 1e15, mean: 1.00e15, var: 1.38e15
lambda: 1e16, mean: 1.00e16, var: 8.05e16
lambda: 1e17, mean: 1.00e17, var: 2.42e18
lambda: 1e18, mean: 1.00e18, var: 7.84e19
lambda: 2e19, mean: 1.84e19, var: 5.41e21

Mean: ΣX_i/n
Variance: Σ(X_i - λ)^2/(n-1)

For samples from a Poisson distribution, the sample mean and variance should be very close to lambda.

For reference, I also checked the sample variance of Poisson generators in NumPy and R (I haven't conducted a full goodness-of-fit test yet).

NumPy implementation for lam>=10 is based on the paper (The transformed rejection method for generating Poisson random variables)

lambda = 1e+01, mean = 1.00e+01, std = 1.00e+01
lambda = 1e+10, mean = 1.00e+10, std = 1.00e+10
lambda = 1e+12, mean = 1.00e+12, std = 9.99e+11
lambda = 1e+14, mean = 1.00e+14, std = 1.01e+14
lambda = 1e+15, mean = 1.00e+15, std = 1.04e+15
lambda = 1e+16, mean = 1.00e+16, std = 1.42e+16
lambda = 1e+17, mean = 1.00e+17, std = 1.64e+17
lambda = 1e+18, mean = 1.00e+18, std = 1.52e+18

Algorithm PTRD can be safely used for 10≤ μ ≤10^8. Algorithm PTRS for 10 ≤ μ ≤10^7.

R implementation is based on the paper (Computer generation of Poisson deviates from modified normal distributions)

lambda = 10, mean = 10, std = 10.0191
lambda = 1e+10, mean = 1e+10, std = 9.99373e+09
lambda = 1e+12, mean = 1e+12, std = 1.00086e+12
lambda = 1e+14, mean = 1e+14, std = 9.9847e+13
lambda = 1e+15, mean = 1e+15, std = 1.00205e+15
lambda = 1e+16, mean = 1e+16, std = 1.00097e+16
lambda = 1e+17, mean = 1e+17, std = 9.98917e+16
lambda = 1e+18, mean = 1e+18, std = 1.00145e+18

Code sample

use rand::{rngs::SmallRng, SeedableRng};
use rand_distr::{Distribution, Poisson};
const SAMPLES: usize = 1_000_000;
const MAX_LAMBDA: f64 = Poisson::<f64>::MAX_LAMBDA;

fn poisson_mean_var(lambda: f64) -> (f64, f64) {
    let mut rng = SmallRng::seed_from_u64(0);
    let dist = Poisson::new(lambda).unwrap();
    let samples = (0..SAMPLES)
        .map(|_| dist.sample(&mut rng))
        .collect::<Vec<f64>>();
    if samples.iter().any(|&x| x.is_nan()) {
        panic!("NaN in samples");
    }
    if samples.iter().any(|&x| x.is_infinite()) {
        panic!("Infinite in samples");
    }

    let mean = samples.iter().map(|&x| x / SAMPLES as f64).sum::<f64>();
    let var = samples.iter().map(|&x| (x - lambda).powi(2)).sum::<f64>() / (SAMPLES as f64 - 1.0);

    (mean, var)
}

fn main() {
    let lambdas = [10.0, 1e10, 1e12, 1e14, 1e15, 1e16, 1e17, 1e18, MAX_LAMBDA];
    for lambda in lambdas.iter() {
        let (mean, var) = poisson_mean_var(*lambda);
        println!("|{:.0e}| {:.2e}|{:.2e}|", lambda, mean, var);
    }
}

[dependencies]
rand = { path = "../rand" }
rand_distr = { path = "../rand/rand_distr"}

The text was updated successfully, but these errors were encountered:

JamboChen · 2024-10-20T18:32:13Z

I've experimentally implemented the PD algorithm (the one used by R) and replaced the rejection sampling method. It passed the lambda=1e9 test and significantly improved performance in my computer's "poisson/100" and "poisson/variable" benchmarks.

Additionally, I found that the test parameter 1.844E+19 in the Poisson KS test actually exceeds the i64 range, and the gamma_lr function may get stuck in a long loop with large numbers.

You can find my implementation here: Implementation Link

dhardy · 2024-10-21T08:28:05Z

We cannot accept straight translations from R code which is distributed under the GPLv2. But I assume this is same algorithm published here in Fortran?

1.844E+19 is approx 2^64, and is MAX_LAMBDA (introduced in #1498).

But the values of λ used in the test are 0.1, 1.0, 7.5 then 1e9 (disabled) and 1.844E+19 (disabled). Our implementation uses a different algorithm for λ < 12 and λ ≥ 12, so the latter isn't currently tested at all (and has no tests with "reasonable" parameters).

JamboChen · 2024-10-21T08:56:59Z

We cannot accept straight translations from R code which is distributed under the GPLv2. But I assume this is same algorithm published here in Fortran?

My implementation follows the textual description of the PD algorithm in case A (μ ≥ 10) from the referenced paper (page 11 of the PDF). In fact, the R implementation uses some goto statements, which, even if permissible from a licensing perspective, are difficult to directly translate.

Our implementation uses a different algorithm for λ < 12 and λ ≥ 12, so the latter isn't currently tested at all

Locally, I enabled the 1e9 test, and it passed. However, due to the slow performance of the gamma_lr function (it took 70 seconds to run on my machine), I haven't pushed this part of the changes yet.

Regarding the discrete KS test function, it may need to be generalized to accept a CDF function of the form f(u64) -> f64. Otherwise, most of the samples in the 1.844E+19 test will overflow when converted to i64. I made the corresponding changes locally (but have not pushed them yet), though it seems that the gamma_lr function is getting stuck in an infinite loop.

benjamin-lieser · 2024-10-21T09:02:11Z

I guess is should be fine to test with 1.844E+19 / 2, I do not see a current need for a f(u64) -> f64

dhardy · 2024-10-21T09:27:26Z

Further to this, we have a Poisson impl for Distribution<u64>. We should test this, which seems to be simple enough:

    for (seed, lambda) in parameters.into_iter().enumerate() {
        let dist = Poisson::new(lambda).unwrap();
        test_discrete::<f64, Poisson<f64>, _>(seed as u64, dist, |k| cdf(k, lambda));
        test_discrete::<u64, Poisson<f64>, _>(seed as u64, dist, |k| cdf(k, lambda));
    }

JamboChen added the X-bug Type: bug report label Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Poisson` variance deviation at high `lambda` #1515

`Poisson` variance deviation at high `lambda` #1515

JamboChen commented Oct 19, 2024 •

edited

Loading

JamboChen commented Oct 20, 2024

dhardy commented Oct 21, 2024

JamboChen commented Oct 21, 2024

benjamin-lieser commented Oct 21, 2024

dhardy commented Oct 21, 2024

Poisson variance deviation at high lambda #1515

Poisson variance deviation at high lambda #1515

Comments

JamboChen commented Oct 19, 2024 • edited Loading

Summary

Code sample

JamboChen commented Oct 20, 2024

dhardy commented Oct 21, 2024

JamboChen commented Oct 21, 2024

benjamin-lieser commented Oct 21, 2024

dhardy commented Oct 21, 2024

`Poisson` variance deviation at high `lambda` #1515

`Poisson` variance deviation at high `lambda` #1515

JamboChen commented Oct 19, 2024 •

edited

Loading