Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poisson variance deviation at high lambda #1515

Open
JamboChen opened this issue Oct 19, 2024 · 5 comments
Open

Poisson variance deviation at high lambda #1515

JamboChen opened this issue Oct 19, 2024 · 5 comments
Labels
X-bug Type: bug report

Comments

@JamboChen
Copy link
Contributor

JamboChen commented Oct 19, 2024

Summary

Poisson distribution where, at high lambda values, the sample variance deviates noticeably from the mean.

During a Kolmogorov-Smirnov (KS) test, the Poisson distribution fails when lambda approaches MAX_LAMBDA. I conducted a simple test to examine the sample mean and variance, and observed the following results:

lambda: 1e1, mean: 1.00e1, var: 9.98e0
lambda: 1e10, mean: 1.00e10, var: 1.00e10
lambda: 1e12, mean: 1.00e12, var: 1.00e12
lambda: 1e14, mean: 1.00e14, var: 1.11e14
lambda: 1e15, mean: 1.00e15, var: 1.38e15
lambda: 1e16, mean: 1.00e16, var: 8.05e16
lambda: 1e17, mean: 1.00e17, var: 2.42e18
lambda: 1e18, mean: 1.00e18, var: 7.84e19
lambda: 2e19, mean: 1.84e19, var: 5.41e21
  • Mean: ΣX_i/n
  • Variance: Σ(X_i - λ)^2/(n-1)

For samples from a Poisson distribution, the sample mean and variance should be very close to lambda.

For reference, I also checked the sample variance of Poisson generators in NumPy and R (I haven't conducted a full goodness-of-fit test yet).

NumPy implementation for lam>=10 is based on the paper (The transformed rejection method for generating Poisson random variables)

lambda = 1e+01, mean = 1.00e+01, std = 1.00e+01
lambda = 1e+10, mean = 1.00e+10, std = 1.00e+10
lambda = 1e+12, mean = 1.00e+12, std = 9.99e+11
lambda = 1e+14, mean = 1.00e+14, std = 1.01e+14
lambda = 1e+15, mean = 1.00e+15, std = 1.04e+15
lambda = 1e+16, mean = 1.00e+16, std = 1.42e+16
lambda = 1e+17, mean = 1.00e+17, std = 1.64e+17
lambda = 1e+18, mean = 1.00e+18, std = 1.52e+18

Algorithm PTRD can be safely used for 10≤ μ ≤10^8. Algorithm PTRS for 10 ≤ μ ≤10^7.

R implementation is based on the paper (Computer generation of Poisson deviates from modified normal distributions)

lambda = 10, mean = 10, std = 10.0191
lambda = 1e+10, mean = 1e+10, std = 9.99373e+09
lambda = 1e+12, mean = 1e+12, std = 1.00086e+12
lambda = 1e+14, mean = 1e+14, std = 9.9847e+13
lambda = 1e+15, mean = 1e+15, std = 1.00205e+15
lambda = 1e+16, mean = 1e+16, std = 1.00097e+16
lambda = 1e+17, mean = 1e+17, std = 9.98917e+16
lambda = 1e+18, mean = 1e+18, std = 1.00145e+18

Code sample

use rand::{rngs::SmallRng, SeedableRng};
use rand_distr::{Distribution, Poisson};
const SAMPLES: usize = 1_000_000;
const MAX_LAMBDA: f64 = Poisson::<f64>::MAX_LAMBDA;

fn poisson_mean_var(lambda: f64) -> (f64, f64) {
    let mut rng = SmallRng::seed_from_u64(0);
    let dist = Poisson::new(lambda).unwrap();
    let samples = (0..SAMPLES)
        .map(|_| dist.sample(&mut rng))
        .collect::<Vec<f64>>();
    if samples.iter().any(|&x| x.is_nan()) {
        panic!("NaN in samples");
    }
    if samples.iter().any(|&x| x.is_infinite()) {
        panic!("Infinite in samples");
    }

    let mean = samples.iter().map(|&x| x / SAMPLES as f64).sum::<f64>();
    let var = samples.iter().map(|&x| (x - lambda).powi(2)).sum::<f64>() / (SAMPLES as f64 - 1.0);

    (mean, var)
}

fn main() {
    let lambdas = [10.0, 1e10, 1e12, 1e14, 1e15, 1e16, 1e17, 1e18, MAX_LAMBDA];
    for lambda in lambdas.iter() {
        let (mean, var) = poisson_mean_var(*lambda);
        println!("|{:.0e}| {:.2e}|{:.2e}|", lambda, mean, var);
    }
}
[dependencies]
rand = { path = "../rand" }
rand_distr = { path = "../rand/rand_distr"}
@JamboChen JamboChen added the X-bug Type: bug report label Oct 19, 2024
@JamboChen
Copy link
Contributor Author

I've experimentally implemented the PD algorithm (the one used by R) and replaced the rejection sampling method. It passed the lambda=1e9 test and significantly improved performance in my computer's "poisson/100" and "poisson/variable" benchmarks.

Additionally, I found that the test parameter 1.844E+19 in the Poisson KS test actually exceeds the i64 range, and the gamma_lr function may get stuck in a long loop with large numbers.

You can find my implementation here: Implementation Link

@dhardy
Copy link
Member

dhardy commented Oct 21, 2024

We cannot accept straight translations from R code which is distributed under the GPLv2. But I assume this is same algorithm published here in Fortran?

1.844E+19 is approx 2^64, and is MAX_LAMBDA (introduced in #1498).

But the values of λ used in the test are 0.1, 1.0, 7.5 then 1e9 (disabled) and 1.844E+19 (disabled). Our implementation uses a different algorithm for λ < 12 and λ ≥ 12, so the latter isn't currently tested at all (and has no tests with "reasonable" parameters).

@JamboChen
Copy link
Contributor Author

We cannot accept straight translations from R code which is distributed under the GPLv2. But I assume this is same algorithm published here in Fortran?

My implementation follows the textual description of the PD algorithm in case A (μ ≥ 10) from the referenced paper (page 11 of the PDF). In fact, the R implementation uses some goto statements, which, even if permissible from a licensing perspective, are difficult to directly translate.

Our implementation uses a different algorithm for λ < 12 and λ ≥ 12, so the latter isn't currently tested at all

Locally, I enabled the 1e9 test, and it passed. However, due to the slow performance of the gamma_lr function (it took 70 seconds to run on my machine), I haven't pushed this part of the changes yet.

Regarding the discrete KS test function, it may need to be generalized to accept a CDF function of the form f(u64) -> f64. Otherwise, most of the samples in the 1.844E+19 test will overflow when converted to i64. I made the corresponding changes locally (but have not pushed them yet), though it seems that the gamma_lr function is getting stuck in an infinite loop.

@benjamin-lieser
Copy link
Collaborator

I guess is should be fine to test with 1.844E+19 / 2, I do not see a current need for a f(u64) -> f64

@dhardy
Copy link
Member

dhardy commented Oct 21, 2024

Further to this, we have a Poisson impl for Distribution<u64>. We should test this, which seems to be simple enough:

    for (seed, lambda) in parameters.into_iter().enumerate() {
        let dist = Poisson::new(lambda).unwrap();
        test_discrete::<f64, Poisson<f64>, _>(seed as u64, dist, |k| cdf(k, lambda));
        test_discrete::<u64, Poisson<f64>, _>(seed as u64, dist, |k| cdf(k, lambda));
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-bug Type: bug report
Projects
None yet
Development

No branches or pull requests

3 participants