Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running SNP-dists on large sample set #51

Open
DorothyTamYiLing opened this issue Dec 10, 2022 · 2 comments
Open

running SNP-dists on large sample set #51

DorothyTamYiLing opened this issue Dec 10, 2022 · 2 comments

Comments

@DorothyTamYiLing
Copy link

Hi Teesmann,

First of all, thanks for writing this piece of software.

I am trying to run SNP-dists on a large sample set (3745 samples alignment, each with 4988504b). It has been running for more than 24 hours and I wonder if that is normal. How much time do you think it will take to finish for an input of this size? I have stopped the running now as I would like to get a rough estimate of the run time.

Thanks,
Dorothy

@kloetzl
Copy link
Contributor

kloetzl commented Dec 10, 2022

Hi Dorothy,
The runtime of snp-dists scales quadratically with the input. Say c is the time for a single pairwise comparison. snpsdist makes O(n^2) comparisons. Hence for your sample the time is 3745^2 * c. If c is 10ms that still is 38 hours! In order to get a good estimate for c I recommend you run the analysis on just 37 samples. Multiply the resulting time by 10'000 and you get the runtime for the whole dataset.

If the resulting estimate is way too large you can compute approximate solutions using mash or phylonium.

Hope this helps,
Fabian

@DorothyTamYiLing
Copy link
Author

Hi Fabian,

Thanks for the useful tips! I will give the calculation a go and maybe try to reduce the sample set too.

Thanks,
Dorothy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants