-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange results on 23 and me genotype data and pretrained models #22
Comments
Hi Enabieva, The statistic of command:
So,did you fix the error, or may share advice? And if I want to convert ancestry maker to ancestry sample, I should how to do that? Can I vote each maker to get the result of sample. Thank you. |
Dear vinhdc10998, |
Hello! I'm getting the same issue with my tests as well. I'm taking 10,000 random samples from UKBiobank and I'm also getting largely AFR ancestry from those random samples, while they should be largely Europeans. Is it a bug or a problem with the labels? I found this in the README :
Thanks a lot! |
Hi all, Briefly, I had the same problem as you guys are finding. I was using my own reference (HGDP phased samples on the hg38 build), and a query file from Brazilian samples (from my own project), that are known to be admixed (EUR, AFR and NAM). I found that Gnomix was overestimating all windows to AFR which was obviously a problem. I even compare the results with RFMix v2, the previous version of Gnomix (called Xgmix), and the global ancestry generated by ADMIXTURE program (the latter one being compatible with the results of RFMix v2). So there is an explanation about extrapolation and interpolation in the post that kind of solve the problem for me (I still need to test in the other chromosomes to see if actually worked). I have used only markers presented in my query file on the reference file. That way my reference file contained only markers presented in my query file (even though my query file was bigger, in terms of markers, then the reference file). After running the Gnomix again, the program tells you the markers that were extrapolated (in my case the ones in the boundaries), and I excluded those markers from my subsequent analysis (the loss was minimal compared with the number of markers left for subsequent analysis). With that in hand I compared again with RFMix v2, and the global ancestry generated by ADMIXTURE and the results correlated really well, r2=0.9ish for the 3 ancestries (EUR,AFR and NAM)... Hope this help you guys in any way. Cheers, |
Great discussion. In general when you see large overestimates of African ancestry, it is a sign that your reference SNP encodings (pre-trained models) are not matching your sample SNP encodings. For example, your sample might be on a different build (b37 vs. b38) than the one the model was trained on. In any case, when your sample SNPs are defined somehow differently than in the reference model, it will now appear to the model that your samples have SNPs with unusual variants that were very rare, or even unobserved, in the training data. Since the ancestry with the most variant diversity is African (a consequence of the out-of-Africa bottleneck), these samples will now be assigned African as the most likely match. If you simply cannot determine how your SNP definitions are differing from those used in creating the model, you'll have to retrain your own model by obtaining public references (for instance from 1000 genomes) and merging them with your data. It is this merging step that must be done carefully, because if your sample SNP variants are not matched during the merge to the references (same build, same strandedness), you will again see African ancestry everywhere when inferring ancestry on your samples. |
Hi, I have followed this thread and thus used imputed data instead (Michigan Imp. Server, 1000G AMR). It was very helpful, thanks. Thank you! |
any updates to this issue, i am using a hg38 to hg19 liftover of a 100X whole genome sequence from nebula .vcf and getting mostly african ancestry on a european individual, chr2 , images are of hg38 with gnomix, hg38 to hg19 liftover with gnomix and 23andMe chromosome paint. got these numbers during run for hg38 vcf:
got these number during run for hg38 to hg19 liftover vcf:
so lots more SNPs covered after liftover and fewer different reference variants. but chomosome 2 images don't reflect correct ancestry |
I'm testing Gnomix on 23andme genotype data using the pretrained models, and the results I get seem completely incorrect. While the individual is of Eastern European and Jewish ancestry, but the predictions (a) consist of short segments, (b) most of the segments, at least on Chr 22, are predicted to be African.
The statistics in pretrained_log file (for Chr 22) are:
Loading and processing query file...
What could be the source of the error?
The text was updated successfully, but these errors were encountered: