-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PerformanceWarning message #6
Comments
Hi Guilherme! This is very strange indeed. I don't expect this to be because of the performance warning as that refers to the speed of that operation. So as long as you weren't bothered by the time that step took, that should be fine. Now, what I do think that the source of this issue is the reference panel (or potentially the genetic map file). Could you tell me a bit more about the reference file you're using or even share it with me? |
Hi Helgi, Please, see the zipped file with the gmap and the reference file (which is a phased HGDP reference file for the chr15). Guilherme |
Great, thanks. I'll take a look when I find time. |
Thanks @weekend37 . I sent you an email with the files. |
Hi @weekend37 , did you get a chance to take a look at this issue? Thank you! :) |
Hi Guilherme. Yes, I just finished looking at your files. It seems like the you have a large mismatch of SNP positions in your reference and query file. The overlap spans about 2% of the reference file SNPs. Not only that, but it seems as if the segments that are covered are entirely different (see image). The result is a model trying to estimate ancestry of segments it has never seen before. Please make sure that this is not a mistake. This could for example be different builds like 37 vs 38. If this is not a mistake and you only have 2% of the reference positions, I would recommend imputing the remaining SNPs in order to get sensible results. Those results will be assuming the the imputed SNPs are the ground truth and hence possibly quite biased, but at least sensible. -Helgi |
Thanks @weekend37! I am not sure if the 2% overlap could be due to the fact that my study used an array genotype data while the reference is from a WGS phased from HGDP... Regarding the genome build they are both hg38... Guilherme |
Yeb. That makes total sense. The default base model for Gnomix is a Logistic Regression (linear model). The default XGMix base model and the rfmix equivalent is XGBoost and Random Forest, respectively, which are both decision tree based models that handle missing data much more naturally. However, I bet their still struggling a little bit and no matter what you'll do, your estimates past the ~0.45x10^8 marker will never have any meaning to them regardless of the model (almost like using reference file from another chromosome then the one for the query file). Now, if you decide to trim your query file to that marker such that we're at least dealing with same segments, I can fix the overlap issue with a feature that I've been meaning to add. Namely, only train the model on the query SNPs. Given both of those things being done, you should have no issues. I'm positive that I can add that feature soon but that would be a Beta version. Let me know if that would be of interest to you. |
Feature request added in 11. |
Hi,
I am not sure if this is actually an issue that would compromise the analysis, but there is a message at the end of the analysis as follows:
Specifically, it is the PerformanceWarning
I am getting a really weird result when running gnomix...
I have an admixed sample (Native American, African and European ancestries), with a global ancestry estimate around 70% EUR, 25% AFR and 5% NAM...
The Gnomix output .msp is assigning (I've tested only on one chromosome), AFR for all windows in all samples, which is not correct...
Not sure if this PerformanceWarning might be the culprit of this...
Thanks,
The text was updated successfully, but these errors were encountered: