PerformanceWarning message #6

guidebortoli · 2021-10-19T16:22:51Z

Hi,

I am not sure if this is actually an issue that would compromise the analysis, but there is a message at the end of the analysis as follows:



--------------------------------------------------------------------------------
-----------------------------------  Gnomix  -----------------------------------
--------------------------------------------------------------------------------
When using this software, please cite: 
Helgi Hilmarsson, Arvind S Kumar, Richa Rastogi, Carlos D Bustamante, 
Daniel Mas Montserrat, Alexander G Ioannidis: 
"High Resolution Ancestry Deconvolution for Next Generation Genomic Data" 
https://www.biorxiv.org/content/10.1101/2021.09.19.460980v1
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Launching in training mode...
Reading vcf file...
Getting genetic map info...
Getting sample map info...
Building founders...
Splitting sample map...
Running Simulation...
Training...
Reading data...
Building model...
Training base models...
100%|████████████████████████████████████████| 709/709 [01:04<00:00, 10.95it/s]                                      
Training smoother...
Fitting calibrator...
Evaluating model...
Re-training base models...
100%|████████████████████████████████████████| 709/709 [01:20<00:00,  8.83it/s]                                                            
Analyzing model performance...
Estimated train accuracy: 98.42%
Estimated val accuracy: 98.32%
Model, info and analysis saved at chr15/models/model_chm_chr15
--------------------------------------------------------------------------------
Launching inference...
Loading and processing query file...
- Number of SNPs from model: 1087854
- Number of SNPs from file: 30954
- Number of intersecting SNPs: 20036
- Percentage of model SNPs covered by query file: 1.8399999999999999%
Inferring ancestry on query data...
Phasing individual 493/493
Writing phased SNPs to disc...
/Users/debortoli/Downloads/arquivos_compactados/gnomix-main/src/utils.py:316: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  df[data_samples[i]] = genotype_person
Saving results...

Specifically, it is the PerformanceWarning

I am getting a really weird result when running gnomix...
I have an admixed sample (Native American, African and European ancestries), with a global ancestry estimate around 70% EUR, 25% AFR and 5% NAM...
The Gnomix output .msp is assigning (I've tested only on one chromosome), AFR for all windows in all samples, which is not correct...

Not sure if this PerformanceWarning might be the culprit of this...

Thanks,

The text was updated successfully, but these errors were encountered:

weekend37 · 2021-10-19T16:41:38Z

Hi Guilherme! This is very strange indeed. I don't expect this to be because of the performance warning as that refers to the speed of that operation. So as long as you weren't bothered by the time that step took, that should be fine.

Now, what I do think that the source of this issue is the reference panel (or potentially the genetic map file). Could you tell me a bit more about the reference file you're using or even share it with me?

guidebortoli · 2021-10-19T17:36:53Z

Hi Helgi,

Please, see the zipped file with the gmap and the reference file (which is a phased HGDP reference file for the chr15).

Guilherme

weekend37 · 2021-10-19T17:40:25Z

Great, thanks. I'll take a look when I find time.
Also, feel free to include the query file if possible. Then I can actually recreate your issue.

guidebortoli · 2021-10-19T17:45:49Z

Thanks @weekend37 . I sent you an email with the files.

guidebortoli · 2021-10-21T18:56:04Z

Hi @weekend37 , did you get a chance to take a look at this issue?

Thank you! :)

weekend37 · 2021-10-21T19:22:18Z

Hi Guilherme. Yes, I just finished looking at your files. It seems like the you have a large mismatch of SNP positions in your reference and query file. The overlap spans about 2% of the reference file SNPs. Not only that, but it seems as if the segments that are covered are entirely different (see image). The result is a model trying to estimate ancestry of segments it has never seen before.

Please make sure that this is not a mistake. This could for example be different builds like 37 vs 38.

If this is not a mistake and you only have 2% of the reference positions, I would recommend imputing the remaining SNPs in order to get sensible results. Those results will be assuming the the imputed SNPs are the ground truth and hence possibly quite biased, but at least sensible.

-Helgi

guidebortoli · 2021-10-21T19:49:40Z

Thanks @weekend37!
So, I have done the same analysis using Xgmix and the output was slightly better (although) it was still overestimating the AFR ancestry...
And also with rfmix2, where the results were the best so far (comparing with the global ancestry with Admixture program), and also a quick admixture mapping signal analysis for a known marker related to the phenotype I am investigating (that is known to have strong signal due to an allele difference of almost 100% between AFR and EUR)...

I am not sure if the 2% overlap could be due to the fact that my study used an array genotype data while the reference is from a WGS phased from HGDP...

Regarding the genome build they are both hg38...

Guilherme

weekend37 · 2021-10-21T23:00:51Z

Yeb. That makes total sense. The default base model for Gnomix is a Logistic Regression (linear model). The default XGMix base model and the rfmix equivalent is XGBoost and Random Forest, respectively, which are both decision tree based models that handle missing data much more naturally.

However, I bet their still struggling a little bit and no matter what you'll do, your estimates past the ~0.45x10^8 marker will never have any meaning to them regardless of the model (almost like using reference file from another chromosome then the one for the query file).

Now, if you decide to trim your query file to that marker such that we're at least dealing with same segments, I can fix the overlap issue with a feature that I've been meaning to add. Namely, only train the model on the query SNPs. Given both of those things being done, you should have no issues.

I'm positive that I can add that feature soon but that would be a Beta version. Let me know if that would be of interest to you.

weekend37 · 2021-11-06T18:29:23Z

Feature request added in 11.
Marking this as closed.

weekend37 closed this as completed Nov 6, 2021

guidebortoli mentioned this issue Apr 1, 2022

Strange results on 23 and me genotype data and pretrained models #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PerformanceWarning message #6

PerformanceWarning message #6

guidebortoli commented Oct 19, 2021 •

edited

Loading

weekend37 commented Oct 19, 2021

guidebortoli commented Oct 19, 2021

weekend37 commented Oct 19, 2021

guidebortoli commented Oct 19, 2021

guidebortoli commented Oct 21, 2021

weekend37 commented Oct 21, 2021 •

edited

Loading

guidebortoli commented Oct 21, 2021

weekend37 commented Oct 21, 2021 •

edited

Loading

weekend37 commented Nov 6, 2021

PerformanceWarning message #6

PerformanceWarning message #6

Comments

guidebortoli commented Oct 19, 2021 • edited Loading

weekend37 commented Oct 19, 2021

guidebortoli commented Oct 19, 2021

weekend37 commented Oct 19, 2021

guidebortoli commented Oct 19, 2021

guidebortoli commented Oct 21, 2021

weekend37 commented Oct 21, 2021 • edited Loading

guidebortoli commented Oct 21, 2021

weekend37 commented Oct 21, 2021 • edited Loading

weekend37 commented Nov 6, 2021

guidebortoli commented Oct 19, 2021 •

edited

Loading

weekend37 commented Oct 21, 2021 •

edited

Loading

weekend37 commented Oct 21, 2021 •

edited

Loading