Feature Request: Option to subset reference SNPs to query file positions before training #11

weekend37 · 2021-11-06T18:28:18Z

Would enhance model performance in most cases and avoid disaster where imputation has not been performed.

candevrivera2021 · 2024-01-31T18:50:13Z

Hi. I am having this issue when training the model from scratch after filtering the snps in the query file. I am not sure what to do.

After building the model, I get this error when doing the inference in the query samples.
Any advice is appreciated.

Launching inference...
Loading and processing query file...

Number of SNPs from model: 333651
Number of SNPs from file: 447132
Number of intersecting SNPs: 333651
Percentage of model SNPs covered by query file: 100.0%
Found 24392 (7.3106%) different reference variants. Adjusting...
Inferring ancestry on query data...
/users/cvergara/gnomix/src/utils.py:313: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
df[data_samples[i]] = genotype_person

weekend37 self-assigned this Nov 6, 2021

weekend37 mentioned this issue Nov 6, 2021

PerformanceWarning message #6

Closed

Provide feedback