Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not an issue, but I am confused ... #6

Open
DRL opened this issue Aug 1, 2017 · 1 comment
Open

Not an issue, but I am confused ... #6

DRL opened this issue Aug 1, 2017 · 1 comment

Comments

@DRL
Copy link

DRL commented Aug 1, 2017

Hi AndreasHeger,

Problem:

  • I want to calculate whether certain annotation features (genes, repeats, etc) are enriched/depleted in a particular subset of contigs in an assembly

--workspace: BED file of all regions in genome (excluding regions composed of N's)
--segments: BED file of annotations in subset of contigs

contig_1001    21      792     RepeatMasker
contig_1001    27      34      dust
contig_1001    93      159     dust
contig_1001    246     255     dust
contig_1001    266     339     dust
contig_1001    415     422     dust

--annotation: BED file of annotations across the whole genome (same as above but for whole genome)

The output I get when running:

gat-run.py --ignore-segment-tracks --segments=segments.bed --annotations=annotations.bed --workspace=workspace.bed --num-samples=100 --log=gat.log --num-threads=8 > gat.out

is

track   annotation        observed  expected      CI95low       CI95high      stddev     fold    l2fold  pvalue      qvalue      track_nsegments  track_size  track_density  annotation_nsegments  annotation_size  annotation_density  overlap_nsegments  overlap_size  overlap_density  percent_overlap_nsegments_track  percent_overlap_size_track  percent_overlap_nsegments_annotation  percent_overlap_size_annotation
merged  ncrnas_predicted  2913      1709.1200     1300.0000     1994.0000     209.0009   1.7040  0.7689  1.0000e-02  1.0000e-02  62983            6935174     6.6911e+00     1025                  163283           1.5754e-01          30                 2913          2.8105e-03       0.0476                           0.0420                      2.9268                                1.7840
merged  gene              389744    170648.2000   163172.0000   177856.0000   5359.9760  2.2839  1.1915  1.0000e-02  1.0000e-02  62983            6935174     6.6911e+00     18574                 37934616         3.6599e+01          278                389744        3.7603e-01       0.4414                           5.6198                      1.4967                                1.0274
merged  tandem            368130    158513.4400   154952.0000   162625.0000   2399.6840  2.3224  1.2156  1.0000e-02  1.0000e-02  62983            6935174     6.6911e+00     47134                 4562430          4.4018e+00          4994               368130        3.5517e-01       7.9291                           5.3082                      10.5953                               8.0687
merged  RepeatMasker      1492404   610641.4800   602042.0000   620429.0000   6353.3404  2.4440  1.2892  1.0000e-02  1.0000e-02  62983            6935174     6.6911e+00     117147                21502336         2.0745e+01          8705               1492404       1.4399e+00       13.8212                          21.5193                     7.4308                                6.9407
merged  dust              3200967   1182955.4000  1172992.0000  1190872.0000  4343.2429  2.7059  1.4361  1.0000e-02  1.0000e-02  62983            6935174     6.6911e+00     382880                14706492         1.4189e+01          63463              3200967       3.0883e+00       100.7621                         46.1555                     16.5752                               21.7657

I am confused:

  • shouldn't percent_overlap_size_track and co be 100% for all?

Thank you in advance.

cheers,

dom

@AndreasHeger
Copy link
Owner

Good question. From memory, I think percent_overlap_size_track is the proportion of nucleotides in 'segments' that overlap annotations within the workspace.

It might well be a bug, are your segments non-overlapping?

There is also the --ignore-segment-tracks option, which merges all the segments. The 46% might mean that 46% of the nucleotides are in DUST segments, though I then would assume the total to be 100%. Need to go through the code to remember what happened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants