GitHub - MikeAxtell/CleaveLand4: CleaveLand4: Analysis of degradome data to find sliced miRNA and siRNA targets

MikeAxtell / CleaveLand4 Public

Notifications You must be signed in to change notification settings
Fork 13
Star 19

CleaveLand4: Analysis of degradome data to find sliced miRNA and siRNA targets

19 stars 13 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
GSTAr_v1-0		GSTAr_v1-0
CleaveLand4.pl		CleaveLand4.pl
CleaveLand4_TUTORIAL.pdf		CleaveLand4_TUTORIAL.pdf
LICENSE		LICENSE
README		README

Repository files navigation

LICENSE
CleaveLand4.pl

This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.

You should have received a copy of the GNU General Public License along
with this program. If not, see <http://www.gnu.org/licenses/>.

SYNOPSIS
CleaveLand4 : Finding evidence of sliced targets of small RNAs from
degradome data

AUTHOR
Michael J. Axtell, Penn State University, mja18@psu.edu

VERSION
4.5 : July 11, 2018

INSTALL
Dependencies - Required Perl Modules
Getopt::Std
Math::CDF

CleaveLand4.pl is a perl program, so it needs perl installed on your
system. It was developed on perl version 5.10.0, and hasn't been tested
on other versions (but there is no reason to suspect problems with other
perl 5.x versions). CleaveLand4.pl will not compile unless the above two
Perl modules are also installed in your perl's @INC. Getopt::Std is
pre-loaded into most (all?) Perl distros. But you may need to install
Math::CDF from CPAN. Only one Math::CDF function is used by CleaveLand4
('pbinom').

Dependencies - PATH executables
bowtie (version 0.12.x or 1.x)
bowtie-build
RNAplex (from Vienna RNA package)
GSTAr.pl (Version 1.0 or higher -- distrubuted with CleaveLand4)
R
samtools

All of the above must be executable from your PATH. Depending on the
mode of the CleaveLand4 run (see below), only a subset of these programs
may be required for a given run.

Installation
Except for the above dependencies, there is no "real" installation. If
the script is in your working directory, you can call it with

./CleaveLand4.pl

For convenience, you can add it to your PATH. e.g.

sudo mv CleaveLand4.pl /usr/bin/

GSTAr.pl expects to find perl in /usr/bin/perl .. if not, edit line 1
(the hashbang) accordingly.

USAGE
CleaveLand4.pl [options] > [out.txt]

Log and progress information goes to STDERR, and can be suppressed with
option -q (quiet mode).

Options
-h Print help message and quit

-v Print version and quit

-q Quiet mode .. no log/progress information to STDERR

-a Sort small RNA / transcript alignments by Allen et al. score* instead
of default MFEratio -- for GSTAr

-t Output in tabular format instead of the default verbose format

-r [float >0..1] Minimum Free Energy Ratio cutoff. Default: 0.65 -- for
GSTAr

-o [string] : Produce T-plots in the directory indicated by the string.
If the dir does not exist, it will be created

-d [string] : Path to degradome density file.

-e [string] : Path to FASTA-formatted degradome reads.

-g [string] : Path to GSTAr-created tabular formatted query-transcript
alignments.

-u [string] : Path to FASTA-formatted small RNA queries

-n [string] : Path to FASTA-formatted transcriptome

-p [float >0..1] : p-value for reporting. Default is 1 (no p-value
filtering).

-c [integer 0..4] : Maximum category for reporting. Default is 4 (all
categories reported).

*Allen et al. score: This is a score based on the position-specific
penalties described by Allen et al. (2005) Cell, 121:207-221 [PMID:
15851028]. Specifically, mismatched query bases or target-bulged bases,
are penalized 1. G-U wobbles are penalized 0.5. These penalties are
double within positions 2-13 of the query.

Modes
CleaveLand4 runs in one of four different modes. Each mode has a
required set of options and a disallowed set of options, as described
below:

Mode 1: Align degradome data and create degradome density file, perform
new small RNA query/transcriptome alignment with GSTAr, and analyze.
Required options: -e, -u, -n. Disallowed options: -d, -g.

Mode 2: Use existing degradome density file, perform new small RNA
query/transcriptome alignment with GSTAr, and analyze. Required options:
-d, -u, -n. Disallowed options: -e, -g.

Mode 3: Align degradome data and create degradome density file, use
existing GSTAr alignments, and analyze. Required options: -e, -n, -g.
Disallowed options: -d, -u.

Mode 4: Use existing degradome density file and existing GSTAr
alignments, and analyze. Required options: -d, -g. Disallowed options:
-e, -u.

METHODS
Degradome data --> transcriptome alignments --> degradome density file creation (modes 1 and 3)
Degradome data is aligned to the reference transcriptome using bowtie.
If needed, the bowtie indices for the transcript are built with
bowtie-build using default parameters. This results in the creation of
six files, each including "ebwt" in their suffix. Alignment parameters
allow zero or one mismatch, and are only allowed to the forward strand
of the transcriptome. In the case of multiple valid alignments only one
is randomly selected and reported. (The specific bowtie command used is
"bowtie -f -v 1 --best -k 1 --norc -S"). The alignment process uses
samtools to generate a sorted BAM alignment file from the bowtie output
stream.

After creation of the sorted BAM alignment file, the alignments are
parsed to quantify the density of observed 5' ends at each nt of the
transcriptome. The results are written to a 'degradome density' file in
the working directory. The BAM alignment file is deleted at the
completion of this process. The degradome density files contain the
position of the transcript, the number of 5' ends at that position, and
the degradome peak 'category'. Categories are determined as follows:

Category 4: Just one read at that position

Category 3: >1 read, but below or equal to the average* depth of
coverage on the transcript

Category 2: >1 read, above the average* depth, but not the maximum on
the transcript

Cateogry 1: >1 read, equal to the maximum on the transcript, when there
is >1 position at maximum value

Cateogry 0: >1 read, equal to the maximum on the transcript, when there
is just 1 position at the the maximum value

* Note that the average does not include all of the 'zeroes' for
non-occupied positions within a transcript. Instead, it is the average
of all positions that have at least one read.

Small RNA --> transcriptome alignments with GSTAr (modes 1 and 2)
Potential target sites are generated with GSTAr.pl, which ships with
CleaveLand4. Options -r and -a are passed to GSTAr. By default,
potential target sites are sorted in descending order by MFE ratio. If
the -a switch is present, this is changed to ascending order based on
Allen et al. score. Note that GSTAr can be expected to take 90-120
seconds per query when analyzing a typically sized transcritpome.
GSTAr.pl is always called to output in tabular format by CleaveLand4.pl.
Only alignments to the reverse-comp strand of the transcriptome are
considered. Upon completion, a GSTAr alignment file is written to the
working directory. See the GSTAr documentation for more details on this
program.

Analysis (all modes)
After loading valid degradome density and GSTAr alignment files,
CleaveLand4 first checks to ensure that the transcriptome (as noted in
the headers) is the same. If so, analysis progresses. For each alignment
in the GSTAr alignment file, CleaveLand4 searches the degradome density
file to see if there are any degradome reads at the predicted slicing
site. If there are, a p-value is calculated. The p-value takes into
account both the noise in the degradome density file and the quality of
the small RNA-transcriptome alignment. First, the chances of observing a
deagrdome 'peak' of the given category by random chance is calculated.
The chance is the total number of peaks of the given category divided by
the effective transcriptome size*. Then, the quality of the alignment is
simply the rank of the alignment in the GSTAr alignment file (which is
either sorted by MFE ratio [default] or by Allen et al. score). The
p-value is calculated as the binomial probability of observing one or
more 'hits' in 'x' trials given probability 'c', where 'x' is the rank
of the alignment, and 'c' is the chance described above.

* The effective transcriptome size is the total number of bases in the
transcriptome - (n * mean_read_size), where 'n' is the number of
transcripts. This adjustment accounts for the fact that the very ends of
the transcripts could not possibly have any mapped 5' ends.

Any hits with a p-value <= the cutoff specified by option -p AND a
category <= the cutoff specified by option -c are output to STDOUT. By
default, all hits are reported (option p default is 1, option c default
is 4).

INPUT FILE FORMAT REQUIREMENTS
Newlines
All files are assumed to have "\n" as newline characters. Files with
MS-DOS text encoding, or others, that do not conform to this assumption
will cause unexpected behavior and likely meaningless results.

Transcriptome (option -n)
This must be a multiline FASTA file. The headers should be short and
simple and devoid of whitespace (e.g. ">AT1G12345" is good, ">AT1G12345
| this is my favorite gene | it is awesome" is not. The filename of the
transcriptome file should also be devoid of whitespace.

Degradome reads (option -e)
This must be a multiline FASTA file. The reads are assumed to have
already clipped to remove adapters. Furthermore, the reads must not have
been collapsed in any way. In other words, each read off the sequencer
should have an entry. Sequences that appear 50 times in the raw data
from 50 different reads should each have a line.

Finally, CleaveLand4 assumes that each degradome read represents the
5'-3' sequence of a transcript, and that the first nt of each read
represents the 5' end of an RNA.

Small RNA Queries (option -u)
This must be a multiline FASTA file with the full sequence of a given
small RNA on one line (e.g. each line is either a header beginning with
">" or the full-length sequence of the small RNA). The headers should be
short and simple and devoid of whitespace (e.g. ">ath-miR169a" is good,
">ath-miR169a MIMAT0000200 Arabidopsis thaliana miR169a" is not.
Sequences can have either T's or U's.

Note: miRBase often has several mature miRNAs with exactly the same
sequence, relfecting paralogous miRNA genes within a species. There is
no use querying the same sequence multiple times, so it is a good idea
to collapse the redundancy by query sequence when making a file of small
RNA queries.

Degradome density files (option -d)
Most of the time, these will be files created by previous runs of
CleaveLand4 that will have the suffix "_dd.txt". If you don't like the
alignment parameters that CleaveLand4 uses, you could create your own
degradome denstiy files. The format specification is:

Header region: Lines begin with "#". Here is an example:

[line1]# CleaveLand4 degradome density

[line2]# Fri Sep 13 09:21:38 EDT 2013

[line3]# Degradome Reads:GSM278335.fasta

[line4]# Transcriptome:TAIR10_cdna_20110103_rgmupdated_cleand.fasta

[line5]# TranscriptomeCharacters:51074197

[line6]# Mean Degradome Read Size:20

[line7]# Estimated effective Transcriptome Size:50402157

[line8]# Category 0:16430

[line9]# Category 1:3456

[line10]# Category 2:95747

[line11]# Category 3:207062

[line12]# Category 4:78279

CleaveLand4 demands that the first line of a valid degradome density
file is "# CleaveLand4 degradome density". It also requires all other
header lines, except the date on line 2, to be present. All of this
information is required for analysis.

Data Region: Each transcript begins with two lines as follows:

[line1]@ID:AT1G50920.1

[line2]@LN:2394

The @ID: gives the name of the transcript, while the @LN: gives the
length of the transcript. After this, each line gives a one-based
position, the number of 5' ends at that positions, and the degradome
category. The data lines are tab-delimited. Note that positons with zero
reads are NOT shown.

GSTAr query-transcriptome alignments (option -g)
These are files created by GSTAr. If they were created as part of a
CleaveLand4 run, they will have the suffix "_GSTAr.txt". They must be in
the 'tabular' format, and have a proper header as shown below:

[line1]# GSTAr version 1.0

[line2]# Thu Sep 12 13:56:58 EDT 2013

[line3]# Queries: test_mir.fasta

[line4]# Transcripts: TAIR10_cdna_20110103_rgmupdated_cleand.fasta

[line5]# Hit seed length required to initiate RNAplex analysis (option
-s): 7

[line6]# Minimum Free Energy Ratio cutoff (option -r): 0.65

[line7]# Sorted by: MFEratio

[line8]# Output Format: Tabular

Is is strongly recommended NOT to try to produce these files by means
not involving CleaveLand4/GSTAr.

OUTPUT
Pretty format
By default, CleaveLand prints hits that pass the p-value and category
filters to STDOUT in a human-readble, verbose format that is
self-explanatory. A header (lines beginning with "#") is printed giving
basic information on the analysis.

Tabular format
If option -t is specified, any hits passing the p-value filter are
printed in a tab-delimited format. First, a header (lines beginning with
"#") is printed giving basic information on the analysis. After that, a
line giving the names of the columns is printed. Each subsequent line
gives information on a single hit. The format is similar to that of the
GSTAr alignments. Column information is:

1: SiteID: A unique name (within the scope of the output of a particular
run) for the putative slicing site. In the form
"[transcript]:[slice_site]". The output is sorted by SiteID.

2: Query: Name of query

3: Transcript: Name of transcript

4: TStart: One-based start position of the alignment within the
transcript

5: TStop: One-based stop position of the alignment within the transcript

6: TSLice: One-based position of the alignment opposite position 10 of
the query

7: MFEperfect: Minimum free energy of a perfectly matched site
(approximate)

8: MFEsite: Minimum free energy of the alignment in question

9: MFEratio: MFEsite / MFEperfect

10: AllenScore: Penalty score calculated per Allen et al. (2005) Cell,
121:207-221 [PMID: 15851028].

11: Paired: String representing paired positions in the query and
transcript. The format is Query5'-Query3',Transcript3'-Transcript5'.
Positions are one-based. Discrete blocks of pairing are separated by ;

12: Unpaired: String representing unpaired positions in the query and
transcript. The format is
Query5'-Query3',Transcript3'-Transcript5'[code]. Possible codes are
"UP5" (Unpaired region at 5' end of query), "UP3" (Unpaired region at 3'
end of query), "SIL" (symmetric internal loop), "AILt" (asymmetric
internal loop with more unpaired nts on the transcript side), "AILq"
(asymmetric internal loop with more unpaired nts on the query side),
"BULt" (Bulged on the transcript side), or "BULq" (bulged on the query
side). Positions are one-based. Discrete blocks of pairing are separated
by ;

13: Structure: Aligned secondary structure. The region before the "&"
represents the transcript, 5'-3', while the region after the "&"
represents the query, 5'-3'. "(" represents a transcript base that is
paired, ")" represents a query based that is paired, "." represents an
unpaired base, and "-" represents a gap inserted to facilitate
alignment.

14: Sequence: Aligned sequence. The region before the "&" represents the
transcript, 5'-3', while the region after the "&" represents the query,
5'-3'.

15: DegradomeCategory:

Category 4: Just one read at that position

Category 3: >1 read, but below or equal to the average* depth of coverage on the transcript

Category 2: >1 read, above the average* depth, but not the maximum on the transcript

Cateogry 1: >1 read, equal to the maximum on the transcript, when there is >1 position at maximum value

Cateogry 0: >1 read, equal to the maximum on the transcript, when there is just 1 position at the the maximum value

16: DegradomePval: p-value for this degradome hit.

17: Tplot_file_path: File path for the T-plot of this hit.

* Note that the average does not include all of the 'zeroes' for
non-occupied positions within a transcript. Instead, it is the average
of all positions that have at least one read. =head2 T-plots

If the user requests them by including the -o option, T-plots for each
hit that passes the p-value cutoff are created and written to the
directory specificied by option -o. The black line on the plot shows all
of the degradome data, and the red dot shows the putative slicing site.
The title of each T-plot indicates the transcript ("T="), the query
("Q="), and the putative slicing site ("S="), as well as the category
and p-value.

Existing T-plot files in the -o directory with the same name will be
over-written without warning.

WARNINGS
Don't believe the hype - part 1
Under default settings, CleaveLand4 reports ALL putative slicing sites
with ANY degradome reads at all, regardless of the liklihood of a given
hit of being due to random chance. Without any filtering, most of your
hits probably ARE due to random chance. This means that there will be
many many hits of Category 4 (just one read) and/or at very high
p-values. Exercise skepticism when interpreting these results. More
confidence in the reality of a given slicing event can come from
restricting analysis to hits with low p-values and/or high categories,
and, even better, observing the slicing event in multiple degradome
libraries.

Don't believe the hype - part 2
The p-value calculation is built around the ASSUMPTION that the rank
order of alignments for a given query reflects their liklihood of being
functional. Under default settings, GSTAr will sort the alignments for
each query based on descending MFE ratio. Alternatively, when option a
is specified for the GSTAr run, they will be sorted in ascending order
according the Allen et al. score. The extent to which the p-values are
trustworthy is dictated directly by the extent to which you believe that
MFEratio or Allen et al. scores are predictive of function. If you don't
believe that, you should treat the p-values with due skepticism, and
focus instead on high category hits and reproducibility between
libraries.

Not for whole genomes
Degradome alignment by CleaveLand4 only searches the top strand of the
transcriptome. Also, GSTAr holds the entire contents of the
transcripts.fasta file in memory to speed the isolation of
sub-sequences, and CleaveLand4 holds the entire contents of the
degradome density file in memory. This will be impractical in terms of
memory usage if a user attempts to load a whole genome. Similarly, GSTAr
will only search for pairing between the top strand of the
transcripts.fasta file, making it also impractical for a genome
analysis, where sites might be on either strand.

Temp files
CleaveLand4 writes several temp files during the course of a run. So,
don't mess with them during a run. In addition, it is a very bad idea to
have two CleaveLand4 runs operating concurrently from the same working
directory. CleaveLand4 will clean up all temp files at the conclusion of
a run.

Not too fast in modes 1 and 2 (and maybe 3).
GSTAr is a very fast intermolecular RNA-RNA hybridization calculator.
But when applied to whole transcriptomes, it is still very
time-consuming. When running in modes 1 or 2, plan on about 90-120
seconds per query during the GSTAr phase. Additionally, bowtie
alignments and index bulding (modes 1 or 3) can also be time-consuming.
Finally, requesting T-plots can also slow things down, especially if a
great number of hits are being returned.

No ambiguity codes
Query sequences with characters other than A, T, U, C, or G
(case-insensitive) will not be analyzed, and a warning will be sent to
the user. Transcript sub-sequences for potential query alignments will
be *silently* ignored if they contain any characters other than A, T, U,
C,or G (case-insensitive).

Small queries
GSTAR demands that query sequences must be small (15-26 nts). Queries
that don't meet these size requirements will not be analyzed and a
warning sent to the user.

No redundancy
For a GIVEN QUERY, GSTAr alignments are non-redundant in terms of the
slicing site of the alignment. However, a single query can have multiple
overlapping alignment patterns that have differing predicted slicing
sites. In addition, if multiple queries are similar in sequence, the
same alignment (in terms of putative slicing site) can be present
multiple times for different queries. If CleaveLand4 identifies more
than one alignment at the same putative slicing site, it will only
report the one with the best (lowest) p-value (subject to the maximum
allowed p-value, option -p). Therefore there should be no redundancy in
the putative slice sites returned by a given CleaveLand4 run.

No reverse-compatibility
Degradome density files created by versions of CleaveLand prior to 4.0
are NOT compatible with CleaveLand 4. Sorry.

Change in category definitions
The categories used by CleaveLand4 differ slightly from those used in
CleaveLand3 and earlier. Specifically, categories 3 and 2 now rely upon
calculating the mean, not the median, level of coverage in the
transcript. In addition, transcript positions with zero coverage are no
longer included in the calculation. The effect of this is to make
category 2 hits much more rare, and category 3 hits much more common.

Slicing at position 10
CleaveLand4 only looks for evidence of slicing at position 10 relative
to the aligned small RNA. There is no ambiguity -- data at position 11
or 9 is not relevant to CleaveLand4. This is because, as far as I know,
there is no direct evidence showing Argonaute proteins cut anywhere
besides position 10. However, there IS clear evidence for isomirs:
lower-abundance variants of miRNAs with alternative 5' or 3' ends.
Isomirs with alternative 5' ends could certainly cause offset slicing.
If you wish to search for slicing at slightly 'off' locations with
CleaveLand4, you will need to explictly query with the isomirs of
interest.