Skip to content

Commit

Permalink
merged develop branch
Browse files Browse the repository at this point in the history
  • Loading branch information
eunjijunekim committed Nov 5, 2014
2 parents c0144ae + d2de030 commit 1f85052
Show file tree
Hide file tree
Showing 83 changed files with 11,202 additions and 6,900 deletions.
12 changes: 1 addition & 11 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,11 +1 @@
norm_scripts/run_normalization_head
norm_scripts/runall_normalization_head.pl
norm_scripts/run_normalization_aug14
norm_scripts/runall_normalization_aug14.pl
norm_scripts/predict_num_reads_blast.pl
norm_scripts/runall_head.pl
norm_scripts/run_normalization_sep14
norm_scripts/runall_normalization_sep14.pl
norm_scripts/runall_sam2genes.pl
norm_scripts/runall_quantify_genes.pl
norm_scripts/runall_sam2mappingstats_gnorm.pl
documentation.md
267 changes: 119 additions & 148 deletions README.md

Large diffs are not rendered by default.

75 changes: 75 additions & 0 deletions about_cfg.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
## CONFIGURATION FILE

###0. NORMALIZATION and DATA TYPE
####A. Normalization Type
PORT offers **Exon-Intron-Junction** level normalization and **Gene** level normalization. Select the normalization type by setting GENE_NORM and/or EXON_INTRON_JUNCTION_NORM to TRUE. At least one normalization type needs to be used.
####B. Data Type
#####i. STRANDED
Set STRANDED to TRUE if the data are stranded.<br>
#####ii. FWD or REV
If STRANDED is set to TRUE, strand information needs to be provided. Set FWD to TRUE if forward read is in the same orientation as the transcripts/genes (sense) and set REV to TRUE if reverse read is in the same orientation as the transcripts/genes (sense).<br>
Note that when dUTP-based protocol (e.g. Illumina TruSeq stranded protocol) is used, strand information comes from reverse read.

========================================================================================================

###1. CLUSTER INFO
If you're using SGE (Sun Grid Engine) or LSF (Load Sharing Facility), simply set the cluster name (SGE_CLUSTER or LSF_CLUSTER) to TRUE. You may edit the queue names and max_jobs.<br>
If not, use OTHER_CLUSTER option and specify the required parameters.

========================================================================================================

###2. GENE INFO
Gene information file with required suffixes need to be provided. You may use the same file for [1] and [2].
####[1] Gene information file for [Gene Normalization]
Gene normalization requires an ensembl gene info file. The gene info file must contain column names with these suffixes: name, chrom, strand, txStart, txEnd, exonStarts, exonEnds, name2, ensemblToGeneName.value.

ensembl gene info files for mm9, hg19, and dm3 are available in Normalization/norm_scripts directory:

mm9: /path/to/Normalization/norm_scripts/mm9_ensGenes.txt
hg19: /path/to/Normalization/norm_scripts/hg19_ensGenes.txt
dm3: /path/to/Normalization/norm_scripts/dm3_ensGenes.txt

####[2] Gene information file for [Exon-Intron-Junction Normalization]
Gene info file must contain column names with these suffixes: chrom, strand, txStart, txEnd, exonStarts, and exonEnds.

ucsc gene info files for mm9, hg19, and dm3 are available for download:

mm9: wget http://itmat.indexes.s3.amazonaws.com/mm9_ucsc_gene_info_header.txt
hg19: wget http://itmat.indexes.s3.amazonaws.com/hg19_ucsc_gene_info_header.txt
dm3: wget http://itmat.indexes.s3.amazonaws.com/dm3_ucsc_gene_info_header.txt

####[3] Annotation file for [Exon-Intron-Junction Normalization]
This file should be downloaded from UCSC known-gene track. This file must contain column names with these suffixes: name, chrom, exonStarts, exonEnd, geneSymbol, and description.

Annotation files for mm9 and hg19 are available in Normalization/norm_scripts directory:

mm9: /path/to/Normalization/norm_scripts/ucsc_known_mm9
hg19: /path/to/Normalization/norm_scripts/ucsc_known_hg19

========================================================================================================

###3. FA and FAI
####[1] genome sequence one-line fasta file

ucsc genome fa files for mm9, hg19, and dm3 are available for download :

mm9: wget http://itmat.indexes.s3.amazonaws.com/mm9_genome_one-line-seqs.fa
hg19: wget http://itmat.indexes.s3.amazonaws.com/hg19_genome_one-line-seqs.fa
dm3: wget http://itmat.indexes.s3.amazonaws.com/dm3_genome_one-line-seqs.fa

For other organisms, follow the instructions [here](https://github.com/itmat/rum/wiki/Creating-indexes) to create indexes.

####[2] index file
You can get the index file (*.fai) using [samtools](http://samtools.sourceforge.net/) (samtools faidx &lt;ref.fa>)

========================================================================================================

###4. DATA VISUALIZATION
Set SAM2COV to TRUE if you want to use sam2cov to generate coverage files. sam2cov only supports reads aligned with RUM or STAR (set aligner used to TRUE). Make sure you have the latest version of sam2cov. At the moment, sam2cov assumes the strand information (sense) comes from reverse read for stranded data.

========================================================================================================

###5. CLEANUP
By default, CLEANUP step only deletes the intermediate SAM files. Set DELETE_INT_SAM to FALSE if you wish to keep the intermediate SAM files. You can also convert sam files to bam files by setting CONVERT_SAM2BAM to TRUE and coverage files can be compressed by setting GZIP_COV to TRUE.

========================================================================================================
Loading

0 comments on commit 1f85052

Please sign in to comment.