Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyzing RNASeq data without replicates #6

Open
rawnakhoque opened this issue Feb 20, 2017 · 8 comments
Open

Analyzing RNASeq data without replicates #6

rawnakhoque opened this issue Feb 20, 2017 · 8 comments

Comments

@rawnakhoque
Copy link
Collaborator

rawnakhoque commented Feb 20, 2017

@singha53 @santina
Hi,
Here https://github.com/STAT540-UBC/team_Bloodies/tree/master/Data/RNA-seq/Normal/
is a GSE87195_rnaseq_ensT_all.csv file for RNASeq count data from ~60000 transcripts of 13 samples. Since I do not have replicates, I would like to perform only pairwise comparison. Do I need to perform any statistical analysis before comparison? Could you please mention some tools/statistical approach I can do at this point? I see many of the cells contain zero value. Should I get rid of the zero? Thanks.

@ppavlidis
Copy link

Barging in here ...

What do you mean by "pairwise comparison"? (Pairs of what? Genes? Samples? Datasets? What kind of comparison?). It's might help to put your question in terms of a goal like "find which genes are most variable" or "cluster samples".

Genes with no counts at all in any sample would not be much fun to analyze. Genes that have some samples with zero counts are a different story. In any case, wouldn't you want to do some exploratory data analysis first to determine how/if to filter? Looking at the CSV file is only a good first step. Can you think of some other ways to characterize the data to help you assess your question?

@singha53-zz
Copy link

@rawnakhoque Do the 13 samples correspond to 13 cell-types? or are there multiple samples per cell-type? Can you upload the paper associated with this dataset? I am interested in knowing what this dataset was used for. The following paper used LIMMA to compare each cell-type with all other cell-types (http://www.bloodjournal.org/content/115/26/5376) but they had multiple replicates per cell-type.

@rawnakhoque
Copy link
Collaborator Author

Hi Paul,
Sorry for the unclear statement. I would like to perform pairwise comparison of differential gene expression between each differentiated progenitors (the columns in the csv file) in human "hematopoietic stem cell differentiation" process. The file has read counts at transcript level. I looked at the manual for DESeq2 and they mentioned at 1.3.1 paragraph that the package needs raw counts of gene i in sample j. I was wondering how could I convert the reads from transcript level to gene level. Or do I need to convert the transcripts to genes at all.

@rawnakhoque
Copy link
Collaborator Author

@singha53 The 13 samples contain 7 cell types i.e. there are multiple samples for some cell type. I would like to compare between HSC vs MPP, MPP vs CLP, MPP vs CMP, CLP vs MLP, CMP vs GMP, CMP vs MEP (directly related populations). The paper associate with our analysis is here- https://github.com/STAT540-UBC/team_Bloodies/blob/master/Background%20papers/2016%20DNA%20Methylation%20Dynamics%20of%20Human%20Hematopoietic%20Stem%20Cell%20Differentiation.pdf
I am not sure if I can use LIMMA since I don't have replicates. In the edgeR user guide at paragraph 2.11. they have some statements on data without replicates but I am not quite clear about the procedure.

@ppavlidis
Copy link

When you say "multiple samples for some cell type" isn't that replication? Glancing at the GEO record it's not exactly clear what is truly replicates, but naively only CLP and CMP seem to not have replicates at all. You should be able to proceed - I mean, it will generate results, but it's far from ideal. You'll probably get crummy p-values but you'll generate a ranking - maybe it will turn out useable and better than just "fold change".

About transcripts vs genes: If you're asking if DESeq2 requires using genes not transcripts, the answer is no, it doesn't (it's just numbers...). Whether you should collapse transcripts to genes depends on what you are trying to do. But if you want to combine them, you'd just sum the counts. The logic being that this is the total number of reads associated with the gene.

@singha53-zz
Copy link

@rawnakhoque The paper groups cell-types based on their progenitor status, ie. myeloid progenitors (CMP, GMP) vs. lymphoid progenitors (CLP, MLP0, MLP1, MLP2, MLP3) using DEseq2, see Figure 6. The following post addresses how to use DEseq2 with no biological replicates:
http://seqanswers.com/forums/showthread.php?t=31036 --> although it states this should only be used for exploratory purposes:
see post by Michael Love:
"Working without replicates
DESeq allows analysis of experiments with no biological replicates in one or even both of the conditions. While one may not want to draw strong conclusions from such an analysis, it may still be useful for exploration and hypothesis generation. If replicates are available only for one of the conditions, one might choose to assume that the variance-mean dependence estimated from the data for that condition holds as well for the unreplicated one. If neither condition has replicates, one can still perform an analysis based on the assumption that for most genes, there is no true differential expression, and that a valid mean-variance relationship can be estimated from treating the two samples as if they were replicates. A minority of differentially abundant genes will act as outliers; however, they will not have a severe impact on the gamma-family GLM fit, as the gamma distribution for low values of the shape parameter has a heavy right-hand tail. Some overestimation of the variance may be expected, which will make that approach conservative."

@rawnakhoque
Copy link
Collaborator Author

@ppavlidis @singha53
Thanks for your help!

@santina
Copy link
Contributor

santina commented Feb 26, 2017

You can explore the data without replicates, but you can't really make a proper statistical inference about the data. See the vignette for DESeq2 p.57, under 5.8

Thank you, Paul and Amrit, for following up on the question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants