-
Notifications
You must be signed in to change notification settings - Fork 3
/
03_msmc.Rmd
130 lines (97 loc) · 6.4 KB
/
03_msmc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: "MSMC Analysis"
output: github_document
bibliography: bibliography.bib
---
```{r, echo=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8,
echo=FALSE, warning=FALSE, message=FALSE)
```
```{r}
library(tidyverse)
source("R/constants.R")
```
We used [msmc2](https://github.com/stschiff/msmc2) to estimate changes in effective population size over time from deep sequenced genomes. Two such deep sequenced genomes were available, one from Magnetic Island (MI-1-4) and one from Fitzroy Island (FI-1-3). Genome-wide read coverage for these was assessed using the [bedtools genomecov](https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html) utility and is plotted below. Both genomes had a peak coverage depth of slightly less than 20x with a long tail of higher coverage regions.
Data was therefore prepared for msmc analysis as follows;
- The genome was masked using the [snpable](http://lh3lh3.users.sourceforge.net/snpable.shtml) suite of utilities. See [02_snpable.sh](hpc/msmc/02_snpable.sh) for details.
- Only contigs larger than 1Mb were included
- A mappability mask was generated using `makeMappabilityMask.py` from [msmc-tools](https://github.com/stschiff/msmc-tools)
- SNPs were called using the `bamCaller` python script assuming a mean genome coverage of 20x. See [04_genomecov.sh](hpc/msmc/04_genomecov.sh) for details
- Inputs for a single run were generated with the script `generate_multihetsep.py` from [msmc-tools](https://github.com/stschiff/msmc-tools)
- Inputs for bootstraps were generated using the script `multihetsep_bootstrap.py` from [msmc-tools](https://github.com/stschiff/msmc-tools). 100 bootstraps were generated by taking 20 random chunks (per chromosome) of size 500kb and assembling these into 20 "chromosomes".
- The [msmc2](https://github.com/stschiff/msmc2) program was run on each bootstrap using options appropriate for a single diploid sample (see script [07_bootstrap.sh](hpc/msmc/07_bootstrap.sh))
```{r}
read_coverage <- function(sample_id){
filename <- paste("hpc/msmc/",sample_id,".genomecov_summary",sep="")
read_tsv(filename,col_names = c("contig","depth","cov","contig_len","cov_fraction")) %>% add_column(sample=sample_id)
}
sample_ids <- c('FI-1-3','MI-1-4')
cov_stats <- do.call(rbind,lapply(sample_ids,read_coverage))
ggplot(cov_stats %>% filter(depth<100) %>% filter(sample %in% c('FI-1-3','MI-1-4')) %>% filter(depth>2),aes(x=depth,y=cov)) +
geom_line(aes(color=sample)) +
geom_vline(aes(xintercept=18.5)) +
geom_vline(aes(xintercept=19.5)) + xlab("Read Depth") + ylab("Genome Coverage at Specified Depth") + theme_minimal()
```
In order to turn msmc outputs into a demographic history we assumed a mutation rate of `r format(1.86e-8)` (see [02_mutation_rates.md](02_mutation_rates.md) ), and a generation time of 5 years. The inferred population history is shown below.
```{r}
mu <- 1.86e-8
gen <- 5
read_bootstraps <- function(pop,base_dir="hpc/msmc/bootstrap_results/"){
bs_files <- list.files(base_dir,paste(c(".*_",'within',pop,".*.final.txt"),collapse=""),full.names = TRUE)
data <- do.call(rbind,lapply(bs_files,function(fn){
read_tsv(fn, col_types = cols()) %>% add_column(pop=pop) %>% add_column(bootstrap = strsplit(basename(fn),"_")[[1]][1] )
})) %>% unite(bsid,bootstrap,pop,remove = FALSE)
data
}
bs_fi_data <- read_bootstraps("FI")
bs_mi_data <- read_bootstraps("MI")
bs_data <- rbind(bs_fi_data,bs_mi_data)
bs_data_av <- bs_data %>% group_by(pop,time_index) %>% summarise(left_time_boundary=mean(left_time_boundary),right_time_boundary=mean(right_time_boundary),lambda=mean(lambda))
ymax <- bs_data_av %>% mutate(y=(1/lambda)/mu) %>% pull(y) %>% max()
```
```{r}
bs_data_gbr <- rbind(bs_fi_data,bs_mi_data)
bs_data_gbr_av <- bs_data_gbr %>% group_by(pop,time_index) %>% summarise(left_time_boundary=mean(left_time_boundary),right_time_boundary=mean(right_time_boundary),lambda=mean(lambda))
# saveRDS(bs_data_gbr,"cache/bs_data_gbr.rds")
# saveRDS(bs_data_gbr_av,"cache/bs_data_gbr_av.rds")
ggplot(bs_data_gbr,aes(x=left_time_boundary/mu*gen,y=(1/lambda)/mu)) +
scale_x_log10(breaks=c(1e+4,1e+5,1e+6,1e+7), labels=c("10kya","100kya","1mya","10mya")) +
geom_step(aes(group=bsid,color=pop),alpha=0.04,size=1) +
geom_step(data=bs_data_gbr_av, size=1,aes(color=pop)) +
scale_color_manual(values = site_colors()[c("FI","MI")],labels=c("Flinders Island","Magnetic Island")) +
ylim(0,ymax*1.2)+
xlab("Years Ago") +
ylab(expression(paste("Effective Population Size ",N[e]))) +
theme_minimal() + theme(legend.title = element_blank(), legend.position = c(0.8,0.8)) +
theme(text=element_text(size=20), legend.title = element_blank())
```
# With Climate Data
We used climate data taken from [Bintanja and van de Wal 2008](https://www.nature.com/articles/nature07158) to provide an estimate of global averaged sea level. Local factors are not taken into account but generally account for differences on the order of ~1m which is roughly 2 orders of magnitude less than the largest changes shown.
```{r}
climate_data <- read_tsv("raw_data/nature07158-s2.txt",skip = 14)
library(cowplot)
nep <- ggplot(bs_data_gbr,aes(x=left_time_boundary/mu*gen,y=(1/lambda)/mu)) +
scale_x_log10(breaks=c(1e+4,1e+5,1e+6,1e+7), labels=c("10kya","100kya","1mya","10mya"), limits=c(1e4,3e6)) +
geom_step(aes(group=bsid,color=pop),alpha=0.04,size=1) +
geom_step(data=bs_data_gbr_av, size=1,aes(color=pop)) +
scale_color_manual(values = site_colors()[c("FI","MI")],labels=c("Flinders Island","Magnetic Island")) +
ylim(0,ymax*1.2)+
xlab("") +
ylab(expression(paste("Effective Population Size ",N[e]))) +
theme_minimal() + theme(legend.title = element_blank(), legend.position = c(0.9,0.8))# + xlim(1e3,1e+7)
nep
cp <- ggplot(climate_data %>% filter(Time>10),aes(x=Time*1e3,y=-Ice_tot)) +
geom_line() +
scale_x_log10(breaks=c(1e+4,1e+5,1e+6,1e+7), labels=c("10kya","100kya","1mya","10mya"), limits=c(1e4,3e6)) +
theme_minimal() +
xlab("") + ylab("Sea Level")
plot_grid(nep,cp, ncol = 1, align = "hv", axis = "lr", rel_heights = c(0.7,0.3))
```
Finally we write out the bootstrap averaged data in msmc format for later us in simulating data under these demographic histories. (See [05_sf2_thresholds.md](05_sf2_thresholds.md))
```{r}
for (p in c('MI','FI')){
outfile <- paste("hpc/ms/msmc_av_",p,".txt",sep = "")
pop_data <- bs_data_av %>% ungroup %>% filter(pop==p) %>% select(-pop) %>% as.data.frame()
write_tsv(path = outfile,pop_data)
}
```