How to quickly check the insert size distribution of a new sequencing library

After you get the reads, be they paired end or mate pairs, from the sequencing centre, you sometimes want to chek the distribution of the insert sizes. There are probably several ways to do this, but here is what we tend to use.

Needed: either the enitre read dataset, or a subset of it (see below) a reference genome, or assembly, with at least a certain fraction of contigs/scaffolds large than the expected insert size.

All programs are avaliable from the common bin folder unless otherwise indicated. We'll be using:

seqtk (optional)
bwa
samtools
your favorite plotting program

-- Step 1: - optional: creating subsets of the data

For example, to create files with 1 million randomly selected reads - still paired up:

seqtk sample ../read1.fastq.gz 1000000 >read1_1Mreads.fastq
seqtk sample ../read2.fastq.gz 1000000 >read2_1Mreads.fastq

To take 10% of the reads instead:

seqtk sample ../read1.fastq.gz 0.1 >read1_10%reads.fastq
seqtk sample ../read2.fastq.gz 0.1 >read2_10%reads.fastq

NOTE, seqtk will happily accept gzipped fastq files.

Step 2: indexing your reference:

I prefer to make a new folder close to where the reference file is:

mkdir bwa_index

Create a symlink (or 'soft link', see this page to the reference (optional, but this makes the next commands more readable).

ln -s path/to/reference.fasta .

NOTE the period . at the end!

Create the index:

bwa-0.6.2 index -a bwtsw reference.fasta >bwa-index.out 2>&1

(all messages from the program will go to the file bwa_index.out)

Step 3: aligning the reads:

Make a folder for the results and cd into it. Make symlinks to the read files and reference index (optional, but this makes the next commands more readable).

ln -s path/to/read1_1Mreads.fastq .
ln -s path/to/read2_1Mreads.fastq .
ln -s path/to/bwa_index .

Align reads

bwa-0.6.2 aln -t 24 bwa_index/reference.fasta read1_1Mreads.fastq >read1_1Mreads.sai 2> read1_1Mreads.bwa.err
bwa-0.6.2 aln -t 24 bwa_index/reference.fasta read2_1Mreads.fastq >read2_1Mreads.sai 2> read2_1Mreads.bwa.err

NOTE older Illumina data may require you to add the -I flag, check the documentation.

Generate sam file:

bwa-0.6.2 sampe bwa_index/reference.fasta \
read1_1Mreads.sai \
read2_1Mreads.sai \
read1_1Mreads.fastq \
read2_1Mreads.fastq \
1>read1+2_1Mreads.sam 2> read1+2_1Mreads.bwa.err

Some tools need a sorted, compressed bam file. You can generate this using:

module load samtools
cat read1+2_1Mreads.sam | samtools view -uS - | samtools sort - read1+2_1Mreads.sorted

The resulting file will be called read1+2_1Mreads.sorted.bam.

Alternatively, you can do it like this:

bwa-0.6.2 sampe bwa_index/reference.fasta \
read1_1Mreads.sai \
read2_1Mreads.sai \
read1_1Mreads.fastq \
read2_1Mreads.fastq \
2> read1+2_1Mreads.bwa.err | samtools view -uS - | samtools sort - read1+2_1Mreads.sorted

Now you could for example get some basic metrics on the mapping by running:

samtools flagstat 130301_R1+2_001_10Mreads.sorted.bam

Step 4: plotting.

To get the insert size distribution, extract the following from the sam file:

the lines describing the mapped reads (these start with the name of the read)
for these lines, the value of column 9, the insert size as determined by bwa, if it is positive and larger than 0

Back to the CEES-HPC wiki home page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to quickly check the insert size distribution of a new sequencing library

How to quickly check the insert size distribution of a new sequencing library

Clone this wiki locally