Calling variants in sequencing data

This is an extremely simplified tutorial on how to call variants in NGS data. The only assumption is that you have paired-end sequencing data (file_1.fastq and file_2.fastq) of some species with a reference genome, where I assume yeast (sacCer3) below.


# All commands below are based on Linux (http://www.embl.de/~rausch/primer.pdf)

# Download the genome
wget 'http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/chromFa.tar.gz'
tar -xzf chromFa.tar.gz
cat *.fa > sacCer3.fa

# Install BWA http://bio-bwa.sourceforge.net/
# Create a genome index
bwa index -a bwtsw sacCer3.fa

# Map sequences to the genome
bwa aln sacCer3.fa file_1.fastq > aln_1.sai
bwa aln sacCer3.fa file_2.fastq > aln_2.sai
bwa sampe sacCer3.fa aln_1.sai aln_2.sai file_1.fastq file_2.fastq > aln.sam

# Install samtools http://samtools.sourceforge.net/
# Convert alignments to bam
samtools view -S -b aln.sam > aln.bam

# Sort BAM file (will create aln_sort.bam)
samtools sort aln.bam aln_sort

# Index BAM file
samtools index aln_sort.bam

# Call variants and convert results to VCF
# See: http://samtools.sourceforge.net/mpileup.shtml
# See: http://vcftools.sourceforge.net
samtools mpileup -uf sacCer3.fa aln_sort.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | perl vcfutils.pl varFilter -D10000 > var.flt.vcf
head -n 50 var.flt.vcf


# Annotate variants
# Please see Annovar: http://www.openbioinformatics.org/annovar/
# or snpEff: http://snpeff.sourceforge.net/
# Here is a short example showing Ensembl gene annotations with Annovar:

# Convert vcf to annovar format
perl convert2annovar.pl -format vcf4 var.flt.vcf > var.annovar

# Download the Ensembl gene annotation for sacCer3
perl annotate_variation.pl -buildver sacCer3 -downdb ensGene sacCer3db
perl annotate_variation.pl -buildver sacCer3 -downdb seq sacCer3db/sacCer3_seq
perl retrieve_seq_from_fasta.pl sacCer3db/sacCer3_ensGene.txt -seqdir sacCer3db/sacCer3_seq -format ensGene -outfile sacCer3db/sacCer3_ensGeneMrna.fa

# Annotate the SNVs
perl annotate_variation.pl -geneanno -buildver sacCer3 -dbtype ensGene var.annovar <path_to_annovar>/sacCer3db/


# Merging vcf files
# Please see VCFtools vcftools.sourceforge.net
# and tabix samtools.sourceforge.net/tabix.shtml for details.
# Here is a short example of how to merge two vcf files called var1.vcf and var2.vcf:

vcf-sort var1.vcf > var1.sort.vcf
bgzip var1.sort.vcf
tabix -p vcf var1.sort.vcf.gz
vcf-sort var2.vcf > var2.sort.vcf
bgzip var2.sort.vcf
tabix -p vcf var2.sort.vcf.gz
export PATH=${PATH}:<path_to_tabix>
perl -I<path_to_vcftools_perl_directory> vcf-merge var1.sort.vcf.gz var2.sort.vcf.gz > merged.vcf


# Further information:
# FASTQ: http://en.wikipedia.org/wiki/FASTQ_format
# SEQanswers: http://seqanswers.com
# Samtools: samtools.sourceforge.net
# VCFtools: vcftools.sourceforge.net
# Picard: http://picard.sourceforge.net/
# Annovar: http://www.openbioinformatics.org/annovar/
# snpEff: http://snpeff.sourceforge.net/



Valid HTML 4.01 Transitional

CSS ist valide!

*EMBL Disclaimer
Author: Tobias Rausch
Last change: 20th November 2012
Main Page of the EMBL-Heidelberg