7. Variant calling

7.1. Preface

In this section we will use our genome assembly based on the ancestor and call genetic variants in the evolved line [NIELSEN2011].

7.2. Overview

The part of the workflow we will work on in this section can be viewed in Fig. 7.1.

../_images/workflow6.png

Fig. 7.1 The part of the workflow we will work on in this section marked in red.

7.3. Learning outcomes

After studying this tutorial section you should be able to:

#. Use tools to call variants based on a reference genome. #, Be able to describe what influences the calling of variants.

7.4. Before we start

Lets see how our directory structure looks so far:

$ cd ~/analysis
$ ls -1F
assembly/
data/
kraken/
mappings/
multiqc_data/
trimmed/
trimmed-fastqc/

Attention

If you have not run the previous section on Read mapping, you can download the mapped data needed for this section here: Downloads. Download the file to the ~/analysis directory and decompress. Alternatively on the CLI try:

cd ~/analysis
wget -O mappings.tar.gz https://osf.io/g5at8/download
tar xvzf mappings.tar.gz

7.5. Installing necessary software

Tools we are going to use in this section and how to intall them if you not have done it yet.

# activate the env
$ conda create --yes -n var samtools bamtools freebayes bedtools vcflib rtg-tools bcftools matplotlib
$ conda activate var

7.6. Preprocessing

We first need to make an index of our reference genome as this is required by the SNP caller. Given a scaffold/contig file in fasta-format, e.g. scaffolds.fasta which is located in the directory assembly/, use SAMtools to do this:

$ samtools faidx assembly/scaffolds.fasta

Furthermore we need to pre-process our mapping files a bit further and create a bam-index file (.bai) for the bam-file we want to work with:

$ bamtools index -in mappings/evol1.sorted.dedup.q20.bam

Lets also create a new directory for the variants:

$ mkdir variants

7.7. Calling variants

7.7.1. Freebayes

We can call variants with a tool called freebayes. Given a reference genome scaffold file in fasta-format, e.g. scaffolds.fasta and the index in .fai format and a mapping file (.bam file) and a mapping index (.bai file), we can call variants with freebayes like so:

# Now we call variants and pipe the results into a new file
$ freebayes -p 1 -f assembly/scaffolds.fasta mappings/evol1.sorted.dedup.q20.bam > variants/evol1.freebayes.vcf
  • -p 1: specifies the ploidy level. E.Coli are haploid.

7.8. Post-processing

7.8.1. Understanding the output files (.vcf)

Lets look at a vcf-file:

# first 10 lines, which are part of the header
$ cat variants/evol1.freebayes.vcf | head
##fileformat=VCFv4.2
##fileDate=20200122
##source=freeBayes v1.3.1-dirty
##reference=assembly/scaffolds.fasta
##contig=<ID=NODE_1_length_348724_cov_30.410613,length=348724>
##contig=<ID=NODE_2_length_327290_cov_30.828326,length=327290>
##contig=<ID=NODE_3_length_312063_cov_30.523209,length=312063>
##contig=<ID=NODE_4_length_202800_cov_31.500777,length=202800>
##contig=<ID=NODE_5_length_164027_cov_28.935175,length=164027>
##contig=<ID=NODE_6_length_144088_cov_29.907986,length=144088>

Lets look at the variants:

# remove header lines and look at top 4 entires
$ cat variants/evol1.freebayes.vcf | grep -v '##' | head -4
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  unknown
NODE_1_length_348724_cov_30.410613      375     .       A       C       0       .       AB=0;ABP=0;AC=0;AF=0;AN=1;AO=3;CIGAR=1X;DP=21;DPB=21;DPRA=0;EPP=3.73412;EPPR=3.49285;GTI=0;LEN=1;MEANALT=1;MQM=44;MQMR=40.3333;NS=1;NUMALT=1;ODDS=63.5226;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=53;QR=414;RO=18;RPL=2;RPP=3.73412;RPPR=7.35324;RPR=1;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=14;SRP=15.074;SRR=4;TYPE=snp       GT:DP:AD:RO:QR:AO:QA:GL      0:21:18,3:18:414:3:53:0,-29.6927
NODE_1_length_348724_cov_30.410613      393     .       T       A       0       .       AB=0;ABP=0;AC=0;AF=0;AN=1;AO=2;CIGAR=1X;DP=24;DPB=24;DPRA=0;EPP=7.35324;EPPR=6.56362;GTI=0;LEN=1;MEANALT=1;MQM=36;MQMR=42.9545;NS=1;NUMALT=1;ODDS=127.074;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=21;QR=717;RO=22;RPL=2;RPP=7.35324;RPPR=3.0103;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=17;SRP=17.2236;SRR=5;TYPE=snp       GT:DP:AD:RO:QR:AO:QA:GL      0:24:22,2:22:717:2:21:0,-57.4754
NODE_1_length_348724_cov_30.410613      612     .       A       C       2.32041e-15     .       AB=0;ABP=0;AC=0;AF=0;AN=1;AO=3;CIGAR=1X;DP=48;DPB=48;DPRA=0;EPP=9.52472;EPPR=11.1654;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=296.374;PAIRED=1;PAIREDR=0.977778;PAO=0;PQA=0;PQR=0;PRO=0;QA=53;QR=1495;RO=45;RPL=0;RPP=9.52472;RPPR=3.44459;RPR=3;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=19;SRP=5.37479;SRR=26;TYPE=snp  GT:DP:AD:RO:QR:AO:QA:GL      0:48:45,3:45:1495:3:53:0,-129.869

The fields in a vcf-file are described in he table (Table 7.1) below:

Table 7.1 The vcf-file format fields.

Col

Field

Description

1

CHROM

Chromosome name

2

POS

1-based position. For an indel, this is the position preceding the indel.

3

ID

Variant identifier. Usually the dbSNP rsID.

4

REF

Reference sequence at POS involved in the variant. For a SNP, it is a single base.

5

ALT

Comma delimited list of alternative seuqence(s).

6

QUAL

Phred-scaled probability of all samples being homozygous reference.

7

FILTER

Semicolon delimited list of filters that the variant fails to pass.

8

INFO

Semicolon delimited list of variant information.

9

FORMAT

Colon delimited list of the format of individual genotypes in the following fields.

10+

Sample(s)

Individual genotype information defined by FORMAT.

7.8.2. Statistics

Now we can use it to do some statistics and filter our variant calls.

First, to prepare out vcf-file for querying we need to index it with tabix:

# compress file
$ bgzip variants/evol1.freebayes.vcf
# index
$ tabix -p vcf variants/evol1.freebayes.vcf.gz
  • -p vcf: input format

We can get some quick stats with rtg vcfstats:

$ rtg vcfstats variants/evol1.freebayes.vcf.gz

Example output from rtg vcfstats:

Location                     : variants/evol1.freebayes.vcf.gz
Failed Filters               : 0
Passed Filters               : 35233
SNPs                         : 55
MNPs                         : 6
Insertions                   : 3
Deletions                    : 5
Indels                       : 0
Same as reference            : 35164
SNP Transitions/Transversions: 0.83 (25/30)
Total Haploid                : 69
Haploid SNPs                 : 55
Haploid MNPs                 : 6
Haploid Insertions           : 3
Haploid Deletions            : 5
Haploid Indels               : 0
Insertion/Deletion ratio     : 0.60 (3/5)
Indel/SNP+MNP ratio          : 0.13 (8/61)

However, we can also run BCFtools to extract more detailed statistics about our variant calls:

$ bcftools stats -F assembly/scaffolds.fasta -s - variants/evol1.freebayes.vcf.gz > variants/evol1.freebayes.vcf.gz.stats
  • -s -: list of samples for sample stats, “-” to include all samples

  • -F FILE: faidx indexed reference sequence file to determine INDEL context

Now we take the stats and make some plots (e.g. Fig. 7.2) which are particular of interest if having multiple samples, as one can easily compare them. However, we are only working with one here:

$ mkdir variants/plots
$ plot-vcfstats -p variants/plots/ variants/evol1.freebayes.vcf.gz.stats
  • -p: The output files prefix, add a slash at the end to create a new directory.

../_images/vcfstats.png

Fig. 7.2 Example of plot-vcfstats output.

7.8.3. Variant filtration

Variant filtration is a big topic in itself [OLSEN2015]. There is no consens yet and research on how to best filter variants is ongoing.

We will do some simple filtration procedures here. For one, we can filter out low quality reads.

Here, we only include variants that have quality > 30.

# use rtg vcfffilter
$ rtg vcffilter -q 30 -i variants/evol1.freebayes.vcf.gz -o variants/evol1.freebayes.q30.vcf.gz
  • -i FILE: input file

  • -o FILE: output file

  • -q FLOAT: minimal allowed quality in output.

or use vcflib:

# or use vcflib
$ zcat variants/evol1.freebayes.vcf.gz  | vcffilter -f "QUAL >= 30" | gzip > variants/evol1.freebayes.q30.vcf.gz
  • -f "QUAL >= 30": we only include variants that have been called with quality >= 30.

Quick stats for the filtered variants:

# look at stats for filtered
$ rtg vcfstats variants/evol1.freebayes.q30.vcf.gz

freebayes adds some extra information to the vcf-files it creates. This allows for some more detailed filtering. This strategy will NOT work on calls done with e.g. SAMtools/bcftools mpileup called variants. Here we filter, based on some recommendation form the developer of freebayes:

$ zcat variants/evol1.freebayes.vcf.gz | vcffilter -f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1" | bgzip > variants/evol1.freebayes.filtered.vcf.gz
  • QUAL > 1: removes really bad sites

  • QUAL / AO > 10: additional contribution of each obs should be 10 log units (~ Q10 per read)

  • SAF > 0 & SAR > 0: reads on both strands

  • RPR > 1 & RPL > 1: at least two reads “balanced” to each side of the site

$ tabix -p vcf variants/evol1.freebayes.filtered.vcf.gz

This strategy used here will do for our purposes. However, several more elaborate filtering strategies have been explored, e.g. here.

Todo

Look at the statistics. One ratio that is mentioned in the statistics is transition transversion ratio (ts/tv). Explain what this ratio is and why the observed ratio makes sense.

Todo

Call and filter variants for the second evolved strain, similarily to what ws described here for the first strain. Should you be unable to do it, check the code section: Code: Variant calling.

References

NIELSEN2011

Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genetics, 2011, 12:433-451

OLSEN2015

Olsen ND et al. Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Front. Genet., 2015, 6:235.