3. Quality control

3.1. Preface

There are many sources of errors that can influence the quality of your sequencing run [ROBASKY2014]. In this quality control section we will use our skill on the command-line interface to deal with the task of investigating the quality and cleaning sequencing data [KIRCHNER2014].

Note

You will encounter some To-do sections at times. Write the solutions and answers into a text-file.

3.2. Overview

The part of the workflow we will work on in this section can be viewed in Fig. 3.1.

../_images/workflow4.png

Fig. 3.1 The part of the workflow we will work on in this section marked in red.

3.3. Learning outcomes

After studying this tutorial you should be able to:

  1. Describe the steps involved in pre-processing/cleaning sequencing data.
  2. Distinguish between a good and a bad sequencing run.
  3. Compute, investigate and evaluate the quality of sequence data from a sequencing experiment.

3.4. The data

First, we are going to download the data we will analyse. Open a shell/terminal.

# create a directory you work in
mkdir analysis

# change into the directory
cd analysis

# download the data
curl -O http://compbio.massey.ac.nz/data/203341/data.tar.gz

# uncompress it
tar -xvzf data.tar.gz

Note

Should the download fail, download manually from Downloads.

The data is from a paired-end sequencing run data (see Fig. 3.2) from an Illumina MiSeq [GLENN2011]. Thus, we have two files, one for each end of the read.

../_images/pairedend.png

Fig. 3.2 Illustration of single-end (SE) versus paired-end (PE) sequencing.

If you need to refresh how Illumina paired-end sequencing works have a look at the Illumina technology webpage and this video.

Attention

The data we are using is “almost” raw data as it came from the machine. This data has been post-processed in two ways already. All sequences that were identified as belonging to the PhiX genome have been removed. This process requires some skills we will learn in later sections. Illumina adapters have been removed as well already! The process is explained below but we are not going to do it.

3.4.1. Investigate the data

Make use of your newly developed skills on the command-line to investigate the files in data folder.

Todo

  1. Use the command-line to get some ideas about the file.
  2. What kind of files are we dealing with?
  3. How many sequence reads are in the file?
  4. Assume a genome size of 12MB. Calculate the coverage based on this formula: C = LN / G
  • C: Coverage
  • G: is the haploid genome length in bp
  • L: is the read length in bp (e.g. 2x100 paired-end = 200)
  • N: is the number of reads sequenced

3.5. The fastq file format

The data we receive from the sequencing is in fastq format. To remind us what this format entails, we can revisit the fastq wikipedia-page!

A useful tool to decode base qualities can be found here.

Todo

Explain briefly what the quality value represents.

3.6. The QC process

There are a few steps one need to do when getting the raw sequencing data from the sequencing facility:

  1. Remove PhiX sequences
  2. Adapter trimming
  3. Quality trimming of reads
  4. Quality assessment

3.7. PhiX genome

PhiX is a nontailed bacteriophage with a single-stranded DNA and a genome with 5386 nucleotides. PhiX is used as a quality and calibration control for sequencing runs. PhiX is often added at a low known concentration, spiked in the same lane along with the sample or used as a separate lane. As the concentration of the genome is known, one can calibrate the instruments. Thus, PhiX genomic sequences need to be removed before processing your data further as this constitutes a deliberate contamination [MUKHERJEE2015]. The steps involve mapping all reads to the “known” PhiX genome, and removing all of those sequence reads from the data.

However, your sequencing provider might not have used PhiX, thus you need to read the protocol carefully, or just do this step in any case.

Attention

We are not going to do this step here, as this has been already done. Please see the Read mapping section on how to map reads against a reference genome.

3.8. Adapter trimming

The process of sequencing DNA via Illumina technology requires the addition of some adapters to the sequences. These get sequenced as well and need to be removed as they are artificial and do not belong to the species we try to sequence.

Attention

The process of how to do this is explained here, however we are not going to do this as our sequences have been adapter-trimmed already.

First, we need to know the adapter sequences that were used during the sequencing of our samples. Normally, you should ask your sequencing provider, who should be providing this information to you. Illumina itself provides a document that describes the adapters used for their different technologies. Also the FastQC tool, we will be using later on, provides a collection of contaminants and adapters.

Second, we need a tool that takes a list of adapters and scans each sequence read and removes the adapters. Install a tool called fastq-mcf from the ea-utils suite of tools that is able to do this.

# install
conda install ea-utils

Using the tool together with a adapter/contaminants list in fasta-file (here denoted as adapters.fa):

fastq-mcf -o cleaned.R1.fq.gz -o cleaned.R2.fq.gz adapaters.fa infile_R1.fastq infile_R2.fastq
  • -o: Specifies the output-files. These are fastq-files for forward and reverse read, with adapters removed.

3.9. Quality assessment of sequencing reads (SolexaQA++)

To assess the sequence read quality of the Illumina run we make use of a program called SolexaQA++ [COX2010]. SolexaQA++ was originally developed to work with Solexa data (since bought by Illumina), but long since working with Illumina data. It produces nice graphics that intuitively show the quality of the sequences. it is also able to dynamically trim the bad quality ends off the reads.

From the webpage:

“SolexaQA calculates sequence quality statistics and creates visual representations of data quality for second-generation sequencing data. Originally developed for the Illumina system (historically known as “Solexa”), SolexaQA now also supports Ion Torrent and 454 data.”

3.9.1. Install SolexaQA++

Unfortunately, currently we cannot install SolexaQA++ with conda.

curl -O http://compbio.massey.ac.nz/data/203341/SolexaQA.tar.gz

# uncompress the archive
tar -xvzf SolexaQA.tar.gz

# make the file executable
chmod a+x SolexaQA/Linux_x64/SolexaQA++

# copy program to root folder
cp ./SolexaQA/Linux_x64/SolexaQA++ .

# run the program
./SolexaQA++

Note

Should the download fail, download manually from Downloads.

3.9.2. SolexaQA++ manual

SolexaQA++ has three modes that can be run. Type:

./SolexaQA++
SolexaQA++ v3.1.3
Released under GNU General Public License version 3
C++ version developed by Mauro Truglio (M.Truglio@massey.ac.nz)

Usage: SolexaQA++ <command> [options]

Command: analysis      quality analysis and graphs generation
         dynamictrim    trim reads using a chosen threshold
         lengthsort  sort reads by a chosen length

The three modes are: analysis, dynamictrim, and lengthsort:

analysis - the primary quality analysis and visualization tool. Designed to run on unmodified FASTQ files obtained directly from Illumina, Ion Torrent or 454 sequencers.

dynamictrim - a read trimmer that individually crops each read to its longest contiguous segment for which quality scores are greater than a user-supplied quality cutoff.

lengthsort - a program to separate high quality reads from low quality reads. LengthSort assigns trimmed reads to paired-end, singleton and discard files based on a user-defined length cutoff.

3.9.3. SolexaQA++ dynamic trimming

We will use SolexaQA++ dynamic trim the reads, to chop of nucleotides witha a bad quality score.

Todo

  1. Create a directory for the result-files –> trimmed/.
  2. Run SolexaQA++ dynamictrim with the untrimmed data and a probability cutoff of 0.01., and submit result-directory trimmed/.
  3. Investigate the result-files in trimmed/, e.g. do the file-sizes change to the original files?
  4. SolexaQA++ dynamictrim produces a graphical output. Explain what the graph shows. Find heklp on the SolexaQA++ website.

Hint

Should you not get 1 and/or 2 right, try the commands in Code: SolexaQA++ trimming.

3.9.4. SolexaQA++ analysis on trimmed data

Todo

  1. Create a directory for the result-files –> trimmed-solexaqa.
  2. Use SolexaQA++ to do the quality assessment with the trimmed data-set.
  3. Compare your results to the examples of a particularly bad MiSeq run (Fig. 3.6 to Fig. 3.6, taken from SolexaQA++ website). Write down your observations.
  4. What elements in these example figures (Fig. 3.3 to Fig. 3.6) indicate that the show a bad run? Write down your explanations.

Hint

Should you not get 1 and/or 2 it right, try the commands in Code: SolexaQA++ qc.

../_images/solexaqa_quality_bad.png

Fig. 3.3 SolexaQA++ example quality plot along reads of a bad MiSeq run

../_images/solexaqa_hist_bad.png

Fig. 3.4 SolexaQA++ example histogram plot of a bad MiSeq run.

../_images/solexaqa_cumulative_bad.png

Fig. 3.5 SolexaQA++ example cumulative plot of a bad MiSeq run.

../_images/solexaqa_heatmap_bad.png

Fig. 3.6 SolexaQA++ example quality heatmap of a bad MiSeq run.

3.10. Sickle for dynamic trimming (alternative to SolexaQA++)

Should the dynamic trimming not work with SolexaQA++, you can alternatively use Sickle.

source activate ngs
conda install sickle-trim

Now we are going to run the program on our paired-end data:

# create a new directory
mkdir trimmed

# sickle parameters:
sickle --help

# as we are dealing with paired-end data you will be using "sickle pe"
sickle pe --help

# run sickle like so:
sickle pe -g -t sanger -f data/ancestor-R1.fastq.gz -r data/ancestor-R2.fastq.gz -o trimmed/ancestor-R1.trimmed.fastq.gz -p trimmed/ancestor-R2.trimmed.fastq.gz

Hint

Should you be unable to run Sickle or SolexaQA++ at all to trim the data. You can download the trimmed dataset here. Unarchive and uncompress the files with tar -xvzf trimmed.tar.gz.

3.11. Quality assessment of sequencing reads (FastQC)

3.11.1. Installing FastQC

source activate ngs
conda install fastqc

# should now run the program
fastqc --help
            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

        fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of
    which will help to identify a different potential type of problem in your
    data.

    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.

3.11.2. FastQC manual

FastQC is a very simple program to run that provides similar and additional information to SolexaQA++.

From the webpage:

“FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.”

The basic command looks like:

fastqc -o RESULT-DIR INPUT-FILE.[txt/fa/fq] ...
  • -o RESULT-DIR is the directory where the result files will be written
  • INPUT-FILE.[txt/fa/fq] is the sequence file to analyze, can be more than one file.

Hint

The result will be a HTML page per input file that can be opened in a web-browser.

3.11.3. Run FastQC on the untrimmed and trimmed data

Todo

  1. Create a directory for the results –> trimmed-fastqc
  2. Run FastQC on all trimmed files.
  3. Visit the FastQC website and read about sequencing QC reports for good and bad Illumina sequencing runs.
  4. Compare your results to these examples (Fig. 3.7 to Fig. 3.9) of a particularly bad run (taken from the FastQC website) and write down your observations with regards to your data.
  5. What elements in these example figures (Fig. 3.7 to Fig. 3.9) indicate that the example is from a bad run?

Hint

Should you not get it right, try the commands in Code: FastQC.

../_images/fastqc_bad1.png

Fig. 3.7 Quality score across bases.

../_images/fastqc_bad2.png

Fig. 3.8 Quality per tile.

../_images/fastqc_bad3.png

Fig. 3.9 GC distribution over all sequences.

References

[COX2010]Cox MP, Peterson DA and Biggs PJ. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics, 2010, 11:485. DOI: 10.1186/1471-2105-11-485
[GLENN2011]Glenn T. Field guide to next-generation DNA sequencers. Molecular Ecology Resources (2011) 11, 759–769 doi: 10.1111/j.1755-0998.2011.03024.x
[KIRCHNER2014]Kirchner et al. Addressing challenges in the production and analysis of Illumina sequencing data. BMC Genomics (2011) 12:382
[MUKHERJEE2015]Mukherjee S, Huntemann M, Ivanova N, Kyrpides NC and Pati A. Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Standards in Genomic Sciences, 2015, 10:18. DOI: 10.1186/1944-3277-10-18
[ROBASKY2014]Robasky et al. The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics (2014) 15, 56-62