3. Quality control¶
There are many sources of errors that can influence the quality of your sequencing run [ROBASKY2014]. In this quality control section we will use our skill on the command-line interface to deal with the task of investigating the quality and cleaning sequencing data [KIRCHNER2014].
You will encounter some To-do sections at times. Write the solutions and answers into a text-file.
3.3. Learning outcomes¶
After studying this tutorial you should be able to:
- Describe the steps involved in pre-processing/cleaning sequencing data.
- Distinguish between a good and a bad sequencing run.
- Compute, investigate and evaluate the quality of sequence data from a sequencing experiment.
3.4. The data¶
First, we are going to download the data we will analyse. Open a shell/terminal.
# create a directory you work in mkdir analysis # change into the directory cd analysis # download the data curl -O http://compbio.massey.ac.nz/data/203341/data.tar.gz # uncompress it tar -xvzf data.tar.gz
Should the download fail, download manually from Downloads.
The data we are using is “almost” raw data as it came from the machine. This data has been post-processed in two ways already. All sequences that were identified as belonging to the PhiX genome have been removed. This process requires some skills we will learn in later sections. Illumina adapters have been removed as well already! The process is explained below but we are not going to do it.
3.4.1. Investigate the data¶
Make use of your newly developed skills on the command-line to
investigate the files in
- Use the command-line to get some ideas about the file.
- What kind of files are we dealing with?
- How many sequence reads are in the file?
- Assume a genome size of 12MB. Calculate the coverage based on this formula:
C = LN / G
G: is the haploid genome length in bp
L: is the read length in bp (e.g. 2x100 paired-end = 200)
N: is the number of reads sequenced
3.5. The fastq file format¶
The data we receive from the sequencing is in
fastq format. To remind us what this format entails, we can revisit the fastq wikipedia-page!
A useful tool to decode base qualities can be found here.
Explain briefly what the quality value represents.
3.6. The QC process¶
There are a few steps one need to do when getting the raw sequencing data from the sequencing facility:
- Remove PhiX sequences
- Adapter trimming
- Quality trimming of reads
- Quality assessment
3.7. PhiX genome¶
PhiX is a nontailed bacteriophage with a single-stranded DNA and a genome with 5386 nucleotides. PhiX is used as a quality and calibration control for sequencing runs. PhiX is often added at a low known concentration, spiked in the same lane along with the sample or used as a separate lane. As the concentration of the genome is known, one can calibrate the instruments. Thus, PhiX genomic sequences need to be removed before processing your data further as this constitutes a deliberate contamination [MUKHERJEE2015]. The steps involve mapping all reads to the “known” PhiX genome, and removing all of those sequence reads from the data.
However, your sequencing provider might not have used PhiX, thus you need to read the protocol carefully, or just do this step in any case.
We are not going to do this step here, as this has been already done. Please see the Read mapping section on how to map reads against a reference genome.
3.8. Adapter trimming¶
The process of sequencing DNA via Illumina technology requires the addition of some adapters to the sequences. These get sequenced as well and need to be removed as they are artificial and do not belong to the species we try to sequence.
The process of how to do this is explained here, however we are not going to do this as our sequences have been adapter-trimmed already.
First, we need to know the adapter sequences that were used during the sequencing of our samples. Normally, you should ask your sequencing provider, who should be providing this information to you. Illumina itself provides a document that describes the adapters used for their different technologies. Also the FastQC tool, we will be using later on, provides a collection of contaminants and adapters.
# install conda install ea-utils
Using the tool together with a adapter/contaminants list in fasta-file (here denoted as
fastq-mcf -o cleaned.R1.fq.gz -o cleaned.R2.fq.gz adapaters.fa infile_R1.fastq infile_R2.fastq
-o: Specifies the output-files. These are fastq-files for forward and reverse read, with adapters removed.
3.9. Quality assessment of sequencing reads (SolexaQA++)¶
To assess the sequence read quality of the Illumina run we make use of a program called SolexaQA++ [COX2010]. SolexaQA++ was originally developed to work with Solexa data (since bought by Illumina), but long since working with Illumina data. It produces nice graphics that intuitively show the quality of the sequences. it is also able to dynamically trim the bad quality ends off the reads.
From the webpage:
“SolexaQA calculates sequence quality statistics and creates visual representations of data quality for second-generation sequencing data. Originally developed for the Illumina system (historically known as “Solexa”), SolexaQA now also supports Ion Torrent and 454 data.”
3.9.1. Install SolexaQA++¶
curl -O http://compbio.massey.ac.nz/data/203341/SolexaQA.tar.gz # uncompress the archive tar -xvzf SolexaQA.tar.gz # make the file executable chmod a+x SolexaQA/Linux_x64/SolexaQA++ # copy program to root folder cp ./SolexaQA/Linux_x64/SolexaQA++ . # run the program ./SolexaQA++
Should the download fail, download manually from Downloads.
3.9.2. SolexaQA++ manual¶
SolexaQA++ has three modes that can be run. Type:
SolexaQA++ v3.1.3 Released under GNU General Public License version 3 C++ version developed by Mauro Truglio (M.Truglio@massey.ac.nz) Usage: SolexaQA++ <command> [options] Command: analysis quality analysis and graphs generation dynamictrim trim reads using a chosen threshold lengthsort sort reads by a chosen length
The three modes are:
analysis - the primary quality analysis and visualization tool.
Designed to run on unmodified FASTQ files obtained directly from
Illumina, Ion Torrent or 454 sequencers.
dynamictrim - a read trimmer that individually crops each read to
its longest contiguous segment for which quality scores are greater than
a user-supplied quality cutoff.
lengthsort - a program to separate high quality reads from low
quality reads. LengthSort assigns trimmed reads to paired-end, singleton
and discard files based on a user-defined length cutoff.
3.9.3. SolexaQA++ dynamic trimming¶
We will use SolexaQA++ dynamic trim the reads, to chop of nucleotides witha a bad quality score.
- Create a directory for the result-files –> trimmed/.
- Run SolexaQA++
dynamictrimwith the untrimmed data and a probability cutoff of 0.05., and submit result-directory trimmed/.
- Investigate the result-files in trimmed/, e.g. do the file-sizes change to the original files?
dynamictrimproduces a graphical output. Explain what the graph shows. Find heklp on the SolexaQA++ website.
Should you not get 1 and/or 2 right, try the commands in Code: SolexaQA++ trimming.
3.9.4. SolexaQA++ analysis on trimmed data¶
- Create a directory for the result-files –> trimmed-solexaqa.
- Use SolexaQA++ to do the quality assessment with the trimmed data-set.
- Compare your results to the examples of a particularly bad MiSeq run (Fig. 3.6 to Fig. 3.6, taken from SolexaQA++ website). Write down your observations.
- What elements in these example figures (Fig. 3.3 to Fig. 3.6) indicate that the show a bad run? Write down your explanations.
Should you not get 1 and/or 2 it right, try the commands in Code: SolexaQA++ qc.
3.10. Sickle for dynamic trimming (alternative to SolexaQA++)¶
conda activate ngs conda install sickle-trim
Now we are going to run the program on our paired-end data:
# create a new directory mkdir trimmed # sickle parameters: sickle --help # as we are dealing with paired-end data you will be using "sickle pe" sickle pe --help # run sickle like so: sickle pe -g -t sanger -f data/ancestor-R1.fastq.gz -r data/ancestor-R2.fastq.gz -o trimmed/ancestor-R1.trimmed.fastq.gz -p trimmed/ancestor-R2.trimmed.fastq.gz
3.11. Quality assessment of sequencing reads (FastQC)¶
3.11.1. Installing FastQC¶
conda activate ngs conda install fastqc # should now run the program fastqc --help
FastQC - A high throughput sequence QC analysis tool SYNOPSIS fastqc seqfile1 seqfile2 .. seqfileN fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN DESCRIPTION FastQC reads a set of sequence files and produces from each one a quality control report consisting of a number of different modules, each one of which will help to identify a different potential type of problem in your data. If no files to process are specified on the command line then the program will start as an interactive graphical application. If files are provided on the command line then the program will run with no user interaction required. In this mode it is suitable for inclusion into a standardised analysis pipeline.
3.11.2. FastQC manual¶
From the webpage:
“FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.”
The basic command looks like:
fastqc -o RESULT-DIR INPUT-FILE.[txt/fa/fq] ...
-o RESULT-DIRis the directory where the result files will be written
INPUT-FILE.[txt/fa/fq]is the sequence file to analyze, can be more than one file.
The result will be a HTML page per input file that can be opened in a web-browser.
3.11.3. Run FastQC on the untrimmed and trimmed data¶
- Create a directory for the results –> trimmed-fastqc
- Run FastQC on all trimmed files.
- Visit the FastQC website and read about sequencing QC reports for good and bad Illumina sequencing runs.
- Compare your results to these examples (Fig. 3.7 to Fig. 3.9) of a particularly bad run (taken from the FastQC website) and write down your observations with regards to your data.
- What elements in these example figures (Fig. 3.7 to Fig. 3.9) indicate that the example is from a bad run?
Should you not get it right, try the commands in Code: FastQC.
|[COX2010]||Cox MP, Peterson DA and Biggs PJ. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics, 2010, 11:485. DOI: 10.1186/1471-2105-11-485|
|[GLENN2011]||Glenn T. Field guide to next-generation DNA sequencers. Molecular Ecology Resources (2011) 11, 759–769 doi: 10.1111/j.1755-0998.2011.03024.x|
|[KIRCHNER2014]||Kirchner et al. Addressing challenges in the production and analysis of Illumina sequencing data. BMC Genomics (2011) 12:382|
|[MUKHERJEE2015]||Mukherjee S, Huntemann M, Ivanova N, Kyrpides NC and Pati A. Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Standards in Genomic Sciences, 2015, 10:18. DOI: 10.1186/1944-3277-10-18|
|[ROBASKY2014]||Robasky et al. The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics (2014) 15, 56-62|