3. Quality control¶
There are many sources of errors that can influence the quality of your sequencing run [ROBASKY2014]. In this quality control section we will use our skill on the command-line interface to deal with the task of investigating the quality and cleaning sequencing data [KIRCHNER2014].
You will encounter some To-do sections at times. Write the solutions and answers into a text-file.
3.3. Learning outcomes¶
After studying this tutorial you should be able to:
- Describe the steps involved in pre-processing/cleaning sequencing data.
- Distinguish between a good and a bad sequencing run.
- Compute, investigate and evaluate the quality of sequence data from a sequencing experiment.
3.4. The data¶
First, we are going to download the data we will analyse. Open a shell/terminal.
# create a directory you work in mkdir analysis # change into the directory cd analysis # download the data curl -O http://compbio.massey.ac.nz/data/203341/data.tar.gz # uncompress it tar -xvzf data.tar.gz
Should the download fail, download manually from Downloads.
The data we are using is “almost” raw data as it came from the machine. This data has been post-processed in two ways already. All sequences that were identified as belonging to the PhiX genome have been removed. This process requires some skills we will learn in later sections. Illumina adapters have been removed as well already! The process is explained below but we are not going to do it.
3.4.1. Investigate the data¶
Make use of your newly developed skills on the command-line to
investigate the files in
- Use the command-line to get some ideas about the file.
- What kind of files are we dealing with?
- How many sequence reads are in the file?
- Assume a genome size of 12MB. Calculate the coverage based on this formula:
C = LN / G
G: is the haploid genome length in bp
L: is the read length in bp (e.g. 2x100 paired-end = 200)
N: is the number of reads sequenced
3.5. The fastq file format¶
The data we receive from the sequencing is in
fastq format. To remind us what this format entails, we can revisit the fastq wikipedia-page!
A useful tool to decode base qualities can be found here.
Explain briefly what the quality value represents.
3.6. The QC process¶
There are a few steps one need to do when getting the raw sequencing data from the sequencing facility:
- Remove PhiX sequences
- Adapter trimming
- Quality trimming of reads
- Quality assessment
3.7. PhiX genome¶
PhiX is a nontailed bacteriophage with a single-stranded DNA and a genome with 5386 nucleotides. PhiX is used as a quality and calibration control for sequencing runs. PhiX is often added at a low known concentration, spiked in the same lane along with the sample or used as a separate lane. As the concentration of the genome is known, one can calibrate the instruments. Thus, PhiX genomic sequences need to be removed before processing your data further as this constitutes a deliberate contamination [MUKHERJEE2015]. The steps involve mapping all reads to the “known” PhiX genome, and removing all of those sequence reads from the data.
However, your sequencing provider might not have used PhiX, thus you need to read the protocol carefully, or just do this step in any case.
We are not going to do this step here, as this has been already done. Please see the Read mapping section on how to map reads against a reference genome.
3.8. Adapter trimming¶
The process of sequencing DNA via Illumina technology requires the addition of some adapters to the sequences. These get sequenced as well and need to be removed as they are artificial and do not belong to the species we try to sequence. Generally speaking adapter trimming takes time.
The process of how to do this is explained here, however we are not going to do this as our sequences have been adapter-trimmed already.
First, we need to know the adapter sequences that were used during the sequencing of our samples. Normally, you should ask your sequencing provider, who should be providing this information to you. Illumina itself provides a document that describes the adapters used for their different technologies. Also the FastQC tool, we will be using later on, provides a collection of contaminants and adapters.
# install conda install ea-utils
Using the tool together with a adapter/contaminants list in fasta-file (here denoted as
fastq-mcf -o cleaned.R1.fq.gz -o cleaned.R2.fq.gz adapaters.fa infile_R1.fastq infile_R2.fastq
-o: Specifies the output-files. These are fastq-files for forward and reverse read, with adapters removed.
3.9. Sickle for dynamic trimming¶
We are using a simple program Sickle for dynamic trimming of our sequencing reads to remove bad quality called bases from our reads.
conda activate ngs conda install sickle-trim
Now we are going to run the program on our paired-end data:
# create a new directory mkdir trimmed # sickle parameters: sickle --help # as we are dealing with paired-end data you will be using "sickle pe" sickle pe --help # run sickle like this on the ancestor: sickle pe -g -t sanger -f data/ancestor-R1.fastq.gz -r data/ancestor-R2.fastq.gz -o trimmed/ancestor-R1.trimmed.fastq.gz -p trimmed/ancestor-R2.trimmed.fastq.gz -s trimmed/ancestor-singles.fastq.gz
- Run Sickle also on the evolved samples.
3.10. Quality assessment of sequencing reads (FastQC)¶
3.10.1. Installing FastQC¶
conda activate ngs conda install fastqc # should now run the program fastqc --help
FastQC - A high throughput sequence QC analysis tool SYNOPSIS fastqc seqfile1 seqfile2 .. seqfileN fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN DESCRIPTION FastQC reads a set of sequence files and produces from each one a quality control report consisting of a number of different modules, each one of which will help to identify a different potential type of problem in your data. If no files to process are specified on the command line then the program will start as an interactive graphical application. If files are provided on the command line then the program will run with no user interaction required. In this mode it is suitable for inclusion into a standardised analysis pipeline.
3.10.2. FastQC manual¶
FastQC is a very simple program to run that provides inforation about sequence read quality.
From the webpage:
“FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.”
The basic command looks like:
fastqc -o RESULT-DIR INPUT-FILE.[txt/fa/fq] ...
-o RESULT-DIRis the directory where the result files will be written
INPUT-FILE.[txt/fa/fq]is the sequence file to analyze, can be more than one file.
The result will be a HTML page per input file that can be opened in a web-browser.
3.10.3. Run FastQC on the untrimmed and trimmed data¶
- Create a directory for the results –> trimmed-fastqc
- Run FastQC on all trimmed files.
- Visit the FastQC website and read about sequencing QC reports for good and bad Illumina sequencing runs.
- Compare your results to these examples (Fig. 3.3 to Fig. 3.5) of a particularly bad run (taken from the FastQC website) and write down your observations with regards to your data.
- What elements in these example figures (Fig. 3.3 to Fig. 3.5) indicate that the example is from a bad run?
Should you not get it right, try the commands in Code: FastQC.
|[GLENN2011]||Glenn T. Field guide to next-generation DNA sequencers. Molecular Ecology Resources (2011) 11, 759–769 doi: 10.1111/j.1755-0998.2011.03024.x|
|[KIRCHNER2014]||Kirchner et al. Addressing challenges in the production and analysis of Illumina sequencing data. BMC Genomics (2011) 12:382|
|[MUKHERJEE2015]||Mukherjee S, Huntemann M, Ivanova N, Kyrpides NC and Pati A. Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Standards in Genomic Sciences, 2015, 10:18. DOI: 10.1186/1944-3277-10-18|
|[ROBASKY2014]||Robasky et al. The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics (2014) 15, 56-62|