6. Taxonomic investigation¶
We want to investate if there are sequences of other species in our collection of sequenced DNA pieces. We hope that most of them are from our species that we try to study, i.e. the DNA that we have extracted and amplified. This might be a way of quality control, e.g. have the samples been contaminated? Lets investigate if we find sequences from other species in our sequence set.
We will use the tool Kraken to assign taxonomic classifications to our sequence reads. Let us see if we can id some sequences from other species.
You will encounter some To-do sections at times. Write the solutions and answers into a text-file.
We will be using a tool called Kraken [WOOD2014]. This tool uses k-mers to assign a taxonomic labels in form of NCBI Taxonomy to the sequence (if possible). The taxonomic label is assigned based on similar k-mer content of the sequence in question to the k-mer content of reference genome sequence. The result is a classification of the sequence in question to the most likely taxonomic label. If the k-mer content is not similar to any genomic sequence in the database used, it will not assign any taxonomic label.
6.4. Before we start¶
Lets see how our directory structure looks so far:
cd ~/analysis ls -1F
assembly/ data/ mappings/ SolexaQA/ SolexaQA++ trimmed/ trimmed-fastqc/ trimmed-solexaqa/
Use conda in the same fashion as before to install Kraken:
source activate ngs conda install kraken-all
Now we create a directory where we are going to do the analysis and we will change into that directory too.
# make sure you are in your analysis root folder cd ~/analysis # create dir mkdir kraken cd kraken
curl -O https://ccb.jhu.edu/software/kraken/dl/minikraken.tgz # alternatively we can use wget wget https://ccb.jhu.edu/software/kraken/dl/minikraken.tgz # once the download is finished, we need to extract the archive content # it will create a directory: "minikraken_20141208/" tar -xvzf minikraken.tgz
Should the download fail. Please find links to alternative locations on the Downloads page.
The "minikraken" database was created from bacteria, viral and archaea sequences. What are the implications for us when we are trying to classify our sequences?
Now that we have installed Kraken and downloaded and extracted the minikraken database, we can attempt to investigate the sequences we got back from the sequencing provider for other species as the one it should contain. We call the Kraken tool and specify the database and fasta-file with the sequences it should use. The general command structure looks like this:
kraken --only-classified-output --db minikraken_20141208 example.fa > example.kraken
However, we may have fastq-files, so we need to use
--fastq-input which tells Kraken that it is dealing with fastq-formated files.
Here, we are investigating one of the unmapped paired-end read files of the evolved line.
kraken --only-classified-output --db minikraken_20141208 --fastq-input ../mappings/evolved-6.sorted.unmapped.R1.fastq > evolved-6-R1.kraken
We are for now only interested in the portion of sequeuces that can be classified, that is why we supplied the option
--only-classified-output to the Kraken command.
This classification may take a while, depending on how many sequences we are going to classify. The resulting content of the file "evolved-6-R1.kraken" looks similar to the following example:
C M02810:197:000000000-AV55U:1:1101:10078:18384/1 2157 151 0:99 2157:1 0:1 2157:1 0:19 C M02810:197:000000000-AV55U:1:1101:10573:27304/1 364745 150 0:43 364745:1 0:76 C M02810:197:000000000-AV55U:1:1101:10852:5722/1 37665 151 0:33 37665:1 0:87 C M02810:197:000000000-AV55U:1:1101:11429:10330/1 374840 101 0:17 374840:1 0:4 374840:1 0:2 374840:1 0:45 C M02810:197:000000000-AV55U:1:1101:11705:24355/1 2157 151 0:1 2157:1 0:119
Each sequence classified by Kraken results in a single line of output. Output lines contain five tab-delimited fields; from left to right, they are:
C/U: one letter code indicating that the sequence was either classified or unclassified.
- The sequence ID, obtained from the FASTA/FASTQ header.
- The taxonomy ID Kraken used to label the sequence; this is 0 if the sequence is unclassified and otherwise should be the NCBI Taxonomy identifier.
- The length of the sequence in bp.
- A space-delimited list indicating the lowest common ancestor (in the
taxonomic tree) mapping of each k-mer in the sequence.
562:13 561:4 A:31 0:1 562:3would indicate that:
- the first 13 k-mers mapped to taxonomy ID #562
- the next 4 k-mers mapped to taxonomy ID #561
- the next 31 k-mers contained an ambiguous nucleotide
- the next k-mer was not in the database
- the last 3 k-mers mapped to taxonomy ID #562
6.7. Investigate taxa¶
We can use the webpage NCBI TaxIdentifier to quickly get the names to the taxonomy identifier. However, this is impractical as we are dealing potentially with many sequences. Kraken has some scripts that help us understand our results better.
First, we generate a sample-wide report of all taxa found.
This can be achieved with the tool
kraken-report --db minikraken_20141208 evolved-6-R1.kraken > evolved-6-R1.kraken.report
The first few lines of an example report are shown below.
0.00 0 0 U 0 unclassified 100.00 2665 47 - 1 root 54.56 1454 0 - 131567 cellular organisms 38.50 1026 1023 D 2157 Archaea 0.08 2 0 P 651137 Thaumarchaeota 0.08 2 0 - 651142 unclassified Thaumarchaeota 0.08 2 0 G 1048752 Candidatus Caldiarchaeum 0.08 2 2 S 311458 Candidatus Caldiarchaeum subterraneum 0.04 1 0 P 28890 Euryarchaeota 0.04 1 0 C 183967 Thermoplasmata
The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:
- Percentage of reads covered by the clade rooted at this taxon
- Number of reads covered by the clade rooted at this taxon
- Number of reads assigned directly to this taxon
- A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply "-".
- NCBI Taxonomy ID
- indented scientific name
If you want to compare the taxa content of different samples to another, one can create a report whose structure is always the same for all samples, disregarding which taxa are found (obviously the percentages and numbers will be different).
We can cerate such a report using the option
--show-zeros which will print out all taxa (instead of only those found).
We then sort the taxa according to taxa-ids (column 5), e.g.
sort -n -k5.
kraken-report *--show-zeros* --db minikraken_20141208 evolved-6-R1.kraken | **sort -n -k5** > evolved-6-R1.kraken.report.sorted
The report is not ordered according to taxa ids and contains all taxa in the database, even if they have not been found in our sample and are thus zero. The columns are the same as in the former report, however, we have more rows and they are now differently sorted, according to the NCBI Taxonomy id.
For every sequence in our sample and its predicted taxonomic identifier,
we can attach the taxonomic names with
kraken-translate --mpa-format --db minikraken_20141208 evolved-6-R1.kraken > evolved-6-R1.kraken.names
An example output looks like this:
M02810:197:000000000-AV55U:1:1101:10078:18384/1 d__Archaea M02810:197:000000000-AV55U:1:1101:10573:27304/1 d__Viruses|f__Baculoviridae|g__Betabaculovirus|s__Choristoneura_occidentalis_granulovirus M02810:197:000000000-AV55U:1:1101:10852:5722/1 d__Viruses|f__Phycodnaviridae|g__Phaeovirus|s__Ectocarpus_siliculosus_virus_1 M02810:197:000000000-AV55U:1:1101:11429:10330/1 d__Viruses|f__Microviridae|g__Microvirus|s__Enterobacteria_phage_phiX174_sensu_lato M02810:197:000000000-AV55U:1:1101:11705:24355/1 d__Archaea M02810:197:000000000-AV55U:1:1101:12206:13333/1 d__Viruses|o__Herpesvirales|f__Alloherpesviridae|g__Cyprinivirus|s__Cyprinid_herpesvirus_3 M02810:197:000000000-AV55U:1:1101:12336:11626/1 d__Viruses|f__Baculoviridae|g__Betabaculovirus|s__Choristoneura_occidentalis_granulovirus M02810:197:000000000-AV55U:1:1101:12707:5809/1 d__Archaea M02810:197:000000000-AV55U:1:1101:12741:21685/1 d__Viruses|f__Baculoviridae|g__Betabaculovirus|s__Choristoneura_occidentalis_granulovirus M02810:197:000000000-AV55U:1:1101:12767:27821/1 d__Archaea
Here, each sequence that got classified is present in one row. The nomenclature for the names is preceded with an letter according to its rank, e.g. (d)omain, (k)ingdom, (p)hylum, (c)lass, (o)rder, (f)amily (g)enus, or (s)pecies. Taxonomy assignments above the superkingdom (d) rank are represented as just root.
We use the Krona tools to create a nice interactive visualisation of the taxa content of our sample [ONDOV2011]. Fig. 6.2 shows an example (albeit an artificial one) snapshot of the visualisation Krona provides. Fig. 6.2 is a snapshot of the interactive web-page similar to the one we try to create.
Install Krona with:
source activate ngs conda install krona
First some house-keeping to make the Krona installation work. Do not worry to much about what is happening here.
# we delete a symbolic link that is not correct rm -rf ~/miniconda3/envs/ngs/opt/krona/taxonomy # we create a directory in our home where the krona database will live mkdir -p ~/krona/taxonomy # now we make a symbolic link to that directory ln -s ~/krona/taxonomy ~/miniconda3/envs/ngs/opt/krona/taxonomy # now we copy some scripts around, this is necessary as krona installation fails to do this cp -r ~/miniconda3/envs/ngs/opt/krona/scripts ~/miniconda3/envs/py3-kraken/bin/
We need to build a taxonomy database for Krona. However, if this fails we will skip this step and just download a pre-build one. Lets first try to build one.
Now, if this fails, we download a pre-build taxonomy database for krona.
# Download pre-build database curl -O http://compbio.massey.ac.nz/data/taxonomy.tab.gz # we unzip the file gzip -d taxonomy.tab.gz # we move the unzipped file to our taxonomy directory we specified in the step before. mv taxonomy.tab ~/krona/taxonomy
Should this also fail we can download a pre-build database on the Downloads page via a browser.
Now we use the tool
ktImportTaxonomy from the Krona tools to crate the html web-page:
cat evolved-6-R1.kraken | cut -f 2,3 > evolved-6-R1.kraken.krona ktImportTaxonomy evolved-6-R1.kraken.krona firefox taxonomy.krona.html
|[ONDOV2011]||Ondov BD, Bergman NH, and Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics, 2011, 12(1):385.|
|[WOOD2014]||Wood DE and Steven L Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 2014, 15:R46. DOI: 10.1186/gb-2014-15-3-r46.|