6. Taxonomic investigation¶

6.1. Preface¶

We want to investate if there are sequences of other species in our collection of sequenced DNA pieces. We hope that most of them are from our species that we try to study, i.e. the DNA that we have extracted and amplified. This might be a way of quality control, e.g. have the samples been contaminated? Lets investigate if we find sequences from other species in our sequence set.

We will use the tool Kraken to assign taxonomic classifications to our sequence reads. Let us see if we can id some sequences from other species.

Note

You will encounter some To-do sections at times. Write the solutions and answers into a text-file.

6.2. Overview¶

The part of the workflow we will work on in this section can be viewed in Fig. 6.1.

6.3. Before we start¶

Lets see how our directory structure looks so far:

cd ~/analysis
ls -1F

assembly/
data/
mappings/
SolexaQA/
SolexaQA++
trimmed/
trimmed-fastqc/
trimmed-solexaqa/


6.4. Kraken¶

We will be using a tool called Kraken [WOOD2014]. This tool uses k-mers to assign a taxonomic labels in form of NCBI Taxonomy to the sequence (if possible). The taxonomic label is assigned based on similar k-mer content of the sequence in question to the k-mer content of reference genome sequence. The result is a classification of the sequence in question to the most likely taxonomic label. If the k-mer content is not similar to any genomic sequence in the database used, it will not assign any taxonomic label.

6.4.1. Installation¶

Use conda in the same fashion as before to install Kraken:

source activate ngs
conda install kraken-all


Now we create a directory where we are going to do the analysis and we will change into that directory too.

# make sure you are in your analysis root folder
cd ~/analysis

# create dir
mkdir kraken
cd kraken


Now we need to create or download a Kraken database that can be used to assign the taxonomic labels to sequences. We opt for downloading the pre-build “minikraken” database from the Kraken website:

curl -O https://ccb.jhu.edu/software/kraken/dl/minikraken.tgz

# alternatively we can use wget
wget https://ccb.jhu.edu/software/kraken/dl/minikraken.tgz

# once the download is finished, we need to extract the archive content
# it will create a directory: "minikraken_20141208/"
tar -xvzf minikraken.tgz


Attention

Note

The “minikraken” database was created from bacteria, viral and archaea sequences. What are the implications for us when we are trying to classify our sequences?

6.4.2. Usage¶

Now that we have installed Kraken and downloaded and extracted the minikraken database, we can attempt to investigate the sequences we got back from the sequencing provider for other species as the one it should contain. We call the Kraken tool and specify the database and fasta-file with the sequences it should use. The general command structure looks like this:

kraken --only-classified-output --db minikraken_20141208 example.fa > example.kraken


However, we may have fastq-files, so we need to use --fastq-input which tells Kraken that it is dealing with fastq-formated files. Here, we are investigating one of the unmapped paired-end read files of the evolved line.

kraken --only-classified-output --db minikraken_20141208 --fastq-input ../mappings/evolved-6.sorted.unmapped.R1.fastq > evolved-6-R1.kraken


Note

We are for now only interested in the portion of sequeuces that can be classified, that is why we supplied the option --only-classified-output to the Kraken command.

This classification may take a while, depending on how many sequences we are going to classify. The resulting content of the file “evolved-6-R1.kraken” looks similar to the following example:

C       M02810:197:000000000-AV55U:1:1101:10078:18384/1 2157    151     0:99 2157:1 0:1 2157:1 0:19
C       M02810:197:000000000-AV55U:1:1101:10573:27304/1 364745  150     0:43 364745:1 0:76
C       M02810:197:000000000-AV55U:1:1101:10852:5722/1  37665   151     0:33 37665:1 0:87
C       M02810:197:000000000-AV55U:1:1101:11429:10330/1 374840  101     0:17 374840:1 0:4 374840:1 0:2 374840:1 0:45
C       M02810:197:000000000-AV55U:1:1101:11705:24355/1 2157    151     0:1 2157:1 0:119



Each sequence classified by Kraken results in a single line of output. Output lines contain five tab-delimited fields; from left to right, they are:

1. C/U: one letter code indicating that the sequence was either classified or unclassified.
2. The sequence ID, obtained from the FASTA/FASTQ header.
3. The taxonomy ID Kraken used to label the sequence; this is 0 if the sequence is unclassified and otherwise should be the NCBI Taxonomy identifier.
4. The length of the sequence in bp.
5. A space-delimited list indicating the lowest common ancestor (in the taxonomic tree) mapping of each k-mer in the sequence. For example, 562:13 561:4 A:31 0:1 562:3 would indicate that:
• the first 13 k-mers mapped to taxonomy ID #562
• the next 4 k-mers mapped to taxonomy ID #561
• the next 31 k-mers contained an ambiguous nucleotide
• the next k-mer was not in the database
• the last 3 k-mers mapped to taxonomy ID #562

Note

The Kraken manual can be accessed here.

6.4.3. Investigate taxa¶

We can use the webpage NCBI TaxIdentifier to quickly get the names to the taxonomy identifier. However, this is impractical as we are dealing potentially with many sequences. Kraken has some scripts that help us understand our results better.

6.4.3.1. kraken-report¶

First, we generate a sample-wide report of all taxa found. This can be achieved with the tool kranken-report.

kraken-report --db minikraken_20141208 evolved-6-R1.kraken > evolved-6-R1.kraken.report


The first few lines of an example report are shown below.

  0.00  0       0       U       0       unclassified
100.00  2665    47      -       1       root
54.56  1454    0       -       131567    cellular organisms
38.50  1026    1023    D       2157        Archaea
0.08  2       0       P       651137        Thaumarchaeota
0.08  2       0       -       651142          unclassified Thaumarchaeota
0.08  2       0       G       1048752           Candidatus Caldiarchaeum
0.08  2       2       S       311458              Candidatus Caldiarchaeum subterraneum
0.04  1       0       P       28890         Euryarchaeota
0.04  1       0       C       183967          Thermoplasmata



The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:

1. Percentage of reads covered by the clade rooted at this taxon
2. Number of reads covered by the clade rooted at this taxon
3. Number of reads assigned directly to this taxon
4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply “-“.
5. NCBI Taxonomy ID
6. The indented scientific name

Note

If you want to compare the taxa content of different samples to another, one can create a report whose structure is always the same for all samples, disregarding which taxa are found (obviously the percentages and numbers will be different).

We can cerate such a report using the option --show-zeros which will print out all taxa (instead of only those found). We then sort the taxa according to taxa-ids (column 5), e.g. sort -n -k5.

kraken-report *--show-zeros* --db minikraken_20141208 evolved-6-R1.kraken | **sort -n -k5** > evolved-6-R1.kraken.report.sorted


The report is not ordered according to taxa ids and contains all taxa in the database, even if they have not been found in our sample and are thus zero. The columns are the same as in the former report, however, we have more rows and they are now differently sorted, according to the NCBI Taxonomy id.

6.4.3.2. kraken-translate¶

For every sequence in our sample and its predicted taxonomic identifier, we can attach the taxonomic names with kraken-translate.

kraken-translate --mpa-format --db minikraken_20141208 evolved-6-R1.kraken > evolved-6-R1.kraken.names


An example output looks like this:

M02810:197:000000000-AV55U:1:1101:10078:18384/1 d__Archaea
M02810:197:000000000-AV55U:1:1101:10573:27304/1 d__Viruses|f__Baculoviridae|g__Betabaculovirus|s__Choristoneura_occidentalis_granulovirus
M02810:197:000000000-AV55U:1:1101:10852:5722/1  d__Viruses|f__Phycodnaviridae|g__Phaeovirus|s__Ectocarpus_siliculosus_virus_1
M02810:197:000000000-AV55U:1:1101:11429:10330/1 d__Viruses|f__Microviridae|g__Microvirus|s__Enterobacteria_phage_phiX174_sensu_lato
M02810:197:000000000-AV55U:1:1101:11705:24355/1 d__Archaea
M02810:197:000000000-AV55U:1:1101:12206:13333/1 d__Viruses|o__Herpesvirales|f__Alloherpesviridae|g__Cyprinivirus|s__Cyprinid_herpesvirus_3
M02810:197:000000000-AV55U:1:1101:12336:11626/1 d__Viruses|f__Baculoviridae|g__Betabaculovirus|s__Choristoneura_occidentalis_granulovirus
M02810:197:000000000-AV55U:1:1101:12707:5809/1  d__Archaea
M02810:197:000000000-AV55U:1:1101:12741:21685/1 d__Viruses|f__Baculoviridae|g__Betabaculovirus|s__Choristoneura_occidentalis_granulovirus
M02810:197:000000000-AV55U:1:1101:12767:27821/1 d__Archaea



Here, each sequence that got classified is present in one row. The nomenclature for the names is preceded with an letter according to its rank, e.g. (d)omain, (k)ingdom, (p)hylum, (c)lass, (o)rder, (f)amily (g)enus, or (s)pecies. Taxonomy assignments above the superkingdom (d) rank are represented as just root.

6.5. Centrifuge¶

We can also use the successor to Kraken, a tool by the same group called Centrifuge [KIM2017]. This tool uses a novel indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem to assign a taxonomic labels in form of NCBI Taxonomy to the sequence (if possible). The result is a classification of the sequence in question to the most likely taxonomic label. If the search sequence is not similar to any genomic sequence in the database used, it will not assign any taxonomic label.

6.5.1. Installation¶

Use conda in the same fashion as before to install Centrifuge:

source activate ngs
conda install centrifuge


Now we create a directory where we are going to do the analysis and we will change into that directory too.

# make sure you are in your analysis root folder
cd ~/analysis

# create dir
mkdir centrifuge
cd centrifuge


Now we need to create or download a Centrifuge database that can be used to assign the taxonomic labels to sequences. We opt for downloading the pre-build database from the Centrifuge website:

curl -O ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed.tar.gz

# alternatively we can use wget
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed.tar.gz

# once the download is finished, we need to extract the archive content
# It will extract a few files from the archive and may take a moment to finish.
tar -xvzf p_compressed.tar.gz


Attention

Note

The database we will be using was created from bacteria and archaea sequences only. What are the implications for us when we are trying to classify our sequences?

6.5.2. Usage¶

Now that we have installed Centrifuge and downloaded and extracted the pre-build database, we can attempt to investigate the sequences we got back from the sequencing provider for other species as the one it should contain. We call the Centrifuge tool and specify the database and fasta-file with the sequences it should use. The general command structure looks like this:

centrifuge -x p_compressed -U example.fa --report-file report.txt -S results.txt


However, if we do not have fastq-files we may have to use the -f option, which tells Centrifuge that it is dealing with a fasta-formated file. Here, we are investigating one of the unmapped paired-end read files of the evolved line.

centrifuge -x p_compressed -U ../mappings/evolved-6.sorted.unmapped.R1.fastq --report-file evolved-6-R1-report.txt -S evolved-6-R1-results.txt


This classification may take a moment, depending on how many sequences we are going to classify. The resulting content of the file evolved-6-R1-results.txt looks similar to the following example:

readID  seqID   taxID   score   2ndBestScore    hitLength       queryLength     numMatches
M02810:197:000000000-AV55U:1:1101:15316:8461    cid|1747        1747    1892    0       103     135     1
M02810:197:000000000-AV55U:1:1101:15563:3249    cid|161879      161879  18496   0       151     151     1
M02810:197:000000000-AV55U:1:1101:19743:5166    cid|564 564     10404   10404   117     151     2
M02810:197:000000000-AV55U:1:1101:19743:5166    cid|562 562     10404   10404   117     151     2



Each sequence classified by Centrifuge results in a single line of output. Output lines contain eight tab-delimited fields; from left to right, they are according to the Centrifuge website:

2. The sequence ID of the genomic sequence, where the read is classified.
3. The taxonomic ID of the genomic sequence in the second column.
4. The score for the classification, which is the weighted sum of hits.
5. The score for the next best classification.
6. A pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs.
7. A pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs.
8. The number of classifications for this read, indicating how many assignments were made.

6.5.3. Investigate taxa¶

6.5.3.1. Centrifuge report¶

The command above creates a Centrifuge report automatically for us. It contains an overview of the identified taxa and their abundances in your supplied sequences (normalised to genomic length):

name    taxID   taxRank genomeSize      numReads        numUniqueReads  abundance
Pseudomonas aeruginosa  287     species 22457305        1       0       0.0
Pseudomonas fluorescens 294     species 14826544        1       1       0.0
Pseudomonas putida      303     species 6888188 1       1       0.0
Ralstonia pickettii     329     species 6378979 3       2       0.0
Pseudomonas pseudoalcaligenes   330     species 4691662 1       1       0.0171143



Each line contains seven tab-delimited fields; from left to right, they are according to the Centrifuge website:

1. The name of a genome, or the name corresponding to a taxonomic ID (the second column) at a rank higher than the strain.
2. The taxonomic ID.
3. The taxonomic rank.
4. The length of the genome sequence.
5. The number of reads classified to this genomic sequence including multi-classified reads.
6. The number of reads uniquely classified to this genomic sequence.
7. The proportion of this genome normalized by its genomic length.

6.5.3.2. Kraken-like report¶

If we would like to generate a report as generated with the former tool Kraken, we can do it like this:

centrifuge-kreport -x p_compressed evolved-6-R1-results.txt > evolved-6-R1-kreport.txt

  0.00  0       0       U       0       unclassified
78.74  163     0       -       1       root
78.74  163     0       -       131567    cellular organisms
78.74  163     0       D       2           Bacteria
54.67  113     0       P       1224          Proteobacteria
36.60  75      0       C       1236            Gammaproteobacteria
31.18  64      0       O       91347             Enterobacterales
30.96  64      0       F       543                 Enterobacteriaceae
23.89  49      0       G       561                   Escherichia
23.37  48      48      S       562                     Escherichia coli
0.40  0       0       S       564                     Escherichia fergusonii
0.12  0       0       S       208962                  Escherichia albertii
3.26  6       0       G       570                   Klebsiella
3.14  6       6       S       573                     Klebsiella pneumoniae
0.12  0       0       S       548                     [Enterobacter] aerogenes
2.92  6       0       G       620                   Shigella
1.13  2       2       S       623                     Shigella flexneri
0.82  1       1       S       624                     Shigella sonnei
0.50  1       1       S       1813821                 Shigella sp. PAMC 28760
0.38  0       0       S       621                     Shigella boydii



This gives a similar (not the same) report as the Kraken tool. The report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:

1. Percentage of reads covered by the clade rooted at this taxon
2. Number of reads covered by the clade rooted at this taxon
3. Number of reads assigned directly to this taxon
4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply “-“.
5. NCBI Taxonomy ID
6. The indented scientific name

6.6. Visualisation (Krona)¶

We use the Krona tools to create a nice interactive visualisation of the taxa content of our sample [ONDOV2011]. Fig. 6.2 shows an example (albeit an artificial one) snapshot of the visualisation Krona provides. Fig. 6.2 is a snapshot of the interactive web-page similar to the one we try to create.

6.6.1. Installation¶

Install Krona with:

source activate ngs
conda install krona


First some house-keeping to make the Krona installation work. Do not worry to much about what is happening here.

# we delete a symbolic link that is not correct
rm -rf ~/miniconda3/envs/ngs/opt/krona/taxonomy

# we create a directory in our home where the krona database will live
mkdir -p ~/krona/taxonomy

# now we make a symbolic link to that directory
ln -s ~/krona/taxonomy ~/miniconda3/envs/ngs/opt/krona/taxonomy


6.6.2. Build the taxonomy¶

We need to build a taxonomy database for Krona. However, if this fails we will skip this step and just download a pre-build one. Lets first try to build one.

ktUpdateTaxonomy.sh ~/krona/taxonomy


Now, if this fails, we download a pre-build taxonomy database for krona:

# Download pre-build database
curl -O http://compbio.massey.ac.nz/data/taxonomy.tab.gz

# we unzip the file
gzip -d taxonomy.tab.gz

# we move the unzipped file to our taxonomy directory we specified in the step before.
mv taxonomy.tab ~/krona/taxonomy


Attention

6.6.3. Visualise¶

Now, we use the tool ktImportTaxonomy from the Krona tools to create the html web-page. We first need build a two column file (read_id<tab>tax_id) as input to the ktImportTaxonomy tool. We will do this by cutting the columns out of either the Kraken or Centrifuge results:

# Kraken
cd kraken
cat evolved-6-R1.kraken | cut -f 2,3 > evolved-6-R1.kraken.krona
ktImportTaxonomy evolved-6-R1.kraken.krona
firefox taxonomy.krona.html

# Centrifuge
cd centrifuge
cat evolved-6-R1-results.txt | cut -f 1,3 > evolved-6-R1-results.krona
ktImportTaxonomy evolved-6-R1-results.krona
firefox taxonomy.krona.html


What happens here is that we extract the second and third column from the Kraken results. Afterwards, we input these to the Krona script, and open the resulting web-page in a bowser. Done!

References

 [KIM2017] Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016 Dec;26(12):1721-1729
 [ONDOV2011] Ondov BD, Bergman NH, and Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics, 2011, 12(1):385.
 [WOOD2014] Wood DE and Steven L Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 2014, 15:R46. DOI: 10.1186/gb-2014-15-3-r46.