8. Genome annotation

8.1. Preface

In this section you will predict genes and assess your assembly using Augustus and BUSCO, as well as Prokka.

Note

You will encounter some To-do sections at times. Write the solutions and answers into a text-file.

8.2. Overview

The part of the workflow we will work on in this section can be viewed in Fig. 8.1.

../_images/workflow1.png

Fig. 8.1 The part of the workflow we will work on in this section marked in red.

8.3. Learning outcomes

After studying this section of the tutorial you should be able to:

  1. Explain how annotation completeness is assessed using orthologues

  2. Use bioinformatics tools to perform gene prediction

  3. Use genome-viewing software to graphically explore genome annotations and NGS data overlays

8.4. Before we start

Lets see how our directory structure looks so far:

$ cd ~/analysis
$ ls -1F
assembly/
data/
kraken/
mappings/
multiqc_data/
trimmed/
trimmed-fastqc/
variants/

Attention

If you have not run the previous section Genome assembly, you can download the genome assembly needed for this section here: Downloads. Download the file to the ~/analysis directory and decompress. Alternatively on the CLI try:

cd ~/analysis
wget -O assembly.tar.gz https://osf.io/t2zpm/download
tar xvzf assembly.tar.gz

8.5. Installing the software

# activate the env
$ conda create --yes -n anno busco

This will install both the Augustus [STANKE2005] and the BUSCO [SIMAO2015] software, which we will use (separately) for gene prediction and assessment of assembly completeness, respectively.

Make a directory for the annotation results:

$ mkdir annotation
$ cd annotation

We need to get the database that BUSCO will use to assess orthologue presence absence in our genome annotation. BUSCO provides a command to list all available datasets and download datasets.

$ busco --list-datasets
INFO:   Downloading information on latest versions of BUSCO data...

################################################

Datasets available to be used with BUSCOv4 as of 2019/11/27:

bacteria_odb10
    - acidobacteria_odb10
    - actinobacteria_phylum_odb10
        - actinobacteria_class_odb10
            - corynebacteriales_odb10
            - micrococcales_odb10
            - propionibacteriales_odb10
            - streptomycetales_odb10
            - streptosporangiales_odb10
        - coriobacteriia_odb10
            - coriobacteriales_odb10
...

BUSCO will download the dataset when starting an analysis.

We also need to place the configuration file for this program in a location in which we have “write” privileges. Do this recursively for the entire config directory, placing it into your current annotation directory:

$ cp -r ~/miniconda3/envs/anno/config/ .

8.6. Assessment of orthologue presence and absence

BUSCO will assess orthologue presence absence using blastn, a rapid method of finding close matches in large databases (we will discuss this in lecture). It uses blastn to make sure that it does not miss any part of any possible coding sequences. To run the program, we give it

  • A fasta format input file

  • A name for the output files

  • The name of the lineage database against which we are assessing orthologue presence absence (that we downloaded above)

  • An indication of the type of annotation we are doing (genomic, as opposed to transcriptomic or previously annotated protein files).

  • The config file to use

$ busco  -i ../assembly/scaffolds.fasta -o my_anno -l bacteria_odb10 -m geno --config config/config.ini

Navigate into the output directory you created. There are many directories and files in there containing information on the orthologues that were found, but here we are only really interested in one: the summary statistics. This is located in the short_summary*.txt file. Look at this file. It will note the total number of orthologues found, the number expected, and the number missing. This gives an indication of your genome completeness.

Todo

Is it necessarily true that your assembly is incomplete if it is missing some orthologues? Why or why not?

8.7. Annotation with Augustus

We will use Augustus to perform gene prediction. This program implements a hidden markov model (HMM) to infer where genes lie in the assembly you have made. To run the program you need to give it:

  • Information as to whether you would like the genes called on both strands (or just the forward or reverse strands)

  • A “model” organism on which it can base it’s HMM parameters on (in this case we will use E.coli)

  • The location of the assembly file

  • A name for the output file, which will be a .gff (general feature format) file.

  • We will also tell it to display a progress bar as it moves through the genome assembly.

$ augustus --progress=true --strand=both --species=E_coli_K12 --AUGUSTUS_CONFIG_PATH=config ../assembly/scaffolds.fasta > augustus.gff

Note

Should the process of producing your annotation fail, you can download a annotation manually from Downloads. Remember to unzip the file.

8.8. Annotation with Prokka

Install Prokka:

$ conda create --yes -n prokka prokka
$ conda activate prokka

Run Prokka:

$ prokka --kingdom Bacteria --genus Escherichia --species coli --outdir annotation assembly/scaffolds.fasta

Your results will be in the annotation directory with the prefix PROKKA.

8.9. Interactive viewing

We will use the software IGV to view the assembly, the gene predictions you have made, and the variants that you have called, all in one window.

8.9.1. IGV

$ conda activate anno
$ conda install --yes igv

To run IGV type:

$ igv

This will open up a new window. Navigate to that window and open up your genome assembly:

  • Genomes -> Load Genome from File

  • Load your assembly (scaffolds.fasta), not your gff file.

Load the tracks:

  • File -> Load from File

  • Load your unzipped vcf file from section: Variant calling

  • Load your unzipped gff file from this section.

At this point you should be able to zoom in and out to see regions in which there are SNPs or other types of variants. You can also see the predicted genes. If you zoom in far enough, you can see the sequence (DNA and protein).

If you have time and interest, you can right click on the sequence and copy it. Open a new browser window and go to the blastn homepage. There, you can blast your gene of interest (GOI) and see if blast can assign a function to it.

The end goal of this lab will be for you to select a variant that you feel is interesting (e.g. due to the gene it falls near or within), and hypothesize as to why that mutation might have increased in frequency in these evolving populations.

References

SIMAO2015

Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV and Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 2015, Oct 1;31(19):3210-2

STANKE2005

Stanke M and Morgenstern B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res, 2005, 33(Web Server issue): W465–W467.