8. Genome annotation¶
You will encounter some To-do sections at times. Write the solutions and answers into a text-file.
8.3. Learning outcomes¶
After studying this section of the tutorial you should be able to:
Explain how annotation completeness is assessed using orthologues
Use bioinformatics tools to perform gene prediction
Use genome-viewing software to graphically explore genome annotations and NGS data overlays
8.4. Before we start¶
Lets see how our directory structure looks so far:
$ cd ~/analysis $ ls -1F
assembly/ data/ kraken/ mappings/ multiqc_data/ trimmed/ trimmed-fastqc/ variants/
If you have not run the previous section Genome assembly, you can download the genome assembly needed for this section here: Downloads. Download the file to the
~/analysis directory and decompress. Alternatively on the CLI try:
cd ~/analysis wget -O assembly.tar.gz https://osf.io/t2zpm/download tar xvzf assembly.tar.gz
8.5. Installing the software¶
# activate the env $ conda create --yes -n anno busco
Make a directory for the annotation results:
$ mkdir annotation $ cd annotation
$ busco --list-datasets
INFO: Downloading information on latest versions of BUSCO data... ################################################ Datasets available to be used with BUSCOv4 as of 2019/11/27: bacteria_odb10 - acidobacteria_odb10 - actinobacteria_phylum_odb10 - actinobacteria_class_odb10 - corynebacteriales_odb10 - micrococcales_odb10 - propionibacteriales_odb10 - streptomycetales_odb10 - streptosporangiales_odb10 - coriobacteriia_odb10 - coriobacteriales_odb10 ...
BUSCO will download the dataset when starting an analysis.
We also need to place the configuration file for this program in a location in which we have “write” privileges. Do this recursively for the entire config directory, placing it into your current annotation directory:
$ cp -r ~/miniconda3/envs/anno/config/ .
8.6. Assessment of orthologue presence and absence¶
BUSCO will assess orthologue presence absence using blastn, a rapid method of finding close matches in large databases (we will discuss this in lecture). It uses blastn to make sure that it does not miss any part of any possible coding sequences. To run the program, we give it
A fasta format input file
A name for the output files
The name of the lineage database against which we are assessing orthologue presence absence (that we downloaded above)
An indication of the type of annotation we are doing (genomic, as opposed to transcriptomic or previously annotated protein files).
The config file to use
$ busco -i ../assembly/scaffolds.fasta -o my_anno -l bacteria_odb10 -m geno --config config/config.ini
Navigate into the output directory you created.
There are many directories and files in there containing information on the orthologues that were found, but here we are only really interested in one: the summary statistics.
This is located in the
Look at this file.
It will note the total number of orthologues found, the number expected, and the number missing.
This gives an indication of your genome completeness.
Is it necessarily true that your assembly is incomplete if it is missing some orthologues? Why or why not?
We will use Augustus to perform gene prediction. This program implements a hidden markov model (HMM) to infer where genes lie in the assembly you have made. To run the program you need to give it:
Information as to whether you would like the genes called on both strands (or just the forward or reverse strands)
A “model” organism on which it can base it’s HMM parameters on (in this case we will use E.coli)
The location of the assembly file
A name for the output file, which will be a .gff (general feature format) file.
We will also tell it to display a progress bar as it moves through the genome assembly.
$ augustus --progress=true --strand=both --species=E_coli_K12 --AUGUSTUS_CONFIG_PATH=config ../assembly/scaffolds.fasta > augustus.gff
Should the process of producing your annotation fail, you can download a annotation manually from Downloads. Remember to unzip the file.
$ conda create --yes -n prokka prokka $ conda activate prokka
$ prokka --kingdom Bacteria --genus Escherichia --species coli --outdir annotation assembly/scaffolds.fasta
Your results will be in the
annotation directory with the prefix
8.9. Interactive viewing¶
We will use the software IGV to view the assembly, the gene predictions you have made, and the variants that you have called, all in one window.
$ conda activate anno $ conda install --yes igv
To run IGV type:
This will open up a new window. Navigate to that window and open up your genome assembly:
Genomes -> Load Genome from File
Load your assembly (
scaffolds.fasta), not your gff file.
Load the tracks:
File -> Load from File
Load your unzipped
vcffile from section: Variant calling
Load your unzipped
gfffile from this section.
At this point you should be able to zoom in and out to see regions in which there are SNPs or other types of variants. You can also see the predicted genes. If you zoom in far enough, you can see the sequence (DNA and protein).
If you have time and interest, you can right click on the sequence and copy it. Open a new browser window and go to the blastn homepage. There, you can blast your gene of interest (GOI) and see if blast can assign a function to it.
The end goal of this lab will be for you to select a variant that you feel is interesting (e.g. due to the gene it falls near or within), and hypothesize as to why that mutation might have increased in frequency in these evolving populations.
Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV and Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 2015, Oct 1;31(19):3210-2
Stanke M and Morgenstern B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res, 2005, 33(Web Server issue): W465–W467.