4. Genome assembly¶

4.1. Preface¶

In this section we will use our skill on the command-line interface to create a genome assembly from sequencing data.

Note

You will encounter some To-do sections at times. Write the solutions and answers into a text-file.

4.2. Overview¶

The part of the workflow we will work on in this section can be viewed in Fig. 4.1.

4.3. Learning outcomes¶

After studying this tutorial you should be able to:

1. Compute and interpret a whole genome assembly.
2. Judge the quality of a genome assembly.

4.4. Before we start¶

Lets see how our directory structure looks so far:

cd ~/analysis
ls -1F

data/
trimmed/
trimmed-fastqc/


Due to the size of the data sets you may find that the assembly takes a lot of time to complete, especially on older hardware. To mitigate this problem we can randomly select a subset of sequences we are going to use at this stage of the tutorial. To do this we will install another program:

conda activate ngs
conda install seqtk


Now that seqtk has been installed, we are going to sample 10% of the original reads:

# change directory
cd ~/analysis
# create directory
mkdir sampled

seqtk sample -s11 trimmed/ancestor-R1.trimmed.fastq.gz 0.1 | gzip > sampled/ancestor-R1.trimmed.fastq.gz
seqtk sample -s11 trimmed/ancestor-R2.trimmed.fastq.gz 0.1 | gzip > sampled/ancestor-R2.trimmed.fastq.gz


In the commands below you need to change the input directory from trimmed/ to sampled/.

Note

The -s options needs to be the same value for file 1 and file 2 to samples the reads that belong to each other. It specified the seed value for the random number generator.

Note

It should be noted that by reducing the amount of reads that go into the assembly, we are loosing information that could otherwise be used to make the assembly. Thus, the assembly will be likely “much” worse than when using the complete dataset.

4.5. Creating a genome assembly¶

We want to create a genome assembly for our ancestor. We are going to use the quality trimmed forward and backward DNA sequences and use a program called SPAdes to build a genome assembly.

Todo

1. Discuss briefly why we are using the ancestral sequences to create a reference genome as opposed to the evolved line.

4.5.1. Installing the software¶

We are going to use a program called SPAdes fo assembling our genome. In a recent evaluation of assembly software, SPAdes was found to be a good choice for fungal genomes [ABBAS2014]. It is also simple to install and use.

conda activate ngs


# change to your analysis root folder
cd ~/analysis

# first create a output directory for the assemblies
mkdir assembly

# to get a help for spades and an overview of the parameter type:


The two files we need to submit to SPAdes are two paired-end read files.

spades.py -o assembly/spades-default/ -1 trimmed/ancestor-R1.trimmed.fastq.gz -2 trimmed/ancestor-R2.trimmed.fastq.gz


Todo

1. Run SPAdes with default parameters on the ancestor
3. Run SPAdes a second time but use the options suggested at the SPAdes manual section 3.4 for assembling 2x150bp paired-end reads (are fungi multicellular?). Use a different output directory assembly/spades-150 for this run.

Hint

Should you not get it right, try the commands in Code: SPAdes assembly (trimmed data).

4.6. Assembly quality assessment¶

4.6.1. Assembly statistics¶

Quast (QUality ASsesment Tool) [GUREVICH2013], evaluates genome assemblies by computing various metrics, including:

• N50: length for which the collection of all contigs of that length or longer covers at least 50% of assembly length
• NG50: where length of the reference genome is being covered
• NA50 and NGA50: where aligned blocks instead of contigs are taken
• missassemblies: misassembled and unaligned contigs or contigs bases
• genes and operons covered

It is easy with Quast to compare these measures among several assemblies. The program can be used on their website.

conda install quast


Run Quast with both assembly scaffolds.fasta files to compare the results.

Note

Should you be unable to run SPAdes on the data, you can manually download the assembly from Downloads. Unarchive and uncompress the files with tar -xvzf assembly.tar.gz.

quast -o assembly/quast assembly/spades-default/scaffolds.fasta assembly/spades-150/scaffolds.fasta


Todo

1. Compare the results of Quast with regards to the two different assemblies.
2. Which one do you prefer and why?

4.7. Compare the untrimmed data¶

Todo

1. To see if our trimming procedure has an influence on our assembly, run the same command you used on the trimmed data on the original untrimmed data.
2. Run Quast on the assembly and compare the statistics to the one derived for the trimmed data set. Write down your observations.

Hint

Should you not get it right, try the commands in Code: SPAdes assembly (original data).

4.8. Assemblathon¶

Todo

Now that you know the basics for assembling a genome and judging their quality, play with the SPAdes parameters and the trimmed data to create the best assembly possible. We will compare the assemblies to find out who created the best one.

Todo

1. Once you have your final assembly, rename your assembly directory int spades_final, e.g. mv assembly/spades-default assembly/spades_final.
2. Write down in your notes the command used to create your final assembly.
3. Write down in your notes the assembly statistics derived through Quast