Genome Assembly

Putting the pieces together

Author

Prof. Peter Kille & Dr. Sarah Christofides

Published

September 16, 2024

We will now assemble the genomes for the mystery prokaryotes in the datasets folder. You can either copy them over to mydata/Session2 or work directly from mydata/datasets/prokaryotes (I suggest sticking with whichever approach you used for QC and trimming). Whichever you choose to do, always remember to keep your folder system tidy. If you are struggling to remember how it’s arranged, grab some paper and sketch it out as a visual reminder.

As you go through each step of the assembly process, use loops to apply it to all the prokaryote genomes you’ve been given.

Pick one prokaryote genome and work through assembling it. Use the time later in the extension session to use loops to apply the process to all the prokaryote genomes.

Either way, write scripts rather than running the programmes by typing directly into the terminal. This means you have every step saved so it can be revisited whenever you need.

The bioinformatic process to assemble a genome

Below is a walkthrough on the steps necessary to assemble a mitochondrial or bacterial genome. Don’t get lost in the terminal with typing commands meaninglessly: step back and think about the bioinformatic process to get to the end goal.

We will use multiple bioinformatic packages to assemble a genome and provide assembly statistics

fastQC – Quality control of reads

fastp – adapter removal and trimming tool

Unicycler – genome assembler

Quast – genome assembly statistics

Bandage (optional) – view graphical fragment assembly (gfa) files

Remember that you can access the “help” option for almost all bioinformatics tools by executing the name of the tool with no flags/options, or adding -h or –help. After using a program, don’t forget to unload the module.

tmux is a terminal multiplexer. It lets you switch easily between several programs in one terminal, detach them (they keep running in the background) and reattach them to a different terminal. If you run a script in the terminal and then close the session/turn off the laptop, the script will be cancelled. To avoid this, we use tmux. This will allow you to run scripts and programs in the terminal, close the terminal/turn off your laptop, and the script will continue to run.

module load tmux/3.2a

Basic usage cheatsheet: https://tmuxcheatsheet.com/

Start a new session:

tmux new -s [session name]

Close a session:

Ctrl + b, then d

List available sessions:

tmux list

Re-join an existing session:

tmux a -t [session name]

Delete a session:

tmux kill-session -t [session name]

1. Quality checking data

You will often start with raw sequence data in FASTQ format. You first need to check the quality of the data before proceeding with the assembly; this can be achieved using fastQC.

2. Trimming and adapter removal

Raw sequence data may still contain fragments of the adapter sequences from the sequencing process; these artificial sequences need to be removed. Low quality bases that may occur toward the end of reads can also be trimmed to improve the overall sequence quality. These steps will be carried out using fastp. After this step you will have processed reads.

3. Re-assessing data quality

Following adaptor removal and trimming, we need to repeat the quality checking with fastQC, but this time using the processed reads.

Genome assembly

We will assemble the processed reads into an assembly using the assembler Unicycler. Unicycler is an assembly pipeline for bacterial (and mitochondrial) genomes. It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser.

module load unicycler/v0.5.0

Basic usage:

unicycler -1 [processed fastq R1] -2 [processed fastq R2]  -t [thread number] -o [output directory]

Each assembly will be located in a different directory within the parent directory indicated by the -o option

Exercise: Unicycler

Write a script to assemble your genome(s) with Unicycler.

If working on more than one genome: rename assembly files and copy to new directory

A big part of bioinformatics is maintaining directory and file organisation. Each mitochondrial genome assembly output by Unicycler will have the same generic name, assembly.fasta, albeit located in a different folder. We need to rename these files to reflect the input data. Compose a loop to copy and rename these files to a new directory.

Assembly statistics

Once our assembly files have been renamed and copied to a single location, we can analyse them for quality statistics. Use Quast to report useful metrics such as assembly length, GC content, contig number, and N50 value. QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics.

module load quast/5.2.0

Basic usage:

quast.py -t [thread number] -o [output directory] [input fasta file]

The output directory contains the report in multiple formats. To view the report on the terminal use the cat command which prints the contents of a file to the terminal (don’t forget to navigate to the file first as it’s in a different directory!):

cat [report.txt]
Exercise: Quast

Write a script to assess your genome assembly with Quast.

One of the files produced by Unicycler is an assembly graph (.gfa file extension). This file details the links between contigs that were produced during the assembly and can provide valuable information on the difficult-to-assemble regions of the genome. Bandage is a program for visualising de novo assembly graphs. By displaying connections which are not present in the contigs file, Bandage opens up new possibilities for analysing de novo assemblies. Download to your local laptop from the link: https://rrwick.github.io/Bandage/.

Worked Example of Assembly

Extension Work

Go to the mitochondrial files and apply the same process to them, using a loop.