module load tmux/3.4-pp5yxaf
We will now assemble the genomes for the mystery virus in the datasets folder. You can either copy them over to mydata/Session2
or work directly from mydata/datasets/virus
(I suggest sticking with whichever approach you used for QC and trimming). Whichever you choose to do, always remember to keep your folder system tidy. If you are struggling to remember how it’s arranged, grab some paper and sketch it out as a visual reminder.
As you go through each step of the assembly process, use loops to apply it to all the virus genomes you’ve been given.
Pick one virus genome and work through assembling it. Use the time later in the extension session to use loops to apply the process to all the virus genomes.
Either way, write scripts rather than running the programmes by typing directly into the terminal. This means you have every step saved so it can be revisited whenever you need.
The bioinformatic process to assemble a genome
Below is a walkthrough on the steps necessary to assemble a mitochondrial or bacterial genome. Don’t get lost in the terminal with typing commands meaninglessly: step back and think about the bioinformatic process to get to the end goal.
We will use multiple bioinformatic packages to assemble a genome and provide assembly statistics
fastQC – Quality control of reads
fastp – adapter removal and trimming tool
Unicycler – genome assembler
Quast – genome assembly statistics
Bandage (optional) – view graphical fragment assembly (gfa) files
Remember that you can access the “help” option for almost all bioinformatics tools by executing the name of the tool with no flags/options, or adding -h or –help. After using a program, don’t forget to unload the module.
tmux is a terminal multiplexer. It lets you switch easily between several programs in one terminal, detach them (they keep running in the background) and reattach them to a different terminal. If you run a script in the terminal and then close the session/turn off the laptop, the script will be cancelled. To avoid this, we use tmux. This will allow you to run scripts and programs in the terminal, close the terminal/turn off your laptop, and the script will continue to run.
Basic usage cheatsheet: https://tmuxcheatsheet.com/
Start a new session:
tmux new -s [session name]
Close a session:
Ctrl + b, then d
List available sessions:
tmux list
Re-join an existing session:
tmux a -t [session name]
Delete a session:
tmux kill-session -t [session name]
1. Quality checking data
You will often start with raw sequence data in FASTQ format. You first need to check the quality of the data before proceeding with the assembly; this can be achieved using fastQC.
2. Trimming and adapter removal
Raw sequence data may still contain fragments of the adapter sequences from the sequencing process; these artificial sequences need to be removed. Low quality bases that may occur toward the end of reads can also be trimmed to improve the overall sequence quality. These steps will be carried out using fastp. After this step you will have processed reads.
3. Re-assessing data quality
Following adaptor removal and trimming, we need to repeat the quality checking with fastQC, but this time using the processed reads.
Genome assembly
We will assemble the processed reads into an assembly using the assembler Unicycler. Unicycler is an assembly pipeline for bacterial (and mitochondrial) genomes. It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser.
module load py-unicycler/0.5.0-d6k3wzh
Basic usage:
unicycler -1 [processed fastq R1] -2 [processed fastq R2] -t [thread number] -o [output directory]
Each assembly will be located in a different directory within the parent directory indicated by the -o option
Write a script to assemble your genome(s) with Unicycler.
If working on more than one genome: rename assembly files and copy to new directory
A big part of bioinformatics is maintaining directory and file organisation. Each mitochondrial genome assembly output by Unicycler will have the same generic name, assembly.fasta, albeit located in a different folder. We need to rename these files to reflect the input data. Compose a loop to copy and rename these files to a new directory.
Assembly statistics
Once our assembly files have been renamed and copied to a single location, we can analyse them for quality statistics. Use Quast to report useful metrics such as assembly length, GC content, contig number, and N50 value. QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics.
module load py-quast/5.2.0-hgcvroq
Basic usage:
quast.py -t [thread number] -o [output directory] [input fasta file]
The output directory contains the report in multiple formats. To view the report on the terminal use the cat
command which prints the contents of a file to the terminal (don’t forget to navigate to the file first as it’s in a different directory!):
cat [report.txt]
Write a script to assess your genome assembly with Quast.
One of the files produced by Unicycler is an assembly graph (.gfa file extension). This file details the links between contigs that were produced during the assembly and can provide valuable information on the difficult-to-assemble regions of the genome. Bandage is a program for visualising de novo assembly graphs. By displaying connections which are not present in the contigs file, Bandage opens up new possibilities for analysing de novo assemblies. Download to your local laptop from the link: https://rrwick.github.io/Bandage/.
Worked Example of Assembly
Go to the mitochondrial files and apply the same process to them, using a loop.