NGS Quality Control

First steps - making sure your data is good

Author

Dr. Sarah Christofides

Published

October 16, 2025

Working with Quality Control

Start off by logging onto the server and navigating to your mydata directory.

FASTA and FASTQ

The two commonest file types for sequencing data are FASTA and FASTQ. (To make life a little more complicated, both have more than one file extension. FASTAs can be .fasta, .fa or .fna; FASTQs can be .fastq or .fq)

A FASTA is a 2-line file format. Each sequence has a header line, which always starts with > and contains an identifier for the sequence. The next line contains the sequence itself. For example, a FASTA file with two sequences in it could look like this:

>Sequence 1 
ACGTGCTTCCGGTTTCAGGGTCA
>Sequence 2
GTACTTAACCTAAACTGGACTAA

A FASTQ is an extension of a FASTA that includes quality scores for each base of the sequence. It is a 4-line file format. The first line is still a header, but now begins with @, and the second line is the sequence. The third line is always a single + that separates the sequence from the quality scores on the fourth line. A FASTQ of those same two sequences could look like this:

@Sequence 1 
ACGTGCTTCCGGTTTCAGGGTCA
+
R%!JSQA(AD\@ASDIA&ASD&!N
@Sequence 2
GTACTTAACCTAAACTGGACTAA
+
&DJSA)ADFLA9DF5*J34AQBS
Exercise: looking at sequence files
  1. Copy the Session2 folder from classdata to mydata, and cd into it.

  2. Read the file subsample_Ill1.fasta using one (or more!) of the commands cat, less, head and tail.

  3. Using your method of choice, look at Illumina_R1.fastq. Can you see the difference?

Don’t forget to use tab completion!

Variables

Variables are a very important concept in any coding language. The idea is to assign a name to some object in your environment: a dataset, a number, some text, a filename, the output of a function… pretty much anything! The R fans among you will already be familiar with them in use cases like:

# Read in a dataset and save it as dframe1
dframe1<-read.csv(file.choose())

#Run a linear regression and save the outputs as mod1
mod1<-lm(dframe1$y ~ dframe1$x)

In the above examples, dframe1 and mod1 are variables because what they contain can vary: you could read in any dataset you like and call it dframe1. Likewise, you could run model after model and call the outputs mod1, overwriting the previous results each time.

In R, to ‘call’ (i.e. access) the contents of a variable you simply use its name. For example:

# Run the summary command on dframe1
summary(dframe1)

#Print the contents of mod1
mod1

Linux also makes extensive use of variables, but the syntax is a little different. Instead of <-, a variable is set with = (there are also other ways, which you will see shortly). However, to call a variable in Linux you use a dollar sign and curly brackets.

Exercise: set and call variables in Linux
  1. Create a variable hi to contain the words “Hello world”.
hi="Hello world"
  1. Try printing the variable (the print command in Linux is echo) just using its name as you would in R. Does it work? What happens?
echo hi
  1. Now try again using the Linux syntax. Can you see what is different?
echo ${hi}

In Linux, if you want to set a variable to contain the output of a command, you need a little bit of extra syntax:

mydir=$(pwd)

The above example runs the command pwd and saves the output to the variable mydir. Note that the command has to be in round brackets and have a dollar sign in front of it. Try this out for yourself, and then try calling the variable like you did in the previous exercise.

Rules for variables in Linux

  1. Variables can be set with the = sign.

  2. Text strings should be in “quotation marks”.

  3. To save the outputs of a command to a variable, the command needs to be in round brackets with a dollar sign in front.

  4. To call a variable, use its name in curly brackets with a dollar sign in front.

FastQC

In the pantheon of great bioinformatics software, FastQC has a special place. It’s free, simple to use and has been around for ages, and everyone uses it because it’s just so good. FastQC will take your FASTQ file and produce a nice easy-to-read report about it with loads of information.

To use FastQC on the server, we will first need to load the module. This is the equivalent of starting up a programme on your desktop computer. You can find out what modules are available on the server by typing the command

module avail

You should see that the list includes fastqc/v0.12.1. To load this module, simply type

module load fastqc/v0.12.1

Try this out now!

As mentioned before, FastQC is very easy to run. At its most basic (without any extra options), you simply type (substituting file.fastq for the name of the FASTQ file you want to check):

fastqc file.fastq

There is one extra option that is useful to include, and that is the -o flag. This stands for ‘output’ and tells FastQC what directory to put its output files in.

fastqc -o output_dir file.fastq
Exercise: using FastQC
  1. Make a directory called fastqc inside your Session2 folder.

  2. Try running FastQC on Illumina_R1.fastq, using your fastqc directory for the output.

  3. Use ls to see what output it has created.

  4. Using the MobaXterm sftp file transfer window or FileZilla, transfer the html file to your desktop and open it.

Reminder: if using FileZilla, the details are:

  • Host: sponsa.bios.cf.ac.uk (for masters students) or hawker.bios.cf.ac.uk (for Y3 students)
  • Username: your username
  • Password: your password
  • Port: Your assigned port number
  1. [Extension] Look up the help page using fastqc -h. What other options are available?

Running and Reusing Scripts

So far you have just been typing commands directly into the command line. This is great for exploring directories etc., but now you are starting to do some proper analysis you’ll soon find yourself asking questions like

  • what if I want to run this again?
  • what if I want to check exactly what I did?
  • what if I want to show someone else what I did?

These are excellent questions, and they can all be answered by writing SCRIPTS! If that sounds a bit daunting, a script is simply a text file that you write your commands into. By adding a special line of code called a shebang, the computer can run the commands in this text file as though you had typed them directly onto the command line.

The shebang is simply

#!/usr/bin/bash
Exercise: a simple script

Let’s create a file with nano and write a simple script in it. By convention, script files are given the extension .sh rather than .txt

nano test_nano.sh
  • type the shebang at the top #!/usr/bin/bash
  • Type ls on the next line
  • Exit using ctrl-X, save and return to command line
  • Now we need to make the file executable (so it can be run as a programme)
chmod a+x [script name]

this is shorthand for chmod (change modify) a(all)+(add)execute(x) [script name] – thus changing the permission to allow everyone to execute a script. For a very helpful guide to chmod see https://en.wikipedia.org/wiki/Chmod

Now run the program!

./test_nano.sh

What output do you get? It is what you expected?

Quality trimming using fastp

Having assessed the quality of our sequence data, the next thing we want to do is clean it up. There are many programmes available for the task, but we will use the one-stop-shop software fastp which has won many friends by being both comprehensive and easy to use. It has an excellent users’ guide at https://github.com/OpenGene/fastp.

The main things fastp does is to trim off bad bases and remove leftover adaptor sequences. It automatically identifies the adapters so you don’t need to tell it in advance what to look for.

Exercise: fastp

Create a new file using nano, copy the script below, make it executable and run it! We are going to use the forward and reverse reads for our sample to run fastp in paired-end mode.

Script:

#!/usr/bin/bash

module load fastp/0.23.4-rsrnyej

fastp -q 20 -u 10 --cut_right -i Illumina_R1.fastq -I Illumina_R2.fastq -o Illumina_trim_R1.fastq -O Illumina_trim_R2.fastq -

Look up the help page for fastp. What does each of the options given above mean?

Use ls to look at the output. How could the script be improved to make it tidier?

Exercise: checking it worked!
  1. Run FastQC on your trimmed Illumina_R1 file and compare it to the original. Can you see the difference?

  2. In the Session2 directory there is a sub-directory called looping, with five single-end Illumina samples in it. Can you write a script which will use loop over those files to do QC on them, then trim them and re-run the QC? Extra brownie points for keeping the output directories nice and tidy!

Pipes

Very often in Linux you want to use the output of one command as the input to the next. This can easily be done using the pipe (|) character. For example, suppose I want to count the number of sequences in a FASTA file. I can do this easily using a combination of the search command grep and the word count command wc.

cat subsample_Ill1.fasta | grep ">" | wc -l

Let’s break down what this command is doing. First of all, I access the whole content of subsample_Ill1.fasta using cat. However, rather than dumping it onto the screen I redirect that to become the input to grep. I tell grep to search for the symbol > and sift out only the lines which contain that symbol. Because this is a FASTA, I know that can only be the header lines. I then feed this output into the wc command, with the option -l to count lines rather than words.

Exercise: using the pipe
  1. Try using the command above to count the number of sequences in subsample_Ill1.fasta.

  2. What happens if you remove | wc -l from the end? Do you understand why?

Pipes are particularly useful as they allow you to run…. LOOPS!

Loops

Loops are one of the best things about working on a Linux system. They allow you to write a command once and then run it on any number of files. There are several kinds of loops which all work on the same principles, so we will focus on the while loop. Let’s look at a simple loop and break it down.

ls *.fasta | while read file; do wc -l ${file}; done

This starts by listing all the FASTA files in the directory Session2 (remember that * stands in for anything). Try running

ls *.fasta 

just to check you’re happy with that.

The next step is to use a pipe. Instead of just listing the FASTA files on the screen, we are going to use that as input for our loop. The loop starts while read file. This means that one at a time, every line of that input will take a turn at being substituted for the word ‘file’ in the command that will follow. You don’t have to call it ‘file’ - it could be banana or profiterole or socks; the only thing that matters is that you use the same word throughout the loop. Whatever word you choose, it becomes something known as a variable: literally, a value that can vary as it stands in for something else. while read sets the value of file, and then whenever we want to access its contents we can do so using ${file}.

Having set the loop in progress, a semicolon indicates that next comes the command we want to run for each value of our variable. In this case, I am using wc -l to count the total number of lines in the file. Another semicolon followed by ‘done’ is needed to finish the loop and tell it to go away and run.

Exercise: using loops
  1. Try running the loop above in the Session2 directory.

  2. Try changing ‘file’ to something else. Does it work?

  3. [Extension] Using what you’ve learned so far, can you make a loop that will only count the sequence header lines?

Creating a quality summary with multiqc

Multiqc will search through a parent and all daughter folders, extracting all the qc data and creating a graphical summary. Try it for your session folder.

 module load py-multiqc/1.15-hcx2o2j
 
 #run multiqc on all files in the current working directory
 multiqc .
 

use the -h flag to see what option multiqc provides.

Exercise: the main event!

You are now going to meet the dataset we’ll be exploring for the next few days. In ~/classdata there is a folder called datasets. Copy this folder across to your ~/mydata directory

cp -r ~/classdata/datasets ~/mydata/datasets

Inside this folder you will find three data sets: virus, prokaryote and mitochondrial. We will be using the virus dataset to clean, assemble, annotate and analyse genomes over the next few days. The prokaryote and mitochondrial datasets will be for you to play with in the extension sessions.

Your challenge now is to write script(s) to run fastQC, fastp, fastQC (again!) and multiqc on the virus files in the subdirectory raw_reads. You can either copy them over to mydata/Session2 or work directly from mydata/datasets/prokaryotes/raw_reads. There are multiple virus files, so you will need to use loops!

If you are feeling confident, try creating one script that will carry out all the processes on all the files. Remember to keep your file names and output directories tidy!

Extension Work

Do the same for the mitochondrial data!

End of NGS QC.

Time for coffee. You have earned yourselves cake!