RNAseq Processing

From Reads to Counts

Author

Prof. Peter Kille and Dr. Sarah Christofides

Published

January 7, 2025

RNASeq processing - The Theory

This presentation outlines the RNA sequencing (RNAseq) data analysis process starting from raw pair end sequence reads (fastq files). The scripts can be refined to work with single end data, key notes are provided how this adaptation can be done. Key steps include read trimming, aligning sequencing reads to a reference genome using STAR, quantifying read counts (being respectful of intron/exon boundaries), marking up potential duplicate reads and generation of feature counts. The count data thus generated can be used for generating Differential Expressed Gene (DEG) lists by analsyis with R packages such as DESeq2 or edgeR. Finally, differential gene expression analysis employs statistical tests and visualisations to identify genes with significantly altered expression levels.

If you are interested Andres et al 2013 Provides an excellent overview of the processes involved.

Lecture slides for RNASeq Processing

1. Introduction

RNAseq, or transcriptomics, is a method for quantifying the expression rate for all genes simultaneously. Analogous to a quantitative PCR of one gene, it comes with its advantages and drawbacks, with special statistical measures developed over years to make the results as accurate as possible.

There are three main steps to analyse a series of RNAseq fastq sample files (following quality filtering):

1.1 QC and Trimming

As with all Next Generation Sequencing (NGS) informatic processes the first step in data analysis is to remove any poor quality bases and any experimentally introduced sequences (adapters or primers). This workflow uses Fastp, although programs such as trimmomatic or trimgalore would also do teh job well. It is recommended to check the data quality and quantity before and after trimming - runnign fastqc on each read file before and after trimming then generating a summary with multiqc provides a great combination.

1.2 Formatting Target Genome

When aligning to genomes the target genome needs to be formatted it to reflect the mapper being used, read length and genome annotation (using a GTF or GFF file) so that the mapper knows where the intron-exon junctions are. You also should ideally use a ‘soft masked’ genom so that common repeats are avalible to map to.

This is a RAM intensive process and can take some time for larger genomes - for main stream mappers, such as STAR and Bowtie, and for routine genomes such as human and mouse there are downloaded approriate formatted genomes (just search the web to find) this can save you alot of time and computational resources. If you need to download and format your own genomes here are some sites that are useful to download genomes from in order of ‘usability’:

Main steam and model genomes

Ensembl current_fasta -https://ftp.ensembl.org/pub/current_fasta/

Ensembl current_gtf - https://ftp.ensembl.org/pub/current_gtf/

UK Darwin Tree of life Genomes

Genomes and annotations (gtf) https://projects.ensembl.org/darwin-tree-of-life/

NCBI genomes

Genomes and annotations https://www.ncbi.nlm.nih.gov/datasets/genome/

NCBI currently supports FTP, and API to allow wget or curl to be used to download genomes or a piece of software - a small local binary (this does not need to be installed just download - datasets ). Examples of how to sue these are given below:

FTP site: https://ftp.ncbi.nlm.nih.gov/genomes/

wget https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCA_945859605.1/download?include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GFF&include_annotation_type=RNA_FASTA&include_annotation_type=CDS_FASTA&include_annotation_type=PROT_FASTA&include_annotation_type=SEQUENCE_REPORT&hydrated=FULLY_HYDRATED

#download datasets software from NCBI - https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/
# Now you can uses datasets to download your data
datasets download genome accession GCA_945859605.1 --include gff3,rna,cds,protein,genome,seq-report

1.3 Alignment (STAR)

STAR – (Spliced Transcript Alignments to a Reference) is an alignment package which functions similarly to standard genome alignments, but is designed for short regions of RNA that could span intron-exon junctions and with low compute requirements. STAR outputs a bam format file which contains the locations where all the reads in your dataset have aligned and which genes they cover.

1.4 Marking Duplicates

Duplicate reads can be generated by both library preparation methods and by over-clustering on the Illumina read bed. The latter was more common with older Illumina platforms are rarely seen in newer platforms that have been loaded at the correct concentration but as the library preparation uses lower input the read duplicate is becoming more common from this stage of the process. ‘False’ duplicates can be observed in libraries that are very deeply sequenced and contain limited cellular diversity - cell lines or tissue isolates as abundant genes will be over-represented. If you want to check the most robust way is to visualise the read coverage for a range of abundant genes - if the coverage is consistent then duplicates are not an issue but if you observe specific amplicons which or over-represented (looking like stacks) then this is an issue you need to address. DO NOT use duplicate removal for 3’ UTR library methods.Here we’ll use the imaginatively named MarkDuplicates from GATK.

1.5 Feature Counting

FeatureCounts is a simple package that takes the positions of mapped reads and outputs a file quantifying the expression of each gene or exon (based on parameter choices). At this point raw read counts are hard to interpret due to likely different levels of sequencing achieved per sample and methodological biases.

Workshop Example

Workshop Example: Raw Data

We will start with the aligning and counting steps.

The data you will need for this exercise is derived from NCBI Bioproject PRJNA658134 Geo dataset GSE156507. I have selected only those L3 stage samples and sub-sampled the data from the 60M reads deposited to ~10M reads so that it can be processed with the workshop. Samples included are:

~/classdata/Session5/ RNAseq-Processing/rawdata
SRR12478185_1.fastq.gz    SRR12478185_2.fastq.gz
SRR12478186_1.fastq.gz    SRR12478186_2.fastq.gz
SRR12478187_1.fastq.gz    SRR12478187_2.fastq.gz
SRR12478191_1.fastq.gz    SRR12478191_2.fastq.gz
SRR12478192_1.fastq.gz    SRR12478192_2.fastq.gz
SRR12478193_1.fastq.gz    SRR12478193_2.fastq.gz

~/classdata/REFS/Celegens_genome_star/
Caenorhabditis_elegans.WBcel235.112.gtf
Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa

This folder contains lots of other index files for star to function that you don’t need to touch! Note: most programs will accept fastq or fastq.gz without any changes however star requires you to include the --readFilesCommand zcat parameter.

Software and scripts

We will be using scripts to run these steps. In the classdata/Session5/RNAseq-Processing/instance_scripts folder you will find the following that you can use as a basis for your analysis, however, make sure you’re tuning it to your own file structure and file names.

We’ll be using a full sized RNAseq sample otherwise it causes the programs to think it’s bad data. In the classdata/Session3/RNAseq-Processing/rawdata folder there four pairs of RNAseq files from an C. elegans RNAseq study. In the folder classdata/REFS there is a reference genome, and a gtf file. The step 2 “star index genome” has already been run for you (you don’t need to do this!)

Scripts

  • 1-QC.sh
  • 2-star_index_genome.sh (already done, don’t repeat!)
  • 3-star.sh
  • 4-markduplicates.sh
  • 5-featurecounts.sh
Exercises: RNAseq processing

Using the pre-made scripts provided, perform the steps on four pairs of fastq files. There are examples of all of these files in the classdata/Session5 directory which you should copy into your own folder. You will need to edit them to represent your own working folder and filenames:

1.1 QC and trim your sample data

1.2 Format Genome (THIS HAS BEEN DONE FOR YOU)

1.3 Use the outputs from your QC script as inputs to run star.

1.4 Use the outputs from star to run mark duplicates to both remove and keep duplicates (this will generate two outputs).

1.5 Use the outputs to each of these to run featureCounts on the data.

Run multiQC on the processed directory and observe the summary.

These outputs are now ready to put into R and perform Differential Gene Expression Analysis. Things to remember:

  • Load any modules you will need!

  • Use the full path any time it could be ambiguous.

Scripts

1-QC.sh

#!/bin/bash

#varibles
workdir=$(pwd)

rawdir=${workdir}/rawdata
#create directory for qc output
mkdir ${workdir}/rawqc
rawqcdir=${workdir}/rawqc
mkdir ${workdir}/trimdata
trimdir=${workdir}/trimdata
mkdir ${workdir}/trimqc
trimqcdir=${workdir}/trimqc

#load program
module load fastqc

for f in ${rawdir}/*_1.fastq.gz
do

R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

fastqc -t 2 ${rawdir}/${base}_1.fastq.gz ${rawdir}/${base}_2.fastq.gz -o ${rawqcdir}

done
#unload program
module unload fastqc


#load program - fastp
module load fastp

#create loop
for f in ${rawdir}/*_1.fastq.gz
do

R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

fastp -q 20 -u 10 --cut_right \
      -i ${rawdir}/${base}_1.fastq.gz \
      -I ${rawdir}/${base}_2.fastq.gz \
      -o ${trimdir}/${base}_trim_R1.fastq.gz \
      -O ${trimdir}/${base}_trim_R2.fastq.gz \
      -j ${trimdir}/${base}_trim.json \
      -h ${trimdir}/${base}_trim.html

done

#unload program
module unload fastp


#load program - fastp
module load fastqc

#create loop
for f in ${rawdir}/*_1.fastq.gz
do

R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

fastqc -t 2 ${trimdir}/${base}_trim_R1.fastq.gz ${trimdir}/${base}_trim_R2.fastq.gz -o ${trimqcdir}

done

#unload  program
module unload fastqc

module load py-multiqc

multiqc ${workdir}

module unload py-multiqc

2-star_index_genome.sh


# Load some modules
module load star/2.7.11a-cbbjioq

workdir=$(pwd)

## Change --sjdbOverhang to length of your sequence data /2 minus 1

echo "\n\n I TOLD YOU NOT TO RUN THIS ONE NOW! \n\n (unless you're in the future and trying to run this for real, in which case you need to edit this script and remove the # characters from the command)"

STAR    --runThreadN 8 \
        --limitGenomeGenerateRAM 321563573 \
        --runMode genomeGenerate \
        --genomeDir  ~/classdata/REFS/Celegens_genome_star \
        --genomeFastaFiles ~/classdata/REFS/Celegens_genome_star/Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa \
        --sjdbGTFfile ~/classdata/REFS/Celegens_genome_star/Caenorhabditis_elegans.WBcel235.112.gtf \
        --sjdbOverhang 49

3-star.sh

#!/bin/bash

## Load some Modules
module load star/2.7.11b-conda

## Useful shortcuts
#varibles
workdir=$(pwd)

rawdir=${workdir}/rawdata
trimdir=${workdir}/trimdata
mkdir ${workdir}/star
stardir=${workdir}/star
genomedir=~/classdata/REFS/Celegens_genome_star

for f in ${workdir}/rawdata/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

# map forward and reverse reads to genome
STAR   --outMultimapperOrder Random \
       --outSAMmultNmax 1 \
       --runThreadN 8 \
       --runMode alignReads \
       --outSAMtype BAM Unsorted \
       --quantMode GeneCounts \
       --readFilesCommand zcat \
       --outFileNamePrefix ${stardir}/${base}-unsort. \
       --genomeDir ${genomedir} \
       --readFilesIn ${trimdir}/${base}_trim_R1.fastq.gz ${trimdir}/${base}_trim_R2.fastq.gz

done

4-markduplicates.sh

#!/bin/bash

#load some modules
module load picard/3.1.1-peglbgr
module load samtools/1.19.2-m76oqh7

## Useful shortcuts
workdir=$(pwd)

rawdir=${workdir}/rawdata
trimdir=${workdir}/trimdata
stardir=${workdir}/star
genomedir=~/classdata/REFS/Celegens_genome_star
mkdir ${workdir}/markdup
markdir=${workdir}/markdup


for f in ${workdir}/rawdata/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

samtools sort -@ 8 -o $stardir/${base}.sorted.bam $stardir/${base}-unsort.Aligned.out.bam
samtools index $stardir/${base}.sorted.bam

##  MARK DUPLICATES  ##
picard MarkDuplicates I=$stardir/${base}.sorted.bam O=$markdir/${base}.markdup.bam M=$markdir/${base}.metrics.markdup.txt REMOVE_DUPLICATES=false VALIDATION_STRINGENCY=SILENT



## REMOVE DUPLICATES ##
picard MarkDuplicates I=${stardir}/${base}.sorted.bam O=${markdir}/${base}.rmdup.bam M=${markdir}/${base}.metrics.rmdup.txt REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=SILENT

done

5-featurecounts.sh

#!/bin/bash

# Load some modules
module load subread/2.0.6-abbqxcc

## Useful shortcuts
workdir=$(pwd)

rawdir=${workdir}/rawdata
trimdir=${workdir}/trimdata
stardir=${workdir}/star
genomedir=~/classdata/REFS/Celegens_genome_star
markdir=${workdir}/markdup
mkdir ${workdir}/featureCounts
fcdir=${workdir}/featureCounts

for f in ${workdir}/rawdata/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

featureCounts \
        -T 4 -p -F GTF -t exon -g gene_id \
        -a ${genomedir}/Caenorhabditis_elegans.WBcel235.112.gtf \
        -o ${fcdir}/${base}.markdup.featurecount \
        ${markdir}/${base}.markdup.bam

featureCounts \
        -T 4 -p -F GTF -t exon -g gene_id \
        -a ${genomedir}/Caenorhabditis_elegans.WBcel235.112.gtf \
        -o ${fcdir}/${base}.rmdup.featurecount \
        ${markdir}/${base}.rmdup.bam

done

:::::