RNAseq Processing

From Reads to Counts

Author

Prof. Peter Kille and Dr. Sarah Christofides

Published

September 16, 2024

Transcriptomics: an introduction

RNASeq processing

Introduction

RNAseq, or transcriptomics, is a method for quantifying the expression rate for all genes simultaneously. Analogous to a quantitative PCR of one gene, it comes with its advantages and drawbacks, with special statistical measures developed over years to make the results as accurate as possible.

There are three main steps to analyse a series of RNAseq fastq sample files (following quality filtering):

Alignment

STAR – (Spliced Transcript Alignments to a Reference) is an alignment package which functions similarly to standard genome alignments, but is designed for short regions of RNA that could span intron-exon junctions and with low compute requirements. STAR outputs a bam format file which contains the locations where all the reads in your dataset have aligned and which genes they cover.

Counting

FeatureCounts is a simple package that takes the positions of mapped reads and outputs a file quantifying the expression of each gene or exon (based on parameter choices). At this point raw read counts are hard to interpret due to likely different levels of sequencing achieved per sample and methodological biases.

One common step prior to counting is marking duplicates that arise from data generation for further information, or so that they can be removed. Here we’ll use the imaginatively named MarkDuplicates from GATK.

Differential Gene Analysis

Contrasting the expression profile of the samples is typically done with one of two R packages: Deseq2 or EdgeR (the mac vs windows of the RNAseq fight), however a multitude of alternatives exist. These packages perform the normalization and statistical steps of contrasting samples as defined in a metadata file stating your experimental design (replicates, tissue type, treatment etc). The output here is a range of significant genes, ordinance and cluster analysis of sample similarity, and various quality control figures.

Following these three steps, there are an almost infinite number of tools and packages to look deeper into your data, find experimentally specific insights, and prior published data to contrast against.

Data

We will start with the aligning and counting steps.

The data you will need for this exercise are:

~/classdata/Session5/ RNAseq-Processing/fastq
SRR5222797_1.fastq    SRR5222797_2.fastq
SRR5222798_1.fastq    SRR5222798_2.fastq
SRR5222799_1.fastq    SRR5222799_2.fastq

~/classdata/REFS/STAR
Arabidopsis_thaliana.TAIR10.47.gtf
Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa

This folder contains lots of other index files for star to function that you don’t need to touch! Note: most programs will accept fastq or fastq.gz without any changes however star requires you to include the --readFilesCommand zcat parameter.

Software and scripts

We will be using scripts to run these steps. In the classdata/Session5/scripts folder you will find the following that you can use as a basis for your analysis, however make sure you’re tuning it to your own file structure and file names.

We’ll be using a full sized RNAseq sample otherwise it causes the programs to think it’s bad data. In the classdata/Day3 folder there four pairs of RNAseq files from an Arabidopsis RNAseq study. In the folder classdata/REFS there is a reference genome, and a gtf file. The step 2 “star index genome” has already been run for you (you don’t need to do this!)

Scripts

  • 1-QC.sh
  • 2-star_index_genome.sh (already done, don’t repeat!)
  • 3-star.sh
  • 4-markduplicates.sh
  • 5-featurecounts.sh
Exercises: RNAseq processing

Using the pre-made scripts provided, perform the steps on four pairs of fastq files. There are examples of all of these files in the classdata/Session5 directory which you should copy into your own folder. You will need to edit them to represent your own working folder and filenames:

  1. QC and trim your sample data

  2. Use the outputs from your QC script as inputs to run star.

  3. Use the outputs from star to run mark duplicates to both remove and keep duplicates (this will generate two outputs).

  4. Use the outputs to each of these to run featureCounts on the data.

  5. Run multiQC on the processed directory and observe the summary.

These outputs are now ready to put into R and perform Differential Gene Expression Analysis. Things to remember:

  • Load any modules you will need!

  • Use the full path any time it could be ambiguous.

Scripts

#!/bin/bash

## Load some Modules
module load fastqc/0.11.9
module load trimmomatic/0.39

## Useful shortcuts
export workingdir=~/mydata/Session5/RNAseq-Processing
export classdir=~/classdata

#list=("sample1" "sample2" "sample3")
list=("SRR5222797" "SRR5222798" "SRR5222799")

for i in ${list[@]}
do
## The commands you want to run

# fastqc the raw data
fastqc -t 4 $workingdir/fastq/${i}_1.fastq
fastqc -t 4 $workingdir/fastq/${i}_2.fastq

# Run trimmomatic
trimmomatic PE $workingdir/fastq/${i}_1.fastq $workingdir/fastq/${i}_2.fastq  -baseout $workingdir/fastq/${i}-trim.fastq ILLUMINACLIP:$classdir/REFS/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15

# fastqc the outputs
fastqc -t 4 $workingdir/fastq/${i}-trim_1P.fastq
fastqc -t 4 $workingdir/fastq/${i}-trim_2P.fastq

done

2-star_index_genome.sh

#!/bin/bash

# Load some modules
module load star/2.7.6a
 
## Useful shortcuts
export refdir=~/classdata/REFS

## Change --sjdbOverhang to length of your sequence data /2 minus 1


echo "\n\n I TOLD YOU NOT TO RUN THIS ONE NOW! \n\n (unless you're in the future and trying to run this for real, in which case you need to edit this script and remove the # characters from the command)"

STAR    --runThreadN 8 \
        --limitGenomeGenerateRAM 321563573 \
        --runMode genomeGenerate \
        --genomeDir  $refdir/ \
        --genomeFastaFiles $refdir/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa \
        --sjdbGTFfile $refdir/Arabidopsis_thaliana.TAIR10.47.gtf \
        --sjdbOverhang 49

3-star.sh

#!/bin/bash

## Load some Modules
module load star/2.7.6a

## Useful shortcuts
export workingdir=~/mydata/Session5/RNAseq-Processing
export refdir=~/classdata/REFS/STAR/

## The commands you want to run
mkdir $workingdir/star

#list=("sample1" "sample2" "sample3")
list=("SRR5222797" "SRR5222798" "SRR5222799")

for i in ${list[@]}
do
# map forward and reverse reads to genome
# If input data is gzipped (.fastq.gz) inculde the additional parameter:   --readFilesCommand zcat
STAR   --outMultimapperOrder Random \
       --outSAMmultNmax 1 \
       --runThreadN 4  \
       --runMode alignReads \
       --outSAMtype BAM Unsorted \
       --quantMode GeneCounts \
       --outFileNamePrefix $workingdir/star/${i}-unsort. \
       --genomeDir $refdir \
       --readFilesIn $workingdir/fastq/${i}-trim_1P.fastq $workingdir/fastq/${i}-trim_2P.fastq
done

4-markduplicates.sh

#!/bin/bash

#load some modules
module load picard/2.26.2
module load samtools/1.15.1

## Useful shortcuts
export workingdir=~/mydata/Session5/RNAseq-Processing

mkdir markdup

#list=("sample1" "sample2" "sample3")
list=("SRR5222797" "SRR5222798" "SRR5222799")

for i in ${list[@]}
do
samtools sort -@ 4 -o $workingdir/star/${i}.sorted.bam $workingdir/star/${i}-unsort.Aligned.out.bam

samtools index $workingdir/star/${i}-unsort.Aligned.out.bam

##  MARK DUPLICATES  ##
picard MarkDuplicates I=$workingdir/star/${i}.sorted.bam O=$workingdir/markdup/${i}.markdup.bam M=$workingdir/markdup/${i}.metrics.markdup.txt REMOVE_DUPLICATES=false VALIDATION_STRINGENCY=SILENT


## REMOVE DUPLICATES ##
picard MarkDuplicates I=$workingdir/star/${i}.sorted.bam O=$workingdir/markdup/${i}.rmdup.bam M=$workingdir/markdup/${i}.metrics.rmdup.txt REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=SILENT

done

5-featurecounts.sh

#!/bin/bash

# Load some modules
module load subread/2.0.2

## Useful shortcuts
export workingdir=~/mydata/Session5/RNAseq-Processing
export refdir=~/classdata/REFS/STAR/


mkdir $workingdir/featureCounts

#list=("sample1" "sample2" "sample3")
list=("SRR5222797" "SRR5222798" "SRR5222799")



for i in ${list[@]}
do

featureCounts \
        -T 4 -p -F GTF -t exon -g gene_id \
        -a $refdir/Arabidopsis_thaliana.TAIR10.47.gtf \
        -o $workingdir/featureCounts/${i}.markdup.featurecount \
        $workingdir/markdup/${i}.markdup.bam

featureCounts \
        -T 4 -p -F GTF -t exon -g gene_id \
        -a $refdir/Arabidopsis_thaliana.TAIR10.47.gtf \
        -o $workingdir/featureCounts/${i}.rmdup.featurecount \
        $workingdir/markdup/${i}.rmdup.bam

done