RNAseq Processing

From Reads to Counts

Author

Prof. Peter Kille and Dr. Sarah Christofides

Published

October 18, 2024

Transcriptomics: an introduction

RNASeq processing

Introduction

RNAseq, or transcriptomics, is a method for quantifying the expression rate for all genes simultaneously. Analogous to a quantitative PCR of one gene, it comes with its advantages and drawbacks, with special statistical measures developed over years to make the results as accurate as possible.

There are three main steps to analyse a series of RNAseq fastq sample files (following quality filtering):

Alignment

STAR – (Spliced Transcript Alignments to a Reference) is an alignment package which functions similarly to standard genome alignments, but is designed for short regions of RNA that could span intron-exon junctions and with low compute requirements. STAR outputs a bam format file which contains the locations where all the reads in your dataset have aligned and which genes they cover.

Counting

FeatureCounts is a simple package that takes the positions of mapped reads and outputs a file quantifying the expression of each gene or exon (based on parameter choices). At this point raw read counts are hard to interpret due to likely different levels of sequencing achieved per sample and methodological biases.

One common step prior to counting is marking duplicates that arise from data generation for further information, or so that they can be removed. Here we’ll use the imaginatively named MarkDuplicates from GATK.

Differential Gene Analysis

Contrasting the expression profile of the samples is typically done with one of two R packages: Deseq2 or EdgeR (the mac vs windows of the RNAseq fight), however a multitude of alternatives exist. These packages perform the normalization and statistical steps of contrasting samples as defined in a metadata file stating your experimental design (replicates, tissue type, treatment etc). The output here is a range of significant genes, ordinance and cluster analysis of sample similarity, and various quality control figures.

Following these three steps, there are an almost infinite number of tools and packages to look deeper into your data, find experimentally specific insights, and prior published data to contrast against.

Data

We will start with the aligning and counting steps.

The data you will need for this exercise is derived from NCBI Bioproject PRJNA658134 Geo dataset GSE156507. I have selected only those L3 stage samples and sub-sampled the data from the 60M reads deposited to ~10M reads so that it can be processed with the workshop. Samples included are:

~/classdata/Session5/ RNAseq-Processing/rawdata
SRR12478185_1.fastq.gz    SRR12478185_2.fastq.gz
SRR12478186_1.fastq.gz    SRR12478186_2.fastq.gz
SRR12478187_1.fastq.gz    SRR12478187_2.fastq.gz
SRR12478191_1.fastq.gz    SRR12478191_2.fastq.gz
SRR12478192_1.fastq.gz    SRR12478192_2.fastq.gz
SRR12478193_1.fastq.gz    SRR12478193_2.fastq.gz

~/classdata/REFS/Celegens_genome_star/
Caenorhabditis_elegans.WBcel235.112.gtf
Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa

This folder contains lots of other index files for star to function that you don’t need to touch! Note: most programs will accept fastq or fastq.gz without any changes however star requires you to include the --readFilesCommand zcat parameter.

Software and scripts

We will be using scripts to run these steps. In the classdata/Session5/RNAseq-Processing/instance_scripts folder you will find the following that you can use as a basis for your analysis, however, make sure you’re tuning it to your own file structure and file names.

We’ll be using a full sized RNAseq sample otherwise it causes the programs to think it’s bad data. In the classdata/Session3/RNAseq-Processing/rawdata folder there four pairs of RNAseq files from an C. elegans RNAseq study. In the folder classdata/REFS there is a reference genome, and a gtf file. The step 2 “star index genome” has already been run for you (you don’t need to do this!)

Scripts

  • 1-QC.sh
  • 2-star_index_genome.sh (already done, don’t repeat!)
  • 3-star.sh
  • 4-markduplicates.sh
  • 5-featurecounts.sh
Exercises: RNAseq processing

Using the pre-made scripts provided, perform the steps on four pairs of fastq files. There are examples of all of these files in the classdata/Session5 directory which you should copy into your own folder. You will need to edit them to represent your own working folder and filenames:

  1. QC and trim your sample data

  2. Use the outputs from your QC script as inputs to run star.

  3. Use the outputs from star to run mark duplicates to both remove and keep duplicates (this will generate two outputs).

  4. Use the outputs to each of these to run featureCounts on the data.

  5. Run multiQC on the processed directory and observe the summary.

These outputs are now ready to put into R and perform Differential Gene Expression Analysis. Things to remember:

  • Load any modules you will need!

  • Use the full path any time it could be ambiguous.

1-QC.sh

#!/bin/bash

#varibles
workdir=$(pwd)

rawdir=${workdir}/rawdata
#create directory for qc output
mkdir ${workdir}/rawqc
rawqcdir=${workdir}/rawqc
mkdir ${workdir}/trimdata
trimdir=${workdir}/trimdata
mkdir ${workdir}/trimqc
trimqcdir=${workdir}/trimqc

#load program
module load fastqc

for f in ${rawdir}/*_1.fastq.gz
do

R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

fastqc -t 2 ${rawdir}/${base}_1.fastq.gz ${rawdir}/${base}_2.fastq.gz -o ${rawqcdir}

done
#unload program
module unload fastqc


#load program - fastp
module load fastp

#create loop
for f in ${rawdir}/*_1.fastq.gz
do

R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

fastp -q 20 -u 10 --cut_right \
      -i ${rawdir}/${base}_1.fastq.gz \
      -I ${rawdir}/${base}_2.fastq.gz \
      -o ${trimdir}/${base}_trim_R1.fastq.gz \
      -O ${trimdir}/${base}_trim_R2.fastq.gz \
      -j ${trimdir}/${base}_trim.json \
      -h ${trimdir}/${base}_trim.html

done

#unload program
module unload fastp


#load program - fastp
module load fastqc

#create loop
for f in ${rawdir}/*_1.fastq.gz
do

R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

fastqc -t 2 ${trimdir}/${base}_trim_R1.fastq.gz ${trimdir}/${base}_trim_R2.fastq.gz -o ${trimqcdir}

done

#unload  program
module unload fastqc

module load py-multiqc

multiqc ${workdir}

module unload py-multiqc

2-star_index_genome.sh


# Load some modules
module load star/2.7.11a-cbbjioq

workdir=$(pwd)

## Change --sjdbOverhang to length of your sequence data /2 minus 1

echo "\n\n I TOLD YOU NOT TO RUN THIS ONE NOW! \n\n (unless you're in the future and trying to run this for real, in which case you need to edit this script and remove the # characters from the command)"

STAR    --runThreadN 8 \
        --limitGenomeGenerateRAM 321563573 \
        --runMode genomeGenerate \
        --genomeDir  ~/classdata/REFS/Celegens_genome_star \
        --genomeFastaFiles ~/classdata/REFS/Celegens_genome_star/Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa \
        --sjdbGTFfile ~/classdata/REFS/Celegens_genome_star/Caenorhabditis_elegans.WBcel235.112.gtf \
        --sjdbOverhang 49

3-star.sh

#!/bin/bash

## Load some Modules
module load star/2.7.11b-conda

## Useful shortcuts
#varibles
workdir=$(pwd)

rawdir=${workdir}/rawdata
trimdir=${workdir}/trimdata
mkdir ${workdir}/star
stardir=${workdir}/star
genomedir=~/classdata/REFS/Celegens_genome_star

for f in ${workdir}/rawdata/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

# map forward and reverse reads to genome
STAR   --outMultimapperOrder Random \
       --outSAMmultNmax 1 \
       --runThreadN 8 \
       --runMode alignReads \
       --outSAMtype BAM Unsorted \
       --quantMode GeneCounts \
       --readFilesCommand zcat \
       --outFileNamePrefix ${stardir}/${base}-unsort. \
       --genomeDir ${genomedir} \
       --readFilesIn ${trimdir}/${base}_trim_R1.fastq.gz ${trimdir}/${base}_trim_R2.fastq.gz

done

4-markduplicates.sh

#!/bin/bash

#load some modules
module load picard/3.1.1-peglbgr
module load samtools/1.19.2-m76oqh7

## Useful shortcuts
workdir=$(pwd)

rawdir=${workdir}/rawdata
trimdir=${workdir}/trimdata
stardir=${workdir}/star
genomedir=~/classdata/REFS/Celegens_genome_star
mkdir ${workdir}/markdup
markdir=${workdir}/markdup


for f in ${workdir}/rawdata/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

samtools sort -@ 8 -o $stardir/${base}.sorted.bam $stardir/${base}-unsort.Aligned.out.bam
samtools index $stardir/${base}.sorted.bam

##  MARK DUPLICATES  ##
picard MarkDuplicates I=$stardir/${base}.sorted.bam O=$markdir/${base}.markdup.bam M=$markdir/${base}.metrics.markdup.txt REMOVE_DUPLICATES=false VALIDATION_STRINGENCY=SILENT



## REMOVE DUPLICATES ##
picard MarkDuplicates I=${stardir}/${base}.sorted.bam O=${markdir}/${base}.rmdup.bam M=${markdir}/${base}.metrics.rmdup.txt REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=SILENT

done

5-featurecounts.sh

#!/bin/bash

# Load some modules
module load subread/2.0.6-abbqxcc

## Useful shortcuts
workdir=$(pwd)

rawdir=${workdir}/rawdata
trimdir=${workdir}/trimdata
stardir=${workdir}/star
genomedir=~/classdata/REFS/Celegens_genome_star
markdir=${workdir}/markdup
mkdir ${workdir}/featureCounts
fcdir=${workdir}/featureCounts

for f in ${workdir}/rawdata/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')

featureCounts \
        -T 4 -p -F GTF -t exon -g gene_id \
        -a ${genomedir}/Caenorhabditis_elegans.WBcel235.112.gtf \
        -o ${fcdir}/${base}.markdup.featurecount \
        ${markdir}/${base}.markdup.bam

featureCounts \
        -T 4 -p -F GTF -t exon -g gene_id \
        -a ${genomedir}/Caenorhabditis_elegans.WBcel235.112.gtf \
        -o ${fcdir}/${base}.rmdup.featurecount \
        ${markdir}/${base}.rmdup.bam

done

Once you have looked at the RNAseq-Analysis workshop apply this Knowedge to the analyse your C. elegans RNAseq data. In the RNAseq-Processing folder your are surplied with an Rstudio script for Sartools Sartools-template-deseq2.r and a metadata file C_elegansL3_Cd_metadata.txt

the Metadata file describes teh various parameters of the experiment:

Sample_ID       Files   SRA     Stage   Chemical        Dose    StageDose       Replicate
L3_Cd_0_rep1    SRR12478185.markdup.featurecount        SRR12478185     L3      Cd      0       L3dose0 rep1
L3_Cd_0_rep2    SRR12478186.markdup.featurecount        SRR12478186     L3      Cd      0       L3dose0 rep2
L3_Cd_0_rep3    SRR12478187.markdup.featurecount        SRR12478187     L3      Cd      0       L3dose0 rep3
L3_Cd_20_rep1   SRR12478191.markdup.featurecount        SRR12478191     L3      Cd      20      L3dose20        rep1
L3_Cd_20_rep2   SRR12478192.markdup.featurecount        SRR12478192     L3      Cd      20      L3dose20        rep2
L3_Cd_20_rep3   SRR12478193.markdup.featurecount        SRR12478193     L3      Cd      20      L3dose20        rep3