~/classdata/Session5/ RNAseq-Processing/rawdata
SRR12478185_1.fastq.gz SRR12478185_2.fastq.gz
SRR12478186_1.fastq.gz SRR12478186_2.fastq.gz
SRR12478187_1.fastq.gz SRR12478187_2.fastq.gz
SRR12478191_1.fastq.gz SRR12478191_2.fastq.gz
SRR12478192_1.fastq.gz SRR12478192_2.fastq.gz
SRR12478193_1.fastq.gz SRR12478193_2.fastq.gz
~/classdata/REFS/Celegens_genome_star/
Caenorhabditis_elegans.WBcel235.112.gtf
Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa
Transcriptomics: an introduction
RNASeq processing
Introduction
RNAseq, or transcriptomics, is a method for quantifying the expression rate for all genes simultaneously. Analogous to a quantitative PCR of one gene, it comes with its advantages and drawbacks, with special statistical measures developed over years to make the results as accurate as possible.
There are three main steps to analyse a series of RNAseq fastq sample files (following quality filtering):
Alignment
STAR – (Spliced Transcript Alignments to a Reference) is an alignment package which functions similarly to standard genome alignments, but is designed for short regions of RNA that could span intron-exon junctions and with low compute requirements. STAR outputs a bam format file which contains the locations where all the reads in your dataset have aligned and which genes they cover.
Counting
FeatureCounts is a simple package that takes the positions of mapped reads and outputs a file quantifying the expression of each gene or exon (based on parameter choices). At this point raw read counts are hard to interpret due to likely different levels of sequencing achieved per sample and methodological biases.
One common step prior to counting is marking duplicates that arise from data generation for further information, or so that they can be removed. Here we’ll use the imaginatively named MarkDuplicates from GATK.
Differential Gene Analysis
Contrasting the expression profile of the samples is typically done with one of two R packages: Deseq2 or EdgeR (the mac vs windows of the RNAseq fight), however a multitude of alternatives exist. These packages perform the normalization and statistical steps of contrasting samples as defined in a metadata file stating your experimental design (replicates, tissue type, treatment etc). The output here is a range of significant genes, ordinance and cluster analysis of sample similarity, and various quality control figures.
Following these three steps, there are an almost infinite number of tools and packages to look deeper into your data, find experimentally specific insights, and prior published data to contrast against.
Data
We will start with the aligning and counting steps.
The data you will need for this exercise is derived from NCBI Bioproject PRJNA658134 Geo dataset GSE156507. I have selected only those L3 stage samples and sub-sampled the data from the 60M reads deposited to ~10M reads so that it can be processed with the workshop. Samples included are:
This folder contains lots of other index files for star to function that you don’t need to touch! Note: most programs will accept fastq or fastq.gz without any changes however star requires you to include the --readFilesCommand zcat
parameter.
Software and scripts
We will be using scripts to run these steps. In the classdata/Session5/RNAseq-Processing/instance_scripts
folder you will find the following that you can use as a basis for your analysis, however, make sure you’re tuning it to your own file structure and file names.
We’ll be using a full sized RNAseq sample otherwise it causes the programs to think it’s bad data. In the classdata/Session3/RNAseq-Processing/rawdata folder there four pairs of RNAseq files from an C. elegans RNAseq study. In the folder classdata/REFS there is a reference genome, and a gtf file. The step 2 “star index genome” has already been run for you (you don’t need to do this!)
- 1-QC.sh
- 2-star_index_genome.sh (already done, don’t repeat!)
- 3-star.sh
- 4-markduplicates.sh
- 5-featurecounts.sh
Using the pre-made scripts provided, perform the steps on four pairs of fastq files. There are examples of all of these files in the classdata/Session5
directory which you should copy into your own folder. You will need to edit them to represent your own working folder and filenames:
QC and trim your sample data
Use the outputs from your QC script as inputs to run star.
Use the outputs from star to run mark duplicates to both remove and keep duplicates (this will generate two outputs).
Use the outputs to each of these to run featureCounts on the data.
Run multiQC on the processed directory and observe the summary.
These outputs are now ready to put into R and perform Differential Gene Expression Analysis. Things to remember:
Load any modules you will need!
Use the full path any time it could be ambiguous.
1-QC.sh
#!/bin/bash
#varibles
workdir=$(pwd)
rawdir=${workdir}/rawdata
#create directory for qc output
mkdir ${workdir}/rawqc
rawqcdir=${workdir}/rawqc
mkdir ${workdir}/trimdata
trimdir=${workdir}/trimdata
mkdir ${workdir}/trimqc
trimqcdir=${workdir}/trimqc
#load program
module load fastqc
for f in ${rawdir}/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')
fastqc -t 2 ${rawdir}/${base}_1.fastq.gz ${rawdir}/${base}_2.fastq.gz -o ${rawqcdir}
done
#unload program
module unload fastqc
#load program - fastp
module load fastp
#create loop
for f in ${rawdir}/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')
fastp -q 20 -u 10 --cut_right \
-i ${rawdir}/${base}_1.fastq.gz \
-I ${rawdir}/${base}_2.fastq.gz \
-o ${trimdir}/${base}_trim_R1.fastq.gz \
-O ${trimdir}/${base}_trim_R2.fastq.gz \
-j ${trimdir}/${base}_trim.json \
-h ${trimdir}/${base}_trim.html
done
#unload program
module unload fastp
#load program - fastp
module load fastqc
#create loop
for f in ${rawdir}/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')
fastqc -t 2 ${trimdir}/${base}_trim_R1.fastq.gz ${trimdir}/${base}_trim_R2.fastq.gz -o ${trimqcdir}
done
#unload program
module unload fastqc
module load py-multiqc
multiqc ${workdir}
module unload py-multiqc
2-star_index_genome.sh
# Load some modules
module load star/2.7.11a-cbbjioq
workdir=$(pwd)
## Change --sjdbOverhang to length of your sequence data /2 minus 1
echo "\n\n I TOLD YOU NOT TO RUN THIS ONE NOW! \n\n (unless you're in the future and trying to run this for real, in which case you need to edit this script and remove the # characters from the command)"
STAR --runThreadN 8 \
--limitGenomeGenerateRAM 321563573 \
--runMode genomeGenerate \
--genomeDir ~/classdata/REFS/Celegens_genome_star \
--genomeFastaFiles ~/classdata/REFS/Celegens_genome_star/Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa \
--sjdbGTFfile ~/classdata/REFS/Celegens_genome_star/Caenorhabditis_elegans.WBcel235.112.gtf \
--sjdbOverhang 49
3-star.sh
#!/bin/bash
## Load some Modules
module load star/2.7.11b-conda
## Useful shortcuts
#varibles
workdir=$(pwd)
rawdir=${workdir}/rawdata
trimdir=${workdir}/trimdata
mkdir ${workdir}/star
stardir=${workdir}/star
genomedir=~/classdata/REFS/Celegens_genome_star
for f in ${workdir}/rawdata/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')
# map forward and reverse reads to genome
STAR --outMultimapperOrder Random \
--outSAMmultNmax 1 \
--runThreadN 8 \
--runMode alignReads \
--outSAMtype BAM Unsorted \
--quantMode GeneCounts \
--readFilesCommand zcat \
--outFileNamePrefix ${stardir}/${base}-unsort. \
--genomeDir ${genomedir} \
--readFilesIn ${trimdir}/${base}_trim_R1.fastq.gz ${trimdir}/${base}_trim_R2.fastq.gz
done
4-markduplicates.sh
#!/bin/bash
#load some modules
module load picard/3.1.1-peglbgr
module load samtools/1.19.2-m76oqh7
## Useful shortcuts
workdir=$(pwd)
rawdir=${workdir}/rawdata
trimdir=${workdir}/trimdata
stardir=${workdir}/star
genomedir=~/classdata/REFS/Celegens_genome_star
mkdir ${workdir}/markdup
markdir=${workdir}/markdup
for f in ${workdir}/rawdata/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')
samtools sort -@ 8 -o $stardir/${base}.sorted.bam $stardir/${base}-unsort.Aligned.out.bam
samtools index $stardir/${base}.sorted.bam
## MARK DUPLICATES ##
picard MarkDuplicates I=$stardir/${base}.sorted.bam O=$markdir/${base}.markdup.bam M=$markdir/${base}.metrics.markdup.txt REMOVE_DUPLICATES=false VALIDATION_STRINGENCY=SILENT
## REMOVE DUPLICATES ##
picard MarkDuplicates I=${stardir}/${base}.sorted.bam O=${markdir}/${base}.rmdup.bam M=${markdir}/${base}.metrics.rmdup.txt REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=SILENT
done
5-featurecounts.sh
#!/bin/bash
# Load some modules
module load subread/2.0.6-abbqxcc
## Useful shortcuts
workdir=$(pwd)
rawdir=${workdir}/rawdata
trimdir=${workdir}/trimdata
stardir=${workdir}/star
genomedir=~/classdata/REFS/Celegens_genome_star
markdir=${workdir}/markdup
mkdir ${workdir}/featureCounts
fcdir=${workdir}/featureCounts
for f in ${workdir}/rawdata/*_1.fastq.gz
do
R1=$(basename $f | cut -f1 -d.)
base=$(echo $R1 | sed 's/_1//')
featureCounts \
-T 4 -p -F GTF -t exon -g gene_id \
-a ${genomedir}/Caenorhabditis_elegans.WBcel235.112.gtf \
-o ${fcdir}/${base}.markdup.featurecount \
${markdir}/${base}.markdup.bam
featureCounts \
-T 4 -p -F GTF -t exon -g gene_id \
-a ${genomedir}/Caenorhabditis_elegans.WBcel235.112.gtf \
-o ${fcdir}/${base}.rmdup.featurecount \
${markdir}/${base}.rmdup.bam
done
Once you have looked at the RNAseq-Analysis workshop apply this Knowedge to the analyse your C. elegans RNAseq data. In the RNAseq-Processing folder your are surplied with an Rstudio script for Sartools Sartools-template-deseq2.r
and a metadata file C_elegansL3_Cd_metadata.txt
the Metadata file describes teh various parameters of the experiment:
Sample_ID Files SRA Stage Chemical Dose StageDose Replicate
L3_Cd_0_rep1 SRR12478185.markdup.featurecount SRR12478185 L3 Cd 0 L3dose0 rep1
L3_Cd_0_rep2 SRR12478186.markdup.featurecount SRR12478186 L3 Cd 0 L3dose0 rep2
L3_Cd_0_rep3 SRR12478187.markdup.featurecount SRR12478187 L3 Cd 0 L3dose0 rep3
L3_Cd_20_rep1 SRR12478191.markdup.featurecount SRR12478191 L3 Cd 20 L3dose20 rep1
L3_Cd_20_rep2 SRR12478192.markdup.featurecount SRR12478192 L3 Cd 20 L3dose20 rep2
L3_Cd_20_rep3 SRR12478193.markdup.featurecount SRR12478193 L3 Cd 20 L3dose20 rep3