Data Processing : Generation of DEG Lists

From Counts to Gene Lists

Author

Prof Peter Kille

Published

January 7, 2025

Differential Gene Analysis

Contrasting the expression profile of the samples is typically done with one of two R packages: Deseq2 or EdgeR (the mac vs windows of the RNAseq fight), however a multitude of alternatives exist. These packages perform the normalization and statistical steps of contrasting samples as defined in a metadata file stating your experimental design (replicates, tissue type, treatment etc). The output here is a range of significant genes, ordinance and cluster analysis of sample similarity, and various quality control figures.

Following these steps, there are an almost infinite number of tools and packages to look deeper into your data, find experimentally specific insights, and prior published data to contrast against.

Key Statistical Concepts - Flase Discovery Rate (FDR)

The workshop will use tools that exploit statistical approaches to identify differential expressed genes (DEGs). You will not need to derive the mathematical formula underpinning these concepts or deploy the algorithums from first principles since the software you will use does this for you but it is worth knowing the background. There is an excellent academic from UNC-Chapel Hill, Prof Josh Starmer, who runs a youtube channel called StatQuest this deal with a large range of biological stats and has some great annotations to explain them, I am going to suggest you watch two of his videos to orientate you about the key statistical concepts for this session:

3.1-3.5 DEG pipelines

Processing data from counts to DEG lists requires a number of complex steps - rather than code these individually we recommned using a ‘wrapper’ program that executes the steps within an automated workflow. NCBI provides theis function for a selection of datasets archived in GEO - Geo2R Instrctions. For count data that we have genearted ourselveswe are recommending using a wrapper called SARTools. This wrapper supports simple 2 state analsysis but generates plots for data exploration, normalisation, dispersion, MA/Volcano Plots and DEG list. Also it generates a DESeq2 (or edgeR) object (dds object) that can be used for more complex analysis.

A simple Guide to SARTools

1. Download or copy the SARTools template script

#Download
wget https://github.com/PF2-pasteur-fr/SARTools/blob/master/template_script_DESeq2.r
# an example metadata file can be viewed in workshop example below

# copy script, metadata and feature counts from classdata
mkdir 
cp -R ~/classdata/Session5/RNAseq-Analysis/ ~/mydata/

2. Download Workshop data

Ensure you analysis folder has a sub-folder which is callled FeatureCounts that contains individual files containing your data and these are referenced in your metadata file - you do not need a subdirectory location as we will add this to the script.

Your data represents a set of nine human neuronal differential RNAseq samples have been sequenced, consisting of 6x control samples (3x two different individuals) and 3x a deletion mutant of the 1q21.1 cytogenetic region of the human genome, and 3x of a duplication of this region.

Deletions and alterations to this region has a range of impacts on neuropsychiatric disorders and is under active study. See the omim and the Wikipedia page is remarkably informative! 1q21.1_deletion_syndrome.

1q21.1 cytogenetic region of the human genome

3. Open your POSIT R-studio session

We recommend using R-studio either on your local machine or using the POSIT based server option for reproducibility.

Open SARTools script and edit the areas highlighted in red.

SARTools

These parameters are:

workDir - location of your working directory (easy way of identifying this in Rstudio is to navigate to your working directory in the files tab and then use More>Set As Working Directory to show your workdir in the console window).

projectName - Your choosen name for you output files

author - Your Name

targetFile - metadata file name

rawDir - Directory containing your count files

VarInt - The parameter you wish you analyse

conRef - The description within your choosen parameter that you wish to use as your control

N.B. Line 39 there is a parameter for your level is statistical significance - this is by default 0.5 you can chnage this if you need.

4. Now run the script

At the top right of your R script window there is a ‘Run’ pull down list - select this and highlight ‘Run all’.

5. Review Outputs

HTML file - this contains your data analsyis with comprehensive explainatory text. It provides key plots, parameters and summary data.

table - this folder contains tab separated varible files of your results (these can be downloaded and explored in excel), both as a complete list but also separate files for up and down regulated gens passing the parameter you defined.

figures - this folder contains the images used in the generation of the html output.

NB - if you change parameters and rerun the script it will overwrite the table and image directories so if you want to keep these make sure you rename name them before rerunning the script.

out.DEseq2/dds Object - This is a very useful object as many secondary analysis steps can use this object for further analysis.

We strongly recommend you save your R session when finished so you do not need to recalulate the dds object.

Dataset to use - Geo:GDS2565

Once you have looked at the RNAseq-Analysis workshop apply this Knowedge to the analyse your C. elegans RNAseq data. In the RNAseq-Processing folder your are surplied with an Rstudio script for Sartools Sartools-template-deseq2.r and a metadata file C_elegansL3_Cd_metadata.txt

the Metadata file describes the various parameters of the experiment:

Sample_ID       Files   SRA     Stage   Chemical        Dose    StageDose       Replicate
L3_Cd_0_rep1    SRR12478185.markdup.featurecount        SRR12478185     L3      Cd      0       L3dose0 rep1
L3_Cd_0_rep2    SRR12478186.markdup.featurecount        SRR12478186     L3      Cd      0       L3dose0 rep2
L3_Cd_0_rep3    SRR12478187.markdup.featurecount        SRR12478187     L3      Cd      0       L3dose0 rep3
L3_Cd_20_rep1   SRR12478191.markdup.featurecount        SRR12478191     L3      Cd      20      L3dose20        rep1
L3_Cd_20_rep2   SRR12478192.markdup.featurecount        SRR12478192     L3      Cd      20      L3dose20        rep2
L3_Cd_20_rep3   SRR12478193.markdup.featurecount        SRR12478193     L3      Cd      20      L3dose20        rep3