Refined Data Representation

Venn Diagrams and More

Author

Prof. Peter Kille and Dr. Sarah Christofides

Published

January 7, 2025

5.1 Venn diagrams

Looking for conserved or specific reponses between treatment types or time points can often reveal key bioilkoguical insights. The most strainghtforward way of doing this is to take a venn diagram based response to search for Gene ID that are shred or specific between groups. The simplest on line tool is Venny 2.1 which allows you to paste uptoo 4 lists and derive the overlapping and unique GeneID, export these list by selecting the relevant list within the graphical representation of the venn diagram and also customise representation (colour, lablers ect) and save it as a png image by ‘right hand clicking’ in venn diagram representation.

More informative representations can be generated by using programs such as Divenny Diagram where you upload the DEG list indicating if the ‘change’ in expression is UP or DOWN and then the intersections display conservation of up and down regulation as well as creating a attractive representation.

There are tools for creating venn diagram with more that 4 samples both on line with interactivenn and in R ggVennDiagram. Over 4 samples the classical venn diagram representations can become hard to interpret so you may want to consider a UpSet plot - ggVennDiagram supports UpSet plot generation.

5.2 Visualising Expression Patterns

Once an analysis has yielded a list of DEGs, it’s common to visualise them in a heatmap. There is a wonderful R package called ComplexHeatmap, which will do almost anything you can think of to a heatmap. We have create a script that you can copy into your workign directory and use to visualise your data.

cp ~/classdata/Session5/RNAseq-Analysis/Heatmaps.r ~/mydata/[workdir]

##Heatmaps Script

##Load libraries
#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")
BiocManager::install("ComplexHeatmap", lib.loc="/home/sbi9srj/mydata/Rpackages")
library(ComplexHeatmap)

## Prepare your data for the heatmap
#Read in your Rdata that you saved previously
load("projectName.RData")

## Create a list of the output files from SARtools
tables <- list.files(path="tables", pattern="*.txt", full.names=TRUE, recursive=FALSE)
IDtoGene <- read.delim("/mnt/clusters/hawker/data/classdata/Bioinformatics/REFS/Homo_sapiens.GRCh38.100.map.txt", header=T)

# Now add column to tables with gene name
lapply(tables, function(x) {
  t <-read.table(x, header=TRUE)
  t.cols <- select(t, Id, log2FoldChange, pvalue, padj)
  t.annot <- merge(t.cols, IDtoGene, by='Id')
  write.table(t.annot, file=paste(x, ".annot.csv", sep=""), sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
})

#################
## Read in data ##
#################

# We are going to draw a heatmap of all the genes differentially upregulated in the deletion compared to the control
# Files must be in the format with a header (samples) and gene names (rows)
infile <- data.frame(read.table("tables/DeletionvsControl.up.txt", sep="\t", header = TRUE, row.names= 1))
#Add in the name of each gene
infile$names <- IDtoGene$Gene[match(rownames(infile), IDtoGene$Id)]

#Click on 'infile' to see what it looks like. 
# The first set of columns are the raw reads for each gene in each sample
# The next set of columns are the normalised reads for each gene in each sample
# At the end are some columns with other useful information, e.g. the fold change and P values

# We want to use the read counts that have been normalised, so we will select just those columns
normVals <- infile[,grep("norm", names(infile))]

# Now we read in the metadata (the information about each sample)
# Metadata column names must match sample names!
x.meta = data.frame(read.table(targetFile, header = TRUE, row.names= 1))

# Ensure that the samples are in the same order in both files
normVals <- normVals[ , order(names(normVals))]
x.meta <- x.meta[order(row.names(x.meta)), ]

# Check that they are in the same order
all.equal(gsub("norm.","",names(normVals)), rownames(x.meta))

# Create a copy of the data into gene-scaled format 
scaleVals = t(apply(normVals, 1, function(x) {
  q10 = quantile(x, 0.1)
  q90 = quantile(x, 0.9)
  x[x < q10] = q10
  x[x > q90] = q90
  scale(x)
}))
colnames(scaleVals) = colnames(normVals)


## NOTE: You now have two datasets: 
##         -- normVals        (raw normalised count data)
##         -- scaleVals (thes same data, but each gene is on its own scale)
##
## Choose which to plot yourself!

#####################
## Complex Heatmap ##
#####################

# Plot a very basic heatmap
Heatmap(data.matrix(scaleVals))

# This is a lot of genes! Let's try to filter them down
# Try excluding genes with a fold change of less than 1
scaleValsFC1<-scaleVals[infile$FoldChange > 1,]

Heatmap(data.matrix(scaleValsFC1))

# Try just the 50 most expressed genes
scaleVals50<-scaleVals[order(infile$baseMean, decreasing = T)[1:50],]

Heatmap(data.matrix(scaleVals50))

# A fancier heatmap!
Heatmap(data.matrix(scaleVals50),
        cluster_columns = TRUE,
        col = colorRampPalette(c("white", "black", "red"))(50),
        heatmap_legend_param = list(
          title = "Scaled counts"),
        top_annotation = columnAnnotation(Type = x.meta$Type),
        row_labels = infile$names[rownames(infile) %in% rownames(scaleVals50)],
        row_names_gp = gpar(fontsize = 8),
        column_split = x.meta$Type,
        cluster_column_slices = FALSE,
        border = TRUE,
        clustering_distance_rows = "pearson",
        clustering_distance_columns = "pearson"
)

# Try commenting out different lines in the heatmap above. Can you work out what each option does?
# ComplexHeatmap has a very good reference manual at https://jokergoo.github.io/ComplexHeatmap-reference/book/
# If you want, try looking for more option to play with on there

# Try switching between scaled and non-scaled counts. Does it make a difference?

5.3 PCAexplorer

Exploring the multi-varient relationships between your rawdata or filtered lists provides a powerful approach to exploring the relationship between your samples. Both SARTools and Geo2R incorporate a simple PCA approach but this approach can be deployed to explore the full dimensionality of the data. PCAexplorer provides a R based shinny app (menu driven app) that allows you to explore PCA full functionality. You will need to create a raw data matrix and customised metadata file but these are easy to adapt from the SRATools metadata file and the SARTools ‘complete’ outout file - you can use raw or normalised counts. The documentation is very straightforward - have fun.

5.4 Extension: Useful Packages for Downstream Analysis

Use the list of ~200 most differentially expressed genes in gProfiler, or other online resources below, to investigate the functional connotations (More on this in the extension session).

Converting between common gene IDs

Whole dataset annotation (and ontologies)

Interaction Networks

Other visualisation tools

Whole packages

Pathway Analysis

Gene function repository - everything you want to know about a gene

GeneCard