Differential Expression and Annotation

Benilton S Carvalho

Tools for scRNA-Seq

  • scanpy
  • Monocle
  • Bioconductor
  • Seurat

Differential Expression

Methods

  • Preprocessing: data import, QC, quantification
  • Normalization: dropout, depth, batch effects
  • Dimensionality Reduction: PCA, UMAP
  • Clustering: cell type clusters
  • Differential Gene Expression

Seurat

  • Most used tool

  • Good documentation

  • Several tutorials

  • Many methods

  • Extendable

  • Gene Expression Dynamics

Statistical Analysis of Differences Between Clusters

  • Different types of hits
    • Quantitatively significant between clusters
    • Qualitatively different (predictive) of cluster membership
  • Different types of markers
    • Global: Distinguish one cluster from all of the others
    • Local: Distinguish one cluster from a set of clusters

Comparing Gene Expression Across Groups

  • Suppose you have two groups of cells
  • You have expression levels for each cell within each group
  • Question: is gene expression significantly different between both groups?

Methods

  • Non-parametric: Wilcoxon rank sum test
  • Parametric: t-test, negative binomial
  • Classification: ROC
  • Specialized: MAST
FindMarkers(data, ident.1 = "g1", ident.2 = "g2",
            group.by = "status", test.use = "roc",
            only.pos = TRUE)

Wilcoxon rank sum test

  • Challenge: scRNA-seq data does not follow a beautiful bell-shaped curve
  • Non-parametric: doesn’t make assumption about the shape of the data
  • Rank-based: doesn’t use the magnitude of the expression levels but relies on the order of the data instead
  • Two-sample: compares two independent groups

Wilcoxon rank sum test: step-by-step

  • Combine: pool the expression levels from both groups
  • Rank: sort the values from smaller to largest, assigning ranks (1, 2, …)
  • Sum: sum the ranks for each group
  • Test: the test determines if the difference in rank sums is large enough to be unlikely due to chance alone (p-value)

t-test

  • Works better when data follow a bell-shaped distribution
  • But has very good performance even when the data is not normally distributed
  • Compares the average gene expression in the two groups
  • Assesses if the difference is likely due to chance or a real effect of the groups

t-test: step-by-step

  • Calculate the Means: Find the average gene expression for each group.
  • Measure the Spread: How much the individual measurements vary around the averages (SD).
  • Calculate the t-statistic: A number that combines the difference in means and the spread. Larger t-values suggest a bigger difference between groups.
  • Get the p-value: The probability of seeing a difference as large as (or larger than) the one you observed if there was actually no real effect of the groups.

Negative Binomial Test

  • Compares the expression between groups using count data
  • In scRNA-seq, we count how many times each gene is expressed in a cell
  • Count data doesn’t behave like other types of data (e.g., heights, weights). It has unique properties:
    • Discrete: You can’t see a gene 0.5 times. Counts are whole numbers (0, 1, 2, etc.).
    • Overdispersed: The variation in counts is often larger than expected from a simple model. Think of it as some genes being much more expressed than others.

Negative Binomial Distribuition

The negative binomial distribution is a statistical model that’s well-suited for count data. It can handle:

  • Discrete nature: It only deals with whole numbers.
  • Overdispersion: It allows for extra variation in the data.
  • Overdispersion in scRNA-seq data means technical variability (technical features, like library preparation, sequencing) combined with biological variability (variability across cells).

The Negative Binomial Test: Comparing Groups

In scRNA-seq, we often want to compare gene expression between groups (e.g., treated vs. control cells). The negative binomial test helps us do this by:

  • Modeling the Counts: It estimates the parameters of the negative binomial distribution for each group.
  • Testing for Differences: It assesses whether the differences in counts between groups are statistically significant.
  • p-values are often used to summarize evidences of differences between groups.

Simulation

  • 33k genes
  • 200 cells per group
  • 2 groups
  • 1k differentially expressed genes
  • baseline counts: 10
  • effect size: 5

DE Genes

Correction Wilcoxon t-test Negative Binomial MAST
p-value 0.91 0.97 0.97 0.09
FDR 0.49 0.72 0.76 0.00

Remember: we simulated data, so we know that there are 1.000 genes that are differentially expressed. The proportions above represent how much of these 1.000 genes each method was able to detect.

not DE Genes

Correction Wilcoxon t-test Negative Binomial MAST
p-value 0.05 0.05 0.05 0.02
FDR 0.00 0.00 0.00 0.00

Remember: we simulated data, so we know that there are 32.000 genes that are not differentially expressed. The proportions above represent how much of these 32.000 genes each method was able to detect as being differentially expressed (therefore, the method made mistakes).

What about new methods?

Cell Type Annotation

Why annotate cell types?

  • Interpreting the findings of our analysis is the most difficult task in sc-data analysis
  • Understanding the biological state of each cluster is way harder then assigning clusters
  • To do this, we need to “connect” our dataset to existing knowledge
  • One strategy is to compare the expression of our dataset to the expressions of curated existing datasets (references)
  • What tool do we use? SingleR

Cell Type Annotation

  • SingleR pkg contains the statistical method for assignment
  • celldex pkg shares several reference (well curated) datasets
  • Most references are built from bulk RNA-seq and microarray
  • They are good enough for annotation of sc-data, provided that the references contains the cell types that are expected to be present on the test data
  • We’ll use a reference built from Blueprint and ENCODE data
  • Single-cell references can also be used

How to perform annotation?

## Load the references
library(celldex)
ref = BlueprintEncodeData()

## We could load a sc reference instead
## ref = MuraroPancreasData()

How to perform annotation?

## Compare expression levels from my.data
##     to the reference
library(SingleR)
pred = SingleR(test = my.data, ref = ref,
               labels = ref$label.main)

table(pred$labels)

Observing the results

plotScoreHeatmap(pred)