Normalization and Clustering of scRNA-Seq Data

Benilton S Carvalho

Problem

Imagine a bustling city (your tissue sample). Each building is a cell with a unique function. scRNA-seq lets us eavesdrop on the conversations (gene expression) happening in each building.

Normalization: Adjusting the Volume Controls

Noise in the system

Technical noise: Think of it like background chatter in a coffee shop, making it hard to hear individual conversations.
Biological variation: Some buildings are naturally louder (cells are more active), while others whisper (cells are less active).

Technical Variability in scRNA-Seq

Systematic differences in sequencing coverages between libraries is a common fact in single-cell experiments
Common causes:
- technical differences in cDNA capture
- PCR amplification efficiency across cells
- difficulty of achieving consistent library preparation with minimal starting material

Finding the Right Volume - Aims

Normalization is like adjusting the volume in each building.

Remove technical variation (background chatter).
Make expression levels comparable across cells (buildings).

Normalization

Normalization is a series of statistical methods that aim to mitigate/reduce the effects of these systematic differences on downstream analyses.
Example: When comparing two groups of cells, we want to identify differences that are due to the biology, and not differences that result from the fact that cells from Group A have more reads than cells from Group B.
Therefore, normalization is needed to improve the statistical power of detecting biology driven differences.

Finding the Right Volume - Methods

Normalization is like adjusting the volume in each building.

Total count scaling (adjust based on overall conversation volume).
Logarithmic transformation (amplify quiet conversations).
Library size (DESeq2).
Deconvolution.
Spike-ins.
Other advanced methods.

Normalization by Size Factor

Simplest strategy
Divides all counts for a given cell by a “size factor”
Hypothesis: any bias in that cell affects all genes equally
The resulting ‘normalized counts’ can be used in downstream analyses
Size factor:
- Geometric mean for each gene
- Divide the cell counts by the aforementioned mean
- Take the median of the ratios

Normalization by Deconvolution

Cells are first clustered into several groups
Normalization happens independently for each group
Size factors are used to scale counts and make different groups comparable
The hypothesis of most genes not being DE is true for each group

Normalization by Spike-in

Add the same amount of spike-in RNA to each cell
Equalize spike-in coverage across cells using size factors that are specific for spike-ins
Doesn’t make other assumptions

How to Normalize Data Using Seurat?

NormalizeData(
  object,
  normalization.method = "LogNormalize",
  scale.factor = 10000
)

Feature counts for each cell are divided by the total counts for that cell
Later, they are multiplied by the scale.factor
Finally, this result is natural-log transformed

Dimension Reduction: A Primer

Introduction to Dimension Reduction

Objective: Reduce the number of dimensions while preserving essential data characteristics.
Why:
- Simplifies visualization and interpretation.
- Reduces computational burden.
- Helps to identify and remove noise.

Principal Component Analysis (PCA)

Objective: Transform data into a set of orthogonal (uncorrelated) components.
Key Points:
- Variance: Maximizes the variance captured in each component.
- Components: Principal components (PCs) are ranked by the amount of variance they explain.
- Linear: PCA is a linear transformation.

PCA Steps

Standardize the Data: Ensure each gene has a mean of 0 and a standard deviation of 1.
Covariance Matrix: Compute the covariance matrix to understand the data variance.
Eigen Decomposition: Compute eigenvalues and eigenvectors from the covariance matrix.
Principal Components: Select top eigenvectors as principal components.
Transformation: Project data onto the principal components.

PCA in scRNA-seq

Preprocessing: Normalize and log-transform the gene expression matrix.
Application: Identify the most variable genes before applying PCA.
Interpretation: First few PCs often capture major biological variability (e.g., cell types, states).

Run PCA with Seurat

The function RunPCA will perform the PCA dimension reduction method to the object. But the user must be careful about the number of features that will be used. Most of the time, we can use only the most variable features (why?) and reduce the computational burden significantly. It is also good practice to overwrite the object with the PCA result to centralize the information.

object = RunPCA(
  object,
  features
)

Visualizing PCs

We can use the DimPlot function to visualize the principal components (one dimension vs. another). On this plot, we search for the existence of clusters. It is important to remember that the PCs are sorted (in decreasing order) by variance (i.e., the PC1 is the one with maximum variance, etc).

DimPlot(
  object,
  reduction = "pca"
)

Visualizing PC Heatmap

Another option is to visualize each principal component as a function of gene-specific contributions. This visualization will show us how each gene contributes to variance explanation. One heatmap that shows mainly one single color indicates that all genes contribute in the same direction (this means that it is only an overall average that is being shown). We expect to identify blocks of cell and genes on diverging colors.

DimHeatmap(
  object,
  dims
)

Uniform Manifold Approximation and Projection (UMAP)

Objective: Non-linear dimension reduction technique.
Key Points:
- Local vs. Global Structure: Preserves local neighborhood and global structure.
- Manifold Learning: Assumes data lies on a manifold of lower dimension.
- Flexible: Effective for various data types, including scRNA-seq.

UMAP Steps

Graph Construction: Construct a high-dimensional graph representation of the data.
Optimization: Optimize the graph for a low-dimensional representation.
Embedding: Use the optimized graph to embed the data in 2D or 3D.

UMAP in scRNA-seq

Preprocessing: Similar to PCA (normalize, log-transform, select variable genes).
Visualization: Produces visually interpretable clusters representing cell populations.
Parameters: n_neighbors (local structure) and min_dist (clustering tightness).

Running UMAP with Seurat

Similarly to RunPCA, we use the RunUMAP function to obtain the UMAP projections for dimension reduction. There is a tuning parameter dims (number of dimensions to be used) that we need to adjust for. Low values for dims will show strange patterns and increasing dims will allow for the identification of cell clusters.

RunUMAP(
  object,
  dims
)

Visualizing UMAP Projections

We can use the DimPlot function again to visualize the UMAP projections (one dimension vs. another). On this plot, we search for the existence of clusters.

DimPlot(
  object,
  reduction = "umap"  # can be omitted
)

Comparison PCA vs UMAP

Aspect	PCA	UMAP
Type	Linear	Non-linear
Variance	Maximizes explained variance	Preserves local/global structure
Interpretation	Easier to interpret PCs	Better for complex structures
Speed	Faster	Slower

Practical Tips

Choosing Method:
- Use PCA for initial analysis and simple structures.
- Use UMAP for complex, non-linear structures.
Interpretation:
- Both methods require careful interpretation.
- Visualizations should be supplemented with biological knowledge.

Clustering: Uncovering Neighborhoods

Uncovering Neighborhoods - Aim

Clustering is like identifying distinct neighborhoods within the city.

Group cells with similar gene expression profiles (similar conversations).

Clustering

Unsupervised (statistical learning) strategy to define groups of cells with similar expression patterns.
Simplifies interpretation.
Clusters should be treated as approximations of abstract biological concepts (cell types, states, etc).
These clusters are for exploration only and we can create as many versions as we want (more clusters, less clusters).
The correctness of cluster assignments is meaningless.

Uncovering Neighborhoods - Methods

Clustering is like identifying distinct neighborhoods within the city.

K-means (predefine the number of neighborhoods).
Hierarchical clustering (build a family tree of cell relationships).
Graph-based clustering (connect cells based on shared conversations).

Hierarchical Clustering

Produces a dendrogram, useful for understanding the relationships between subpopulations.
It’s slow.
Uses cell to cell distances and requires A LOT of memory.

Graph-based Clustering

Used often in Seurat.
Uses information from neighbors (in higher dimensions) to create communities.
It’s fast (uses knn).
It’s robust.
Doesn’t make assumptions about the shapes of the clusters.
Information about relationships beyond neighbors is lost.

Perform Clustering with Seurat - 1

Determine distance between cells

FindNeighbors(
   object,
   dims=1:20
)

Computes the k.param nearest neighbors for a given dataset. Can also optionally (via compute.SNN), construct a shared nearest neighbor graph by calculating the neighborhood overlap (Jaccard index) between every cell and its k.param nearest neighbors.

Perform Clustering with Seurat - 2

Perform graph-clustering

FindClusters(
  object
)

Identify clusters of cells by a shared nearest neighbor (SNN) modularity optimization based clustering algorithm.

Recap

Concept	Analogy	Goal
scRNA-seq	City	Understand conversations (gene expression) in each building (cell).
Noise	Background chatter	Makes it hard to hear individual conversations.

Recap

Concept	Analogy	Goal
Normalization	Volume control	Adjust volume to make conversations comparable.
Clustering	Neighborhoods	Group buildings with similar conversations together.