Single-cell and spatial transcriptomics data analysis with Seurat in R
Yutong Wang, Graduate Group in Biostatistics, University of California, Berkeley
11/10/2021
1 Load packages
2 Seurat object
2.1 Download one 10X Genomics Visium dataset and load it into Seurat
2.2 Explore the Seurat object
2.2.1 Dimension of the active assay
2.2.2 Feature and sample names
2.2.3 Sample-level metadata
2.2.4 objects(e.g. Assay) together with feature-level metadata
3 Analysis pipeline in Seurat
3.1 Preprocess data
3.1.1 Quality control
3.1.2 Normalization
3.2 Downstream tasks (dimension reduction, data visualization, cluster annotation, differential expression)
3.3 Identify spatially variable genes
3.3.1 Moran’s I
3.3.2 marker-variogram
3.3.3 Other R packages to identify spatially variable genes
4 Convert data formats between R and Python
5 Acknowledgements
1 Load packages
Besides Seurat, we need to import some useful packages.
dplyr and ggplot2 are part of tidyverse, a readable and flexible R language for solving data science challenges. I personally prefer the coding style with tidyverse, but you may use base R too.
patchwork combines separate ggplots into the same graphic with easy access to control layouts.
limma is a Bioconductor package to analyze microarray data. It is not essential for our analysis, but may provide a more efficient implementation of the Wilcoxon rank sum test in differential expression analysis.
library(Seurat)
library(dplyr) # data manipulation
library(ggplot2)
library(patchwork)
install.packages(‘BiocManager’)
BiocManager::install(‘limma’)
library(limma) # optional
2 Seurat object
We first load one spatial transcriptomics dataset into Seurat, and then explore the Seurat object a bit for single-cell data storage and manipulation. One 10X Genomics Visium dataset will be analyzed with Seurat in this tutorial, and you may explore other dataset sources from various sequencing technologies, and other computational toolkits listed in this (non-exhaustive) resource spreadsheet.
2.1 Download one 10X Genomics Visium dataset and load it into Seurat
A spatial gene expression dataset of mouse brain serial section 2 (Sagittal-Posterior) collected by Space Ranger 1.1.0. will be analyzed throughout the tutorial. Both the gene expression matrix and spatial imaging data are necessary for the computational analysis.
For illustration purposes, I have downloaded them from 10x Genomics, and uploaded them to Google Drive. Alternatively, you can download the same files directly from 10x Genomics, or explore other spatial gene expression datasets out there.
The data files we will be using today include:
a (filtered) feature / cell matrix HDF5 file (.h5)
a spatial imaging data folder (.tar.gz)
specify your working directory
root_dir <- “~/Downloads/”
setwd(root_dir)
url prefix to download files
url_prefix <- “https://drive.google.com/uc?export=download&id=”
url IDs are extracted from Google Drive
spatial_folder_id <- “13B6ulZwz9XmWBD1RsnTHTMn3ySgIpFCY”
h5_mat_id <- “1nnch2ctksJ4rXYG5FzlPh3bIgnylTicV”
file names
spatial_folder_name <- “./V1_Mouse_Brain_Sagittal_Posterior_Section_2_spatial.tar.gz”
h5_mat_name <- “./V1_Mouse_Brain_Sagittal_Posterior_Section_2_filtered_feature_bc_matrix.h5”
download files from the resource URL to a destination path where the file is saved
download.file(url = paste0(url_prefix, spatial_folder_id),
destfile = file.path(spatial_folder_name))
download.file(url = paste0(url_prefix, h5_mat_id),
destfile = file.path(h5_mat_name))
extract (or list) contents from the tar archives
untar(spatial_folder_name)
untar(spatial_folder_name, list = TRUE) # list contents
[1] “spatial/” “spatial/tissue_lowres_image.png”
[3] “spatial/tissue_positions_list.csv” “spatial/aligned_fiducials.jpg”
[5] “spatial/tissue_hires_image.png” “spatial/scalefactors_json.json”
[7] “spatial/detected_tissue_image.jpg”
Load a 10x Genomics Visium Spatial Experiment into a Seurat object
brain_data <- Seurat::Load10X_Spatial(
The directory contains the read count matrix H5 file and the image data in a subdirectory called spatial
.
data.dir = root_dir,
filename = h5_mat_name,
assay = “Spatial”, # specify name of the initial assay
slice = “slice1”, # specify name of the stored image
filter.matrix = TRUE,
to.upper = FALSE
)
brain_data
An object of class Seurat
32285 features across 3289 samples within 1 assay
Active assay: Spatial (32285 features, 0 variable features)
2.2 Explore the Seurat object
A Seurat object serves as a container that contains both data (like the count matrix) and analysis (like dimension reduction or clustering results) for a single-cell dataset.
2.2.1 Dimension of the active assay
Extract the dimension of the active assay using dim, nrow, or ncol.
dim(x = brain_data) # the number of features (genes) by samples (spots)
[1] 32285 3289
nrow(x = brain_data) # the number of features
ncol(x = brain_data) # the number of samples
2.2.2 Feature and sample names
Extract the feature and sample names using rownames or colnames.
head(x = rownames(brain_data), n = 5)
[1] “Xkr4” “Gm1992” “Gm19938” “Gm37381” “Rp1”
tail(x = colnames(brain_data), n = 5)
2.2.3 Sample-level metadata
Sample-level metadata is stored as a data.frame, where each row correspond to one sample (e.g. cell or spot) and each column correspond to one sample-level metadata field. It can be accessed via [[ extract operator, the meta.data object, or the $ sigil ($ extracts one single column at a time). Row names in the metadata need to match the column names of the counts matrix.
class(brain_data[[]])
[1] “data.frame”
colnames(brain_data[[]]) # automatically calculated while creating the Seurat object
[1] “orig.ident” “nCount_Spatial” “nFeature_Spatial”
head(brain_data@meta.data)
orig.ident nCount_Spatial nFeature_Spatial
AAACAAGTATCTCCCA-1 SeuratProject 14786 4811
AAACACCAATAACTGC-1 SeuratProject 9009 2748
AAACAGCTTTCAGAAG-1 SeuratProject 7853 2665
AAACAGGGTCTATATT-1 SeuratProject 20932 5334
AAACAGTGTTCCTGGG-1 SeuratProject 42740 8115
AAACATTTCCCGGATT-1 SeuratProject 10320 3834
brain_data$nCount_Spatial[1:3]
AAACAAGTATCTCCCA-1 AAACACCAATAACTGC-1 AAACAGCTTTCAGAAG-1
14786 9009 7853
nFeature_Spatial: the number of unique genes in each sample
sum(brain_data n F e a t u r e S p a t i a l = = c o l S u m s ( b r a i n d a t a @ a s s a y s nFeature_Spatial == colSums(brain_data@assays nFeatureSpatial==colSums(braindata@assaysSpatial@counts > 0))
[1] 3289
nCount_Spatial: the total number of detected molecules in each sample
sum(brain_data n C o u n t S p a t i a l = = c o l S u m s ( b r a i n d a t a @ a s s a y s nCount_Spatial == colSums(brain_data@assays nCountSpatial==colSums(braindata@assaysSpatial@counts))
[1] 3289
2.2.4 objects(e.g. Assay) together with feature-level metadata
A vector of names of associated objects can be had with the names function
These can be passed to the double [[ extract operator to pull them from the Seurat object
names(x = brain_data)
[1] “Spatial” “slice1”
brain_data[[‘Spatial’]] # equivalent to: brain_data@assays$Spatial
Assay data with 32285 features for 3289 cells
First 10 features:
Xkr4, Gm1992, Gm19938, Gm37381, Rp1, Sox17, Gm37587, Gm37323, Mrpl15,
Lypla1
brain_data[[‘slice1’]] # equivalent to: brain_data@images$slice1
Spatial data from the VisiumV1 technology for 3289 samples
Associated assay: Spatial
Image key: slice1_
Each Seurat object consists of one or more Assay objects (basis unit of Seurat) as individual representations of single-cell expression data. Examples of the Assay objects include RNA, ADT in CITE-seq, or Spatial. Each Assay stores multiple slots, including raw (counts), normalized (data) and scaled data (scaled.data) as well as a vector of variable features (var.features) and feature-level metadata (meta.features).
brain_data@assays$Spatial@counts[5:10, 1:3]
6 x 3 sparse Matrix of class “dgCMatrix”
AAACAAGTATCTCCCA-1 AAACACCAATAACTGC-1 AAACAGCTTTCAGAAG-1
Rp1 . . .
Sox17 . . .
Gm37587 . . .
Gm37323 . . .
Mrpl15 1 . 1
Lypla1 1 . .
Feature-level metadata is associated with each individual assay. Feature-level metadata can be accessed through the double bracket [[ extract operator on the Assay objects, or the meta.features slot.
brain_data[[‘Spatial’]]@meta.features
head(brain_data[[‘Spatial’]][[]])
Other objects containing the single-cell data analysis results will be discussed later after going through the analysis pipeline.
3 Analysis pipeline in Seurat
The general steps to perform preprocessing, dimensiona reduction and clustering for spatial transcriptomics data are quite similar to the Seurat workflow analyzing single-cell RNA sequencing data.
3.1 Preprocess data
A standard preprocessing workflow includes the selection and filtration of cells based on quality control (QC) metrics, data normalization and (optional) scaling, and the detection of highly variable features.
3.1.1 Quality control
A few common QC metrics include
The number of unique genes detected in each sample (nFeature_Spatial).
The total number of molecules detected within a sample (nCount_Spatial).
The percentage of reads that map to the mitochondrial genome.
brain_data[[“percent.mt”]] <- PercentageFeatureSet(brain_data,
pattern = “^mt-”)
VlnPlot(
brain_data, features = c(“nFeature_Spatial”, “nCount_Spatial”, “percent.mt”),
pt.size = 0.1, ncol = 3) &
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank())
Jointly (rather than separately) consider the QC metrics when filtering
plot1 <- FeatureScatter(
brain_data, feature1 = “nCount_Spatial”, feature2 = “percent.mt”) + NoLegend()
plot2 <- FeatureScatter(
brain_data, feature1 = “nCount_Spatial”, feature2 = “nFeature_Spatial”) +
NoLegend()
plot1 + plot2
brain_subset <- subset(
brain_data,
subset = nFeature_Spatial < 8000 & nFeature_Spatial > 1000 &
nCount_Spatial < 50000 & percent.mt < 30)
print(paste(“Filter out”, ncol(brain_data) - ncol(brain_subset),
“samples because of the outlier QC metrics, with”, ncol(brain_subset),
“samples left.”))
[1] “Filter out 96 samples because of the outlier QC metrics, with 3193 samples left.”
3.1.2 Normalization
The variance of molecular counts expresses spatial heterogeneity which cannot be solely explained by technical noise. Satija Lab and Collaborators recommends normalization using SCTransform (Hafemeister and Satija, 2019) in order to account for technical bias while preserving true biological differences.
SpatialFeaturePlot(
brain_subset, features = c(“nFeature_Spatial”, “nCount_Spatial”, “percent.mt”)) &
theme(legend.position = “bottom”)
brain_norm <- SCTransform(brain_subset, assay = “Spatial”, verbose = FALSE)
names(brain_norm)
[1] “Spatial” “SCT” “slice1”
dim(brain_norm@assays$SCT@counts)
[1] 17980 3193
dim(brain_norm@assays$SCT@data)
[1] 17980 3193
dim(brain_norm@assays$SCT@scale.data)
[1] 3000 3193
SCTransform returns the Seurat object with a new assay called SCT, where its counts slot stores the corrected UMI counts, the data slot stores the log-normalized version of the corrected UMI counts, and scale.data slot stores the pearson residuals (normalized values) and is used as PCA input.
The scale.data slot in output assay are subset to contain only the variable genes with return.only.var.genes = TRUE by default.
3.2 Downstream tasks (dimension reduction, data visualization, cluster annotation, differential expression)
brain_obj <- RunPCA(brain_norm, assay = “SCT”, verbose = FALSE)
compute K nearest neighbors (KNN)
brain_obj <- FindNeighbors(brain_obj, reduction = “pca”, dims = 1:30)
Leiden algorithm for community detection
brain_obj <- FindClusters(brain_obj, verbose = FALSE)
PCA result is the default UMAP input, use dimensions 1:30 as input features
brain_obj <- RunUMAP(brain_obj, reduction = “pca”, dims = 1:30)
plot3 <- DimPlot(brain_obj, reduction = “umap”, label = TRUE) + NoLegend()
plot4 <- SpatialDimPlot(brain_obj, label = TRUE, label.size = 3) + NoLegend()
plot3 + plot4
The reductions object stores a dimensionality reduction taken out in Seurat; each slot in reductions consists of a cell embeddings matrix, a feature loadings matrix, and a projected feature loadings matrix.
brain_obj@reductions
$pca
A dimensional reduction object with key PC_
Number of dimensions: 50
Projected dimensional reduction calculated: FALSE
Jackstraw run: FALSE
Computed using assay: SCT
$umap
A dimensional reduction object with key UMAP_
Number of dimensions: 2
Projected dimensional reduction calculated: FALSE
Jackstraw run: FALSE
Computed using assay: SCT
Ordinary differential expression analysis does not take spatial information into account, and thus it is generally applicable for both scRNA-seq data and spatial transcriptomics data to explore cellular heterogeneity of cell types and states.
identity class of each sample
table(brain_obj@active.ident)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
414 404 301 282 227 210 210 195 191 160 140 129 113 74 70 43 30
find all markers of cluster 10
cluster10_markers <- FindMarkers(brain_obj, ident.1 = 10, min.pct = 0.25)
head(cluster10_markers, n = 5)
p_val avg_log2FC pct.1 pct.2 p_val_adj
Esr1 4.302916e-154 0.645829 0.471 0.021 7.736643e-150
Trh 1.336079e-140 2.235272 0.557 0.041 2.402270e-136
Ccn3 8.237450e-118 2.262304 1.000 0.253 1.481094e-113
Pcdh8 8.330506e-115 1.562598 0.979 0.237 1.497825e-110
Trhr 1.723452e-114 0.752990 0.543 0.048 3.098767e-110
VlnPlot(brain_obj, features = c(“Esr1”, “Trh”, “Ccn3”))
SpatialFeaturePlot(object = brain_obj,
features = rownames(cluster10_markers)[1:3],
alpha = c(0.1, 1), ncol = 3)
The following code chunk is not evaluated due to computational time constraints, but you are encouraged to give it a try if interested.
find markers for every cluster compared to all remaining cells,
report only the positive ones
this code chunk is not evaluated for now because of time constraints
brain_obj_markers <- FindAllMarkers(brain_obj, only.pos = TRUE, min.pct = 0.25,
logfc.threshold = 0.25)
brain_obj_markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC)
top10 <- brain_obj_markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC)
DoHeatmap(brain_obj, features = top10$gene) + NoLegend()
3.3 Identify spatially variable genes
Seurat is able to calculate two metrics in order for the users to identify spatially variable genes, namely Moran’s I statistic and marker-variogram. Spatial coordinates of the samples are incorporated to identify features with spatial heterogeneity in its expression.
3.3.1 Moran’s I
Moran’s I is a global metric measuring the correlation of gene expression values between local observed values and the average of neighboring values. We can interpret Moran’s I as a spatial autocorrelation metric similar to the Pearson correlation coefficient in the context of spatial statistics. We can either calculate the z-score under normality assumption or perform permutation test to see whether we can reject the null hypothesis of zero spatial autocorrelation.
Moran’s I=N∑i,jWij∑i∑jWij(xi−x¯)(xj−x¯)∑i(xi−x¯)2
, where N is the total number of spatial location units indexed by (i,j), and W is a weight matrix to be discussed below. Recall that the Pearson correlation coefficient is
r=∑ni=1(xi−x¯)(yi−y¯)∑ni=1(xi−x¯)2−−−−−−−−−−−√∑ni=1(yi−y¯)2−−−−−−−−−−−√
Fig: Interpretation of Moran’s I statistic, image source.
Fig: Interpretation of Moran’s I statistic, image source.
Our assumptions about the spatial neighboring relationship can be encoded with a weight matrix W, where Wi,j can be either contiguity-based, or distance-based (inverse distance). For instance, the following figure illustrates potential cases for higher-order contiguity, when a first-order contiguity is determined based on the existence of a shared boundary or border.
Fig: Examples of contiguity-based spatial neighbors, image source.
Fig: Examples of contiguity-based spatial neighbors, image source.
brain_moransi <- FindSpatiallyVariableFeatures(
brain_obj, assay = “SCT”,
features = VariableFeatures(brain_obj)[1:10],
selection.method = “moransi”)
Computing Moran’s I
For a more efficient implementation of the Morans I calculation,
(selection.method = ‘moransi’) please install the Rfast2 package
--------------------------------------------
install.packages(‘Rfast2’)
--------------------------------------------
After installation of Rfast2, Seurat will automatically use the more
efficient implementation (no further action necessary).
This message will be shown once per session
moransi_output_df <- brain_moransi@assays S C T @ m e t a . f e a t u r e s n a . e x c l u d e h e a d ( m o r a n s i o u t p u t d f [ o r d e r ( m o r a n s i o u t p u t d f SCT@meta.features %>% na.exclude head(moransi_output_df[order(moransi_output_df SCT@meta.featuresna.excludehead(moransioutputdf[order(moransioutputdfMoransI_observed, decreasing = T), ])
MoransI_observed MoransI_p.value moransi.spatially.variable
Pcp2 0.6926961 0 TRUE
Car8 0.5745839 0 TRUE
Ttr 0.5635015 0 TRUE
Mbp 0.4965477 0 TRUE
Plp1 0.4491421 0 TRUE
Ptgds 0.2900585 0 TRUE
moransi.spatially.variable.rank
Pcp2 1
Car8 2
Ttr 3
Mbp 4
Plp1 5
Ptgds 6
top_features_moransi <- head(
SpatiallyVariableFeatures(brain_moransi,
selection.method = “moransi”), 3)
SpatialFeaturePlot(brain_moransi,
features = top_features_moransi, ncol = 3, alpha = c(0.1, 1)) +
plot_annotation(
title = “Top 3 genes with the largest Moran’s I”,
subtitle = “among 10 top variable genes for illustration purposes”)
bottom_features_moransi <- tail(
SpatiallyVariableFeatures(brain_moransi,
selection.method = “moransi”), 3)
SpatialFeaturePlot(brain_moransi,
features = bottom_features_moransi, ncol = 3, alpha = c(0.1, 1)) +
plot_annotation(
title = “Bottom 3 genes with the smallest Moran’s I”,
subtitle = “among 10 top variable genes for illustration purposes”)
3.3.2 marker-variogram
A gene-specific empirical variogram v for gene g can be calculated as a function of pair-wise distance r between any two samples i and j. A permutation test can be performed to see whether such variogram v® is associated with the spatial distance r in trendsceek.
v®=E∀i,j, s.t. dist(i,j)=r[(xi−xj)2]
, where xi,xj stand for the gene expression of gene g in sample i,j respectively.
Fig: Common theoretical variogram models, image source.
Fig: Common theoretical variogram models, image source.
brain_variogram <- FindSpatiallyVariableFeatures(
brain_obj, assay = “SCT”,
features = VariableFeatures(brain_obj)[1:10],
selection.method = “markvariogram”)
variogram_output_df <- brain_variogram@assaysKaTeX parse error: Expected 'EOF', got '#' at position 36: …% na.exclude #̲ there are NA r…r.metric.5), ])
r.metric.5 markvariogram.spatially.variable
Ttr 0.08242632 TRUE
Pcp2 0.08359531 TRUE
Mbp 0.15537902 TRUE
Car8 0.16478140 TRUE
Plp1 0.25534753 TRUE
Ptgds 0.35968268 TRUE
markvariogram.spatially.variable.rank
Ttr 1
Pcp2 2
Mbp 3
Car8 4
Plp1 5
Ptgds 6
The column named r.metric.5 stores the expected values of mark-variogram at a given r (radius) value (r=5 by default) when the marks attached to different points are independent (i.e., there is no spatial autocorrelation).
top_features_variogram <- head(
SpatiallyVariableFeatures(brain_variogram,
selection.method = “markvariogram”), 3)
SpatialFeaturePlot(brain_variogram,
features = top_features_variogram, ncol = 3, alpha = c(0.1, 1)) +
plot_annotation(
title = “3 genes with the top spatially variable rank (by mark-variogram)”,
subtitle = “among 10 top variable genes for illustration purposes”)
bottom_features_variogram <- tail(
SpatiallyVariableFeatures(brain_variogram,
selection.method = “markvariogram”), 3)
SpatialFeaturePlot(brain_variogram,
features = bottom_features_variogram, ncol = 3, alpha = c(0.1, 1)) +
plot_annotation(
title = “3 genes with the bottom spatially variale rank (by mark-variogram)”,
subtitle = “among 10 top variable genes for illustration purposes”)
3.3.3 Other R packages to identify spatially variable genes
spatialDE
Bioconductor package: https://www.bioconductor.org/packages/release/bioc/html/spatialDE.html
vignette: https://www.bioconductor.org/packages/release/bioc/vignettes/spatialDE/inst/doc/spatialDE.html
trendsceek
GitHub repository with a demo: https://github.com/edsgard/trendsceek
SPARK together with its updated version SPARK-X
GitHub repository: https://xzhoulab.github.io/SPARK/
an analysis example: https://xzhoulab.github.io/SPARK/02_SPARK_Example/
MERINGUE
GitHub repository: https://github.com/JEFworks-Lab/MERINGUE
a vignette analyzing mouse olfactory bulb data: https://github.com/JEFworks-Lab/MERINGUE/blob/master/docs/mOB_analysis.md
4 Convert data formats between R and Python
To convert single-cell data objects between R (e.g. Seurat or SingleCellExperiment) and Python (e.g. AnnData), you can use either SeuratDisk or sceasy. Another option is to use reticulate, which could be a bit more involved and take more memory.
SeuratDisk vignette to convert between Seurat and AnnData: https://mojaveazure.github.io/seurat-disk/articles/convert-anndata.html
sceasy to convert single-cell datasets across different formats: https://github.com/cellgeni/sceasy
reticulate vignette to call Python from R and convert data formats: https://github.com/cellgeni/sceasy
5 Acknowledgements
The tutorial was inspired by the computational assignments I created together with Dr. Yun S. Song in a graduate course (CMPBIO 290: Algorithms for single-cell genomics) at University of California, Berkeley in Fall 2021, and it was largely based on many open-source materials, especially the Seurat tutorials from the Satija Lab. I would also like to thank Salwan Butrus for helpful feedback and suggestions.
References
Seurat essential commands list: https://satijalab.org/seurat/articles/essential_commands.html
Seurat GitHub Wiki: https://github.com/satijalab/seurat/wiki
Luecken M. K. and Theis F. J., Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol (2019) (doi: <10.15252/msb.20188746>)
Analysis, visualization, and integration of spatial datasets with Seurat: https://satijalab.org/seurat/articles/spatial_vignette.html
Seurat guided clustering tutorial for scRNA-seq data: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html
sessionInfo()