miloDE_tutorials/miloDE__mouse_embryo.Rmd at main · MarioniLab/miloDE_tutorials · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
---
title: "miloDE: chimera mouse embryo, Tal1-"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{miloDE__mouse_embryo}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(knitr)

```

# Load libraries

```{r setup}

library(miloDE)

# library containing toy data
suppressMessages(library(MouseGastrulationData))

# analysis libraries
library(scuttle)
suppressMessages(library(miloR))
suppressMessages(library(uwot))
library(scran)
suppressMessages(library(dplyr))
library(reshape2)

# packages for gene cluster analysis
library(scWGCNA)
suppressMessages(library(Seurat))

# plotting libraries
library(ggplot2)
library(viridis)
library(ggpubr)

```

We will also show how we can parallel `de_test_neighbourhoods`. For this we need to load `BiocParallel` and enable multicore parallel evaluation.

```{r setup-parallel}

library(BiocParallel)

ncores = 4
mcparam = MulticoreParam(workers = ncores)
register(mcparam)

```


# Load data

We will use data from mouse gastrulation scRNA-seq atlas [Pijuan-Sala et al., 2019](https://pubmed.ncbi.nlm.nih.gov/30787436/). As a part of this project, using chimera mouse embryos, authors characterised the impact of Tal1 knock out on the mouse development. The most prominent phenotype was a loss of blood cells in Tal1- cells.

In this vignette, we will apply miloDE on cells contributing to blood lineage (endothelia and haematoendothelial progenitors (haem. prog-s.)) and assess whether we can detect and characterise more subtle phenotypes (i.e. DE).

Tal1-/+ chimer data are processed and directly available within `MouseGastrulationData` package. As condition ID, we will use slot `tomato` which indicates whether cells carry Tal1 KO.


```{r load-data, fig.width=8 , fig.cap="UMAPs, coloured by cell types"}


# load chimera Tal1
sce = suppressMessages(MouseGastrulationData::Tal1ChimeraData())

# downsample to few selected cell types
cts = c("Spinal cord" , "Haematoendothelial progenitors", "Endothelium" , "Blood progenitors 1" , "Blood progenitors 2")
sce = sce[, sce$celltype.mapped %in% cts]
# let's rename Haematoendothelial progenitors
sce$celltype.mapped[sce$celltype.mapped == "Haematoendothelial progenitors"] = "Haem. prog-s."

# delete row for tomato
sce = sce[!rownames(sce) == "tomato-td" , ]

# add logcounts
sce = logNormCounts(sce)

# update tomato field to be more interpretable
sce$tomato = sapply(sce$tomato , function(x) ifelse(isTRUE(x) , "Tal1_KO" , "WT"))

# for this exercise, we focus on 3000 highly variable genes (for computational efficiency)
dec.sce = modelGeneVar(sce)
hvg.genes = getTopHVGs(dec.sce, n = 3000)
sce = sce[hvg.genes , ]
# change rownames to symbol names
rowdata = as.data.frame(rowData(sce))
rownames(sce) = rowdata$SYMBOL

# add UMAPs
set.seed(32)
umaps = as.data.frame(uwot::umap(reducedDim(sce , "pca.corrected")))
# let's store UMAPs - we will use them for visualisation
reducedDim(sce , "UMAP") = umaps

# plot UMAPs, colored by cell types
umaps = cbind(as.data.frame(colData(sce)) , reducedDim(sce , "UMAP"))
names(EmbryoCelltypeColours)[names(EmbryoCelltypeColours) == "Haematoendothelial progenitors"] = "Haem. prog-s."
cols_ct = EmbryoCelltypeColours[names(EmbryoCelltypeColours) %in% unique(umaps$celltype.mapped)]

p = ggplot(umaps , aes(x = V1 , y = V2 , col = celltype.mapped)) +
  geom_point() +
  scale_color_manual(values = cols_ct) +
  facet_wrap(~tomato) +
  theme_bw() +
  labs(x = "UMAP-1", y = "UMAP-2")
p


```


# Assign neighbourhoods

## Estimate k -> neighbourhood size

`estimate_neighbourhood_sizes` allows to gauge how neighbourhood size distribution changes as a function of (order,k). It might be useful to run it first in order to determine optimal range that will return desired neighbourhood sizes.

```{r estimate-k, fig.width=6, fig.cap="Neighbourhood size distribution ~ k"}


stat_k = estimate_neighbourhood_sizes(sce, k_grid = seq(10,40,5) ,
                                      order = 2, prop = 0.1 , filtering = TRUE,
                                      reducedDim_name = "pca.corrected" , plot_stat = TRUE)

kable(stat_k , caption = "Neighbourhood size distribution ~ k")

```

We will use k=20, order=2 --> that returns an average neighbourhood size ~400 cells.

## Assign neighbourhoods

To assign neighbourhoods, use `assign_neighbourhoods`. Note that under the hood, there is a random sampling of index cells --> if you want to ensure the same neighbourhood assignment, please set seed prior to running this function.

Note that we set `filtering = TRUE` to achieve a refined assignment in which redundant neighbourhoods are discarded. Alternatively this can be done post hoc, using `filter_neighbourhoods`.

```{r assign-nhoods}


set.seed(32)
sce_milo = assign_neighbourhoods(sce , k = 20 , order = 2,
                                 filtering = TRUE , reducedDim_name = "pca.corrected" , verbose = F)


```

In total we get 42 nhoods. A neighbourhood assignment can be visualised using Milo plots, in which each circle corresponds to a neighbourhood, and edges between them represent shared cells. The center of each neighbourhood are coordinated in 2D latent space (e.g. UMAP) for the center cell of the neighbourhood. We can also colour each neighbourhood by provided metric. In this plot, we will annotate each neighbourhood with its enriched cell type, and colour neighbourhoods by assigned cell types.


```{r plot-nhoods-by-cts , fig.width=5, fig.cap="Neighbourhood plot, coloured by cell types"}


nhoods_sce = nhoods(sce_milo)
# assign cell types for nhoods
nhood_stat_ct = data.frame(Nhood = 1:ncol(nhoods_sce) , Nhood_center = colnames(nhoods_sce))
nhood_stat_ct = miloR::annotateNhoods(sce_milo , nhood_stat_ct , coldata_col = "celltype.mapped")
p = plot_milo_by_single_metric(sce_milo, nhood_stat_ct, colour_by = "celltype.mapped" ,
                                      layout = "UMAP" , size_range = c(1.5,3) , edge_width = c(0.2,0.5)) +
  scale_fill_manual(values = cols_ct , name = "Cell type")
p


```

# DE testing

## Calculate AUC per neighbourhood

Prior to DE testing, an optional step is to assess which neighbourhoods do not show any signs of perturbation and discard them prior to DE testing in order to facilitate the burden from multiple testing correction (by using `calc_AUC_per_neighbourhood`). To do so we build on [Augur](https://pubmed.ncbi.nlm.nih.gov/32690972/) that employs RF classifiers to separate cells between the conditions. As an output of `calc_AUC_per_neighbourhood`, we return data frame containing AUC of the per neighbourhood classifiers.
We suggest that AUC <= 0.5 corresponds to neighbourhoods that can be safely discarded from DE testing (however, you may use your own threshold if desired). In addition, if classifier can not be built due to very low number of cells in at least one of the conditions, AUC will be set to NaN.

Note that this part is rather computationally costly, and we recommend it to run if a substantial transcriptional regions are anticipated to be unperturbed.

```{r calc-auc-per-nhood, fig.width=5, fig.cap="Neighbourhood plot, coloured by AUC"}


stat_auc = suppressWarnings(calc_AUC_per_neighbourhood(sce_milo , sample_id = "sample" , condition_id = "tomato", min_n_cells_per_sample = 1, BPPARAM = mcparam))

p = plot_milo_by_single_metric(sce_milo, stat_auc, colour_by = "auc" ,
                                      layout = "UMAP" , size_range = c(1.5,3) , edge_width = c(0.2,0.5)) +
  scale_fill_viridis(name = "AUC")
p


```

We observe that AUCs in the neighbourhoods containing mostly Blood progenitors is NaN which is consistent with absense of these cells in Tal1- cells.
In addition, we observe higher AUCs in endothelial subregions, likely reflecting that these cells are more affected by Tal1 KO.

## DE testing

Let's proceed with DE testing within each neighbourhood. We will test neighbourhoods in which AUC is not NaN (i.e. neighbourhoods in which there are enough cells from both conditions).

```{r de-testing, fig.width=5 , fig.cap="Neighbourhood plot, coloured by whether DE testing is performed"}


de_stat = de_test_neighbourhoods(sce_milo ,
                             sample_id = "sample",
                             design = ~tomato,
                             covariates = c("tomato"),
                             subset_nhoods = stat_auc$Nhood[!is.na(stat_auc$auc)],
                             output_type = "SCE",
                             plot_summary_stat = TRUE,
                             layout = "UMAP", BPPARAM = mcparam ,
                             verbose = T)


```

# Analysis of miloDE results

## Get neighbourhood ranking by the extent of DE

One explanatory question a user might have is an overall scan of which transcriptional regions show noteworthy signs of DE. To do so ona neighbourhood level, we provide the function `rank_neighbourhoods_by_DE_magnitude`. Within this function, we calculate two metrics:

a) `n_DE_genes` - for each neighbourhood, we calculate how many genes are assigned as DE. Since we are doing it within each neighbourhood, we use `pval_corrected_across_genes` and we use default `pval.thresh=0.1` (can be changed). Note that there is no comparison between neighbourhoods here.


b) `n_specific_DE_genes`. We also might be interested which neighbourhoods differ from others more so than we would expect. To assess this, we are interested in which neighbourhoods contain genes, that are DE 'specifically' in those neighbourhoods. To calculate this, for each gene we now use z-transformation of `pval_corrected_across_nhoods`, and we identify the neighbourhoods in which z-normalised p-values are lower than a threshold (default `z.thresh=-3`). This would tell us that the gene is signifciantly DE in the neighbourhood *compared to most other neighbourhoods*. We do so for each gene, and then for each neighbourhood we calculate how mane genes have z-normalised p-values are lower than a threshold.


Note that for gene/neighbourhood combinations for which p-values are returned as NaNs (e.g. genes are not tested), for this function we set pvalues = 1. In other words, if a gene is only tested in few neighbourhoods to begin with, z-normalised p-value corrected across neighbourhoods is likely to be small for these neighbourhoods.


```{r rank-nhoods-by-DE-magnitude, fig.width=10 , fig.cap = "Neighbourhood plot, coloured by # of DE genes" , fig.height=5}

stat_de_magnitude = rank_neighbourhoods_by_DE_magnitude(de_stat)

p1 = plot_milo_by_single_metric(sce_milo, stat_de_magnitude, colour_by = "n_DE_genes" ,
                                      layout = "UMAP" , size_range = c(1.5,3) , edge_width = c(0.2,0.5)) +
  scale_fill_viridis(name = "# DE genes")
p2 = plot_milo_by_single_metric(sce_milo, stat_de_magnitude, colour_by = "n_specific_DE_genes" ,
                                      layout = "UMAP" , size_range = c(1.5,3) , edge_width = c(0.2,0.5)) +
  scale_fill_viridis(name = "# specific\nDE genes" , option = "inferno")
p = ggarrange(p1,p2)
p

```

Reassuringly, we observe that same regions show both higher AUC and number of DE expressed genes.

# Co-regulated programs

The fine neighbourhood resolution allows to cluster genes based on logFC vectors (across the neighbourhoods). One way is to employ WGCNA approach, originally designed to finding co-expressed genes. Instead of gene counts we will use corrected logFC (logFC set to 0 in the neighbourhoods in which it is not DE).

Below we outline a custom script to detect gene modules using WGCNA approach. We will use [scWGCNA](https://github.com/CFeregrino/scWGCNA), specifically designed to handle single-cell data.

Note that WGCNA algorithm is exclusive and sensitive to input parameters - nevertheless, we suggest it is useful to explore what DE patterns exist in your data.

## Gene modules using WGCNA

```{r sc-wgcna}


get_wgcna_modules = function(de_stat , subset_hoods = NULL ,
                             n_hoods_sig.thresh = 2 ,
                             npcs = 5 ,
                             pval.thresh = 0.1 ){
  require(scWGCNA)
  require(Seurat)
  require(dplyr)
  require(reshape2)

  set.seed(32)
  # subset hoods
  if (!is.null(subset_hoods)){
    de_stat = de_stat[de_stat$Nhood %in% subset_hoods , ]
  }

  # focus on genes that DE in at least 2 nhoods
  de_stat_per_gene = as.data.frame(de_stat %>% group_by(gene) %>% dplyr::summarise(n_hoods_sig = sum(pval_corrected_across_nhoods < pval.thresh , na.rm = TRUE)))
  genes_sig = de_stat_per_gene$gene[de_stat_per_gene$n_hoods_sig >= n_hoods_sig.thresh]

  de_stat = de_stat[de_stat$gene %in% genes_sig, ]
  de_stat = de_stat[order(de_stat$Nhood) , ]

  # discard neighbourhoods in which testing was not performed
  de_stat = de_stat[de_stat$test_performed , ]

  # for this analysis, set logFC to 0 and pvals to 1 if they are NaN
  de_stat$logFC[is.na(de_stat$logFC)] = 0
  de_stat$pval[is.na(de_stat$pval)] = 1
  de_stat$pval_corrected_across_genes[is.na(de_stat$pval_corrected_across_genes)] = 1
  de_stat$pval_corrected_across_nhoods[is.na(de_stat$pval_corrected_across_nhoods)] = 1

  # set logFC to 0 if pval_corrected_across_nhoods > pval.thresh
  de_stat$logFC[de_stat$pval_corrected_across_nhoods >= pval.thresh] = 0

  # move the object to Seurat
  de_stat = reshape2::dcast(data = de_stat, formula = gene~Nhood, value.var = "logFC")
  rownames(de_stat) = de_stat$gene
  de_stat = de_stat[,2:ncol(de_stat)]

  obj.seurat <- CreateSeuratObject(counts = de_stat)
  DefaultAssay(obj.seurat) <- "RNA"
  obj.seurat = FindVariableFeatures(obj.seurat)
  # scale
  obj.seurat[["RNA"]]@scale.data = as.matrix(obj.seurat[["RNA"]]@data)
  obj.seurat = RunPCA(obj.seurat , npcs = npcs)

  # run scwgcna
  clusters_scwgcna = run.scWGCNA(p.cells = obj.seurat,
                                 s.cells = obj.seurat,
                                 is.pseudocell = F,
                                 features = rownames(obj.seurat),
                                 less = TRUE , merging = TRUE)
  # compile stat
  clusters = lapply(1:length(clusters_scwgcna$module.genes) , function(i){
    out = data.frame(cluster = i , gene = clusters_scwgcna$module.genes[[i]] , n_genes = length(clusters_scwgcna$module.genes[[i]]))
    return(out)
  })
  clusters = do.call(rbind , clusters)
  # add colors
  genes_w_colors = clusters_scwgcna$dynamicCols
  genes_w_colors = data.frame(gene = names(genes_w_colors) , cluster_color = genes_w_colors)
  clusters = merge(clusters , genes_w_colors)

  return(clusters)
}


# for this vignette, for simplicity we will focus on genes that are DE in at least 4 neighbourhoods
modules_wgcna = suppressMessages(get_wgcna_modules(convert_de_stat(de_stat) , n_hoods_sig.thresh = 4))


```


## Gene module plots

We can visualise for which transcriptional regions different gene modules are relevant.

For each gene module, we can plot neighbourhood plot, in which neighbourhood colour corresponds to average logFC (across genes from the module) and neighbourhood size corresponds to fraction of genes from the module, that are DE in this neighbourhood.

```{r sc-wgcna-plot-modules, fig.cap = "Co-reulated modules", fig.width=10, fig.height=8}


plots = lapply(sort(unique(modules_wgcna$cluster)) , function(cluster){
  p = plot_DE_gene_set(sce_milo, de_stat , genes = modules_wgcna$gene[modules_wgcna$cluster == cluster],
                            layout = "UMAP" , size_range = c(0.5,3) ,
                            node_stroke = 0.3, edge_width = c(0.2,0.5)) +
    ggtitle(paste0("Module ", cluster, ", ", length(modules_wgcna$gene[modules_wgcna$cluster == cluster]) , " genes"))
  return(p)
})
p = ggarrange(plotlist = plots)
p


```

## Break down by cell types

We can also quantify the relevance of each module for different cell types.

```{r sc-wgcna-plot-modules-beeswarm, fig.width=10,fig.height=8 , fig.cap = "Co-regulated modules, break down by cell types."}


# we first need to assign cell type label to each neighbourhood - we have calculated it previously
# we will use data.frame format for de-stat; the conversion between SingleCellExperiment and data.frame formats can be done using `convert_de_stat`
de_stat_df = convert_de_stat(de_stat)
de_stat_df = merge(de_stat_df , nhood_stat_ct , by = c("Nhood" , "Nhood_center"))


plots = lapply(sort(unique(modules_wgcna$cluster)) , function(cluster){
  p = plot_beeswarm_gene_set(de_stat_df,
                             genes = modules_wgcna$gene[modules_wgcna$cluster == cluster],
                             nhoodGroup = "celltype.mapped") +
    ggtitle(paste0("Module ", cluster))
  return(p)
})
p = ggarrange(plotlist = plots)
p


```


## Visualising individual genes

We can also visualise DE results for individual genes, in which each neighbourhood is coloured by its logFC if significant.

Below we show couple examples of DE patterns of genes from modules 2 and 4.

## Module 2

```{r de-single-genes-mod-2 , fig.width=10,fig.height=15 , fig.cap = "Genes from module 2."}


set.seed(1020)
module = 2
n_genes = 6
genes = sample(modules_wgcna$gene[modules_wgcna$cluster == module] , n_genes)


plots = lapply(genes , function(gene){
  p = plot_DE_single_gene(sce_milo, de_stat , gene = gene , layout = "UMAP" , set_na_to_0 = TRUE) +
    ggtitle(gene)
  return(p)
})
p = ggarrange(plotlist = plots , ncol = 2, nrow = 3)
p


```

## Module 4

```{r de-single-genes-mod-4 , fig.width=10,fig.height=15 , fig.cap = "Genes from module 4."}


set.seed(1020)
module = 4
n_genes = 6
genes = sample(modules_wgcna$gene[modules_wgcna$cluster == module] , n_genes)


plots = lapply(genes , function(gene){
  p = plot_DE_single_gene(sce_milo, de_stat , gene = gene , layout = "UMAP" , set_na_to_0 = TRUE) +
    ggtitle(gene)
  return(p)
})
p = ggarrange(plotlist = plots, ncol=2, nrow=3)
p


```


# Session Info

```{r sessinf}
sessionInfo()
```