Cell2Fate analysis of mouse pancreas development dataset
This tutorial shows how to use cell2fate for RNA velocity analysis on mouse pancreas development dataset. Cell2fate is an RNA velocity model that linearizes the velocity differential equations which allows solving a biophysically accurate model in a fully Bayesian fashion, enhancing the interpretability of transcriptional dynamics and connecting them with spatial organisation of tissues.
Note!
Cell2Fate is an independent package, but is powered by scvi-tools. If you have questions about scvi-tools please visit https://discourse.scverse.org/c/help/scvi-tools/7.
Loading packages
[1]:
import cell2fate as c2f
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import os
data_name = 'Pancreas_with_cc'
Global seed set to 0
Loading single cell RNA sequencing data
You need to change this to suitable directories on your system:
[2]:
# Data name, where to download data and where to save results
data_path = '/nfs/team283/se14/cell2fate_tutorial_data/Pancreas_with_cc/'
results_path = '/nfs/team283/se14/cell2fate_tutorial_data/Pancreas_with_cc/'
[3]:
# Downloading data into specified directory:
os.system('cd ' + data_path + ' && wget -q https://cell2fate.cog.sanger.ac.uk/' + data_name + '/' + data_name + '_anndata.h5ad')
[3]:
0
Load the data and extract most variable genes (and optionally remove some clusters). We expect the adata object to contain spliced and unspliced raw UMI counts in adata.layers[‘spliced’] and adata.layers[‘unspliced’]. Spliced and unspliced counts can be for example be quantified with the velocyto method. Starsolo also has the option to run velocyto as part of the general mapping, demultiplexing and count quantification pipeline. For more information on spliced and unspliced count quantification this publication gives a good overview.
[4]:
adata = sc.read_h5ad(data_path + data_name + '_anndata.h5ad')
clusters_to_remove = []
adata = c2f.utils.get_training_data(adata, cells_per_cluster = 10**5, cluster_column = 'clusters',
remove_clusters = clusters_to_remove,
min_shared_counts = 20, n_var_genes= 3000)
Keeping at most 100000 cells per cluster
Filtered out 20801 genes that are detected 20 counts (shared).
Extracted 3000 highly variable genes.
[5]:
fig, ax = plt.subplots(1,1, figsize = (6, 4))
sc.pl.umap(adata, color = ['clusters'], s = 200, legend_loc='on data', show = False, ax = ax)
plt.savefig(results_path + data_name + 'UMAP_clusters.pdf')
We initialize the model, which includes automatically setting a maximal number of modules, based on the number of Louvain clusters (times 1.15) in the data.
Initializing and training the model
[6]:
c2f.Cell2fate_DynamicalModel.setup_anndata(adata, spliced_label='spliced', unspliced_label='unspliced')
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[7]:
n_modules = c2f.utils.get_max_modules(adata)
Leiden clustering ...
Number of Leiden Clusters: 13
Maximal Number of Modules: 14
[8]:
mod = c2f.Cell2fate_DynamicalModel(adata, n_modules = n_modules)
Training the model:
[9]:
mod.train()
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [MIG-GPU-8391e223-da74-0458-e121-783edc78bf21/0/0]
Epoch 500/500: 100%|██████████| 500/500 [11:27<00:00, 1.38s/it, v_num=1, elbo_train=1.12e+7]
Exporting and visualizing posteriors
Exporting relevant model parameters to the anndata object:
[10]:
adata = mod.export_posterior(adata)
Sampling local variables, batch: 100%|██████████| 1/1 [00:14<00:00, 14.95s/it]
Sampling global variables, sample: 100%|██████████| 29/29 [00:12<00:00, 2.24it/s]
Warning: Saving ALL posterior samples. Specify "return_samples: False" to save just summary statistics.
One of the interesting parameter posteriors that was saved to the anndata object is the differentiation time:
[11]:
fig, ax = plt.subplots(1,2, figsize = (15, 5))
sc.pl.umap(adata, color = ['Time (hours)'], legend_loc = 'right margin',
size = 200, color_map = 'inferno', ncols = 2, show = False, ax = ax[0])
sc.pl.umap(adata, color = ['Time Uncertainty (sd)'], legend_loc = 'right margin',
size = 200, color_map = 'inferno', ncols = 2, show = False, ax = ax[1])
plt.savefig(results_path + data_name + 'UMAP_Time_nModules' + str(n_modules) + '.pdf')
Visualizing summary statistics and module activations
We can compute some module statistics to visualize the activity of the underlying modules:
[12]:
adata = mod.compute_module_summary_statistics(adata)
mod.plot_module_summary_statistics(adata, save = results_path + data_name + 'module_summary_stats_plot.pdf')
This is an alternative way to visualize module activation over time:
[13]:
mod.compare_module_activation(adata, chosen_modules = [1,2,3,4],
save = results_path + data_name + 'module_activation_comparison.pdf')
Visualizing velocities and module specific velocities
For large datasets (> 10000 cells) we recommend the scalable scVelo workflow for computing and plotting the velocity graph, which is used in this method:
[14]:
mod.compute_and_plot_total_velocity_scvelo(adata, save = results_path + data_name + 'total_velocity_plots.png', delete = False)
Computing total RNAvelocity ...
For smaller datasets (< 10000 cells) our own method that takes into account posterior uncertainty, when computing the velocity graph can be used:
[15]:
mod.compute_and_plot_total_velocity(adata, save = results_path + data_name + 'total_velocity_plots.png', delete = False)
Computing total RNAvelocity ...
The delete=False command saves the RNA velocity for each gene in both cases in an adata.layers slot.
For a visualization that is more similar to the scvelo plots:
[16]:
import scvelo as scv
fix, ax = plt.subplots(1, 1, figsize = (8, 6))
scv.pl.velocity_embedding_stream(adata, basis='umap', save = False, vkey='Velocity',
show = False, ax = ax, legend_fontsize = 13)
plt.savefig(results_path + data_name + 'total_velocity_plots.png')
After computing velocities, relative module activations for trajectories can be plotted.
[17]:
chosen_module = 0
mod.visualize_module_trajectories(adata, chosen_module)
Keyword arguments for plotting can also be defined to further adjust velocity graphs, such as the location of the legend (cluster names) and the cmap.
[18]:
chosen_module = 0
plotting_kwargs={"color": 'clusters', 'legend_fontsize': 10, 'legend_loc': 'right_margin', 'dpi': 300, 'cmap': 'RdYlGn'}
mod.visualize_module_trajectories(adata, chosen_module, plotting_kwargs=plotting_kwargs)
plt.savefig(results_path + data_name + 'module_0_velocities.png', bbox_inches='tight')
[19]:
adata
[19]:
AnnData object with n_obs × n_vars = 3696 × 3000
obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', '_indices', '_scvi_batch', 'Time (hours)', 'Time Uncertainty (sd)', 'Module 0 Activation', 'Module 0 State', 'Module 1 Activation', 'Module 1 State', 'Module 2 Activation', 'Module 2 State', 'Module 3 Activation', 'Module 3 State', 'Module 4 Activation', 'Module 4 State', 'Module 5 Activation', 'Module 5 State', 'Module 6 Activation', 'Module 6 State', 'Module 7 Activation', 'Module 7 State', 'Module 8 Activation', 'Module 8 State', 'Module 9 Activation', 'Module 9 State', 'Module 10 Activation', 'Module 10 State', 'Module 11 Activation', 'Module 11 State', 'Module 12 Activation', 'Module 12 State', 'Module 13 Activation', 'Module 13 State', 'Module 14 Activation', 'Module 14 State', 'velocity_self_transition'
var: 'highly_variable_genes', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca', '_scvi_uuid', '_scvi_manager_uuid', 'mod', 'Module 0 State_colors', 'Module 1 State_colors', 'Module 2 State_colors', 'Module 3 State_colors', 'Module 4 State_colors', 'Module 5 State_colors', 'Module 6 State_colors', 'Module 7 State_colors', 'Module 8 State_colors', 'Module 9 State_colors', 'Module 10 State_colors', 'Module 11 State_colors', 'Module 12 State_colors', 'Module 13 State_colors', 'Module 14 State_colors', 'velocity_graph', 'velocity_graph_neg', 'velocity_params', 'Velocity_graph'
obsm: 'X_pca', 'X_umap', 'velocity_umap', 'Velocity_umap'
layers: 'spliced', 'unspliced', 'velocity', 'Velocity'
obsp: 'distances', 'connectivities', 'binary'
Plotting technical variables and other downstream analyses
Technical variables usually show lower detection efficiency and higher noise (= lower overdispersion parameter) for unspliced counts:
[20]:
mod.plot_technical_variables(adata, save = results_path + data_name + 'technical_variables_overview_plot.pdf')
This is how to have a look at the various rate parameters the optimization converged to:
[21]:
print('A_mgON mean:', np.mean(mod.samples['post_sample_means']['A_mgON']))
print('gamma_g mean:', np.mean(mod.samples['post_sample_means']['gamma_g']))
print('beta_g mean:', np.mean(mod.samples['post_sample_means']['beta_g']))
print('lam_mi, all modules: \n \n', np.round(mod.samples['post_sample_means']['lam_mi'],2))
A_mgON mean: 0.09715694
gamma_g mean: 0.77538234
beta_g mean: 1.3485191
lam_mi, all modules:
[[[25.9 19.62]]
[[ 3.98 10.35]]
[[18.54 5.46]]
[[ 3.52 2.01]]
[[ 3.45 2.54]]
[[11.37 7.23]]
[[12.05 9.33]]
[[14.47 9.63]]
[[ 5.64 4.82]]
[[ 9.89 6.46]]
[[ 3.96 4.21]]
[[ 1.49 1.32]]
[[12.4 11.3 ]]
[[ 4.49 7.6 ]]
[[ 6.64 6.66]]]
This method returns orders the genes and TFs in each module from most to least enriched. And it also performs gene set enrichment analysis:
[22]:
tab, all_results = mod.get_module_top_features(adata, p_adj_cutoff=0.01, background = adata.var_names)
tab.to_csv(results_path + data_name + 'module_top_features_table.csv')
[23]:
tab
[23]:
| Module Number | Genes Ranked | TFs Ranked | Terms Ranked | |
|---|---|---|---|---|
| 0 | 0 | Akr1cl, Smtnl2, Ces1d, Rapsn, Adamts16, Adamts... | Hes1, Elf5, Zfp36l1, Rest, Sox9, Ehf, Sox5, Tc... | extracellular matrix organization (GO:0030198)... |
| 1 | 1 | Ska1, Aurkb, Kif11, Melk, Bub1, Casc5, 2810417... | Pou2f1, Dnmt1, Atf4, Nfyb, Dnajc21, Nr4a1, Zbt... | mitotic spindle organization (GO:0007052), mic... |
| 2 | 2 | Nptx2, Dll1, Eda, Grin3a, Reps2, Tmem171, Nfix... | Nfix, Tcf20, Dach2, Arid3a, Zbtb16, Foxa3, Neu... | |
| 3 | 3 | Chgb, Igfbp2, Grik2, Gadd45a, Glod5, Dock4, Me... | Neurog3, Ikzf5, Lmx1b, Zfp423, Smarcd2, Klf10,... | |
| 4 | 4 | Srrm4, Btbd11, Adamts18, Olfm1, Scube1, Chst9,... | Zfp423, D630045J12Rik, E2f1, Kcnip3, Zeb1, Pbx... | |
| 5 | 5 | Chst9, Bach2, Asic2, Akt3, Dgkb, Hpca, Fer, Sl... | Bach2, Nfatc2, Aff2, Plag1, Runx1t1, Ebf1, Ddi... | |
| 6 | 6 | Ncam2, Clstn2, Ebf1, Asic2, Sulf1, Ctnna3, Mar... | Ebf1, Znfx1, Rfx2, Runx1t1, Hivep1, St18, Zfp3... | |
| 7 | 7 | Ccl4, Tmem178b, Sdk1, Mb21d2, Acsl1, Apba1, Cc... | Purg, Rfx2, Zfp804a, Vdr, Zfp90, Creb3, Prdm16... | |
| 8 | 8 | Pik3c2g, Plac8, Filip1, Zbtb7c, Gm2694, Zfp804... | Zbtb7c, Zfp804a, Zfp174, Zbtb10, Mnx1, Zfp672,... | |
| 9 | 9 | Tshz3, Ctnna2, Ppp1r14c, Gm2115, Lrpprc, Fam15... | Tshz3, Mef2a, Zfp719, Mlxipl, Pdx1, Mnx1, Mllt... | |
| 10 | 10 | Pou6f2, Pcdh15, Gria2, Nlgn1, Irx2, Gpr179, Kc... | Irx2, Smarca1, Etv1, Arx, Esrrg, Ncoa7, Zeb1, ... | |
| 11 | 11 | Nell1, Iapp, Ins1, Ins2, Stmn2, Slc30a8, Gcg, ... | Hnf4a, Myt1l, Rora, Zfat, Zfp516, Zfhx2, Foxj3... | |
| 12 | 12 | Npy, Slc24a2, P2ry1, Sytl4, Arhgap36, Mapt, Sd... | Pcbp3, Rora, Mnx1, Pdx1, Mlxipl, Mllt11, Purg,... | |
| 13 | 13 | Zeb2, Uty, Zfp804a, Camkmt, Gpr173, Eml6, Mib1... | Zeb2, Zfp804a, Esrrg, Purg, Plag1, Aff2, Ebf1,... | |
| 14 | 14 | Zeb2, Uty, Camkmt, Zfp804a, Gpr173, Eml6, Mib1... | Zeb2, Zfp804a, Esrrg, Purg, Plag1, Aff2, Ebf1,... |
[24]:
all_results[0]
[24]:
| Gene_set | Term | P-value | Adjusted P-value | Old P-value | Old adjusted P-value | Odds Ratio | Combined Score | Genes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | GO_Biological_Process_2021 | extracellular matrix organization (GO:0030198) | 7.729784e-07 | 0.000455 | 0 | 0 | 8.410380 | 118.359400 | ADAMTS16;COL27A1;FLRT2;ADAMTS1;SPP1;CAPG;LAMB1... |
| 1 | GO_Biological_Process_2021 | positive regulation of cell population prolife... | 8.096767e-07 | 0.000455 | 0 | 0 | 6.131783 | 86.008255 | TCF7L2;EMP2;LAMB1;LAMC1;S100B;BST2;GJA1;CDK6;E... |
| 2 | GO_Biological_Process_2021 | regulation of epithelial cell proliferation (G... | 2.013023e-06 | 0.000754 | 0 | 0 | 16.715467 | 219.237948 | TCF7L2;CDK6;LAMB1;LAMC1;SOX9;FGFR2;ZFP36L1 |
| 3 | GO_Biological_Process_2021 | positive regulation of epithelial cell prolife... | 1.448439e-05 | 0.004066 | 0 | 0 | 11.413130 | 127.170097 | NOTCH2;TCF7L2;CDH3;LAMB1;SOX9;LAMC1;FGFR2 |
| 4 | GO_Biological_Process_2021 | negative regulation of pri-miRNA transcription... | 3.596750e-05 | 0.005291 | 0 | 0 | inf | inf | REST;NFIB;SOX9 |
| 5 | GO_Biological_Process_2021 | regulation of mesenchymal stem cell differenti... | 3.596750e-05 | 0.005291 | 0 | 0 | inf | inf | REST;SOX9;SOX5 |
| 6 | GO_Biological_Process_2021 | membrane raft assembly (GO:0001765) | 3.596750e-05 | 0.005291 | 0 | 0 | inf | inf | ANXA2;EMP2;S100A10 |
| 7 | GO_Biological_Process_2021 | glial cell differentiation (GO:0010001) | 3.769270e-05 | 0.005291 | 0 | 0 | 40.236111 | 409.846798 | ERBB3;NFIB;ADGRG6;SOX9 |
| 8 | GO_Biological_Process_2021 | regulation of keratinocyte proliferation (GO:0... | 7.345821e-05 | 0.009166 | 0 | 0 | 30.166667 | 287.150283 | NOTCH2;CDH3;FGFR2;ZFP36L1 |
To plot the expression and RNA velocity of individual module marker genes:
[25]:
gene = 'Npy'
fig, ax = plt.subplots(1, 2, figsize = (15, 5))
sc.pl.umap(adata, color = [gene], s = 200, legend_loc='on data', show = False,
title = gene + ' expression', ax = ax[0])
sc.pl.umap(adata, color = [gene], s = 200, legend_loc='on data', show = False,
layer = 'velocity', title = gene + ' velocity', ax = ax[1])
[25]:
<AxesSubplot:title={'center':'Npy velocity'}, xlabel='UMAP1', ylabel='UMAP2'>