Cell2Fate analysis of mouse pancreas development dataset

This tutorial shows how to use cell2fate for RNA velocity analysis on mouse pancreas development dataset. Cell2fate is an RNA velocity model that linearizes the velocity differential equations which allows solving a biophysically accurate model in a fully Bayesian fashion, enhancing the interpretability of transcriptional dynamics and connecting them with spatial organisation of tissues.

Note!

Cell2Fate is an independent package, but is powered by scvi-tools. If you have questions about scvi-tools please visit https://discourse.scverse.org/c/help/scvi-tools/7.

Loading packages

[1]:
import cell2fate as c2f
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import os
data_name = 'Pancreas_with_cc'
Global seed set to 0

Loading single cell RNA sequencing data

You need to change this to suitable directories on your system:

[2]:
# Data name, where to download data and where to save results
data_path = '/nfs/team283/se14/cell2fate_tutorial_data/Pancreas_with_cc/'
results_path = '/nfs/team283/se14/cell2fate_tutorial_data/Pancreas_with_cc/'
[3]:
# Downloading data into specified directory:
os.system('cd ' + data_path + ' && wget -q https://cell2fate.cog.sanger.ac.uk/' + data_name + '/' + data_name + '_anndata.h5ad')
[3]:
0

Load the data and extract most variable genes (and optionally remove some clusters). We expect the adata object to contain spliced and unspliced raw UMI counts in adata.layers[‘spliced’] and adata.layers[‘unspliced’]. Spliced and unspliced counts can be for example be quantified with the velocyto method. Starsolo also has the option to run velocyto as part of the general mapping, demultiplexing and count quantification pipeline. For more information on spliced and unspliced count quantification this publication gives a good overview.

[4]:
adata = sc.read_h5ad(data_path + data_name + '_anndata.h5ad')
clusters_to_remove = []
adata =  c2f.utils.get_training_data(adata, cells_per_cluster = 10**5, cluster_column = 'clusters',
                                    remove_clusters = clusters_to_remove,
                                min_shared_counts = 20, n_var_genes= 3000)
Keeping at most 100000 cells per cluster
Filtered out 20801 genes that are detected 20 counts (shared).
Extracted 3000 highly variable genes.
[5]:
fig, ax = plt.subplots(1,1, figsize = (6, 4))
sc.pl.umap(adata, color = ['clusters'], s = 200, legend_loc='on data', show = False, ax = ax)
plt.savefig(results_path + data_name + 'UMAP_clusters.pdf')
../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_8_0.png

We initialize the model, which includes automatically setting a maximal number of modules, based on the number of Louvain clusters (times 1.15) in the data.

Initializing and training the model

[6]:
c2f.Cell2fate_DynamicalModel.setup_anndata(adata, spliced_label='spliced', unspliced_label='unspliced')
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[7]:
n_modules = c2f.utils.get_max_modules(adata)
Leiden clustering ...
Number of Leiden Clusters: 13
Maximal Number of Modules: 14
[8]:
mod = c2f.Cell2fate_DynamicalModel(adata, n_modules = n_modules)

Training the model:

[9]:
mod.train()
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [MIG-GPU-8391e223-da74-0458-e121-783edc78bf21/0/0]
Epoch 500/500: 100%|██████████| 500/500 [11:27<00:00,  1.38s/it, v_num=1, elbo_train=1.12e+7]

Exporting and visualizing posteriors

Exporting relevant model parameters to the anndata object:

[10]:
adata = mod.export_posterior(adata)
Sampling local variables, batch: 100%|██████████| 1/1 [00:14<00:00, 14.95s/it]
Sampling global variables, sample: 100%|██████████| 29/29 [00:12<00:00,  2.24it/s]
Warning: Saving ALL posterior samples. Specify "return_samples: False" to save just summary statistics.

One of the interesting parameter posteriors that was saved to the anndata object is the differentiation time:

[11]:
fig, ax = plt.subplots(1,2, figsize = (15, 5))
sc.pl.umap(adata, color = ['Time (hours)'], legend_loc = 'right margin',
                size = 200, color_map = 'inferno', ncols = 2, show = False, ax = ax[0])
sc.pl.umap(adata, color = ['Time Uncertainty (sd)'], legend_loc = 'right margin',
                size = 200, color_map = 'inferno', ncols = 2, show = False, ax = ax[1])
plt.savefig(results_path + data_name + 'UMAP_Time_nModules' + str(n_modules) + '.pdf')
../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_19_0.png

Visualizing summary statistics and module activations

We can compute some module statistics to visualize the activity of the underlying modules:

[12]:
adata = mod.compute_module_summary_statistics(adata)
mod.plot_module_summary_statistics(adata, save = results_path + data_name + 'module_summary_stats_plot.pdf')
../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_21_0.png

This is an alternative way to visualize module activation over time:

[13]:
mod.compare_module_activation(adata, chosen_modules = [1,2,3,4],
                         save = results_path + data_name + 'module_activation_comparison.pdf')
../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_23_0.png

Visualizing velocities and module specific velocities

For large datasets (> 10000 cells) we recommend the scalable scVelo workflow for computing and plotting the velocity graph, which is used in this method:

[14]:
mod.compute_and_plot_total_velocity_scvelo(adata, save = results_path + data_name + 'total_velocity_plots.png', delete = False)
Computing total RNAvelocity ...
../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_25_2.png

For smaller datasets (< 10000 cells) our own method that takes into account posterior uncertainty, when computing the velocity graph can be used:

[15]:
mod.compute_and_plot_total_velocity(adata, save = results_path + data_name + 'total_velocity_plots.png', delete = False)
Computing total RNAvelocity ...
../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_27_1.png

The delete=False command saves the RNA velocity for each gene in both cases in an adata.layers slot.

For a visualization that is more similar to the scvelo plots:

[16]:
import scvelo as scv
fix, ax = plt.subplots(1, 1, figsize = (8, 6))
scv.pl.velocity_embedding_stream(adata, basis='umap', save = False, vkey='Velocity',
                                 show = False, ax = ax, legend_fontsize = 13)
plt.savefig(results_path + data_name + 'total_velocity_plots.png')
../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_30_0.png

After computing velocities, relative module activations for trajectories can be plotted.

[17]:
chosen_module = 0
mod.visualize_module_trajectories(adata, chosen_module)
../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_32_0.png

Keyword arguments for plotting can also be defined to further adjust velocity graphs, such as the location of the legend (cluster names) and the cmap.

[18]:
chosen_module = 0
plotting_kwargs={"color": 'clusters', 'legend_fontsize': 10, 'legend_loc': 'right_margin', 'dpi': 300, 'cmap': 'RdYlGn'}
mod.visualize_module_trajectories(adata, chosen_module, plotting_kwargs=plotting_kwargs)
plt.savefig(results_path + data_name + 'module_0_velocities.png', bbox_inches='tight')
../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_34_0.png
[19]:
adata
[19]:
AnnData object with n_obs × n_vars = 3696 × 3000
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', '_indices', '_scvi_batch', 'Time (hours)', 'Time Uncertainty (sd)', 'Module 0 Activation', 'Module 0 State', 'Module 1 Activation', 'Module 1 State', 'Module 2 Activation', 'Module 2 State', 'Module 3 Activation', 'Module 3 State', 'Module 4 Activation', 'Module 4 State', 'Module 5 Activation', 'Module 5 State', 'Module 6 Activation', 'Module 6 State', 'Module 7 Activation', 'Module 7 State', 'Module 8 Activation', 'Module 8 State', 'Module 9 Activation', 'Module 9 State', 'Module 10 Activation', 'Module 10 State', 'Module 11 Activation', 'Module 11 State', 'Module 12 Activation', 'Module 12 State', 'Module 13 Activation', 'Module 13 State', 'Module 14 Activation', 'Module 14 State', 'velocity_self_transition'
    var: 'highly_variable_genes', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca', '_scvi_uuid', '_scvi_manager_uuid', 'mod', 'Module 0 State_colors', 'Module 1 State_colors', 'Module 2 State_colors', 'Module 3 State_colors', 'Module 4 State_colors', 'Module 5 State_colors', 'Module 6 State_colors', 'Module 7 State_colors', 'Module 8 State_colors', 'Module 9 State_colors', 'Module 10 State_colors', 'Module 11 State_colors', 'Module 12 State_colors', 'Module 13 State_colors', 'Module 14 State_colors', 'velocity_graph', 'velocity_graph_neg', 'velocity_params', 'Velocity_graph'
    obsm: 'X_pca', 'X_umap', 'velocity_umap', 'Velocity_umap'
    layers: 'spliced', 'unspliced', 'velocity', 'Velocity'
    obsp: 'distances', 'connectivities', 'binary'

Plotting technical variables and other downstream analyses

Technical variables usually show lower detection efficiency and higher noise (= lower overdispersion parameter) for unspliced counts:

[20]:
mod.plot_technical_variables(adata, save = results_path + data_name + 'technical_variables_overview_plot.pdf')
../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_37_0.png

This is how to have a look at the various rate parameters the optimization converged to:

[21]:
print('A_mgON mean:', np.mean(mod.samples['post_sample_means']['A_mgON']))
print('gamma_g mean:', np.mean(mod.samples['post_sample_means']['gamma_g']))
print('beta_g mean:', np.mean(mod.samples['post_sample_means']['beta_g']))
print('lam_mi, all modules: \n \n', np.round(mod.samples['post_sample_means']['lam_mi'],2))
A_mgON mean: 0.09715694
gamma_g mean: 0.77538234
beta_g mean: 1.3485191
lam_mi, all modules:

 [[[25.9  19.62]]

 [[ 3.98 10.35]]

 [[18.54  5.46]]

 [[ 3.52  2.01]]

 [[ 3.45  2.54]]

 [[11.37  7.23]]

 [[12.05  9.33]]

 [[14.47  9.63]]

 [[ 5.64  4.82]]

 [[ 9.89  6.46]]

 [[ 3.96  4.21]]

 [[ 1.49  1.32]]

 [[12.4  11.3 ]]

 [[ 4.49  7.6 ]]

 [[ 6.64  6.66]]]

This method returns orders the genes and TFs in each module from most to least enriched. And it also performs gene set enrichment analysis:

[22]:
tab, all_results = mod.get_module_top_features(adata, p_adj_cutoff=0.01, background = adata.var_names)
tab.to_csv(results_path + data_name + 'module_top_features_table.csv')
[23]:
tab
[23]:
Module Number Genes Ranked TFs Ranked Terms Ranked
0 0 Akr1cl, Smtnl2, Ces1d, Rapsn, Adamts16, Adamts... Hes1, Elf5, Zfp36l1, Rest, Sox9, Ehf, Sox5, Tc... extracellular matrix organization (GO:0030198)...
1 1 Ska1, Aurkb, Kif11, Melk, Bub1, Casc5, 2810417... Pou2f1, Dnmt1, Atf4, Nfyb, Dnajc21, Nr4a1, Zbt... mitotic spindle organization (GO:0007052), mic...
2 2 Nptx2, Dll1, Eda, Grin3a, Reps2, Tmem171, Nfix... Nfix, Tcf20, Dach2, Arid3a, Zbtb16, Foxa3, Neu...
3 3 Chgb, Igfbp2, Grik2, Gadd45a, Glod5, Dock4, Me... Neurog3, Ikzf5, Lmx1b, Zfp423, Smarcd2, Klf10,...
4 4 Srrm4, Btbd11, Adamts18, Olfm1, Scube1, Chst9,... Zfp423, D630045J12Rik, E2f1, Kcnip3, Zeb1, Pbx...
5 5 Chst9, Bach2, Asic2, Akt3, Dgkb, Hpca, Fer, Sl... Bach2, Nfatc2, Aff2, Plag1, Runx1t1, Ebf1, Ddi...
6 6 Ncam2, Clstn2, Ebf1, Asic2, Sulf1, Ctnna3, Mar... Ebf1, Znfx1, Rfx2, Runx1t1, Hivep1, St18, Zfp3...
7 7 Ccl4, Tmem178b, Sdk1, Mb21d2, Acsl1, Apba1, Cc... Purg, Rfx2, Zfp804a, Vdr, Zfp90, Creb3, Prdm16...
8 8 Pik3c2g, Plac8, Filip1, Zbtb7c, Gm2694, Zfp804... Zbtb7c, Zfp804a, Zfp174, Zbtb10, Mnx1, Zfp672,...
9 9 Tshz3, Ctnna2, Ppp1r14c, Gm2115, Lrpprc, Fam15... Tshz3, Mef2a, Zfp719, Mlxipl, Pdx1, Mnx1, Mllt...
10 10 Pou6f2, Pcdh15, Gria2, Nlgn1, Irx2, Gpr179, Kc... Irx2, Smarca1, Etv1, Arx, Esrrg, Ncoa7, Zeb1, ...
11 11 Nell1, Iapp, Ins1, Ins2, Stmn2, Slc30a8, Gcg, ... Hnf4a, Myt1l, Rora, Zfat, Zfp516, Zfhx2, Foxj3...
12 12 Npy, Slc24a2, P2ry1, Sytl4, Arhgap36, Mapt, Sd... Pcbp3, Rora, Mnx1, Pdx1, Mlxipl, Mllt11, Purg,...
13 13 Zeb2, Uty, Zfp804a, Camkmt, Gpr173, Eml6, Mib1... Zeb2, Zfp804a, Esrrg, Purg, Plag1, Aff2, Ebf1,...
14 14 Zeb2, Uty, Camkmt, Zfp804a, Gpr173, Eml6, Mib1... Zeb2, Zfp804a, Esrrg, Purg, Plag1, Aff2, Ebf1,...
[24]:
all_results[0]
[24]:
Gene_set Term P-value Adjusted P-value Old P-value Old adjusted P-value Odds Ratio Combined Score Genes
0 GO_Biological_Process_2021 extracellular matrix organization (GO:0030198) 7.729784e-07 0.000455 0 0 8.410380 118.359400 ADAMTS16;COL27A1;FLRT2;ADAMTS1;SPP1;CAPG;LAMB1...
1 GO_Biological_Process_2021 positive regulation of cell population prolife... 8.096767e-07 0.000455 0 0 6.131783 86.008255 TCF7L2;EMP2;LAMB1;LAMC1;S100B;BST2;GJA1;CDK6;E...
2 GO_Biological_Process_2021 regulation of epithelial cell proliferation (G... 2.013023e-06 0.000754 0 0 16.715467 219.237948 TCF7L2;CDK6;LAMB1;LAMC1;SOX9;FGFR2;ZFP36L1
3 GO_Biological_Process_2021 positive regulation of epithelial cell prolife... 1.448439e-05 0.004066 0 0 11.413130 127.170097 NOTCH2;TCF7L2;CDH3;LAMB1;SOX9;LAMC1;FGFR2
4 GO_Biological_Process_2021 negative regulation of pri-miRNA transcription... 3.596750e-05 0.005291 0 0 inf inf REST;NFIB;SOX9
5 GO_Biological_Process_2021 regulation of mesenchymal stem cell differenti... 3.596750e-05 0.005291 0 0 inf inf REST;SOX9;SOX5
6 GO_Biological_Process_2021 membrane raft assembly (GO:0001765) 3.596750e-05 0.005291 0 0 inf inf ANXA2;EMP2;S100A10
7 GO_Biological_Process_2021 glial cell differentiation (GO:0010001) 3.769270e-05 0.005291 0 0 40.236111 409.846798 ERBB3;NFIB;ADGRG6;SOX9
8 GO_Biological_Process_2021 regulation of keratinocyte proliferation (GO:0... 7.345821e-05 0.009166 0 0 30.166667 287.150283 NOTCH2;CDH3;FGFR2;ZFP36L1

To plot the expression and RNA velocity of individual module marker genes:

[25]:
gene = 'Npy'
fig, ax = plt.subplots(1, 2, figsize = (15, 5))
sc.pl.umap(adata, color = [gene], s = 200, legend_loc='on data', show = False,
           title = gene + ' expression', ax = ax[0])
sc.pl.umap(adata, color = [gene], s = 200, legend_loc='on data', show = False,
           layer = 'velocity', title = gene + ' velocity', ax = ax[1])
[25]:
<AxesSubplot:title={'center':'Npy velocity'}, xlabel='UMAP1', ylabel='UMAP2'>
../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_45_1.png