Cell2Fate analysis of mouse pancreas development dataset

This tutorial shows how to use cell2fate for RNA velocity analysis on mouse pancreas development dataset. Cell2fate is an RNA velocity model that linearizes the velocity differential equations which allows solving a biophysically accurate model in a fully Bayesian fashion, enhancing the interpretability of transcriptional dynamics and connecting them with spatial organisation of tissues.

Note!

Cell2Fate is an independent package, but is powered by scvi-tools. If you have questions about scvi-tools please visit https://discourse.scverse.org/c/help/scvi-tools/7.

Loading packages

[1]:

import cell2fate as c2f
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import os
data_name = 'Pancreas_with_cc'

Global seed set to 0

Loading single cell RNA sequencing data

You need to change this to suitable directories on your system:

[2]:

# Data name, where to download data and where to save results
data_path = '/nfs/team283/se14/cell2fate_tutorial_data/Pancreas_with_cc/'
results_path = '/nfs/team283/se14/cell2fate_tutorial_data/Pancreas_with_cc/'

[3]:

# Downloading data into specified directory:
os.system('cd ' + data_path + ' && wget -q https://cell2fate.cog.sanger.ac.uk/' + data_name + '/' + data_name + '_anndata.h5ad')

[3]:

Load the data and extract most variable genes (and optionally remove some clusters). We expect the adata object to contain spliced and unspliced raw UMI counts in adata.layers[‘spliced’] and adata.layers[‘unspliced’]. Spliced and unspliced counts can be for example be quantified with the velocyto method. Starsolo also has the option to run velocyto as part of the general mapping, demultiplexing and count quantification pipeline. For more information on spliced and unspliced count quantification this publication gives a good overview.

[4]:

adata = sc.read_h5ad(data_path + data_name + '_anndata.h5ad')
clusters_to_remove = []
adata =  c2f.utils.get_training_data(adata, cells_per_cluster = 10**5, cluster_column = 'clusters',
                                    remove_clusters = clusters_to_remove,
                                min_shared_counts = 20, n_var_genes= 3000)

Keeping at most 100000 cells per cluster
Filtered out 20801 genes that are detected 20 counts (shared).
Extracted 3000 highly variable genes.

[5]:

fig, ax = plt.subplots(1,1, figsize = (6, 4))
sc.pl.umap(adata, color = ['clusters'], s = 200, legend_loc='on data', show = False, ax = ax)
plt.savefig(results_path + data_name + 'UMAP_clusters.pdf')

../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_8_0.png

We initialize the model, which includes automatically setting a maximal number of modules, based on the number of Louvain clusters (times 1.15) in the data.

Initializing and training the model

[6]:

c2f.Cell2fate_DynamicalModel.setup_anndata(adata, spliced_label='spliced', unspliced_label='unspliced')

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

[7]:

n_modules = c2f.utils.get_max_modules(adata)

Leiden clustering ...
Number of Leiden Clusters: 13
Maximal Number of Modules: 14

[8]:

mod = c2f.Cell2fate_DynamicalModel(adata, n_modules = n_modules)

Training the model:

[9]:

mod.train()

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [MIG-GPU-8391e223-da74-0458-e121-783edc78bf21/0/0]

Epoch 500/500: 100%|██████████| 500/500 [11:27<00:00,  1.38s/it, v_num=1, elbo_train=1.12e+7]

Exporting and visualizing posteriors

Exporting relevant model parameters to the anndata object:

[10]:

adata = mod.export_posterior(adata)

Sampling local variables, batch: 100%|██████████| 1/1 [00:14<00:00, 14.95s/it]
Sampling global variables, sample: 100%|██████████| 29/29 [00:12<00:00,  2.24it/s]
Warning: Saving ALL posterior samples. Specify "return_samples: False" to save just summary statistics.

One of the interesting parameter posteriors that was saved to the anndata object is the differentiation time:

[11]:

fig, ax = plt.subplots(1,2, figsize = (15, 5))
sc.pl.umap(adata, color = ['Time (hours)'], legend_loc = 'right margin',
                size = 200, color_map = 'inferno', ncols = 2, show = False, ax = ax[0])
sc.pl.umap(adata, color = ['Time Uncertainty (sd)'], legend_loc = 'right margin',
                size = 200, color_map = 'inferno', ncols = 2, show = False, ax = ax[1])
plt.savefig(results_path + data_name + 'UMAP_Time_nModules' + str(n_modules) + '.pdf')

../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_19_0.png

Visualizing summary statistics and module activations

We can compute some module statistics to visualize the activity of the underlying modules:

[12]:

adata = mod.compute_module_summary_statistics(adata)
mod.plot_module_summary_statistics(adata, save = results_path + data_name + 'module_summary_stats_plot.pdf')

../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_21_0.png

This is an alternative way to visualize module activation over time:

[13]:

mod.compare_module_activation(adata, chosen_modules = [1,2,3,4],
                         save = results_path + data_name + 'module_activation_comparison.pdf')

../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_23_0.png

Visualizing velocities and module specific velocities

For large datasets (> 10000 cells) we recommend the scalable scVelo workflow for computing and plotting the velocity graph, which is used in this method:

[14]:

mod.compute_and_plot_total_velocity_scvelo(adata, save = results_path + data_name + 'total_velocity_plots.png', delete = False)

Computing total RNAvelocity ...

../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_25_2.png

For smaller datasets (< 10000 cells) our own method that takes into account posterior uncertainty, when computing the velocity graph can be used:

[15]:

mod.compute_and_plot_total_velocity(adata, save = results_path + data_name + 'total_velocity_plots.png', delete = False)

Computing total RNAvelocity ...

../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_27_1.png

The delete=False command saves the RNA velocity for each gene in both cases in an adata.layers slot.

For a visualization that is more similar to the scvelo plots:

[16]:

import scvelo as scv
fix, ax = plt.subplots(1, 1, figsize = (8, 6))
scv.pl.velocity_embedding_stream(adata, basis='umap', save = False, vkey='Velocity',
                                 show = False, ax = ax, legend_fontsize = 13)
plt.savefig(results_path + data_name + 'total_velocity_plots.png')

../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_30_0.png

After computing velocities, relative module activations for trajectories can be plotted.

[17]:

chosen_module = 0
mod.visualize_module_trajectories(adata, chosen_module)

../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_32_0.png

Keyword arguments for plotting can also be defined to further adjust velocity graphs, such as the location of the legend (cluster names) and the cmap.

[18]:

chosen_module = 0
plotting_kwargs={"color": 'clusters', 'legend_fontsize': 10, 'legend_loc': 'right_margin', 'dpi': 300, 'cmap': 'RdYlGn'}
mod.visualize_module_trajectories(adata, chosen_module, plotting_kwargs=plotting_kwargs)
plt.savefig(results_path + data_name + 'module_0_velocities.png', bbox_inches='tight')

../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_34_0.png

[19]:

adata

[19]:

AnnData object with n_obs × n_vars = 3696 × 3000
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', '_indices', '_scvi_batch', 'Time (hours)', 'Time Uncertainty (sd)', 'Module 0 Activation', 'Module 0 State', 'Module 1 Activation', 'Module 1 State', 'Module 2 Activation', 'Module 2 State', 'Module 3 Activation', 'Module 3 State', 'Module 4 Activation', 'Module 4 State', 'Module 5 Activation', 'Module 5 State', 'Module 6 Activation', 'Module 6 State', 'Module 7 Activation', 'Module 7 State', 'Module 8 Activation', 'Module 8 State', 'Module 9 Activation', 'Module 9 State', 'Module 10 Activation', 'Module 10 State', 'Module 11 Activation', 'Module 11 State', 'Module 12 Activation', 'Module 12 State', 'Module 13 Activation', 'Module 13 State', 'Module 14 Activation', 'Module 14 State', 'velocity_self_transition'
    var: 'highly_variable_genes', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca', '_scvi_uuid', '_scvi_manager_uuid', 'mod', 'Module 0 State_colors', 'Module 1 State_colors', 'Module 2 State_colors', 'Module 3 State_colors', 'Module 4 State_colors', 'Module 5 State_colors', 'Module 6 State_colors', 'Module 7 State_colors', 'Module 8 State_colors', 'Module 9 State_colors', 'Module 10 State_colors', 'Module 11 State_colors', 'Module 12 State_colors', 'Module 13 State_colors', 'Module 14 State_colors', 'velocity_graph', 'velocity_graph_neg', 'velocity_params', 'Velocity_graph'
    obsm: 'X_pca', 'X_umap', 'velocity_umap', 'Velocity_umap'
    layers: 'spliced', 'unspliced', 'velocity', 'Velocity'
    obsp: 'distances', 'connectivities', 'binary'

Plotting technical variables and other downstream analyses

Technical variables usually show lower detection efficiency and higher noise (= lower overdispersion parameter) for unspliced counts:

[20]:

mod.plot_technical_variables(adata, save = results_path + data_name + 'technical_variables_overview_plot.pdf')

../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_37_0.png

This is how to have a look at the various rate parameters the optimization converged to:

[21]:

print('A_mgON mean:', np.mean(mod.samples['post_sample_means']['A_mgON']))
print('gamma_g mean:', np.mean(mod.samples['post_sample_means']['gamma_g']))
print('beta_g mean:', np.mean(mod.samples['post_sample_means']['beta_g']))
print('lam_mi, all modules: \n \n', np.round(mod.samples['post_sample_means']['lam_mi'],2))

A_mgON mean: 0.09715694
gamma_g mean: 0.77538234
beta_g mean: 1.3485191
lam_mi, all modules:

 [[[25.9  19.62]]

 [[ 3.98 10.35]]

 [[18.54  5.46]]

 [[ 3.52  2.01]]

 [[ 3.45  2.54]]

 [[11.37  7.23]]

 [[12.05  9.33]]

 [[14.47  9.63]]

 [[ 5.64  4.82]]

 [[ 9.89  6.46]]

 [[ 3.96  4.21]]

 [[ 1.49  1.32]]

 [[12.4  11.3 ]]

 [[ 4.49  7.6 ]]

 [[ 6.64  6.66]]]

This method returns orders the genes and TFs in each module from most to least enriched. And it also performs gene set enrichment analysis:

[22]:

tab, all_results = mod.get_module_top_features(adata, p_adj_cutoff=0.01, background = adata.var_names)
tab.to_csv(results_path + data_name + 'module_top_features_table.csv')

[23]:

tab

[23]:

	Module Number	Genes Ranked	TFs Ranked	Terms Ranked
0	0	Akr1cl, Smtnl2, Ces1d, Rapsn, Adamts16, Adamts...	Hes1, Elf5, Zfp36l1, Rest, Sox9, Ehf, Sox5, Tc...	extracellular matrix organization (GO:0030198)...
1	1	Ska1, Aurkb, Kif11, Melk, Bub1, Casc5, 2810417...	Pou2f1, Dnmt1, Atf4, Nfyb, Dnajc21, Nr4a1, Zbt...	mitotic spindle organization (GO:0007052), mic...
2	2	Nptx2, Dll1, Eda, Grin3a, Reps2, Tmem171, Nfix...	Nfix, Tcf20, Dach2, Arid3a, Zbtb16, Foxa3, Neu...
3	3	Chgb, Igfbp2, Grik2, Gadd45a, Glod5, Dock4, Me...	Neurog3, Ikzf5, Lmx1b, Zfp423, Smarcd2, Klf10,...
4	4	Srrm4, Btbd11, Adamts18, Olfm1, Scube1, Chst9,...	Zfp423, D630045J12Rik, E2f1, Kcnip3, Zeb1, Pbx...
5	5	Chst9, Bach2, Asic2, Akt3, Dgkb, Hpca, Fer, Sl...	Bach2, Nfatc2, Aff2, Plag1, Runx1t1, Ebf1, Ddi...
6	6	Ncam2, Clstn2, Ebf1, Asic2, Sulf1, Ctnna3, Mar...	Ebf1, Znfx1, Rfx2, Runx1t1, Hivep1, St18, Zfp3...
7	7	Ccl4, Tmem178b, Sdk1, Mb21d2, Acsl1, Apba1, Cc...	Purg, Rfx2, Zfp804a, Vdr, Zfp90, Creb3, Prdm16...
8	8	Pik3c2g, Plac8, Filip1, Zbtb7c, Gm2694, Zfp804...	Zbtb7c, Zfp804a, Zfp174, Zbtb10, Mnx1, Zfp672,...
9	9	Tshz3, Ctnna2, Ppp1r14c, Gm2115, Lrpprc, Fam15...	Tshz3, Mef2a, Zfp719, Mlxipl, Pdx1, Mnx1, Mllt...
10	10	Pou6f2, Pcdh15, Gria2, Nlgn1, Irx2, Gpr179, Kc...	Irx2, Smarca1, Etv1, Arx, Esrrg, Ncoa7, Zeb1, ...
11	11	Nell1, Iapp, Ins1, Ins2, Stmn2, Slc30a8, Gcg, ...	Hnf4a, Myt1l, Rora, Zfat, Zfp516, Zfhx2, Foxj3...
12	12	Npy, Slc24a2, P2ry1, Sytl4, Arhgap36, Mapt, Sd...	Pcbp3, Rora, Mnx1, Pdx1, Mlxipl, Mllt11, Purg,...
13	13	Zeb2, Uty, Zfp804a, Camkmt, Gpr173, Eml6, Mib1...	Zeb2, Zfp804a, Esrrg, Purg, Plag1, Aff2, Ebf1,...
14	14	Zeb2, Uty, Camkmt, Zfp804a, Gpr173, Eml6, Mib1...	Zeb2, Zfp804a, Esrrg, Purg, Plag1, Aff2, Ebf1,...

[24]:

all_results[0]

[24]:

	Gene_set	Term	P-value	Adjusted P-value	Odds Ratio	Combined Score	Genes
0	GO_Biological_Process_2021	extracellular matrix organization (GO:0030198)	7.729784e-07	0.000455	8.410380	118.359400	ADAMTS16;COL27A1;FLRT2;ADAMTS1;SPP1;CAPG;LAMB1...
1	GO_Biological_Process_2021	positive regulation of cell population prolife...	8.096767e-07	0.000455	6.131783	86.008255	TCF7L2;EMP2;LAMB1;LAMC1;S100B;BST2;GJA1;CDK6;E...
2	GO_Biological_Process_2021	regulation of epithelial cell proliferation (G...	2.013023e-06	0.000754	16.715467	219.237948	TCF7L2;CDK6;LAMB1;LAMC1;SOX9;FGFR2;ZFP36L1
3	GO_Biological_Process_2021	positive regulation of epithelial cell prolife...	1.448439e-05	0.004066	11.413130	127.170097	NOTCH2;TCF7L2;CDH3;LAMB1;SOX9;LAMC1;FGFR2
4	GO_Biological_Process_2021	negative regulation of pri-miRNA transcription...	3.596750e-05	0.005291	inf	inf	REST;NFIB;SOX9
5	GO_Biological_Process_2021	regulation of mesenchymal stem cell differenti...	3.596750e-05	0.005291	inf	inf	REST;SOX9;SOX5
6	GO_Biological_Process_2021	membrane raft assembly (GO:0001765)	3.596750e-05	0.005291	inf	inf	ANXA2;EMP2;S100A10
7	GO_Biological_Process_2021	glial cell differentiation (GO:0010001)	3.769270e-05	0.005291	40.236111	409.846798	ERBB3;NFIB;ADGRG6;SOX9
8	GO_Biological_Process_2021	regulation of keratinocyte proliferation (GO:0...	7.345821e-05	0.009166	30.166667	287.150283	NOTCH2;CDH3;FGFR2;ZFP36L1

To plot the expression and RNA velocity of individual module marker genes:

[25]:

gene = 'Npy'
fig, ax = plt.subplots(1, 2, figsize = (15, 5))
sc.pl.umap(adata, color = [gene], s = 200, legend_loc='on data', show = False,
           title = gene + ' expression', ax = ax[0])
sc.pl.umap(adata, color = [gene], s = 200, legend_loc='on data', show = False,
           layer = 'velocity', title = gene + ' velocity', ax = ax[1])

[25]:

<AxesSubplot:title={'center':'Npy velocity'}, xlabel='UMAP1', ylabel='UMAP2'>

../../_images/notebooks_publication_figures_cell2fate_PancreasWithCC_45_1.png