Cell2Fate dynamical model analysis of PBMC dataset

[2]:
import cell2fate as c2f
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import os
Global seed set to 0

You need to change this to suitable directories on your system:

[3]:
# Data name, where to download data and where to save results
data_name = 'PBMC'
data_path = '/nfs/team283/aa16/data/fate_benchmarking/benchmarking_datasets/PBMC/pbmc68k.h5ad'
results_path = '/nfs/team283/aa16/cell2fate_paper_results/PBMC/'
[4]:
adata = sc.read_h5ad(data_path)

Load the data and extract most variable genes (and optionally remove some clusters).

[5]:
adata = sc.read_h5ad(data_path)
clusters_to_remove = []
adata = c2f.utils.get_training_data(adata, cells_per_cluster = 1000, cluster_column = 'clusters',
                                    remove_clusters = clusters_to_remove,
                                min_shared_counts = 0, n_var_genes= 3000)
# (For some reasons selecting min_shared_counts = 20 removes almost all genes in this dataset,
# hence we exceptionally set it to 0 here)
Keeping at most 1000 cells per cluster
Extracted 3000 highly variable genes.
[6]:
adata
[6]:
AnnData object with n_obs × n_vars = 9359 × 3000
    obs: 'celltype', 'clusters', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    obsm: 'X_tsne', 'X_umap'
    layers: 'spliced', 'unspliced'
[7]:
fig, ax = plt.subplots(1,1, figsize = (6, 4))
sc.pl.umap(adata, color = ['clusters'], s = 200, legend_loc='right margin', show = False, ax = ax)
plt.savefig(results_path + data_name + 'UMAP_clusters.pdf')
../../_images/notebooks_publication_figures_cell2fate_PBMC_DynamicalModel_8_0.png
[8]:
adata
[8]:
AnnData object with n_obs × n_vars = 9359 × 3000
    obs: 'celltype', 'clusters', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'clusters_colors'
    obsm: 'X_tsne', 'X_umap'
    layers: 'spliced', 'unspliced'

We initialize the model

[9]:
c2f.Cell2fate_DynamicalModel.setup_anndata(adata, spliced_label='spliced', unspliced_label='unspliced')
[10]:
n_modules = c2f.utils.get_max_modules(adata)
Leiden clustering ...
WARNING: You’re trying to run this on 704 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
2023-04-20 10:52:35.613580: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /software/gcc-8.2.0/lib64/
2023-04-20 10:52:35.613727: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /software/gcc-8.2.0/lib64/
2023-04-20 10:52:35.613742: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Number of Leiden Clusters: 137
Maximal Number of Modules: 157

Louvain clustering seems to give an excessively large number of clusters in this dataset, so instead choose number of modules, based on number of annotated clusters times 1.15:

[11]:
n_modules = 12
[12]:
mod = c2f.Cell2fate_DynamicalModel(adata, n_modules = n_modules)

Training the model:

[13]:
mod.train()
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 500/500: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [22:54<00:00,  2.75s/it, v_num=1, elbo_train=1.28e+6]

Exporting relevant model parameters to the anndata object:

[14]:
adata = mod.export_posterior(adata)
sample_kwargs['batch_size'] 9359
Sampling local variables, batch: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:34<00:00, 34.56s/it]
Sampling global variables, sample: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:24<00:00,  1.18it/s]
Warning: Saving ALL posterior samples. Specify "return_samples: False" to save just summary statistics.

One of the interesting parameter posteriors that was saved to the anndata object is the differentiation time:

[15]:
fig, ax = plt.subplots(1,2, figsize = (15, 5))
sc.pl.umap(adata, color = ['Time (hours)'], legend_loc = 'right margin',
                size = 200, color_map = 'inferno', ncols = 2, show = False, ax = ax[0])
sc.pl.umap(adata, color = ['Time Uncertainty (sd)'], legend_loc = 'right margin',
                size = 200, color_map = 'inferno', ncols = 2, show = False, ax = ax[1],
              vmin = 0, vmax = 1.3)
plt.savefig(results_path + data_name + 'UMAP_Time_nModules' + str(n_modules) + '.pdf')
../../_images/notebooks_publication_figures_cell2fate_PBMC_DynamicalModel_21_0.png

Potentially more insightful is the Time coefficient of variation (standard deviation divided by mean):

[17]:
adata.obs['Time (CV)'] = adata.obs['Time Uncertainty (sd)']/adata.obs['Time (hours)']
fig, ax = plt.subplots(1,1, figsize = (7.5, 5))
sc.pl.umap(adata, color = ['Time (CV)'], legend_loc = 'right margin',
                size = 200, color_map = 'inferno', ncols = 2, show = False, ax = ax,
              vmin = 0, vmax = 1)
plt.savefig(results_path + data_name + 'UMAP_CV.pdf')
../../_images/notebooks_publication_figures_cell2fate_PBMC_DynamicalModel_23_0.png

We can compute some module statistics to visualize the activity of the underlying modules:

[15]:
adata = mod.compute_module_summary_statistics(adata)
mod.plot_module_summary_statistics(adata, save = results_path + data_name + 'module_summary_stats_plot.pdf')
../../_images/notebooks_publication_figures_cell2fate_PBMC_DynamicalModel_25_0.png

This is an alternative way to visualize module activation over time:

[16]:
mod.compare_module_activation(adata, chosen_modules = [1,2,3,4],
                         save = results_path + data_name + 'module_activation_comparison.pdf')
../../_images/notebooks_publication_figures_cell2fate_PBMC_DynamicalModel_27_0.png

Computing Velocity Graph:

[23]:
mod.compute_and_plot_total_velocity_scvelo(adata, save = results_path + data_name + 'total_velocity_plots.svg', delete = False)
Computing total RNAvelocity ...
../../_images/notebooks_publication_figures_cell2fate_PBMC_DynamicalModel_29_2.png