Multi-modality pipeline: analyzing single-cell multiome data (ATAC + Gene Expression)#

Introduction#

In this tutorial we will analyze single-cell multiome data data from Peripheral blood mononuclear cells (PBMCs). The dataset used in this tutorial can be found here: http://renlab.sdsc.edu/kai/10x-Multiome/.

In addition to SnapATAC2, we will utilize scanpy and scglue to perform the integration.

[1]:
import scanpy as sc
import snapatac2 as snap
import anndata as ad
import scglue
import itertools

Analyze gene expression data#

[2]:
rna = sc.read_h5ad("10x-Multiome-Pbmc10k-RNA.h5ad")
rna
[2]:
AnnData object with n_obs × n_vars = 9631 × 29095
    obs: 'domain', 'cell_type'
    var: 'gene_ids', 'feature_types'

First we backup the raw counts in anndata and perform the preprocessing.

[3]:
rna.layers["counts"] = rna.X.copy()
sc.pp.normalize_total(rna)
sc.pp.log1p(rna)
sc.pp.highly_variable_genes(rna)
rna = rna[:, rna.var.highly_variable]
sc.pp.scale(rna, max_value=10)
/home/kaizhang/data/software/miniconda3/lib/python3.8/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)

We then perform the dimension reduction, followed by UMAP embedding.

[4]:
sc.tl.pca(rna)
sc.pp.neighbors(rna)
sc.tl.umap(rna)
[5]:
sc.pl.umap(rna, color="cell_type")
../_images/tutorials_modality_9_0.png

Analyze chromatin accessibility data#

[6]:
atac = snap.read("10x-Multiome-Pbmc10k-ATAC.h5ad")
atac
[6]:
AnnData object with n_obs x n_vars = 9631 x 107194 backed at '10x-Multiome-Pbmc10k-ATAC.h5ad'
    obs: cells, domain, cell_type
    var: peaks, feature_types
    obsm: X_umap, X_spectral
    uns: spectral_eigenvalue
[7]:
snap.tl.spectral(atac, features=None, n_comps=30)
snap.tl.umap(atac)
Compute similarity matrix
Normalization
Perform decomposition
[8]:
snap.pl.umap(atac, color="cell_type", interactive = False)
/home/kaizhang/data/software/miniconda3/lib/python3.8/site-packages/polars/internals/frame.py:1675: UserWarning: setting a DataFrame by indexing is deprecated; Consider using DataFrame.with_column
  warnings.warn(
[8]:
../_images/tutorials_modality_13_1.png
[9]:
atac.close()

Perform data integration#

First we need to assign genomic coordinates to the genes in scRNA-seq experiment. To do this, we need to download gene annotations from here.

[10]:
scglue.data.get_gene_annotation(
    rna, gtf="gencode.v40.chr_patch_hapl_scaff.annotation.gtf.gz",
    gtf_by="gene_name"
)
rna = rna[:, rna.var.dropna().index.to_numpy()]
rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].head()
[10]:
chrom chromStart chromEnd
genes
MEGF6 chr1 3487950.0 3611508.0
TNFRSF25 chr1 6460785.0 6466175.0
CAMTA1 chr1 6785453.0 7769706.0
SLC25A33 chr1 9539464.0 9585173.0
DHRS3 chr1 12567909.0 12618210.0
[11]:
atac = ad.read("10x-Multiome-Pbmc10k-ATAC.h5ad")
sc.pp.neighbors(atac, use_rep="X_spectral")
atac
[11]:
AnnData object with n_obs × n_vars = 9631 × 107194
    obs: 'domain', 'cell_type'
    var: 'feature_types'
    uns: 'spectral_eigenvalue', 'neighbors'
    obsm: 'X_spectral', 'X_umap'
    obsp: 'distances', 'connectivities'
[12]:
split = atac.var_names.str.split(r"[:-]")
atac.var["chrom"] = split.map(lambda x: x[0])
atac.var["chromStart"] = split.map(lambda x: x[1])
atac.var["chromEnd"] = split.map(lambda x: x[2])
atac.var.head()
[12]:
feature_types chrom chromStart chromEnd
peaks
chr1:816881-817647 Peaks chr1 816881 817647
chr1:819912-823500 Peaks chr1 819912 823500
chr1:825827-825889 Peaks chr1 825827 825889
chr1:826612-827979 Peaks chr1 826612 827979
chr1:841243-843059 Peaks chr1 841243 843059
[13]:
graph = scglue.genomics.rna_anchored_prior_graph(rna, atac)
graph
[13]:
<networkx.classes.multidigraph.MultiDiGraph at 0x2ba410869e50>
[14]:
scglue.models.configure_dataset(
    rna, "NB", use_highly_variable=True,
    use_layer="counts", use_rep="X_pca"
)

scglue.models.configure_dataset(
    atac, "NB", use_highly_variable=True,
    use_rep="X_spectral"
)
/home/kaizhang/.local/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning:

Trying to modify attribute `._uns` of view, initializing view as actual.

[15]:
graph = graph.subgraph(itertools.chain(
    rna.var.query("highly_variable").index,
    atac.var.query("highly_variable").index
))
[16]:
# This is patch for numpy >= 1.21 to fix a bug mentioned in
# https://github.com/lmcinnes/pynndescent/issues/163
import pynndescent
import numba
import numpy as np
@numba.njit(fastmath=True)
def correct_alternative_cosine(ds):
    result = np.empty_like(ds)
    for i in range(ds.shape[0]):
        result[i] = 1.0 - np.power(2.0, ds[i])
    return result
pynn_dist_fns_fda = pynndescent.distances.fast_distance_alternatives
pynn_dist_fns_fda["cosine"]["correction"] = correct_alternative_cosine
pynn_dist_fns_fda["dot"]["correction"] = correct_alternative_cosine

We now start training the model.

[17]:
glue = scglue.models.fit_SCGLUE(
    {"rna": rna, "atac": atac}, graph,
    fit_kws={"directory": "glue"}
)
[INFO] fit_SCGLUE: Pretraining SCGLUE model...
[WARNING] SCGLUEModel: It is recommended that `use_rep` dimensionality be equal or larger than `latent_dim`.
[INFO] autodevice: Using GPU 4 as computation device.
[INFO] SCGLUEModel: Setting `graph_batch_size` = 12615
[INFO] SCGLUEModel: Setting `max_epochs` = 178
[INFO] SCGLUEModel: Setting `patience` = 15
[INFO] SCGLUEModel: Setting `reduce_lr_patience` = 8
[INFO] SCGLUETrainer: Using training directory: "glue/pretrain"
/home/kaizhang/data/software/miniconda3/lib/python3.8/site-packages/torch/distributions/negative_binomial.py:97: UserWarning:

Specified kernel cache directory could not be created! This disables kernel caching. Specified directory is /home/kaizhang/.cache/torch/kernels. This warning will appear only once per process. (Triggered internally at  ../aten/src/ATen/native/cuda/jit_utils.cpp:860.)

[INFO] SCGLUETrainer: [Epoch 10] train={'g_nll': 0.515, 'g_kl': 0.016, 'g_elbo': 0.53, 'x_rna_nll': 0.428, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.437, 'x_atac_nll': 0.282, 'x_atac_kl': 0.002, 'x_atac_elbo': 0.283, 'dsc_loss': 0.689, 'vae_loss': 0.742, 'gen_loss': 0.707}, val={'g_nll': 0.513, 'g_kl': 0.016, 'g_elbo': 0.529, 'x_rna_nll': 0.433, 'x_rna_kl': 0.01, 'x_rna_elbo': 0.443, 'x_atac_nll': 0.279, 'x_atac_kl': 0.002, 'x_atac_elbo': 0.28, 'dsc_loss': 0.69, 'vae_loss': 0.744, 'gen_loss': 0.71}, 4.7s elapsed
[INFO] SCGLUETrainer: [Epoch 20] train={'g_nll': 0.483, 'g_kl': 0.015, 'g_elbo': 0.498, 'x_rna_nll': 0.425, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.434, 'x_atac_nll': 0.277, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.278, 'dsc_loss': 0.688, 'vae_loss': 0.733, 'gen_loss': 0.698}, val={'g_nll': 0.482, 'g_kl': 0.015, 'g_elbo': 0.497, 'x_rna_nll': 0.427, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.436, 'x_atac_nll': 0.276, 'x_atac_kl': 0.002, 'x_atac_elbo': 0.278, 'dsc_loss': 0.69, 'vae_loss': 0.734, 'gen_loss': 0.699}, 47.1s elapsed
Epoch 00023: reducing learning rate of group 0 to 2.0000e-04.
Epoch 00023: reducing learning rate of group 0 to 2.0000e-04.
[INFO] LRScheduler: Learning rate reduction: step 1
[INFO] SCGLUETrainer: [Epoch 30] train={'g_nll': 0.477, 'g_kl': 0.015, 'g_elbo': 0.492, 'x_rna_nll': 0.424, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.433, 'x_atac_nll': 0.276, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.278, 'dsc_loss': 0.689, 'vae_loss': 0.731, 'gen_loss': 0.696}, val={'g_nll': 0.477, 'g_kl': 0.015, 'g_elbo': 0.492, 'x_rna_nll': 0.427, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.436, 'x_atac_nll': 0.273, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.274, 'dsc_loss': 0.686, 'vae_loss': 0.73, 'gen_loss': 0.696}, 47.6s elapsed
[INFO] SCGLUETrainer: [Epoch 40] train={'g_nll': 0.476, 'g_kl': 0.015, 'g_elbo': 0.491, 'x_rna_nll': 0.422, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.431, 'x_atac_nll': 0.277, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.278, 'dsc_loss': 0.69, 'vae_loss': 0.729, 'gen_loss': 0.695}, val={'g_nll': 0.477, 'g_kl': 0.015, 'g_elbo': 0.492, 'x_rna_nll': 0.425, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.434, 'x_atac_nll': 0.275, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.276, 'dsc_loss': 0.686, 'vae_loss': 0.73, 'gen_loss': 0.696}, 49.6s elapsed
Epoch 00042: reducing learning rate of group 0 to 2.0000e-05.
Epoch 00042: reducing learning rate of group 0 to 2.0000e-05.
[INFO] LRScheduler: Learning rate reduction: step 2
[INFO] SCGLUETrainer: [Epoch 50] train={'g_nll': 0.476, 'g_kl': 0.015, 'g_elbo': 0.491, 'x_rna_nll': 0.423, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.432, 'x_atac_nll': 0.276, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.277, 'dsc_loss': 0.689, 'vae_loss': 0.729, 'gen_loss': 0.695}, val={'g_nll': 0.476, 'g_kl': 0.015, 'g_elbo': 0.491, 'x_rna_nll': 0.422, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.431, 'x_atac_nll': 0.274, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.276, 'dsc_loss': 0.689, 'vae_loss': 0.726, 'gen_loss': 0.692}, 44.5s elapsed
Epoch 00056: reducing learning rate of group 0 to 2.0000e-06.
Epoch 00056: reducing learning rate of group 0 to 2.0000e-06.
[INFO] LRScheduler: Learning rate reduction: step 3
[INFO] SCGLUETrainer: [Epoch 60] train={'g_nll': 0.476, 'g_kl': 0.015, 'g_elbo': 0.491, 'x_rna_nll': 0.424, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.433, 'x_atac_nll': 0.276, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.278, 'dsc_loss': 0.689, 'vae_loss': 0.73, 'gen_loss': 0.696}, val={'g_nll': 0.476, 'g_kl': 0.015, 'g_elbo': 0.491, 'x_rna_nll': 0.424, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.433, 'x_atac_nll': 0.278, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.28, 'dsc_loss': 0.688, 'vae_loss': 0.733, 'gen_loss': 0.698}, 47.7s elapsed
2022-06-25 12:44:53,779 ignite.handlers.early_stopping.EarlyStopping INFO: EarlyStopping: Stop training
[INFO] EarlyStopping: Restoring checkpoint "56"...
[INFO] fit_SCGLUE: Estimating balancing weight...
[INFO] estimate_balancing_weight: Clustering cells...
[INFO] estimate_balancing_weight: Matching clusters...
[INFO] estimate_balancing_weight: Matching array shape = (17, 16)...
[INFO] estimate_balancing_weight: Estimating balancing weight...
[INFO] fit_SCGLUE: Fine-tuning SCGLUE model...
[WARNING] SCGLUEModel: It is recommended that `use_rep` dimensionality be equal or larger than `latent_dim`.
[INFO] SCGLUEModel: Setting `graph_batch_size` = 12615
[INFO] SCGLUEModel: Setting `align_burnin` = 30
[INFO] SCGLUEModel: Setting `max_epochs` = 178
[INFO] SCGLUEModel: Setting `patience` = 15
[INFO] SCGLUEModel: Setting `reduce_lr_patience` = 8
[INFO] SCGLUETrainer: Using training directory: "glue/fine-tune"
[INFO] SCGLUETrainer: [Epoch 10] train={'g_nll': 0.472, 'g_kl': 0.015, 'g_elbo': 0.486, 'x_rna_nll': 0.424, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.432, 'x_atac_nll': 0.277, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.279, 'dsc_loss': 0.691, 'vae_loss': 0.73, 'gen_loss': 0.696}, val={'g_nll': 0.472, 'g_kl': 0.015, 'g_elbo': 0.487, 'x_rna_nll': 0.43, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.439, 'x_atac_nll': 0.275, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.277, 'dsc_loss': 0.691, 'vae_loss': 0.735, 'gen_loss': 0.701}, 4.4s elapsed
[INFO] SCGLUETrainer: [Epoch 20] train={'g_nll': 0.469, 'g_kl': 0.014, 'g_elbo': 0.483, 'x_rna_nll': 0.424, 'x_rna_kl': 0.009, 'x_rna_elbo': 0.432, 'x_atac_nll': 0.276, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.278, 'dsc_loss': 0.689, 'vae_loss': 0.729, 'gen_loss': 0.695}, val={'g_nll': 0.469, 'g_kl': 0.014, 'g_elbo': 0.483, 'x_rna_nll': 0.426, 'x_rna_kl': 0.008, 'x_rna_elbo': 0.434, 'x_atac_nll': 0.276, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.277, 'dsc_loss': 0.695, 'vae_loss': 0.731, 'gen_loss': 0.697}, 5.3s elapsed
[INFO] SCGLUETrainer: [Epoch 30] train={'g_nll': 0.466, 'g_kl': 0.014, 'g_elbo': 0.48, 'x_rna_nll': 0.425, 'x_rna_kl': 0.008, 'x_rna_elbo': 0.433, 'x_atac_nll': 0.276, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.278, 'dsc_loss': 0.687, 'vae_loss': 0.73, 'gen_loss': 0.696}, val={'g_nll': 0.466, 'g_kl': 0.014, 'g_elbo': 0.48, 'x_rna_nll': 0.427, 'x_rna_kl': 0.008, 'x_rna_elbo': 0.436, 'x_atac_nll': 0.273, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.274, 'dsc_loss': 0.695, 'vae_loss': 0.729, 'gen_loss': 0.694}, 4.5s elapsed
[INFO] SCGLUETrainer: [Epoch 40] train={'g_nll': 0.464, 'g_kl': 0.014, 'g_elbo': 0.478, 'x_rna_nll': 0.423, 'x_rna_kl': 0.008, 'x_rna_elbo': 0.431, 'x_atac_nll': 0.277, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.278, 'dsc_loss': 0.688, 'vae_loss': 0.728, 'gen_loss': 0.694}, val={'g_nll': 0.465, 'g_kl': 0.014, 'g_elbo': 0.478, 'x_rna_nll': 0.425, 'x_rna_kl': 0.008, 'x_rna_elbo': 0.433, 'x_atac_nll': 0.275, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.276, 'dsc_loss': 0.695, 'vae_loss': 0.729, 'gen_loss': 0.694}, 4.2s elapsed
Epoch 00042: reducing learning rate of group 0 to 2.0000e-04.
Epoch 00042: reducing learning rate of group 0 to 2.0000e-04.
[INFO] LRScheduler: Learning rate reduction: step 1
[INFO] SCGLUETrainer: [Epoch 50] train={'g_nll': 0.463, 'g_kl': 0.013, 'g_elbo': 0.476, 'x_rna_nll': 0.423, 'x_rna_kl': 0.008, 'x_rna_elbo': 0.431, 'x_atac_nll': 0.276, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.277, 'dsc_loss': 0.69, 'vae_loss': 0.727, 'gen_loss': 0.693}, val={'g_nll': 0.464, 'g_kl': 0.013, 'g_elbo': 0.477, 'x_rna_nll': 0.422, 'x_rna_kl': 0.008, 'x_rna_elbo': 0.43, 'x_atac_nll': 0.274, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.276, 'dsc_loss': 0.685, 'vae_loss': 0.724, 'gen_loss': 0.69}, 6.4s elapsed
Epoch 00056: reducing learning rate of group 0 to 2.0000e-05.
Epoch 00056: reducing learning rate of group 0 to 2.0000e-05.
[INFO] LRScheduler: Learning rate reduction: step 2
[INFO] SCGLUETrainer: [Epoch 60] train={'g_nll': 0.463, 'g_kl': 0.013, 'g_elbo': 0.476, 'x_rna_nll': 0.423, 'x_rna_kl': 0.008, 'x_rna_elbo': 0.431, 'x_atac_nll': 0.276, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.278, 'dsc_loss': 0.689, 'vae_loss': 0.728, 'gen_loss': 0.694}, val={'g_nll': 0.463, 'g_kl': 0.013, 'g_elbo': 0.477, 'x_rna_nll': 0.424, 'x_rna_kl': 0.008, 'x_rna_elbo': 0.432, 'x_atac_nll': 0.278, 'x_atac_kl': 0.001, 'x_atac_elbo': 0.28, 'dsc_loss': 0.696, 'vae_loss': 0.731, 'gen_loss': 0.696}, 4.5s elapsed
2022-06-25 12:55:43,212 ignite.handlers.early_stopping.EarlyStopping INFO: EarlyStopping: Stop training
[INFO] EarlyStopping: Restoring checkpoint "56"...
[18]:
rna.obsm["X_glue"] = glue.encode_data("rna", rna)
atac.obsm["X_glue"] = glue.encode_data("atac", atac)
[19]:
combined = ad.concat([rna, atac])
sc.pp.neighbors(combined, use_rep="X_glue", metric="cosine")
sc.tl.umap(combined)
/home/kaizhang/.local/lib/python3.8/site-packages/anndata/_core/anndata.py:1828: UserWarning:

Observation names are not unique. To make them unique, call `.obs_names_make_unique`.

[20]:
sc.pl.umap(combined, color=["cell_type", "domain"], wspace=0.45)
../_images/tutorials_modality_28_0.png