Identify differentially accessible regions#

This tutorial assumes you have read the Standard Pipeline tutorial and have finished preprocessing, clustering and cell annotations.

[1]:

import snapatac2 as snap
import numpy as np
import polars as pl

snap.__version__

[1]:

'2.3.2dev1'

First let’s load the previously processed h5ad file.

[2]:

data = snap.read(snap.datasets.pbmc5k(type="annotated_h5ad"))
data

Updating file 'atac_pbmc_5k_annotated.h5ad' from 'https://osf.io/download/rvwhf/' to '/home/kaizhang/.cache/snapatac2'.

[2]:

AnnData object with n_obs x n_vars = 4436 x 6062095 backed at '/home/kaizhang/.cache/snapatac2/atac_pbmc_5k_annotated.h5ad'
    obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito', 'doublet_probability', 'doublet_score', 'leiden', 'cell_type'
    var: 'count', 'selected'
    uns: 'cell_type_colors', 'scrublet_sim_doublet_score', 'spectral_eigenvalue', 'leiden_colors', 'reference_sequences'
    obsm: 'X_spectral', 'X_umap', 'insertion'
    obsp: 'distances'

[3]:

snap.pl.umap(data, color='cell_type', interactive=False)

[3]:

Peak calling at the cluster-level#

An important goal of single-cell ATAC-seq analysis is to identify genomic regions that are enriched with TN5 insertions, or “open chromatin” regions. Using snap.tl.call_peaks, we can easily identify open chromatin regions in different cell populations stratified by provided group information.

[4]:

%%time
snap.tl.call_peaks(data, groupby='cell_type')

2023-09-11 19:57:35 - INFO - Exporting data...
2023-09-11 20:01:43 - INFO - Calling peaks for 11 groups ...
2023-09-11 20:04:39 - INFO - Merging peaks...

CPU times: user 4min 22s, sys: 7.76 s, total: 4min 30s
Wall time: 7min 7s

snap.tl.call_peaks first calls peaks for individual groups and then merges overlapping peaks to create a list of fix-width non-overlapping peaks. The intermediate results are discarded by default but can be saved using the out_dir parameter.

Results of peak calling are stored in data.uns['peaks'] as a dataframe. Rows are merged peaks, and columns are cell clusters. Boolean values in the dataframe indicate whether a peak is present in a given cell cluster.

[5]:

data.uns['peaks'].head()

[5]:

shape: (5, 12)

Peaks	Memory B	DC	Naive/Intermediate B	CD4 TCM	NK	CD16 Mono	CD4 Naive	CD14 Mono	CD8 Naive	CD8 TEM	MAIT
str	bool	bool	bool	bool	bool	bool	bool	bool	bool	bool	bool
"chr1:180653-18…	false	false	false	false	true	false	false	true	false	false	false
"chr1:191594-19…	false	false	false	false	true	false	false	true	false	true	true
"chr1:267749-26…	false	true	false	true	false	false	false	true	false	false	false
"chr1:280500-28…	false	false	false	false	false	false	false	true	false	false	false
"chr1:585944-58…	false	false	true	false	false	false	false	false	false	false	false

Now, with the peak list, we can create a cell by peak matrix by snap.pp.make_peak_matrix.

[6]:

%%time
peak_mat = snap.pp.make_peak_matrix(data)
peak_mat

/projects/ps-renlab2/kai/software/micromamba/lib/python3.9/site-packages/anndata/_core/anndata.py:117: ImplicitModificationWarning:

Transforming to str index.

CPU times: user 59.3 s, sys: 1.48 s, total: 1min
Wall time: 13.8 s

[6]:

AnnData object with n_obs × n_vars = 4436 × 250241
    obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito', 'doublet_probability', 'doublet_score', 'leiden', 'cell_type'

Finding marker regions#

In this section, we are going to use a quick-and-dirty method to identify marker regions for each cell type. The tl.marker_regions function aggregates signal across cells and utilizes z-scores to identify specifically enriched peaks.

[7]:

%%time
marker_peaks = snap.tl.marker_regions(peak_mat, groupby='cell_type', pvalue=0.01)

CPU times: user 3.92 s, sys: 142 ms, total: 4.07 s
Wall time: 4.07 s

[8]:

snap.pl.regions(peak_mat, groupby='cell_type', peaks=marker_peaks, interactive=False)

[8]:

[9]:

%%time
motifs = snap.tl.motif_enrichment(
    motifs=snap.datasets.cis_bp(unique=True),
    regions=marker_peaks,
    genome_fasta=snap.genome.hg38,
)

2023-09-11 20:05:07 - INFO - Fetching 58945 sequences ...
2023-09-11 20:05:32 - INFO - Computing enrichment ...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1165/1165 [15:45<00:00,  1.23it/s]

CPU times: user 1h 58min 50s, sys: 9.62 s, total: 1h 59min
Wall time: 16min 52s

[10]:

snap.pl.motif_enrichment(motifs, max_fdr=0.0001, height=1600, interactive=False)

[10]:

Regression-based differential test#

The tl.marker_regions above does not consider the variations across cells. To fully utilize the single-cell information, we can apply regression-based differential test method. First, Let’s select the peaks that are either present in naive B cells or memory B cells.

[11]:

group1 = "Naive/Intermediate B"
group2 = "Memory B"
naive_B = data.obs['cell_type'] == group1
memory_B = data.obs['cell_type'] == group2
peaks_selected = np.logical_or(
    data.uns["peaks"][group1].to_numpy(),
    data.uns["peaks"][group2].to_numpy(),
)

Perform differential test using tl.diff_test.

[12]:

%%time
diff_peaks = snap.tl.diff_test(
    peak_mat,
    cell_group1=naive_B,
    cell_group2=memory_B,
    features=peaks_selected,
)

2023-09-11 20:22:02 - INFO - Input contains 107629 features, now perform filtering with 'min_log_fc = 0.25' and 'min_pct = 0.05' ...
2023-09-11 20:22:14 - INFO - Testing 33352 features ...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33352/33352 [04:53<00:00, 113.58it/s]

CPU times: user 5min 4s, sys: 1.99 s, total: 5min 6s
Wall time: 5min 6s

Filter the results based on adjusted p-value or FDR.

[13]:

diff_peaks = diff_peaks.filter(pl.col('adjusted p-value') < 0.01)
diff_peaks.head()

[13]:

shape: (5, 4)

feature name	log2(fold_change)	p-value	adjusted p-value
str	f64	f64	f64
"chr17:3558489-…	6.417912	1.0078e-30	3.3612e-26
"chr5:156402576…	4.204133	4.0689e-23	6.7853e-19
"chr6:52497600-…	5.489535	8.6097e-23	9.5717e-19
"chr4:155758703…	-3.053255	3.1618e-21	2.6363e-17
"chr1:15949531-…	4.236554	4.2257e-20	2.8187e-16

[14]:

snap.pl.regions(
    peak_mat,
    groupby = 'cell_type',
    peaks = {
        group1: diff_peaks.filter(pl.col("log2(fold_change)") > 0)['feature name'].to_numpy(),
        group2: diff_peaks.filter(pl.col("log2(fold_change)") < 0)['feature name'].to_numpy(),
    },
    interactive = False,
)

[14]:

In the next example we are going to identify peaks that are accessible in memory B cells but not in the rest of cells. One way of doing this is to select random cells from each cluster to form the background, and then perform test between memory B cells and the background.

Here we randomly select 50 cells from other cell clusters, and set the direction = "positive" because we are interested in peaks that are more accessible in memory B cells.

[15]:

barcodes = np.array(data.obs_names)
background = []
for i in np.unique(data.obs['leiden']):
    if i != group2:
        cells = np.random.choice(
            barcodes[data.obs['leiden'] == i],
            size = 50,
            replace = False,
        )
        background.append(cells)
background = np.concatenate(background)

[16]:

diff_peaks = snap.tl.diff_test(
    peak_mat,
    cell_group1 = memory_B,
    cell_group2 = background,
    features = data.uns["peaks"][group2].to_numpy(),
    direction = "positive",
)

2023-09-11 20:27:13 - INFO - Input contains 83417 features, now perform filtering with 'min_log_fc = 0.25' and 'min_pct = 0.05' ...
2023-09-11 20:27:22 - INFO - Testing 18839 features ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18839/18839 [03:22<00:00, 92.91it/s]

[17]:

diff_peaks = diff_peaks.filter(pl.col('adjusted p-value') < 0.01)
diff_peaks.head()

[17]:

shape: (5, 4)

feature name	log2(fold_change)	p-value	adjusted p-value
str	f64	f64	f64
"chr16:17301595…	3.051179	2.0011e-53	3.7698e-49
"chr14:10581787…	3.229134	1.2291e-52	1.1578e-48
"chr11:95715788…	3.382947	2.2691e-51	1.4249e-47
"chr9:37409048-…	2.603901	8.1013e-49	3.8155e-45
"chr16:88045815…	2.521774	3.2865e-48	1.2383e-44

[18]:

snap.pl.regions(
    peak_mat,
    groupby = 'cell_type',
    peaks = {
        group2: diff_peaks['feature name'].to_numpy(),
    },
    interactive = False,
)

[18]: