Identify differentially accessible regions#
This tutorial assumes you have read the Standard Pipeline tutorial and have finished preprocessing, clustering and cell annotations.
[1]:
import snapatac2 as snap
import numpy as np
import polars as pl
snap.__version__
[1]:
'2.3.2dev1'
First let’s load the previously processed h5ad file.
[2]:
data = snap.read(snap.datasets.pbmc5k(type="annotated_h5ad"))
data
Updating file 'atac_pbmc_5k_annotated.h5ad' from 'https://osf.io/download/rvwhf/' to '/home/kaizhang/.cache/snapatac2'.
[2]:
AnnData object with n_obs x n_vars = 4436 x 6062095 backed at '/home/kaizhang/.cache/snapatac2/atac_pbmc_5k_annotated.h5ad'
obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito', 'doublet_probability', 'doublet_score', 'leiden', 'cell_type'
var: 'count', 'selected'
uns: 'cell_type_colors', 'scrublet_sim_doublet_score', 'spectral_eigenvalue', 'leiden_colors', 'reference_sequences'
obsm: 'X_spectral', 'X_umap', 'insertion'
obsp: 'distances'
[3]:
snap.pl.umap(data, color='cell_type', interactive=False)
[3]:

Peak calling at the cluster-level#
An important goal of single-cell ATAC-seq analysis is to identify genomic regions that are enriched with TN5 insertions, or “open chromatin” regions. Using snap.tl.call_peaks
, we can easily identify open chromatin regions in different cell populations stratified by provided group information.
[4]:
%%time
snap.tl.call_peaks(data, groupby='cell_type')
2023-09-11 19:57:35 - INFO - Exporting data...
2023-09-11 20:01:43 - INFO - Calling peaks for 11 groups ...
2023-09-11 20:04:39 - INFO - Merging peaks...
CPU times: user 4min 22s, sys: 7.76 s, total: 4min 30s
Wall time: 7min 7s
snap.tl.call_peaks
first calls peaks for individual groups and then merges overlapping peaks to create a list of fix-width non-overlapping peaks. The intermediate results are discarded by default but can be saved using the out_dir
parameter.
Results of peak calling are stored in data.uns['peaks']
as a dataframe. Rows are merged peaks, and columns are cell clusters. Boolean values in the dataframe indicate whether a peak is present in a given cell cluster.
[5]:
data.uns['peaks'].head()
[5]:
Peaks | Memory B | DC | Naive/Intermediate B | CD4 TCM | NK | CD16 Mono | CD4 Naive | CD14 Mono | CD8 Naive | CD8 TEM | MAIT |
---|---|---|---|---|---|---|---|---|---|---|---|
str | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool |
"chr1:180653-18… | false | false | false | false | true | false | false | true | false | false | false |
"chr1:191594-19… | false | false | false | false | true | false | false | true | false | true | true |
"chr1:267749-26… | false | true | false | true | false | false | false | true | false | false | false |
"chr1:280500-28… | false | false | false | false | false | false | false | true | false | false | false |
"chr1:585944-58… | false | false | true | false | false | false | false | false | false | false | false |
Now, with the peak list, we can create a cell by peak matrix by snap.pp.make_peak_matrix
.
[6]:
%%time
peak_mat = snap.pp.make_peak_matrix(data)
peak_mat
/projects/ps-renlab2/kai/software/micromamba/lib/python3.9/site-packages/anndata/_core/anndata.py:117: ImplicitModificationWarning:
Transforming to str index.
CPU times: user 59.3 s, sys: 1.48 s, total: 1min
Wall time: 13.8 s
[6]:
AnnData object with n_obs × n_vars = 4436 × 250241
obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito', 'doublet_probability', 'doublet_score', 'leiden', 'cell_type'
Finding marker regions#
In this section, we are going to use a quick-and-dirty method to identify marker regions for each cell type. The tl.marker_regions
function aggregates signal across cells and utilizes z-scores to identify specifically enriched peaks.
[7]:
%%time
marker_peaks = snap.tl.marker_regions(peak_mat, groupby='cell_type', pvalue=0.01)
CPU times: user 3.92 s, sys: 142 ms, total: 4.07 s
Wall time: 4.07 s
[8]:
snap.pl.regions(peak_mat, groupby='cell_type', peaks=marker_peaks, interactive=False)
[8]:

[9]:
%%time
motifs = snap.tl.motif_enrichment(
motifs=snap.datasets.cis_bp(unique=True),
regions=marker_peaks,
genome_fasta=snap.genome.hg38,
)
2023-09-11 20:05:07 - INFO - Fetching 58945 sequences ...
2023-09-11 20:05:32 - INFO - Computing enrichment ...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1165/1165 [15:45<00:00, 1.23it/s]
CPU times: user 1h 58min 50s, sys: 9.62 s, total: 1h 59min
Wall time: 16min 52s
[10]:
snap.pl.motif_enrichment(motifs, max_fdr=0.0001, height=1600, interactive=False)
[10]:

Regression-based differential test#
The tl.marker_regions
above does not consider the variations across cells. To fully utilize the single-cell information, we can apply regression-based differential test method. First, Let’s select the peaks that are either present in naive B cells or memory B cells.
[11]:
group1 = "Naive/Intermediate B"
group2 = "Memory B"
naive_B = data.obs['cell_type'] == group1
memory_B = data.obs['cell_type'] == group2
peaks_selected = np.logical_or(
data.uns["peaks"][group1].to_numpy(),
data.uns["peaks"][group2].to_numpy(),
)
Perform differential test using tl.diff_test
.
[12]:
%%time
diff_peaks = snap.tl.diff_test(
peak_mat,
cell_group1=naive_B,
cell_group2=memory_B,
features=peaks_selected,
)
2023-09-11 20:22:02 - INFO - Input contains 107629 features, now perform filtering with 'min_log_fc = 0.25' and 'min_pct = 0.05' ...
2023-09-11 20:22:14 - INFO - Testing 33352 features ...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33352/33352 [04:53<00:00, 113.58it/s]
CPU times: user 5min 4s, sys: 1.99 s, total: 5min 6s
Wall time: 5min 6s
Filter the results based on adjusted p-value or FDR.
[13]:
diff_peaks = diff_peaks.filter(pl.col('adjusted p-value') < 0.01)
diff_peaks.head()
[13]:
feature name | log2(fold_change) | p-value | adjusted p-value |
---|---|---|---|
str | f64 | f64 | f64 |
"chr17:3558489-… | 6.417912 | 1.0078e-30 | 3.3612e-26 |
"chr5:156402576… | 4.204133 | 4.0689e-23 | 6.7853e-19 |
"chr6:52497600-… | 5.489535 | 8.6097e-23 | 9.5717e-19 |
"chr4:155758703… | -3.053255 | 3.1618e-21 | 2.6363e-17 |
"chr1:15949531-… | 4.236554 | 4.2257e-20 | 2.8187e-16 |
[14]:
snap.pl.regions(
peak_mat,
groupby = 'cell_type',
peaks = {
group1: diff_peaks.filter(pl.col("log2(fold_change)") > 0)['feature name'].to_numpy(),
group2: diff_peaks.filter(pl.col("log2(fold_change)") < 0)['feature name'].to_numpy(),
},
interactive = False,
)
[14]:

In the next example we are going to identify peaks that are accessible in memory B cells but not in the rest of cells. One way of doing this is to select random cells from each cluster to form the background, and then perform test between memory B cells and the background.
Here we randomly select 50 cells from other cell clusters, and set the direction = "positive"
because we are interested in peaks that are more accessible in memory B cells.
[15]:
barcodes = np.array(data.obs_names)
background = []
for i in np.unique(data.obs['leiden']):
if i != group2:
cells = np.random.choice(
barcodes[data.obs['leiden'] == i],
size = 50,
replace = False,
)
background.append(cells)
background = np.concatenate(background)
[16]:
diff_peaks = snap.tl.diff_test(
peak_mat,
cell_group1 = memory_B,
cell_group2 = background,
features = data.uns["peaks"][group2].to_numpy(),
direction = "positive",
)
2023-09-11 20:27:13 - INFO - Input contains 83417 features, now perform filtering with 'min_log_fc = 0.25' and 'min_pct = 0.05' ...
2023-09-11 20:27:22 - INFO - Testing 18839 features ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18839/18839 [03:22<00:00, 92.91it/s]
[17]:
diff_peaks = diff_peaks.filter(pl.col('adjusted p-value') < 0.01)
diff_peaks.head()
[17]:
feature name | log2(fold_change) | p-value | adjusted p-value |
---|---|---|---|
str | f64 | f64 | f64 |
"chr16:17301595… | 3.051179 | 2.0011e-53 | 3.7698e-49 |
"chr14:10581787… | 3.229134 | 1.2291e-52 | 1.1578e-48 |
"chr11:95715788… | 3.382947 | 2.2691e-51 | 1.4249e-47 |
"chr9:37409048-… | 2.603901 | 8.1013e-49 | 3.8155e-45 |
"chr16:88045815… | 2.521774 | 3.2865e-48 | 1.2383e-44 |
[18]:
snap.pl.regions(
peak_mat,
groupby = 'cell_type',
peaks = {
group2: diff_peaks['feature name'].to_numpy(),
},
interactive = False,
)
[18]:
