Identify differentially accessible regions#

This tutorial assumes you have read the Standard Pipeline tutorial and have finished preprocessing, clustering and cell annotations.

[1]:

import snapatac2 as snap
import numpy as np
import polars as pl

snap.__version__

[1]:

'2.3.0.dev4'

First let’s load the previously processed h5ad file.

[2]:

data = snap.read(snap.datasets.pbmc5k(type="annotated_h5ad"))
data

Updating file 'atac_pbmc_5k_annotated.h5ad' from 'https://osf.io/download/rvwhf/' to '/home/kaizhang/.cache/snapatac2'.

[2]:

AnnData object with n_obs x n_vars = 4436 x 6062095 backed at '/home/kaizhang/.cache/snapatac2/atac_pbmc_5k_annotated.h5ad'
    obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito', 'doublet_probability', 'doublet_score', 'leiden', 'cell_type'
    var: 'count', 'selected'
    uns: 'spectral_eigenvalue', 'cell_type_colors', 'reference_sequences', 'leiden_colors', 'scrublet_sim_doublet_score'
    obsm: 'X_spectral', 'X_umap', 'insertion'
    obsp: 'distances'

[3]:

snap.pl.umap(data, color='cell_type', width=700, interactive=False)

[3]:

Peak calling at the cluster-level#

An important goal of single-cell ATAC-seq analysis is to identify genomic regions that are enriched with TN5 insertions, or “open chromatin” regions. Using snap.tl.call_peaks, we can easily identify open chromatin regions in different cell populations stratified by provided group information.

[4]:

%%time
snap.tl.call_peaks(data, groupby='cell_type')

2023-04-13 14:49:04 - INFO - Exporting data...
2023-04-13 14:54:33 - INFO - Calling peaks for 11 groups ...
2023-04-13 15:09:47 - INFO - Merging peaks...

CPU times: user 5min 50s, sys: 11.6 s, total: 6min 2s
Wall time: 20min 45s

snap.tl.call_peaks first calls peaks for individual groups and then merges overlapping peaks to create a list of fix-width non-overlapping peaks. The intermediate results are discarded by default but can be saved using the out_dir parameter.

Results of peak calling are stored in data.uns['peaks'] as a dataframe. Rows are merged peaks, and columns are cell clusters. Boolean values in the dataframe indicate whether a peak is present in a given cell cluster.

[5]:

data.uns['peaks'].head()

[5]:

shape: (5, 12)

Peaks	CD4 Naive	CD8 Naive	Naive/Intermediate B	CD8 TEM	CD4 TCM	CD14 Mono	MAIT	Memory B	CD16 Mono	NK	DC
str	bool	bool	bool	bool	bool	bool	bool	bool	bool	bool	bool
"chr1:180653-18...	false	false	false	false	false	true	false	false	false	true	false
"chr1:191594-19...	false	false	false	true	false	true	true	false	false	true	false
"chr1:267749-26...	false	false	false	false	true	true	false	false	false	false	true
"chr1:280500-28...	false	false	false	false	false	true	false	false	false	false	false
"chr1:585944-58...	false	false	true	false	false	false	false	false	false	false	false

Now, with the peak list, we can create a cell by peak matrix by snap.pp.make_peak_matrix.

[6]:

%%time
peak_mat = snap.pp.make_peak_matrix(data, file="peak_matrix.h5ad")
peak_mat

CPU times: user 1min 22s, sys: 1.43 s, total: 1min 24s
Wall time: 12.7 s

[6]:

AnnData object with n_obs x n_vars = 4436 x 250241 backed at 'peak_matrix.h5ad'
    obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito', 'doublet_probability', 'doublet_score', 'leiden', 'cell_type'

Finding marker regions#

In this section, we are going to use a quick-and-dirty method to identify marker regions for each cell type. The tl.marker_regions function aggregates signal across cells and utilizes z-scores to identify specifically enriched peaks.

[7]:

%%time
marker_peaks = snap.tl.marker_regions(peak_mat, groupby='cell_type', pvalue=0.01)

/projects/ps-renlab2/kai/dev/snapatac2/snapatac2-python/snapatac2/tools/_misc.py:82: FutureWarning:

X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.

CPU times: user 6.56 s, sys: 499 ms, total: 7.06 s
Wall time: 7.05 s

[8]:

snap.pl.regions(peak_mat, groupby='cell_type', peaks=marker_peaks, interactive=False)

[8]:

[9]:

%%time
motifs = snap.tl.motif_enrichment(
    motifs=snap.datasets.cis_bp(unique=True),
    regions=marker_peaks,
    genome_fasta=snap.genome.hg38,
)

2023-04-13 15:10:19 - INFO - Fetching 58945 sequences ...
2023-04-13 15:10:21 - INFO - Computing enrichment ...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1165/1165 [05:42<00:00,  3.40it/s]

CPU times: user 2h 3min 58s, sys: 17.1 s, total: 2h 4min 15s
Wall time: 6min 24s

[10]:

snap.pl.motif_enrichment(motifs, max_fdr=0.0001, height=1600, interactive=False)

[10]:

Regression-based differential test#

The tl.marker_regions above does not consider the variations across cells. To fully utilize the single-cell information, we can apply regression-based differential test method. First, Let’s select the peaks that are either present in naive B cells or memory B cells.

[11]:

group1 = "Naive/Intermediate B"
group2 = "Memory B"
naive_B = data.obs['cell_type'] == group1
memory_B = data.obs['cell_type'] == group2
peaks_selected = np.logical_or(
    data.uns["peaks"][group1].to_numpy(),
    data.uns["peaks"][group2].to_numpy(),
)

Perform differential test using tl.diff_test.

[12]:

%%time
diff_peaks = snap.tl.diff_test(
    peak_mat,
    cell_group1=naive_B,
    cell_group2=memory_B,
    features=peaks_selected,
)

2023-04-13 15:16:48 - INFO - Input contains 107629 features, now perform filtering with 'min_log_fc = 0.25' and 'min_pct = 0.05' ...
2023-04-13 15:17:02 - INFO - Testing 33352 features ...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33352/33352 [05:01<00:00, 110.59it/s]

CPU times: user 5min 16s, sys: 2.17 s, total: 5min 18s
Wall time: 5min 18s

Filter the results based on adjusted p-value or FDR.

[13]:

diff_peaks = diff_peaks.filter(pl.col('adjusted p-value') < 0.01)
diff_peaks.head()

[13]:

shape: (5, 4)

feature name	log2(fold_change)	p-value	adjusted p-value
str	f64	f64	f64
"chr17:3558489-...	6.417912	1.0078e-30	3.3612e-26
"chr5:156402576...	4.204133	4.0689e-23	6.7853e-19
"chr6:52497600-...	5.489535	8.6097e-23	9.5717e-19
"chr4:155758703...	-3.053255	3.1618e-21	2.6363e-17
"chr1:15949531-...	4.236554	4.2257e-20	2.8187e-16

[14]:

snap.pl.regions(
    peak_mat,
    groupby = 'cell_type',
    peaks = {
        group1: diff_peaks.filter(pl.col("log2(fold_change)") > 0)['feature name'].to_numpy(),
        group2: diff_peaks.filter(pl.col("log2(fold_change)") < 0)['feature name'].to_numpy(),
    },
    interactive = False,
)

/projects/ps-renlab2/kai/dev/snapatac2/snapatac2-python/snapatac2/tools/_misc.py:82: FutureWarning:

X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.

[14]:

In the next example we are going to identify peaks that are accessible in memory B cells but not in the rest of cells. One way of doing this is to select random cells from each cluster to form the background, and then perform test between memory B cells and the background.

Here we randomly select 50 cells from other cell clusters, and set the direction = "positive" because we are interested in peaks that are more accessible in memory B cells.

[15]:

barcodes = np.array(data.obs_names)
background = []
for i in np.unique(data.obs['leiden']):
    if i != group2:
        cells = np.random.choice(
            barcodes[data.obs['leiden'] == i],
            size = 50,
            replace = False,
        )
        background.append(cells)
background = np.concatenate(background)

[16]:

diff_peaks = snap.tl.diff_test(
    peak_mat,
    cell_group1 = memory_B,
    cell_group2 = background,
    features = data.uns["peaks"][group2].to_numpy(),
    direction = "positive",
)

2023-04-13 15:22:14 - INFO - Input contains 83417 features, now perform filtering with 'min_log_fc = 0.25' and 'min_pct = 0.05' ...
2023-04-13 15:22:25 - INFO - Testing 18700 features ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18700/18700 [03:11<00:00, 97.69it/s]

[17]:

diff_peaks = diff_peaks.filter(pl.col('adjusted p-value') < 0.01)
diff_peaks.head()

[17]:

shape: (5, 4)

feature name	log2(fold_change)	p-value	adjusted p-value
str	f64	f64	f64
"chr14:10581787...	3.274067	2.5671e-57	4.8004e-53
"chr11:95715788...	3.520904	7.1180e-55	6.6553e-51
"chr9:37409048-...	2.972516	3.6258e-51	2.2601e-47
"chr22:41936117...	3.113602	1.7278e-49	8.0775e-46
"chr16:88045815...	2.417225	5.1147e-49	1.7541e-45

[18]:

snap.pl.regions(
    peak_mat,
    groupby = 'cell_type',
    peaks = {
        group2: diff_peaks['feature name'].to_numpy(),
    },
    interactive = False,
)

/projects/ps-renlab2/kai/dev/snapatac2/snapatac2-python/snapatac2/tools/_misc.py:82: FutureWarning:

X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.

[18]:

To Be Continued …#

[ ]: