Multi-sample Pipeline: analyzing snATAC-seq data of human colon samples#
Introduction#
In this tutorial, we will perform integrative analysis of snATAC-seq data of colon sample from multiple donors.
[1]:
import snapatac2 as snap
import numpy as np
snap.__version__
[1]:
'2.3.0.dev4'
Download the example dataset.
[2]:
files = snap.datasets.colon()
files
[2]:
[('colon_transverse_SM-A9VP4',
PosixPath('/home/kaizhang/.cache/snapatac2/colon_transverse.tar.untar/colon_transverse_SM-A9VP4_rep1_fragments.bed.gz')),
('colon_transverse_SM-BZ2ZS',
PosixPath('/home/kaizhang/.cache/snapatac2/colon_transverse.tar.untar/colon_transverse_SM-BZ2ZS_rep1_fragments.bed.gz')),
('colon_transverse_SM-A9HOW',
PosixPath('/home/kaizhang/.cache/snapatac2/colon_transverse.tar.untar/colon_transverse_SM-A9HOW_rep1_fragments.bed.gz')),
('colon_transverse_SM-CSSDA',
PosixPath('/home/kaizhang/.cache/snapatac2/colon_transverse.tar.untar/colon_transverse_SM-CSSDA_rep1_fragments.bed.gz')),
('colon_transverse_SM-ACCQ1',
PosixPath('/home/kaizhang/.cache/snapatac2/colon_transverse.tar.untar/colon_transverse_SM-ACCQ1_rep1_fragments.bed.gz'))]
First we use pp.import_data
to import the fragment files. Here we use the thresholds used in our previous paper to filter the cells.
[3]:
%%time
adatas = []
for name, fl in files:
data = snap.pp.import_data(
fl,
genome=snap.genome.hg38,
min_tsse=7,
min_num_fragments=1000,
file=name+'.h5ad'
)
adatas.append(data)
CPU times: user 7min 46s, sys: 7.77 s, total: 7min 54s
Wall time: 2min 13s
We then follow the standard procedures to add tile matrices, select features, and identify doublets. Note these functions can take either a single AnnData object or a list of AnnData objects.
[4]:
%%time
snap.pp.add_tile_matrix(adatas, bin_size=5000)
snap.pp.select_features(adatas, n_features=50000)
snap.pp.scrublet(adatas)
snap.pp.filter_doublets(adatas)
2023-04-13 19:51:54 - INFO - Selected 50000 features.
2023-04-13 19:51:54 - INFO - Selected 50000 features.
2023-04-13 19:51:54 - INFO - Selected 50000 features.
2023-04-13 19:51:54 - INFO - Selected 50000 features.
2023-04-13 19:51:55 - INFO - Selected 50000 features.
2023-04-13 19:53:08 - INFO - Detected doublet rate = 5.479%
2023-04-13 19:53:08 - INFO - Detected doublet rate = 4.786%
2023-04-13 19:53:08 - INFO - Detected doublet rate = 4.499%
2023-04-13 19:53:08 - INFO - Detected doublet rate = 3.494%
2023-04-13 19:53:08 - INFO - Detected doublet rate = 7.767%
CPU times: user 419 ms, sys: 1.67 s, total: 2.08 s
Wall time: 1min 41s
Creating AnnDataSet object#
We then create an AnnDataSet
object which contains links to individual data. The data are not loaded into memory so it can scale to very large dataset.
[5]:
%%time
data = snap.AnnDataSet(
adatas=[(name, adata) for (name, _), adata in zip(files, adatas)],
filename="colon.h5ads"
)
data
CPU times: user 340 ms, sys: 45 ms, total: 385 ms
Wall time: 473 ms
[5]:
AnnDataSet object with n_obs x n_vars = 39589 x 606219 backed at 'colon.h5ads'
contains 5 AnnData objects with keys: 'colon_transverse_SM-A9VP4', 'colon_transverse_SM-BZ2ZS', 'colon_transverse_SM-A9HOW', 'colon_transverse_SM-CSSDA', 'colon_transverse_SM-ACCQ1'
obs: 'sample'
uns: 'AnnDataSet', 'reference_sequences'
When merging multiple h5ad files, SnapATAC2 automatically adds a column to .obs['sample']
, indicating the origin of the cells. After merging, .obs_names
are no longer garanteed to be unique, as some barcodes may be shared across experiments. So it is advised to regenerate unique IDs by concatenating .obs_names
and sample IDs.
[6]:
data.obs_names = np.array(data.obs['sample']) + ":" + np.array(data.obs_names)
[7]:
snap.pp.select_features(data, n_features=50000)
2023-04-13 19:53:23 - INFO - Selected 50000 features.
[8]:
%%time
snap.tl.spectral(data)
CPU times: user 9min 17s, sys: 29.7 s, total: 9min 46s
Wall time: 33.9 s
We next perform UMAP embedding and visualize the result.
[9]:
%%time
snap.tl.umap(data)
CPU times: user 3min 40s, sys: 34.3 s, total: 4min 14s
Wall time: 2min 21s
[10]:
snap.pl.umap(data, color="sample", interactive=False)
[10]:
Batch correction#
From the UMAP plot above we can clearly see some donor/individual specific effects. Although these donor differences are interesting to study on their own, it obscures the clustering procedure for identifying shared cell states across individuals.
Here we apply two different approaches, Harmony and modified MNNCorrect, to remove donor specific differences.
[11]:
%%time
snap.pp.mnc_correct(data, batch="sample")
snap.pp.harmony(data, batch="sample", max_iter_harmony=20)
2023-04-13 19:56:48,174 - harmonypy - INFO - Iteration 1 of 20
2023-04-13 19:56:48 - INFO - Iteration 1 of 20
2023-04-13 19:57:05,421 - harmonypy - INFO - Iteration 2 of 20
2023-04-13 19:57:05 - INFO - Iteration 2 of 20
2023-04-13 19:57:16,576 - harmonypy - INFO - Iteration 3 of 20
2023-04-13 19:57:16 - INFO - Iteration 3 of 20
2023-04-13 19:57:21,863 - harmonypy - INFO - Converged after 3 iterations
2023-04-13 19:57:21 - INFO - Converged after 3 iterations
CPU times: user 17min 51s, sys: 1min 5s, total: 18min 56s
Wall time: 57.3 s
Visualizing the result of MNNCorrect.
[12]:
snap.tl.umap(data, use_rep="X_spectral_mnn")
snap.pl.umap(data, color="sample", interactive=False)
[12]:
Visualize the result of Harmony.
[13]:
snap.tl.umap(data, use_rep="X_spectral_harmony")
snap.pl.umap(data, color="sample", interactive=False)
[13]:
Both methods have effectively eliminated the unwanted effects, as can be seen. While we do not offer a recommendation for one method over the other, it’s worth noting that Harmony’s algorithm sometimes may significantly alter the underlying topology and generate artificial clusters.
Clustering#
[14]:
snap.pp.knn(data, use_rep="X_spectral_harmony")
[15]:
snap.tl.leiden(data)
[16]:
snap.pl.umap(data, color="leiden", interactive=False)
[16]:
AnnDataSet object IO#
Just like the AnnData object, AnnDataSet object is synchronized with the content of the HDF5 file. Therefore, there is no need to manually save the result. After the analysis is finished, simply close the file by:
[17]:
data.close()
data
[17]:
Closed AnnDataSet object
The AnnDataSet object is stored as a standard h5ad file, in which the links to individual anndata file was saved in uns['AnnDataSet']
and can be reopened by snap.read_dataset
.
[18]:
data = snap.read_dataset("colon.h5ads")
data
[18]:
AnnDataSet object with n_obs x n_vars = 39589 x 606219 backed at 'colon.h5ads'
contains 5 AnnData objects with keys: 'colon_transverse_SM-A9VP4', 'colon_transverse_SM-BZ2ZS', 'colon_transverse_SM-A9HOW', 'colon_transverse_SM-CSSDA', 'colon_transverse_SM-ACCQ1'
obs: 'sample', 'leiden'
var: 'count', 'selected'
uns: 'reference_sequences', 'spectral_eigenvalue', 'AnnDataSet'
obsm: 'X_umap', 'X_spectral_mnn', 'X_spectral_harmony', 'X_spectral'
obsp: 'distances'
Because the AnnDataSet object does not copy the underlying AnnData objects, if you move the component h5ad files then it won’t be able to find them. In this case, you can supply the new locations using the update_data_locations
parameter:
[19]:
data.close()
[20]:
data = snap.read_dataset(
"colon.h5ads",
update_data_locations = {"colon_transverse_SM-CSSDA": "colon_transverse_SM-CSSDA.h5ad"},
)
data
[20]:
AnnDataSet object with n_obs x n_vars = 39589 x 606219 backed at 'colon.h5ads'
contains 5 AnnData objects with keys: 'colon_transverse_SM-A9VP4', 'colon_transverse_SM-BZ2ZS', 'colon_transverse_SM-A9HOW', 'colon_transverse_SM-CSSDA', 'colon_transverse_SM-ACCQ1'
obs: 'sample', 'leiden'
var: 'count', 'selected'
uns: 'AnnDataSet', 'spectral_eigenvalue', 'reference_sequences'
obsm: 'X_spectral', 'X_spectral_harmony', 'X_umap', 'X_spectral_mnn'
obsp: 'distances'
[21]:
data.close()