AnnData is both a data structure and an on-disk file specification that facilitates the sharing of labeled data matrices.
The Python anndata package supports both in-memory and on-disk representation of AnnData object. For detailed descriptions about the AnnData format, please read anndata’s documentation.
Despite being an excellent package, the anndata package falls short of its support for the on-disk representation or backed mode of AnnData object. When opened in the backed mode, the in-memory snapshot and on-disk data of AnnData are not in sync with each other, causing inconsistent and unexpected behaviors. For example in the backed mode, anndata only supports updates to the X slot in the AnnData object, which means any changes to other slots like obs will not be written to disk. This make the backed mode very cumbersome to use and often lead to unexpected outcomes. Also, as it still reads all other componenets except X into memory, it uses a lot of memory for large datasets.
To address these limitations, SnapATAC2 implements its own out-of-core AnnData object with the following key features:
AnnData is fully backed by the underlying hdf5 file. Any operations on the AnnData object will be reflected on the hdf5 file.
All elements are lazily loaded. No matter how large is the file, opening it consume almost zero memory. Matrix data can be accessed and processed by chunks, which keeps the memory usage to the minimum.
In-memory cache can be turned on to speed up the repetitive access of elements.
Featuring an AnnDataSet object to lazily concatenate multiple AnnData objects.
2.2 A tutorial on using backed AnnData objects
In this section, we will learn the basics about SnapATAC2’s AnnData implementation.
2.2.1 Reading/opening a h5ad file.
SnapATAC2 can open h5ad files in either in-memory mode or backed mode. By default, snapatac2.read open a h5ad file in backed mode.
import snapatac2 as snapadata = snap.read(snap.datasets.pbmc5k(type='h5ad'))adata
AnnData object with n_obs x n_vars = 4363 x 6176550 backed at '/home/kaizhang/.cache/snapatac2/atac_pbmc_5k.h5ad'
obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito', 'doublet_score', 'is_doublet', 'leiden'
var: 'selected'
uns: 'scrublet_threshold', 'reference_sequences', 'scrublet_sim_doublet_score', 'spectral_eigenvalue'
obsm: 'X_umap', 'X_spectral', 'insertion'
obsp: 'distances'
You can turn the backed mode off using backed=False, which will use the Python anndata package to read the file and create an in-memory AnnData object.
import snapatac2 as snapadata = snap.read(snap.datasets.pbmc5k(type='h5ad'), backed=None)adata
Updating file 'atac_pbmc_5k.h5ad' from 'https://data.mendeley.com/api/datasets/dr2z4jbcx3/draft/files/d90adfd1-b4b8-4dcd-8704-9ab19f104116?a=758c37e5-4832-4c91-af89-9a1a83a051b3' to '/home/kaizhang/.cache/snapatac2'.
The backed AnnData object in SnapATAC2 does not need to be saved as it is always in sync with the data on disk. However, if you have opened the h5ad file in write mode, it is important to remember to close the file using the AnnData.close method. Otherwise, the underlying hdf5 file might be corrupted.
AnnData object with n_obs x n_vars = 3 x 4 backed at 'adata.h5ad'
obsm: 'matrix'
varm: 'another_matrix'
The matrices are now saved on the backing hdf5 file and will be cleared from the memory.
2.2.4 Accessing elements in a backed AnnData object
Slots in backed AnnData object, e.g.,AnnData.X, AnnData.obs, store references to the actual data. Accessing those slots does not automatically perform dereferencing or load the data into memory. Instead, a lazy element will be returned, as demonstrated in the example below:
adata.X
Array(f64) element, cache_enabled: no, cached: no
However, asscessing the slots by keys will automatically read the data:
To retreive the lazy element from obsm, you can use:
adata.obsm.el('matrix')
Array(f64) element, cache_enabled: no, cached: no
Several useful methods haven been implemented for lazy elements. For example, you can use the slicing operator to read the full data or a part of the data:
AnnDataSet object with n_obs x n_vars = 40 x 7 backed at 'dataset.h5ads'
contains 10 AnnData objects with keys: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
obs: 'id'
uns: 'AnnDataSet'
AnnDataSet is just a special form of AnnData objects. It inherits most of the methods from AnnData. It carries its own annotations, such as obs, var, obsm, etc. Besides, it grants you the access to component AnnData objects as well, as shown in the example below:
AnnDataSet can be subsetted in a way similar to AnnData objects. But there is one caveat: subsetting an AnnDataSet will not rearrange the rows across component AnnData objects.
2.3.2 Converting AnnDataSet to AnnData
An in-memory AnnData can be made from AnnDataSet using: