2 AnnData – Annotated Data

2.1 Introduction

AnnData is both a data structure and an on-disk file specification that facilitates the sharing of labeled data matrices.

The Python anndata package supports both in-memory and on-disk representation of AnnData object. For detailed descriptions about the AnnData format, please read anndata’s documentation.

Despite being an excellent package, the anndata package falls short of its support for the on-disk representation or backed mode of AnnData object. When opened in the backed mode, the in-memory snapshot and on-disk data of AnnData are not in sync with each other, causing inconsistent and unexpected behaviors. For example in the backed mode, anndata only supports updates to the X slot in the AnnData object, which means any changes to other slots like obs will not be written to disk. This make the backed mode very cumbersome to use and often lead to unexpected outcomes. Also, as it still reads all other componenets except X into memory, it uses a lot of memory for large datasets.

To address these limitations, SnapATAC2 implements its own out-of-core AnnData object with the following key features:

AnnData is fully backed by the underlying hdf5 file. Any operations on the AnnData object will be reflected on the hdf5 file.
All elements are lazily loaded. No matter how large is the file, opening it consume almost zero memory. Matrix data can be accessed and processed by chunks, which keeps the memory usage to the minimum.
In-memory cache can be turned on to speed up the repetitive access of elements.
Featuring an AnnDataSet object to lazily concatenate multiple AnnData objects.

2.2 A tutorial on using backed AnnData objects

In this section, we will learn the basics about SnapATAC2’s AnnData implementation.

2.2.1 Reading/opening a h5ad file.

SnapATAC2 can open h5ad files in either in-memory mode or backed mode. By default, snapatac2.read open a h5ad file in backed mode.

import snapatac2 as snap
adata = snap.read(snap.datasets.pbmc5k(type='h5ad'))
adata

AnnData object with n_obs x n_vars = 4363 x 6176550 backed at '/home/kaizhang/.cache/snapatac2/atac_pbmc_5k.h5ad'
    obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito', 'doublet_score', 'is_doublet', 'leiden'
    var: 'selected'
    uns: 'scrublet_threshold', 'reference_sequences', 'scrublet_sim_doublet_score', 'spectral_eigenvalue'
    obsm: 'X_umap', 'X_spectral', 'insertion'
    obsp: 'distances'

You can turn the backed mode off using backed=False, which will use the Python anndata package to read the file and create an in-memory AnnData object.

import snapatac2 as snap
adata = snap.read(snap.datasets.pbmc5k(type='h5ad'), backed=None)
adata

Updating file 'atac_pbmc_5k.h5ad' from 'https://data.mendeley.com/api/datasets/dr2z4jbcx3/draft/files/d90adfd1-b4b8-4dcd-8704-9ab19f104116?a=758c37e5-4832-4c91-af89-9a1a83a051b3' to '/home/kaizhang/.cache/snapatac2'.

AnnData object with n_obs × n_vars = 4363 × 6176550
    obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito', 'doublet_score', 'is_doublet', 'leiden'
    var: 'selected'
    uns: 'reference_sequences', 'scrublet_sim_doublet_score', 'scrublet_threshold', 'spectral_eigenvalue'
    obsm: 'X_spectral', 'X_umap', 'insertion'
    obsp: 'distances'

2.2.2 Closing a backed AnnData object

The backed AnnData object in SnapATAC2 does not need to be saved as it is always in sync with the data on disk. However, if you have opened the h5ad file in write mode, it is important to remember to close the file using the AnnData.close method. Otherwise, the underlying hdf5 file might be corrupted.

adata = snap.read(snap.datasets.pbmc5k(type='h5ad'))
adata.close()
adata

Closed AnnData object

2.2.3 Creating a backed AnnData object

You can use the AnnData constructor to create a new AnnData object.

adata = snap.AnnData(filename='adata.h5ad')
adata

AnnData object with n_obs x n_vars = 0 x 0 backed at 'adata.h5ad'

You can then modify slots in the AnnData object.

import numpy as np
adata.X = np.ones((3, 4))
adata.obs_names = ["1", "2", "3"]
adata.var_names = ["a", "b", "c", "d"]
adata.obsm['matrix'] = np.ones((3, 10))
adata.varm['another_matrix'] = np.ones((4, 10))
adata

AnnData object with n_obs x n_vars = 3 x 4 backed at 'adata.h5ad'
    obsm: 'matrix'
    varm: 'another_matrix'

The matrices are now saved on the backing hdf5 file and will be cleared from the memory.

2.2.4 Accessing elements in a backed AnnData object

Slots in backed AnnData object, e.g., AnnData.X, AnnData.obs, store references to the actual data. Accessing those slots does not automatically perform dereferencing or load the data into memory. Instead, a lazy element will be returned, as demonstrated in the example below:

adata.X

Array(f64) element, cache_enabled: no, cached: no

However, asscessing the slots by keys will automatically read the data:

adata.obsm['matrix']

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

To retreive the lazy element from obsm, you can use:

adata.obsm.el('matrix')

Array(f64) element, cache_enabled: no, cached: no

Several useful methods haven been implemented for lazy elements. For example, you can use the slicing operator to read the full data or a part of the data:

adata.X[:]

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

adata.X[:2, :2]

array([[1., 1.],
       [1., 1.]])

You can also iterate over the chunks of the matrix using the chunked method:

for chunk, fr, to in adata.obsm.el('matrix').chunked(chunk_size=2):
    print("from row {} to {}: {}".format(fr, to - 1, chunk))

from row 0 to 1: [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
from row 2 to 2: [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

By default AnnData will read from the disk each time you request the data. This will incur a lot of IO overheads if you do this repetitively.

%%time
for _ in range(1000):
    adata.obsm['matrix']

CPU times: user 69.3 ms, sys: 1.78 ms, total: 71.1 ms
Wall time: 70.9 ms

One solution to this is to turn on the cache for the element you want to repetitively read from.

%%time
adata.obsm.el('matrix').enable_cache()
for _ in range(1000):
    adata.obsm['matrix']

CPU times: user 1.94 ms, sys: 3 µs, total: 1.95 ms
Wall time: 1.95 ms

The data will be cached the first time you request it and the subsequent calls will make use of the cached data.

2.2.5 Subsetting the AnnData

The backed AnnData object does not have “views”. Instead, you need to use the AnnData.subset method to create a new AnnData object.

adata_subset = adata.subset([0, 1], [0, 1], out="subset.h5ad")
adata_subset

AnnData object with n_obs x n_vars = 2 x 2 backed at 'subset.h5ad'
    obsm: 'matrix'
    varm: 'another_matrix'

You could also do this inplace without the out parameter:

adata_subset.subset([0])
adata_subset

AnnData object with n_obs x n_vars = 1 x 2 backed at 'subset.h5ad'
    obsm: 'matrix'
    varm: 'another_matrix'

2.2.6 Convert to in-memory representation

Finally, you can convert a backed AnnData to anndata’s in-memory AnnData object using:

adata.to_memory()

AnnData object with n_obs × n_vars = 3 × 4
    obsm: 'matrix'
    varm: 'another_matrix'

2.3 Combining multiple AnnData objects into a AnnDataSet object

Oftentimes you want to combine and deal with multiple h5ad files simultaniously. In this section you will learn how to do this efficiently.

First, let us create a bunch of AnnData objects.

def create_anndata(index: int):
    adata = snap.AnnData(
        X=np.ones((4, 7))*index,
        filename=str(index) + ".h5ad",
    )
    adata.var_names = [str(i) for i in range(7)]
    adata.obs_names = [str(i) for i in range(4)]
    adata.obsm['matrix'] = np.random.rand(4,50)
    return adata
list_of_anndata = [(str(i), create_anndata(i)) for i in range(10)]

We can then use the AnnDataSet constructor to horizontally concatenate all AnnData objects.

dataset = snap.AnnDataSet(
    adatas=list_of_anndata,
    filename="dataset.h5ads",
    add_key="id",
)
dataset

AnnDataSet object with n_obs x n_vars = 40 x 7 backed at 'dataset.h5ads'
contains 10 AnnData objects with keys: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
    obs: 'id'
    uns: 'AnnDataSet'

AnnDataSet is just a special form of AnnData objects. It inherits most of the methods from AnnData. It carries its own annotations, such as obs, var, obsm, etc. Besides, it grants you the access to component AnnData objects as well, as shown in the example below:

dataset.adatas.obsm['matrix']

array([[0.58093577, 0.26644723, 0.4864984 , ..., 0.71743106, 0.9752404 ,
        0.14249661],
       [0.84127877, 0.70899659, 0.70174468, ..., 0.55876489, 0.71954284,
        0.07213879],
       [0.59091501, 0.69258951, 0.49688508, ..., 0.67908653, 0.62798061,
        0.79639072],
       ...,
       [0.34293689, 0.60989453, 0.24111227, ..., 0.21526474, 0.0989522 ,
        0.57523261],
       [0.56398127, 0.2169606 , 0.998914  , ..., 0.02527175, 0.31128978,
        0.37318406],
       [0.11068251, 0.28360159, 0.51183271, ..., 0.40982264, 0.03265828,
        0.90912685]])

2.3.1 Subsetting an AnnDataSet object

AnnDataSet can be subsetted in a way similar to AnnData objects. But there is one caveat: subsetting an AnnDataSet will not rearrange the rows across component AnnData objects.

2.3.2 Converting AnnDataSet to AnnData

An in-memory AnnData can be made from AnnDataSet using:

dataset.to_adata()

AnnData object with n_obs × n_vars = 40 × 7
    obs: 'id'
    uns: 'AnnDataSet'