snapatac2.pp.add_tile_matrix#

snapatac2.pp.add_tile_matrix(adata, *, bin_size=500, inplace=True, chunk_size=500, exclude_chroms=['chrM', 'chrY', 'M', 'Y'], min_frag_size=None, max_frag_size=None, counting_strategy='paired-insertion', value_type='target', summary_type='sum', file=None, backend='hdf5', n_jobs=8)[source]#

Generate cell by bin count matrix.

This function is used to generate and add a cell by bin count matrix to the AnnData object.

import_fragments must be ran first in order to use this function.

Parameters:
  • adata (AnnData | list[AnnData]) – The (annotated) data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to regions. adata could also be a list of AnnData objects when inplace=True. In this case, the function will be applied to each AnnData object in parallel.

  • bin_size (int) – The size of consecutive genomic regions used to record the counts.

  • inplace (bool) – Whether to add the tile matrix to the AnnData object or return a new AnnData object.

  • chunk_size (int) – Increasing the chunk_size speeds up I/O but uses more memory.

  • exclude_chroms (list[str] | str | None) – A list of chromosomes to exclude.

  • min_frag_size (int | None) – Minimum fragment size to include.

  • max_frag_size (int | None) – Maximum fragment size to include.

  • counting_strategy (Literal['fragment', 'insertion', 'paired-insertion']) – The strategy to compute feature counts. It must be one of the following: “fragment”, “insertion”, or “paired-insertion”. “fragment” means the feature counts are assigned based on the number of fragments that overlap with a region of interest. “insertion” means the feature counts are assigned based on the number of insertions that overlap with a region of interest. “paired-insertion” is similar to “insertion”, but it only counts the insertions once if the pair of insertions of a fragment are both within the same region of interest [Miao24]. Note that this parameter has no effect if input are single-end reads.

  • value_type (Literal['target', 'total', 'fraction']) – The type of value to use from .obsm['_values'], only available when data is imported using import_values. It must be one of the following: “target”, “total”, or “fraction”. “target” means the value is the number of recrods that are with postive measurements, e.g., number of methylated bases. “total” means the value is the total number of measurements, e.g., methylated bases plus unmethylated bases. “fraction” means the value is the fraction of the records that are positive, e.g., the fraction of methylated bases.

  • summary_type (Literal['sum', 'mean']) – The type of summary to use when multiple values are found in a bin. This parameter is only used when .obsm['_values'] exists, which is created by import_values. It must be one of the following: “sum” or “mean”.

  • file (Path | None) – File name of the output file used to store the result. If provided, result will be saved to a backed AnnData, otherwise an in-memory AnnData is used. This has no effect when inplace=True.

  • backend (Literal['hdf5']) – The backend to use for storing the result. If None, the default backend will be used.

  • n_jobs (int) – Number of jobs to run in parallel when adata is a list. If n_jobs=-1, all CPUs will be used.

Returns:

An annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to bins. If file=None, an in-memory AnnData will be returned, otherwise a backed AnnData is returned.

Return type:

AnnData | ad.AnnData | None

Examples

>>> import snapatac2 as snap
>>> data = snap.pp.import_fragments(snap.datasets.pbmc500(downsample=True), chrom_sizes=snap.genome.hg38, sorted_by_barcode=False)
>>> snap.pp.add_tile_matrix(data, bin_size=500)
>>> print(data)
AnnData object with n_obs × n_vars = 585 × 6062095
    obs: 'n_fragment', 'frac_dup', 'frac_mito'
    uns: 'reference_sequences'
    obsm: 'fragment_paired'