snapatac2.pp.make_gene_matrix#

snapatac2.pp.make_gene_matrix(adata, gene_anno, *, inplace=False, file=None, backend='hdf5', chunk_size=500, use_x=False, id_type='gene', transcript_name_key='transcript_name', transcript_id_key='transcript_id', gene_name_key='gene_name', gene_id_key='gene_id', min_frag_size=None, max_frag_size=None, counting_strategy='insertion')[source]#

Generate cell by gene activity matrix.

Generate cell by gene activity matrix by counting the TN5 insertions in gene body regions. The result will be stored in a new file and a new AnnData object will be created.

import_data must be ran first in order to use this function.

Parameters:
  • adata (AnnData | AnnDataSet) – The (annotated) data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to regions.

  • gene_anno (Genome | Path) – Either a Genome object or the path of a gene annotation file in GFF or GTF format.

  • inplace (bool) – Whether to add the gene matrix to the AnnData object or return a new AnnData object.

  • file (Optional[Path]) – File name of the h5ad file used to store the result. This has no effect when inplace=True.

  • backend (Optional[Literal['hdf5']]) – The backend to use for storing the result. If None, the default backend will be used.

  • chunk_size (int) – Chunk size

  • use_x (bool) – If True, use the matrix stored in .X to compute the gene activity. Otherwise the .obsm['insertion'] is used.

  • id_type (Literal['gene', 'transcript']) – “gene” or “transcript”.

  • transcript_name_key (str) – The key of the transcript name in the gene annotation file.

  • transcript_id_key (str) – The key of the transcript id in the gene annotation file.

  • gene_name_key (str) – The key of the gene name in the gene annotation file.

  • gene_id_key (str) – The key of the gene id in the gene annotation file.

  • min_frag_size (Optional[int]) – Minimum fragment size to include.

  • max_frag_size (Optional[int]) – Maximum fragment size to include.

  • counting_strategy (Literal['fragment', 'insertion', 'paired-insertion']) – The strategy to compute feature counts. It must be one of the following: “fragment”, “insertion”, or “paired-insertion”. “fragment” means the feature counts are assigned based on the number of fragments that overlap with a region of interest. “insertion” means the feature counts are assigned based on the number of insertions that overlap with a region of interest. “paired-insertion” is similar to “insertion”, but it only counts the insertions once if the pair of insertions of a fragment are both within the same region of interest [Miao24]. Note that this parameter has no effect if input are single-end reads.

Returns:

An annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to genes. If file=None, an in-memory AnnData will be returned, otherwise a backed AnnData is returned.

Return type:

AnnData

Examples

>>> import snapatac2 as snap
>>> data = snap.pp.import_data(snap.datasets.pbmc500(downsample=True), chrom_sizes=snap.genome.hg38, sorted_by_barcode=False)
>>> gene_mat = snap.pp.make_gene_matrix(data, gene_anno=snap.genome.hg38)
>>> print(gene_mat)
AnnData object with n_obs × n_vars = 585 × 60606
    obs: 'n_fragment', 'frac_dup', 'frac_mito'