snapatac2.pp.make_gene_matrix#

snapatac2.pp.make_gene_matrix(adata, gene_anno, *, inplace=False, file=None, backend='hdf5', chunk_size=500, use_x=False, id_type='gene', upstream=2000, downstream=0, include_gene_body=True, transcript_name_key='transcript_name', transcript_id_key='transcript_id', gene_name_key='gene_name', gene_id_key='gene_id', min_frag_size=None, max_frag_size=None, counting_strategy='insertion')[source]#

Generate cell by gene activity matrix.

Generate cell by gene activity matrix by counting the TN5 insertions in each gene’s regulatory domain. The regulatory domain is initially defined as the TSS or the whole gene body (if include_gene_body=True). We then extends this domain by upstream and downstream base pairs on both sides.

The result will be stored in a new file and a new AnnData object will be created. import_data must be ran first in order to use this function.

Parameters:
  • adata (AnnData | AnnDataSet) – The (annotated) data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to regions.

  • gene_anno (Genome | Path) – Either a Genome object or the path of a gene annotation file in GFF or GTF format.

  • inplace (bool) – Whether to add the gene matrix to the AnnData object or return a new AnnData object.

  • file (Optional[Path]) – File name of the h5ad file used to store the result. This has no effect when inplace=True.

  • backend (Optional[Literal['hdf5']]) – The backend to use for storing the result. If None, the default backend will be used.

  • chunk_size (int) – Chunk size

  • use_x (bool) – If True, use the matrix stored in .X to compute the gene activity. Otherwise the .obsm['insertion'] is used.

  • id_type (Literal['gene', 'transcript']) – “gene” or “transcript”.

  • upstream (int) – The number of base pairs upstream of the regulatory domain.

  • downstream (int) – The number of base pairs downstream of the regulatory domain.

  • include_gene_body (bool) – Whether to include the gene body in the regulatory domain. If False, the TSS is used as the regulatory domain.

  • transcript_name_key (str) – The key of the transcript name in the gene annotation file.

  • transcript_id_key (str) – The key of the transcript id in the gene annotation file.

  • gene_name_key (str) – The key of the gene name in the gene annotation file.

  • gene_id_key (str) – The key of the gene id in the gene annotation file.

  • min_frag_size (Optional[int]) – Minimum fragment size to include.

  • max_frag_size (Optional[int]) – Maximum fragment size to include.

  • counting_strategy (Literal['fragment', 'insertion', 'paired-insertion']) – The strategy to compute feature counts. It must be one of the following: “fragment”, “insertion”, or “paired-insertion”. “fragment” means the feature counts are assigned based on the number of fragments that overlap with a region of interest. “insertion” means the feature counts are assigned based on the number of insertions that overlap with a region of interest. “paired-insertion” is similar to “insertion”, but it only counts the insertions once if the pair of insertions of a fragment are both within the same region of interest [Miao24]. Note that this parameter has no effect if input are single-end reads.

Returns:

An annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to genes. If file=None, an in-memory AnnData will be returned, otherwise a backed AnnData is returned.

Return type:

AnnData

Examples

>>> import snapatac2 as snap
>>> data = snap.pp.import_data(snap.datasets.pbmc500(downsample=True), chrom_sizes=snap.genome.hg38, sorted_by_barcode=False)
>>> gene_mat = snap.pp.make_gene_matrix(data, gene_anno=snap.genome.hg38)
>>> print(gene_mat)
AnnData object with n_obs × n_vars = 585 × 60606
    obs: 'n_fragment', 'frac_dup', 'frac_mito'
>>> gene_mat = snap.pp.make_gene_matrix(data, gene_anno=snap.genome.hg38, upstream=1000, downstream=1000, include_gene_body=False)