snapatac2.pp.import_data#

snapatac2.pp.import_data(fragment_file, *, file=None, genome=None, gene_anno=None, chrom_size=None, min_num_fragments=200, min_tsse=1, sorted_by_barcode=True, low_memory=True, whitelist=None, shift_left=0, shift_right=0, chunk_size=2000, tempdir=None, backend='hdf5', n_jobs=8)[source]#

Import dataset and compute QC metrics.

This function will store fragments as base-resolution TN5 insertions in the resulting h5ad file (in .obsm['insertion']), along with the chromosome sizes (in .uns['reference_sequences']). Various QC metrics, including TSSe, number of unique fragments, duplication rate, fraction of mitochondrial DNA reads, will be computed. The .obsm['insertion'] matrix created in this step is essential for downstream analysis, such as tile matrix generation and peak calling.

Parameters:

fragment_file (Path | list[Path]) – File name of the fragment file. This can be a single file or a list of files.
file (Union[Path, list[Path], None]) – File name of the output h5ad file used to store the result. If provided, result will be saved to a backed AnnData, otherwise an in-memory AnnData is used. If fragment_file is a list of files, file must also be a list of files if provided.
genome (Optional[Genome]) – A Genome object, providing gene annotation and chromosome sizes. If not set, gff_file and chrom_size must be provided. genome has lower priority than gff_file and chrom_size.
gene_anno (Optional[Path]) – File name of the gene annotation file in GFF or GTF format. This is required if genome is not set. Setting gene_anno will override the annotations from the genome parameter.
chrom_size (Optional[dict[str, int]]) – A dictionary containing chromosome sizes, for example, {"chr1": 2393, "chr2": 2344, ...}. This is required if genome is not set. Setting chrom_size will override the chrom_size from the genome parameter.
min_num_fragments (int) – Number of unique fragments threshold used to filter cells
min_tsse (float) – TSS enrichment threshold used to filter cells
sorted_by_barcode (bool) – Whether the fragment file has been sorted by cell barcodes. If sorted_by_barcode == True, this function makes use of small fixed amout of memory. If sorted_by_barcode == False and low_memory == False, all data will be kept in memory. See low_memory for more details.
low_memory (bool) – Whether to use the low memory mode when sorted_by_barcode == False. It does this by first sort the records by barcodes and then process them in batch. The parameter has no effect when sorted_by_barcode == True.
whitelist (Union[Path, list[str], None]) – File name or a list of barcodes. If it is a file name, each line must contain a valid barcode. When provided, only barcodes in the whitelist will be retained.
shift_left (int) – Insertion site correction for the left end. This is set to 0 by default, as shift correction is usually done in the fragment file generation step.
shift_right (int) – Insertion site correction for the right end. Note this has no effect on single-end reads. For single-end reads, shift_right will be set using the value of shift_left. This is set to 0 by default, as shift correction is usually done in the fragment file generation step.
chunk_size (int) – Increasing the chunk_size may speed up I/O but will use more memory. The speed gain is usually not significant.
tempdir (Optional[Path]) – Location to store temporary files. If None, system temporary directory will be used.
backend (Literal['hdf5']) – The backend.
n_jobs (int) – Number of jobs to run in parallel when fragment_file is a list. If n_jobs=-1, all CPUs will be used.

Returns:

An annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to regions. If file=None, an in-memory AnnData will be returned, otherwise a backed AnnData is returned.

Return type:

AnnData | ad.AnnData