snapatac2.pp.import_data#

snapatac2.pp.import_data(fragment_file, chrom_sizes, *, file=None, min_num_fragments=200, sorted_by_barcode=True, whitelist=None, chrM=['chrM', 'M'], shift_left=0, shift_right=0, chunk_size=2000, tempdir=None, backend='hdf5', n_jobs=8)[source]#

Import data fragment files and compute basic QC metrics.

A fragment refers to the sequence data originating from a distinct location in the genome. In single-ended sequencing, one read equates to a fragment. However, in paired-ended sequencing, a fragment is defined by a pair of reads. This function is designed to handle, store, and process input files with fragment data, further yielding a range of basic Quality Control (QC) metrics. These metrics include the total number of unique fragments, duplication rates, and the percentage of mitochondrial DNA detected.

How fragments are stored is dependent on the sequencing approach utilized. For single-ended sequencing, fragments are found in .obsm['fragment_single']. In contrast, for paired-ended sequencing, they are located in .obsm['fragment_paired'].

Diving deeper technically, the fragments are internally structured within a Compressed Sparse Row (CSR) matrix. In this configuration, each row signifies a specific cell, while each column represents a unique genomic position. Fragment starting positions dictate the column indices. Matrix values capture the lengths of the fragments for paired-end reads or the lengths of the reads for single-ended scenarios. It’s important to note that for single-ended reads, the values are signed, with the sign providing information on the fragment’s strand orientation. Additionally, it is worth noting that cells may harbor duplicate fragments, leading to the presence of duplicate column indices within the matrix. As a result, the matrix deviates from the standard CSR format, and it is not advisable to use the matrix for linear algebra operations.

../../_images/func%2Bimport_data.svg

Note

This function accepts both single-end and paired-end reads. If the records in the fragment file contain 6 columns with the last column representing the strand of the fragment, the fragments are considered single-ended. Otherwise, the fragments are considered paired-ended.

Parameters:
  • fragment_file (Path | list[Path]) – File name of the fragment file, optionally compressed with gzip or zstd. This can be a single file or a list of files. If it is a list of files, a separate AnnData object will be created for each file. A fragment file must contain at least 5 columns: chromosome, start, end, barcode, count. Optionally it can contain one more column indicating the strand of the fragment. When strand is provided, the fragments are considered single-ended.

  • chrom_sizes (Genome | dict[str, int]) – A Genome object or a dictionary containing chromosome sizes, for example, {"chr1": 2393, "chr2": 2344, ...}.

  • file (Union[Path, list[Path], None]) – File name of the output h5ad file used to store the result. If provided, result will be saved to a backed AnnData, otherwise an in-memory AnnData is used. If fragment_file is a list of files, file must also be a list of files if provided.

  • min_num_fragments (int) – Number of unique fragments threshold used to filter cells

  • sorted_by_barcode (bool) – Whether the fragment file has been sorted by cell barcodes. This function will be faster if sorted_by_barcode==True. Note the make_fragment_file will always sort the fragment file by barcode.

  • whitelist (Union[Path, list[str], None]) – File name or a list of barcodes. If it is a file name, each line must contain a valid barcode. When provided, only barcodes in the whitelist will be retained.

  • shift_left (int) – Insertion site correction for the left end. This is set to 0 by default, as shift correction is usually done in the fragment file generation step.

  • chrM (list[str]) – A list of chromosome names that are considered mitochondrial DNA. This is used to compute the fraction of mitochondrial DNA.

  • shift_right (int) – Insertion site correction for the right end. Note this has no effect on single-end reads. For single-end reads, shift_right will be set using the value of shift_left. This is set to 0 by default, as shift correction is usually done in the fragment file generation step.

  • chunk_size (int) – Increasing the chunk_size may speed up I/O but will use more memory. The speed gain is usually not significant.

  • tempdir (Optional[Path]) – Location to store temporary files. If None, system temporary directory will be used.

  • backend (Literal['hdf5']) – The backend.

  • n_jobs (int) – Number of jobs to run in parallel when fragment_file is a list. If n_jobs=-1, all CPUs will be used.

Returns:

An annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to regions. If file=None, an in-memory AnnData will be returned, otherwise a backed AnnData is returned.

Return type:

AnnData | ad.AnnData

Examples

>>> import snapatac2 as snap
>>> data = snap.pp.import_data(snap.datasets.pbmc500(downsample=True), chrom_sizes=snap.genome.hg38, sorted_by_barcode=False)
>>> print(data)
AnnData object with n_obs × n_vars = 585 × 0
    obs: 'n_fragment', 'frac_dup', 'frac_mito'
    uns: 'reference_sequences'
    obsm: 'fragment_paired'