snapatac2.pp.import_data#

snapatac2.pp.import_data(fragment_file, chrom_sizes, *, file=None, min_num_fragments=200, sorted_by_barcode=True, whitelist=None, chrM=['chrM', 'M'], shift_left=0, shift_right=0, chunk_size=2000, tempdir=None, backend='hdf5', n_jobs=8)[source]#

Import data fragment files and compute basic QC metrics.

A fragment refers to the sequence data originating from a distinct location in the genome. In single-ended sequencing, one read equates to a fragment. However, in paired-ended sequencing, a fragment is defined by a pair of reads. This function is designed to handle, store, and process input files with fragment data, further yielding a range of basic Quality Control (QC) metrics. These metrics include the total number of unique fragments, duplication rates, and the percentage of mitochondrial DNA detected.

How fragments are stored is dependent on the sequencing approach utilized. For single-ended sequencing, fragments are found in .obsm['fragment_single']. In contrast, for paired-ended sequencing, they are located in .obsm['fragment_paired'].

Diving deeper technically, the fragments are internally structured within a Compressed Sparse Row (CSR) matrix. In this configuration, each row signifies a specific cell, while each column represents a unique genomic position. Fragment starting positions dictate the column indices. Matrix values capture the lengths of the fragments for paired-end reads or the lengths of the reads for single-ended scenarios. It’s important to note that for single-ended reads, the values are signed, with the sign providing information on the fragment’s strand orientation. Additionally, it is worth noting that cells may harbor duplicate fragments, leading to the presence of duplicate column indices within the matrix. As a result, the matrix deviates from the standard CSR format, and it is not advisable to use the matrix for linear algebra operations.

../../_images/func%2Bimport_data.svg

Note

  • This function accepts both single-end and paired-end reads. If the records in the fragment file contain 6 columns with the last column representing the strand of the fragment, the fragments are considered single-ended. Otherwise, the fragments are considered paired-ended.

  • When file is not None, this function uses constant memory regardless of the size of the input file.

  • When sorted_by_barcode is False, this function will sort the fragment file first, during which temporary files will be created in tempdir. The size of temporary files is proportional to the number of records in the fragment file. For large fragment files, it is recommended to set tempdir to a location with sufficient space in order to avoid running out of disk space.

  • The QC metrics are computed only for reads that are included by the whitelist or chrom_sizes.

Warning

When the input to the function is a list of files, it employs multiprocessing to process these files concurrently. In this case, however, it is crucial to safeguard the entry point of the program by encapsulating the function call within if __name__ == '__main__':. This condition ensures that the module is being run as the main program and not being loaded as a module from another script. Without this protection, each subprocess might attempt to spawn its own subprocesses, leading to a cascade of process spawns—a situation that can cause the program to hang or crash due to infinite recursion. You don’t need to do this in Jupyter notebook as it automatically does that.

Parameters:
  • fragment_file (Path | list[Path]) – File name of the fragment file, optionally compressed with gzip or zstd. This can be a single file or a list of files. If it is a list of files, a separate AnnData object will be created for each file. A fragment file must contain at least 5 columns: chromosome, start, end, barcode, count. Optionally it can contain one more column indicating the strand of the fragment. When strand is provided, the fragments are considered single-ended.

  • chrom_sizes (Genome | dict[str, int]) – A Genome object or a dictionary containing chromosome sizes, for example, {"chr1": 2393, "chr2": 2344, ...}.

  • file (Union[Path, list[Path], None]) – File name of the output h5ad file used to store the result. If provided, result will be saved to a backed AnnData, otherwise an in-memory AnnData is used. If fragment_file is a list of files, file must also be a list of files if provided.

  • min_num_fragments (int) – Number of unique fragments threshold used to filter cells

  • sorted_by_barcode (bool) – Whether the fragment file has been sorted by cell barcodes. This function will be faster if sorted_by_barcode==True. Note the make_fragment_file will always sort the fragment file by barcode.

  • whitelist (Union[Path, list[str], None]) – File name or a list of barcodes. If it is a file name, each line must contain a valid barcode. When provided, only barcodes in the whitelist will be retained.

  • shift_left (int) – Insertion site correction for the left end. This is set to 0 by default, as shift correction is usually done in the fragment file generation step.

  • chrM (list[str]) – A list of chromosome names that are considered mitochondrial DNA. This is used to compute the fraction of mitochondrial DNA.

  • shift_right (int) – Insertion site correction for the right end. Note this has no effect on single-end reads. For single-end reads, shift_right will be set using the value of shift_left. This is set to 0 by default, as shift correction is usually done in the fragment file generation step.

  • chunk_size (int) – Increasing the chunk_size may speed up I/O but will use more memory. The speed gain is usually not significant.

  • tempdir (Optional[Path]) – Location to store temporary files. If None, system temporary directory will be used.

  • backend (Literal['hdf5']) – The backend.

  • n_jobs (int) – Number of jobs to run in parallel when fragment_file is a list. If n_jobs=-1, all CPUs will be used.

Returns:

An annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to regions. If file=None, an in-memory AnnData will be returned, otherwise a backed AnnData is returned.

Return type:

AnnData | ad.AnnData

Examples

>>> import snapatac2 as snap
>>> data = snap.pp.import_data(snap.datasets.pbmc500(downsample=True), chrom_sizes=snap.genome.hg38, sorted_by_barcode=False)
>>> print(data)
AnnData object with n_obs × n_vars = 585 × 0
    obs: 'n_fragment', 'frac_dup', 'frac_mito'
    uns: 'reference_sequences'
    obsm: 'fragment_paired'