snapatac2.pp.make_fragment_file#

snapatac2.pp.make_fragment_file(bam_file, output_file, is_paired=True, barcode_tag=None, barcode_regex=None, umi_tag=None, umi_regex=None, shift_left=4, shift_right=-5, min_mapq=30, chunk_size=50000000, compression=None, compression_level=None, tempdir=None)[source]#

Convert a BAM file to a fragment file.

Convert a BAM file to a fragment file by performing the following steps:

  1. Filtering: remove reads that are unmapped, not primary alignment, mapq < 30, fails platform/vendor quality checks, or optical duplicate. For paired-end sequencing, it also removes reads that are not properly aligned.

  2. Deduplicate: Sort the reads by cell barcodes and remove duplicated reads for each unique cell barcode.

  3. Output: Convert BAM records to fragments (if paired-end) or single-end reads.

The bam file needn’t be sorted or filtered.

Note

  • When using barcode_regex or umi_regex, the regex must contain exactly one capturing group

(Parentheses group the regex between them) that matches the barcodes or UMIs. Writting the correct regex is tricky. You can test your regex online at https://regex101.com/. BAM files produced by the 10X Genomics Cell Ranger pipeline are not supported, as they contain invalid BAM headers. Specifically, Cell Ranger ATAC <= 2.0 produces BAM files with no @VN tag in the header, and Cell Ranger ATAC >= 2.1 produces BAM files with invalid @VN tag in the header. It is recommended to use the fragment files produced by Cell Ranger ATAC instead. - This function generates large temporary files in tempdir during sorting. For large files, it is recommended to set tempdir to a location with sufficient space in order to avoid running out of disk space.

Parameters:
  • bam_file (Path) – File name of the BAM file.

  • output_file (Path) – File name of the output fragment file.

  • is_paired (bool) – Indicate whether the BAM file contain paired-end reads

  • barcode_tag (Optional[str]) – Extract barcodes from TAG fields of BAM records, e.g., barcode_tag="CB".

  • barcode_regex (Optional[str]) – Extract barcodes from read names of BAM records using regular expressions. Reguler expressions should contain exactly one capturing group (Parentheses group the regex between them) that matches the barcodes. For example, barcode_regex="(..:..:..:..):\w+$" extracts bd:69:Y6:10 from A01535:24:HW2MMDSX2:2:1359:8513:3458:bd:69:Y6:10:TGATAGGTTG. You can test your regex on this website: https://regex101.com/.

  • umi_tag (Optional[str]) – Extract UMI from TAG fields of BAM records.

  • umi_regex (Optional[str]) – Extract UMI from read names of BAM records using regular expressions. See barcode_regex for more details.

  • shift_left (int) – Insertion site correction for the left end. Note this has no effect on single-end reads.

  • shift_right (int) – Insertion site correction for the right end. Note this has no effect on single-end reads.

  • min_mapq (int | None) – Filter the reads based on MAPQ.

  • chunk_size (int) – The size of data retained in memory when performing sorting. Larger chunk sizes result in faster sorting and greater memory usage.

  • compression (Optional[Literal['gzip', 'zstandard']]) – Compression type. If None, it is inferred from the suffix.

  • compression_level (Optional[int]) – Compression level. 1-9 for gzip, 1-22 for zstandard. If None, it is set to 6 for gzip and 3 for zstandard.

  • tempdir (Optional[Path]) – Location to store temporary files. If None, system temporary directory will be used.

Returns:

Various statistics.

Return type:

PyFlagStat

See also

import_data