snapatac2.pp.make_fragment_file#

snapatac2.pp.make_fragment_file(bam_file, output_file, is_paired=True, barcode_tag=None, barcode_regex=None, umi_tag=None, umi_regex=None, shift_left=4, shift_right=-5, min_mapq=30, chunk_size=50000000, chrM=['chrM', 'M'], source=None, compression=None, compression_level=None, tempdir=None)[source]#

Convert a BAM file to a fragment file.

Convert a BAM file to a fragment file by performing the following steps:

Filtering: remove reads that are unmapped, not primary alignment, mapq < 30, fails platform/vendor quality checks, or optical duplicate. For paired-end sequencing, it also removes reads that are not properly aligned.
Deduplicate: Sort the reads by cell barcodes and remove duplicated reads for each unique cell barcode.
Output: Convert BAM records to fragments (if paired-end) or single-end reads.

The bam file needn’t be sorted or filtered.

Note

When using barcode_regex or umi_regex, the regex must contain exactly one capturing group (Parentheses group the regex between them) that matches the barcodes or UMIs. Writting the correct regex is tricky. You can test your regex online at https://regex101.com/. BAM files produced by the 10X Genomics Cell Ranger pipeline are not supported, as they contain invalid BAM headers. Specifically, Cell Ranger ATAC <= 2.0 produces BAM files with no @VN tag in the header, and Cell Ranger ATAC >= 2.1 produces BAM files with invalid @VN tag in the header. It is recommended to use the fragment files produced by Cell Ranger ATAC instead.
This function generates large temporary files in tempdir during sorting. For large files, it is recommended to set tempdir to a location with sufficient space in order to avoid running out of disk space.

Parameters:

bam_file (Path) – File name of the BAM file.
output_file (Path) – File name of the output fragment file.
is_paired (bool) – Indicate whether the BAM file contain paired-end reads
barcode_tag (str | None) – Extract barcodes from TAG fields of BAM records, e.g., barcode_tag="CB".
barcode_regex (str | None) – Extract barcodes from read names of BAM records using regular expressions. Reguler expressions should contain exactly one capturing group (Parentheses group the regex between them) that matches the barcodes. For example, barcode_regex="(..:..:..:..):\w+$" extracts bd:69:Y6:10 from A01535:24:HW2MMDSX2:2:1359:8513:3458:bd:69:Y6:10:TGATAGGTTG. You can test your regex on this website: https://regex101.com/.
umi_tag (str | None) – Extract UMI from TAG fields of BAM records.
umi_regex (str | None) – Extract UMI from read names of BAM records using regular expressions. See barcode_regex for more details.
shift_left (int) – Insertion site correction for the left end. Note this has no effect on single-end reads.
shift_right (int) – Insertion site correction for the right end. Note this has no effect on single-end reads.
min_mapq (int | None) – Filter the reads based on MAPQ.
chunk_size (int) – The size of data retained in memory when performing sorting. Larger chunk sizes result in faster sorting and greater memory usage.
source (Optional[Literal['10x']]) – The source of the BAM file. This is used for vendor-specific processing. For example, the BAM files generated by 10X Genomics needs special processing to fix the malformed BAM headers. Currently the only supported source is “10x”.
chrM (list[str] | None) – A list of mitochondrial chromosome names, used to calculate QC metrics.
compression (Optional[Literal['gzip', 'zstandard']]) – Compression type. If None, it is inferred from the suffix.
compression_level (int | None) – Compression level. 1-9 for gzip, 1-22 for zstandard. If None, it is set to 6 for gzip and 3 for zstandard.
tempdir (Path | None) – Location to store temporary files. If None, system temporary directory will be used.

Returns:

A dictionary containing the following metrics:

”sequenced_reads”: number of reads in the input BAM file.
”sequenced_read_pairs”: number of read pairs in the input BAM file.
”frac_duplicates”: Fraction of high-quality read pairs that are deemed to be PCR duplicates. This metric is a measure of sequencing saturation and is a function of library complexity and sequencing depth. More specifically, this is the fraction of high-quality fragments with a valid barcode that align to the same genomic position as another read pair in the library.
”frac_confidently_mapped”: Fraction of sequenced reads or read pairs with mapping quality > 30.
”frac_unmapped”: Fraction of sequenced reads or read pairs that have a valid barcode but could not be mapped to the genome.
”frac_valid_barcode”: Fraction of reads or read pairs with barcodes that match the whitelist after error correction.
”frac_nonnuclear”: Fraction of sequenced read pairs that have a valid barcode and map to non-nuclear genome contigs, including mitochondria, with mapping quality > 30.
”frac_fragment_in_nucleosome_free_region”: Fraction of high-quality fragments smaller than 147 basepairs.
”frac_fragment_flanking_single_nucleosome”: Fraction of high-quality fragments between 147 and 294 basepairs.

Return type:

dict[str, float]