snapatac2.pp.make_fragment_file#

snapatac2.pp.make_fragment_file(bam_file, output_file, is_paired=True, barcode_tag=None, barcode_regex=None, umi_tag=None, umi_regex=None, shift_left=4, shift_right=-5, min_mapq=30, chunk_size=50000000)[source]#

Convert a BAM file to a fragment file.

Convert a BAM file to a fragment file by performing the following steps:

  1. Filtering: remove reads that are unmapped, not primary alignment, mapq < 30, fails platform/vendor quality checks, or optical duplicate. For paired-end sequencing, it also removes reads that are not properly aligned.

  2. Deduplicate: Sort the reads by cell barcodes and remove duplicated reads for each unique cell barcode.

  3. Output: Convert BAM records to fragments (if paired-end) or single-end reads.

Note the bam file needn’t be sorted or filtered.

Parameters:
  • bam_file (Path) – File name of the BAM file.

  • output_file (Path) – File name of the output fragment file.

  • is_paired (bool) – Indicate whether the BAM file contain paired-end reads

  • barcode_tag (Optional[str]) – Extract barcodes from TAG fields of BAM records, e.g., barcode_tag = "CB".

  • barcode_regex (Optional[str]) – Extract barcodes from read names of BAM records using regular expressions. Reguler expressions should contain exactly one capturing group (Parentheses group the regex between them) that matches the barcodes. For example, barcode_regex = "(..:..:..:..):w+$" extracts bd:69:Y6:10 from A01535:24:HW2MMDSX2:2:1359:8513:3458:bd:69:Y6:10:TGATAGGTTG.

  • umi_tag (Optional[str]) – Extract UMI from TAG fields of BAM records.

  • umi_regex (Optional[str]) – Extract UMI from read names of BAM records using regular expressions. See barcode_regex for more details.

  • shift_left (int) – Insertion site correction for the left end. Note this has no effect on single-end reads.

  • shift_right (int) – Insertion site correction for the right end. Note this has no effect on single-end reads.

  • min_mapq (Optional[int]) – Filter the reads based on MAPQ.

  • chunk_size (int) – The size of data retained in memory when performing sorting. Larger chunk sizes result in faster sorting and greater memory usage.

Returns:

Various statistics.

Return type:

PyFlagStat