Standard pipeline: analyzing 5K PBMC dataset from 10X genomics#

Introduction#

In this tutorial we will analyze single-cell ATAC-seq data from Peripheral blood mononuclear cells (PBMCs).

Import library and environment setup#

[1]:
import snapatac2 as snap

snap.__version__
[1]:
'2.5.0dev2'

To start, we need to download a fragment file. This can be achieved by calling the function below.

[2]:
# Input files
fragment_file = snap.datasets.pbmc5k()
fragment_file
[2]:
PosixPath('/home/zhangkaiLab/zhangkai33/.cache/snapatac2/atac_pbmc_5k.tsv.gz')

A fragment file refers to a file containing information about the fragments of DNA that are accessible and have been sequenced. Here are more information about the Fragment File.

If you do not have a fragment file for your experiment, you can make one from a BAM file, see pp.make_fragment_file().

Preprocessing#

We begin data preprocessing by importing fragment files and calculating basic quality control (QC) metrics using the pp.import_data() function.

This function compresses and stores the fragments in an AnnData object for later usage(To learn more about SnapATAC2’s anndata implementation, click here). During this process, various quality control measures, such as TSS enrichment and the number of unique fragments per cell, are computed and stored in the anndata as well.

When the file argument is not specified, the AnnData object is created and stored in the computer’s memory. However, if the file argument is provided, the AnnData object will be backed by an hdf5 file. In “backed” mode, pp.import_data() processes the data in chunks and streams the results to the disk, using only a small, fixed amount of memory. Therefore, it is recommended to specify the file parameter when working with large datasets and limited memory resources. Keep in mind that analyzing data in “backed” mode is slightly slower than in “memory” mode.

In this tutorial we will use the backed mode. To learn more about the differences between these two modes, click here.

[3]:
%%time
data = snap.pp.import_data(
    fragment_file,
    chrom_sizes=snap.genome.hg38,
    file="pbmc.h5ad",  # Optional
    sorted_by_barcode=False,
)
data
CPU times: user 6min 32s, sys: 9.15 s, total: 6min 41s
Wall time: 3min 17s
[3]:
AnnData object with n_obs x n_vars = 14232 x 0 backed at 'pbmc.h5ad'
    obs: 'n_fragment', 'frac_dup', 'frac_mito'
    uns: 'reference_sequences'
    obsm: 'fragment_paired'

pp.import_data() computes only basic QC metrics like the number of unique fragments per cell, fraction of duplicated reads and fraction of mitochondrial read. More advanced metrics can be computed by other functions.

We first use pl.frag_size_distr() to calculate and plot the size distribution of fragments in this dataset. The typical fragment size distribution in ATAC-seq data exhibits several key characteristics:

  1. Nucleosome-Free Region (NFR): The majority of fragments are short, typically around 80-300 base pairs (bp) in length. These short fragments represent regions of open chromatin where DNA is relatively accessible and not bound to nucleosomes. These regions are often referred to as nucleosome-free regions (NFRs) and correspond to regions of active transcriptional regulation.

  2. Mono-Nucleosome Peaks: There is often a peak in the fragment size distribution at around 147-200 bp, which corresponds to the size of a single nucleosome wrapped DNA. These fragments result from the cutting of DNA by the transposase enzyme when it encounters a nucleosome, causing the DNA to be protected and resulting in fragments of the same size.

  3. Di-Nucleosome Peaks: In addition to the mono-nucleosome peak, you may also observe a smaller peak at approximately 300-400 bp, corresponding to di-nucleosome fragments. These fragments occur when transposase cuts the DNA between two neighboring nucleosomes.

  4. Large Fragments: Some larger fragments, greater than 500 bp in size, may be observed. These fragments can result from various sources, such as the presence of longer stretches of open chromatin or technical artifacts.

[4]:
snap.pl.frag_size_distr(data, interactive=False)
2023-10-08 13:50:58 - INFO - Computing fragment size distribution...
[4]:
../_images/tutorials_pbmc_9_1.png

The plotting functions in SnapATAC2 can optionally return a plotly Figure object that can be further customized using plotly’s API. In the example below, we change the y-axis to log-scale.

[5]:
fig = snap.pl.frag_size_distr(data, show=False)
fig.update_yaxes(type="log")
fig.show()