snapatac2.pp.scrublet#

snapatac2.pp.scrublet(adata, features='selected', n_comps=15, sim_doublet_ratio=2.0, expected_doublet_rate=0.1, n_neighbors=None, use_approx_neighbors=False, random_state=0, inplace=True, n_jobs=8, verbose=True)[source]#

Compute probability of being a doublet using the scrublet algorithm.

This function identifies doublets by generating simulated doublets using randomly pairing chromatin accessibility profiles of individual cells. The simulated doublets are then embedded alongside the original cells using the spectral embedding algorithm in this package. A k-nearest-neighbor classifier is trained to distinguish between the simulated doublets and the authentic cells. This trained classifier produces a “doublet score” for each cell. The doublet scores are then converted into probabilities using a Gaussian mixture model.

Parameters:
  • adata (AnnData | list[AnnData]) – The (annotated) data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to regions. adata can also be a list of AnnData objects. In this case, the function will be applied to each AnnData object in parallel.

  • features (str | ndarray | None) – Boolean index mask, where True means that the feature is kept, and False means the feature is removed.

  • n_comps (int) – Number of components. 15 is usually sufficient. The algorithm is not sensitive to this parameter.

  • sim_doublet_ratio (float) – Number of doublets to simulate relative to the number of observed cells.

  • expected_doublet_rate (float) – Expected doublet rate.

  • n_neighbors (Optional[int]) – Number of neighbors used to construct the KNN graph of observed cells and simulated doublets. If None, this is set to round(0.5 * sqrt(n_cells))

  • use_approx_neighbors – Whether to use approximate search.

  • random_state (int) – Random state.

  • inplace (bool) – Whether update the AnnData object inplace

  • n_jobs (int) – Number of jobs to run in parallel.

  • verbose (bool) – Whether to print progress messages.

Returns:

if inplace = True, it updates adata with the following fields:
  • adata.obs["doublet_probability"]: probability of being a doublet

  • adata.obs["doublet_score"]: doublet score

Return type:

tuple[np.ndarray, np.ndarray] | None