snapatac2.pp.select_features#

snapatac2.pp.select_features(adata, n_features=500000, filter_lower_quantile=0.005, filter_upper_quantile=0.005, whitelist=None, blacklist=None, max_iter=1, inplace=True, n_jobs=8, verbose=True)[source]#

Perform feature selection by selecting the most accessibile features across all cells unless max_iter > 1.

Note

This function does not perform the actual subsetting. The feature mask is used by various functions to generate submatrices on the fly. Features that are zero in all cells will be always removed regardless of the filtering criteria. For more discussion about feature selection, see: kaizhang/SnapATAC2#116.

Parameters:
  • adata (AnnData | AnnDataSet | list[AnnData]) – The (annotated) data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to regions. adata can also be a list of AnnData objects. In this case, the function will be applied to each AnnData object in parallel.

  • n_features (int) – Number of features to keep. Note that the final number of features may be smaller than this number if there is not enough features that pass the filtering criteria.

  • filter_lower_quantile (float) – Lower quantile of the feature count distribution to filter out. For example, 0.005 means the bottom 0.5% features with the lowest counts will be removed.

  • filter_upper_quantile (float) – Upper quantile of the feature count distribution to filter out. For example, 0.005 means the top 0.5% features with the highest counts will be removed. Be aware that when the number of feature is very large, the default value of 0.005 may risk removing too many features.

  • whitelist (Optional[Path]) – A user provided bed file containing genome-wide whitelist regions. None-zero features listed here will be kept regardless of the other filtering criteria. If a feature is present in both whitelist and blacklist, it will be kept.

  • blacklist (Optional[Path]) – A user provided bed file containing genome-wide blacklist regions. Features that are overlapped with these regions will be removed.

  • max_iter (int) – If greater than 1, this function will perform iterative clustering and feature selection based on variable features found using previous clustering results. This is similar to the procedure implemented in ArchR, but we do not recommend it, see kaizhang/SnapATAC2#111. Default value is 1, which means no iterative clustering is performed.

  • inplace (bool) – Perform computation inplace or return result.

  • n_jobs (int) – Number of parallel jobs to use when adata is a list.

  • verbose (bool) – Whether to print progress messages.

Returns:

If inplace = False, return a boolean index mask that does filtering, where True means that the feature is kept, False means the feature is removed. Otherwise, store this index mask directly to .var['selected'].

Return type:

np.ndarray | None