API¶
MAGIC¶
Markov Affinity-based Graph Imputation of Cells (MAGIC)
Authors: Scott Gigante <scott.gigante@yale.edu>, Daniel Dager <daniel.dager@yale.edu> (C) 2018 Krishnaswamy Lab GPLv2
-
class
magic.magic.
MAGIC
(knn=5, knn_max=None, decay=1, t=3, n_pca=100, solver='exact', knn_dist='euclidean', n_jobs=1, random_state=None, verbose=1)[source]¶ Bases:
sklearn.base.BaseEstimator
MAGIC operator which performs dimensionality reduction.
Markov Affinity-based Graph Imputation of Cells (MAGIC) is an algorithm for denoising and transcript recover of single cells applied to single-cell RNA sequencing data, as described in van Dijk et al, 2018 [1].
Parameters: - knn (int, optional, default: 5) – number of nearest neighbors from which to compute kernel bandwidth
- knn_max (int, optional, default: None) – maximum number of nearest neighbors with nonzero connection. If None, will be set to 3 * knn
- decay (int, optional, default: 1) – sets decay rate of kernel tails. If None, alpha decaying kernel is not used
- t (int, optional, default: 3) – power to which the diffusion operator is powered. This sets the level of diffusion. If ‘auto’, t is selected according to the Procrustes disparity of the diffused data
- n_pca (int, optional, default: 100) – Number of principal components to use for calculating neighborhoods. For extremely large datasets, using n_pca < 20 allows neighborhoods to be calculated in roughly log(n_samples) time.
- solver (str, optional, default: 'exact') – Which solver to use. “exact” uses the implementation described in van Dijk et al. (2018) [1]. “approximate” uses a faster implementation that performs imputation in the PCA space and then projects back to the gene space. Note, the “approximate” solver may return negative values.
- knn_dist (string, optional, default: 'euclidean') – Distance metric for building kNN graph. Recommended values: ‘euclidean’, ‘cosine’. Any metric from scipy.spatial.distance can be used. Custom distance functions of form f(x, y) = d are also accepted
- n_jobs (integer, optional, default: 1) – The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used
- random_state (integer or numpy.RandomState, optional, default: None) – The generator used to initialize random PCA If an integer is given, it fixes the seed Defaults to the global numpy random number generator
- verbose (int or boolean, optional (default: 1)) – If True or > 0, print status messages
-
X
¶ Input data
Type: array-like, shape=[n_samples, n_features]
-
X_magic
¶ Output data
Type: array-like, shape=[n_samples, n_features]
-
graph
¶ The graph built on the input data
Type: graphtools.BaseGraph
Examples
>>> import magic >>> import pandas as pd >>> import matplotlib.pyplot as plt >>> X = pd.read_csv("../../data/test_data.csv") >>> X.shape (500, 197) >>> magic_operator = magic.MAGIC() >>> X_magic = magic_operator.fit_transform(X, genes=['VIM', 'CDH1', 'ZEB1']) >>> X_magic.shape (500, 3) >>> magic_operator.set_params(t=7) MAGIC(a=15, k=5, knn_dist='euclidean', n_jobs=1, n_pca=100, random_state=None, t=7, verbose=1) >>> X_magic = magic_operator.transform(genes=['VIM', 'CDH1', 'ZEB1']) >>> X_magic.shape (500, 3) >>> X_magic = magic_operator.transform(genes="all_genes") >>> X_magic.shape (500, 197) >>> plt.scatter(X_magic['VIM'], X_magic['CDH1'], ... c=X_magic['ZEB1'], s=1, cmap='inferno') >>> plt.show() >>> magic.plot.animate_magic(X, gene_x='VIM', gene_y='CDH1', ... gene_color='ZEB1', operator=magic_operator) >>> dremi = magic_operator.knnDREMI('VIM', 'CDH1', plot=True)
References
[1] (1, 2, 3) Van Dijk D et al. (2018), Recovering Gene Interactions from Single-Cell Data Using Data Diffusion, Cell. -
diff_op
¶ The diffusion operator calculated from the data
-
fit
(X, graph=None)[source]¶ Computes the diffusion operator
Parameters: - X (array, shape=[n_samples, n_features]) – input data with n_samples samples and n_features dimensions. Accepted data types: numpy.ndarray, scipy.sparse.spmatrix, pd.DataFrame, anndata.AnnData.
- graph (graphtools.Graph, optional (default: None)) – If given, provides a precomputed kernel matrix with which to perform diffusion.
Returns: magic_operator – The estimator object
Return type:
-
fit_transform
(X, graph=None, **kwargs)[source]¶ Computes the diffusion operator and the denoised gene expression
Parameters: - X (array, shape=[n_samples, n_features]) – input data with n_samples samples and n_features dimensions. Accepted data types: numpy.ndarray, scipy.sparse.spmatrix, pd.DataFrame, anndata.AnnData.
- graph (graphtools.Graph, optional (default: None)) – If given, provides a precomputed kernel matrix with which to perform diffusion.
- genes (list or {"all_genes", "pca_only"}, optional (default: None)) – List of genes, either as integer indices or column names if input data is a pandas DataFrame. If “all_genes”, the entire smoothed matrix is returned. If “pca_only”, PCA on the smoothed data is returned. If None, the entire matrix is also returned, but a warning may be raised if the resultant matrix is very large.
- t_max (int, optional, default: 20) – maximum t to test if t is set to ‘auto’
- plot_optimal_t (boolean, optional, default: False) – If true and t is set to ‘auto’, plot the disparity used to select t
- ax (matplotlib.axes.Axes, optional) – If given and plot_optimal_t is true, plot will be drawn on the given axis.
Returns: X_magic – The gene expression values after diffusion
Return type: array, shape=[n_samples, n_genes]
-
get_params
(deep=True)¶ Get parameters for this estimator.
Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators. Returns: params – Parameter names mapped to their values. Return type: dict
-
knnDREMI
(gene_x, gene_y, k=10, n_bins=20, n_mesh=3, n_jobs=1, plot=False, **kwargs)[source]¶ Calculate kNN-DREMI on MAGIC output
Calculates k-Nearest Neighbor conditional Density Resampled Estimate of Mutual Information as defined in Van Dijk et al, 2018. [1]
Note that kNN-DREMI, like Mutual Information and DREMI, is not symmetric. Here we are estimating I(Y|X).
Parameters: - gene_x (array-like, shape=[n_samples]) – Gene shown on the x axis (independent feature)
- gene_y (array-like, shape=[n_samples]) – Gene shown on the y axis (dependent feature)
- k (int, range=[0:n_samples), optional (default: 10)) – Number of neighbors
- n_bins (int, range=[0:inf), optional (default: 20)) – Number of bins for density resampling
- n_mesh (int, range=[0:inf), optional (default: 3)) – In each bin, density will be calculcated around (mesh ** 2) points
- n_jobs (int, optional (default: 1)) – Number of threads used for kNN calculation
- plot (bool, optional (default: False)) – If True, DREMI create plots of the data like those seen in Fig 5C/D of van Dijk et al. 2018. (doi:10.1016/j.cell.2018.05.061).
- **kwargs (additional arguments for scprep.stats.plot_knnDREMI) –
Returns: dremi – kNN condtional Density resampled estimate of mutual information
Return type: float
-
set_params
(**params)[source]¶ Set the parameters on this estimator.
Any parameters not given as named arguments will be left at their current value.
Parameters: - knn (int, optional, default: 5) – number of nearest neighbors on which to build kernel
- decay (int, optional, default: 1) – sets decay rate of kernel tails. If None, alpha decaying kernel is not used
- t (int, optional, default: 3) – power to which the diffusion operator is powered. This sets the level of diffusion. If ‘auto’, t is selected according to the R squared of the diffused data
- n_pca (int, optional, default: 100) – Number of principal components to use for calculating neighborhoods. For extremely large datasets, using n_pca < 20 allows neighborhoods to be calculated in roughly log(n_samples) time.
- knn_dist (string, optional, default: 'euclidean') – recommended values: ‘euclidean’, ‘cosine’ Any metric from scipy.spatial.distance can be used distance metric for building kNN graph.
- n_jobs (integer, optional, default: 1) – The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used
- random_state (integer or numpy.RandomState, optional, default: None) – The generator used to initialize random PCA If an integer is given, it fixes the seed Defaults to the global numpy random number generator
- verbose (int or boolean, optional (default: 1)) – If True or > 0, print status messages
Returns: Return type: self
-
transform
(X=None, genes=None, t_max=20, plot_optimal_t=False, ax=None)[source]¶ Computes the values of genes after diffusion
Parameters: - X (array, optional, shape=[n_samples, n_features]) – input data with n_samples samples and n_features dimensions. Not required, since MAGIC does not embed cells not given in the input matrix to MAGIC.fit(). Accepted data types: numpy.ndarray, scipy.sparse.spmatrix, pd.DataFrame, anndata.AnnData.
- genes (list or {"all_genes", "pca_only"}, optional (default: None)) – List of genes, either as integer indices or column names if input data is a pandas DataFrame. If “all_genes”, the entire smoothed matrix is returned. If “pca_only”, PCA on the smoothed data is returned. If None, the entire matrix is also returned, but a warning may be raised if the resultant matrix is very large.
- t_max (int, optional, default: 20) – maximum t to test if t is set to ‘auto’
- plot_optimal_t (boolean, optional, default: False) – If true and t is set to ‘auto’, plot the disparity used to select t
- ax (matplotlib.axes.Axes, optional) – If given and plot_optimal_t is true, plot will be drawn on the given axis.
Returns: X_magic – The gene expression values after diffusion
Return type: array, shape=[n_samples, n_genes]
Plotting¶
-
magic.plot.
animate_magic
(data, gene_x, gene_y, gene_color=None, t_max=20, delay=2, operator=None, filename=None, ax=None, figsize=None, s=1, cmap='inferno', interval=200, dpi=100, ipython_html='jshtml', verbose=False, **kwargs)[source]¶ Animate a gene-gene relationship with increased diffusion
Parameters: - data (array-like) – Input data matrix
- gene_x (int or str) – Gene to put on the x axis
- gene_y (int or str) – Gene to put on the y axis
- gene_color (int or str, optional (default: None)) – Gene to color by. If None, no color vector is used
- t_max (int, optional (default: 20)) – maximum value of t to include in the animation
- delay (int, optional (default: 5)) – number of frames to dwell on the first frame before applying MAGIC
- operator (magic.MAGIC, optional (default: None)) – precomputed MAGIC operator. If None, one is created.
- filename (str, optional (default: None)) – If not None, saves a .gif or .mp4 with the output
- ax (matplotlib.Axes or None, optional (default: None)) – axis on which to plot. If None, an axis is created
- figsize (tuple, optional (default: None)) – Tuple of floats for creation of new matplotlib figure. Only used if ax is None.
- s (int, optional (default: 1)) – Point size
- cmap (str or callable, optional (default: 'inferno')) – Matplotlib colormap
- interval (float, optional (default: 30)) – Time in milliseconds between frames
- dpi (int, optional (default: 100)) – Dots per inch (image quality) in saved animation)
- ipython_html ({'html5', 'jshtml'}) – which html writer to use if using a Jupyter Notebook
- verbose (bool, optional (default: False)) – MAGIC operator verbosity
- *kwargs (arguments for MAGIC) –
Returns: Return type: A Matplotlib animation showing diffusion of an edge with increased t