UNCURL public functions

uncurl.max_variance_genes

uncurl.max_variance_genes(data, nbins=5, frac=0.2)[source]

This function identifies the genes that have the max variance across a number of bins sorted by mean.

Parameters:
  • data (array) – genes x cells
  • nbins (int) – number of bins to sort genes by mean expression level. Default: 10.
  • frac (float) – fraction of genes to return per bin - between 0 and 1. Default: 0.1
Returns:

list of gene indices (list of ints)

uncurl.qualNorm

uncurl.qualNorm(data, qualitative)[source]

Generates starting points using binarized data. If qualitative data is missing for a given gene, all of its entries should be -1 in the qualitative matrix.

Parameters:
  • data (array) – 2d array of genes x cells
  • qualitative (array) – 2d array of numerical data - genes x clusters
Returns:

Array of starting positions for state estimation or clustering, with shape genes x clusters

uncurl.poisson_cluster

uncurl.poisson_cluster(data, k, init=None, max_iters=100)[source]

Performs Poisson hard EM on the given data.

Parameters:
  • data (array) – A 2d array- genes x cells. Can be dense or sparse; for best performance, sparse matrices should be in CSC format.
  • k (int) – Number of clusters
  • init (array, optional) – Initial centers - genes x k array. Default: None, use kmeans++
  • max_iters (int, optional) – Maximum number of iterations. Default: 100
Returns:

a cells x 1 vector of cluster assignments, and a genes x k array of cluster means.

Return type:

a tuple of two arrays

uncurl.nb_cluster

uncurl.nb_cluster(data, k, P_init=None, R_init=None, assignments=None, means=None, max_iters=10)[source]

Performs negative binomial clustering on the given data. If some genes have mean > variance, then these genes are fitted to a Poisson distribution.

Parameters:
  • data (array) – genes x cells
  • k (int) – number of clusters
  • P_init (array) – NB success prob param - genes x k. Default: random
  • R_init (array) – NB stopping param - genes x k. Default: random
  • assignments (array) – cells x 1 array of integers 0...k-1. Default: kmeans-pp (poisson)
  • means (array) – initial cluster means (for use with kmeans-pp to create initial assignments). Default: None
  • max_iters (int) – default: 100
Returns:

1d array of length cells, containing integers 0...k-1 P (array): genes x k - value is 0 for genes with mean > var R (array): genes x k - value is inf for genes with mean > var

Return type:

assignments (array)

uncurl.poisson_estimate_state

uncurl.poisson_estimate_state(data, clusters, init_means=None, init_weights=None, method='NoLips', max_iters=30, tol=1e-10, disp=False, inner_max_iters=100, normalize=True, initialization='tsvd', parallel=True, threads=4, max_assign_weight=0.75, run_w_first=True, constrain_w=False, regularization=0.0)[source]

Uses a Poisson Covex Mixture model to estimate cell states and cell state mixing weights.

To lower computational costs, use a sparse matrix, set disp to False, and set tol to 0.

Parameters:
  • data (array) – genes x cells array or sparse matrix.
  • clusters (int) – number of mixture components
  • init_means (array, optional) – initial centers - genes x clusters. Default: from Poisson kmeans
  • init_weights (array, optional) – initial weights - clusters x cells, or assignments as produced by clustering. Default: from Poisson kmeans
  • method (str, optional) – optimization method. Current options are ‘NoLips’ and ‘L-BFGS-B’. Default: ‘NoLips’.
  • max_iters (int, optional) – maximum number of iterations. Default: 30
  • tol (float, optional) – if both M and W change by less than tol (RMSE), then the iteration is stopped. Default: 1e-10
  • disp (bool, optional) – whether or not to display optimization progress. Default: False
  • inner_max_iters (int, optional) – Number of iterations to run in the optimization subroutine for M and W. Default: 100
  • normalize (bool, optional) – True if the resulting W should sum to 1 for each cell. Default: True.
  • initialization (str, optional) – If initial means and weights are not provided, this describes how they are initialized. Options: ‘cluster’ (poisson cluster for means and weights), ‘kmpp’ (kmeans++ for means, random weights), ‘km’ (regular k-means), ‘tsvd’ (tsvd(50) + k-means). Default: tsvd.
  • parallel (bool, optional) – Whether to use parallel updates (sparse NoLips only). Default: True
  • threads (int, optional) – How many threads to use in the parallel computation. Default: 4
  • max_assign_weight (float, optional) – If using a clustering-based initialization, how much weight to assign to the max weight cluster. Default: 0.75
  • run_w_first (bool, optional) – Whether or not to optimize W first (if false, M will be optimized first). Default: True
  • constrain_w (bool, optional) – If True, then W is normalized after every iteration. Default: False
  • regularization (float, optional) – Regularization coefficient for M and W. Default: 0 (no regularization).
Returns:

genes x clusters - state means W (array): clusters x cells - state mixing components for each cell ll (float): final log-likelihood

Return type:

M (array)

uncurl.nb_estimate_state

uncurl.nb_estimate_state(data, clusters, R=None, init_means=None, init_weights=None, max_iters=10, tol=0.0001, disp=True, inner_max_iters=400, normalize=True)[source]

Uses a Negative Binomial Mixture model to estimate cell states and cell state mixing weights.

If some of the genes do not fit a negative binomial distribution (mean > var), then the genes are discarded from the analysis.

Parameters:
  • data (array) – genes x cells
  • clusters (int) – number of mixture components
  • R (array, optional) – vector of length genes containing the dispersion estimates for each gene. Default: use nb_fit
  • init_means (array, optional) – initial centers - genes x clusters. Default: kmeans++ initializations
  • init_weights (array, optional) – initial weights - clusters x cells. Default: random(0,1)
  • max_iters (int, optional) – maximum number of iterations. Default: 10
  • tol (float, optional) – if both M and W change by less than tol (in RMSE), then the iteration is stopped. Default: 1e-4
  • disp (bool, optional) – whether or not to display optimization parameters. Default: True
  • inner_max_iters (int, optional) – Number of iterations to run in the scipy minimizer for M and W. Default: 400
  • normalize (bool, optional) – True if the resulting W should sum to 1 for each cell. Default: True.
Returns:

genes x clusters - state centers W (array): clusters x cells - state mixing components for each cell R (array): 1 x genes - NB dispersion parameter for each gene ll (float): Log-likelihood of final iteration

Return type:

M (array)

uncurl.mds

uncurl.mds(means, weights, d)[source]

Dimensionality reduction using MDS.

Parameters:
  • means (array) – genes x clusters
  • weights (array) – clusters x cells
  • d (int) – desired dimensionality
Returns:

array of shape (d, cells)

Return type:

W_reduced (array)

uncurl.lineage

uncurl.lineage(means, weights, curve_function='poly', curve_dimensions=6)[source]

Lineage graph produced by minimum spanning tree

Parameters:
  • means (array) – genes x clusters - output of state estimation
  • weights (array) – clusters x cells - output of state estimation
  • curve_function (string) – either ‘poly’ or ‘fourier’. Default: ‘poly’
  • curve_dimensions (int) – number of parameters for the curve. Default: 6
Returns:

list of lists for each cluster smoothed data in 2d space: 2 x cells list of edges: pairs of cell indices cell cluster assignments: list of ints

Return type:

curve parameters

uncurl.pseudotime

uncurl.pseudotime(starting_node, edges, fitted_vals)[source]
Parameters:
  • starting_node (int) – index of the starting node
  • edges (list) – list of tuples (node1, node2)
  • fitted_vals (array) – output of lineage (2 x cells)
Returns:

A 1d array containing the pseudotime value of each cell.