Introduction
RedClust is a Julia package for Bayesian clustering of high-dimensional Euclidean data using pairwise dissimilarity information instead of the raw observations. It uses an MCMC sampler to generate posterior samples from the space of all possible clustering structures on the data, and it is an implementation of the algorithm described by Natarajan et al., 2023.
Installation
The package can be installed by typing ]add RedClust
into the Julia REPL or by using Pkg
:
using Pkg
Pkg.add("RedClust")
Basic example
using RedClust
# Generate data
points, distM, clusts, probs, oracle_coclustering =
generatemixture(N, K; α = 10, σ = data_σ, dim = data_dim)
# Let RedClust choose the best prior hyperparameters
params = fitprior(points, "k-means", false)
# Set the MCMC options
options = MCMCOptionsList(numiters = 5000)
data = MCMCData(points)
# Run the sampler
result = runsampler(data, options, params)
# Get a point estimate
pointestimate, index = getpointestimate(result)
# Summary of MCMC and point estimate
summarise(result)
summarise(pointestimate, clusts)
A more elaborate example can be found in the Example section. Examples from the paper and its supplementary material can be found in the 'examples' branch of the Github repository for RedClust.jl.
Model
RedClust implements the model described in Natarajan et al. (2022). The key features are-
- The use of a random partition model with an unknown number of clusters $K$ which allows for posterior inference on $K$. That is, there is a prior distribution on the space of all possible clustering structures with any number of clusters from one to the number of observations. The number of clusters is an object of inference to be determined by MCMC sampling.
- The pairwise dissimilarities between observations are assumed to be mutually independent given the clustering assignment; that is, the clustering likelihood is a composite or pseduo-likelihood.
- The clustering likelihood is comprised of a cohesive part and a repulsive part. The repulsive component of the likelihood imposes a strong identifiability constraint on the clustering; clusters must not only comprise points that have similar small distances among themselves, but also similar distances to points in other clusters.
The prior on the clustering structure can be chosen according to application in question. The current version of RedClust implements a microclustering prior (Miller et al., 2015; Betancourt et al., 2022). This means that the partition is generated by drawing cluster sizes $n_1, \ldots, n_K$ from a random distribution $\nu$ (conditional upon $n_1 + \ldots n_K = n$ where $n$ is the number of observations), and cluster labels are given by a uniform random permutation of
\[\underbrace{1, \ldots, 1}_{n_1 \text{ times}}, \ldots, \underbrace{K, \ldots, K}_{n_K \text{ times}}\]
RedClust implements a microclustering prior with $\nu$ a shifted negative binomial with random parameters $r$ and $p$, which are sampled from a Gamma and Beta distribution respectively.
\[\begin{align*} r &\sim \mathrm{Gamma}(\eta, \sigma)\\ p &\sim \mathrm{Beta}(u, v) \end{align*}\]
This results in the following partition prior:
\[\pi(\rho_n \mid r, p) \propto K! p^{n-K}(1-p)^{rK}\Gamma(r)^{-K}\prod_{k=1}^K n_k \Gamma(n_k+r-1)\]
where $\rho_n$ is the partition and $K$ is the number of clusters in the partition. The partition likelihood is given by
\[\begin{align*} \lambda_k &\overset{\mathrm{iid}}{\sim} \mathrm{Gamma}(\alpha, \beta), \quad 1 \leq k \leq K\\ \theta_{kt} &\overset{\mathrm{iid}}{\sim} \mathrm{Gamma}(\zeta, \gamma), \quad 1 \leq k < t \leq K \end{align*}\]
\[\pi(D, \mid \rho_n, \boldsymbol{\lambda}, \boldsymbol{\theta}, r, p) = \left[ \prod_{k=1}^K \prod_{\substack{i, j \in C_k \\ i \neq j}} \frac{D_{ij}^{\delta_1 - 1}\lambda_k^{\delta_1}}{\Gamma(\delta_1)} \exp(-\lambda_k D_{ij}) \right] \left[\prod_{\substack{t, k = 1 \\ t \neq K}} \prod_{\substack{i \in C_k \\ j \in C_t}} \frac{D_{ij}^{\delta_2-1}\theta_{kt}^{\delta_2}}{\Gamma(\delta_2)} \exp(-\theta_{kt}D_{ij}) \right]\]
where $\boldsymbol D$ is the matrix of pairwise dissimilarities between observations.
Point estimation
A clustering point-estimate $\boldsymbol c^*$ can be determined in a number of ways.
- Choosing the sample with maximum likelihood or maximum posterior. This corresponds to an MLE or MAP estimate using the MCMC samples as an approximation to the likelihood and posterior.
- Searching for a clustering that minimises the posterior expectation of a loss function $\ell$ such as the Binder loss (Binder, 1978), the Variation of Information distance (Meilă, 2007; Wade and Ghahramani, 2018), or the Normalised Information Distance (Kraskov et al., 2005). That is,
\[\boldsymbol c^* = \argmin_{\boldsymbol c} \mathbb{E}_{\boldsymbol c'}[\ell(\boldsymbol c, \boldsymbol c')]\]
- A naive method is to restrict the search space to those clusterings visited by the MCMC sampler. This method is implemented in RedClust in the function
getpointestimate
. - A better method is the SALSO algorithm (Dahl et al., 2022), implemented in the R package salso, which heuristically searches the space of all possible clusterings. You can use the RCall package in Julia to make calls to the salso package in R if you have R and salso installed.
Bibliography
Betancourt, B., Zanella, G. and Steorts, R. C. (2022), ‘Random partition models for microclustering tasks’, Journal of the American Statistical Association 117(539), 1215–1227. DOI: 10.1080/01621459.2020.1841647.
Binder, D. A. (1978). Bayesian Cluster Analysis. Biometrika, 65(1), 31–38. DOI: 10.2307/2335273.
Dahl, D. B., Johnson, D. J. and M¨uller, P. (2022), ‘Search algorithms and loss functions for Bayesian clustering’, Journal of Computational and Graphical Statistics 0(0), 1–13. DOI: 10.1080/10618600.2022.2069779.
Kraskov, A., Stögbauer, H., Andrzejak, R. G. and P Grassberger, P. (2005), ‘Hierarchical clustering using mutual information’, Europhysics Letters (EPL) 70(2), 278-284, DOI: 10.1209/epl/i2004-10483-y.
Marina Meilă (2007), ‘Comparing clusterings—an information based distance’, Journal of Multivariate Analysis, 98(5), 873-895, DOI: 10.1016/j.jmva.2006.11.013.
Miller, J., Betancourt, B., Zaidi, A., Wallach, H. and Steorts, R. C. (2015), ‘Microclustering: When the cluster sizes grow sublinearly with the size of the data set’, arXiv 1512.00792.
Natarajan, A., De Iorio, M., Heinecke, A., Mayer, E. and Glenn, S. (2023). ‘Cohesion and Repulsion in Bayesian Distance Clustering’, Journal of the Americal Statistical Association. DOI: 10.1080/01621459.2023.2191821.
Wade, S. and Ghahramani, Z. (2018), ‘Bayesian Cluster Analysis: Point Estimation and Credible Balls (with Discussion)’, Bayesian Analysis 13(2), 559 – 626. DOI: 10.1214/17-BA1073.