Public API

Main functions

RedClust.fitprior — Function

fitprior(data, algo, diss = false;
Kmin = 1,
Kmax = Int(floor(size(data)[end] / 2),
verbose = true)

Determines the best prior hyperparameters from the data. A notional clustering is obtained using k-means or k-medoids and the elbow method, and the distances are split into within-cluster distances and inter-cluster distances based on the notional clustering. These distances are then used to fit the prior hyperparameters using MLE and empirical Bayes sampling.

Required Arguments

data::Union{Vector{Vector{Float64}}, Matrix{Float64}}: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.
algo::String: must be one of "k-means" or "k-medoids".

Optional Arguments

diss::bool: if true, data will be assumed to be a pairwise dissimilarity matrix.
Kmin::Integer: minimum number of clusters to scan for the elbow method.
Kmax::Integer: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.
verbose::Bool: if false, disables all info messages and progress bars.

Returns

An object of type PriorHyperparamsList.

RedClust.fitprior2 — Function

fitprior2(data, algo, diss = false;
Kmin = 1,
Kmax = Int(floor(size(data)[end] / 2),
verbose = true)

Determines the best prior hyperparameters from the data. Uses the same method as fitprior to obtain values for $\sigma$, $\eta$, $u$, and $v$, but derives values for the cluster-specific parameters by considering within-cluster and cross-cluster distances over clusterings with $K$ clusters for all values of $K \in [K_\mathrm{min}, K_\mathrm{max}]$, weighted by the prior predictive distribution of $K$ in that range.

Required Arguments

data::Union{Vector{Vector{Float64}}, Matrix{Float64}}: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.
algo::String: must be one of "k-means" or "k-medoids".

Optional Arguments

diss::bool: if true, data will be assumed to be a pairwise dissimilarity matrix.
Kmin::Integer: minimum number of clusters to scan for the elbow method.
Kmax::Integer: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.
verbose::Bool: if false, disables all info messages and progress bars.

Returns

An object of type PriorHyperparamsList.

RedClust.sampleK — Method

sampleK(params::PriorHyperparamsList, numsamples::Int, n::Int)::Vector{Int}
sampleK(η::Real, σ::Real, u::Real, v::Real, numsamples::Int, n::Int)

Returns a vector of length numsamples containing samples of $K$ (number of clusters) generated from its marginal prior predictive distribution inferred from params. The parameter n is the number of observations in the model.

RedClust.sampledist — Function

sampledist(params::PriorHyperparamsList, type::String, numsamples = 1)::Vector{Float64}

Generate a vector of samples of length numsamples from the prior predictive distribution on the distances, as encapsulated in params. type must be either "intercluster" or "intracluster".

RedClust.runsampler — Function

runsampler(data,
options = MCMCOptionsList(),
params = PriorHyperparamsList();
verbose = true) -> MCMCResult

Runs the MCMC sampler on the data.

Arguments

data::MCMCData: contains the distance matrix.
options::MCMCOptionsList: contains the number of iterations, burnin, etc.
params::PriorHyperparamsList: contains the prior hyperparameters for the model.
verbose::Bool: if false, disables all info messages and progress bars.

Returns

A struct of type MCMCResult containing the MCMC samples, convergence diagnostics, and summary statistics.

See also

MCMCData, MCMCOptionsList, fitprior, PriorHyperparamsList, MCMCResult.

Point estimation

RedClust.binderloss — Method

binderloss(a::ClustLabelVector, b::ClustLabelVector;
normalised = true) -> Float64

Computes the Binder loss between two clusterings. If normalised = true then the result is equal to one minus the rand index between a and b.

RedClust.getpointestimate — Method

getpointestimate(samples::MCMCResult; method = "MAP", loss = "VI")

Computes a point estimate from a vector of samples of cluster allocations by searching for a minimiser of the posterior expectation of some loss function.

Arguments

method::String: must be one of the following -

- `"MAP"`: maximum a posteriori.
- `"MLE"`: maximum likelihood estimation.
- `"MPEL"`: minimum posterior expected loss.
`"MAP"` and `"MLE"` search among the MCMC samples for the clustering with the maximum log posterior and log likelihood respectively. `"MPEL"` searches for a clustering that minimises the posterior expected loss of some loss function specified by the `loss` argument. The search space is the set of samples in `samples`.

loss: Determines the loss function used for the "MPEL" method. Must be either be a string or a function. If specified as a string, must be one of "binder" (Binder loss), "omARI" (one minus the Adjusted Rand Index), "VI" (Variation of Information distance), or "ID" (Information Distance). If specified as a function, must have a method defined for (x::Vector{Int}, y::Vector{Int}) -> Real.

Returns

Returns a tuple (clust, i) where clust is a clustering allocation in samples and i is its sample index.

RedClust.infodist — Method

infodist(a::ClustLabelVector, b::ClustLabelVector;
normalised = true) -> Float64

Computes the information distance between two clusterings. The information distance is defined as

\[d_{\mathrm{ID}}(a, b) = \max \{H(A), H(B)\} - I(A, B)\]

where $A$ and $B$ are the cluster membership probability functions for $a$ and $b$ respectively, $H$ denotes the entropy of a distribution, and $I$ denotes the mutual information between two distributions. The information distance has range $[0, \log N]$ where $N$ is the number of observations. If normalised = true, the result is scaled to the range $[0, 1]$.

Summary and clustering evaluation

RedClust.evaluateclustering — Method

evaluateclustering(clusts::ClustLabelVector, truth::ClustLabelVector) Returns a named tuple of values quantifying the accuracy of the clustering assignments in clusts with respect to the ground truth clustering assignments in truth.

Return values

nbloss: Normalised Binder loss (= 1 - rand index). Lower is better.
ari: Adjusted Rand index. Higher is better.
vi: Variation of Information distance ($\in [0, \log N]$). Lower is better.
nvi: Normalised VI distance ($\in [0, 1]$).
id: Information Distance ($\in [0, \log N]$). Lower is better.
nid: Normalised Information Distance ($\in [0, 1]$).
nmi: Normalised Mutual Information ($\in [0, 1]$). Higher is better.

RedClust.summarise — Method

summarise([io::IO], clusts::ClustLabelVector, truth::ClustLabelVector, printoutput = true) -> String Prints a summary of the clustering accuracy of clusts with respect to the ground truth in truth. The output is printed to the output stream io, which defaults to stdout if not provided.

Convenience functions

RedClust.adjacencymatrix — Method

adjacencymatrix(clusts::ClustLabelVector) -> Matrix{Bool} Returns the n×n adjacency matrix corresponding to the given cluster label vector clusts, where n = length(clusts).

RedClust.generatemixture — Method

generatemixture(N, K; [α, dim, radius, σ, rng])

Generates a multivariate Normal mixture, with kernel weights generated from a Dirichlet prior. The kernels are centred at the vertices of a dim-dimensional simplex with edge length radius.

Required Arguments

N::Integer: number of observations to generate.
K::Integer: number of mixture kernels.

Optional Arguments

α::Float64 = K: parameter for the Dirichlet prior.
dim::Integer = K: dimension of the observations.
radius::Float64 = 1: radius of the simplex whose vertices are the kernel means.
σ::Float64 = 0.1: variance of each kernel.
rng: a random number generator to use, or an integer to seed the default random number generator with. If not provided, the default RNG provided by the Random.jl package will be used with default seeding.

Returns

Named tuple containing the following fields-

points::Vector{Vector{Float64}}: a vector of N observations.
distancematrix::Matrix{Float64}: an N×N matrix of pairwise Euclidean distances between the observations.
clusts::ClustLabelVector: vector of N cluster assignments.
probs::Float64: vector of K cluster weights generated from the Dirichlet prior, used to generate the observations.
oracle_coclustering::Matrix{Float64}: N×N matrix of co-clustering probabilities, calculated assuming full knowledge of the cluster centres and cluster weights.

RedClust.makematrix — Method

makematrix(x::AbstractVector{<:AbstractVector}) -> Matrix

Convert a vector of vectors into a matrix, where each vector becomes a column in the matrix.

RedClust.sortlabels — Method

sortlabels(x::ClustLabelVector) -> ClustLabelVector Returns a cluster label vector y such that x and y have the same adjacency structure and labels in y occur in sorted ascending order.

Example datasets

RedClust.example_dataset — Method

example_dataset(n::Int)

Returns a named tuple containing the dataset from the n$^{\mathrm{th}}$ simulated example in the main paper. This dataset was generated using the following code in Julia v1.8.1:

	using RedClust
	using Random: seed!
	seed!(44)
	K = 10 # Number of clusters
	N = 100 # Number of points
	data_σ = [0.25, 0.2, 0.18][n] # Variance of the normal kernel
	data_dim = [10, 50, 10][n] # Data dimension
	data = generatemixture(N, K; α = 10, σ = data_σ, dim = data_dim);

Note however that the above code may produce different results on your computer because the random number generator in Julia is not meant for reproducible results across different computers, different versions of Julia, or different versions of the Random.jl package, even with appropriate seeding. Therefore the datasets have been included with this package, and it is recommended to access them via this function.

See also generatemixture.

RedClust.example_datasets — Method

Return a read-only handle to a HDF5 file that contains example datasets from the main paper. You must remember to close the file once you are done reading from it. This function is provided for reproducibility purposes only; it is recommended to read the datasets via the convenience function example_dataset.

Types

RedClust.ClustLabelVector — Type

Alias for Vector{Int}

RedClust.MCMCData — Type

MCMCData(points::AbstractVector{<:AbstractVector{<:Float64}}) MCMCData(dissimilaritymatrix::AbstractMatrix{Float64})

Contains the pairwise dissimilarities for the MCMC sampler.

RedClust.MCMCOptionsList — Type

MCMCOptionsList(;
numiters = 5000,
burnin = floor(0.2 * numiters),
thin = 1,
numGibbs = 5,
numMH = 1)

List of options for running the MCMC.

# Constructor arguments
- `numiters::Integer`: number of iterations to run.
- `burnin::Integer`: number of iterations to discard as burn-in.
- `thin::Integer`: will keep every `thin` samples.
- `numGibbs:Integer`: number of intermediate Gibbs scans in the split-merge step.
- `numMH:Integer`: number of split-merge steps per MCMC iteration.

RedClust.MCMCResult — Type

Struct containing MCMC samples.

Fields

options::MCMCOptionsList: options passed to the sampler.
params::PriorHyperparamsList: prior hyperparameters used by the sampler.
clusts::Vector{ClustLabelVector}: contains the clustering allocations. clusts[i] is a vector containing the clustering allocation from the ith sample.
posterior_coclustering::Matrix{Float64}: the posterior coclustering matrix.
K::Vector{Int}: posterior samples of $K$, i.e., the number of clusters. K[i] is the number of clusters in clusts[i].
r::Vector{Float64}: posterior samples of the parameter $r$.
p::Vector{Float64}: posterior samples of the parameter $p$.
K_mean, r_mean, p_mean: posterior mean of K, r, and p respectively.
K_variance, r_variance, p_variance: posterior variance of K, r, and p respectively.
K_acf::Vector{Float64}, r_acf::Vector{Float64}, p_acf::Vector{Float64}: autocorrelation function for K, r, and p respectively.
K_iac, r_iac, and p_iac: integrated autocorrelation coefficient for K, r, and p respectively.
K_ess::Float64, r_ess::Float64, and p_ess::Float64: effective sample size for K, r, and p respectively.
loglik::Vector{Float64}: log-likelihood for each sample.
logposterior::Vector{Float64}: a function proportional to the log-posterior for each sample, with constant of proportionality equal to the normalising constant of the partition prior.
splitmerge_splits: Boolean vector indicating the iterations when a split proposal was used in the split-merge step. Has length numMH * numiters (see MCMCOptionsList).
splitmerge_acceptance_rate: acceptance rate of the split-merge proposals.
r_acceptances: Boolean vector indicating the iterations (including burnin and the thinned out iterations) where the Metropolis-Hastings proposal for r was accepted.
r_acceptance_rate:: Metropolis-Hastings acceptance rate for r.
runtime: total runtime for all iterations.
mean_iter_time: average time taken for each iteration.

RedClust.PriorHyperparamsList — Type

PriorHyperparamsList(; [kwargs])

Contains the prior hyperparameters for the model. It is recommended to set the values using the fitprior function. Do not set these values manually unless you know what you are doing.

Constructor arguments

δ1::Float64 = 1: the parameter $\delta_1$.
α::Float64 = 1: the shape parameter $\alpha$ in the prior for each $\lambda_k$.
β::Float64 = 1: the rate parameter $\beta$ in the prior for each $\lambda_k$.
δ2::Float64 = 1: the parameter $\delta_2$.
ζ::Float64 = 1: the shape parameter $\zeta$ in the prior for each $\theta_{kt}$.
γ::Float64 = 1: the rate parameter $\gamma$ in the prior for each $\theta_{kt}$.
η::Float64 = 1: the shape parameter $\eta$ in the prior for $r$.
σ::Float64 = 1: the rate parameter $\sigma$ in the prior for $r$.
u::Float64 = 1: the parameter $u$ in the prior for $p$.
v::Float64 = 1: the parameter $v$ in the prior for $p$.
proposalsd_r::Float64 = 1: standard deviation of the truncated Normal proposal when sampling r via a Metropolis-Hastings step.
K_initial::Int = 1: initial number of clusters to start the MCMC.
repulsion::Bool = true: whether to use the repulsive component of the likelihood.
maxK::Int = 0: maxmimum number of clusters to allow. If set to 0 then no maximum is imposed.

See also