Public API
Main functions
RedClust.fitprior
— Functionfitprior(data, algo, diss = false;
Kmin = 1,
Kmax = Int(floor(size(data)[end] / 2),
verbose = true)
Determines the best prior hyperparameters from the data. A notional clustering is obtained using k-means or k-medoids and the elbow method, and the distances are split into within-cluster distances and inter-cluster distances based on the notional clustering. These distances are then used to fit the prior hyperparameters using MLE and empirical Bayes sampling.
Required Arguments
data::Union{Vector{Vector{Float64}}, Matrix{Float64}}
: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.algo::String
: must be one of"k-means"
or"k-medoids"
.
Optional Arguments
diss::bool
: if true,data
will be assumed to be a pairwise dissimilarity matrix.Kmin::Integer
: minimum number of clusters to scan for the elbow method.Kmax::Integer
: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.verbose::Bool
: if false, disables all info messages and progress bars.
Returns
An object of type PriorHyperparamsList
.
RedClust.fitprior2
— Functionfitprior2(data, algo, diss = false;
Kmin = 1,
Kmax = Int(floor(size(data)[end] / 2),
verbose = true)
Determines the best prior hyperparameters from the data. Uses the same method as fitprior
to obtain values for $\sigma$, $\eta$, $u$, and $v$, but derives values for the cluster-specific parameters by considering within-cluster and cross-cluster distances over clusterings with $K$ clusters for all values of $K \in [K_\mathrm{min}, K_\mathrm{max}]$, weighted by the prior predictive distribution of $K$ in that range.
Required Arguments
data::Union{Vector{Vector{Float64}}, Matrix{Float64}}
: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.algo::String
: must be one of"k-means"
or"k-medoids"
.
Optional Arguments
diss::bool
: if true,data
will be assumed to be a pairwise dissimilarity matrix.Kmin::Integer
: minimum number of clusters to scan for the elbow method.Kmax::Integer
: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.verbose::Bool
: if false, disables all info messages and progress bars.
Returns
An object of type PriorHyperparamsList
.
RedClust.sampleK
— MethodsampleK(params::PriorHyperparamsList, numsamples::Int, n::Int)::Vector{Int}
sampleK(η::Real, σ::Real, u::Real, v::Real, numsamples::Int, n::Int)
Returns a vector of length numsamples
containing samples of $K$ (number of clusters) generated from its marginal prior predictive distribution inferred from params
. The parameter n
is the number of observations in the model.
RedClust.sampledist
— Functionsampledist(params::PriorHyperparamsList, type::String, numsamples = 1)::Vector{Float64}
Generate a vector of samples of length numsamples
from the prior predictive distribution on the distances, as encapsulated in params
. type
must be either "intercluster"
or "intracluster"
.
RedClust.runsampler
— Functionrunsampler(data,
options = MCMCOptionsList(),
params = PriorHyperparamsList();
verbose = true) -> MCMCResult
Runs the MCMC sampler on the data.
Arguments
data::MCMCData
: contains the distance matrix.options::MCMCOptionsList
: contains the number of iterations, burnin, etc.params::PriorHyperparamsList
: contains the prior hyperparameters for the model.verbose::Bool
: if false, disables all info messages and progress bars.
Returns
A struct of type MCMCResult
containing the MCMC samples, convergence diagnostics, and summary statistics.
See also
MCMCData
, MCMCOptionsList
, fitprior
, PriorHyperparamsList
, MCMCResult
.
Point estimation
RedClust.binderloss
— Methodbinderloss(a::ClustLabelVector, b::ClustLabelVector;
normalised = true) -> Float64
Computes the Binder loss between two clusterings. If normalised = true
then the result is equal to one minus the rand index between a
and b
.
RedClust.getpointestimate
— Methodgetpointestimate(samples::MCMCResult; method = "MAP", loss = "VI")
Computes a point estimate from a vector of samples of cluster allocations by searching for a minimiser of the posterior expectation of some loss function.
Arguments
method::String
: must be one of the following -
- `"MAP"`: maximum a posteriori.
- `"MLE"`: maximum likelihood estimation.
- `"MPEL"`: minimum posterior expected loss.
`"MAP"` and `"MLE"` search among the MCMC samples for the clustering with the maximum log posterior and log likelihood respectively. `"MPEL"` searches for a clustering that minimises the posterior expected loss of some loss function specified by the `loss` argument. The search space is the set of samples in `samples`.
loss
: Determines the loss function used for the"MPEL"
method. Must be either be a string or a function. If specified as a string, must be one of"binder"
(Binder loss),"omARI"
(one minus the Adjusted Rand Index),"VI"
(Variation of Information distance), or"ID"
(Information Distance). If specified as a function, must have a method defined for(x::Vector{Int}, y::Vector{Int}) -> Real
.
Returns
Returns a tuple (clust, i)
where clust
is a clustering allocation in samples
and i
is its sample index.
RedClust.infodist
— Methodinfodist(a::ClustLabelVector, b::ClustLabelVector;
normalised = true) -> Float64
Computes the information distance between two clusterings. The information distance is defined as
\[d_{\mathrm{ID}}(a, b) = \max \{H(A), H(B)\} - I(A, B)\]
where $A$ and $B$ are the cluster membership probability functions for $a$ and $b$ respectively, $H$ denotes the entropy of a distribution, and $I$ denotes the mutual information between two distributions. The information distance has range $[0, \log N]$ where $N$ is the number of observations. If normalised = true, the result is scaled to the range $[0, 1]$.
Summary and clustering evaluation
RedClust.evaluateclustering
— Methodevaluateclustering(clusts::ClustLabelVector, truth::ClustLabelVector) Returns a named tuple of values quantifying the accuracy of the clustering assignments in clusts
with respect to the ground truth clustering assignments in truth
.
Return values
nbloss
: Normalised Binder loss (= 1 - rand index). Lower is better.ari
: Adjusted Rand index. Higher is better.vi
: Variation of Information distance ($\in [0, \log N]$). Lower is better.nvi
: Normalised VI distance ($\in [0, 1]$).id
: Information Distance ($\in [0, \log N]$). Lower is better.nid
: Normalised Information Distance ($\in [0, 1]$).nmi
: Normalised Mutual Information ($\in [0, 1]$). Higher is better.
RedClust.summarise
— Methodsummarise([io::IO], clusts::ClustLabelVector, truth::ClustLabelVector, printoutput = true) -> String Prints a summary of the clustering accuracy of clusts
with respect to the ground truth in truth
. The output is printed to the output stream io
, which defaults to stdout
if not provided.
Convenience functions
RedClust.adjacencymatrix
— Methodadjacencymatrix(clusts::ClustLabelVector) -> Matrix{Bool} Returns the n
×n
adjacency matrix corresponding to the given cluster label vector clusts
, where n = length(clusts)
.
RedClust.generatemixture
— Methodgeneratemixture(N, K; [α, dim, radius, σ, rng])
Generates a multivariate Normal mixture, with kernel weights generated from a Dirichlet prior. The kernels are centred at the vertices of a dim
-dimensional simplex with edge length radius
.
Required Arguments
N::Integer
: number of observations to generate.K::Integer
: number of mixture kernels.
Optional Arguments
α::Float64 = K
: parameter for the Dirichlet prior.dim::Integer = K
: dimension of the observations.radius::Float64 = 1
: radius of the simplex whose vertices are the kernel means.σ::Float64 = 0.1
: variance of each kernel.rng
: a random number generator to use, or an integer to seed the default random number generator with. If not provided, the default RNG provided by the Random.jl package will be used with default seeding.
Returns
Named tuple containing the following fields-
points::Vector{Vector{Float64}}
: a vector ofN
observations.distancematrix::Matrix{Float64}
: anN
×N
matrix of pairwise Euclidean distances between the observations.clusts::ClustLabelVector
: vector ofN
cluster assignments.probs::Float64
: vector ofK
cluster weights generated from the Dirichlet prior, used to generate the observations.oracle_coclustering::Matrix{Float64}
:N
×N
matrix of co-clustering probabilities, calculated assuming full knowledge of the cluster centres and cluster weights.
RedClust.makematrix
— Methodmakematrix(x::AbstractVector{<:AbstractVector}) -> Matrix
Convert a vector of vectors into a matrix, where each vector becomes a column in the matrix.
RedClust.sortlabels
— Methodsortlabels(x::ClustLabelVector) -> ClustLabelVector Returns a cluster label vector y
such that x
and y
have the same adjacency structure and labels in y
occur in sorted ascending order.
Example datasets
RedClust.example_dataset
— Methodexample_dataset(n::Int)
Returns a named tuple containing the dataset from the n$^{\mathrm{th}}$ simulated example in the main paper. This dataset was generated using the following code in Julia v1.8.1:
using RedClust
using Random: seed!
seed!(44)
K = 10 # Number of clusters
N = 100 # Number of points
data_σ = [0.25, 0.2, 0.18][n] # Variance of the normal kernel
data_dim = [10, 50, 10][n] # Data dimension
data = generatemixture(N, K; α = 10, σ = data_σ, dim = data_dim);
Note however that the above code may produce different results on your computer because the random number generator in Julia is not meant for reproducible results across different computers, different versions of Julia, or different versions of the Random.jl package, even with appropriate seeding. Therefore the datasets have been included with this package, and it is recommended to access them via this function.
See also generatemixture
.
RedClust.example_datasets
— MethodReturn a read-only handle to a HDF5 file that contains example datasets from the main paper. You must remember to close the file once you are done reading from it. This function is provided for reproducibility purposes only; it is recommended to read the datasets via the convenience function example_dataset
.
Types
RedClust.ClustLabelVector
— TypeAlias for Vector{Int}
RedClust.MCMCData
— TypeMCMCData(points::AbstractVector{<:AbstractVector{<:Float64}}) MCMCData(dissimilaritymatrix::AbstractMatrix{Float64})
Contains the pairwise dissimilarities for the MCMC sampler.
RedClust.MCMCOptionsList
— TypeMCMCOptionsList(;
numiters = 5000,
burnin = floor(0.2 * numiters),
thin = 1,
numGibbs = 5,
numMH = 1)
List of options for running the MCMC.
# Constructor arguments
- `numiters::Integer`: number of iterations to run.
- `burnin::Integer`: number of iterations to discard as burn-in.
- `thin::Integer`: will keep every `thin` samples.
- `numGibbs:Integer`: number of intermediate Gibbs scans in the split-merge step.
- `numMH:Integer`: number of split-merge steps per MCMC iteration.
RedClust.MCMCResult
— TypeStruct containing MCMC samples.
Fields
options::MCMCOptionsList
: options passed to the sampler.params::PriorHyperparamsList
: prior hyperparameters used by the sampler.clusts::Vector{ClustLabelVector}
: contains the clustering allocations.clusts[i]
is a vector containing the clustering allocation from thei
th sample.posterior_coclustering::Matrix{Float64}
: the posterior coclustering matrix.K::Vector{Int}
: posterior samples of $K$, i.e., the number of clusters.K[i]
is the number of clusters inclusts[i]
.r::Vector{Float64}
: posterior samples of the parameter $r$.p::Vector{Float64}
: posterior samples of the parameter $p$.K_mean
,r_mean
,p_mean
: posterior mean ofK
,r
, andp
respectively.K_variance
,r_variance
,p_variance
: posterior variance ofK
,r
, andp
respectively.K_acf::Vector{Float64}
,r_acf::Vector{Float64}
,p_acf::Vector{Float64}
: autocorrelation function forK
,r
, andp
respectively.K_iac
,r_iac
, andp_iac
: integrated autocorrelation coefficient forK
,r
, andp
respectively.K_ess::Float64
,r_ess::Float64
, andp_ess::Float64
: effective sample size forK
,r
, andp
respectively.loglik::Vector{Float64}
: log-likelihood for each sample.logposterior::Vector{Float64}
: a function proportional to the log-posterior for each sample, with constant of proportionality equal to the normalising constant of the partition prior.splitmerge_splits
: Boolean vector indicating the iterations when a split proposal was used in the split-merge step. Has lengthnumMH * numiters
(seeMCMCOptionsList
).splitmerge_acceptance_rate
: acceptance rate of the split-merge proposals.r_acceptances
: Boolean vector indicating the iterations (including burnin and the thinned out iterations) where the Metropolis-Hastings proposal forr
was accepted.r_acceptance_rate:
: Metropolis-Hastings acceptance rate forr
.runtime
: total runtime for all iterations.mean_iter_time
: average time taken for each iteration.
RedClust.PriorHyperparamsList
— TypePriorHyperparamsList(; [kwargs])
Contains the prior hyperparameters for the model. It is recommended to set the values using the fitprior
function. Do not set these values manually unless you know what you are doing.
Constructor arguments
δ1::Float64 = 1
: the parameter $\delta_1$.α::Float64 = 1
: the shape parameter $\alpha$ in the prior for each $\lambda_k$.β::Float64 = 1
: the rate parameter $\beta$ in the prior for each $\lambda_k$.δ2::Float64 = 1
: the parameter $\delta_2$.ζ::Float64 = 1
: the shape parameter $\zeta$ in the prior for each $\theta_{kt}$.γ::Float64 = 1
: the rate parameter $\gamma$ in the prior for each $\theta_{kt}$.η::Float64 = 1
: the shape parameter $\eta$ in the prior for $r$.σ::Float64 = 1
: the rate parameter $\sigma$ in the prior for $r$.u::Float64 = 1
: the parameter $u$ in the prior for $p$.v::Float64 = 1
: the parameter $v$ in the prior for $p$.proposalsd_r::Float64 = 1
: standard deviation of the truncated Normal proposal when samplingr
via a Metropolis-Hastings step.K_initial::Int = 1
: initial number of clusters to start the MCMC.repulsion::Bool = true
: whether to use the repulsive component of the likelihood.maxK::Int = 0
: maxmimum number of clusters to allow. If set to 0 then no maximum is imposed.
See also