Scaling Average-Linkage via Sparse Cluster Embeddings

Published in Asian Conference on Machine Learning (ACML), 2021

This paper gives a subquadratic algorithm for average-linkage hierarchical clustering on Euclidean data which approximates each merge up to a constant factor. This is enabled by designing a cluster embedding, i.e. a map which assigns each cluster (many points) to a single point in an embedding space such that the distance in the embedding space approximates the average linkage between clusters in the original space. This allows us to apply approximate nearest neighbor search techniques (e.g. locality sensitive hashing) to get a subquadratic time algorithm). We give an experimental evaluation of our algorithm, showing that it is more accurate than the theory suggests and that it has a significant speedup over standard methods even for moderately large inputs.