Abstract 
In this paper we give a fast algorithm to solve the distributed Kmeans problem. This problem is as follows: we have a group of machines that each has a dataset of points in a real vector space. The goal is to choose a set of K vectors that (approximately) minimises the classic Kmeans objective function relative to the union of all the datasets. We consider the broadcast model  in which, at any time, one machine is broadcasting data to all others. We give a distributed algorithm that needs only to broadcast a small amount of data. The amount of data which needs to be broadcast by our algorithm is far smaller than the previous state of the art, whilst our theoretical guarantee of the quality of our solution is essentially the same. In addition to our theoretical performance guarantees we perform experiments with very promising preliminary results. 
Authors 
 Stephen Pasteris (UCL)
 Ting He (PSU)
 Mark Herbster (UCL)
 Richard Tomsett (IBM UK)

Date 
Sep2020 
Venue 
4th Annual Fall Meeting of the DAIS ITA, 2020 
