Coresets Learning via Distributed Clustering and Local Gradients

Military / Coalition Issue

In military operations it will often be the case that each mobile device in the coalition network has collected its own set of data from its surroundings, whilst bandwidth limitations prohibit the direct aggregation of the datasets onto a central server. Often, some machine learning task will have to be performed using the distributed dataset as a whole. This project considers how to do such tasks while broadcasting (between the mobile devices) only a small amount of information.

Core idea and key achievements

This project consists of 3 papers about performing machine learning on a distributed dataset without broadcasting much information between devices.

The first paper in on the compression (i.e. summarisation) of a local dataset (the compressed datasets can then be easily transmitted to a central server and aggregated). It is assumed that the datasets are in the form of a set of vectors. The compression is comprised of the set of centres of the regions of a Voronoi diagram along with the number of points in the corresponding regions and, for each corresponding region, a vector representing the gradient of a linear approximation of the “probability” density of the points in the region. After transmission, the dataset can then be recovered approximately by sampling from the linear distributions.

The second paper builds an approximation of the aggregated dataset where the approximation is a Voronoi diagram along with the number of points in each region. Our distributed algorithm is an approximation (to any degree of accuracy – higher accuracies requiring more information to be broadcast) of K-means++ whilst requiring only a very small amount of data to be broadcast.

The third paper considers building a classifier via online-to-batch conversion of an online learning algorithm. Our distributed algorithm exactly implements the classic online-to-batch conversion meta-algorithm but only needs to broadcast the mistakes made by the online learning algorithm – a quantity that scales linearly with the bound on the performance of the resulting classifier.

Implications for Defence

These techniques will allow machine learning tasks over distributed datasets to be performed when we have bandwidth constraints in a wireless network of devices.

Readiness & alternative Defence uses

All 3 algorithms where coded up in Python as part of the project.

Resources and references

(none)

Organisations

UCL, IBM (US and UK), PSU