Abstract |
Motivated by the need of solving machine learning problems over distributed datasets, we explore the use of coreset to re- duce the communication overhead. Coreset is a summary of the original dataset in the form of a small weighted set in the same sample space. Compared to other data summaries, coreset has the advantage that it can be used as a proxy of the original dataset, potentially for different applications. How- ever, existing coreset construction algorithms are each tailor- made for a specific machine learning problem, and thus to support diverse machine learning problems, one has to col- lect many coresets of different types, defeating the purpose of saving communication overhead. We resolve this dilemma by developing a coreset construction algorithm based on k- means/median clustering, that gives provably good approxi- mation for a broad range of machine learning problems with sufficiently continuous cost functions. Through evaluations on diverse datasets and machine learning problems, we verify the universally good performance of the proposed algorithm. |