A Universally Good Coreset for Distributed Machine Learning

Abstract Motivated by the need of solving machine learning problems over distributed datasets, we explore the use of coreset to re- duce the communication overhead. Coreset is a summary of the original dataset in the form of a small weighted set in the same sample space. Compared to other data summaries, coreset has the advantage that it can be used as a proxy of the original dataset, potentially for different applications. How- ever, existing coreset construction algorithms are each tailor- made for a specific machine learning problem, and thus to support diverse machine learning problems, one has to col- lect many coresets of different types, defeating the purpose of saving communication overhead. We resolve this dilemma by developing a coreset construction algorithm based on k- means/median clustering, that gives provably good approxi- mation for a broad range of machine learning problems with sufficiently continuous cost functions. Through evaluations on diverse datasets and machine learning problems, we verify the universally good performance of the proposed algorithm.
Authors
  • Hanlin Lu (PSU)
  • Ming-Ju Li (PSU)
  • Ting He (PSU)
  • Shiqiang Wang (IBM US)
  • Vijaykrishnan Narayanan (PSU)
  • Kevin Chan (ARL)
Date Sep-2018
Venue 2nd Annual Fall Meeting of the DAIS ITA, 2018