A Universally Good Coreset for Distributed Machine Learning

Abstract Motivated by the need of solving machine learning problems over distributed datasets, we explore the use of coreset to re- duce the communication overhead. Coreset is a summary of the original dataset in the form of a small weighted set in the same sample space. Compared to other data summaries, coreset has the advantage that it can be used as a proxy of the original dataset, potentially for different applications. How- ever, existing coreset construction algorithms are each tailor- made for a specific machine learning problem, and thus to support diverse machine learning problems, one has to col- lect many coresets of different types, defeating the purpose of saving communication overhead. We resolve this dilemma by developing a coreset construction algorithm based on k- means/median clustering, that gives provably good approxi- mation for a broad range of machine learning problems with sufficiently continuous cost functions. Through evaluations on diverse datasets and machine learning problems, we verify the universally good performance of the proposed algorithm.
  • Hanlin Lu (PSU)
  • Ming-Ju Li (PSU)
  • Ting He (PSU)
  • Shiqiang Wang (IBM US)
  • Vijaykrishnan Narayanan (PSU)
  • Kevin Chan (ARL)
Date Sep-2018
Venue 2nd Annual Fall Meeting of the DAIS ITA, 2018