In recent years, we have observed a dramatic growth of data generation in edge-based machine learning applications. Motivated by the need of solving machine learning problem over distributed datasets, we would like to reduce the size of datasets as well as minimizing the machine learning performance degradation. Suppose we are given a dataset P, it could be represented by a data cube with three dimensions: cardinality n, number of features d and number of precision bits b. In this dissertation, we will explore different data reduction techniques to reduce these three dimensions and make three steps toward reducing the total size of the dataset.
In our first step, we consider using coreset to reduce the cardinality of the collected dataset. Coreset is a small weighted dataset, functioning as a proxy of the original dataset. However, existing coreset construction algorithms are each tailor-made for a specific machine learning problem. That is, we are required to construct different coresets to support different machine learning models. In our first step, we resolve this dilemma by developing robust coreset construction algorithms based on k-clustering algorithms. Our solution is proved to give a guaranteed approximation for a broad range of machine learning problems with sufficiently continuous cost functions.
In our second step, we propose the first framework to incorporate quantization techniques into the process of coreset construction. Specifically, we theoretically analyze the ML error caused by a combination of coreset construction techniques and quantization techniques. Based on that, we formulate an optimization problem to minimize the ML error under a fixed budget of communication cost. To improve the scalability for large datasets, we identify two proxies of the original objective function, for which efficient algorithms are developed. For the case of data on multiple nodes, we further design a novel algorithm to allocate the communication budgets to different nodes while minimizing the overall ML error.
As our third step, we consider the problem of solving edge-based k-means on a large dataset in high dimensional space. In this application scenario, data sources offload machine learning computation to nearby edge servers under limited communication budget and computation power. To solve this problem, we propose to construct small data summaries with fewer data samples (by techniques for Cardinality Reduction (CR)), fewer features (by techniques for Dimensionality Reduction (DR)) and fewer precision bits (by techniques for Quantization (QT)). By analyzing the complexity, the communication cost, and the approximation error of k-means algorithms based on state-of-the-art data reduction methods, we show that: (i) it is possible to achieve a near-optimal approximation at a near-linear complexity and a constant communication cost, (ii) the order of applying DR and CR leads to a tradeoff between the complexity and the communication cost, (iii) combining DR/CR methods with a properly selected quantizer can further reduce the communication cost without compromising the other performance metrics.
At last, in each step, the effectiveness of our analysis is verified through extensive experiments on multiple real datasets and different machine learning problems.