Managing Training Data from Untrusted Partners Using Self-generating Policies

Abstract	When training data for machine learning is obtained from many different sources, not all of which may be trusted, it is difficult to determine which training data to accept and which to reject. A policy-based approach for data curation, where the policies are generated after examining the properties of the offered data, can provide a way to only accept selected data for creating a machine learning model. In this paper, we discuss the challenges associated with generating policies that can manage training data from different sources. An effcient policy generation scheme needs to determine the order in which information is received, must have an approach to determine the trustworthiness of each partner, must have an approach to decide how to quickly assess which data subset can add value to a complex model, and must address several other issues. After providing an overview of the challenges, we propose approaches to solve them and study the properties of those approaches.
Authors	Dinesh Verma (IBM US) Seraphin Calo (IBM US) Shonda Witherspoon (IBM US) Irene Manatos (IBM US) Elisa Bertino (Purdue) Amani Abu Jabal (Purdue) Geeth de Mel (IBM UK) Ananthram Swami (ARL) Greg Cirincione (ARL) Gavin Pearson (Dstl)
Date	Apr-2019
Venue	SPIE - Defense + Commercial Sensing 2019