Data Distribution and Scheduling for Distributed Analytics Tasks

Abstract We consider a distributed analytics system with interconnected machines. Analytics tasks run on the machines, where each task runs on a single machine but may require data from multiple other machines. Every task requires a given amount of data to run, and it needs to receive all its data within a specific deadline. The application scenario is that each machine has limited storage, thus we usually cannot place the entire amount of data for a specific task on a single machine that executes the task. We study how to distribute the data on machines in the system, without violating the bandwidth and storage constraints, while ensuring that the data transfer deadlines are met. We prove that a solution to this problem is equivalent to that of a max-flow problem on a specifically constructed graph. We present an algorithm for solving this problem via standard max-flow algorithms.
  • Stephen Pasteris (UCL)
  • Shiqiang Wang (IBM US)
  • Christian Makaya (IBM US)
  • Kevin Chan (ARL)
  • Mark Herbster (UCL)
Date Sep-2017
Venue 1st Annual Fall Meeting of the DAIS ITA, 2017