Learning to Simplify Distributed Systems Management

Abstract	Managing large-scale distributed systems is a difficult task. System administrators are responsible for the upkeep and maintenance of numerous components with complex dependencies. With the shift to microservices-based architectures, these systems can consist of 100s to 1000s of interconnected nodes. To combat this difficulty, administrators rely on analyzing logs and metrics collected from the different services. However, the number of available metrics for large systems presents complexity and scaling issues. To combat these issues, we present Minerva, an unsupervised Machine Learning (ML) framework for performing network diagnosis analysis. Minerva is composed of a multi-stage pipeline, where each component can act individually or cohesively to perform various management tasks. Our system offers a unified and extensible framework for managing the complexity of large networks, and presents administrators with a swiss-army knife for diagnosing the overall health of their systems. To demonstrate the feasibility of Minerva, we evaluate its performance on a production-scale system. We present use cases for the various management tools made available by Minerva, and show how these tools can be used to make strong inferences about the system using unsupervised techniques.
Authors	Christopher Streifer Ramya Raghavendra (IBM US) Theophilus Benson Mudhakar Srivatsa (IBM US)
Date	Dec-2018
Venue	IEEE Big Data 2018