Hierarchical Bandit Problems

Abstract	In a coalition setting the goal of online learning is to learn the data sources and analytics software best suited for different tasks. We model this as a multi-arm bandit problem where the goal is to identify/learn the best source/software (arm). Challenges include scaling up to large numbers of arms, dealing with a geographic distribution of arms, and differing policies governing different sets of arms. Thus we propose a hierarchical bandit formulation where arms are partitioned into sets with each set managed by an agent tasked with learning the best arm within its set. In addition, there is a learner tasked with determining which agent is handling the best arm. The goal is to develop algorithms operating at both layers of the hierarchy so as to maximize the cumulative reward that the learner incurs. We consider two settings, the stochastic setting where arms take their rewards from a distribution and a non-stochastic setting where an adversary decides on the reward. In the stochastic case we develop a two layer algorithm that exhibits good performance as shown through simulation. In the non-stochastic case we develop performance bounds that account for the structure of the hierarchy.
Authors	Lin Yang (UMass) Yu-Zhen Chen (UMass) M Hajiesmili (UMass) Don Towsley (UMass) Mark Herbster (UCL) Stehen Pasteris (UCL)
Date	Sep-2020
Venue	4th Annual Fall Meeting of the DAIS ITA, 2020