Abstract |
Reinforcement learning (RL) has been widely applied to communication and computer systems. The mathematical foundation of RL is Markov Decision Process (MDP), where a Markov process with defined states is used to model the system and the actions to be taken affect the state transitions and the corresponding rewards. The deep RL can produce the optimal action policy to maximize the long-term reward. However, a key limitation of the RL is that the system under consideration often does not satisfy the required properties, thus making the MDP inexact and the derived policy non-optimal. To overcome this shortcoming, we consider a class of non-Markovian Decision Process (nMDP) where the evolution of the underlying process depends on the system history of a finite number of previous time slots (e.g., the job arrival process depends on the actual job arrivals in the fixed number of past time slots). Furthermore, model parameters for the nMDP demonstrate some mild forms of periodic characteristics. Such periodic properties ensure that the system has a finite number of optimal action policies. A deep-learning method is developed to obtain the optimal policies for the nMDP. The method is applied to achieve the optimal allocation of distributed resources to arriving computational tasks in a network setting similar to the software-defined network (SDN). Evaluation results reveal that the proposed technique is valid and capable of obtaining the near-optimal solutions when compared with the upper-bound results by exhaustive search with perfect (but impractical) knowledge of future arrivals and task processing times. |