FedCE: Federated Certainty Equivalence Control for Linear Gaussian Systems
Decentralized multi-agent systems are ubiquitous across various applications, such as decentralized control of robots and drones, decentralized autonomous vehicles, and non-cooperative games, among others. Extensive research in the literature has focused on decentralized multi-agent systems with known system dynamics, exploring various frameworks, such as decentralized optimal control, multi-agent planning, and non-cooperative game theory. However, realistically the environment model is often only partially observed or entirely unknown. Multi-agent reinforcement Learning (MARL) is designed to address the broader context of multi-agent sequential decision-making, where the agents lack complete knowledge of the environment model. In such conditions, agents learn the environment by interacting with the system and gathering rewards.
We focus on the decentralized LQ problem, which is a multi-agent learning problem with linear system dynamics and a quadratic cost function. In particular, we consider a scenario with two agents, each of which observes the system state partially. The system is controlled by the collective actions of the two agents, i.e., the sum of the actions. Leveraging LQ systems as a learning benchmark holds considerable advantages due to their theoretical tractability and extensive relevance across diverse engineering domains. Our particular problem arises for instance when robots that are decoupled in their dynamics and observations are tasked with collaborating to have coupled behavior. Studying this setting has resulted in developing FedCE, a collaborative learning policy which allows agents to efficiently explore and exploit. In our proposed algorithm, we partition time into exploitation and exploration intervals, carefully designing their durations to achieve high performance in both system model learning and minimizing regret. During the exploration phase, we employ Least Square Estimation (LSE) techniques to obtain local partial system model estimates. These estimates are then shared between agents at the end of each exploration interval. Subsequently, both agents compute Certainty Equivalence controllers, which they apply during the exploitation interval. As time progresses and the model estimates improve, the relative length of exploration intervals compared to exploitation intervals decreases, leading to reduced communication between agents over time. We analyze FedCE in terms of its regret bound and demonstrate that the regret scales at a rate of O(√T) for a time horizon of T.