Skip to main content

Bio: Shicong Cen is a final-year Ph.D. candidate in the Department of Electrical and Computer Engineering of Carnegie Mellon University, advised by Prof. Yuejie Chi.

Shicong’s research focuses on efficient learning for reinforcement learning and game theory, by developing in-depth theoretical understanding of policy optimization methods, so as to turn heuristics into rigorous principles and inspire the design of new algorithms that achieve provably fast convergence.

Shicong obtained his bachelor’s degree in Information and Computing Science from School of Mathematical Sciences, Peking University in 2019.

Talk Title: Algorithmic Foundations of Policy Optimization in Reinforcement Learning and Multi-agent Systems

Abstract: Policy optimization methods have been integral to the empirical success of comtemporary reinforcement learning (RL). A recent flurry of works endeavored to developing policy optimization methods with provable convergence guarantee. Our works contribute to this recent line of research by investigating non-asymptotic convergence guarantee and improving the dependencies on salient problem parameters.

We start with policy optimization for tabular single-agent RL, where the intrinsic non-concavity poses a significant challenge to obtaining convergence guarantee. By combining natural policy gradient (NPG) methods with entropy regularization, we deliver the first provably linearly converging policy optimization method to the global optimal policy at a dimension-free rate, assuming access to exact policy evaluation. Going beyond entropy regularization, we design a novel policy optimization method that accommodates various choices of regularizers by leveraging Bregman divergence and mirror descent and provably works even when the regularizer lacks strong convexity and smoothness.

In the presence of multiple agents, the goal of policy optimization is to find an approximate Nash equilibrium (NE), preferably through an independent learning process. Noticeably, existing analyses on learning NE for competitive games fall short of characterizing the last-iterate of the learning algorithm in a dimension-free manner. We make progress by designing novel extragradient policy optimization methods that converge to the quantal response equilibrium (QRE) of competitive games at a dimension-free linear rate, enabling learning approximate NE without introducing additional assumptions typically needed in prior results. We further extend our algorithms and the accompanying analysis to competitive Markov games and multi-player zero-sum polymatrix games.

arrow-left-smallarrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-long-yellowarrow-right-smallfacet-arrow-down-whitefacet-arrow-downCheckedCheckedlink-outmag-glass