Le Thai An | Reinforcement Learning applied to Control at a glance

Introduction

This document summarizes what I have learnt about the intersection of Control Theory and Reinforcement Learning (RL) topic. The topic was greatly introduced by Ben Recht, which he proposed this new exciting field the name Actionable Intelligence. To begin, this is the introductory headline of the field:

Actionable Intelligence is the study of how to use past data to enhance the future manipulation of a dynamical system

I know some Computer Science guys at my Uni said that Deep RL is the key to achieve The Singularity (lol), and yes there are strong supports that RL has defeated human in Go, Atari game, etc. To be honest, I doubted about that audacious blame. At that time, I thought that Deep RL achieved these performances in perfect environment without measurement error and no hidden game states, but how can we so sure that it will achieve the same in dynamic and imperfect environment of industrial systems?

The tour of Ben Recht indeed clears my doubts, there is a whole pool of problems to solve. Why is it so hard to optimal control a system given unknown dynamic states? RL has to be the one that could easily solve this kind of problem. The answer is simple, the stakes could not be higher if we would like to guarantee safety executions and valuable gains (both are defined by reward function) by a series of action in uncertainty. Afterall, RL is to find (randomly or heuristic) a series of action u to maximize reward function r.

A test-bed for everyone to test: Linear Quadratic Regulator (LQR)

As Prof. Recht stated his point:

If a machine learning does crazy things when restricted to linear models, it's going to do crazy things on complex nonlinear models too.

And reversely, if your ML algorithm has good results on non-linear models, it has to be get acceptable results on linear models too. Otherwise, it is unreasonable to trust your model. Therefore, LQR has come to save the day as a test-bed for everything new popped out of your head. LQR is a special case of optimal control problem when we know the state transitions function and we design the cost function to represent our task:

$$ minE_e[\sum_{t=1}^{T}C_t(x_t,u_t)] $$ $$ x_{t+1}=f_t(x_t,u_t,e_t) $$ $$ u_t=\pi_t(\tau_t) $$ With LQR (continuous version), the state transition function is linear and the cost becomes quadratic: $$ E[\frac{1}{T}\sum_{t=1}^{T}x_t^TQx_t+u_t^TRu_t+x_T^TP_Tx_T] $$ $$ x_{t+1}=Ax_t+Bu_t+e_t $$

Intuitively, you must find a series of action $ u_t $ in order to minimize the quadratic cost expected over error (to get mean error). Next section will introduce some approaches to find the optimal (or not) policy and how fast they converge to minimum cost with a fixed amount of samples. The raising question is how we can balance between optimal cost and computation to process as least as possible many samples. I have not implemented any approaches in RL to try to solve LQR yet, but it seems pretty straightforward to implement. However, Prof. Recht mentioned LQR is not easy, "magic tends to happen in LQR". I do not know if this is true until I get my hand dirty on evaluating these approaches.

Approaches to solve LQR with RL

In the above LQR problem, we design the cost function based on our demands and if we fully know the state transitions functions, we can solve LQR directly using Batch Optimization or Dynamic Programming. And the solution is described as below. For finite-time horizon, discrete-time LQR:

$$ u_t=-K_tx_t $$ where $$ K_t=-(B^TP_{t+1}B+R)^{-1}B^TP_{t+1}A $$ and $$ P_t=Q+A^TP_{t+1}A-A^TP_{t+1}B(R+B^TP_{t+1}B)^{-1}B^TP_{t+1}A $$

When we know the state transition, we know A and B in linear equation. The optimal solution is founded by recursively find $ P_{t+1} $ from terminal condition $ P_t $. Then we can find series $ u_t $. When the time T approaches infinity, the optimal solution of u becomes static. Solving Riccati equation of P, we get:

$$ P=Q+A^TPA-A^TPB(R+B^TPB)^{-1}B^TPA $$ $$ K=-(B^TPB+R)^{-1}B^TPA$$

As we can see, the optimal solution u is independence of error variance, which means more noise do not affect system control. In infinite-time scheme, solution u is time-invariant. The questions is, when we do not know state transition function, we do not know A, B in advanced, how can we solve LQR for optimal control? This is the place that RL shined. We have these roads to follow:

Identify everything (the most masochistic way lol) about system dynamic using physics and finite element methods.
Identify a coarse model for the system dynamic
We will be blunt :) and map directly data from sensor models to output control (e.g neural network)

Model-based approaches

I am personally like the first approach, because that's what Control Theory guys usually do anyway :). They called this approach System Identification. However, in ML manner, we do not use the glory of Physics to model the system. What we are going to do is trying to learn the state-transition function, which means we will fit the matrix A and B with some cost function by choices. For example, we use L2 loss to fit A, B:

$$ \hat{A},\hat{B}=\underset{A,B}{min}\sum_{t=1}^{T-1}||x_{t+1}-Ax_t-Bu_t||^2$$

Then we plug these matrices to state transition function with an error $v_t$:

$$ x_{t+1}=\hat{A}x_t+\hat{B}u_t+v_t $$

Actually, if we treat $ \hat{A}, \hat{B} $ as true dynamic matrices, it does not feel right. So far, this approach finds solution u based on what we just estimated $ \hat{A}, \hat{B} $. Here is the burning question: How can we model the uncertainty (errors) of our estimation and then use that to design solution u to robust control the system?

To answer that question, there are recent works of (again) Prof.Recht's group about Coarsed-ID control addressing this problem. These works will be discussed in next posts.

Model-free approaches

This section is huge! I cannot write the section down with few lines, it is even fit to have many intensive courses in RL. And yes, this is the realm of RL, everyone like it :). This section serves as introductory point for the next few posts. Basically, as the name stated, we do not care about the system dynamic, what we do is to mitigate cost function by:

Iteratively refining cost function based on measured system state and estimating policy $ u_t $ from the cost function each time step: Approximate Dynamic Programming.
Defining a parametrized distribution $ p(u;\vartheta) $ over $ u_t $ and use gradient ascent for some average reward function of $ u_t $ to refine iteratively the parametrized distribution $ p(u;\vartheta) $. The purpose is to minimize cost function expected over that distribution: Direct Policy Search.

For now, we do not go to detail of these methods. PID is also a model-free method that find u based on just the error, but PID can learn to compensate for poor models, changing conditions? This leads us to a burning topic from PID to RL:

How well must we understand the system in order to optimal control it?

The next few posts we will discuss how we balance between optimal policy solution versus computation cost to process the system state samples and how fast these methods find solution that leads system to stability.