Overview of RL

RL has many applications in different fields ranging from Optimal Control, Reward System, Operations Research (Economics), etc. It is indeed a huge distinct branch in Machine Learning beside Supervised and Unsupervised Learning. There are important aspects in the formulation of RL:

  • No supervisor, only reward signal serves as heuristic guidance.
  • RL is stochastic, so feedback is not instantaneous.
  • The variables are sequential time-dependent.
  • Action affects subsequent observable data.

Using RL, we try to build an agent that acts $ A_t $ based on the feedback reward signal $ R_t $ and observation $ O_T $ each time step t. The environment will receive action $ A_t $ and emit reward $ R_{t+1} $ and observation $ O_{t+1} $ for the next time step:

drawing

Both agent and environment has its own internal state $ S_t^a $ and $ S_t^e $ respectively. We can formulate agent state as a function of history $ S_t^a=f(H_t) $.

Markov Decision Process (MDP)

MDP serves as an effective mathematical foundation for the problem of RL, it formally describes an environment of RL. For instance, almost all RL problems can be formalised as MDPs:

  • Optimal Control can be formulated with continuous MDPs
  • Partially Observable problems can be converted into MDPs
  • Bandits are MDPs with one state.

MDP fits correctly with the RL characteristics when RL agent has all information about the environment (full observability), where RL agent is stochastic, time-dependent and action affecting subsequent data. This document will introduce MDP definitions and formulations, the next posts will discuss how to solve MDPs and apply it to RL problems.

Markov Process (Markov Chain)

Markov Process is a mermoryless random process, i.e a sequence of random state $ S_t $ has the property:

$$ P[S_{t+1}|S_t]=P[S_{t+1}|S_1,...,S_t] $$

The property means that the state captures all relevant information from the history, so history can be discarded, assuming the agent is full observability $ O_t=S_t^a=S_t^e $. Intuitively, we can connect from the fact that the environment state is indeed Markov, hence the agent state is also Markov if the system is fully observable. On the other hand, if the current state $ S_t $ is characterized by the fully observed history $H_t$, then the function $ S_t = f(H_t) $ can be expressed with certainty (no loss), hence the Markov property ($S_{t+1}$ depend only on $S_t$).

If we have finite n state $ S_1,...,S_n $, the state transitions P can be characterized by a matrix:

$$ \begin{bmatrix} P_{11} & ... & P_{1n}\\ ... & & \\ P_{n1} & ... & P_{nn} \end{bmatrix} $$

Markov Reward Process (MRP)

Basically, MRP is Markov chain with values for each state. The value of each state measures how good to be in that state. To be concrete, MRP is a tuple $ (S,P,R,\gamma) $:

  • S is a finite set of states
  • P is state transition probability matrix, $ P_{ss'} = P[S_{t+1}=s'|S_t=s] $
  • R is a (immediate) reward function for each state, R_s=E[R_{t+1}|S_t=s]
  • $ \gamma $ is a discount factor ranging from 0 to 1.

The value function $ v(s) $ computes value for each state is the expected reward return starting from state $s$:

$$ v(s)= E[G_t|S_t=s] $$ where return G is $$ G_t = R_{t+1}+\gamma R_{t+2}+...=\sum_{k=0}^\infty \gamma^k R_{t+k+1} $$

Intuitively, the function value averages all possible reward return path from state s to the terminal state of MRP (terminal state to avoid infinite return), where each path has a unique return G. The factor $ \gamma $ discounts G to also avoid infinite return G when there are infinite state sequences, and to represent uncertainty of far future reward and favoring immediate rewards.

The value function can be in the form of Bellman equation:

$$ v(s)=E[R_{t+1}+\gamma v(S_{t+1})|S_t=s] $$ or $$ v(s)=R_s+ \gamma \sum_{s' \in S}P_{ss'}v(s') $$

To be able to solve for $v(s)$, we must represent Bellman function in matrix form:

$$ v = R + \gamma Pv $$ and solve for $v$: $$ v=(I-\gamma P)^{-1}R $$

Markov Decision Process (MDP)

MDP is a Markov reward process with decisions to act on the environment, and then the enviroment will stochastically transition to the next state s'. Please note that all states in MDP are Markov. A MDP is a tuple $S,A,P,R, \gamma$ :

  • S is a finite set of states
  • A is a finite set of actions
  • P is state transition probability matrix, $ P_{ss'}^a = P[S_{t+1}=s'|S_t=s,A_t=a] $
  • R is a (immediate) reward function for each state, R_s^a=E[R_{t+1}|S_t=s,A_t=a]
  • $ \gamma $ is a discount factor ranging from 0 to 1.

To extend from MRP, MDP defines a policy distribution over a set of agent's actions on state $s$:

$$ \pi(a|s)=P[A_t=a|S_t=s] $$

The policies are depend on the current state $s$ (not the history) and are stationary(time-independent). MDP also redefines the state-value function and action-value function, which are the expected return starting from state $s$ and then following the policy $\pi$:

$$ v_\pi(s)=E_\pi[G_t|S_t=s] $$ $$ q_\pi(s,a) = E_\pi[G_t|S_t=s,A_t=a] $$

we then rewrite these equation as Bellman Expectation equations:

$$ v_\pi(s) = E_\pi[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s] $$ $$ q_\pi(s,a) = E_\pi[R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1})|S_t=s,A_t=a] $$

After simple transformations, we can see their inter-relation as follow:

$$ v_\pi(s) = \sum_{a \in A}\pi(a|s)q_\pi(s,a) $$ $$ q_\pi(s,a) = R_s^a+\gamma \sum_{s' \in S}P_{ss'}^av_\pi(s')$$

which leads us to the state-value transition function and action-value transition function:

$$ v_\pi(s) =\sum_{a \in A}\pi(a|s)(R_s^a+\gamma \sum_{s' \in S}P_{ss'}^av_\pi(s')) $$ $$ q_\pi(s,a) = R_s^a+\gamma \sum_{s' \in S}P_{ss'}^a\sum_{a' \in A}\pi(a'|s')q_\pi(s',a') $$

We can solve MDP state-value function by Bellman Expectation equation in matrix form:

$$ v_\pi = (I - \gamma P^\pi)^{-1}R^\pi $$ where $$ P_{ss'}^\pi = \sum_{a \in A}\pi(a|s)P_{ss'}^a $$ $$ R_{s}^\pi = \sum_{a \in A}\pi(a|s)R_{s}^a $$

It is computationally hard to compute state-value function, which is O(n^3) for n states. We must use iterative methods for large set of states, e.g Dynamic Programming, Monte-Carlo evaluation, Temporal-Different learning, etc.

Optimal Policy for MDP

We define the optimal state-value function and action-value function are the maximum of these function over all policies:

$$ v_*(s) = \underset{\pi}{max}v_\pi(s) $$ $$ q_*(s,a) = \underset{\pi}{max}q_\pi(s,a) $$

The optimal value function specifies the best possible performance in MDP. We reply on MDP theorem to find optimal policies:

  • There exists an optimal policy that is better than all other policies $ \pi_* \geq \pi, \forall \pi $
  • There may be multiple solution to achieve optimal state-value function and action-value function.

If we know optimal $q_*(s,a)$, we immediately have the optimal policy. We can solve recursively for optimal state-value function and action-value function through these Bellman optimality function:

$$ v_*(s) = \underset{a}{max}R_s^a+\gamma \sum_{s' \in S}P_{ss'}^av_*(s') $$ $$ q_*(s,a) = R_s^a+\gamma \sum_{s' \in S}P_{ss'}^a\underset{a'}{max}q_*(s',a')$$

Again, Bellman optimality equations is non-linear and have no close form solution. We must use iterative methods to solve them, e.g Value Iterative, Policy iteration, Q-learning, Sarsa.