Mapping MDP to games and find optimal policies
From game to MDP realization
In the last blog, I have introduced MDP, which is a general environment mathematical-based abstraction, can be represented to almost every perfect environments (no system or random uncertainties). Games or simulation-based environments can be considered perfect environment that we can realize MDP on these environments. Note that there are many other environment abstractions beside MDP that RL can be applied e.g LQR, Partial Observable MDP.
In this blog, I use Easy21 as an test-bed game environment for many RL algorithms (agents) to learn optimal policies. Easy21 is similar to the BlackJack example in Sutton and Barto RL book, however, the rules are different and non standard. Easy21 environment is also an exercise of David Silver - UCL Course on RL, more about its game specification can be found here.
We can define the state $s$ of Easy21 MDP as vector of accumulated player score and initial dealer score (first black card draw by dealer). As you can see, player score must be in range from 1 to 21 and initial dealer score is in range from 1 to 10. Therefore, we will have total of 21x10 states, and hence we can apply Value Iteration (VI) or Policy Iteration (PI) without worrying about tractability of the algorithm. However, that would be the case only when we can realize the state transition probability matrix of this MDP easily, in this case, it does not. This is also the reason I choose this environment to demonstrate that some environments may have low number of states but are difficult to realize its dynamic and we have to use model-free RL methods to learn its optimal policy.
To explain why, we will attempt to compute 210x210 state transition matrix $P$ of Easy21. Let us consider the probability to transition from state $(10, 5)$ to $(11, 5)$ (remember that initial dealer score is unchange), how many ways can we accumulate score to get from 10 to 11? If we only add up the score, the answer is 1 and the transition probability is the probability that we draw card number one $P_{s,s'}=\frac{1}{10}$. However, in Easy21 we can add up or minus the score by drawing black or red card respectively, hence there are infinite ways to accumulate score from 10 to 11 (e.g 10 + 1, 10 - 3 + 4, etc) and transition probability is $P_{s, s'}=\sum\prod P_i$. Therefore, computing state transition matrix $P$ becomes intractable. In general, these are the informal steps that we usually do in MDP realization:
- Define state space $S$. In games, it's usually the same type with observation $O$ or direct observable output from environment.
- Infer action space $A$ from each state $s \in S$. (e.g in Easy21, at each state we can "hit" or "stick"
- Compute state transition probability matrix $P$ from enviroment dynamic.
- Three ways to define reward function: terminal rewards, intermediate rewards or both. It's depend on the requirements of the RL problem.
It is clear that we cannot apply VI or PI to solve optimal policy for this game. In next sections, I will introduce some popular value-based model-free RL algorithm with different schemes to learn Easy21 optimal policies. In the next blogs, policy-based RL algorithms will be introduced and evaluated.
Approaches to find optimal policies
In this section, I implement Monte-Carlo Control and TD($\lambda$) Sarsa algorithm to learn optimal policies of Easy21 under two different settings: table lookup $Q$ function and downsampling $Q$ function approximation. In addition, for balancing between exploration and exploitation, I update policy by $\epsilon$-greedy with GLIE schedule. Pseudocode for algorithm is presented in each section. If you are interested in actual implementation, it can be found in my repository.
Table lookup Q: Every-visit Monte-Carlo (MC) Control
Although the state-space of Easy21 is small enough to sweep full backup to calculate $Q$ at each state-action pair, we do not know the state transition matrix. In this case, we sample the series of state-action-reward tuple $(s, a, r, s', a')$ by letting RL agent interacts with environment based on its current policy. In MC Control, we update $Q(s_t, a_t)$ by adding error between new return $G_t$ and current $Q(s_t, a_t)$ scaled by a regulated step size $\alpha_t$:
$$ Q(s_t, a_t) = Q(s_t, a_t) + \alpha_t(G_t - Q(s_t, a_t)) $$ where $$ G_t = \sum_{\tau=t}^T\gamma^{\tau-t}r_{\tau} $$ $$ \alpha_t = \frac{1}{N(s_t, a_t)} $$Intuitively, each $Q(s_t, a_t)$ is an average of return $G_t$ as running sum so far through many episodes, the average can add new member easily by scaling new error with $\alpha_t$. We define average return as running sum like this to be compatible with iterative algorithm. Note that $N(s_t, a_t)$ counts every time agent visit state $s_t$ and do action $a_t$. After each episode, we can update policy following new action function $Q$ by $\epsilon$-greedy scheme:
$$ \epsilon = \frac{N_0}{N_0 + N(s_t)}$$ $$ \pi(a|s) = \begin{cases} \frac{\epsilon}{m} + 1 - \epsilon & a^* = argmax_{a \in A}Q(s,a) \\ \frac{\epsilon}{m} & otherwise \end{cases} $$Updating policy with $\epsilon$-greedy scheme will ensure that $Q$ function does not converge to local minima by only exploiting. The convergence properties will be dicussed in next sections. Pseudocode for MC Control can be seen as below:
I run MC Control on Easy21 enviroment implemented as an extension of OpenAI gym framework. Source code of Easy21 enviroment can be found here After 500000 episodes, the value function $V$ starts to converge as shown below. Note that I derive value function $V$ from $Q$ by $V(s) = max_aQ(s,a)$, the policy is also derived by the same manner.
.
Table lookup Q: TD($ \lambda $) Sarsa
Like MC Control, TD($ \lambda $) Sarsa also learns optimal policy by sampling experiences. However, we does not add scaled error between full return $G_t$ and current $Q(s_t, a_t)$. Instead, intuitively, we move $Q(s_t, a_t)$ toward new bootstrapped $Q$ by applying Bellman equation. We do not need to wait for reaching terminal state of each episode to update $Q(s_t, a_t)$, we update on-the-go whenever we reach state $s_t$ and do action $a_t$. This scheme is more suitable for agents to learn continuous tasks in real world (may have no terminal state or needing immediate reacts).
$$ Q(s_t, a_t) = Q(s_t, a_t) + \alpha_t(q_t^{\lambda} - Q(s_t, a_t)) $$ where $q_t^{\lambda}$ is a weighted sum n-steps returns $q_t^{(n)}$: $$ q_t^{\lambda} = (1 - \lambda)\sum_{n=1}^{\infty}\lambda^{n-1}q_t^{(n)} $$We can think $q_t^{\lambda}$ as an average of close to far bootstrapped returns, it combines the characteristics of MC update and TD(0) update. $q_t^{\lambda}$ conveys how good the state-action pair that agent estimates combines with true sample from enviroment. We can see it more clearly in pseudocode of Backward view Td($\lambda$) Sarsa:
I also run TD($\lambda$) Sarsa on Easy21 enviroment with 500000 episodes and $\lambda = 0$, the value function $V$ starts to converge as shown below. I also measure mean square error between $Q$ function learnt from Sarsa and "true" $Q$ function from MC control.
Approximated Q: TD($ \lambda $) Sarsa
In the case where the state space of MDP is very large, we could approximate $Q$ by a model. For instance, linear approximation of $Q$ would be features dot model parammeters $Q(s, a) = \phi(s, a)^T\theta$. I would like to apply linear approximation on $Q$ of this simple enviroment to demonstrate how linear function approximation is done. The feature is chose as description in Easy21 exercise. With function approximation, we do not update $Q$ directly, because model parammeters now represented the function $Q$. We update model parameters as follow:
$$ \theta = \theta + \alpha_t(q_t^{\lambda} - Q(s_t, a_t, \theta))\nabla_{\theta}Q(s_t, a_t, \theta) $$Pseudo code for function approximation of TD($\lambda$) Sarsa is the same as table lookup Q, we just need to replace update rule as specified.
We will evaluate this algorithm by measuring mean square error between $Q$ function learnt and "true" $Q$ function from MC control.
Discussion
We usually only use MC Control as a base-line for computing optimal $Q$ function for any MDP. MC Control has great learning characteristics:
- Unbiased estimation of Q (average of returns is unbiased).
- For large enough episodes, $Q$ function computed by MC Control will converge to minimum mean square error of true $Q$ function.
However, MC Control $Q$ estimation has large variance, it also needs large amount of samples to start converging and it's not suitable for online learning. In reality, one episode may never end, TD($\lambda$) Sarsa is the better option for online learning, because TD($\lambda$) Sarsa uses bootstrapped estimation of current state Q (backward view TD no need to wait for end of episode). TD($\lambda$) $Q$ estimation converges to maximum likelihood estimation of environment MDP and has much lower variance compared to MC control.
When the state space is large, it is essential that we must downsample approximation state space to ensure tractability. However, as can be seen in evaluation, in exchange of speed as expected, approximated $Q$ function is hard to converge and introduces variance. Downsampling MDP like this may not capture the true characteristics of environment.
Next blogs I will present about Policy-based algorithms in some openAI gym environment. These algorithms are promising for large state space with uncertainties in real world application.