OwenZhu's Blog

Lecture Notes for Reinforcement Learning (MDP)

2018/01/19 Share

Reference: David Silver, UCL reinforcement learning, lecture 2; CS 294 Deep Reinforcement Learning, Fall 2017

Markov Process (or Markov Chain)

Here we assume that the environment is fully observable, which means “the current state completely characterises the process”.

A Markov process is a tuple $\left< S, P \right>$, where $S$ denotes the state space (discrete or continuous) and $P$ denotes the state transition probability matrix.

Sometimes we use the transition operator $T$ instead of $P$, but why it is an operator? Suppose we have a vector of probabilities $\vec{\mu}_t$ (which means we are not completely sure which state we are actually in), and let $\mu_{t,i} = p(s_t = i)$. Then $\vec{\mu}_{t+1}$ can be obtain by $T\vec{\mu}_t$, so it is a linear operator. If states are continuous it is still an infinitely large linear operator.

Therefore, an agent has nothing to do in a Markov process, who just translates from one state to another state following by the transition probability.

Markov Reward Process

A Markov process converts to a Markov reward process, when the reward function $R$ and discount factor $\gamma$ are introduced to it. An agent walks around (still totally depends on environment) and environment will give it a signal (reward).

Markov Decision Process

  • Bellman Expectation Equation

    The state-value function:

    The action-value function:

    We can reform Bellman expectation equation, by replacing $v_\pi(s)$ with $q_\pi(s, a)$ or replacing $q_\pi(s, a)$ with $v_\pi(s)$ on the right-hand side. It’s very very important to understand the relationship between $v_\pi(s)$ and $q_\pi(s, a)$ here. We will find the equation appear everywhere in the following lectures. Same as Markov process, we can find the direct closed form solution of Bellman expectation equation when given $\gamma$, $R_\pi$ and $P_\pi$.

  • Bellman Optimal Equation

    In the same way, we can reform the Bellman Optimal Equation by $v_*$ or $q_*$. The $action$ in $q_*(s, a)$ is considered to be an input, after that the agent follows $\pi^*$ and gains all returns. $v_*$ equals to $q_*$ only if the $action$ is $a^*$. There is no closed form solution of Bellman optimal equation. The $max()$ function is introduced to Bellman Optimal Equation, so it’s not linear. Instead, we can use other iteration approaches: policy iteration, value iteration, Q learning, SARSA, etc.

CATALOG
  1. 1. Markov Process (or Markov Chain)
  2. 2. Markov Reward Process
  3. 3. Markov Decision Process