Lecture Notes for Reinforcement Learning (MDP)

 2018/01/19   Share

Reference: David Silver, UCL reinforcement learning, lecture 2; CS 294 Deep Reinforcement Learning, Fall 2017

Markov Process (or Markov Chain)

Here we assume that the environment is fully observable, which means “the current state completely characterises the process”.

A Markov process is a tuple $\left< S, P \right>$, where $S$ denotes the state space (discrete or continuous) and $P$ denotes the state transition probability matrix.

Sometimes we use the transition operator $T$ instead of $P$, but why it is an operator? Suppose we have a vector of probabilities $\vec{\mu}_t$ (which means we are not completely sure which state we are actually in), and let $\mu_{t,i} = p(s_t = i)$. Then $\vec{\mu}_{t+1}$ can be obtain by $T\vec{\mu}_t$, so it is a linear operator. If states are continuous it is still an infinitely large linear operator.

Therefore, an agent has nothing to do in a Markov process, who just translates from one state to another state following by the transition probability.

Markov Reward Process

A Markov process converts to a Markov reward process, when the reward function $R$ and discount factor $\gamma$ are introduced to it. An agent walks around (still totally depends on environment) and environment will give it a signal (reward).

Markov Decision Process

Bellman Expectation Equation

The state-value function:
$v_\pi(s)=E_\pi[R_{t+1}+\gamma v_\pi(S_{t+1}) |S_t=s]$
The action-value function:
$q_\pi(s, a)=E_\pi[R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1}) |S_t=s,A_t=a]$
We can reform Bellman expectation equation, by replacing $v_\pi(s)$ with $q_\pi(s, a)$ or replacing $q_\pi(s, a)$ with $v_\pi(s)$ on the right-hand side. It’s very very important to understand the relationship between $v_\pi(s)$ and $q_\pi(s, a)$ here. We will find the equation appear everywhere in the following lectures. Same as Markov process, we can find the direct closed form solution of Bellman expectation equation when given $\gamma$, $R_\pi$ and $P_\pi$.
$v_\pi=\left(I-\gamma P^\pi\right)^{-1}R^\pi$
Bellman Optimal Equation

In the same way, we can reform the Bellman Optimal Equation by $v_*$ or $q_*$. The $action$ in $q_*(s, a)$ is considered to be an input, after that the agent follows $\pi^*$ and gains all returns. $v_*$ equals to $q_*$ only if the $action$ is $a^*$. There is no closed form solution of Bellman optimal equation. The $max()$ function is introduced to Bellman Optimal Equation, so it’s not linear. Instead, we can use other iteration approaches: policy iteration, value iteration, Q learning, SARSA, etc.

CATALOG

1. Markov Process (or Markov Chain)
2. Markov Reward Process
3. Markov Decision Process



缺失模块。
1、请确保node版本大于6.2
2、在博客根目录（注意不是archer根目录）执行以下命令：
npm i hexo-generator-json-content --save
3、在根目录_config.yml里添加配置：

jsonContent:
  meta: false
  pages: false
  posts:
    title: true
    date: true
    path: true
    text: false
    raw: false
    content: false
    slug: false
    updated: false
    comments: false
    link: false
    permalink: false
    excerpt: false
    categories: false
    tags: true