Sequential decision problems

Deals with fully observable and partially observable cases.
Markovian assumption: Probability depends only on current state $s$ not the history of earlier states.
Markov decision process (MDP): A sequential decision problem for a fully observable stochastic environment with Markovian transition and additive rewards.

MDP: $(S, A, T, R)$
A set of states, $S$, with initial state $s_0$
A set of actions, $A$
A transition model, $T$, defined by $P(s’|s,a)$, outcomes of actions are stochastic.
A reward function, $R(s)$ (sometimes, more generally: $R(s,a,a’)$, Utility function depends on sequence of states while each state has a corresponding reward 𝑅(𝑠)

Policy($\pi (s)$) is a solution to MDP(a function from state to action) and is not unique. It is a mapping given whatever state in a system and output appropriate actions.
The difference is how good these policies are(quality of the policy): measured by the expected return(Expected utility of possible state sequence generated by the policy).
Optimal policy $\pi^{*}$: policy that generates highest expected utility.

Ways to evaluate

Finite horizon:

fixed time $N$ and terminate,
return is usually the addition of rewards over the sequence: $U_h([s_0,s_1,...,s_N])=R(s_0)+R(s_1)+...+R(s_N)$
optimal action at a state can change depending on how many steps remain,
optimal policy is non-stationary.

Infinite horizon:

no fixed time limit,
optimal action depends only current state: stationary.
comparing utilities are difficult: utilities can be infinite.
discounted rewards: (bounded by $\pm R_{max}$) $U_h([s_0,s_1,...,s_N])=R(s_0)+\gamma R(s_1)+\gamma^2R(s_2)+...$ $U_h([s_0,s_1,...,s_N])=\sum_{t=1}^{\infty}\gamma^tR(s_t)\leq\sum_{t=1}^{\infty}\gamma^tR_{max}=\frac{R_{max}}{1-\gamma}$
Expected utility of executing $\pi$ starting from $s$: $U^{\pi}(s)=E[\sum_{t=1}^{\infty}\gamma^tR_{t}]$ Optimal policy: $\pi_s^{*}=\arg\max_{\pi}U^{\pi}(s)$ Number of policies $\pi$ is finite if the number of states is finite.
$n$ states and $a$ actions lead to $a^n$ policies.

Value function

Value of optimal policy is called value function $U^{\pi^{*}}(s)$
Given the value function, we can compute an optimal policy as:

$\pi^{*}(s)=\arg\max_{a\in A(s)}\sum_{s'}P(s'|s,a)U(s')$

Value iteration

What if the agent performs suboptimally on purpose? Is the agent still rational? (eg. taking longer route to reduce the probability of utility loss)
State variables: factorced representation, each variable can have various values;
State: atomic representation, each variable is set to be a specific value. A state is an assignment of state variable.

Fixed point algorithm: run the algorithms many times and it will finally converge to a fixed result ($\gamma < 1$). The algorithm comes to an end when the utility function no longer update.

If apply an operator to a function, it will be updated to another function.

ValueIteration

Function – mapping from input to output. For finite domains, function is a vector
Operator – takes a function as input and gives another function as output
Contraction – distance between 2 functions reduce on applying an operator

Max norm : $||U||=\max_s U(s)$

$||A\cdot B||\leq ||A||\cdot ||B||$ $||A+B||\leq ||A|| + ||B||$

Bellman operator $B$: $R^{|s|}\Rightarrow R^{|s|}$

$BV(s)=R(s)+\gamma \max_{a\in A} \sum_{s'\in S} P(s'|s,a) V(s') = R + \gamma TV$

$T$ is the transition matrix and $V$ is the value matrix.

$||T||=\max_{a,s}\sum_{s'\in S}P(s'|s,a) = \max_s\sum_{s'}T(s,s')=1$

Lemma 1: Bellman operation is a contraction

$\begin{aligned} ||BV - BV'|| &= ||R + \gamma TV - (R + \gamma TV')||\\ &=||\gamma TV - \gamma TV'||\\ &=||\gamma T(V-V')||\\ &\leq\gamma ||T||\cdot ||V-V'|| \\ &=\gamma ||V-V'|| \end{aligned}$

Lemma 2: Value Iteration converges to $V^{\pi}$ with any initial estimate $V$
Property of convergence: $\forall t, H^t(V^{\pi}) = V^{\pi}$
After $N$ iterations:

$\begin{aligned}||H^N(V) - H^N(V^{\pi})||&\leq \gamma ||H^{N - 1}(V) - H^{N - 1}(V^{\pi})||\\ &\leq...\\& \leq \gamma^N||H^0(V) - H^0(V^{\pi})||\\ &= \gamma^N||V - V^{\pi}||\end{aligned}$

which is equivalent to:

$||H^N(V) - V^{\pi}|| \leq \gamma^N||V - V^{\pi}||$

When $N$ is large enough, the right side of the inequality approaches zero, and the value matrix converge to $V^{\pi}$.
$H^t(V)$: apply bellman operator to $V$ for $t$ times

Time needed to converge:
The max initial error is:

$||V^{\pi} - V^0||\leq \frac{2\cdot R_{max}}{1-\gamma }$

After $N$ iteration:

$||V^{\pi} - V^N||\leq \gamma^N \cdot \frac{2\cdot R_{max}}{1-\gamma }$

Termination condition is for error to be small enough:

$\gamma^N\cdot \frac{2\cdot R_{max}}{1-\gamma }\leq \epsilon$ $\Rightarrow \frac{2\cdot R_{max}}{1-\gamma }\leq (\frac{1}{\gamma})^N$ $\Rightarrow \frac{log \frac{2\cdot R_{max}}{1-\gamma }}{log (\frac{1}{\gamma})}\leq N$

Approximate Policy Evaluation

$||H^N(V) - H^{N - 1}(V)|| \leq \epsilon \Rightarrow ||V^{\pi} - H^N(V) ||\leq \frac{\epsilon}{1 - \gamma}$

proof:

$\begin{aligned}||V^{\pi} - H^N(V)||&=||H^{\infty}(V) - H^N(V)||\\ &= ||\sum^{\infty}_{t = N} H^t(V) - H^{t - 1}(V)||\\ &\leq \sum^{\infty}_{t = N} ||H^t(V) - H^{t - 1}(V)||\\ &=\epsilon + \gamma\cdot\epsilon + \gamma^2 \cdot\epsilon+...\\ &=\frac{\epsilon}{1-\gamma}\end{aligned}$

So the terminate condition with a discount factor $\gamma$ is :

$||H^N(V) - H^{N - 1}(V)|| \leq \epsilon (1 - \gamma)$

All MDP can be solved by value iteration in theory, but in practice if there are too many states it cannot be solved.

Policy iteration

Problem with VI: Exact algorithm + Scales poorly

VI compute the utility of each state exactly, but to get an optimal solution we only need to know if one action is clearly better than others. So PI is doing ordering of policies.

Policy evaluation – given a policy $\pi_i$, calculate $U_i = U^{\pi_i}$, if $\pi_i$ is executed.
Policy improvement – calculate new policy $\pi_{i + 1}$ using one-step look ahead based on $U_i$.
Terminate when there is no change in the policy (must terminate since the number of policies is finite).
PolicyIteration
Policy evaluation
PI_improvement

Improving policy iteration

Modified policy iteration

Do only $k$ iterations instead of going to convergence.
Evaluate approximately.

Asynchronous policy iteration

Pick only a subset of states for either policy improvement or for updating in policy evaluation.
Converges as long as we continuously update all states.