Skip to content

Exercise 4.6: Changes in Policy Iteration algorithm for epsilon-soft policies #97

@MoeenTB

Description

@MoeenTB

I think the answer can be a bit more detailed. What I have in mind is not very qualitative but goes something like this:

Modified Step 3:
The second line of the inner loop:
$\pi(s)$ should become stochastic and satisfy: $\pi(s) = \dfrac{\epsilon}{|A(s)|}$ for all $a \in A(s)$ other than $\text{arg}\max\limits_{a} \sum_{s', r} p(s', r|s, a) [r + \gamma V(s')]$ and $\pi(s) =1 - \dfrac{(|A(s)| - 1) \epsilon}{|A(s)|}$.

Modified Step 2:
The second line of the inner loop:
$V(s) \gets \sum_{a} \pi(a|s) \sum_{s', r} p(s', r|s, a) [r + \gamma V(s')]$

Modified Step 1:
$\pi(s)$ should satisfy the $\epsilon-soft$ criterion when being initialized as well. One way to make sure of that is to set $\pi(a|s)$ for all but one $a \in A(s)$ to $\dfrac{\epsilon}{|A(s)|}$ and $\pi(a|s)$ for the remaining action to $1 - \dfrac{(|A(s)| - 1) \epsilon}{|A(s)|}$.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions