Exercise 4.6: Changes in Policy Iteration algorithm for epsilon-soft policies

I think the answer can be a bit more detailed. What I have in mind is not very qualitative but goes something like this:

Modified Step 3:
The second line of the inner loop:
$\pi(s)$ should become stochastic and satisfy: $\pi(s) = \dfrac{\epsilon}{|A(s)|}$ for all $a \in A(s)$ other than $\text{arg}\max\limits_{a} \sum_{s', r} p(s', r|s, a) [r + \gamma V(s')]$ and $\pi(s) =1 - \dfrac{(|A(s)| - 1) \epsilon}{|A(s)|}$.

Modified Step 2:
The second line of the inner loop:
$V(s) \gets \sum_{a} \pi(a|s) \sum_{s', r} p(s', r|s, a) [r + \gamma V(s')]$

Modified Step 1:
$\pi(s)$ should satisfy the $\epsilon-soft$ criterion when being initialized as well. One way to make sure of that is to set $\pi(a|s)$ for all but one $a \in A(s)$ to $\dfrac{\epsilon}{|A(s)|}$ and $\pi(a|s)$ for the remaining action to $1 - \dfrac{(|A(s)| - 1) \epsilon}{|A(s)|}$.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exercise 4.6: Changes in Policy Iteration algorithm for epsilon-soft policies #97

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Exercise 4.6: Changes in Policy Iteration algorithm for epsilon-soft policies #97

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions