Study of first section of sutton book (month 7)

3 minute read

The goal of this month is to fully read the first section of sutton book (8 chapters) and to fully understand the basics of reincorcement learning. Once we got enough background, it will be easier to follow the good practices and state-of-the-art of reinforcement learning. Additionally, since more algorithms will be known, more tools will be at our hands so we can achieve better results for each one of the exercises and projects that will be accomplished.

Additionally, to prove the know-how improvement, the mountain car exercise developed in previous posts will be revisited and redesigned to achieve better results either modifying the current proposed algorithm or using a more suitable one.

Lectures

Since the work of this month consisted of learning from the sutton book referenced in resources section, what we are gathering in this section is a summary of headlines with the key lessons learned from this reading. (Note that there were some lessons already learned in previous iterations and note also that a summary of all the lessons learned in each chapter are provided in the book in the last item of every chapter).

CHAPTER 1

  • Subsection 1.5: Introduction to Reinforcement learning with Tic-Tac-Toe example.

CHAPTER 2

  • Subsection 2.1: The importance of balancing between exploration and exploitation and some simple methods to do so.

  • Subsections 2.2, 2.3 and 2.5: Introduction to stationary vs nonstationary problems.

  • Subsection 2.3: The importance of e-greedy selection in a RL policy to increase the average reward taking benefit of exploration.

  • Subsection 2.6: The importance of the assigned initial values to each scenario state.

  • Subsection 2.7: UBC as an example of deterministically exploration policy.

CHAPTER 3

  • Subsection 3.2: It is not a good practice to give higher rewards to the agent when the performed action brings it closer to the goal instead of rewarding it for those actions which next state is actually the goal.

  • Subsections 3.5 and 3.6: Introduction to the Bellman equation and value and quality concepts (already known but included in the summary due to its importance and the good explanation in the book).

  • Subsection 3.7: The Bellman equation is not feasible to be solved in plenty of real situations, so approximation may be needed

CHAPTER 4

Concepts of policy, value iteration and policy iteration. For a better understanding of the difference between value iteration and policy iteration, this forum discussion may be useful.

CHAPTER 5

  • Subsections 5.1, 5.2 and 5.3: Introduction to the Monte Carlo algorihm and the importance of exploring to discover the optimal action when having an (a-priori) unknown model.

  • Subsections 5.1, 5.2 and 5.3: The importance of e-greedy and e-soft policies when using the Monte Carlo method.

  • Subsection 5.5: Off-policy Vs On-policy methods and introduction to importance sampling (ordinary and weighted).

CHAPTER 6

  • Subsections 6.1, 6.2, 6.3 and 6.4: TD(0) vs DP and Monte Carlo methods.

  • Subsections 6.4->: Sarsa (and the following takeaway: if St+1 is terminal, then Q(St+1)=0), expected sarsa, Q-learning, and double learning

Note that in this chapter, there are some specially interesting exercises to revisit (random walk, cliff mountain, driving home, windy windworld, etc.) all of them with a suggested algorithm to be applied.

CHAPTER 7

  • Subsection 7.0: Introduction to the concept of n-step bootstraping.

  • Subsections 7.2 and 7.3: n-step sarsa and n-step expected sarsa (derived from the prediction of TD-n step - intermediate algorithm between monte-carlo and TD(0)).

  • Subsection 7.5: The n-step tree backup algorithm.

CHAPTER 8

  • Subsection 8.1: Consolidation of model-free Vs model-based and learning/planing paradigms + Introduction to distribution-model Vs sample-model.

  • Subsections 8.2 and 8.3: Dyna-q and Dyna-q+: how to apply planning and direct RL learning simultaneously.

  • Subsection 8.4: Prioritized sweeping.

  • Subsection 8.5: Useful diagram to recap about some learned algorithms and its classification (Figure 8.6: Backup diagrams for all the one-step updates considered in this book).

  • Subsections 8.6 and 8.7: Introduction to trajectory sampling and RTDP as a given example.

  • Subsection 8.8: Background planning vs decision-time planning.


BONUS
While reading sutton book, a really interesting tool for understanding how a machine learning algorithm actually learned and its application to reinforcement learning algorithms was found. Check it out!

Lab work