Week 6-7. Q learning - completed
The problem has been solved with the Q Learning algorithm parameterized with discrete action space of size 16 (4 bins for linear velocity and 4 for angular velocity) and a discrete observation space of size 10. To encourage the exploratory phase of the agent, the e-greddy technique has been applied with an epsilon of 0.99999 and a discount factor of this parameter of 0.99999. The algorithm updates the Q table after each iteration, i.e. at each step. It has been necessary to modify the reward function to make its calculation simpler and lighter. This has been done by removing the backbone calculation of the red line and replacing it with the subtraction of the edges of the red line with respect to the center of the screen, which serves as a sufficient estimate to calculate a reward proportional to the deviation from the center of the road. This in turn makes the observation calculation lighter, since both reward and observation are based on the error calculation. A randomly chosen starting point of the episode has also been added from a list, provided as input, of starting points. These points are starts of curves, both left and right, closed and open. The learning of the lines is done implicitly, since in most cases there is a straight line after a curve and these are easier to learn by the algorithm. It has been observed that the car still pitches in both curves and straights, but the situation has improved, since in the reward function, abrupt changes in the turn are penalized with a subtraction of the angular velocity chosen in the step and the angular velocity chosen in step - 1.