Finished Q-Learning 1D

Posted Jan 20, 2025 Updated Jan 28, 2025

By David Duro 2 min read

Index

Decreased speed discretisation
Fixed RLidar
Example and data analysis
Deeper explanation

Decreased speed discretisation

The agent accelerates by increasing its speed by 0.1. This implies that if we discretise too sharply the agent may not notice subtle changes in velocity, so the best value for discretisation is 0.1.

Fixed RLidar

In order to use lidar, the agent must be moving, as long as it is static, the rays detect the origin as an obstacle. To solve this problem, noise has been added to the way the origin of the rays is calculated.

  
def add_origin_perturbation(origin, magnitude=0.001):
    """Add noise to origin"""
    perturbation = np.random.uniform(-magnitude, magnitude, size=2)
    return origin + np.array([perturbation[0], perturbation[1], 0]) 

Example and data analysis

This graph shows how the agent converges to a solution after a number of steps, even from different positions.

In the video, two phases can be distinguished, the first in which the agent is positioned in a random place and the second in which he lines up between the two cars. example video

Deeper explanation

Rewards

The rewards used were as follows:

  
if np.isinf(distance_90) or np.isinf(distance_270):
    return 0

distance_difference = abs(distance_90 - distance_270)

# fixes inifite rewards
if distance_difference < 0.1:
    distance_difference = 0.1

# centered and static
if distance_difference <= 0.17 and abs(speed) < 0.001:
    reward = 200
# penalties if impact
elif distance_difference > 4.5 or abs(speed) > 12:
    reward = -10
# max standar reward = 100 (1/0.01)
else:
    reward = (1/distance_difference)

return reward

They are based on three parts, centred and static, crash forecast and standard reward.

Discretisation

One of the major problems was the discretisation to achieve a finite number of state spaces. The agent must know at all times what is going on in its environment, but if the discretisation is too large, it is possible to lose information that is happening because the state is not updated. This error was detected and solved by getting the best values for the discretisation.

Parameters

The exploration factor starts at almost 1 and decreases to 0. The decay is exponential and converges at 450 episodes. The learning rate 0.2. And the discount factor at 0.9 to focus on future rewards.

State space

For the state space, we started with the distance difference only, but eventually incorporated the current velocity so that it would be able to remain static without oscillations.

phase 3

This post is licensed under CC BY 4.0 by the author.