Post

Upgrading DQN

Index

Plotting upgrades

The program has been modified to add two new graphs, the final horizontal and vertical distance, which should converge close to 0. As many previous tests were done without this data, ‘–d’ flag has been added which displays these new plots. It is still a real time graph so you can see the progress.

For a cleaner code, this are the variables:

1
2
3
4
5
6
7
8
9
10
11
12
class DQNAgent:
    def __init__(self, state_size, action_size, gamma=0.999, alpha=0.0001, epsilon=1.0, min_epsilon=0.001, decay_rate=0.999):
        #.....

        # DEBUG
        self.reward = 0
        self.loss = 0
        self.steps = 0
        self.med_dist = 0
        self.episodes = 0
        self.episode = 0
        self.n_achieved = 0

Desaligment class

The code has been cleaned up so that a class has been created to misalign the vehicle. It now contains a boolean which tells us if the misalignment is also with angular velocity. This has been done as many previous trainings did not contain it, so we can use them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class Desalignment:
    def __init__(self, env, max_align_steps):
        self.env = env
        self.max_align_steps = max_align_steps

    def reset(self, observation, rotate=False):
        """Reinicia los parámetros de desalineación."""
        self.moved = False
        self.rotate = rotate
        self.n_steps = 0
        self.accelerate = True
        self.first_action = np.array([0.0, 0.0])
        self.random_offset = np.random.choice([-2, 0, 2])
        self.random_rotation = np.random.choice([-2, 0, 2])
        self.target = observation["ego_vehicle_state"]["position"][0] + self.random_offset
    
    def move_to_random_position(self, current_position, target_position, accelerate, steps, first_act):
        """Mueve el vehículo a una posición (target)."""

        distance = target_position - current_position
        action = 0

        if accelerate == True:
            # TRAINED action = 10
            action = 15 if distance > 0 else -15

        if abs(distance) < 0.25 or steps == MAX_ALIGN_STEPS:
            action = -first_act
                
        return np.array([action, 0.0])

    def run(self, observation, parking_target):
        """Mueve el vehículo a una posición aleatoria."""
        if not self.moved:
            action = self.move_to_random_position(
                observation["ego_vehicle_state"]["position"][0], self.target, self.accelerate, self.n_steps, self.first_action[0]
            )
            self.accelerate = False

            if action[0] + self.first_action[0] == 0:
                self.moved = True

            if self.n_steps == 0:
                self.first_action = action

            observation, _, terminated, _, _ = self.env.step((action[0], action[1]), parking_target)
            self.n_steps += 1
            return observation, terminated

        elif self.n_steps <= self.max_align_steps:
            default_rot = 0.0
            if self.rotate:
                default_rot = self.random_rotation
                self.rotate = False
            observation, _, terminated, _, _ = self.env.step((0.0, default_rot), parking_target)
            self.n_steps += 1
            return observation, terminated

        else:
            return observation, False

    def is_desaligned(self):
        """Devuelve True si la desalineación está en progreso, False si ha terminado."""
        return self.n_steps <= self.max_align_steps

More testing

Most training sessions take between 3-6 hours. And almost all of them have been done with only 3 possible random positions, the best that has been achieved is this:

video

Upgrading DQN

All the improvements discussed so far do not directly influence the training, but rather the code cleaning and the interpretation of results. With respect to DQN, several aspects have been improved. The rewards are still not in their final version, because as a car park is divided into different phases, the rewards must contain steps as well as local minima will be generated. The way in which agent alignment is rewarded has been modified. Both orientation and horizontal distance are now taken into account, as an alignment far away from the ideal position is meaningless. On the other hand, it has been observed that in certain situations the vehicle was getting very close to the edge of the road (there are no walls). A wall has been simulated by means of rewards, if the agent comes closer than 1m to the right of his ideal position, it will be counted as a penalty and the episode will be terminated (crash into wall).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
distance_reward = 1 / (1 + np.exp(3 * (dist_to_target - 1)))
if dist_to_target > 7.5 or target_pose[1] < -1.5: #emula obstaculo horizontal
    distance_reward = -5  # Penalización máxima por distancia
    print(f"Terminado por distancia, HOR DIST: {target_pose[1]}")

if horizontal_dist < 0.3:
    orientation_reward = -(((5 * np.pi) / 12) * orient_diff) + (0.1/horizontal_dist)
    orientation_reward = max(-0.5, min(orientation_reward, 1))  # Asegurar rango [-0.5, 1]
    print(f"ORIENT_DIFF: {orient_diff} HOR DIST: {target_pose[1]} REWARD: {orientation_reward}")
else:
    orientation_reward = 0

if orient_diff < 0.1 and horizontal_dist < 0.15 and vertical_dist < 0.25 and abs(speed) < 0.25:
    stopping_bonus = (MAX_STEPS - MAX_ALIGN_STEPS - self.step_number)*2# Bonificación máxima por detenerse
    print("CONSEGUIDO!!")
else:
    stopping_bonus = 0
This post is licensed under CC BY 4.0 by the author.