Upgrading DQN
Index
Plotting upgrades
The program has been modified to add two new graphs, the final horizontal and vertical distance, which should converge close to 0. As many previous tests were done without this data, ‘–d’ flag has been added which displays these new plots. It is still a real time graph so you can see the progress.
For a cleaner code, this are the variables:
1
2
3
4
5
6
7
8
9
10
11
12
class DQNAgent:
def __init__(self, state_size, action_size, gamma=0.999, alpha=0.0001, epsilon=1.0, min_epsilon=0.001, decay_rate=0.999):
#.....
# DEBUG
self.reward = 0
self.loss = 0
self.steps = 0
self.med_dist = 0
self.episodes = 0
self.episode = 0
self.n_achieved = 0
Desaligment class
The code has been cleaned up so that a class has been created to misalign the vehicle. It now contains a boolean which tells us if the misalignment is also with angular velocity. This has been done as many previous trainings did not contain it, so we can use them.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class Desalignment:
def __init__(self, env, max_align_steps):
self.env = env
self.max_align_steps = max_align_steps
def reset(self, observation, rotate=False):
"""Reinicia los parámetros de desalineación."""
self.moved = False
self.rotate = rotate
self.n_steps = 0
self.accelerate = True
self.first_action = np.array([0.0, 0.0])
self.random_offset = np.random.choice([-2, 0, 2])
self.random_rotation = np.random.choice([-2, 0, 2])
self.target = observation["ego_vehicle_state"]["position"][0] + self.random_offset
def move_to_random_position(self, current_position, target_position, accelerate, steps, first_act):
"""Mueve el vehículo a una posición (target)."""
distance = target_position - current_position
action = 0
if accelerate == True:
# TRAINED action = 10
action = 15 if distance > 0 else -15
if abs(distance) < 0.25 or steps == MAX_ALIGN_STEPS:
action = -first_act
return np.array([action, 0.0])
def run(self, observation, parking_target):
"""Mueve el vehículo a una posición aleatoria."""
if not self.moved:
action = self.move_to_random_position(
observation["ego_vehicle_state"]["position"][0], self.target, self.accelerate, self.n_steps, self.first_action[0]
)
self.accelerate = False
if action[0] + self.first_action[0] == 0:
self.moved = True
if self.n_steps == 0:
self.first_action = action
observation, _, terminated, _, _ = self.env.step((action[0], action[1]), parking_target)
self.n_steps += 1
return observation, terminated
elif self.n_steps <= self.max_align_steps:
default_rot = 0.0
if self.rotate:
default_rot = self.random_rotation
self.rotate = False
observation, _, terminated, _, _ = self.env.step((0.0, default_rot), parking_target)
self.n_steps += 1
return observation, terminated
else:
return observation, False
def is_desaligned(self):
"""Devuelve True si la desalineación está en progreso, False si ha terminado."""
return self.n_steps <= self.max_align_steps
More testing
Most training sessions take between 3-6 hours. And almost all of them have been done with only 3 possible random positions, the best that has been achieved is this:
Upgrading DQN
All the improvements discussed so far do not directly influence the training, but rather the code cleaning and the interpretation of results. With respect to DQN, several aspects have been improved. The rewards are still not in their final version, because as a car park is divided into different phases, the rewards must contain steps as well as local minima will be generated. The way in which agent alignment is rewarded has been modified. Both orientation and horizontal distance are now taken into account, as an alignment far away from the ideal position is meaningless. On the other hand, it has been observed that in certain situations the vehicle was getting very close to the edge of the road (there are no walls). A wall has been simulated by means of rewards, if the agent comes closer than 1m to the right of his ideal position, it will be counted as a penalty and the episode will be terminated (crash into wall).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
distance_reward = 1 / (1 + np.exp(3 * (dist_to_target - 1)))
if dist_to_target > 7.5 or target_pose[1] < -1.5: #emula obstaculo horizontal
distance_reward = -5 # Penalización máxima por distancia
print(f"Terminado por distancia, HOR DIST: {target_pose[1]}")
if horizontal_dist < 0.3:
orientation_reward = -(((5 * np.pi) / 12) * orient_diff) + (0.1/horizontal_dist)
orientation_reward = max(-0.5, min(orientation_reward, 1)) # Asegurar rango [-0.5, 1]
print(f"ORIENT_DIFF: {orient_diff} HOR DIST: {target_pose[1]} REWARD: {orientation_reward}")
else:
orientation_reward = 0
if orient_diff < 0.1 and horizontal_dist < 0.15 and vertical_dist < 0.25 and abs(speed) < 0.25:
stopping_bonus = (MAX_STEPS - MAX_ALIGN_STEPS - self.step_number)*2# Bonificación máxima por detenerse
print("CONSEGUIDO!!")
else:
stopping_bonus = 0