Q-Learning
Index
LearningAgent
Class
The LearningAgent
class implements a Q-learning agent optimized for parking tasks using LiDAR and fixed angular velocity. The agent follows the Q-learning algorithm, a reinforcement learning technique.
1. Class Structure of LearningAgent
Main Attributes:
epsilon
: This is the probability of exploring new actions instead of exploiting the current knowledge. It controls the exploration-exploitation tradeoff in the epsilon-greedy policy.alpha
: This is the learning rate, which determines how much of the new information influences the update of the Q-values.gamma
: The discount factor, which specifies the weight of future rewards compared to immediate rewards.q_table
: A table that stores the Q-values, which represent the quality of actions taken in different states. Initially, the Q-values are unknown.actions
: The possible actions the agent can take. In this case, the linear accelerations are [-2, -1, 0, 1, 2].
2. The discretize()
Method
This method converts continuous values into discrete values to simplify the state space.
The discretization is important because Q-learning works with discrete states, not continuous ones. By discretizing the environment data (like distances), the agent can map continuous sensor readings to a manageable set of states that can be stored and updated in the Q-table.
3. The get_state()
Method
This method extracts and discretizes the state of the environment, based on LiDAR data and the vehicle’s position.
The state is represented as a difference between the distances to obstacles in front and behind the vehicle. This difference is discretized so that it becomes a manageable value for the Q-learning process.
4. The choose_action()
Method
This method selects an action based on the epsilon-greedy policy.
In exploration, with probability epsilon
, the agent chooses a random action to discover new possibilities. In exploitation, with probability 1 - epsilon
, the agent selects the action that maximizes the Q-value for the current state.
If a state is encountered for the first time, the Q-values for that state are initialized to zero, ensuring that the agent can start learning even in new situations.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
class LearningAgent:
"""Agente de Q-learning optimizado para aparcamiento, que utiliza LiDAR con velocidad angular fija."""
def __init__(self, epsilon=0.2, alpha=0.1, gamma=0.99):
self.epsilon = epsilon # Probabilidad de exploración en epsilon-greedy
self.alpha = alpha # Tasa de aprendizaje
self.gamma = gamma # Factor de descuento
self.q_table = {} # Tabla Q para almacenar los valores de estado-acción
self.actions = [-2, -1, 0, 1, 2] # Aceleraciones lineales posibles
def discretize(self, value, step=0.25, max_value=10.0):
"""Discretiza un valor continuo al múltiplo más cercano de 'step'.
Args:
value (float): Valor continuo a discretizar.
step (float): Tamaño del intervalo de discretización.
max_value (float): Límite máximo (los valores mayores se limitan).
Returns:
float: Valor discretizado al múltiplo más cercano de 'step'.
"""
# Limitar el valor a [-max_value, max_value]
value = min(max(value, -max_value), max_value)
# Redondear al múltiplo más cercano de step
return round(value / step) * step
def get_state(self, observation):
"""Extrae y discretiza el estado basado en el LiDAR y la posición del vehículo."""
lidar_data = observation["lidar_point_cloud"]["point_cloud"]
car_pose = np.array(observation["ego_vehicle_state"]["position"])
# Asignar inf a obstáculos ausentes.
lidar_data[np.all(lidar_data == [0, 0, 0], axis=1)] = float('inf')
relative_points = lidar_data - car_pose
distances = np.linalg.norm(relative_points, axis=1)
lidar_length = len(distances)
index_90 = lidar_length // 4
index_270 = (3 * lidar_length) // 4
distance_90 = self.discretize(distances[index_90])
distance_270 = self.discretize(distances[index_270])
distance_difference = self.discretize(distance_270 - distance_90)
return float(distance_difference)
def choose_action(self, state):
"""Selecciona una acción basada en la política epsilon-greedy."""
if np.random.rand() < self.epsilon:
# Explorar: Elegir una acción aleatoria
return np.random.choice(self.actions)
# Explotar: Elegir la mejor acción conocida
# print(f"Diferencia de distancias: {state}")
if state not in self.q_table:
# Inicializar valores Q para acciones en el estado si no existen
self.q_table[state] = {action: 0.0 for action in self.actions}
# Devolver la acción con el valor Q más alto en este estado
return max(self.q_table[state], key=self.q_table[state].get)
def act(self, observation):
"""Genera una acción basada en la observación del entorno."""
state = self.get_state(observation)
action = self.choose_action(state)
# print(f"Accion elegida: {action} En estado: {state}")
return np.array([action, 0.0])
def learn(self, state, action, reward, next_state):
"""Actualiza la tabla Q según la fórmula de Q-learning."""
# Inicializar los estados en la tabla Q si no están presentes
if state not in self.q_table:
self.q_table[state] = {action: 0.0 for action in self.actions}
if next_state not in self.q_table:
self.q_table[next_state] = {action: 0.0 for action in self.actions}
# Calcular el valor Q futuro máximo
max_future_q = max(self.q_table[next_state].values())
# print(f"Recompensa obtenida: {reward}")
# Actualizar la tabla Q
current_q = self.q_table[state].get(action, 0.0)
self.q_table[state][action] = current_q + self.alpha * (reward + self.gamma * max_future_q - current_q)
ParkingAgent
Class
Overview of _compute_parking_reward
Method
The goal of this method is to compute a scalar reward value based on the LIDAR readings and the agent’s current position. It uses specific LIDAR points from two key directions (90° and 270°) to calculate the proximity of obstacles.
Rewards logic
First, the relative coordinates of the obstacles are obtained. Then we choose the measurements in 90º and 270º (this will have to be modified as the LIDAR starts at 0 radians in absolute coordinates). Once these measurements are obtained, the difference is calculated, which will be used to return the reward. As we want to minimise the difference in distances, we will return the inverse of the difference so that the less difference there is, the more reward you will get (Multiply by 2 to make a more noticeable difference to unexplored states with 0.0 rewards).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def _compute_parking_reward(self, lidar_data: np.ndarray, car_pose: np.ndarray) -> float:
"""Calcula la recompensa basada en las medidas de LIDAR a 90° y 270°.
Args:
lidar_data (np.ndarray): Datos del punto LIDAR alrededor del vehículo.
car_pose (np.ndarray): Posición actual del agente (coordenadas absolutas).
Returns:
float: Recompensa calculada.
"""
lidar_length = len(lidar_data)
index_90 = lidar_length // 4
index_270 = (3 * lidar_length) // 4
# Asignar 'inf' a los puntos donde no hay obstáculos ([0, 0, 0]).
lidar_data[np.all(lidar_data == [0, 0, 0], axis=1)] = float('inf')
relative_lidar = lidar_data - car_pose
distances = np.linalg.norm(relative_lidar, axis=1)
distance_90 = distances[index_90]
distance_270 = distances[index_270]
# Si nos faltan datos:
if np.isinf(distance_90) or np.isinf(distance_270):
return -10.0
distance_difference = abs(distance_90 - distance_270)
if distance_difference < 0.5:
reward = 5.0
else:
reward = (1/distance_difference)*2
return reward
Example: