Week 19: Reading information

Reading information

I’ve read some information about self-driving. I’ve read:

From Pixels to Actions: Learning to Drive a Car with Deep Neural Networks

In this paper, they analyze an end-to-end neural network to predict a car’s steering actions on a highway based on images taken from a single car-mounted camera. They focus their analysis on several aspects: the input data format, the temporal dependencies between consecutive inputs, and the origin of the data. For the task at hand, regression networks outperform their classifier counterparts. In addition, there seems to be a small difference between networks that use coloured images and ones that use grayscale images as input. For the second aspect, by feeding the network three concatenated images, we get a significant decrease of 30% in mean squared error. For the third aspect, by using simulation data we are able to train networks that have a performance comparable to networks trained on real-life datasets. They also qualitatively demonstrate that the standard metrics that are used to evaluate networks do not necessarily accurately reflect a system’s driving behaviour. They show that a promising confusion matrix may result in poor driving behaviour while a very ill-looking confusion matrix may result in good driving behaviour.

The main architecture is a variation of either the NVIDIA, AlexNet, or VGG19 architecture. For the Alexnet architecture, they removed the dropout of the final two dense layers and reduced their sizes to 500 and 200 neurons as this resulted in better performance. The output layer of the network depends on its type (regression or classification) and, for a classification network, on the amount of classes. For the case of the classification type, they quantize the steering angle measurements into discrete values, which represents the class labels. This quantization is needed as input when training a classifier network and allows to balance the data through sample weighting. This weighting acts as a coefficient for the network’s learning rate for each sample. A sample’s weight is directly related to the class that it belongs to when quantized. These class weights are defined as 1 divided by the amount of samples in the training set that belong to that class, multiplied by a constant so that the smallest class weight is equal to 1. Sample weighting is done for both classifier networks and regression networks.

They train and evaluate different networks on the Comma.ai dataset, which consists of 7.25 hours of driving, most of which is done on highways and during daytime. Images are captured at 20 Hz which results in approximately 552,000 images. They discarded the few sequences that were made during the night due to their high imbalance when compared to those captured during daytime. They limit ourselves to only considering images that were captured while driving on highways. The data is then split into two mutually exclusive partitions: a training set of 161,500 images and a validation set of 10,700 images.

They evaluate performance of their networks using the following performance metrics: accuracy, mean class accuracy (MCA), mean absolute error (MAE) and mean squared error (MSE) metrics. They base their conclusions on the MSE metric, since it allows them to take the magnitude of the error into account and assign a higher loss to larger errors than MAE does. This is desirable since this may lead to better driving behaviour, as they assume that it is easier for the system to recover from many small mistakes than from a few big mistakes. A large prediction error could result in a big sudden change of the steering wheel angle.

In the first experiment (Quantization granularity), they look into the influence that the specifications of the class quantization procedure have on the system’s performance. These specifications consist of the amount of classes and the mapping from the input range to these classes. They compare classifier networks with varying degrees of input measurement granularity. They also compare them to regression networks, which can be seen as having infinitely many classes, although using a different loss function. They conduct this experiment by comparing a coarse-grained quantization scheme with 7 classes and a finer-grained scheme with 17 classes. The difference between 7 and 17 for regression is in the class weighting. Each sample is given a weight based on their relative occurrences in 7 or 17 classes. Also, to be able to compare regression vs classification, the predicted regression outputs were discretized into 7 and 17 classes to calculate MCA in the same way this happened for the classification networks. The coarse-grained scheme scores better on the accuracy and MCA metric. Regression networks significantly outperform classifier networks on the MAE and MSE metrics, which are the most important metrics. Finally, they notice that class weighting does not have a significant impact on the performance of regression networks.

The second expetiment is about image colour scheme. They observed that there is no significant difference in performance between networks that use coloured (RGB) and grayscale images as input. This suggests that, for the task at hand, the system is not able to take much advantage of the colour information.

They evaluate methods that enable our system to take advantage of information that co-occurs in consecutive inputs. This could lead to significant increase in performance as the input images are obtained from successive frames of a video which introduces temporal consistencies.

In the first method (stacked frames), they concatenate multiple subsequent input images to create a stacked image. Then, they feed this stacked image to the network as a single input. We refer to this method as stacked. This means that for image it at time/frame t, images it−1, it−2, … will be concatenated. To measure the influence of this stacked input, the input size must be the only variable. For this reason, the images are concatenated in the depth (channel) dimension and not in a new, 4th dimension. For example, stacking two previous images to the current RGB image of 160x320x3 pixels would change its size to 160x320x9 pixels. By doing this, the architecture stays the same since the first layer remains a 2D convolution layer. They expect that by taking advantage of the temporal information between consecutive inputs, the network should be able to outperform networks that perform independent predictions by taking single images as inputs. They compare single images to stacked frames of 2, 5 or 8 images. The results show that feeding the network stacked frames increases the performance on all metrics. Looking at MSE, we see a significant decrease of about 30% when comparing single images to stacked frames of 3 images. We assume that this is because the network can make a prediction based on the average information of multiple images. For a single image, the predicted value may be too high or too low. For concatenated images, the combined information could cancel each other out, giving a better ’averaged’ prediction. Increasing the amount of concatenated images only leads to small improvements with diminishing returns. Assuming that the network averages the images in some way, they do not want to increase this amount because the network loses responsiveness. Based on the observations, in our setting the configuration with 3 concatenated frames is preferable. It offers a significant boost in performance while the system remains relatively responsive.

In the second technique, they modify our architecture to include recurrent neural network layers. They use Long-term short-memory (LSTM) layers. These layers allow to capture the temporal information between consecutive inputs. The networks are trained on an input vector that consists of the input image and a number of preceding images, just like the stacked frames. Together with our training methodology, this results in a time window. Due to the randomization in our training, this is not a sliding window but a window at a random point of time for every input sample. They compared many variations of the NVIDIA architecture. They experimented with a configuration where they changed one or both of the two dense layers to LSTM layers, one where they added an LSTM layer after the dense layers and one where they changed the output layer to LSTM. Training these networks from scratch led to very poor performance. Perhaps, this might be caused by the fact that as the LSTM offers increased capabilities, it also has more parameters that need to be learned. They hypothesize that their dataset is too small to do this, especially without data augmentation. They load a pretrained network when we create a LSTM network. This pretrained network is the NVIDIA network variant from their granularity experiment with the corresponding output type. Depending on the exact architecture of the LSTM network, the weights of corresponding layers are copied. Weights of non-corresponding layers are initialized as usual. The weights of the convolutional layers are frozen as they have already been trained to detect the important features and this reduces the training time. The results show that the incorporation of LSTM layers did not increase nor reduce the network’s performance.

A last aspect they investigate is the origin of the data. They look into the advantages of a simulator over a real-world dataset and the uses of such a simulator. A simulator brings many advantages. Some examples are that data gathering is easy, cheap and can be automated. First the Udacity simulator is used to generate three datasets. The first dataset is gathered by manually driving around the first test-track in the simulator. The second dataset consists of recovery cases only. It is gathered by diverging from the road, after which the recovery to the middle of the road is recorded. A third validation dataset is gathered by driving around the track in the same way as with the first dataset. The NVIDIA architecture with a regression output is used and no sample weighting is applied during training.

The first experiment tests the performance of a network trained solely on the first dataset. The metrics are comparable to other runs on the real dataset. As the confusion matrix has a dense diagonal, good real-time driving performance is expected. When driving in the simulator, the network starts off quite well and stays nicely in the middle of the road. When it encounters a more difficult sharp turn, the network slightly miss-predicts some frames. The car deviates from the middle of the road and is not able to recover from its miss-predictions, eventually going completely off-track. We conclude that despite promising performance on the traditional metrics, the system fails to keep the car on the road.

The second experiment evaluates the influence of adding recovery data. First a new network is trained solely on the recovery dataset. The confusion matrix is focused on steering sharply to the left or right. As it does not look very promising and the MCA is very low, it is expected this network will not perform very well during real-time driving. Despite the low performance on these metrics, the network manages to keep the car on track. The car however does not stay exactly in the middle of the road. Instead, it diverts from the centre of the road, after which it recovers back towards the middle. It then diverts towards the other side and back to the middle again, and so on. The car thus wobbles softly during the straight parts of the track, but handles the sharp turns surprisingly well.

A third network is trained on both datasets and has a confusion matrix similar to the first network. In the simulator, it performs quite well, driving smoothly in the middle of the lane on the straight parts as well as in sharp turns. We conclude that recovery cases have a significant impact on the system’s driving behaviour. By adding these recovery cases, the driving performance of the system is improved while its performance on metrics deteriorates. This again suggests that the standard metrics might not be a good tool to accurately assess a network’s driving behaviour.

GTA V is integrated as a more realistic simulator platform and it offers a dataset. This dataset is composed by 600k images split into 430k training images and 58k validation images. A NVIDIA and an AlexNet regression network, are trained on the dataset with sample weights based on 17 classes. The network shows performance metrics similar to the NVIDIA regression network trained on the real-world dataset. They evaluate real-time driving performance on an easy, non-urban road with clear lane markings. The network performs quite well and stays around the centre of the lane. When approaching a road with vague lane markings, such as a small bridge, the car deviates towards the opposite lane. When it reaches a three-way crossing, the network can not decide whether to go left or right, as it was equally trained on both cases. Because of this, it drives straight and goes off-road. In an urban environment, the network struggles with the same problem, resulting in poor real-time performance. Current metrics are not always representative for real-time driving performance.