Week 1: Read papers about Deep Learning for Steering Autonomous Vehicles, CNN with Tensorflow

This week, I read some papers about Deep Learning for Steering Autonomous Vehicles. Some of these papers are:

End-to-End Deep Learning for Steering Autonomous Vehicles Considering Temporal Dependencies: In this work, they propose a Convolutional Long Short-Term Memory Recurrent Neural Network (C-LSTM), that is end-to-end trainable, to learn both visual and dynamic temporal dependencies of driving. To train and validate their proposed methods, they used the publicly available Comma.ai dataset. The system they propose is comprised of a front-facing RGB camera and a composite neural network consisting of a CNN and LSTM network that estimate the steering wheel angle based on the camera input. Camera images are processed frame by frame by the CNN. The CNN is pre-trained, on the Imagenet dataset, that features 1.2 million images of approximately 1000 different classes and allows for recognition of a generic set of features and a variety of objects with a high precision. Then, they transfer the trained neural network from that broad domain to another specific one focusing on driving scene images. The LSTM then processes a sequence of w fixed-length feature vectors (sliding window) from the CNN. In turn, the LSTM layers learn to recognize temporal dependences leading to a steering decision Yt based on the inputs from Xt−w to Xt. Small values of t lead to faster reactions, but the network learns only short-term dependences and susceptibility for individually misclassified frames increases. Whereas large values of t lead to a smoother behavior, and hence more stable steering predictions, but increase the chance of learning wrong long-term dependences. The sliding window concept allows the network to learn to recognize different steering angles from the same frame Xi but at different temporal states of the LSTM layers. For the domain-specific training, the classification layer of the CNN is re-initialized and trained on camera road data. Training of the LSTM layer is conducted in a many-to-one fashion; the network learns the steering decisions that are associated with intervals of driving.
Reactive Ground Vehicle Control via Deep Networks: They present a deep learning based reactive controller that uses a simple network architecture requiring few training images. Despite its simple structure and small size, their network architecture, called ControlNet, outperforms more complex networks in multiple environments using different robot platforms. They evaluate ControlNet in structured indoor environments and unstructured outdoor environments. This paper focuses on the low-level task of reactive control, where the robot must avoid obstacles that were not present during map construction such as dynamic obstacles and items added to the environment after map construction. ControlNet abstracts RGB images to generate control commands: turn left, turn right, and go straight. ControlNet’s architecture consists of alternating convolutional layers with max pooling layers, followed by two fully connected layers. The convolutional and pooling layers extract geometric information about the environment while the fully connected layers act as a general classifier. A long short-term memory (LSTM) layer allows the robot to incorporate temporal information by allowing it to continue moving in the same direction over several frames. ControlNet has 63223 trainable parameters.
Explaining How a Deep Neural Network Trained with End-to-End Learning Steers a Car: NVIDIA has created a neural-network-based system, known as PilotNet, which outputs steering angles given images of the road ahead. PilotNet is trained using road images paired with the steering angles generated by a human driving a data-collection car. It derives the necessary domain knowledge by observing human drivers. Road tests demonstrated that PilotNet can successfully perform lane keeping in a wide variety of driving conditions, regardless of whether lane markings are present or not. PilotNet training data contains single images sampled from video from a front-facing camera in the car, paired with the corresponding steering command (1/r), where r is the turning radius of the vehicle. The training data is augmented with additional image/steering-command pairs that simulate the vehicle in different off-center and off-orientation poistions. The PilotNet’s network consists of 9 layers, including a normalization layer, 5 convolutional layers and 3 fully connected layers. The input image is split into YUV planes and passed to the network. The central idea in discerning the salient objects is finding parts of the image that correspond to locations where the feature maps have the greatest activations. The activations of the higher-level maps become masks for the activations of lower levels using the following algorithm: (1) in each layer, the activations of the feature maps are averaged; (2) the top most averaged map is scaled up to the size of the map of the layer below; (3) the up-scaled averaged map from an upper level is then multiplied with the averaged map from the layer below; (4) the intermediate mask is scaled up to the size of the maps of layer below in the same way as described Step (2); (5) the up-scaled intermediate map is again multiplied with the averaged map from the layer below; (6) Steps (4) and (5) above are repeated until the input is reached. The last mask which is of the size of the input image is normalized to the range from 0.0 to 1.0 and becomes the final visualization mask (shows which regions of the input image contribute most to the output of the network).
Deep Steering: Learning End-to-End Driving Model from Spatial and Temporal Visual Cues: In this work they focus on a visionbased model that directly maps raw input images to steering angles using deep networks. First, the model is learned and evaluated on real human driving videos that are time-synchronized with other vehicle sensors. This differs from many prior models trained from synthetic data in racing games. Second, state-of-the-art models, such as PilotNet, mostly predict the wheel angles independently on each video frame, which contradicts common understanding of driving as a stateful process. Instead, their proposed model strikes a combination of spatial and temporal cues, jointly investigating instantaneous monocular camera observations and vehicle’s historical states. This is in practice accomplished by inserting carefully-designed recurrent units (e.g., LSTM and Conv-LSTM) at proper network layers. Third, to facilitate the interpretability of the learned model, they utilize a visual back-propagation scheme for discovering and visualizing image regions crucially influencing the final steering prediction.
Agile Autonomous Driving using End-to-End Deep Imitation Learning: They present an end-to-end imitation learning system for agile, off-road autonomous driving using only low-cost on-board sensors. By imitating a model predictive controller equipped with advanced sensors, they train a deep neural network control policy to map raw, high-dimensional observations to continuous steering and throttle commands. Compared with recent approaches to similar tasks, their method requires neither state estimation nor on-the-fly planning to navigate the vehicle. Their approach relies on, and experimentally validates, recent imitation learning theory.

In addition, I followed the Tensorflow convolutional neural networks tutorial. In this tutorial, I’ve learnt how to use layers to build a convolutional neural network model to recognize the handwritten digits in the MNIST data set. As the model trains, you’ll see log output like the following:

INFO:tensorflow:loss = 2.36026, step = 1
INFO:tensorflow:probabilities = [[ 0.07722801  0.08618255  0.09256398, ...]]
...
INFO:tensorflow:loss = 2.13119, step = 101
INFO:tensorflow:global_step/sec: 5.44132
...
INFO:tensorflow:Saving checkpoints for 20000 into /tmp/mnist_convnet_model/model.ckpt.
INFO:tensorflow:Loss for final step: 0.14782684.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-06-01-15:31:44
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/mnist_convnet_model/model.ckpt-20000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-06-01-15:31:53
INFO:tensorflow:Saving dict for global step 20000: accuracy = 0.9695, global_step = 20000, loss = 0.10200113
{'loss': 0.10200113, 'global_step': 20000, 'accuracy': 0.9695}

Here, I’ve achieved an accuracy of 96.95% on our test data set.