Weeks 14-18: Tensorflow & Caffe working with GPU - Comparison
These weeks I have finally integrated the CPMs TensorFlow implementation. Now the humanpose
component can estimate poses with both frameworks, Caffe and Tensorflow. Switching between frameworks is just as easy as changing the Framework
parameter in the brand new humanpose.yml
configuration file. Configuration file format has been changed to YAML to stay tuned with JdeRobot latest updates. This change only affects the Camera
object code, which now depends on comm
and config
libraries (installed along with JdeRobot). These libraries provide a new level of abstraction, avoiding the need of directly using Ice to establish the communication with the drivers. Besides which framework to use, a bunch of shared parameters between Caffe and TensorFlow (boxsize, limb colors…), as well as the path to each model, are specified within the YAML file.
Another big step forward that has been taken in the past weeks is enabling CUDA based acceleration for both frameworks. I have also upgraded my hardware. Current hardware and software specifications:
- Laptop: Intel Core i7-7700HQ @ 2.80GHz; NVIDIA GeForce GTX-1050.
- CUDA: v8.0.
- CuDNN: v7.
Before moving on to solve a real problem with the acquired knowledge, it’s worth it to make a comparison on performance and qualitative results between the integrated models. The following test has been carried out:
- Both models have been tested against the first ten seconds of the following video: McEwen Spin-O-Rama to the Button - 2015 World Financial Group Continental Cup of Curling. At 30 fps, the number of frames goes up to 300.
- CPU and GPU accelerated inferences have been evaluated.
- Each model has been tested out using four different boxsizes: 96, 128, 192, 320.
- For each of these 2x2x4 = 16 tests, I have stored inference times for each of the 300 frames of:
- Human detector model.
- Pose estimation model.
- Total time. It includes human and pose inference times, as well as the time that takes to process the images and coordinates before, during and after them.
Performance comparison
In terms of performance, Caffe model (remember, original release) is doing slightly better than its sibling implementation on TensorFlow. In the following figure, the average times for human detection, pose estimation and full prediction depending on the boxsize are shown.
And here the tabulated results for the same tests.
Human detection times (ms)
96 px | 128 px | 192 px | 320 px | |
---|---|---|---|---|
CPU - TensorFlow | 215 | 378 | 846 | 2385 |
CPU - Caffe | 328 | 559 | 1230 | 3378 |
GPU - TensorFlow | 34 | 40 | 60 | 144 |
GPU - Caffe | 23 | 28 | 50 | 153 |
Pose estimation times (ms)
96 px | 128 px | 192 px | 320 px | |
---|---|---|---|---|
CPU - TensorFlow | 315 | 588 | 1335 | 4002 |
CPU - Caffe | 270 | 451 | 1028 | 3058 |
GPU - TensorFlow | 71 | 94 | 133 | 312 |
GPU - Caffe | 26 | 33 | 48 | 156 |
Full inference times (ms)
96 px | 128 px | 192 px | 320 px | |
---|---|---|---|---|
CPU - TensorFlow | 473 | 944 | 1841 | 5659 |
CPU - Caffe | 580 | 1030 | 2056 | 6039 |
GPU - TensorFlow | 119 | 165 | 204 | 489 |
GPU - Caffe | 73 | 94 | 129 | 368 |
After taking a look at the results, the first thing that stands out is the great difference between CPU and GPU accelerated inference. In the case of TensorFlow, using CUDA and CuDNN makes the complete inference around 10 times faster, while Caffe model make predictions 15 times faster. It’s worth it to note that while TensorFlow model is slightly faster than Caffe one when working without GPU based acceleration, Caffe performs better when GPU is used, specifically, around 1.5 times faster. For both frameworks, if we compare human detection and pose estimation times, the second one takes generally longer, and if we sum up both times and compare them with the full inference times, we check that there’s a litlle overhead introduced when processing frames, drawing limbs… but it doesn’t seem worrying, at least for now. In a nutshell, we get a great improvement with GPU and Caffe performs a little faster than TensorFlow. With a boxsize of 192 px, which gives nice qualitative results, Caffe model can make pose estimations at about 7-10 frames per second.
Qualitative results
Now let’s take a look at the estimated poses. In the following video, comparisons between Caffe and TensorFlow models (with GPU and boxsize = 192 px) and between different boxsizes (TensorFlow with GPU) is shown. Needless to say that the framerate has been adjusted to get a natural video and does not represent real inference times.
As it can be seen in the video, it’s difficult to appreciate differences between the poses estimated with both models. Maybe it’s too risky to draw any conclusion without performing a quantitive analysis, but it seems like they have been similarly trained. With regard to the different boxsizes, it’s pretty obvious that bigger boxes lead to better results. A good trade-off between inference time and results is reached when using a 192 px boxsize.