Weeks 14-18: Tensorflow & Caffe working with GPU - Comparison

4 minute read

These weeks I have finally integrated the CPMs TensorFlow implementation. Now the humanpose component can estimate poses with both frameworks, Caffe and Tensorflow. Switching between frameworks is just as easy as changing the Framework parameter in the brand new humanpose.yml configuration file. Configuration file format has been changed to YAML to stay tuned with JdeRobot latest updates. This change only affects the Camera object code, which now depends on comm and config libraries (installed along with JdeRobot). These libraries provide a new level of abstraction, avoiding the need of directly using Ice to establish the communication with the drivers. Besides which framework to use, a bunch of shared parameters between Caffe and TensorFlow (boxsize, limb colors…), as well as the path to each model, are specified within the YAML file.

Another big step forward that has been taken in the past weeks is enabling CUDA based acceleration for both frameworks. I have also upgraded my hardware. Current hardware and software specifications:

  • Laptop: Intel Core i7-7700HQ @ 2.80GHz; NVIDIA GeForce GTX-1050.
  • CUDA: v8.0.
  • CuDNN: v7.

Before moving on to solve a real problem with the acquired knowledge, it’s worth it to make a comparison on performance and qualitative results between the integrated models. The following test has been carried out:

  • Both models have been tested against the first ten seconds of the following video: McEwen Spin-O-Rama to the Button - 2015 World Financial Group Continental Cup of Curling. At 30 fps, the number of frames goes up to 300.
  • CPU and GPU accelerated inferences have been evaluated.
  • Each model has been tested out using four different boxsizes: 96, 128, 192, 320.
  • For each of these 2x2x4 = 16 tests, I have stored inference times for each of the 300 frames of:
    • Human detector model.
    • Pose estimation model.
    • Total time. It includes human and pose inference times, as well as the time that takes to process the images and coordinates before, during and after them.

Performance comparison

In terms of performance, Caffe model (remember, original release) is doing slightly better than its sibling implementation on TensorFlow. In the following figure, the average times for human detection, pose estimation and full prediction depending on the boxsize are shown.

And here the tabulated results for the same tests.

Human detection times (ms)

  96 px 128 px 192 px 320 px
CPU - TensorFlow 215 378 846 2385
CPU - Caffe 328 559 1230 3378
GPU - TensorFlow 34 40 60 144
GPU - Caffe 23 28 50 153

Pose estimation times (ms)

  96 px 128 px 192 px 320 px
CPU - TensorFlow 315 588 1335 4002
CPU - Caffe 270 451 1028 3058
GPU - TensorFlow 71 94 133 312
GPU - Caffe 26 33 48 156

Full inference times (ms)

  96 px 128 px 192 px 320 px
CPU - TensorFlow 473 944 1841 5659
CPU - Caffe 580 1030 2056 6039
GPU - TensorFlow 119 165 204 489
GPU - Caffe 73 94 129 368

After taking a look at the results, the first thing that stands out is the great difference between CPU and GPU accelerated inference. In the case of TensorFlow, using CUDA and CuDNN makes the complete inference around 10 times faster, while Caffe model make predictions 15 times faster. It’s worth it to note that while TensorFlow model is slightly faster than Caffe one when working without GPU based acceleration, Caffe performs better when GPU is used, specifically, around 1.5 times faster. For both frameworks, if we compare human detection and pose estimation times, the second one takes generally longer, and if we sum up both times and compare them with the full inference times, we check that there’s a litlle overhead introduced when processing frames, drawing limbs… but it doesn’t seem worrying, at least for now. In a nutshell, we get a great improvement with GPU and Caffe performs a little faster than TensorFlow. With a boxsize of 192 px, which gives nice qualitative results, Caffe model can make pose estimations at about 7-10 frames per second.

Qualitative results

Now let’s take a look at the estimated poses. In the following video, comparisons between Caffe and TensorFlow models (with GPU and boxsize = 192 px) and between different boxsizes (TensorFlow with GPU) is shown. Needless to say that the framerate has been adjusted to get a natural video and does not represent real inference times.

As it can be seen in the video, it’s difficult to appreciate differences between the poses estimated with both models. Maybe it’s too risky to draw any conclusion without performing a quantitive analysis, but it seems like they have been similarly trained. With regard to the different boxsizes, it’s pretty obvious that bigger boxes lead to better results. A good trade-off between inference time and results is reached when using a 192 px boxsize.

Categories:

Updated: