Weeks 14-18: Tensorflow & Caffe working with GPU - Comparison

4 minute read

These weeks I have finally integrated the CPMs TensorFlow implementation. Now the humanpose component can estimate poses with both frameworks, Caffe and Tensorflow. Switching between frameworks is just as easy as changing the Framework parameter in the brand new humanpose.yml configuration file. Configuration file format has been changed to YAML to stay tuned with JdeRobot latest updates. This change only affects the Camera object code, which now depends on comm and config libraries (installed along with JdeRobot). These libraries provide a new level of abstraction, avoiding the need of directly using Ice to establish the communication with the drivers. Besides which framework to use, a bunch of shared parameters between Caffe and TensorFlow (boxsize, limb colors…), as well as the path to each model, are specified within the YAML file.

Another big step forward that has been taken in the past weeks is enabling CUDA based acceleration for both frameworks. I have also upgraded my hardware. Current hardware and software specifications:

Laptop: Intel Core i7-7700HQ @ 2.80GHz; NVIDIA GeForce GTX-1050.
CUDA: v8.0.
CuDNN: v7.

Before moving on to solve a real problem with the acquired knowledge, it’s worth it to make a comparison on performance and qualitative results between the integrated models. The following test has been carried out:

Both models have been tested against the first ten seconds of the following video: McEwen Spin-O-Rama to the Button - 2015 World Financial Group Continental Cup of Curling. At 30 fps, the number of frames goes up to 300.
CPU and GPU accelerated inferences have been evaluated.
Each model has been tested out using four different boxsizes: 96, 128, 192, 320.
For each of these 2x2x4 = 16 tests, I have stored inference times for each of the 300 frames of:
- Human detector model.
- Pose estimation model.
- Total time. It includes human and pose inference times, as well as the time that takes to process the images and coordinates before, during and after them.

Performance comparison

In terms of performance, Caffe model (remember, original release) is doing slightly better than its sibling implementation on TensorFlow. In the following figure, the average times for human detection, pose estimation and full prediction depending on the boxsize are shown.

And here the tabulated results for the same tests.

Human detection times (ms)

	96 px	128 px	192 px	320 px
CPU - TensorFlow	215	378	846	2385
CPU - Caffe	328	559	1230	3378
GPU - TensorFlow	34	40	60	144
GPU - Caffe	23	28	50	153

Pose estimation times (ms)

	96 px	128 px	192 px	320 px
CPU - TensorFlow	315	588	1335	4002
CPU - Caffe	270	451	1028	3058
GPU - TensorFlow	71	94	133	312
GPU - Caffe	26	33	48	156

Full inference times (ms)

	96 px	128 px	192 px	320 px
CPU - TensorFlow	473	944	1841	5659
CPU - Caffe	580	1030	2056	6039
GPU - TensorFlow	119	165	204	489
GPU - Caffe	73	94	129	368

After taking a look at the results, the first thing that stands out is the great difference between CPU and GPU accelerated inference. In the case of TensorFlow, using CUDA and CuDNN makes the complete inference around 10 times faster, while Caffe model make predictions 15 times faster. It’s worth it to note that while TensorFlow model is slightly faster than Caffe one when working without GPU based acceleration, Caffe performs better when GPU is used, specifically, around 1.5 times faster. For both frameworks, if we compare human detection and pose estimation times, the second one takes generally longer, and if we sum up both times and compare them with the full inference times, we check that there’s a litlle overhead introduced when processing frames, drawing limbs… but it doesn’t seem worrying, at least for now. In a nutshell, we get a great improvement with GPU and Caffe performs a little faster than TensorFlow. With a boxsize of 192 px, which gives nice qualitative results, Caffe model can make pose estimations at about 7-10 frames per second.

Qualitative results

Now let’s take a look at the estimated poses. In the following video, comparisons between Caffe and TensorFlow models (with GPU and boxsize = 192 px) and between different boxsizes (TensorFlow with GPU) is shown. Needless to say that the framerate has been adjusted to get a natural video and does not represent real inference times.

As it can be seen in the video, it’s difficult to appreciate differences between the poses estimated with both models. Maybe it’s too risky to draw any conclusion without performing a quantitive analysis, but it seems like they have been similarly trained. With regard to the different boxsizes, it’s pretty obvious that bigger boxes lead to better results. A good trade-off between inference time and results is reached when using a 192 px boxsize.

Twitter LinkedIn

Weeks 14-18: Tensorflow & Caffe working with GPU - Comparison

Performance comparison

Qualitative results

You May Also Enjoy

Weeks 22-25: LSP + MPII datasets

Weeks 20-21: Evaluation methodology

Week 19: Extended TensorFlow implementation

Weeks 12-13: Tensorflow model (I)