Weeks 20-21: Evaluation methodology
Now that we have integrated different CPMs implementations, we must define an evaluation methodology to carry out not just quantitative comparisons as we’ve already done, but qualitative ones. This methodology must enable fast experimentation in order to analyze the effect that different techniques, architectures, parameters and learning processes produce in CPMs performance. The methodology implemented must be compatible with previous work (same metrics, same datasets).
Datasets
According to original CPMs paper and other state-of-the-art works, the most popular datasets for human pose estimation are the following ones:
- Max Plank Institut Informatik (MPII) Human Pose: Includes around 20k images from YouTube containing more than 40k humans. Annotations include human center and scale within the image, 16 body joints coordinates, occlusions and activities.
- Leeds Sports Pose (LSP): Includes 2k images from Flickr containing one human each performing sports. Annotations include 14 body joints coordinates. It does not provide any information about human scale or location before they are properly centered in the images.
- Frames Labeled in Cinema (FLIC): Includes 5k images from popular Hollywood films containing one human each. Annotations are only provided for upper body parts.
Because of its number of samples, as well as its exhaustive annotation, we’re going to start working with MPII Human Pose dataset. Besides that, it is the main dataset used for training in the original paper (although they also train a model with both MPII and LSP datasets combined).
Metrics
Most used metrics for evaluating models on the datasets that have just been mentioned are:
- Percentage of Correct Parts (PCP): a body part (limb) is considered correctly detected if its segment endpoints lie within half of the length of the ground-truth segment from their annotated location (Ferrari et al., 2008). It is used when the algorithm estimates body limbs instead of joints.
- Percentage of Correct Keypoints (PCK): a keypoint (joint) is considered correctly detected if it lies within a certain distance from its annotated location (Yang and Ramanan, 2013). That distance is usually related to the size of the human in the image.
- PCKh: it’s the same metric as PCK but setting the acceptance distance as half of the length of the subject’s head. This is the evaluation measure used in MPII Human Pose dataset, so it’s the one that we’re goint to use as well.
Implementation
Aided by the original CPMs paper and its repo, the dataset and the evaluation toolkit provided by MPII and the eval-mpii-pose repo by @anibali, I’ve managed to evaluate the Caffe model following the next procedure:
- Reading annotations. The more direct option could be parsing the annotations provided in the MPII dataset webpage, as samples are divided in train and test. However, I have found out that test samples don’t have joint location annotations, because MPII withholds them for official evaluation. The only option left is to divide the train samples in subsets like train/validation/test. This have been already done and HDF5 files of the mentioned subsets are provided in
eval-mpii-pose
repo. - Preprocessing. For this task I have revisited the original paper and repo. In a nutshell, each human in the image is cropped based on his location and scale into a resulting image of size ‘‘boxsize x boxsize’’.
- Estimation. It is performed with the human pose estimation model.
- Storing results. The human pose estimation model regresses the coordinates for 14 body joints. These 14 coordinates are reordered to match MPII evaluation specifications and then we store them in an HDF5 file.
- Evaluation. Once all the predictions have been stored in the HDF5 file, the evaluation is performed by a Matlab script provided in
eval-mpii-pose
, which is based in MPII evaluation toolkit. It computes and store PCKh metric for each joint and plots the results.
In the following images, Caffe model results for the validation subset using different boxsizes are shown.
I’m having troubles with TensorFlow so whenever I manage to solve them I will evalaute the TF model as well and properly discuss the results obtained. I’ll just say about Caffe results that they’re slightly worse than the ones reported in the original paper, which could be justified because of a refinement process that they used. It’s also important not to forget that the samples that we have used for evaluation have been probably seen before by the model when training, so that’s a problem we have to solve.