This week I tested the Mask R-CNN with different image sizes to reduce the execution time of a segmentation. The minimum size allowed by the net that obtains results and mantains the original aspect ratio is 540x404. The execution time ranges now between 23 and 25 seconds (using CPU).
I also improved the architecture of the program with 4 branches running independently: Camera, GUI, Net and Tracker. The temporary tracker uses a multitracker implemented in OpenCV (https://github.com/opencv/opencv_contrib/blob/master/modules/tracking/samples/multitracker.py). The continuous mode works at the moment with the detections (and segmentations) given by the Mask R-CNN (after a considerably amount of time) followed by the tracking-by-detection. In future improvements I expect to use a GPU to accelerate the detections.
In this video, the current functionality of the component can be seen: