This work proposes a novel training framework for the activity recognition problem based on a pose estimation intermediate step, and convolutional neural networks (CNNs). The pose estimation problem is formulated as a CNN-based regression problem towards body joints. A CNN regressor is used to generate high precision pose estimates of images before performing activity recognition. A mask over the image of joint location estimates serves as a fourth layer stacked onto the RGB image. Using image + mask, this approach has the advantage of reasoning about the image subject’s pose before attempting to classify the activity that the subject is performing. This framework benefits from the intermediate mask-generation step by forcing the model to interpret the subject’s pose, and discourages the model to classify based on potentially unrelated aspects. This repo presents a robust proof-of-concept of this pose-augmented action recognition framework on diverse real-world images.
This work was done for Caltech's CS 148 Computer Vision course with Prof. Pietro Perona. The link to the associated report is https://github.com/mzhao98/pose-estimation/blob/master/report/CS148__Pose_Estimation.pdf.