Do action categories share parts(features)?

Hueihan Jhuang

2008/9/17 in 6.870

Introduction

Different object categories share parts, (for example, chairs and tables both have four legs) , and it motivates the use of sharing features, or joint boosting, for training object detecotrs from multiple categories[1]. Do humans treat actions as a union of "motion templates?" Do different actions also share parts? We try to answer this question by applying the joint boositng on a set of 9 huamn actions.

The Algorithm (See[1] for detail)

At each round of boosting, instead of choosing a weak leanrer

that minimizes the loss function

Joint boosting minimizes the loss function

by weak learner with the form

where S(n) is the subset of category labels (The binary weak leaner treates the labels that belong to S(n) as positive, negative otherwise). At each round of joint bosting, the algorithm searches over all the features and the 2^n possible subsets of labels.

 

DataSet

This set contains 9 actors performing 9 actions (bend,jumping jack, jump, jump in place, run, gallop sideways, walk, one-hand wave, two-hands wave). There are totally 81 clips with resolution 120x160 pixels, and 50-100 frames per clip.

Experiment

Features: We use the C2 features from [3]. 300 random S2 patches of sizes 4x4, 8x8 and 12x12 are extracted from the 9 actions of first actor, this leas to the representation of a 300-element feature vector for each frame.

Training/Testing set: We choose all the 9 actions of n actors, from the 2- 8 actors, as training set, and that of the remaining 8-n actors as testing set. In the training stage, we use 20 training samples for each category. Assume the intra-class variations increase as the training set conatins actions from more actors, we will change n to see the performance of joint boosting vs boosting.

Joint boosting: The boosting rounds (number of weak leaners) are set to 50. We limit the way of sharing as two-pairs(combining 1 to 2 classes) and three-pairs (combining 1to 3 classes).

Result

(a) No sharing , two pairs, three pairs.

n = # of actors in training set

n = 1

n = 2

n = 3

n = 4

(b) visualiuze shared features

With one actor in the training set, we can see sharing features are slightly better than no-sharing (class-specific) features, and best-pair and three-pair behave about the same. As we increase the # actors in the training set, the relative performace of the three methods (no sharing, two pairs, three pairs) has any dramatic change. The visualization of shared features shows there exists features that can be shared among action classes, but their effects on the overall performance is marginal.

(sample code of using joint boosting, without code for building features...)

References

[1] Sharing visual features for multiclass and multiview object detection
A. Torralba, K. P. Murphy and W. T. Freeman
IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 29, no. 5, pp. 854-869, May, 2007.

[2] http://www.wisdom.weizmann.ac.il/~vision/SpaceTimeActions.html#Database

[3] A biologically inspired system for action recognition

H. Jhuang, T. Serre, L. Wolf and T. Poggio. ICCV 2007.