MuHAVi: A Multicamera Human Action Video Dataset

Posted on at


MuHAVi: A Multicamera Human Action Video Dataset

1. Introduction

As  part  of  the  EPSRC funded  REASON  project,  a large  body  of  human  action  video  (called  MuHAVi)  has been  collected  using  multi-cameras  in  a  challenging environment  (uneven  background  and  night  street  light illumination). The raw images in the dataset can be used for  different  types  of  human  action  recognition  methods (depending  on  what  type  of  image  features  are  used) as well  as  a  dataset  to  evaluate  robust  object  segmentation algorithms. The dataset complements similar efforts such as the CMU Motion Database [1], HumanEva  [2],  both mostly aimed at pose recovery with motion capture ground truths.  The  closest  dataset  is  IXMAS  [3],  another challenging  set  of  data  with  multiple  views.  MuHAVi concentrates   on CCTV-like views (at an angle and some distance from the observed people) using real street light illumination sources. Manually annotated silhouettes have been produced specifically for evaluating silhouette-based human action recognition (SBHAR) methods.  For  action recognition algorithms which are purely based on human silhouettes, i.e. where other image properties such as color and  intensity  are  not  necessarily  used,  it  is  important  to have  accurate  silhouette  data  from  video  frames.  This problem  is  not  usually  considered  as  part  of  action recognition,  but  as  a  lower  level  problem  in  change detection  and  motion  tracking.  Hence  for  researchers  working  at  the  recognition  level,  access  to  reliable manually  annotated  silhouette  data  is  a  major  bonus. Effectively,  the  comparison  of  action  recognition algorithms  is  not  distorted  by  possible  differences in segmentation  approaches.  Nevertheless,  it  can  be  noted that because the  silhouettes are simply  masks that  define foreground,  appearance-based  methods  for  action recognition  can  also  be  evaluated  using  this  dataset. Furthermore,  for  researchers  interested  in  object segmentation,  manually  annotated  silhouettes  provide  a useful  ground  truth  to  evaluate  their  algorithms.  Also, because the  data  is  multi-camera,  this  might  be  used  by researchers working in 3D reconstruction, e.g. using space carving  methods  [4],  from  which  silhouettes  can  be generated  by  projection.  In short, the dataset serves multiple valuable purposes.

Human Action Recognition Using Silhouette Histogram

Chaur-Heh Hsieh, *Ping S. Huang, and Ming-Da Tang

Department of Computer and Communication Engineering

Ming Chuan University

Taoyuan 333, Taiwan, ROC

*Department of Electronic Engineering

Ming Chuan University

Taoyuan 333, Taiwan, ROC

Proposed Method

The proposed system includes four main processes as shown in Figure 1. First, the human silhouette is extracted from the input video by background subtraction method. MuHAVi Dataset can be used as silhouetted data. Then, the extracted silhouette is mapped into three polar coordinate systems that characterize three parts of a human figure respectively. The largest circle covers the motion of the human body and the other two circles are to include the effect arms and legs have on the human action/silhouette. That is why two of the centers are between the shoulders and between the hips, respectively. Each polar coordinate system is quantized by partitioning it into several cells with different radii and angles. By counting the number of pixels fallen into each cell from the silhouette at a particular frame, the silhouette histogram of the frame can be obtained. By collecting a sequence of silhouette histograms, a video clip is thus generated and used to describe the human action. Based on the silhouette histogram descriptor, an action classifier is trained and then used to recognize the action of an input video clip.

Figure 1

1-      Silhouette Extraction

The MuHAVi dataset is used which contains silhouettes of different classes.

2-      Polar Transform

To be able to effectively describe the human shape, the Cartesian coordinate system is transformed into the polar coordinate system through the following equations:

 

Where (xi, yi ) is the coordinate of silhouette pixels in the Cartesian coordinate system. (ri, θi) is the radius and the angle in the polar coordinate system. (xc,yc) is the centre of the silhouette. The centre of the silhouette can be calculated by   

where N is the total number of pixels.

The existing approaches often use a single polar coordinate system to describe the human posture. However, our investigation indicates that the single coordinate is not enough to discriminate different postures with small difference. In this work, we design a method which contains three polar coordinate systems (three circles) defined as:

C1: Circle that encloses the whole human body.

C2: Circle that encloses the upper part of a body.

C3: circle that encloses the lower part of a body.

wave1 (one hand waving) and wave2 (two hands waving), as shown in Figure 2. Silhouette histograms obtained by C1 and C2 are shown in Figure 3 and Figure 4, respectively. We can see that two histograms from C1are very similar such that the discriminability of action types is poor. On the contrary, superior discriminability is demonstrated by two histograms from C2.

3-      Histogram Computation

The procedures for calculating the silhouette histogram can be organized into the following steps.

i-                    First, compute the centre of human silhouette and divide the silhouette into an upper part and a lower part according to the centre position. Then the centers of upper silhouette and lower silhouette are individually computed. Those three centre positions are taken as origins for the respective polar coordinate systems.

ii-                  Second, compute the heights of all human silhouettes, which are used to calculate the radius of C1. The radius of C2 or C3 is half of C1 radius.

iii-                The third step is to compute three histograms separately for each human silhouette.

 

 

 

 

 

 

 

Human action recognition using shape and CLG-motion flow from

multi-view image sequence

Mohiuddin Ahmad, Seong-Whan Lee

Introduction

Recognition of human actions from multiple views image sequences is very popular in the computer vision community since it has applications in video surveillance and monitoring, human–computer interactions, model-based compressions, augmented reality, and so on. The existing methods of human action recognition can be categorized depending on the image state properties, such as motion-based, shape-based, gradient based, etc. Several human action recognition methods have been proposed in the last few decades. Detailed surveys can be found where different methodologies of human action recognition, human movement, etc., are discussed. Based on these reviews, researchers either use human body shape information or motion information with or without body shape model for action recognition. This approach can be considered as a combination of shape- and motion-based representation without using any prior body shape model. One standard approach for human action recognition is to extract a set of features from each image sequence frame, and use these features to train classifiers and to perform recognition. Therefore, it is important to answer the following question. Which feature is robust to action recognition in critical conditions or varying environment? Usually, there is no rigid syntax and well-defined structure for human action recognition available. Moreover, there are several sources of variability that can affect human action recognition, such as variation in speed, viewpoint, size and shape of performer, phase change of action, and so on, and the motion of the human body is non-rigid in nature. These characteristics make human action recognition a more challenging and sophisticated task. Considering the above circumstances, they consider some issues that affect the development of models of actions and classifications, which are as follows:

•The trajectory of an action from different viewing directions is different; some of the body parts (part of hand, lower part of leg, part of body, etc.) are occluded due to view changes, which are shown in Fig. 6.  Fig. 6. Representation of human action using shape and motion sequences with multiple views. (a) Multiple views variation of an action. (b) Shape sequences (walking, raising the right hand, and bowing). (c) Motion sequences (walking, raising the right hand, and bowing). The motion distribution is different for each action.

•An action can be viewed as a series of silhouette images of the human body (Fig. 6(b)). The silhouette information involves no translation, rotation, and scaling. Moreover, the silhouette sequence of an action is invariant to the speed.

•Action can be viewed by the motion or velocity of human body parts (Fig. 6(c)). Simple action involves the motion of a small number of body parts and complex action involves the motion of a whole body. The motion is non-rigid in nature.

•Human action depends on anthropometry, method of performing the action, phase variation (starting and ending time of the action), scale variation of an action, and so on.

Proposed Method

Fig.7 Flow diagram of the proposed method.

Fig. 2shows a block diagram of the proposed method. In the preprocessing steps, the foreground is extracted by using background modeling, shadow elimination and morphological operation. From the foreground image, the velocity of an action is estimated by using combined local–global (CLG) optical flow. The global shape flow features are extracted from silhouette image sequence. The shape flow represents the flow deviation and invariant moments. We use the modified Zernike moment, which is robust against noise and invariant to scale, rotation, and translation, is used to reduce noise and to normalize the action data spatially. Motion features are extracted based on the same center of mass (CM) of corresponding silhouette image. The combined features are then feed to multidimensional hidden

Markov model (MDHMM). In the classification stage, matching of an unknown sequence with a model is done through the calculation of the probability that the MDHMM could generate the particular unknown sequence. The MDHMM with the highest probability most likely generated that sequence.



About the author

160