||Human vision has the ability to recognize a wide variety of objects with great accuracy from minimal motion signal information such as frame differences and point light displays. Replicating this capability on computer has been difficult. However, the utility of such artificial system will be immense as it brings portable cost-effective solutions to surveillance, gaming, robotics, to name a few. Thus, it is important from both theoretical and practical perspectives to design an algorithm that can 1) extract sparse representations of motion signals, 2) group them into coherent spatial-temporal patterns, and 3) interpret the underlying activities. Most currently available approaches require special sensors such as range and stereo, background models, and/or foreground models such as pedestrians and cars. These requirements simplify the problem but limit their applicability tremendously. In this open communication, we outline our recent efforts to the first two goals without any special hardware and background model.
To extract sparse representations from video frames, we calculate motion signals by frame differences, reduce the motion signals of each frame into a sparse dot pattern by subsampling them at 3x3 windows, and finally approximate the dot pattern with skeletons using the following graph algorithm. First, the dot pattern is clustered into connected components. For each component, a Delaunay triangulated graph (G) is derived. From G, edges that are longer than the block size (3 pixels) are removed. Then, the skeleton of the component is derived as a longest shortest path in G. To test if the reduction maintained important information, we recorded 5 movies with various farm animals (cat, chicken, dog, geese, and llama) and 2 movies with humans imitating animals. We reduced each movie into the sparse representation and presented to human volunteers who were asked to pick an animal (including human) from a list of 12. The recognition rate of the sparse representation was 76% (n=30) while the recognition of the frame differences was 84% (n=31).
To group the skeletons across frames, we establish grouping of skeletons within and across frames by estimating optimum rigid transformation, intra-frame grouping, and inter-frame correspondence simultaneously by formulating the problem as an Expectation-Maximization one. Our preliminary results indicate that the approach is highly accurate and robust against noise and clutters.
In this talk, we will present our algorithms and their results, and propose a future research direction for how to recognize animals and humans in the sparse representation without any explicit foreground models.