Motion Fused Frames. Data Level Fusion Strategy for Hand Gesture Recognition

 

Motion Fused Frames: Data Level Fusion Strategy for Hand Gesture Recognition

Abstract

Acquiring spatio-temporal states of an action is the most crucial step for action classificaiton. This paper propose a data level fusion strategy, Motion Fused Frames (MFFs), designed to fuse motion information into static images as better representatives of spatio-temporal states of an action.

Introduction

The applied fusion strategy plays a critical role on the performance of multimodal gesture recognition. Different modalities can be fused either on datalevel, feature level or decision level.

Feature and decision level fusions are the most popular fusion strategies that most of the CNNs currently apply.

But, they have some darwbacks:

  1. Usually a separate network must be trained for each modality, which means number of trainable parameters are multiple times of a single network.
  2. at most of the time, pixel-wise correspondences between different modalities cannot be established since fusion is noly on the classification scores or on final fully connected layers
  3. applied fusion scheme might require complex modifications on the network to obtain good results.

The data level fusion is the most cumbersome one since it requires frame registration, which is a difficult task if the multimodal data is captured by different hardwares. However, the drawbacks arising at feature and decision level fusion methods disappear inherently.

  1. a single network training is sufficient, which reduces the number of parameters multiple times.
  2. since different modalities are fused at data level, pixel-wise correspondences are automatically established.
  3. and CNN architecture can be adopted with a very little modification.

This paper propose a data level fusion strategy, Motion Fused Frames (MFFs), using color and optical flow modalities for hand gesture recognition. MMFs are designed to fuse motion information into static images as better representatives of spatio-temporal states of an action.

An MMF is generated by appending optical flow frames to a static image as extra channels. The appended optical flow frames are calculated from the consecutive previous frames of the selected static image.

Approach

Motion Fused Frames

A single RGB image usually contains static appearance information at a specific time instant and lacks the contextual and temporal information about previous and next frames. As a result, single video frames cannot represent the actual state of an action completely.