Audio-visual recognition systems rely on efficient feature extraction. Many spatio-temporal interest point detectors for visual feature extraction are either too sparse, leading to loss of information, or too dense resulting in noisy and redundant information. Furthermore, interest point detectors designed for a controlled environment can be affected by camera motion. In this paper, a salient spatio-temporal interest point detector is proposed based on a low-rank and group-sparse matrix approximation. The detector handles the camera motion through a short-window video stabilization. The multimodal audio-visual features from multiple descriptors are represented by a super descriptor, from which a compact set of features is extracted through a tensor decomposition and feature selection. This tensor decomposition retains the spatiotemporal structure among features obtained from multiple descriptors. Experimental validation is conducted using two benchmark human interaction recognition datasets: TVHID and Parliament. Experimental results are presented which show that the proposed approach outperforms many state-ofthe- art methods, achieving classification rates of 74.7% and 88.5% on the TVHID and Parliament datasets, respectively.