Abstract |
Coalition situational understanding (CSU) involves the use of artificial intelligence (AI) based assets to assist human decision-makers, operating near the edge of the network, in responding to rapidly-changing events. CSU is built from many sources, including data collected from sensors of multiple modali- ties, e.g. visual (video) and audio. A decision-maker seeking CSU will often need to rely on information from AI assets created by different coalition partners; therefore, the assets must be capable of generating explanations for their outputs to better engender trust across the coalition users. The edge setting drives explanations to be as specific and targeted as possible due to the scarcity of resources. However, existing explainable AI techniques for state-of-the-art convolutional neural network (CNNs) based video activity recognition systems fail to explicitly distinguish the contribution of motion in a model's decision. This is important where explanations need to focus on changes in a scene. This paper presents a new technique called selective audio-visual relevance (SAVR) which filters temporal from spatial information in explanations. We demonstrate the utility of our method by applying it an activity recognition task. Moreover, we show that our method can be extended to multiple modalities, exploiting the fact that multimodal activity recognition networks using video and audio feature extractors are compatible with our selective relevance technique. |