Paying Attention to Attention for Explanation

Abstract	Attention weights in sequence transduction models characterize the importance of encoder hidden states which are in turn aligned with the positions of the input sequence. In this work, we propose attention gradient, the gradient of the loss with respect to the attention weights, as a novel method for explana- tion. We then conduct a comprehensive series of tests to evaluate, and compare its effectiveness towards the following properties desirable in an explanation: (i) Faithfulness to the model (identify both positive and inhibitor words); (ii) Resilience to model- invariant perturbations; (iii) Consistency (intra-class cohesion and inter-class separation); and (iv) Sensitivity to adversarial noise. The results show that attention gradient indeed can provide transparency to the model behavior and often outperforms some of the popular post-hoc explanation techniques.
Authors	Supriyo Chakraborty (IBM US) Franck Le (IBM US) Prudhvi Gurram (ARL) Richard Tomsett (IBM UK)
Date	Sep-2020
Venue	4th Annual Fall Meeting of the DAIS ITA, 2020