Abstract |
The Transformer models have achieved unprecedented breakthroughs in text classification, and have become the foundation of most state-of-the-art NLP systems. The core function that drives the success is the attention mechanism, which provides the ability to dynamically focus on different parts of the input sequence when producing the predictions. Several previous works have investigated the usage of attention weights to explain the model predictions, because intuitively, attention weights reflect the importance of the input positions in the output. Specifically, the objective for explanation is to compute a relevance score for each input token, such that the key input words that are most important to the prediction can be identified. However, previous efforts produced mixed results. We find that the key reason why attention weights cannot be directly used as effective relevance indications is because they do not contain the directional information for relevance (i.e., whether the input tokens contribute towards or against the prediction). We then propose two novel explanation techniques, namely AGrad and RePAGrad, that produce directional relevance scores based on attention weights. To evaluate the explanation performance, we propose three properties that an effective explanation method should satisfy (i.e., faithfulness, resilience, and consistency), and design the corresponding test to quantify each property. Through extensive evaluations with Transformer models and pre-trained BERT models on multiple public text classification datasets, we show that AGrad and RePAGrad significantly outperform existing state-of-the-art explanation methods in faithfulness and consistency, at the cost of nominal degradation on resilience compared to attention weights. In addition, we reveal that elements of a model architecture can play an important role towards explainability. |