An Effective Joint Framework for Document Summarization

Document summarization is an important research issue and has attracted much attention from the academe. The approaches for document summarization can be classified as extractive and abstractive. In this work, we introduce an effective joint framework that integrates extractive and abstractive summarization models, which is much closer to the way human write summaries (first underlining important information). Preliminary experiments on real benchmark dataset demonstrate that our model is competitive with the state-of-the-art methods.


INTRODUCTION
Document summarization, a task to compress a document into a condensed but informative summary, has been extensively investigated to alleviate information overload. Studies on document summarization task have used either extractive or abstractive methods. A vast amount of previous work in summarization task has been extractive [2] due to the immaturity of text generation technologies and the simplicity of extractive methods, which generally identify key sentences or important phrases of an input document and reproduce them as a summary. However, extractive methods face incoherence problem and is different from the way human writes. Abstractive summarization attempts to produce a condensed representation, aspects of which may not appear as parts of the original input text. With the emergence of deep learning techniques as a viable alternative for Natural Language Process (NLP) tasks, researchers have begun applying modern neural networks to abstractive summarization [4], which is much closer to the way humans write summaries. Although abstractive methods have achieved remarkable success, they still remain challenging. Encoding and decoding a long sequence of multiple sentences fail to achieve satisfactory performance. The neural Seq2Seq framework for summarization tends to generate trivial and generic summaries with limited grammaticality and readability and is trained to predict the next word with previous ground-truth words as the input by using the maximum likelihood estimation (MLE) objective function [5].
To address these challenges, we introduce a joint framework which utilizes the advantages of extractive and abstractive summarization to improve the performance. According to the experiments and discussions in [4], using only the first 400 tokens of an article can yield significantly higher ROUGE scores than those obtained by using the first 800 tokens. This result may be explained by the nature of articles that most of articles especially news ones tend to be structured with the most important information at the start or in some sentences. Inspired by this observation, we use extractive summarization methods to select k most important sentences before implementing abstractive methods. This process is similar to how human always underlie salient sentences first before they summarize an article.

OUR MODEL
In this section, we introduce our proposed model. The widelyused sequence-to-sequence framework is adopted to encode an article with multiple sentences and decode it as a short condensed summary. Our baseline model is similar to that of See et al. [4], and is illustrated in Figure. 1. The main distinction of our work is that we introduce an extractive layer which highlights salient sentences to improve the efficiency of decoder when attending to different parts of input document. In the following sections, we will first introduce the extractive layer, and then describe the combination of extractive and abstractive methods.

Extractive Layer
A document d is a sequence of sentences d = S 1 , S 2 , · · · , S n , and a sentence S i is a sequence of words S i = w i1 , w i2 , · · · , w ik , where w ik is the k-th word from the i-th sentence. Words are fed one-byone into a single-layer bidirectional LSTM encoder. The encoder produces a sequence of encoder hidden state h e i and the decoder has its state at time step t, denoted as h d t . Mimicking the way humans write summaries, we implement an extractive layer by using extractive summarization methods to highlight salient and informative sentences, whose content is more likely to be attended to by the decoder. Although many extractive methods can be utilized to select salient sentences, we simplify this process and follow Lead-3 model, which achieves outstanding performance and is surpassed by the best extractive system by only a small margin [2,4]. It selects only the first k sentences of each article as the basis of abstractive summary generation.

Neural Attentional Model
The attention mechanism [1] is usually introduced to alleviate the burden of remembering the whole input sequence and different parts of input document at different time step. In our work, the attention distribution a t are calculated as as in See et al. [4]: wherev, W h , W s and b at t n are learnable parameters. The context vector h * t , which is a weighted sum of the encoder hidden states, is set to be different at different time step, namely, h * t = i a t i h e i . Then the decoder hidden state is concatenated with the context vector and fed through two linear layers to produce the vocabulary distribution: where V * , V , b, and b * are learnable parameters. Similar to See et al. [4] and Nallapati et al. [3], we introduce a pointer mechanism that allows both copying words via pointing and generating words from the vocabulary, to approach the out-of-vocabulary words. We define u as a binary value, u = 1 means the pointer mechanism working, 0 otherwise. The final distribution is: where p(y t = x i |u t = 1) is equal to the attention weight of input token x i , which is copied from source document. , then output of time step t is denoted as y t , The loss function L of the model is the negative log likelihood of generating summaries. We define the ground-truth output sequence as y * = y * 1 , y * 2 , · · · , y * n for a given input d, during training, the loss is calculated as L = − n t =1 loдp(y * t |y * 1 , · · · , y * t −1 , x).

EXPERIMENT
We conduct experiments on CNN/Daily Mail corpus, which is widely used in abstractive document summarization and comprises news stories with multi-sentence human generated summaries. The corpus contains 287,226 training pairs, 13,368 validation pairs, and 11,490 test pairs. The value of k, which denotes the number of sentences being selected as the basis of abstractive summarization process, is set to 5. We compare our model with two state-of-the-art approaches, i.e., graph-based attention model (GBA) [6] and pointer-generator network (PGN, without coverage mechanism) [4]. We have conducted preliminary experiments on the proposed model with the extractive We also perform human evaluation to evaluate output summaries. We randomly select 30 articles from the dataset, three evaluators are asked to score summaries generated by typical models from 1 point to 5 point, where 1 indicated the lowest readability and 5 indicates the highest level. From Table 1, we can see that the proposed method can improve the readability of summaries.

CONCLUSION
In this paper, we proposed a joint framework of extractive and abstractive summarization methods. Experimental results demonstrated that our model improves the performance on baseline dataset and can generate more readable and natural summaries.