I’ve been researching in Abstractive Summarization for a substantial amount of time now. This series of articles is meant for someone to get started into the field and following key intuitive concepts of the papers presented since the advent of the pointer-generator network [1]. One thing that sets abstractive summarization a little apart from its counterparts in NLP is the heavy usage of RL, optimizing on ROUGE scores.

Summarization is the task of taking a long piece of text and producing a shorter version of the original one, which is relevant, informative and should give the reader a fair idea of the original text without reading it completely. There are two kinds of summarization, namely, extractive and abstractive. Extractive deals with creating a summary by highlighting the key sentences in the text, and then calling these “important” highlighted sentences, a summary. Although these selected sentences don’t encompass the entire article, the attempt here is to select the ones which form the crux of the text. Abstractive summarization, is something that we as humans do, which is reading a long piece of text, understanding the text, and then writing it on our own in clear, concise and coherent language.

Now viewing these tasks from a NLP perspective, Extractive summarization is the task of selecting relevant and important sentences from the original text, and simply combining these together. While Abstractive summarization would involve Natural language understanding to comprehend the text and then natural language generation to produce text from this understanding. Most of pre-neural NLP based summarization focused on Extractive summarization for this was mostly rule based summarization. In the neural NLP era, the focus has now shifted to abstractive summarization, a very active area of research and has a lot of scope for improvement.

But why is Abstractive summarization very difficult in the first place? This is mostly owed to the fact that neural networks are pretty bad at generating text even when context is given. It’s just in recent times with the advent of GPT-2 [2] that they have been getting better, but these are models trained on large corpora of text not aimed at any specific task, although attempts to fine-tune these models onto tasks which require language generation is being done. Secondly, coming down to the nitty-gritties of the task itself, Neural networks when attempting to generate text suffer from repetition( basically getting stuck on generating a word, for example “i had the the the ….” ), it’s very difficult to handle factual details (for example, the original text might say “100000 people turned up for the protests”, and you would want the model to correctly copy this fact ) and finally a fundamental problem in NLP models is that while training them we constrain models to be trained on a vocubulary of 50000-200000 words and since the english language has a million words atleast and infinitely many when you think of derivational morphology, handling out of vocabulary words is a difficult task for neural networks.

Although the dataset itself was introduced in a 2016 paper [3] , Abstractive summarization for multi-line summaries came into limelight after “Get to the Point”[1], the paper which introduced the pointer generator paper. It was the first paper which showed that the issues mentioned in the previous paragraph could be handled and they do it by introducing a pointer generator architecture and a coverage vector. So the intuitive idea of this work comes from its ability to copy words from the source text via pointing to them and its own ability to generate new words. Sequence to sequence models have always had the ability to generate new words due to encoder-decoder architectures. These architectures mostly include a sequence of some arbitrary length being fed into an encoder (LSTM or RNN) and then being encoded into a vector x, which is then fed to the decoder which uses this and its own decoder states to generate words from a vocabulary distribution. This sequence to sequence model was then augmented by pointer networks [4] (2015), which allowed it to copy words from the source document.

Not diving too much into detail into the description of the components of this architecture, I’ll try to intuitively explain the expression for the final distribution to generate words from. The pointer generator architecture calculate p_gen, where p_gen is the probability of generating a word, while 1 - p_gen represents the chance that a word is copied. Looking at this practically, If the word is out-of-vocabulary (OOV) then p_vocab(w) is zero since this word doesn’t exist in your dictionary and the probability of generating that word would fall to zero. Similarly, if a word doesn’t appear in the source document then the expression for attention falls to zero. The architecture is trained with negative log likelihood and is given as follows.

Finally, we’ve seen ways the model handles out of vocabulary words and the fact that it can handle factual errors due to its ability to copy source text. The way this paper handled repetition was by introducing a coverage loss, a vector called a coverage vector was introduced which was basically a summation of its attention states. Attention in Neural NLP is basically something used to describe what part of the document should i pay attention to when i generate my next word. When you take the summation of attention states, its a reminder of all the previous places that you have paid attention to, and converting this to a loss, this demotivates the idea of paying attention to the same places and for this to be as low as possible you have to be diverse in your choice of attentiveness to the section of the articles. I guess turning to the practical details of this paper, one thing to be observed is that articles were truncated to 400 tokens and the summary was truncated to 100-120 tokens, and that performance took a hit when the article length was more. Its very counterintuitive that providing more context in the form of more lengthy articles drops the performance of the model, and this highlights upto some extent the inability of current neural NLP models to handle long sequences in the form of articles, or papers, or blogs. Another training detail is that, the model is first trained without the coverage loss, and it initially just learns its main objective i.e to learn generating summmaries by copying relevant text from the source document while generating words which weaved these bits into coherent pieces of text. Ultimately the model is trained with coverage, and coverage could be thought of a finetuning approach here.

The authors mention the fact that when training the model prefers generating words more and copying words less, while at testing time it prefers copying words more, and this makes the model leans towards extraction. The former possibly due to presence of reference summaries and the latter due to the absence of it.