Automatic text summarization system using Transformers — Are you tired of reading a long paper?
Are you tired of reading a long paper? Automatic text summarization system using Transformers can help you deal with long papers or articles. Let’s build a summarization system using HuggingFace and Streamlit.
Let’s try to summarize a paper about “How BTS Became The Undisputed Kings Of K-Pop”
Amazing summarize result
Then how it summarization using Transformers works?
Abstract
We present a system that has the ability to summarize a paper using Transformers. It uses BART, which pre-trains a model combining Bidirectional and Auto-Regressive Transformers and PEGASUS, which is a State-of-the-Art model for abstractive text summarization.
Nowadays, the AI community has two ways to approach automatic text summarization, Extractive Summarization and Abstractive Summarization. However, in this system I focus on Abstractive Summarization because it is more advanced and closer to human-like interpretation. It has more potential (and is generally more interesting for researchers and developers).
Introduction
Peoples are often tasked with reading a document, papers and producing a summary to demonstrate both reading comprehension and writing ability. So, I want to build a Summarization system that has the ability to summarize a larger piece of newspaper in short and cover most of the meaning of the context.
In this blog post, we will talk about several attempts to automate the summarizing process and see how they work.
Related work
Before reading this blog post if you aren’t familiar with Transformers and attention mechanisms , check my previous blog post to have a general knowledge about it.
Extractive Summarization: the extractive approach selects the most important phrases and lines from the documents. It then combines all the important lines to create the summary. So, in this case, every line and word of the summary actually belongs to the original document which is summarized.
Abstractive Summarization: The abstractive approach uses new phrases and terms that are different from the original document, keeping the meaning the same, just like how humans do in summarization. So, it is much harder than the extractive approach.
Overall architecture
At the end of 2019, researchers of Facebook AI Language have published a new model for Natural Language Processing (NLP) called BART (Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension). BART has outperformed other models in the NLP field and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
On the other side, in 2020 researchers of Google AI Language will also have a new model for Natural Language Processing (NLP) called PEGASUS (Pre-training with Extracted Gap-Sentences for Abstractive Summarization). They achieving state-of-the-art results on 12 diverse summarization datasets
In this blog post, we are going to understand what is BART and what is PEGASUS, how do they work in text summarization tasks.
This article is structured as follows
- What is BART?
- What is PEGASUS?
- Dataset
- Implement automation text summarization system using HuggingFace.
What is BART?
BART, which stands for Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension developed by Facebook AI in 2019. It uses a standard Transformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (Bidirectional encoder), GPT (left-to-right decoder).
BERT: Random tokens are replaced with the token [MASK], and the document is encoded bidirectionally. Missing tokens are predicted independently, so BERT cannot easily be used for generation.
GPT: Tokens are predicted auto-regressively, meaning GPT can be used for generation. However, words can only condition on leftward context, so it cannot learn bidirectional interactions.
BART: Input to encoder doesn’t need to be in correct order like decoder outputs. Here, a document has been damaged by replacing spans of text with [MASK] symbols. The damaged document (left) is encoded with a bidirectional Encoder (both direction), and then the likelihood of the original document (right) is calculated with an Autoregressive Decoder.
Because BART has an autoregressive decoder, it can be fine-tuned for sequence generation tasks such as summarization. In summarization, information is copied from input but controlled, which is closely related to the denoising pre-training object. Here, the encoder input is the input sequence, and the decoder generates outputs autoregressive.
What is PEGASUS?
PEGASUS, which stands for Pre-training with Extracted Gap-Sentences for Abstractive Summarization developed by Google AI in 2020. They propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective.
In PEGASUS, several complete sentences are [MASK] from a document. PEGASUS is trained to predict these sentences. An input is a document with missing sentences, PEGASUS will recover them then the output consists of missing sentences concatenated together. This task is Gap Sentence Generation (GSG).
Although the main contribution of PEGASUS is Gap Sentence Generation, its base architecture includes an encoder and a decoder. So, PEGASUS uses a pre-trained encoder as a masked language model.
In the encoder module, we take random mask words from the sequences and use other words from the sequence to predict these masked words.
In PEGASUS, encoder (MLM) and decoder (GSG) train simultaneously.
Originally there were three sentences. One sentence is masked with [MASK1] and used as target generation text (GSG). The other two sentences remain in the input, but some words are randomly masked by [MASK2] (MLM).
The Dataset
The dataset that we can use for training BART and PEGASUS is CNN/DailyMail dataset.
This dataset has two features:
- The article, which is the text of the news article.
- The highlights, which represent the key elements of the text and can be useful for summarizing.
CNN/DailyMail dataset (Hermann et al., 2015) contains 300,000 articles (93k articles from the CNN, and 220k articles the Daily Mail newspapers), each article will have several highlights.
· Article length ~ 766 words
· Summary length ~ 53 words
Let’s have a walk-through of the code!
In this code, we use Newspaper3k and Streamlit to build a simple demo.
Install dependencies
pip install transformers
pip install newspaper3k
pip install streamlit
Run the code
import os
from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch
import streamlit as st
import time
from newspaper import Articleif torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cuda")st.title('Text Summarization Demo')
st.markdown('Orient Development team')
model = st.selectbox('Model', ["Bart","Pegasus"])link_paper = st.text_area('URL of paper')
article = Article(link_paper)
article.download()
article.parse()
input_text = article.textst.text(input_text)def run_model(input_text):
start_time = time.time()
if model == "Bart":
bart_model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn").to(device)
bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
input_text = ' '.join(input_text.split())
input_tokenized = bart_tokenizer.encode(input_text, return_tensors='pt').to(device)summary_ids = bart_model.generate(input_tokenized,
num_beams = 4,
num_return_sequences = 1,
no_repeat_ngram_size = 2,
length_penalty = 1,
min_length = 12,
max_length = 128,
early_stopping = True)
output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
st.write('Summary')
st.success(output)
else:
pegasus_model = PegasusForConditionalGeneration.from_pretrained("google/Pegasus-cnn_dailymail").to(device)
pegasus_tokenizer = PegasusTokenizer.from_pretrained("google/Pegasus-cnn_dailymail")
input_text = ' '.join(input_text.split())
batch = pegasus_tokenizer.prepare_seq2seq_batch(input_text, truncation=True, padding='longest', return_tensors="pt").to(device)
summary_ids = pegasus_model.generate(**batch,
num_beams=6,
num_return_sequences=1,
no_repeat_ngram_size = 2,
length_penalty = 1,
min_length = 30,
max_length = 128,
early_stopping = True)
output = [pegasus_tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)]
st.write("Summary")
st.success(output)
print("--- %s seconds ---" % (time.time() - start_time))
if st.button('Submit'):
run_model(input_text)
Result
I tried several papers and took a good result to summarize a paper.
Text: https://e.vnexpress.net/news/news/pm-orders-covid-19-inoculation-starting-this-week-4242380.html
Summary generated by BART:
Prime Minister Nguyen Xuan Phuc ordered the Health Ministry to commence Covid-19 vaccination for prioritized groups from this week. The poor, families under preferential treatment and some priority groups need to be inoculated with the vaccine quickly, he told a Tuesday meeting.
Summary generated by PEGASUS:
Vietnam has set a target of immunizing 10 million people against the deadly Covid-19 influenza virus by the end of the year, the government has said.
Text: https://www.dw.com/en/coronavirus-home-tests-will-give-germany-more-freedom/a-56677136
Summary generated by BART:
Jens Spahn says home coronavirus tests are an important step on the return to normalcy. Three such self-administered rapid antigen tests have been given special approval for use. German Chancellor Angela Merkel echoed her health minister in emphasizing the importance of treating those who are and are not vaccinated the same.
Summary generated by PEGASUS:
Germany’s health minister says the country is “on the right path” in its efforts to return to normal following the H1N1 pandemic.
Summary generated by BART:
World Health Organisation says it’s “unrealistic” to think the COVID-19 pandemic will be over before the end of the year. Number of new cases rose globally in the week ending February 22 — the first weekly increase recorded since early January. Confirmed cases roses in Americas, Eastern Mediterranean, Europe, and South-East Asia.
Summary generated by PEGASUS:
It started “unrealistic” to think the first CO-19 pandemic will be over before the end of the year, a top World Health Organisation official stressed on Monday.
Conclusion
Bhe BART model that is trained on CNN/DailyMail data has a good performance. It does provide a fluent summary. However, we think that it still has some weakness:
- The BART model is trained on English vocabulary then it may not be used for other languages.
- BART may miss out on some key word that researchers might want to see as a part of the summary.
The PEGASUS model that is also trained on CNN/DailyMail data provides a shorter version of summarize than the BART model. However, The summarize isn’t always meaningful and correct (In the second text, PEGASUS has a mistake about corona-virus to H1N1 pandemic), sometimes PEGASUS comes with wrong and incorrect information like that.
Read original and latest article at:
https://www.neurond.com/blog/automatic-text-summarization-system-using-transformers
NeurondAI is a transformation business. Contact us at:
Website: https://www.neurond.com/