Question Answering System using Transformer — You’ve got questions? We’ve got answers

End-to-end Question Answering system using Transformer

Neurond AI
11 min readFeb 9, 2021


The question answering system is commonly used in the field of natural language processing. It is used to answer questions in the form of natural language and has a wide range of application.

This blog post mainly deals with a Question Answering system designed for a specific field, which is usually use a model called Transformers and it makes use of several methods and mechanisms that I’ll introduce here. The papers I refer to in the post offer a more detailed and quantitative description.

Overall architecture

At the end of 2018, researchers of Google AI Language have public a new model for Natural Language Processing (NLP) called BERT (Bidirectional Encoder Representations from Transformers). BERT has outperformed other model in NLP field and reach the state of the art for modelling language-based tasks.

In this blog post, we are going to understand what is BERT and how to fine-tuned BERT to question answering tasks. We also approach a modify version of BERT called Longformer and I will describe what is it and how it was born and why should we use it.

This article is structured as follows –

  • Before BERT, how do people do in NLP field? (RNN — LSTM)
  • What is Transformers? Five steps to understand attention mechanism.
  • Why Transformers?
  • What is BERT?
  • Why do we use BERT for question and answering task?
  • How to use BERT for Question answering
  • What is Longformer?
  • Why do we use Longformer for question and answering task?
  • Implement question answering task.

Let’s take a look back the traditional neural network of sequential information.


Recurrent Neural Network is a generalization of feedforward neural network that has an internal memory. RNNs can use their internal state (memory) to process sequences of inputs.

Long Short-Term Memory (LSTM) networks are a modified version of recurrent neural networks, which makes it easier to remember past data in memory. The vanishing gradient problem of RNN is resolved here. LSTM is well-suited to classify, process and predict time series given time lags of unknown duration. It trains the model by using back-propagation.


Recurrent neural networks and Long-short term memory models for what concern this question are almost identical in their core properties:

Sequential processing: sentences must be processed words by words.

Past information retained through past hidden states: sequence to sequence models follow the Markov property, each state is assumed to be dependent only on the previously seen state.

The first property is the reason why RNN and LSTM can’t be trained in parallel. In order to encode the second word in a sentence I need the previously computed hidden states of the first word, therefore I need to compute that first. Information in RNN and LSTM are retained thanks to previously computed hidden states. The point is that the encoding of a specific word is retained only for the next time step, which means that the encoding of a word strongly affect only the representation of the next word, its influence is quickly lost after few time steps. LSTM (and also GruRNN) can boost a bit the dependency range they can learn thanks to a deeper processing of the hidden states through specific units (which comes with an increased number of parameters to train) but nevertheless the problem is inherently related to recursion. Another way in which people mitigated this problem is to use Bi-directional models, which encode the same sentence from two direction, from the start to end and from the end to the start, allowing this way words at the end of a sentence to have stronger influence in the creation of the hidden representation, but this is just a workaround rather than a real solution for very long dependencies.

LSTM is DEAD. Long live Transformers.

The paper “Attention is all you need” describer transformers and what is called a sequence-to-sequence (Seq2Seq) architecture. Sequence-to-sequence is a neural net that transforms a given sequence to another sequence for a specific task.

The most famous application of Seq2Seq models is translation, where the sequence of words from one language is transformed into a sequence of words in another language. A popular choice for this type of model is Long-Short-Term-Memory (LSTM) based model. However, for a long sequence LSTM has slow training time and missing information (long-range dependency problem).

So Transformers model was born to solve these problem of LSTM. The attention mechanism will replace the recurrent mechanism. Transformes were introduced in the context of machine translation with the purpose to avoid recursion in order to allow parallel computation (to reduce training time) and also to reduce drop in performances due to long dependencies. The attention-mechanism looks at an input sequence and decides at each step which other parts of the sequence are important. It sounds abstract, but let me clarify with an easy example: When reading this text, you always focus on the word you read but at the same time your mind still holds the important keywords of the text in memory in order to provide context.

Five steps to understand attention mechanism

This is the core idea of Transformers. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

Example: when the model is processing the word “it”, self-attention tries to associate “it” with “street” in the same sentence.

So how to get relation between words?

First step: for each word, we create 3 vector Q, K, V, these vectors are created by multiplying the embedding by three matrices (WQ, Wk, WV) we have trained.

Second step: We divide score by square root of dimension of the key vector we use. Then use softmax function to determines how much each word will be expressed at this position.

Third step: Multiply each value vector by the softmax score to keep important relate word and eliminate the other.

Final step: Sum V vectors to have vector attention Z for a word. Then repeat these step to have a matrices attention for a sentence.

However, we want a model can learn many relation between words, so with each self-attention we just can learn 1 relation, so we use multi-head attention which mean use many self-attention to have many relation between words.

What is BERT?

BERT, which stands for Bidirectional Encoder Representations from Transformers developed by researchers at Google in 2018, is based on Transformers, a deep learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection.

It is designed to pre-train deep bidirectional representations from an unlabeled text by jointly conditioning on both the left and right contexts. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.

BERT stacks multiple transformer encoders on top of each other. The transformer is based on the famous multi-head attention module which has shown substantial success in both vision and language tasks.


BERT helps the search engine understand the significance of transformer words like ‘to’ and ‘for’ in the keywords used.

For the Question Answering System, BERT takes two parameters, the input question, and passage as a single packed sequence. Then we fine-tune the output to display the answer that exist in the passage.

What is Longformers?

Transformer-based language models have been leading the NLP benchmarks lately. Models like BERT, RoBERTa have been state-of-the-art for a while. However, one major drawback of these models is that they cannot “attend” to longer sequences. For example, BERT is limited to a max of 512 tokens at a time. In a classic transformer, you couldn’t process a long paragraph in a same time, so you would divide paragraph in chunks and process each of the chunk individually and then make predictions. However, the drawback is that the model cannot make specific connections between word in this chunk to other chunk on a neural level. So, if you split up your documents into individual samples they will become independent and attention mechanism cannot operate over it across the boundaries of these chunks.

To overcome these long sequence issues, the Longformer essentially combines several attention patterns:

  1. Sliding Window

The name speaks for itself. In this approach, we take an arbitrary window size w, and each token in the sequence will only attend to some w tokens (mostly w/2 to the left and w/2 to the right).

To understand the working of this attention pattern, let’s consider the example of convolutions. Say we take a kernel of size w and slide it through all the tokens in the sequence. After this operation, we’ll have the hidden state representations of all the tokens in the sequence when attended with w adjacent tokens. Now, if we do the same thing for l layers, each token in our sequence would’ve attended (l x w) adjacent tokens, so more or less, the entire input sequence. The authors call this space a receptive field (the reach of attention for a given token). The sliding window attention has a receptive field of (l x w).

2. Dilated Sliding Window

Dilated sliding window: we skip 1 word next to get attention. The idea is to create a vastly greater window of attention, the window size |w| is bigger so you can incorporate information faster across the layers. it will doesn’t harm the model’s computation

3. Global Attention (full self-attention)

Let’s consider the same example of QA tasks. In case of Longformer, we can have all the question tokens to have a global attention pattern, i.e., to have them attend to all the other tokens in the sequence. Moreover, the rest of the tokens, too, attend to all the tokens in the question along with the tokens in their own window. This is shown in the figure above.

Longformer apply three attention patterns above to handle a long sequence.

The Dataset

We will be using the Stanford Question Answering Dataset (SQuAD 2.0) for training and evaluating our model. SQuAD is a reading comprehension dataset and a standard benchmark for QA models. The dataset is publicly available on the website.

SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

Let’s have a walk-through of the code!


1. Install Anaconda. Install

2. Create anaconda environment with python version 3.7.

conda install -c QAS_longformer python=3.7

3. Activate environment.

conda activate QAS_longformer

4. We recommend using cuda for fast training.

#install pytorch with cuda versionpip install torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f

5. Install Transformer library.

pip install transformerspip install simpletransformers

Prepare dataset

Download the SQuAD2.0 dataset. Link

The file directory should be like this:


Training model

Create a python file with code below.

import logging
import json
import queue
import sklearn
import multiprocessing as mp
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
#transformers_logger = logging.getLogger("transformers")
if __name__=='__main__':
#prepare training data
with open('data/train-v2.0.json', 'r') as f:
train_data = json.load(f)
train_data = [item for topic in train_data['data'] for item in topic['paragraphs'] ]
train_data = train_data[0:5001]

with open('data/dev-v2.0.json', 'r') as d:
eval_data = json.load(d)
eval_data = [item for topic in eval_data['data'] for item in topic['paragraphs'] ]
eval_data = eval_data[0:501]

train_args = {
'learning_rate': 3e-5,
'num_train_epochs': 5,
'max_seq_length': 384,
'doc_stride': 128,
'overwrite_output_dir': True,
'reprocess_input_data': False,
'train_batch_size': 2,
'gradient_accumulation_steps': 8,

model_args = QuestionAnsweringArgs(overwrite_output_dir=True)
model_args.evaluate_during_training = True

model = QuestionAnsweringModel(
"longformer", "allenai/longformer-base-4096", args=train_args,use_cuda=True
#train model
model.train_model(train_data, eval_data=eval_data)

# Evaluate the model
result, texts = model.eval_model(eval_data)


import logging
import json
import queue
import sklearn
import time
import sys
import multiprocessing as mp
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
#transformers_logger = logging.getLogger("transformers")
#if __name__=='__main__':

model_args = QuestionAnsweringArgs(overwrite_output_dir=True,doc_stride=80)
model_args.evaluate_during_training = True

#after training. replace the model outputs it generate when training by path like below.
model = QuestionAnsweringModel(
"longformer", "./outputs",use_cuda=True,args=model_args

def predictset():
# input_question=sys.argv[1]

# start = timeit.default_timer()

phrase = "break"
input_question = input("question: ")
if input_question == phrase:
print("QAS: good bye!")
to_predict1 = [{
"context": "<input your context here>",
"qas": [
"question": input_question,
"id": "0",

start_time = time.time()
answers, probabilities = model.predict(to_predict1)
print("--- %s seconds ---" % (time.time() - start_time))
dict_ans = answers[0]
real_answer = print(dict_ans["answer"][0])

return print(real_answer),predictset()


You can replace the context and asked the question relate to that context.


At the result, I created a user interface and asked the system several questions. It’s CORRECT!!


In this blog post, I explained what is Transformers model and its modify version like BERT (Longformer) on the SQuAD dataset for solving question answering task on any text.

We saw another model that uses a modified form of attention to optimize the performance of the traditional Transformer architecture. The Longformer provides computational as well as memory efficiency. Moreover, it also provides support for multiple NLP downstream tasks, unlike other long document Transformer architectures.

The advantages of this question answering system is always capture the answers that exist in the context. For “no answer” case, which mean the answer doesn’t exist in the context, the system will predict “no answer” or predict answer that have a low probability so you can handle and eliminate them.

The disadvantages of this system in reality, not all documents and answers are similar to a question in such a way, so domain-specific lexical knowledge is still also necessary for many questions, and the SQuAD dataset also doesn’t cover all the cases.

For future improvement, we still need to enrich a language model with our domain knowledge and fit the model to our questions and documents.

Read original and latest article at:

NeurondAI is a transformation business. Contact us at: