How to Extract Text from a PDF Using PyMuPDF and Python

Neurond AI
7 min readSep 12, 2022

Text Extraction refers to the process of automatically scanning and converting unstructured text into a structured format. It’s one of the most important tasks in natural language processing.

Figure 1 — Extract the text from a document

Reading or scanning many documents manually involves a lot of time and effort, especially when you have to look through thousands of PDF files.

Fortunately, this issue can be easily tackled by programming with the help of the PyMuPDF library.

Installation

We’ll assume that you already have a Python environment (with Python >=3.7). If you are a beginner, please follow this Python — Environment Setup tutorial to set up a proper programming workspace. A virtual environment is preferable since we can manage our Python packages.

We also recommend installing the jupyter notebook (Project Jupyter), which is great for showcasing your work. It allows you to see the code and the results at the same time.

Let’s dive into PyMuPDF, the library needed for text extraction. You can install it by typing in the terminal.

With pip:

pip install pymupdf

And start using the library by importing the installed module:

import fitz

Bear in mind that the top-level Python import name of the PyMuPDF library is fitz. This is due to historical reasons — according to the author

Note: In this post, we only work with searchable PDF files. To check whether your PDF is legit, open it with a PDF reader and try to copy text or search for some words. A searchable PDF file enables you to do the mentioned work, while a scanned PDF cannot. The PyMuPDF library cannot work with scanned PDF either.

Extract Text from PDF

First of all, we need to set a variable to contain the path to our pdf file. Please replace the ‘PATH_TO_YOUR_AWESOME_RESUME_PDF’ with your path:

my_path = ‘PATH_TO_YOUR_AWESOME_RESUME_PDF”

Here is an example of our working PDF. This is a typical Resume PDF containing a candidate’s information such as contact details, summary, objective, education, skills, and work experience sections.

Figure 2 — Sample Resume PDF file from an Applicant

Let’s open with fitz:

doc = fitz.open(my_path)

The “doc” is a PyMuPDF’s Document class representing the whole document. We will get every necessary information from it, including the text. To extract the text, type the following and run in your jupyter notebook or python file:

for page in doc:    text = page.get_text()    print(text)

In case we get a multi-page document, we will loop all the pages to get the text plain from the document. Here is the result when we print the output:

Figure 3 — The output text from PyMuPDF

The output is quite pretty since the PyMuPDF knows how to read the text in a natural order. However, what if you want to separate particular text blocks? It can be done by passing the parameter “blocks” to the get_text() method.

output = page.get_text(“blocks”)

The output is a list of tuple items, each item will look like this:

Figure 4 — Screenshot by the Author

The x0, y0, x1, y1 is the coordinate of the text line in the document. The next element is the text itself. “block_no” is the block number and “block_type” indicate this block is a text or image.

From now we only care about the text and block number. All the blocks with the same block_no value will be grouped, so we can start printing the text as follow:

for page in doc:    output = page.get_text("blocks")    previous_block_id = 0 # Set a variable to mark the block id    for a block in output:        if block[6] == 0: # We only take the text            if previous_block_id != block[5]:                # Compare the block number
print("\n")
print(block[4])
Figure 5 — Extract each text block from a document

You can notice some strange symbols. This is because sometimes we get text data in Unicode, but we need to represent it in ASCII. To fix this, we use Unidecode library and pass the string into the unidecode function.

from unidecode import unidecodeoutput = []for page in doc:    output += page.get_text("blocks")previous_block_id = 0 # Set a variable to mark the block idfor block in output:if block[6] == 0: # We only take the text    if previous_block_id != block[5]: # Compare the block number       print("\n")       plain_text = unidecode(block[4])       print(plain_text)
Figure 9 — Structure of text blocks

To get the spans from the PDF file, pass the parameter “dict” into the get_text() method of the doc object that we have before.

block_dict = {}page_num = 1for page in doc: # Iterate all pages in the document    file_dict = page.get_text('dict') # Get the page dictionary    block = file_dict['blocks'] # Get the block information    block_dict[page_num] = block # Store in block dictionary    page_num += 1 # Increase the page value by 1

The “block_dict” is a dictionary containing detailed information of all spans in a document. Let’s retrieve the spans and store them in a DataFrame as follow:

import respans = pd.DataFrame(columns=['xmin', 'ymin', 'xmax', 'ymax', 'text', 'tag'])rows = []for page_num, blocks in block_dict.items():    for block in blocks:        if block['type'] == 0:            for line in block['lines']:                for span in line['spans']:                    xmin, ymin, xmax, ymax = list(span['bbox'])                    font_size = span['size']                    text = unidecode(span['text'])                    span_font = span['font']                    is_upper = False                    is_bold = False                    if "bold" in span_font.lower():                        is_bold = True                    if re.sub("[\(\[].*?[\)\]]", "", text).isupper():                        is_upper = True                    if text.replace(" ","") !=  "":                        rows.append((xmin, ymin, xmax, ymax, text,                              is_upper, is_bold, span_font, font_size))                        span_df = pd.DataFrame(rows, columns=['xmin','ymin','xmax','ymax', 'text', 'is_upper','is_bold','span_font', 'font_size'])

Just a little insight, the code above tries to loop over the page, blocks, and lines in a document. Then we will get every span in a line. Although there are some properties in the spans, we care about the bbox (the bounding box), size, font, and text only. You can check our result in the image below:

Figure 10 — The Span Dataframe

We can create more features from these, such as finding the tag for the text. The tag is very significant since it gives you a helping hand in distinguishing between headings and content.

Figure 11 — Heading and its content in a document

We will define three types of tag: h, p, and s.

  • The ‘h’ tag denotes the text which is bigger and more important than normal paragraphs. The text with the ‘h’ tag is usually in the upper case and has the bold style.
  • The ‘p’ tag stands for paragraph, or the normal content in the document. We can find the proper text with ‘p’ tag by counting the number of occurrences of each text size in a document, and then choose the text size which occurs most.
  • The ‘s’ tag will be used for less important text, which is smaller than ‘p’ text.

Following this idea, we will start by gathering all the font sizes and styles in the span DataFrame. We will use the term “score” to emphasize the importance of a text span. The base score of each text span is the font size itself, and increase the score by 1 if it is in the upper case or has the bold style. Note that we also want to avoid increasing the score for the text with special symbols.

After applying some processing techniques by font size, locations, font information, we will be able to extract decent result below.

Figure 11 — All the headings and their content

Deal with Multi-columns Document

We can’t always get a beautiful one-column document since it may result in two or more columns. Fortunately, the PyMuPDF knows how to deal with this problem and permits us to read each column one by one.

Figure 12— Reading two columns document with PyMuPDF

Conclusion

We’ve walked you through how PyMuPDF and Python help us with text extraction. The method frees you from copying single text lines manually or using a PDF reader. Hundreds of documents can be auto-extracted and organized in a structured format.

There is still a lot of work to do in the future, like how can we process a scanned PDF file. This requires implementing OCR (Optical Character Recognition) to read and extract the text from images.

Read original and latest article at: https://www.neurond.com/blog/extract-text-from-pdf-pymupdf-and-python

NeurondAI is a transformation business.

Website: https://www.neurond.com/

--

--