How to Extract Data from PDF using Python? (Text & Images)

data processing from PDF using Python

Introduction

PDF (or Portable Document Format) is the world’s most popular and widely used file format. These PDF documents are known for their cross-platform portability, capability to retain their visual integrity by displaying consistent layout regardless of hardware and software, the ability to incorporate various multimedia, and storage efficiency.

Although it is not easy to edit its contents as it is designed mainly for content preservation. Adobe Systems, the creators of the file format, designed it to be reliably used in digital environments independent of the hardware, software, and operating system being utilized.

Extracting information from a PDF is also challenging, especially for scanned or non-searchable PDFs. In this article, we’ll learn how to extract various data from PDFs in a structured format using Python and get some insights about editing them. We explained text extraction and image processing from a PDF with the use of Python libraries.

Python for processing PDF

You might ponder why to use Python when many online and offline tools can process PDFs. But most of the reliable tools need a premium subscription, and using free tools always poses a risk of security vulnerability. Many PDF readers have identified security vulnerabilities. So, programming is a simple and safe way to for this task.

Python is a high-level, object-oriented programming language with very useful text analytics libraries and frameworks. It makes Python the perfect choice for PDF processing. The most accurate Python libraries used for PDF processing are PyPDF2, pdfminer, textract, PyMuPDF, pdf2dox, Borb, and pdf2image.

PyPDF2 is the most used library to operate on pdfs, it is a pure-python package that can do many operations. There are PyPDF3, and PyPDF4 available they are pretty much the same.

Text Extraction from PDF

Here are 4 Python libraries used to extract text from a PDF easily. We will learn about them one by one with examples:

PyPDF2

%pip install PyPDF2
import PyPDF2

# Open the PDF file in read-binary mode
with open('h2oai.pdf', 'rb') as pdf_file:
    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfReader(pdf_file)

    # Get the total number of pages in the PDF document
    num_pages = len(pdf_reader.pages)
    text = ''

    # Loop through each page and extract the text
    for page_num in range(num_pages):
        page_obj = pdf_reader.pages[page_num]
        page_text = page_obj.extract_text()
        text += page_text
        print(f'Text from page {page_num + 1}:\n{page_text}\n')

You can use Google Colaboratory, Jupyter, or Vscode to run the code. But make sure that the required pdf file is in the file directory. Install and import the PyPDF2 library.

The line opens the file (‘h2oai.pdf’ in my case) with read and write access using the code (with open(‘h2oai.pdf’, ‘rb’) as pdf_file:). The process ensures that the file closes after completion. Then PDFReader object interacts with and reads the files. Then we get the number of pages and loop through each page to extract text.

Get the page object from each page, extract text from each page using the extract_text() method on the page object, and append it to a string “text” as we loop over each page. The print statement prints the page_number and extracted text from each page.

pdfminer

%pip install pdfminer
from pdfminer.high_level import extract_text

text = extract_text("h2oai.pdf")
print(text)

The code import extract_text from pdfminer.high_level module. Then we use the extract_text()function to extract all text from the specified file and store it in a variable ‘text’.

PyMuPDF

%pip install PyMuPDF
# install using: pip install PyMuPDF

import fitz 

with fitz.open("h2oai.pdf") as doc:
    text = ""
    for page in doc:
        text += page.get_text()

print(text)

PyMuPDF is an enhanced Python binding for MuPDF: a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit. Import fitz module from PyMuPDF library.

Open the file using fitz.open() function and assign the resulting document to the variable doc. Iterate through each page, extract text from each page, and append it to a variable text. The program prints the complete extracted text finally.

pdfplumber

pdfplumber is a library built on pdfminer. You can use it to extract text, and tables, and visually debug PDFs. The following is a code implementation that you can easily grasp as it is similar to previous examples.

%pip install pdfplumber
# Import required library.
import pdfplumber

# Open the file and create a pdf object.
pdf = pdfplumber.open("h2oai.pdf")

# Get the number of pages.
num_pages = len(pdf.pages)

print("Number of Pages:", num_pages)

for page_num in range(num_pages):
    page_obj = pdf_reader.pages[page_num]
    page_text = page_obj.extract_text()
    text += page_text

print(text)

Here is a benchmark for all the Python libraries and their accuracy to extracting text:

Python libraries for text extraction with accuracy
Benchmarks representing the accuracy of various PDF-text-extraction libraries on 10 different PDFs

PDF to Image using Python

The above libraries work best on machine-generated PDF text when parsing and extracting the text from PDFs. They are best for the purpose. But what if someone scans the text or needs to process a scanned PDF document?

(A scanned PDF comprises pages that have been scanned in any form and then combined and stored as a PDF file). The above methods do not work for a scanned pdf.

To process a scanned PDF, first, we convert the pdf into separate images of each page and use Optical Character Recognition(OCR) to extract text from these images. The most commonly known Open Source OCR software is the tesseract OCR engine, which utilizes LSTM (a type of Neural Network).

OCR vs Text Extraction in PDFs

It is not a wise option to just process pdf as an image and use OCR every time. Text extraction software like pypdf can use more information from the PDF than just the image. It can know about fonts, encodings, typical character distances, and similar topics.

That means pypdf has a clear advantage when it comes to characters that are easy to confuse such as oO0ö. pypdf will never confuse characters. It just reads what is in the file.

To use the pdf2image module install Poppler from GitHub by downloading the latest release. After downloading, copy the path of its bin folder and add it to the path in the system environment variables. Or read How to Install Poppler on Windows? If you did not add the path to system environment variables use the commented part to get pages.

from pdf2image import convert_from_path

# pages = convert_from_path('h2oai.pdf',500,poppler_path=r'path\to\poppler_bin')
pages = convert_from_path('h2oai.pdf',500)
for i, page in enumerate(pages):
     filepath = "path\to\Image_file_Destination + 'image_'+str(i)+'.png'
    # save the image in the specific path
     page.save(filepath, "PNG")

We import the convert_from_path function from the pdf2image library. The function converts PDF to an image and sets the default dpi (dots per inch) to 200 unless otherwise specified.

Then we loop over all the pages returned by the conver_from_path function. First, we specify the file path where we want to save our page images and then we save the image in PNG format. The filename for the image of each page is generated using the page number. If no file path is mentioned, Images are downloaded in the current working directory.

Table Extraction

The Python library, tabula, extracts tables from PDFs. Then specify the file path to pdf and import the read_pdf function from the tabula.io module. Set pages to “all” to extract tables from all pages of the PDF. tables is a Python list that contains DataFrames, with each DataFrame representing a table extracted from the PDF.

%pip install tabula-py
import tabula
from tabula.io import read_pdf
path = "Path\to\PDF_File"
tables = tabula.io.read_pdf(path, pages = "all")

URL Extraction

We will use PyPDF2 for URL Extraction from an PDF. Here is the code for how to use it:

import PyPDF2
import re
def url_finder(page_content):
   regex = r"(https?://\S+)"
   url = re.findall(regex,page_content)
   return url
url_dict = {}

with open("h2oai.pdf", 'rb') as file:
    readPDF = PyPDF2.PdfReader(file)
    for page_no in range(len(readPDF.pages)):
        page=readPDF.pages[page_no]
        text = page.extract_text()
        url_dict[page_no] = url_finder(text)
        print(f"URLS of page {page_no}: "+str(url_finder(text)))
    file.close()

Import PyPDF2 to read the document and the re module which is used to find regular expressions in Python. Define the url_finder function which defines a regular expression that matches URLs.

Then we use the re.findall() function to fetch all matches to our defined expression in the given page_content provided to the function. The function returns the list of all URLs found in the page_content.

Then similar to earlier text extraction we iterate through each page, extract text from each page, pass it to find URLs in the page, and store it in url_dict.

Image Extraction from PDF

The following Python script uses the PyMuPDF library to open a PDF file, iterate over each page, and extract all images. For each image, it creates an image object using the Python Imaging Library (PIL) and then saves the image to a file.

import fitz
import io
from PIL import Image
file_in_pdf_format = fitz.open("h2oai.pdf")
for page_number in range(len(file_in_pdf_format)):
    page = file_in_pdf_format[page_number]
    img_list = page.get_images()
    if len(img_list) == 0:
        print("There is no image on page ", page_number)
        pass
    for img_index, img in enumerate(page.get_images(), start=1):
        xref = img[0]
        base_img = file_in_pdf_format.extract_image(xref)
        img_bytes = base_img["image"]
        img_ext = base_img["ext"]
        image = Image.open(io.BytesIO(img_bytes))
        image.save(open(f"image{page_number + 1}_{img_index}.{img_ext}", "wb"))

If a page doesn’t contain any images, a message informing this is printed. The filename for each image is generated using the page number and the image index. Replace (f”image{page_number + 1}_{img_index}. {img_ext}”) with a specific file path to store the image at a specific location similar to PDF to image implementation.

Conclusion

There are many resources and libraries that provide a wide range of functionality on PDF files in Python. The article helps with methods of data extraction from PDF files using Python, for Text, Images, Tables, and Hyperlinks. If you would like to work and edit an existing file, here is an interesting read How to Work With a PDF in Python.

Imagine having a virtual assistant at your fingertips that could engage in dynamic conversations, and provide insightful answers to complex questions, read PDF GPT: How to Build a Personal PDF Chat Assistant to build one.

0 Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like