can python edit the metadata of a pdf file

3 min read 07-12-2024

Can Python Edit the Metadata of a PDF File? Yes, and Here's How

Python, with its rich ecosystem of libraries, offers powerful capabilities for manipulating PDF files, including editing their metadata. This article explores how you can leverage Python to modify various aspects of a PDF's metadata, such as author, title, subject, and keywords. We'll cover several approaches, highlighting their strengths and weaknesses.

Understanding PDF Metadata

Before diving into the code, it's crucial to understand what PDF metadata entails. Metadata is data about the PDF file itself, not the content within the document. It includes information like:

Title: The title of the document.
Author: The author(s) of the document.
Subject: A brief description of the document's topic.
Keywords: Relevant keywords associated with the document.
Creation Date: The date the document was created.
Modification Date: The date the document was last modified.
Producer: The software used to create the PDF.

Methods for Editing PDF Metadata with Python

Several Python libraries can handle PDF metadata editing. We'll focus on two popular choices: PyPDF2 and pikepdf.

Method 1: Using PyPDF2

PyPDF2 is a widely used, pure-Python library for working with PDFs. While powerful, its metadata editing capabilities are somewhat limited compared to pikepdf. It primarily allows for setting metadata, but not always reliably removing or modifying existing data.

import PyPDF2

def edit_pdf_metadata_pypdf2(input_path, output_path, metadata):
    """Edits PDF metadata using PyPDF2.  Note: Removing metadata is not reliably supported."""
    try:
        with open(input_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            writer = PyPDF2.PdfWriter()

            #Copy pages
            for page in reader.pages:
                writer.add_page(page)

            #Set metadata
            metadata = {k.lower(): v for k, v in metadata.items()} #Handle case-insensitive metadata keys

            writer.metadata.title = metadata.get('title', "")
            writer.metadata.author = metadata.get('author', "")
            writer.metadata.subject = metadata.get('subject', "")
            writer.metadata.keywords = metadata.get('keywords', "")

            with open(output_path, 'wb') as output_file:
                writer.write(output_file)
        print(f"Metadata updated successfully. Saved to {output_path}")

    except FileNotFoundError:
        print(f"Error: File not found at {input_path}")
    except Exception as e:
        print(f"An error occurred: {e}")


#Example usage
metadata = {
    "Title": "Updated PDF Document",
    "Author": "Updated Author Name",
    "Subject": "Revised Subject",
    "Keywords": "keyword1, keyword2, keyword3"
}

edit_pdf_metadata_pypdf2("input.pdf", "output.pdf", metadata)

Method 2: Using pikepdf

pikepdf is a more advanced library offering more comprehensive control over PDF manipulation, including robust metadata editing. It allows for both setting and removing metadata fields.

import pikepdf

def edit_pdf_metadata_pikepdf(input_path, output_path, metadata):
    """Edits PDF metadata using pikepdf."""
    try:
        with pikepdf.Pdf.open(input_path) as pdf:
            for key, value in metadata.items():
                pdf.update_info({key: value}) # pikepdf handles setting and overwriting
            pdf.save(output_path)
        print(f"Metadata updated successfully. Saved to {output_path}")
    except FileNotFoundError:
        print(f"Error: File not found at {input_path}")
    except Exception as e:
        print(f"An error occurred: {e}")

# Example usage (same metadata as above)
edit_pdf_metadata_pikepdf("input.pdf", "output_pikepdf.pdf", metadata)

Remember to install the necessary libraries: pip install PyPDF2 pikepdf

Choosing the Right Library

For simple metadata updates (primarily setting values), PyPDF2 might suffice. However, for more complex scenarios, including removing or modifying existing metadata reliably, pikepdf provides superior capabilities and is generally recommended. pikepdf offers more control and better handles various PDF structures.

Important Considerations

Error Handling: Always include robust error handling (as shown in the examples) to gracefully manage potential issues like file not found errors.
Permissions: Modifying PDF metadata may be restricted by file permissions or the PDF's security settings.
File Paths: Ensure you provide the correct paths to your input and output PDF files.

This guide provides a solid foundation for using Python to edit PDF metadata. Remember to experiment and adapt these techniques to your specific needs. Always back up your original PDF files before performing any modifications.

can python edit the metadata of a pdf file