Corrupted Documents Read and repair corrupted PDF documents

Introduction

Adobe published the complete PDF specification in 1993. Since then and now several implementations of the specification were done. Some are perfect and some are simply not.

The most annoying implementations are implementations that simply make use of invalid keywords or interpret the specification simply false which e.g. could end in cross-references where object numbers are shifted by one.

Also (or more?) annoying is that most Readers automatically repairs such files and made the way open for spreading corrupted PDF documents while not tagging them as corrupt.

How SetaPDF Handles Corrupted Documents

The SetaPDF-Core component is able to repair corrupted PDF documents without interaction, too. For sure it is not able to repair all errors but the common issue, like corrupted cross-references or invalid data before the file header, will be solved automatically.

If the document is saved later the document will be rewritten from scratch and a new valid document structure will be created.

This process can be triggered by simply loading and saving a PDF document:

PHP
try {
    $writer = new \SetaPDF_Core_Writer_File('repaired.pdf');
    $document = \SetaPDF_Core_Document::loadByFilename('corrupted.pdf', $writer);
    $document->save()->finish();
} catch (\Exception $e) {
    echo 'This file is not repairable!';
}

If  you save a corrupted document, the save() method  will ignore the $method parameter and will set it to SetaPDF_Core_Document::SAVE_METHOD_REWRITE automatically.

Corrupted Cross-Reference Table

If the internal parser encounters errors while reading the cross-reference table it will fallback to another parser which will scan the whole file for object byte offsets.

Invalid File Header

If the PDF file begins with invalid data before the file header, the parser will estimate an offset value which will be used to jump to all individual byte offset positions.