Performance Optimizations

Introduction

Merging a large number or large PDF documents is a problematic process in a web environment as we have to deal with several limits: memory limit, script time limit and also with operation system limits such as a limit of allowed open file handles/descriptors.

On this page you will find some hints and examples on how to improve a merge process considering this limitations. 

Limit of Open Files

Operation systems may have a limit on allowed open files / file descriptors. These limits are system wide limits or user limits. A nice overview how to check and change these setting on a linux system can be found here

With this limits in mind it is impossible to open more files with e.g. the fopen() function of PHP than defined. The limit will also affect standard statements like require or include. So a script will end in an annoying warning mostly followed by a fatal error:  

Warning: fopen(path/to/document.pdf): failed to open stream: Too many open files in /you/script.php on line 123  

In version 2.17.0.768 of the SetaPDF-Core component a special reader and handler class were introduced to overcome this issue: SetaPDF_Core_Reader_MaxFile and the SetaPDF_Core_Reader_MaxFileHandler class.

The handler instance will observe open files/file descriptors and will close them if a specific limit is reached. The reader itself will open a handle if necessary and notify the handler. As both objects are bound to each other there are helper methods implemented to create a reader instance:

PHP
// let's define a maximum of open files
$maxOpenFileHandles = 500;
// create a handler by this value
$handler = new SetaPDF_Core_Reader_MaxFileHandler($maxOpenFileHandles);
// iterate over thousands of PDF files 
foreach (glob('several/thousand/pdfs/*.pdf') AS $path) {
    // create a reader instance
    $reader = $handler->createReader($path);
    // create a document instance with this reader and pass it to the merger instance
    $merger->addDocument(SetaPDF_Core_Document::load($reader));
}

Improve Processing Speed

Working with PDF documents requires the components to tokenize a string into several thousands tokens while creating objects and structures of them. Additionally, e.g. in a merge process, all of these objects need to be reassembled and output to a new document. The bottleneck for speed is the large amount of operations that needs to be done plus the huge object structures that need to be hold in memory. 

The Garbage Collection

PHP 5.3 was the first version that was shipped with a Garbage Collection mechanism (also known as GC). This mechanism releases memory by searching for unused cycled references. A detailed description how this works is available on php.net. The shown algorithm will run only if the root buffer rises the limit of 10000 registered zvals.

Especially in situations where the SetaPDF-Merger component have to deal with very large PDF documents it will rise this limit very often. Anyhow the GC will not find free roots in a merge or save process because they are all in use. Sadly this will not prevent it to be executed again and again and again and... Because of this the GC will slow the whole process down. Sometimes more than 50% of a large merge and save process will be used by the GC to search for free roots.

So if the script you're executing will end after the merger/save process you could try to disable the garbadge collection through the gc_disable() function to gain a speed boost.

Following an example that shows the process time for both enabled and disabled GC on 4 documents with each holding 4000 pages: 

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// let's merge 4000 pages with GC enabled
$start = microtime(true);

$merger = new SetaPDF_Merger();
$merger->addFile('files/pdfs/misc/large/1000-black.pdf');
$merger->addFile('files/pdfs/misc/large/1000-red.pdf');
$merger->addFile('files/pdfs/misc/large/1000-green.pdf');
$merger->addFile('files/pdfs/misc/large/1000-blue.pdf');
$merger->merge();
$document = $merger->getDocument();
$document->setWriter(new SetaPDF_Core_Writer_TempFile('files/_temp/'));
$document->save()->finish();

echo '4000 pages assembled in ' . round(microtime(true) - $start, 4) . 
     ' seconds (GC enabled)<br />';

echo 'Memory usage: ' . round(memory_get_usage() / 1024 / 1024, 2) . ' MB<br />';

// let's clean up
unset($merger, $document);
gc_collect_cycles();

// now we disable GC
gc_disable();

// let's merge 2000 pages with GC disabled
$start = microtime(true);

$merger = new SetaPDF_Merger();
$merger->addFile('files/pdfs/misc/large/1000-black.pdf');
$merger->addFile('files/pdfs/misc/large/1000-red.pdf');
$merger->addFile('files/pdfs/misc/large/1000-green.pdf');
$merger->addFile('files/pdfs/misc/large/1000-blue.pdf');
$merger->merge();
$document = $merger->getDocument();
$document->setWriter(new SetaPDF_Core_Writer_TempFile('files/_temp/'));
$document->save()->finish();

echo '4000 pages assembled in ' . round(microtime(true) - $start, 4) . 
     ' seconds (GC disabled)<br />';
echo 'Memory usage: ' . round(memory_get_usage() / 1024 / 1024, 2) . ' MB<br />';

As you can see the performance gain is up to 100% on an old PHP version! And still an improvement in PHP 7. 

So if you are working with large PDF documents and your script ends after a merge process you could increase the performance by disabling the garbage collection.

BUT: If you need to execute other code after the PDF processing you should keep in mind that the memory that was consumed during the PDF process will not be released by the GC at all!

In 2014 the function gc_disable() got big attention because it was used in composer to gain a great speed boost, too. On this commit you will also find interresting links to articles that explain the behavior of GC in detail. 

Caching

Let's say you have a repository of hundreds of PDF documents and you want your users to create individual compositions of this repository. By default the SetaPDF component has to parse and interpret each document individually before it proceeds with a merge process. But isn't a single document parsed and interpreted several times by each users  composition then? Sure! And this can be reduced if you create a cached version of the document instance e.g. at the moment when you upload a PDF document to your system. This is possible due to the fact that the SetaPDF_Core_Document instance is serializable. Following demo will save 4 serialized document instances.  

In a production system you should do this e.g. at the moment a file is uploaded to your system to distribute the processes to individual script calls!

You should update your cache data if you update the SetaPDF component. 

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// disable the garbadge collector
gc_disable();

// define a cache dir for our files
$cacheDir = 'files/cache/merger-demo/';

// get some pdf files
$files = array(
    'files/pdfs/misc/large/1000-black.pdf',
    'files/pdfs/misc/large/1000-red.pdf',
    'files/pdfs/misc/large/1000-green.pdf',
    'files/pdfs/misc/large/1000-blue.pdf'
);

$start = microtime(true);

foreach ($files AS $file) {
    // create a cache path 
    $cachePath = $cacheDir . basename($file, '.pdf') . '.cache';

    // create a document instance
    $document = SetaPDF_Core_Document::loadByFilename($file);
    // ensure that all pages are read
    $pages = $document->getCatalog()->getPages();
    $pages->ensureAllPageObjects(); 
        
    // cache a serialized version in the file system
    file_put_contents($cachePath, serialize($document));    
}

echo 'Cache created for 4 PDF documents ' .
     'with a total page count of 4000 pages in ' . 
     round(microtime(true) - $start, 4) . ' seconds.';

Now we have cached versions of all documents we want to merge. But let us try to merge these documents without the cache to get a feeling about the process time (we already did this some paragraphs above):

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// disable the garbadge collector
gc_disable();

$start = microtime(true);

// create a merger instance
$merger = new SetaPDF_Merger();
// add all 4 files
$merger->addFile('files/pdfs/misc/large/1000-black.pdf');
$merger->addFile('files/pdfs/misc/large/1000-red.pdf');
$merger->addFile('files/pdfs/misc/large/1000-green.pdf');
$merger->addFile('files/pdfs/misc/large/1000-blue.pdf');
// merger the documents
$merger->merge();

// save the resulting document 
$document = $merger->getDocument();
$document->setWriter(new SetaPDF_Core_Writer_TempFile('files/_temp/'));
$document->save()->finish();

echo '4000 pages assembled in ' . round(microtime(true) - $start, 4) . 
     ' seconds (GC disabled)<br />';

echo 'Memory usage: ' . round(memory_get_usage() / 1024 / 1024, 2) . ' MB<br />';

Ending in ~2.5 seconds on PHP 5.4 and ~1 seconds on PHP 7 for 4000 pages. Pretty well but let's get a step further and use the cached document instances now: 

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// disable the garbadge collection
gc_disable();

// define a cache dir for our files
$cacheDir = 'files/cache/merger-demo/';

// get some pdf files
$files = array(
    'files/pdfs/misc/large/1000-black.pdf',
    'files/pdfs/misc/large/1000-red.pdf',
    'files/pdfs/misc/large/1000-green.pdf',
    'files/pdfs/misc/large/1000-blue.pdf'
);

$start = microtime(true);

// create a merger instance
$merger = new SetaPDF_Merger();

// iterate over all files and get their document instances 
// through a cachecd version.
foreach ($files AS $file) {
    // ATTENTION: In a production environment you should 
    //            ensure that the cache is up to date!!
    $cachePath = $cacheDir . basename($file, '.pdf') . '.cache';
    $document = unserialize(file_get_contents($cachePath));
    $merger->addDocument($document); 
}

// merger the documents
$merger->merge();

// save the resulting document 
$document = $merger->getDocument();
$document->setWriter(new SetaPDF_Core_Writer_TempFile('files/_temp/'));
$document->save()->finish();

echo '4000 pages assembled in ' . round(microtime(true) - $start, 4) . 
     ' seconds (GC disabled + cached document instances)<br />';

echo 'Memory usage: ' . round(memory_get_usage() / 1024 / 1024, 2) . ' MB<br />';

With cached document instances we end at ~1.7 seconds on PHP 5.4 and ~0.75 second on PHP 7 for 4000 pages.

So caching on an old PHP version is a good idea to speed things up. In PHP 7 this technic will also gain a speed improvement but it will only be reasonable for a very high amount of documents.  

The downside of this solution is the fact, that the unserialize() method seems to eat much more memory than creating a plain instance. We're still evaluating this problem and hope to find a solution as soon as possible. 

Merging PDF Documents Asynchronous

Sometimes it is impossible to create a merge process in a single script call, because the limits are simply reached and cannot be changed or extended. It is also possible that you want to move the merge process in a background process that is triggered by e.g. a cron job.

It is sadly impossible to create an all-working solution for this task because we have to deal with temporary files and it's up to you how to trigger the process or how you create e.g. a queue. So the following example simply concatenates 8 documents with each 1000 pages over 8 individual script calls. All temporary data will be held in a session variable. 

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// start a session
session_start();

// disable the garbadge collector
gc_disable();

// if this is the first call or if the process should restart:
if (!isset($_SESSION['myTemporaryDocument']) || $_SESSION['myTemporaryDocument'] === '') {
    // we add 8 documents with each holding 1000 pages
    $_SESSION['myFiles'] = array(
        'files/pdfs/misc/large/1000-black.pdf',
        'files/pdfs/misc/large/1000-red.pdf',
        'files/pdfs/misc/large/1000-green.pdf',
        'files/pdfs/misc/large/1000-blue.pdf',
        'files/pdfs/misc/large/1000-black.pdf',
        'files/pdfs/misc/large/1000-red.pdf',
        'files/pdfs/misc/large/1000-green.pdf',
        'files/pdfs/misc/large/1000-blue.pdf'
    );
    
    // let's create the first document instance
    $document = new SetaPDF_Core_Document();
    
// The process is running, so...
} else {
    // initate a document instance by the last PDF document content
    $document = SetaPDF_Core_Document::loadByString($_SESSION['myTemporaryDocument']);
}

// reset output
$_SESSION['myTemporaryDocument'] = '';

// get the next file from the array
$currentFile = array_shift($_SESSION['myFiles']);
// check how many files are left
$filesLeft = count($_SESSION['myFiles']);
// no files left, let's use a HTTP writer
if ($filesLeft === 0) {
    $writer = new SetaPDF_Core_Writer_Http('4000-async.pdf', true);
// otherwise use a variable writer
} else {
    $writer = new SetaPDF_Core_Writer_Var($_SESSION['myTemporaryDocument']);
}

// initate a merger instance starting with 
// the previously initiated document instance
$merger = new SetaPDF_Merger($document);
// add the next file to the document
$merger->addFile($currentFile);
// merge
$merger->merge();

// set the writer and save
$document->setWriter($writer);
$document->save()->finish();

// if files left, output some content and initiate a reload
if ($filesLeft) {
    if ($filesLeft === 1) {
        echo 'Mergin last document! Download will start...<br />';
    } else {
        echo 'Merging... (' . count($_SESSION['myFiles']) . ' documents left).<br />';
    }
    echo 'Memory usage: ' . round(memory_get_usage() / 1024 / 1024, 2) . ' MB<br />';
    echo '<meta http-equiv="refresh" content="0; async.php?' . time() . '">';
}