Performance Optimizations
Table of Contents
Introduction
Merging a large number or large PDF documents is a problematic process in a web environment as we have to deal with several limits: memory limit, script time limit and also with operation system limits such as a limit of allowed open file handles/descriptors.
On this page you will find some hints and examples on how to improve a merge process considering this limitations.
Limit of Open Files
Operation systems may have a limit on allowed open files / file descriptors. These limits are system wide limits or user limits. A nice overview how to check and change these setting on a linux system can be found here.
With this limits in mind it is impossible to open more files with e.g. the fopen() function of PHP than defined. The limit will also affect standard statements like require or include. So a script will end in an annoying warning mostly followed by a fatal error:
Warning: fopen(path/to/document.pdf): failed to open stream: Too many open files in /you/script.php on line 123
In version 2.17.0.768 of the SetaPDF-Core component a special reader and handler class were introduced to overcome this issue: SetaPDF_Core_Reader_MaxFile
and the SetaPDF_Core_Reader_MaxFileHandler
class.
The handler instance will observe open files/file descriptors and will close them if a specific limit is reached. The reader itself will open a handle if necessary and notify the handler. As both objects are bound to each other there are helper methods implemented to create a reader instance:
// let's define a maximum of open files $maxOpenFileHandles = 500; // create a handler by this value $handler = new \SetaPDF_Core_Reader_MaxFileHandler($maxOpenFileHandles); // iterate over thousands of PDF files foreach (glob('several/thousand/pdfs/*.pdf') AS $path) { // create a reader instance $reader = $handler->createReader($path); // create a document instance with this reader and pass it to the merger instance $merger->addDocument(\SetaPDF_Core_Document::load($reader)); }
Improve Processing Speed
Working with PDF documents requires the components to tokenize a string into several thousands tokens while creating objects and structures of them. Additionally, e.g. in a merge process, all of these objects need to be reassembled and output to a new document. The bottleneck for speed is the large amount of operations that needs to be done plus the huge object structures that need to be hold in memory.
The Garbage Collection
PHP 5.3 was the first version that was shipped with a Garbage Collection mechanism (also known as GC). This mechanism releases memory by searching for unused cycled references. A detailed description how this works is available on php.net. The shown algorithm will run only if the root buffer rises the limit of 10000 registered zvals.
Especially in situations where the SetaPDF-Merger component have to deal with very large PDF documents it will rise this limit very often. Anyhow the GC will not find free roots in a merge or save process because they are all in use. Sadly this will not prevent it to be executed again and again and again and... Because of this the GC will slow the whole process down. Sometimes more than 50% of a large merge and save process will be used by the GC to search for free roots.
So if the script you're executing will end after the merger/save process you could try to disable the garbadge collection through the gc_disable() function to gain a speed boost.
Following an example that shows the process time for both enabled and disabled GC on 4 documents with each holding 4000 pages:
As you can see the performance gain is up to 100% on an old PHP version! And still an improvement in PHP 7.
So if you are working with large PDF documents and your script ends after a merge process you could increase the performance by disabling the garbage collection.
BUT: If you need to execute other code after the PDF processing you should keep in mind that the memory that was consumed during the PDF process will not be released by the GC at all!
In 2014 the function gc_disable()
got big attention because it was used in composer to gain a great speed boost, too. On this commit you will also find interresting links to articles that explain the behavior of GC in detail.
Caching
Let's say you have a repository of hundreds of PDF documents and you want your users to create individual compositions of this repository. By default the SetaPDF component has to parse and interpret each document individually before it proceeds with a merge process. But isn't a single document parsed and interpreted several times by each users composition then? Sure! And this can be reduced if you create a cached version of the document instance e.g. at the moment when you upload a PDF document to your system. This is possible due to the fact that the SetaPDF_Core_Document
instance is serializable. Following demo will save 4 serialized document instances.
In a production system you should do this e.g. at the moment a file is uploaded to your system to distribute the processes to individual script calls!
You should update your cache data if you update the SetaPDF component.
Now we have cached versions of all documents we want to merge. But let us try to merge these documents without the cache to get a feeling about the process time (we already did this some paragraphs above):
Ending in ~2.5 seconds on PHP 5.4 and ~1 seconds on PHP 7 for 4000 pages. Pretty well but let's get a step further and use the cached document instances now:
With cached document instances we end at ~1.7 seconds on PHP 5.4 and ~0.75 second on PHP 7 for 4000 pages.
So caching on an old PHP version is a good idea to speed things up. In PHP 7 this technic will also gain a speed improvement but it will only be reasonable for a very high amount of documents.
The downside of this solution is the fact, that the unserialize() method seems to eat much more memory than creating a plain instance. We're still evaluating this problem and hope to find a solution as soon as possible.
Improve Memory Usage
Depending on the amount and size of PDF documents the memory size of your PHP process can be reached. Following two possibilities to overcome such issue:
Use Temporary Document Instances
By default the resulting PDF is assembled completely in memory while all other document instances are kept until the document is finished. This could lead to high memory consumption. By creating temporary results you can minimize this. Following a simple method that takes a writer instance, an array of files and a limit argument. It will use intermediate document instances while in each instance $limit
files were added and processed. This results in much better memory usage but for sure will require a bit more processing time:
public function mergeOptimized( \SetaPDF_Core_Writer_FileInterface $writer, array $files, $limit = 500 ) { $files = array_reverse($files); $merger = new \SetaPDF_Merger(); $document = $merger->getDocument(); $count = 0; while ($file = array_pop($files)) { $merger->addFile($file); if ((++$count % $limit) === 0) { $merger->merge(); $tmpWriter = new \SetaPDF_Core_Writer_TempFile(); $document->setWriter($tmpWriter); $document->save()->finish(); $document = \SetaPDF_Core_Document::loadByFilename($tmpWriter->getPath()); $merger = new \SetaPDF_Merger($document); $prevTmpWriter = $tmpWriter; // keep the temporay file until the next one is written } } $merger->merge(); $document->setWriter($writer); $document->save()->finish(); }
Merging PDF Documents Asynchronous
Sometimes it is impossible to create a merge process in a single script call, because the limits are simply reached and cannot be changed or extended. It is also possible that you want to move the merge process in a background process that is triggered by e.g. a cron job.
It is sadly impossible to create an all-working solution for this task because we have to deal with temporary files and it's up to you how to trigger the process or how you create e.g. a queue. So the following example simply concatenates 8 documents with each 1000 pages over 8 individual script calls. All temporary data will be held in a session variable.