Getting started

System Requirements and Installation

Because the Extractor component is based on the SetaPDF-Core component its system requirements and installation are almost identically.

Anyhow text extraction is a much more perfomance and memory instensive task as it has to keep track of text positions and metrics, which is a requirement to get the best possible results. Because of this it would be a good idea to make some adjustments to the memory_limit and maybe to the max_execution_time directives. At the end these settings depends on the documents you want to process: Much text items increase memory and execution time. 

Limitations

In the current version (2.x) the SetaPDF-Extractor component is only able to extract text, words and glyphs written in LTR (left-to-right) languages. So languages like Arabic and Hebrew are NOT supported at the moment! This means, the component allows you to extract the characters but the result may be in the reversed order. 

Additionally there are known issues with texts that uses stacked vowels, e.g. Thai. These vowels could return in an invalid order and need to be re-ordered "manually".

Loading the Component

The SetaPDF-Extractor component makes use of classes and methods of the SetaPDF-Core component. The Extractor component itself is integrated into the same structure and is fully covered by the autoload function of the Core component.

Loading the SetaPDF-Extractor component is that simple:

PHP
require_once('/absolute/path/to/library/SetaPDF/Autoload.php');

or

PHP
require_once('../relative/path/to/library/SetaPDF/Autoload.php');

If the component is installed via Composer, just use the autloader instance from Composer: 

PHP
require 'vendor/autoload.php';

Error Handling

Beside the mentioned Exception in the Core manual the Extractor component has its own base Exception: SetaPDF_Extractor_Exception.