Table of Contents
Anyhow text extraction is a much more perfomance and memory instensive task as it has to keep track of text positions and metrics, which is a requirement to get the best possible results. Because of this it would be a good idea to make some adjustments to the memory_limit and maybe to the max_execution_time directives. At the end these settings depends on the documents you want to process: Much text items increase memory and execution time.
In the current version (2.x) the SetaPDF-Extractor component is only able to extract text, words and glyphs written in LTR (left-to-right) languages. So languages like Arabic and Hebrew are NOT supported at the moment! This means, the component allows you to extract the characters but the result may be in the reversed order.
Additionally there are known issues with texts that uses stacked vowels, e.g. Thai. These vowels could return in an invalid order and need to be re-ordered "manually".
The SetaPDF-Extractor component makes use of classes and methods of the SetaPDF-Core component. The Extractor component itself is integrated into the same structure and is fully covered by the autoload function of the Core component.
Loading the SetaPDF-Extractor component is that simple: