Plain Text Strategy Extracts Simple Plain Text
Table of Contents
Introduction
The plain text strategy is the default strategy used by the SetaPDF-Extractor component and allows you to extract plain text from PDF documents. It is represented by the class SetaPDF_Extractor_Strategy_Plain
.
By default the text items are sorted by the baseline sorter but another or individual sorter instance can be passed through the setSorter()
method.
The result will be a standard PHP string.
Process
The plain text strategy extracts all defined text items including their metrics into a temporary result. The items are taken as they appear in the PDF data stream. This means that several words in a single text item are processed as a whole. Or a word splitted over several text items is processed as several individual items.
This result is sorted and grouped (by default via the base line sorter) into lines and orientations then.
The resulting text string is created by running through the sorted and grouped result and comparing the last item with the current one to decide if both text items build a continuity segment. This is done by checking for a gap between both items on their ordinate. The size of this gap is defined by the average width of the space character of both text items devided by a factor defined in the $spaceWidthFactor
property.
If a space between words is "faked" by a character spacing value this strategy is not able to recognize this as a word separator. The exact plain strategy is able to handle this situation!
Usage
An instance can be created individually or by receiving it from the main class:
// get the default instance $extractor = new \SetaPDF_Extractor($document); $plainText = $extractor->getStrategy(); // or create your own $plainText = new \SetaPDF_Extractor_Strategy_Plain(); $extractor = new \SetaPDF_Extractor($document); $extractor->setStrategy($plainText);
You can get a string result by this strategy by calling the getResultByPageNumber()
method for each individual page:
The strategy allows you to pass a filter instance to limit the result e.g. by a specific area on a page.