Exact Plain Text Strategy Extracts Simple Plain Text By Glyphs
Table of Contents
Introduction
The exact plain text strategy is simliar to the plain text strategy but it uses the detail level of the glyph strategy to extract the text. That said it will recreate the resulting text by sorting, comparing and concatenating individual glyphs and will ignore existing text items from the PDF document.
Especially in concunction with e.g. a rectangle filter this strategy will return much more precise result.
The strategy is represented by the SetaPDF_Extractor_Strategy_ExactPlain
class.
The result will be also a standard PHP string.
Process
The exact plain text strategy makes use of the glyph strategy which extracts each single glyph including its metrics in the order in which it appears in the PDF data stream.
After that these single glyphs will be passed to the same logic as of the plain text strategy.
If a space between words is "faked" by a character spacing value this strategy is able to recognize this as a word separator!
Usage
An instance has to be created individually and passed to the main class:
$strategy = new \SetaPDF_Extractor_Strategy_ExactPlain(); $extractor = new \SetaPDF_Extractor($document); $extractor->setStrategy($strategy);
You can get a string result by this strategy by calling the getResultByPageNumber()
method for each individual page:
The strategy allows you to pass a filter instance to limit the result e.g. by a specific area on a page.