Strategies

Overview

To offer as much flexibility as possible the extraction process is not build into the main class but outsourced into individual strategy classes. Strategy instances have to be passed to the main class via the setStrategy() method. 

The different strategies will allow you to control the detail level of the extracted data (currently only text). The result of the default strategy is only plain text while other strategy results include details about positions of their individual result type.

Currently there are 5 strategies available:

Result Types

Each extraction strategy returns an individual result type.

There are two base result types: string and SetaPDF_Extractor_Result_Collection (the individual result types of the strategies will extend this type). If a filter with an $id is in use, the string result will be converted to an array of strings where the $id is used as a key in this array. Otherwise the $id is forwarded to the Glyph or Word instances. While the string is a standard PHP data type the Collection result is something special:

The Collection is a kind of container for individual results and implements beside several PHP interfaces (Itarator, ArrayAccess and Countable) also the SetaPDF_Extractor_Result_HasBoundsInterface interface. This interface allows you to get the outer most bounding box of all items in the result:

Description

Get the outer-most bounds of all items in this collection.

This method will only return values of non-rotated items.

The Glyph strategy will return an instance of SetaPDF_Extractor_Result_Collection.  

The Word strategy will return an instance of SetaPDF_Extractor_Result_Words which is a collection of several SetaPDF_Extractor_Result_Word items.

The Word Group strategy will return an instance of SetaPDF_Extractor_Result_WordGroups which is a collection of several SetaPDF_Extractor_Result_Words items holding several instances of SetaPDF_Extractor_Result_Word.

Words Result

The Words result instance is not only a collection of words but also comes with a getString() method and a __toString() implementation which will return a string by concatenating all word strings and their specifc delimiter characters (space or new-line) to recreate a readable text version:

Description
public SetaPDF_Extractor_Result_Words::getString (
[ string $encoding = 'utf-8' ]
): string

Get the words string value in a specific encoding.

Parameters
$encoding : string
 

A more enhanced method is the getStringAndOffsets() method (it is used internally by the getString() method, too) which will return the assembled string and offsets of the individual word instances in that string result. The indexes of these offsets are the indexes of the words in the collection itself:

Description
public SetaPDF_Extractor_Result_Words::getStringAndOffsets (
[ string $encoding = 'utf-8' ]
): array{string: string, offsets: array<int>}

Get the words string value including offset positions of the words.

Parameters
$encoding : string
 
Return Values

Offsets are only returned if $encoding is set to UTF-8.

The methods above leads us to the search() method. This method accepts a regular expression which is matched against the result of the string representation of all words. The captured offsets of the preg_match_all() call are used to resolve the matched words then:

Description

Searches by a regular expression on the string version of the words.

Parameters
$regex : string
 
Return Values

The collection will hold SetaPDF_Extractor_Result_Words instances.

See

Encoding

By default all strategies will return resolved text and/or characters in UTF-8 encoding.

A string result can be converted to another encoding by using the Encoding class of the Core component:

PHP
$result = \SetaPDF_Core_Encoding::convert($result, 'UTF-8', 'UTF-16BE');

Object result types like Words or Glyphs will allow you to get their string value in a specific encoding by passing the encoding as an argument to the getString() method: 

PHP
$string = $word->getString('UTF-16BE');

Sorters

The SetaPDF-Extractor component extracts text based on its rendered position on a PDF page and not on it's position/definition in a pages content stream. To sort the individual items (glyphs, words, fragments) all available strategies make use of Sorter classes.

All sorters will group all items into rotation groups (defined by their rotation value). That means that a rotated item will actually never be part of a non-rotated item or one with another rotation value. 

By default a strategy uses the Baseline sorter class. This class sorts all text items by their baseline value. It will identify an item to be on a new/other line if the difference of their baseline value is higher than 0.7pt.

The other sorter class is the FlexLine sorter class. It will try to estimate if two items are part of the same line by taking things like height and font-size into consideration. It will e.g. allow you to keep sub- or superscripts on a line.

A sorter instance can be passed to a strategy through the setSorter() method:

PHP
$sorter = new \SetaPDF_Extractor_Sorter_FlexLine();
$strategy->setSorter($sorter);