SetaPDF_Extractor_Strategy_WordGroup Extraction strategy for word groups

File: /SetaPDF v2/Extractor/Strategy/WordGroup.php

The result of this strategy is sorted from top-left to bottom-right.

Each group is represented by an instance of SetaPDF_Extractor_Result_Words.

This class allows you to receive a groups boundary through the getBounds() method.

Class hierarchy

Summary

Static Properties

$characters

static public string SetaPDF_Extractor_Strategy_WordGroup::$characters = '-‐−—‑²­'

Additional UTF-8 sequences that should be handled as word characters

In outdated PHP 5.2 version this property could be used to extend the character classes.

"\x2D" 'HYPHEN-MINUS' (U+002D) "\xE2\x80\x90" 'HYPHEN' (U+2010) "\xE2\x88\x92" 'MINUS SIGN' (U+2212) "\xE2\x80\x94" 'EM DASH' (U+2014) "\xE2\x80\x91" 'NON-BREAKING HYPHEN' (U+2011) "\xC2\xB2" 'SUPERSCRIPT TWO' (U+00B2) "\xC2\xAD" 'SOFT HYPHEN' (U+00AD)


Properties

$_allowedFontSizeDifference

protected float|int SetaPDF_Extractor_Strategy_WordGroup::$_allowedFontSizeDifference = 3

The allowed height difference.

$_boundaryFilter

protected null|SetaPDF_Extractor_Filter_Rectangle SetaPDF_Extractor_Strategy_WordGroup::$_boundaryFilter

The boundary filter.

$_cleanStreamCallback

protected callable SetaPDF_Extractor_Strategy_WordGroup::$_cleanStreamCallback

A callback that is called before processing a stream.

$_contentParser

protected SetaPDF_Core_Parser_Content SetaPDF_Extractor_Strategy_WordGroup::$_contentParser

$_detailLevel

protected string SetaPDF_Extractor_Strategy_WordGroup::$_detailLevel = 'default'

The detail level.

$_filter

protected null|SetaPDF_Extractor_Filter_FilterInterface SetaPDF_Extractor_Strategy_WordGroup::$_filter

A filter.

$_graphicState

protected SetaPDF_Core_Canvas_GraphicState SetaPDF_Extractor_Strategy_WordGroup::$_graphicState

The graphic state instance.

$_hyphens

protected string SetaPDF_Extractor_Strategy_WordGroup::$_hyphens = '-'

These characters are used as hyphens in the dehyphen logic.

"\x2D" 'HYPHEN-MINUS' (U+002D)

$_items

protected SetaPDF_Extractor_TextItem[] SetaPDF_Extractor_Strategy_WordGroup::$_items = array()

The text items.

$_lastMatrix

protected SetaPDF_Core_Geometry_Matrix[] SetaPDF_Extractor_Strategy_WordGroup::$_lastMatrix = array(...)

Used matrixes.

$_rectScaleFactorX

protected float|int SetaPDF_Extractor_Strategy_WordGroup::$_rectScaleFactorX = 0.2707

The value that is used to scale the word-bounding-box on the abscissa.

$_rectScaleFactorY

protected float SetaPDF_Extractor_Strategy_WordGroup::$_rectScaleFactorY = 0.2707

The value that is used to scale the word-bounding-box on the ordinate.

$_resources

protected SetaPDF_Core_Type_Dictionary SetaPDF_Extractor_Strategy_WordGroup::$_resources

The stream resources dictionary.

$_sorter

protected SetaPDF_Extractor_Sorter SetaPDF_Extractor_Strategy_WordGroup::$_sorter

The sorter instance.

$_storage

protected SetaPDF_Extractor_Storage_StorageInterface SetaPDF_Extractor_Strategy_WordGroup::$_storage

The storage.

$_textCount

protected int SetaPDF_Extractor_Strategy_WordGroup::$_textCount = 0

A text item counter.

$_useDehyphen

protected bool SetaPDF_Extractor_Strategy_WordGroup::$_useDehyphen = true

Defines whether the dehyphen logic should be executed or not.

$spaceWidthFactor

public float SetaPDF_Extractor_Strategy_WordGroup::$spaceWidthFactor = 2.0

A factor to calculate whether a distance can be seen as a character separator.

The fonts space character width is devided by this factor to define the minimum space for a character separator.


Static Methods

_isNonWordCharacter()

protected static SetaPDF_Extractor_Strategy_Word::_isNonWordCharacter (
string $character
): bool

Checks whether the character is a non word character.

Parameters
$character : string

$character

getWords()

public static SetaPDF_Extractor_Strategy_Word::getWords (
$text [, string $encoding = 'UTF-8' ]
): array

Helper method to split a string into words.

This helper allows you to split a string into words with the same logic that is used to resolve words by text items resolved from a PDF document.

Parameters
$text
 
$encoding : string
 

Methods

__construct()

_accept()

Proxy method that forwards the call to a filter instance if available.

Parameters
$textItem : SetaPDF_Extractor_TextItem
 
See

_calculateFontSize()

Calculates the font size of a text item in user space.

Parameters
$textItem : SetaPDF_Extractor_TextItem
 

_cleanResult()

Callback to clean up the resulting text.

Parameters
$result
 

_dehyphen()

protected SetaPDF_Extractor_Strategy_WordGroup::_dehyphen (
array $group
): array

Executes the dehyphen logic.

Parameters
$group : array
 

_getParser()

Creates the content stream parser.

Parameters
$stream : string
 

_getSubInstance()

Get an instance of the same strategy for processing an other stream (e.g. a Form XObject stream).

Parameters
$gs : SetaPDF_Core_Canvas_GraphicState
 

_insertIntoStorage()

Prepares and fills the storage for resolving word groups.

Parameters
$stream
 
$resources : SetaPDF_Core_Type_Dictionary
 

_onAfterShowText()

public SetaPDF_Extractor_Strategy_Plain::_onAfterShowText (
string $rawString
): void

Callback that is called after a show text operation was invoked.

Parameters
$rawString : string
 

_onBeforeShowText()

Callback that is called before a show text operation is invoked.

_onBeginOrEndText()

public SetaPDF_Extractor_Strategy_Plain::_onBeginOrEndText (
array $arguments, string $operator
): void

Callback for begin or end text operators (BT/ET).

Parameters
$arguments : array
 
$operator : string
 

_onCurrentTransformationMatrix()

public SetaPDF_Extractor_Strategy_Plain::_onCurrentTransformationMatrix (
array $arguments, string $operator
): void

Callback for ctm changes (cm).

Parameters
$arguments : array
 
$operator : string
 

_onFormXObject()

public SetaPDF_Extractor_Strategy_Plain::_onFormXObject (
array $arguments, string $operator
): void

Callback for painting a specified XObject.

Parameters
$arguments : array
 
$operator : string
 
Exceptions

Throws SetaPDF_Exception_NotImplemented

_onGraphicStateChange()

public SetaPDF_Extractor_Strategy_Plain::_onGraphicStateChange (
array $arguments, string $operator
): void

Callback for graphic state changes operators (q/Q).

Parameters
$arguments : array
 
$operator : string
 

_onInlineImage()

public SetaPDF_Extractor_Strategy_Plain::_onInlineImage (
array $arguments, string $operator
): void

Callback for inline image operator

Parameters
$arguments : array
 
$operator : string
 

_onTextPosition()

public SetaPDF_Extractor_Strategy_Plain::_onTextPosition (
array $arguments, string $operator
): void

Callback for text position operators.

Parameters
$arguments : array
 
$operator : string
 

_onTextShow()

public SetaPDF_Extractor_Strategy_Glyph::_onTextShow (
string $arguments, mixed $operator
): void

Callback that is called if a text should be shown.

Parameters
$arguments : string
 
$operator : mixed
 

_onTextState()

public SetaPDF_Extractor_Strategy_Plain::_onTextState (
array $arguments, string $operator
): void

Callback for text state operators.

All states has to be passed to the current graphic state as defined in PDF 32000-1:2008, Table 52 on page 121.

Parameters
$arguments : array
 
$operator : string
 
Exceptions

Throws SetaPDF_Extractor_Exception

_processLine()

Process all text items of a line.

Parameters
$items : SetaPDF_Extractor_TextItem[]
 

_resolveGroup()

Resolves all intersecting entries of the storage, starting by a single entry.

Parameters
$storageEntry : SetaPDF_Extractor_Storage_StorageEntry
 

_saveLastMatrix()

protected SetaPDF_Extractor_Strategy_Plain::_saveLastMatrix (
string $type
): void

Saves the last matrix by a specific type.

Parameters
$type : string
 

_showText()

protected SetaPDF_Extractor_Strategy_Glyph::_showText (
$string
): void

Method that shows text.

Parameters
$string
 

_showTextStrings()

public SetaPDF_Extractor_Strategy_Glyph::_showTextStrings (
array $textStrings
): void

Callback that is called if text strings should be shown.

Parameters
$textStrings : array
 

getAllowedFontSizeDifference()

Gets the currently allowed font-size difference.

getCleanStreamCallback()

Get the callback that is called before a stream is processed.

getDehyphen()

Gets whether the dehyphen logic should be executed or not.

getDetailLevel()

Get the detail level of the expected result.

getGraphicState()

getRectScaleFactorX()

Gets the rect scale-factor for the abscissa.

getRectScaleFactorY()

Gets the rect scale-factor for the ordinate.

getResult()

Get all resolved words groups.

Parameters
$stream : string
 
$resources : SetaPDF_Core_Type_Dictionary
 

getSorter()

Get the sorter instance.

If none was set a base line sorter is created automatically.

process()

Processes a stream through the plain text strategy.

Parameters
$stream : string
 
$resources : SetaPDF_Core_Type_Dictionary
 

setAllowedFontSizeDifference()

Sets the current allowed font-size difference.

Parameters
$value : int|float
 

setBoundary()

Sets the boundary for the current strategy.

Parameters
$boundary : SetaPDF_Core_Geometry_Rectangle|null
 

setCleanStreamCallback()

public SetaPDF_Extractor_Strategy_AbstractStrategy::setCleanStreamCallback (
[ callable|null $callback = null ]
): void

Set a callback that is called before processing a stream.

Parameters
$callback : callable|null
 

setDehyphen()

public SetaPDF_Extractor_Strategy_WordGroup::setDehyphen (
bool $dehyphen [, string|null $hyphens = null ]
): void

Sets whether the dehyphen logic should be executed or not.

Parameters
$dehyphen : bool
 
$hyphens : string|null
 

setDetailLevel()

public SetaPDF_Extractor_Strategy_Word::setDetailLevel (
string $detailLevel
): void

Set the detail level of the result.

Parameters
$detailLevel : string
 

setFilter()

setGraphicState()

Set the graphic state.

Parameters
$graphicState : SetaPDF_Core_Canvas_GraphicState
 

setRectScaleFactorX()

Sets the rect scale-factor on the abscissa.

The boundaries of the words are scaled using the product of the font-size and the given scale-factor.

Parameters
$value : int|float
 

setRectScaleFactorY()

Sets the rect scale-factor on the ordinate.

The boundaries of the words are scaled using the product of the font-size and the given scale-factor.

Parameters
$value : int|float
 

setSorter()

Set a sorter instance.

Parameters
$sorter : SetaPDF_Extractor_Sorter