SetaPDF_Extractor_Strategy_WordGroup Extraction strategy for word groups

File: /SetaPDF v2/Extractor/Strategy/WordGroup.php

The result of this strategy is sorted from top-left to bottom-right.

Each group is represented by an instance of SetaPDF_Extractor_Result_Words.

This class allows you to receive a groups boundary through the getBounds() method.

Class hierarchy

Summary

Constants

DETAIL_LEVEL_DEFAULT

public const string SetaPDF_Extractor_Strategy_Word::DETAIL_LEVEL_DEFAULT = 'default'

Detail level constant.

Default detail level resulting in instances of SetaPDF_Extractor_Result_Word.

DETAIL_LEVEL_GLYPHS

Detail level constant.

Extended detail level resulting in instances of SetaPDF_Extractor_Result_WordWithGlyphs.


Static Properties

$characters

static public string SetaPDF_Extractor_Strategy_Word::$characters = '\\-‐−—‑²­'

Additional UTF-8 sequences that should be handled as word characters

In outdated PHP 5.2 version this property could be used to extend the character classes.

"\x2D" 'HYPHEN-MINUS' (U+002D) "\xE2\x80\x90" 'HYPHEN' (U+2010) "\xE2\x88\x92" 'MINUS SIGN' (U+2212) "\xE2\x80\x94" 'EM DASH' (U+2014) "\xE2\x80\x91" 'NON-BREAKING HYPHEN' (U+2011) "\xC2\xB2" 'SUPERSCRIPT TWO' (U+00B2) "\xC2\xAD" 'SOFT HYPHEN' (U+00AD)


Properties

$_allowedFontSizeDifference

The allowed height difference.

$_cleanStreamCallback

A callback that is called before processing a stream.

$_detailLevel

protected string SetaPDF_Extractor_Strategy_Word::$_detailLevel = 'default'

The detail level.

$_graphicState

$_hyphens

These characters are used as hyphens in the dehyphen logic.

"\x2D" 'HYPHEN-MINUS' (U+002D)

$_items

$_keepIntersectingSpaces

Defines whether intersecting spaces should be ignored or not.

$_lastMatrix

$_rectScaleFactorX

The value that is used to scale the word-bounding-box on the abscissa.

$_rectScaleFactorY

The value that is used to scale the word-bounding-box on the ordinate.

$_resources

The stream resources dictionary.

$_sorter

$_textCount

A text item counter.

$_useDehyphen

Defines whether the dehyphen logic should be executed or not.

$spaceWidthFactor

A factor to calculate whether a distance can be seen as a character separator.

The fonts space character width is devided by this factor to define the minimum space for a character separator.


Static Methods

_isNonWordCharacter()

protected static SetaPDF_Extractor_Strategy_Word::_isNonWordCharacter (
$character
):

Checks whether the character is a non word character.

Parameters
$character
 

getWords()

public static SetaPDF_Extractor_Strategy_Word::getWords (
$text [, $encoding = 'UTF-8' ]
):

Helper method to split a string into words.

This helper allows you to split a string into words with the same logic that is used to resolve words by text items resolved from a PDF document.

Parameters
$text
 
$encoding
 

Methods

__construct()

_accept()

Proxy method that forwards the call to a filter instance if available.

Parameters
$textItem : SetaPDF_Extractor_TextItem
 
Exceptions

Throws SetaPDF_Extractor_Exception

See

_calculateFontSize()

Calculates the font size of a text item in user space.

Parameters
$textItem : SetaPDF_Extractor_TextItem
 

_cleanResult()

Callback to clean up the resulting text.

Parameters
$result
 

_createWord()

Creates a new word instance using the glyphs.

Parameters
$glyphs
 

_dehyphen()

Executes the dehyphen logic.

Parameters
$group
 

_getParser()

Creates the content stream parser.

Parameters
$stream
 

_getSubInstance()

Get an instance of the same strategy for processing an other stream (e.g. a Form XObject stream).

Parameters
$gs : SetaPDF_Core_Canvas_GraphicState
 

_insertIntoStorage()

Prepares and fills the storage for resolving word groups.

Parameters
$stream
 
$resources : SetaPDF_Core_Type_Dictionary
 

_onAfterShowText()

Callback that is called after a show text operation was invoked.

Parameters
$rawString
 
Exceptions

Throws SetaPDF_Extractor_Exception

_onBeforeShowText()

Callback that is called before a show text operation is invoked.

_onBeginOrEndText()

public SetaPDF_Extractor_Strategy_Plain::_onBeginOrEndText (
$arguments, $operator
): void

Callback for begin or end text operators (BT/ET).

Parameters
$arguments
 
$operator
 

_onCurrentTransformationMatrix()

Callback for ctm changes (cm).

Parameters
$arguments
 
$operator
 

_onFormXObject()

public SetaPDF_Extractor_Strategy_Plain::_onFormXObject (
$arguments, $operator
): void

Callback for painting a specified XObject.

Parameters
$arguments
 
$operator
 
Exceptions

Throws SetaPDF_Exception_NotImplemented

_onGraphicStateChange()

public SetaPDF_Extractor_Strategy_Plain::_onGraphicStateChange (
$arguments, $operator
): void

Callback for graphic state changes operators (q/Q).

Parameters
$arguments
 
$operator
 

_onInlineImage()

public SetaPDF_Extractor_Strategy_Plain::_onInlineImage (
$arguments, $operator
): void

Callback for inline image operator

Parameters
$arguments
 
$operator
 
Exceptions

Throws SetaPDF_Core_Exception

_onTextPosition()

public SetaPDF_Extractor_Strategy_Plain::_onTextPosition (
$arguments, $operator
): void

Callback for text position operators.

Parameters
$arguments
 
$operator
 

_onTextShow()

public SetaPDF_Extractor_Strategy_Glyph::_onTextShow (
$arguments, $operator
): void

Callback that is called if a text should be shown.

Parameters
$arguments
 
$operator
 

_onTextState()

public SetaPDF_Extractor_Strategy_Plain::_onTextState (
$arguments, $operator
): void

Callback for text state operators.

All states has to be passed to the current graphic state as defined in PDF 32000-1:2008, Table 52 on page 121.

Parameters
$arguments
 
$operator
 
Exceptions

Throws SetaPDF_Extractor_Exception

_processLine()

protected SetaPDF_Extractor_Strategy_Word::_processLine (
array $items
):

Process all text items of a line.

Parameters
$items : array
 

_resolveGroup()

Resolves all intersecting entries of the storage, starting by a single entry.

Parameters
$storageEntry : SetaPDF_Extractor_Storage_StorageEntry
 

_saveLastMatrix()

Saves the last matrix by a specific type.

Parameters
$type
 

_showText()

protected SetaPDF_Extractor_Strategy_Glyph::_showText (
$string
): void

Method that shows text.

Parameters
$string
 

_showTextStrings()

public SetaPDF_Extractor_Strategy_Glyph::_showTextStrings (
$textStrings
): void

Callback that is called if text strings should be shown.

Parameters
$textStrings
 

getAllowedFontSizeDifference()

Gets the currently allowed font-size difference.

getCleanStreamCallback()

Get the callback that is called before a stream is processed.

getDehyphen()

Gets whether the dehyphen logic should be executed or not.

getDetailLevel()

Get the detail level of the expected result.

getFilter()

Get the filter.

getGraphicState()

Get the graphic state.

getKeepIntersecingSpaces()

Get a flag which defines whether intersacting spaces are ignored or not.

getRectScaleFactorX()

Gets the rect scale-factor for the abscissa.

getRectScaleFactorY()

Gets the rect scale-factor for the ordinate.

getResult()

Get all resolved words groups.

Parameters
$stream
 
$resources : SetaPDF_Core_Type_Dictionary
 
Exceptions

Throws SetaPDF_Core_Exception

getSorter()

Get the sorter instance.

If none was set a base line sorter is created automatically.

process()

Processes a stream through the plain text strategy.

Parameters
$stream
 
$resources : SetaPDF_Core_Type_Dictionary
 

setAllowedFontSizeDifference()

Sets the current allowed font-size difference.

Parameters
$value
 

setBoundary()

Sets the boundary for the current strategy.

Parameters
$boundary : SetaPDF_Core_Geometry_Rectangle|null
 

setCleanStreamCallback()

public SetaPDF_Extractor_Strategy_AbstractStrategy::setCleanStreamCallback (
[ callable|null $callback = null ]
): void

Set a callback that is called before processing a stream.

Parameters
$callback : callable|null
 

setDehyphen()

public SetaPDF_Extractor_Strategy_WordGroup::setDehyphen (
$dehyphen [, $hyphens = null ]
): void

Sets whether the dehyphen logic should be executed or not.

Parameters
$dehyphen
 
$hyphens
 

setDetailLevel()

public SetaPDF_Extractor_Strategy_Word::setDetailLevel (
$detailLevel
): void

Set the detail level of the result.

Parameters
$detailLevel
 

setFilter()

setGraphicState()

Set the graphic state.

Parameters
$graphicState : SetaPDF_Core_Canvas_GraphicState
 

setKeepIntersectingSpaces()

Set a flag which defines whether intersacting spaces are ignored or not.

By default this is set to false which removes a space or white-space character which intersects with another character for more55 than 55 percent.

Parameters
$keep
 

setRectScaleFactorX()

Sets the rect scale-factor on the abscissa.

The boundaries of the words are scaled using the product of the font-size and the given scale-factor.

Parameters
$value
 

setRectScaleFactorY()

Sets the rect scale-factor on the ordinate.

The boundaries of the words are scaled using the product of the font-size and the given scale-factor.

Parameters
$value
 

setSorter()

Set a sorter instance.

Parameters
$sorter : SetaPDF_Extractor_Sorter