setasign\SetaPDF2\Extractor\Strategy
WordGroupStrategy Extraction strategy for word groups
File: /SetaPDF v2/Extractor/Strategy/WordGroupStrategy.php
Old class name (alias):
\SetaPDF_Extractor_Strategy_WordGroup
The result of this strategy is sorted from top-left to bottom-right.
Each group is represented by an instance of \setasign\SetaPDF2\Extractor\Result\Words.
This class allows you to receive a groups boundary through the getBounds() method.
Class hierarchy
Summary
Methods
- __construct()
- _accept()
- _calculateFontSize()
- _cleanResult()
- _createWord()
- _dehyphen()
- _getParser()
- _getSubInstance()
- _ignore()
- _insertIntoStorage()
- _onBeginOrEndText()
- _onCurrentTransformationMatrix()
- _onFormXObject()
- _onGraphicStateChange()
- _onInlineImage()
- _onTextPosition()
- _onTextShow()
- _onTextState()
- _processLine()
- _resolveGroup()
- _showText()
- _showTextStrings()
- getAllowedFontSizeDifference()
- getBoundary()
- getCleanStreamCallback()
- getDehyphen()
- getDetailLevel()
- getFilter()
- getGraphicState()
- getIgnoreSpaceCharacter()
- getKeepIntersecingSpaces()
- getKeepIntersectingSpaces()
- getRectScaleFactorX()
- getRectScaleFactorY()
- getResult()
- getSorter()
- process()
- setAllowedFontSizeDifference()
- setBoundary()
- setCleanStreamCallback()
- setDehyphen()
- setDetailLevel()
- setFilter()
- setGraphicState()
- setIgnoreFaultyStreams()
- setIgnoreSpaceCharacter()
- setKeepIntersectingSpaces()
- setRectScaleFactorX()
- setRectScaleFactorY()
- setSorter()
Properties
- $_allowedFontSizeDifference
- $_boundaryFilter
- $_cleanStreamCallback
- $_contentParser
- $_detailLevel
- $_filter
- $_graphicState
- $_hyphens
- $_ignoreFaultyStreams
- $_ignoreSpaceCharacter
- $_items
- $_keepIntersectingSpaces
- $_rectScaleFactorX
- $_rectScaleFactorY
- $_resources
- $_sorter
- $_storage
- $_textCount
- $_useDehyphen
- $spaceWidthFactor
Constants
- DELIMITER_ITEMS_NOT_JOINING
- DELIMITER_LINE
- DELIMITER_NEXT_ITEM_DOES_NOT_JOIN
- DELIMITER_NEXT_ITEM_IS_NOT_A_NUMBER
- DELIMITER_NONWORD_CHARACTER
- DELIMITER_NO_DECIMAL_SEPARATOR
- DELIMITER_NO_NEXT_ITEM
- DELIMITER_PREV_ITEM_IS_NOT_A_NUMBER
- DELIMITER_PREV_WORD_IS_BUILD_BY_NON_WORD_CHARACTERS
- DELIMITER_SPACE
- DELIMITER_SPACE_CHARACTER
- DETAIL_LEVEL_DEFAULT
- DETAIL_LEVEL_GLYPHS
Constants
DELIMITER_ITEMS_NOT_JOINING
Delimiter constant
DELIMITER_LINE
Delimiter constant
DELIMITER_NEXT_ITEM_DOES_NOT_JOIN
Delimiter constant
DELIMITER_NEXT_ITEM_IS_NOT_A_NUMBER
Delimiter constant
DELIMITER_NONWORD_CHARACTER
Delimiter constant
DELIMITER_NO_DECIMAL_SEPARATOR
Delimiter constant
DELIMITER_NO_NEXT_ITEM
Delimiter constant
DELIMITER_PREV_ITEM_IS_NOT_A_NUMBER
Delimiter constant
DELIMITER_PREV_WORD_IS_BUILD_BY_NON_WORD_CHARACTERS
Delimiter constant
DELIMITER_SPACE
Delimiter constant
DELIMITER_SPACE_CHARACTER
Delimiter constant
DETAIL_LEVEL_DEFAULT
Detail level constant.
Default detail level resulting in instances of \setasign\SetaPDF2\Extractor\Result\Word.
DETAIL_LEVEL_GLYPHS
Detail level constant.
Extended detail level resulting in instances of \setasign\SetaPDF2\Extractor\Result\WordWithGlyphs.
Static Properties
$characters
Additional UTF-8 sequences that should be handled as word characters
In outdated PHP 5.2 version this property could be used to extend the character classes.
"\x2D" 'HYPHEN-MINUS' (U+002D) "\xE2\x80\x90" 'HYPHEN' (U+2010) "\xE2\x88\x92" 'MINUS SIGN' (U+2212) "\xE2\x80\x94" 'EM DASH' (U+2014) "\xE2\x80\x91" 'NON-BREAKING HYPHEN' (U+2011) "\xC2\xB2" 'SUPERSCRIPT TWO' (U+00B2) "\xC2\xAD" 'SOFT HYPHEN' (U+00AD)
Properties
$_cleanStreamCallback
A callback that is called before processing a stream.
$_hyphens
These characters are used as hyphens in the dehyphen logic.
"\x2D" 'HYPHEN-MINUS' (U+002D) "\xC2\xAD" 'Soft Hyphen (SHY)' (U+00AD)
$_sorter
The sorter instance.
$spaceWidthFactor
A factor to calculate whether a distance can be seen as a character separator.
The fonts space character width is divided by this factor to define the minimum space for a character separator.
Static Methods
_isNonWordCharacter()
Checks whether the character is a non word character.
Parameters
- $character : string
getWords()
Helper method to split a string into words.
This helper allows you to split a string into words with the same logic that is used to resolve words by text items resolved from a PDF document.
Parameters
- $text : string
- $encoding : string
Methods
__construct()
The constructor.
Parameters
- $storage : ?\SetaPDF_Extractor_Storage_StorageInterface
_accept()
Proxy method that forwards the call to a filter instance if available.
This strategy filters space characters automatically if specified (see setIgnoreSpaceCharacter().
Parameters
- $textItem : \SetaPDF_Extractor_TextItem
Exceptions
Throws \setasign\SetaPDF2\Core\Exception
Throws \setasign\SetaPDF2\Extractor\Exception
See
_calculateFontSize()
Calculates the font size of a text item in user space.
Parameters
- $textItem : \SetaPDF_Extractor_TextItem
_createWord()
Creates a new word instance using the glyphs.
Parameters
- $glyphs : \SetaPDF_Extractor_Result_Glyph[]
- $delemitterType
Exceptions
_dehyphen()
_getSubInstance()
Get an instance of the same strategy for processing another stream (e.g. a Form XObject stream).
Parameters
_ignore()
string $prevString,
\SetaPDF_Extractor_TextItem $item,
\SetaPDF_Extractor_TextItem $prevItem
Method to allow implementation of individual logic.
Parameters
- $string : string
- $prevString : string
- $item : \SetaPDF_Extractor_TextItem
- $prevItem : \SetaPDF_Extractor_TextItem
_insertIntoStorage()
Prepares and fills the storage for resolving word groups.
Parameters
- $stream : string
- $resources : \SetaPDF_Core_Type_Dictionary
Exceptions
_onBeginOrEndText()
Callback for begin or end text operators (BT/ET).
Parameters
- $arguments : array
- $operator : string
_onCurrentTransformationMatrix()
Callback for ctm changes (cm).
Parameters
- $arguments : array
- $operator : string
_onFormXObject()
Callback for painting a specified XObject.
Parameters
- $arguments : array
- $operator : string
Exceptions
Throws \setasign\SetaPDF2\Core\Exception
Throws \setasign\SetaPDF2\Core\Filter\Exception
Throws \setasign\SetaPDF2\Core\Parser\Pdf\InvalidTokenException
Throws \setasign\SetaPDF2\Core\Type\Exception
Throws \setasign\SetaPDF2\Exception
_onGraphicStateChange()
Callback for graphic state changes operators (q/Q).
Parameters
- $arguments : array
- $operator : string
_onTextPosition()
Callback for text position operators.
Parameters
- $arguments : array
- $operator : string
_onTextShow()
Callback for text show operators.
Parameters
- $arguments : array
- $operator : string
Exceptions
_onTextState()
Callback for text state operators.
All states has to be passed to the current graphic state as defined in PDF 32000-1:2008, Table 52 on page 121.
Parameters
- $arguments : array
- $operator : string
Exceptions
_processLine()
_resolveGroup()
Resolves all intersecting entries of the storage, starting by a single entry.
Parameters
- $storageEntry : \SetaPDF_Extractor_Storage_StorageEntry
_showTextStrings()
Callback that is called if text strings should be shown.
Parameters
- $textStrings : array
Exceptions
getKeepIntersecingSpaces()
WARNING: This method is marked as deprecated!
Use getKeepIntersectingSpaces() instead.
getKeepIntersectingSpaces()
Get a flag which defines whether intersecting spaces are ignored or not.
getResult()
Get all resolved words groups.
Parameters
- $stream : string
- $resources : \SetaPDF_Core_Type_Dictionary
Exceptions
Throws \ReflectionException
getSorter()
Get the sorter instance.
If none was set a baseline sorter is created automatically.
process()
Processes a stream through the plain text strategy.
Parameters
- $stream : string
- $resources : \SetaPDF_Core_Type_Dictionary
Exceptions
Throws \setasign\SetaPDF2\Core\Exception
Throws \setasign\SetaPDF2\Core\Parser\Pdf\InvalidTokenException
setAllowedFontSizeDifference()
Sets the current allowed font-size difference.
Parameters
- $value : float|int
setBoundary()
setCleanStreamCallback()
Set a callback that is called before processing a stream.
Parameters
- $callback : ?callable
setDehyphen()
Sets whether the dehyphen logic should be executed or not.
Parameters
- $dehyphen : bool
- $hyphens : ?string
setFilter()
Set a filter.
Parameters
- $filter : ?\SetaPDF_Extractor_Filter_FilterInterface
setGraphicState()
setIgnoreFaultyStreams()
Define wether to continue when a stream cannot be decoded or not.
Parameters
- $ignoreFaultyStreams : bool
setIgnoreSpaceCharacter()
Defines whether a space character should be fetched or not.
If this is set to true, the strategy will use the found space character as a delimiter. If this is set to false (default), the strategy will calculate a delimiter by the distance of 2 characters/glyphs.
Parameters
- $ignoreSpaceCharacter : bool
setKeepIntersectingSpaces()
Set a flag which defines whether interacting spaces are ignored or not.
By default, this is set to false which removes a space or white-space character which intersects with another character for more than 55 percent.
Parameters
- $keep : bool
setRectScaleFactorX()
Sets the rect scale-factor on the abscissa.
The boundaries of the words are scaled using the product of the font-size and the given scale-factor.
Parameters
- $value : float|int
setRectScaleFactorY()
Sets the rect scale-factor on the ordinate.
The boundaries of the words are scaled using the product of the font-size and the given scale-factor.
Parameters
- $value : float|int