SetaPDF_Extractor_Strategy_WordGroup Extraction strategy for word groups
File: /SetaPDF v2/Extractor/Strategy/WordGroup.php
The result of this strategy is sorted from top-left to bottom-right.
Each group is represented by an instance of SetaPDF_Extractor_Result_Words.
This class allows you to receive a groups boundary through the getBounds() method.
Class hierarchy
Summary
Methods
- __construct()
- _accept()
- _calculateFontSize()
- _cleanResult()
- _createWord()
- _dehyphen()
- _getParser()
- _getSubInstance()
- _insertIntoStorage()
- _onAfterShowText()
- _onBeforeShowText()
- _onBeginOrEndText()
- _onCurrentTransformationMatrix()
- _onFormXObject()
- _onGraphicStateChange()
- _onInlineImage()
- _onTextPosition()
- _onTextShow()
- _onTextState()
- _processLine()
- _resolveGroup()
- _saveLastMatrix()
- _showText()
- _showTextStrings()
- getAllowedFontSizeDifference()
- getBoundary()
- getCleanStreamCallback()
- getDehyphen()
- getDetailLevel()
- getFilter()
- getGraphicState()
- getKeepIntersecingSpaces()
- getRectScaleFactorX()
- getRectScaleFactorY()
- getResult()
- getSorter()
- process()
- setAllowedFontSizeDifference()
- setBoundary()
- setCleanStreamCallback()
- setDehyphen()
- setDetailLevel()
- setFilter()
- setGraphicState()
- setKeepIntersectingSpaces()
- setRectScaleFactorX()
- setRectScaleFactorY()
- setSorter()
Constants
DETAIL_LEVEL_DEFAULT
Detail level constant.
Default detail level resulting in instances of SetaPDF_Extractor_Result_Word.
DETAIL_LEVEL_GLYPHS
Detail level constant.
Extended detail level resulting in instances of SetaPDF_Extractor_Result_WordWithGlyphs.
Static Properties
$characters
Additional UTF-8 sequences that should be handled as word characters
In outdated PHP 5.2 version this property could be used to extend the character classes.
"\x2D" 'HYPHEN-MINUS' (U+002D) "\xE2\x80\x90" 'HYPHEN' (U+2010) "\xE2\x88\x92" 'MINUS SIGN' (U+2212) "\xE2\x80\x94" 'EM DASH' (U+2014) "\xE2\x80\x91" 'NON-BREAKING HYPHEN' (U+2011) "\xC2\xB2" 'SUPERSCRIPT TWO' (U+00B2) "\xC2\xAD" 'SOFT HYPHEN' (U+00AD)
Properties
$_boundaryFilter
The boundary filter.
$_cleanStreamCallback
A callback that is called before processing a stream.
$_filter
A filter.
$_lastMatrix
Used matrixes.
$_storage
The storage.
$spaceWidthFactor
A factor to calculate whether a distance can be seen as a character separator.
The fonts space character width is devided by this factor to define the minimum space for a character separator.
Static Methods
getWords()
Helper method to split a string into words.
This helper allows you to split a string into words with the same logic that is used to resolve words by text items resolved from a PDF document.
Parameters
- $text
- $encoding
Methods
__construct()
The constructor.
Parameters
- $storage : SetaPDF_Extractor_Storage_StorageInterface|null
_accept()
Proxy method that forwards the call to a filter instance if available.
Parameters
- $textItem : SetaPDF_Extractor_TextItem
Exceptions
Throws SetaPDF_Extractor_Exception
See
_calculateFontSize()
Calculates the font size of a text item in user space.
Parameters
- $textItem : SetaPDF_Extractor_TextItem
_getSubInstance()
Get an instance of the same strategy for processing an other stream (e.g. a Form XObject stream).
Parameters
_insertIntoStorage()
Prepares and fills the storage for resolving word groups.
Parameters
- $stream
- $resources : SetaPDF_Core_Type_Dictionary
_onAfterShowText()
Callback that is called after a show text operation was invoked.
Parameters
- $rawString
Exceptions
Throws SetaPDF_Extractor_Exception
_onCurrentTransformationMatrix()
Callback for ctm changes (cm).
Parameters
- $arguments
- $operator
_onFormXObject()
Callback for painting a specified XObject.
Parameters
- $arguments
- $operator
Exceptions
_onGraphicStateChange()
Callback for graphic state changes operators (q/Q).
Parameters
- $arguments
- $operator
_onInlineImage()
Callback for inline image operator
Parameters
- $arguments
- $operator
Exceptions
Throws SetaPDF_Core_Exception
_onTextState()
Callback for text state operators.
All states has to be passed to the current graphic state as defined in PDF 32000-1:2008, Table 52 on page 121.
Parameters
- $arguments
- $operator
Exceptions
Throws SetaPDF_Extractor_Exception
_resolveGroup()
Resolves all intersecting entries of the storage, starting by a single entry.
Parameters
- $storageEntry : SetaPDF_Extractor_Storage_StorageEntry
getResult()
Get all resolved words groups.
Parameters
- $stream
- $resources : SetaPDF_Core_Type_Dictionary
Exceptions
Throws SetaPDF_Core_Exception
getSorter()
Get the sorter instance.
If none was set a base line sorter is created automatically.
process()
Processes a stream through the plain text strategy.
Parameters
- $stream
- $resources : SetaPDF_Core_Type_Dictionary
setBoundary()
Sets the boundary for the current strategy.
Parameters
- $boundary : SetaPDF_Core_Geometry_Rectangle|null
setCleanStreamCallback()
Set a callback that is called before processing a stream.
Parameters
- $callback : callable|null
setDehyphen()
Sets whether the dehyphen logic should be executed or not.
Parameters
- $dehyphen
- $hyphens
setFilter()
Set a filter.
Parameters
- $filter : SetaPDF_Extractor_Filter_FilterInterface|null
setGraphicState()
Set the graphic state.
Parameters
- $graphicState : SetaPDF_Core_Canvas_GraphicState
setKeepIntersectingSpaces()
Set a flag which defines whether intersacting spaces are ignored or not.
By default this is set to false which removes a space or white-space character which intersects with another character for more55 than 55 percent.
Parameters
- $keep
setRectScaleFactorX()
Sets the rect scale-factor on the abscissa.
The boundaries of the words are scaled using the product of the font-size and the given scale-factor.
Parameters
- $value
setRectScaleFactorY()
Sets the rect scale-factor on the ordinate.
The boundaries of the words are scaled using the product of the font-size and the given scale-factor.
Parameters
- $value