SetaPDF_Extractor_Strategy_Word Extraction strategy for single words.
File: /SetaPDF v2/Extractor/Strategy/Word.php
The result of this strategy is sorted from top-left to bottom-right.
Each word is represented by an instance of SetaPDF_Extractor_Result_Word
or SetaPDF_Extractor_Result_WordWithGlyphs
.
This class allows you to receive a words boundary through the
getBounds()
method of the word instance.
Class hierarchy
Summary
Methods
- __construct()
- _accept()
- _cleanResult()
- _createWord()
- _getParser()
- _getSubInstance()
- _ignore()
- _onBeginOrEndText()
- _onCurrentTransformationMatrix()
- _onFormXObject()
- _onGraphicStateChange()
- _onInlineImage()
- _onTextPosition()
- _onTextShow()
- _onTextState()
- _processLine()
- _showText()
- _showTextStrings()
- getBoundary()
- getCleanStreamCallback()
- getDetailLevel()
- getFilter()
- getGraphicState()
- getIgnoreSpaceCharacter()
- getKeepIntersecingSpaces()
- getKeepIntersectingSpaces()
- getResult()
- getSorter()
- process()
- setBoundary()
- setCleanStreamCallback()
- setDetailLevel()
- setFilter()
- setGraphicState()
- setIgnoreFaultyStreams()
- setIgnoreSpaceCharacter()
- setKeepIntersectingSpaces()
- setSorter()
Constants
- DELIMITER_ITEMS_NOT_JOINING
- DELIMITER_LINE
- DELIMITER_NEXT_ITEM_DOES_NOT_JOIN
- DELIMITER_NEXT_ITEM_IS_NOT_A_NUMBER
- DELIMITER_NONWORD_CHARACTER
- DELIMITER_NO_DECIMAL_SEPARATOR
- DELIMITER_NO_NEXT_ITEM
- DELIMITER_PREV_ITEM_IS_NOT_A_NUMBER
- DELIMITER_PREV_WORD_IS_BUILD_BY_NON_WORD_CHARACTERS
- DELIMITER_SPACE
- DELIMITER_SPACE_CHARACTER
- DETAIL_LEVEL_DEFAULT
- DETAIL_LEVEL_GLYPHS
Constants
DELIMITER_ITEMS_NOT_JOINING
Delimiter constant
DELIMITER_LINE
Delimiter constant
DELIMITER_NEXT_ITEM_DOES_NOT_JOIN
Delimiter constant
DELIMITER_NEXT_ITEM_IS_NOT_A_NUMBER
Delimiter constant
DELIMITER_NONWORD_CHARACTER
Delimiter constant
DELIMITER_NO_DECIMAL_SEPARATOR
Delimiter constant
DELIMITER_NO_NEXT_ITEM
Delimiter constant
DELIMITER_PREV_ITEM_IS_NOT_A_NUMBER
Delimiter constant
DELIMITER_PREV_WORD_IS_BUILD_BY_NON_WORD_CHARACTERS
Delimiter constant
DELIMITER_SPACE
Delimiter constant
DELIMITER_SPACE_CHARACTER
Delimiter constant
DETAIL_LEVEL_DEFAULT
Detail level constant.
Default detail level resulting in instances of SetaPDF_Extractor_Result_Word
.
DETAIL_LEVEL_GLYPHS
Detail level constant.
Extended detail level resulting in instances of SetaPDF_Extractor_Result_WordWithGlyphs
.
Static Properties
$characters
Additional UTF-8 sequences that should be handled as word characters
In outdated PHP 5.2 version this property could be used to extend the character classes.
"\x2D" 'HYPHEN-MINUS' (U+002D) "\xE2\x80\x90" 'HYPHEN' (U+2010) "\xE2\x88\x92" 'MINUS SIGN' (U+2212) "\xE2\x80\x94" 'EM DASH' (U+2014) "\xE2\x80\x91" 'NON-BREAKING HYPHEN' (U+2011) "\xC2\xB2" 'SUPERSCRIPT TWO' (U+00B2) "\xC2\xAD" 'SOFT HYPHEN' (U+00AD)
Properties
$_boundaryFilter
The boundary filter.
$_cleanStreamCallback
A callback that is called before processing a stream.
$_filter
A filter.
$spaceWidthFactor
A factor to calculate whether a distance can be seen as a character separator.
The fonts space character width is divided by this factor to define the minimum space for a character separator.
Static Methods
_isNonWordCharacter()
Checks whether the character is a non word character.
Parameters
- $character : string
getWords()
Helper method to split a string into words.
This helper allows you to split a string into words with the same logic that is used to resolve words by text items resolved from a PDF document.
Parameters
- $text : string
- $encoding : string
Methods
_accept()
Proxy method that forwards the call to a filter instance if available.
This strategy filters space characters automatically if specified (see setIgnoreSpaceCharacter().
Parameters
- $textItem : SetaPDF_Extractor_TextItem
Exceptions
Throws SetaPDF_Core_Exception
Throws SetaPDF_Extractor_Exception
See
_createWord()
Creates a new word instance using the glyphs.
Parameters
- $glyphs : SetaPDF_Extractor_Result_Glyph[]
- $delemitterType
Exceptions
Throws SetaPDF_Core_Exception
_getParser()
Creates the content stream parser.
Parameters
- $stream : string
_getSubInstance()
Get an instance of the same strategy for processing another stream (e.g. a Form XObject stream).
Parameters
_ignore()
Method to allow implementation of individual logic.
Parameters
- $string : string
- $prevString : string
- $item : SetaPDF_Extractor_TextItem
- $prevItem : SetaPDF_Extractor_TextItem
_onBeginOrEndText()
Callback for begin or end text operators (BT/ET).
Parameters
- $arguments : array
- $operator : string
_onCurrentTransformationMatrix()
Callback for ctm changes (cm).
Parameters
- $arguments : array
- $operator : string
_onFormXObject()
Callback for painting a specified XObject.
Parameters
- $arguments : array
- $operator : string
Exceptions
Throws SetaPDF_Core_Exception
Throws SetaPDF_Core_Filter_Exception
Throws SetaPDF_Core_Parser_Pdf_InvalidTokenException
Throws SetaPDF_Core_Type_Exception
Throws SetaPDF_Exception
_onGraphicStateChange()
Callback for graphic state changes operators (q/Q).
Parameters
- $arguments : array
- $operator : string
_onInlineImage()
Callback for inline image operator
Parameters
- $arguments : array
- $operator : string
_onTextPosition()
Callback for text position operators.
Parameters
- $arguments : array
- $operator : string
_onTextShow()
Callback for text show operators.
Parameters
- $arguments : array
- $operator : string
Exceptions
Throws SetaPDF_Core_Exception
_onTextState()
Callback for text state operators.
All states has to be passed to the current graphic state as defined in PDF 32000-1:2008, Table 52 on page 121.
Parameters
- $arguments : array
- $operator : string
Exceptions
Throws SetaPDF_Extractor_Exception
_processLine()
Process all text items of a line.
Parameters
- $items : SetaPDF_Extractor_TextItem[]
Exceptions
Throws SetaPDF_Core_Exception
_showText()
_showTextStrings()
Callback that is called if text strings should be shown.
Parameters
- $textStrings : array
Exceptions
Throws SetaPDF_Core_Exception
getBoundary()
getFilter()
Get the filter.
getKeepIntersecingSpaces()
WARNING: This method is marked as deprecated!
Use getKeepIntersectingSpaces() instead.
getKeepIntersectingSpaces()
Get a flag which defines whether intersecting spaces are ignored or not.
getResult()
Get all resolved words.
Parameters
- $stream : string
- $resources : SetaPDF_Core_Type_Dictionary
Exceptions
Throws SetaPDF_Core_Exception
getSorter()
Get the sorter instance.
If none was set a baseline sorter is created automatically.
process()
Processes a stream through the plain text strategy.
Parameters
- $stream : string
- $resources : SetaPDF_Core_Type_Dictionary
Exceptions
Throws SetaPDF_Core_Exception
setBoundary()
Sets the boundary for the current strategy.
Parameters
- $boundary : SetaPDF_Core_Geometry_Rectangle|null
setCleanStreamCallback()
Set a callback that is called before processing a stream.
Parameters
- $callback : callable|null
setFilter()
Set a filter.
Parameters
- $filter : SetaPDF_Extractor_Filter_FilterInterface|null
setGraphicState()
Set the graphic state.
Parameters
- $graphicState : SetaPDF_Core_Canvas_GraphicState
setIgnoreFaultyStreams()
Define wether to continue when a stream cannot be decoded or not.
Parameters
- $ignoreFaultyStreams : boolean
setIgnoreSpaceCharacter()
Defines whether a space character should be fetched or not.
If this is set to true, the strategy will use the found space character as a delimiter. If this is set to false (default), the strategy will calculate a delimiter by the distance of 2 characters/glyphs.
Parameters
- $ignoreSpaceCharacter : bool
setKeepIntersectingSpaces()
Set a flag which defines whether interacting spaces are ignored or not.
By default, this is set to false which removes a space or white-space character which intersects with another character for more than 55 percent.
Parameters
- $keep : bool