SetaPDF_Extractor_Strategy_ExactPlain Extraction strategy for plain text by using single glyphs for rebuilding the text.
File: /SetaPDF v2/Extractor/Strategy/ExactPlain.php
Class hierarchy
Summary
Methods
- __construct()
- _accept()
- _cleanResult()
- _getParser()
- _getSubInstance()
- _ignore()
- _onAfterShowText()
- _onBeforeShowText()
- _onBeginOrEndText()
- _onCurrentTransformationMatrix()
- _onFormXObject()
- _onGraphicStateChange()
- _onInlineImage()
- _onTextPosition()
- _onTextShow()
- _onTextState()
- _saveLastMatrix()
- _showText()
- _showTextStrings()
- getBoundary()
- getCleanStreamCallback()
- getFilter()
- getGraphicState()
- getIgnoreSpaceCharacter()
- getKeepIntersecingSpaces()
- getKeepIntersectingSpaces()
- getResult()
- getSorter()
- process()
- setBoundary()
- setCleanStreamCallback()
- setFilter()
- setGraphicState()
- setIgnoreFaultyStreams()
- setIgnoreSpaceCharacter()
- setKeepIntersectingSpaces()
- setSorter()
Properties
$_boundaryFilter
The boundary filter.
$_cleanStreamCallback
A callback that is called before processing a stream.
$_filter
A filter.
$_lastMatrix
Used matrixes.
$spaceWidthFactor
A factor to calculate whether a distance can be seen as a character separator.
The fonts space character width is devided by this factor to define the minimum space for a character separator.
Methods
_accept()
Proxy method that forwards the call to a filter instance if available.
This strategy filters space characters automatically if specified (see setIgnoreSpaceCharacter().
Parameters
- $textItem : SetaPDF_Extractor_TextItem
Exceptions
Throws SetaPDF_Core_Exception
Throws SetaPDF_Extractor_Exception
See
_getParser()
Creates the content stream parser.
Parameters
- $stream : string
_getSubInstance()
Get an instance of the same strategy for processing an other stream (e.g. a Form XObject stream).
Parameters
_ignore()
Method to allow implementation of individual logic.
Parameters
- $string : string
- $prevString : string
- $item : SetaPDF_Extractor_TextItem
- $prevItem : SetaPDF_Extractor_TextItem
_onAfterShowText()
Callback that is called after a show text operation was invoked.
Parameters
- $rawString : string
Exceptions
Throws SetaPDF_Extractor_Exception
Throws SetaPDF_Core_Exception
_onBeginOrEndText()
Callback for begin or end text operators (BT/ET).
Parameters
- $arguments : array
- $operator : string
_onCurrentTransformationMatrix()
Callback for ctm changes (cm).
Parameters
- $arguments : array
- $operator : string
_onFormXObject()
Callback for painting a specified XObject.
Parameters
- $arguments : array
- $operator : string
Exceptions
Throws SetaPDF_Core_Exception
Throws SetaPDF_Core_Filter_Exception
Throws SetaPDF_Core_Parser_Pdf_InvalidTokenException
Throws SetaPDF_Core_Type_Exception
Throws SetaPDF_Exception
_onGraphicStateChange()
Callback for graphic state changes operators (q/Q).
Parameters
- $arguments : array
- $operator : string
_onInlineImage()
Callback for inline image operator
Parameters
- $arguments : array
- $operator : string
_onTextPosition()
Callback for text position operators.
Parameters
- $arguments : array
- $operator : string
_onTextShow()
Callback that is called if a text should be shown.
Parameters
- $arguments : array
- $operator : mixed
Exceptions
Throws SetaPDF_Core_Exception
_onTextState()
Callback for text state operators.
All states has to be passed to the current graphic state as defined in PDF 32000-1:2008, Table 52 on page 121.
Parameters
- $arguments : array
- $operator : string
Exceptions
Throws SetaPDF_Extractor_Exception
_showText()
_showTextStrings()
Callback that is called if text strings should be shown.
Parameters
- $textStrings : array
Exceptions
Throws SetaPDF_Core_Exception
getBoundary()
getFilter()
Get the filter.
getKeepIntersecingSpaces()
WARNING: This method is marked as deprecated!
Use getKeepIntersectingSpaces() instead.
getKeepIntersectingSpaces()
Get a flag which defines whether intersecting spaces are ignored or not.
getResult()
Get the plain text from a stream.
Parameters
- $stream : string
- $resources : SetaPDF_Core_Type_Dictionary
Exceptions
Throws SetaPDF_Core_Exception
getSorter()
Get the sorter instance.
If none was set a base line sorter is created automatically.
process()
Processes a stream through the plain text strategy.
Parameters
- $stream : string
- $resources : SetaPDF_Core_Type_Dictionary
Exceptions
Throws SetaPDF_Core_Exception
setBoundary()
Sets the boundary for the current strategy.
Parameters
- $boundary : SetaPDF_Core_Geometry_Rectangle|null
setCleanStreamCallback()
Set a callback that is called before processing a stream.
Parameters
- $callback : callable|null
setFilter()
Set a filter.
Parameters
- $filter : SetaPDF_Extractor_Filter_FilterInterface|null
setGraphicState()
Set the graphic state.
Parameters
- $graphicState : SetaPDF_Core_Canvas_GraphicState
setIgnoreFaultyStreams()
Define wether to continue when a stream cannot be decoded or not.
Parameters
- $ignoreFaultyStreams : boolean
setIgnoreSpaceCharacter()
Defines whether a space character should be fetched or not.
If this is set to true, the strategy will use the found space character as a delemitter. If this is set to false (default), the strategy will calculate a delemitter by the distance of 2 charachters/glyphs.
Parameters
- $ignoreSpaceCharacter : bool
setKeepIntersectingSpaces()
Set a flag which defines whether intersacting spaces are ignored or not.
By default this is set to false which removes a space or white-space character which intersects with another character for more than 55 percent.
Parameters
- $keep : bool