SetaPDF_Extractor_Strategy_ExactPlain Extraction strategy for plain text by using single glyphs for rebuilding the text.

File: /SetaPDF v2/Extractor/Strategy/ExactPlain.php

Class hierarchy

Summary

Properties

$_cleanStreamCallback

A callback that is called before processing a stream.

$_graphicState

$_ignoreFaultyStreams

Defines wether to continue when a stream cannot be decoded or not.

$_ignoreSpaceCharacter

Defines whether space characters should be ignored or not.

$_items

$_keepIntersectingSpaces

Defines whether intersecting spaces should be ignored or not.

$_lastMatrix

$_resources

The stream resources dictionary.

$_sorter

$_textCount

A text item counter.

$spaceWidthFactor

A factor to calculate whether a distance can be seen as a character separator.

The fonts space character width is devided by this factor to define the minimum space for a character separator.


Methods

__construct()

_accept()

Proxy method that forwards the call to a filter instance if available.

This strategy filters space characters automatically if specified (see setIgnoreSpaceCharacter().

Parameters
$textItem : SetaPDF_Extractor_TextItem
 
Exceptions

Throws SetaPDF_Core_Exception

Throws SetaPDF_Extractor_Exception

See

_cleanResult()

public SetaPDF_Extractor_Strategy_Plain::_cleanResult (
string $result
): string

Callback to clean up the resulting text.

Parameters
$result : string
 

_getParser()

Creates the content stream parser.

Parameters
$stream : string
 

_getSubInstance()

Get an instance of the same strategy for processing an other stream (e.g. a Form XObject stream).

Parameters
$gs : SetaPDF_Core_Canvas_GraphicState
 

_ignore()

protected SetaPDF_Extractor_Strategy_Plain::_ignore (
string $string, string $prevString, SetaPDF_Extractor_TextItem $item, SetaPDF_Extractor_TextItem $prevItem
): boolean

Method to allow implementation of individual logic.

Parameters
$string : string
 
$prevString : string
 
$item : SetaPDF_Extractor_TextItem
 
$prevItem : SetaPDF_Extractor_TextItem
 

_onAfterShowText()

public SetaPDF_Extractor_Strategy_Plain::_onAfterShowText (
string $rawString
): void

Callback that is called after a show text operation was invoked.

Parameters
$rawString : string
 
Exceptions

Throws SetaPDF_Extractor_Exception

Throws SetaPDF_Core_Exception

_onBeforeShowText()

Callback that is called before a show text operation is invoked.

_onBeginOrEndText()

public SetaPDF_Extractor_Strategy_Plain::_onBeginOrEndText (
array $arguments, string $operator
): void

Callback for begin or end text operators (BT/ET).

Parameters
$arguments : array
 
$operator : string
 

_onCurrentTransformationMatrix()

public SetaPDF_Extractor_Strategy_Plain::_onCurrentTransformationMatrix (
array $arguments, string $operator
): void

Callback for ctm changes (cm).

Parameters
$arguments : array
 
$operator : string
 

_onFormXObject()

public SetaPDF_Extractor_Strategy_Plain::_onFormXObject (
array $arguments, string $operator
): void

Callback for painting a specified XObject.

Parameters
$arguments : array
 
$operator : string
 
Exceptions

Throws SetaPDF_Core_Exception

Throws SetaPDF_Core_Filter_Exception

Throws SetaPDF_Core_Parser_Pdf_InvalidTokenException

Throws SetaPDF_Core_Type_Exception

Throws SetaPDF_Exception

Throws SetaPDF_Exception_NotImplemented

_onGraphicStateChange()

public SetaPDF_Extractor_Strategy_Plain::_onGraphicStateChange (
array $arguments, string $operator
): void

Callback for graphic state changes operators (q/Q).

Parameters
$arguments : array
 
$operator : string
 

_onInlineImage()

public SetaPDF_Extractor_Strategy_Plain::_onInlineImage (
array $arguments, string $operator
): false|void

Callback for inline image operator

Parameters
$arguments : array
 
$operator : string
 

_onTextPosition()

public SetaPDF_Extractor_Strategy_Plain::_onTextPosition (
array $arguments, string $operator
): void

Callback for text position operators.

Parameters
$arguments : array
 
$operator : string
 

_onTextShow()

public SetaPDF_Extractor_Strategy_Glyph::_onTextShow (
array $arguments, mixed $operator
): void

Callback that is called if a text should be shown.

Parameters
$arguments : array
 
$operator : mixed
 
Exceptions

Throws SetaPDF_Core_Exception

_onTextState()

public SetaPDF_Extractor_Strategy_Plain::_onTextState (
array $arguments, string $operator
): void

Callback for text state operators.

All states has to be passed to the current graphic state as defined in PDF 32000-1:2008, Table 52 on page 121.

Parameters
$arguments : array
 
$operator : string
 
Exceptions

Throws SetaPDF_Extractor_Exception

_saveLastMatrix()

protected SetaPDF_Extractor_Strategy_Plain::_saveLastMatrix (
string $type
): void

Saves the last matrix by a specific type.

Parameters
$type : string
 

_showText()

protected SetaPDF_Extractor_Strategy_Glyph::_showText (
string $string
): void

Method that shows text.

Parameters
$string : string
 
Exceptions

Throws SetaPDF_Core_Exception

_showTextStrings()

public SetaPDF_Extractor_Strategy_Glyph::_showTextStrings (
array $textStrings
): void

Callback that is called if text strings should be shown.

Parameters
$textStrings : array
 
Exceptions

Throws SetaPDF_Core_Exception

getCleanStreamCallback()

Get the callback that is called before a stream is processed.

getGraphicState()

getIgnoreSpaceCharacter()

Gets whether a space character should be fetched or not.

getKeepIntersecingSpaces()

WARNING: This method is marked as deprecated!

Use getKeepIntersectingSpaces() instead.

getKeepIntersectingSpaces()

Get a flag which defines whether intersecting spaces are ignored or not.

getResult()

Get the plain text from a stream.

Parameters
$stream : string
 
$resources : SetaPDF_Core_Type_Dictionary
 
Exceptions

Throws SetaPDF_Core_Exception

getSorter()

Get the sorter instance.

If none was set a base line sorter is created automatically.

process()

Processes a stream through the plain text strategy.

Parameters
$stream : string
 
$resources : SetaPDF_Core_Type_Dictionary
 
Exceptions

Throws SetaPDF_Core_Exception

Throws SetaPDF_Core_Parser_Pdf_InvalidTokenException

setBoundary()

Sets the boundary for the current strategy.

Parameters
$boundary : SetaPDF_Core_Geometry_Rectangle|null
 

setCleanStreamCallback()

public SetaPDF_Extractor_Strategy_AbstractStrategy::setCleanStreamCallback (
[ callable|null $callback = null ]
): void

Set a callback that is called before processing a stream.

Parameters
$callback : callable|null
 

setFilter()

setGraphicState()

Set the graphic state.

Parameters
$graphicState : SetaPDF_Core_Canvas_GraphicState
 

setIgnoreFaultyStreams()

public SetaPDF_Extractor_Strategy_AbstractStrategy::setIgnoreFaultyStreams (
boolean $ignoreFaultyStreams
): void

Define wether to continue when a stream cannot be decoded or not.

Parameters
$ignoreFaultyStreams : boolean
 

setIgnoreSpaceCharacter()

public SetaPDF_Extractor_Strategy_Glyph::setIgnoreSpaceCharacter (
[ bool $ignoreSpaceCharacter = true ]
): void

Defines whether a space character should be fetched or not.

If this is set to true, the strategy will use the found space character as a delemitter. If this is set to false (default), the strategy will calculate a delemitter by the distance of 2 charachters/glyphs.

Parameters
$ignoreSpaceCharacter : bool
 

setKeepIntersectingSpaces()

public SetaPDF_Extractor_Strategy_Plain::setKeepIntersectingSpaces (
[ bool $keep = true ]
): void

Set a flag which defines whether intersacting spaces are ignored or not.

By default this is set to false which removes a space or white-space character which intersects with another character for more than 55 percent.

Parameters
$keep : bool
 

setSorter()

Set a sorter instance.

Parameters
$sorter : SetaPDF_Extractor_Sorter