setasign\SetaPDF2\Extractor\Strategy

WordGroupStrategy Extraction strategy for word groups

File: /SetaPDF v2/Extractor/Strategy/WordGroupStrategy.php
Old class name (alias): \SetaPDF_Extractor_Strategy_WordGroup

The result of this strategy is sorted from top-left to bottom-right.

Each group is represented by an instance of \setasign\SetaPDF2\Extractor\Result\Words.

This class allows you to receive a groups boundary through the getBounds() method.

Class hierarchy

Summary

Constants

DELIMITER_ITEMS_NOT_JOINING

Delimiter constant

DELIMITER_LINE

public const int WordStrategy::DELIMITER_LINE = 2

Delimiter constant

DELIMITER_NEXT_ITEM_DOES_NOT_JOIN

Delimiter constant

DELIMITER_NEXT_ITEM_IS_NOT_A_NUMBER

Delimiter constant

DELIMITER_NONWORD_CHARACTER

Delimiter constant

DELIMITER_NO_DECIMAL_SEPARATOR

Delimiter constant

DELIMITER_NO_NEXT_ITEM

public const int WordStrategy::DELIMITER_NO_NEXT_ITEM = 64

Delimiter constant

DELIMITER_PREV_ITEM_IS_NOT_A_NUMBER

Delimiter constant

DELIMITER_PREV_WORD_IS_BUILD_BY_NON_WORD_CHARACTERS

Delimiter constant

DELIMITER_SPACE

public const int WordStrategy::DELIMITER_SPACE = 1

Delimiter constant

DELIMITER_SPACE_CHARACTER

Delimiter constant

DETAIL_LEVEL_DEFAULT

public const string WordStrategy::DETAIL_LEVEL_DEFAULT = 'default'

Detail level constant.

Default detail level resulting in instances of \setasign\SetaPDF2\Extractor\Result\Word.

DETAIL_LEVEL_GLYPHS

public const string WordStrategy::DETAIL_LEVEL_GLYPHS = 'glyphs'

Detail level constant.

Extended detail level resulting in instances of \setasign\SetaPDF2\Extractor\Result\WordWithGlyphs.


Static Properties

$characters

static public string WordStrategy::$characters = '\\-‐−—‑²­'

Additional UTF-8 sequences that should be handled as word characters

In outdated PHP 5.2 version this property could be used to extend the character classes.

"\x2D" 'HYPHEN-MINUS' (U+002D) "\xE2\x80\x90" 'HYPHEN' (U+2010) "\xE2\x88\x92" 'MINUS SIGN' (U+2212) "\xE2\x80\x94" 'EM DASH' (U+2014) "\xE2\x80\x91" 'NON-BREAKING HYPHEN' (U+2011) "\xC2\xB2" 'SUPERSCRIPT TWO' (U+00B2) "\xC2\xAD" 'SOFT HYPHEN' (U+00AD)


Properties

$_allowedFontSizeDifference

The allowed height difference.

$_boundaryFilter

$_cleanStreamCallback

A callback that is called before processing a stream.

$_detailLevel

protected string WordStrategy::$_detailLevel = 'default'

The detail level.

$_graphicState

The graphic state instance.

$_hyphens

protected string WordGroupStrategy::$_hyphens = '-­'

These characters are used as hyphens in the dehyphen logic.

"\x2D" 'HYPHEN-MINUS' (U+002D) "\xC2\xAD" 'Soft Hyphen (SHY)' (U+00AD)

$_ignoreFaultyStreams

protected bool AbstractStrategy::$_ignoreFaultyStreams = false

Defines wether to continue when a stream cannot be decoded or not.

$_ignoreSpaceCharacter

protected bool GlyphStrategy::$_ignoreSpaceCharacter = false

Defines whether space characters should be ignored or not.

$_items

The text items.

$_keepIntersectingSpaces

protected bool PlainStrategy::$_keepIntersectingSpaces = false

Defines whether intersecting spaces should be ignored or not.

$_rectScaleFactorX

protected float|int WordGroupStrategy::$_rectScaleFactorX = 0.274

The value that is used to scale the word-bounding-box on the x-axis.

$_rectScaleFactorY

protected float|int WordGroupStrategy::$_rectScaleFactorY = 0.2707

The value that is used to scale the word-bounding-box on the y-axis.

$_resources

The stream resources dictionary.

$_sorter

The sorter instance.

$_textCount

protected int PlainStrategy::$_textCount = 0

A text item counter.

$_useDehyphen

protected bool WordGroupStrategy::$_useDehyphen = true

Defines whether the dehyphen logic should be executed or not.

$spaceWidthFactor

public float PlainStrategy::$spaceWidthFactor = 2.0

A factor to calculate whether a distance can be seen as a character separator.

The fonts space character width is divided by this factor to define the minimum space for a character separator.


Static Methods

_isNonWordCharacter()

protected static WordStrategy::_isNonWordCharacter (
string $character
): bool

Checks whether the character is a non word character.

Parameters
$character : string
 

getWords()

public static WordStrategy::getWords (
string $text,
string $encoding = 'UTF-8'
): array

Helper method to split a string into words.

This helper allows you to split a string into words with the same logic that is used to resolve words by text items resolved from a PDF document.

Parameters
$text : string
 
$encoding : string
 

Methods

__construct()

_accept()

protected GlyphStrategy::_accept (): bool|string

Proxy method that forwards the call to a filter instance if available.

This strategy filters space characters automatically if specified (see setIgnoreSpaceCharacter().

Parameters
$textItem : \SetaPDF_Extractor_TextItem
 
Exceptions

Throws \setasign\SetaPDF2\Core\Exception

Throws \setasign\SetaPDF2\Extractor\Exception

See

_calculateFontSize()

Calculates the font size of a text item in user space.

Parameters
$textItem : \SetaPDF_Extractor_TextItem
 

_cleanResult()

public PlainStrategy::_cleanResult (
string $result
): string

Callback to clean up the resulting text.

Parameters
$result : string
 

_createWord()

Creates a new word instance using the glyphs.

Parameters
$glyphs : \SetaPDF_Extractor_Result_Glyph[]
 
$delemitterType
 
Exceptions

Throws \setasign\SetaPDF2\Core\Exception

_dehyphen()

protected WordGroupStrategy::_dehyphen (
array $group
): array

Executes the dehyphen logic.

Parameters
$group : array
 
Exceptions

Throws \ReflectionException

Throws \setasign\SetaPDF2\Core\Exception

_getParser()

Creates the content stream parser.

Parameters
$stream : string
 

_getSubInstance()

Get an instance of the same strategy for processing another stream (e.g. a Form XObject stream).

Parameters
$gs : \SetaPDF_Core_Canvas_GraphicState
 

_ignore()

protected PlainStrategy::_ignore (
string $string,
string $prevString,
\SetaPDF_Extractor_TextItem $item,
\SetaPDF_Extractor_TextItem $prevItem
): bool

Method to allow implementation of individual logic.

Parameters
$string : string
 
$prevString : string
 
$item : \SetaPDF_Extractor_TextItem
 
$prevItem : \SetaPDF_Extractor_TextItem
 

_insertIntoStorage()

protected WordGroupStrategy::_insertIntoStorage (
string $stream,
\SetaPDF_Core_Type_Dictionary $resources
): void

Prepares and fills the storage for resolving word groups.

Parameters
$stream : string
 
$resources : \SetaPDF_Core_Type_Dictionary
 
Exceptions

Throws \setasign\SetaPDF2\Core\Exception

_onBeginOrEndText()

public PlainStrategy::_onBeginOrEndText (
array $arguments,
string $operator
): void

Callback for begin or end text operators (BT/ET).

Parameters
$arguments : array
 
$operator : string
 

_onCurrentTransformationMatrix()

public PlainStrategy::_onCurrentTransformationMatrix (
array $arguments,
string $operator
): void

Callback for ctm changes (cm).

Parameters
$arguments : array
 
$operator : string
 

_onFormXObject()

public PlainStrategy::_onFormXObject (
array $arguments,
string $operator
): void

Callback for painting a specified XObject.

Parameters
$arguments : array
 
$operator : string
 
Exceptions

Throws \setasign\SetaPDF2\Core\Exception

Throws \setasign\SetaPDF2\Core\Filter\Exception

Throws \setasign\SetaPDF2\Core\Parser\Pdf\InvalidTokenException

Throws \setasign\SetaPDF2\Core\Type\Exception

Throws \setasign\SetaPDF2\Exception

Throws \setasign\SetaPDF2\NotImplementedException

_onGraphicStateChange()

public PlainStrategy::_onGraphicStateChange (
array $arguments,
string $operator
): void

Callback for graphic state changes operators (q/Q).

Parameters
$arguments : array
 
$operator : string
 

_onInlineImage()

public PlainStrategy::_onInlineImage (
array $arguments,
string $operator
): false|void

Callback for inline image operator

Parameters
$arguments : array
 
$operator : string
 

_onTextPosition()

public PlainStrategy::_onTextPosition (
array $arguments,
string $operator
): void

Callback for text position operators.

Parameters
$arguments : array
 
$operator : string
 

_onTextShow()

public PlainStrategy::_onTextShow (
array $arguments,
string $operator
): void

Callback for text show operators.

Parameters
$arguments : array
 
$operator : string
 
Exceptions

Throws \setasign\SetaPDF2\Core\Exception

_onTextState()

public PlainStrategy::_onTextState (
array $arguments,
string $operator
): void

Callback for text state operators.

All states has to be passed to the current graphic state as defined in PDF 32000-1:2008, Table 52 on page 121.

Parameters
$arguments : array
 
$operator : string
 
Exceptions

Throws \setasign\SetaPDF2\Extractor\Exception

_processLine()

Process all text items of a line.

Parameters
$items : \SetaPDF_Extractor_TextItem[]
 
Exceptions

Throws \setasign\SetaPDF2\Core\Exception

_resolveGroup()

Resolves all intersecting entries of the storage, starting by a single entry.

Parameters
$storageEntry : \SetaPDF_Extractor_Storage_StorageEntry
 

_showText()

protected GlyphStrategy::_showText (
string $string
): void

Method that shows text.

Parameters
$string : string
 
Exceptions

Throws \setasign\SetaPDF2\Core\Exception

_showTextStrings()

public PlainStrategy::_showTextStrings (
array $textStrings
): void

Callback that is called if text strings should be shown.

Parameters
$textStrings : array
 
Exceptions

Throws \setasign\SetaPDF2\Core\Exception

getAllowedFontSizeDifference()

Gets the currently allowed font-size difference.

getCleanStreamCallback()

public AbstractStrategy::getCleanStreamCallback (
void
): ?callable

Get the callback that is called before a stream is processed.

getDehyphen()

public WordGroupStrategy::getDehyphen (
void
): bool

Gets whether the dehyphen logic should be executed or not.

getDetailLevel()

public WordStrategy::getDetailLevel (
void
): string

Get the detail level of the expected result.

getFilter()

getGraphicState()

Get the graphic state.

getIgnoreSpaceCharacter()

public GlyphStrategy::getIgnoreSpaceCharacter (
void
): bool

Gets whether a space character should be fetched or not.

getKeepIntersecingSpaces()

WARNING: This method is marked as deprecated!

Use getKeepIntersectingSpaces() instead.

getKeepIntersectingSpaces()

Get a flag which defines whether intersecting spaces are ignored or not.

getRectScaleFactorX()

public WordGroupStrategy::getRectScaleFactorX (
void
): float|int

Gets the rect scale-factor for the abscissa.

getRectScaleFactorY()

public WordGroupStrategy::getRectScaleFactorY (
void
): float|int

Gets the rect scale-factor for the ordinate.

getResult()

Get all resolved words groups.

Parameters
$stream : string
 
$resources : \SetaPDF_Core_Type_Dictionary
 
Exceptions

Throws \ReflectionException

Throws \setasign\SetaPDF2\Core\Exception

getSorter()

Get the sorter instance.

If none was set a baseline sorter is created automatically.

process()

Processes a stream through the plain text strategy.

Parameters
$stream : string
 
$resources : \SetaPDF_Core_Type_Dictionary
 
Exceptions

Throws \setasign\SetaPDF2\Core\Exception

Throws \setasign\SetaPDF2\Core\Parser\Pdf\InvalidTokenException

setAllowedFontSizeDifference()

public WordGroupStrategy::setAllowedFontSizeDifference (
float|int $value
): void

Sets the current allowed font-size difference.

Parameters
$value : float|int
 

setBoundary()

Sets the boundary for the current strategy.

Parameters
$boundary : ?\SetaPDF_Core_Geometry_Rectangle
 

setCleanStreamCallback()

public AbstractStrategy::setCleanStreamCallback (
?callable $callback = null
): void

Set a callback that is called before processing a stream.

Parameters
$callback : ?callable
 

setDehyphen()

public WordGroupStrategy::setDehyphen (
bool $dehyphen,
?string $hyphens = null
): void

Sets whether the dehyphen logic should be executed or not.

Parameters
$dehyphen : bool
 
$hyphens : ?string
 

setDetailLevel()

public WordStrategy::setDetailLevel (
string $detailLevel
): void

Set the detail level of the result.

Parameters
$detailLevel : string
 

setFilter()

Set a filter.

Parameters
$filter : ?\SetaPDF_Extractor_Filter_FilterInterface
 

setGraphicState()

Set the graphic state.

Parameters
$graphicState : \SetaPDF_Core_Canvas_GraphicState
 

setIgnoreFaultyStreams()

public AbstractStrategy::setIgnoreFaultyStreams (
bool $ignoreFaultyStreams
): void

Define wether to continue when a stream cannot be decoded or not.

Parameters
$ignoreFaultyStreams : bool
 

setIgnoreSpaceCharacter()

public GlyphStrategy::setIgnoreSpaceCharacter (
bool $ignoreSpaceCharacter = true
): void

Defines whether a space character should be fetched or not.

If this is set to true, the strategy will use the found space character as a delimiter. If this is set to false (default), the strategy will calculate a delimiter by the distance of 2 characters/glyphs.

Parameters
$ignoreSpaceCharacter : bool
 

setKeepIntersectingSpaces()

public PlainStrategy::setKeepIntersectingSpaces (
bool $keep = true
): void

Set a flag which defines whether interacting spaces are ignored or not.

By default, this is set to false which removes a space or white-space character which intersects with another character for more than 55 percent.

Parameters
$keep : bool
 

setRectScaleFactorX()

public WordGroupStrategy::setRectScaleFactorX (
float|int $value
): void

Sets the rect scale-factor on the abscissa.

The boundaries of the words are scaled using the product of the font-size and the given scale-factor.

Parameters
$value : float|int
 

setRectScaleFactorY()

public WordGroupStrategy::setRectScaleFactorY (
float|int $value
): void

Sets the rect scale-factor on the ordinate.

The boundaries of the words are scaled using the product of the font-size and the given scale-factor.

Parameters
$value : float|int
 

setSorter()

Set a sorter instance.

Parameters
$sorter : \SetaPDF_Extractor_Sorter