setasign\SetaPDF2\Extractor

Extractor The main class of the SetaPDF-Extractor Component

File: /SetaPDF v2/Extractor/Extractor.php
Old class name (alias): \SetaPDF_Extractor

Class hierarchy

Summary

Constants

VERSION

public const string Extractor::VERSION = '2.48.0.2155'

The version


Properties

$_document

The document instance

$_ignoreFaultyStreams

protected bool Extractor::$_ignoreFaultyStreams = false

Defines wether to continue when a stream cannot be decoded or not.

$_strategy

The extraction strategy


Methods

__construct()

public Extractor::__construct (
\setasign\SetaPDF2\Core\Document $document,
?Strategy\AbstractStrategy $strategy = null,
bool $ignoreFaultyStreams = false
)

The constructor.

Parameters
$document : \setasign\SetaPDF2\Core\Document
 
$strategy : ?Strategy\AbstractStrategy
 
$ignoreFaultyStreams : bool
 

cleanUp()

public Extractor::cleanUp (
void
): void

Release cycled references.

getResultByPage()

Get the result by the default or individual strategy of a specific page by its page object.

Parameters
$page : \setasign\SetaPDF2\Core\Document\Page
 
$boundaryBox : ?string

If set the page boundary is used to limit the result to the rectangle of the given boundary. See \setasign\SetaPDF2\Core\PageBoundaries::XXX_BOX constants for possible values.

Exceptions

Throws \setasign\SetaPDF2\Core\Exception

Throws \setasign\SetaPDF2\Core\Parser\Pdf\InvalidTokenException

Throws \setasign\SetaPDF2\Core\Type\Exception

getResultByPageNumber()

public Extractor::getResultByPageNumber (
int $pageNumber,
?string $boundaryBox = null
): Result\Collection|Result\Words|Result\WordGroups|string|string[]

Get the result by the default or individual strategy of a specific page by its page number.

Parameters
$pageNumber : int
 
$boundaryBox : ?string

If set the page boundary is used to limit the result to the rectangle of the given boundary. See \setasign\SetaPDF2\Core\PageBoundaries::XXX_BOX constants for possible values.

Exceptions

Throws \setasign\SetaPDF2\Core\Exception

See

getStrategy()

Get the extraction strategy.

getTextItemsByPage()

public Extractor::getTextItemsByPage (
\setasign\SetaPDF2\Core\Document\Page $page,
?string $boundaryBox = null
): TextItem[]

Get all text items by the default or individual strategy of a specific page by its page object.

These text items can be used to get a result by an individual method of a strategy (e.g. the Strategy\PlainStrategy::getResultByTextItems() method. By using this intermediate state it is possible to use several filters, which may collect the same text-items.

Parameters
$page : \setasign\SetaPDF2\Core\Document\Page
 
$boundaryBox : ?string

If set the page boundary is used to limit the result to the rectangle of the given boundary. See \setasign\SetaPDF2\Core\PageBoundaries::XXX_BOX constants for possible values.

Exceptions

Throws \setasign\SetaPDF2\Core\Exception

Throws \setasign\SetaPDF2\Core\Parser\Pdf\InvalidTokenException

Throws \setasign\SetaPDF2\Core\Type\Exception

setStrategy()

public Extractor::setStrategy (): void

Set the extraction strategy.

Parameters
$strategy : Strategy\AbstractStrategy