The Main Class

Introduction

The SetaPDF-Extractor component is created by several classes and interfaces. The central location for an extraction process is the main class SetaPDF_Extractor.

It controls the extraction processes of individual pages and strategies. Internally it setups the needed environment data for the used strategy and acts as a kind of proxy between your script and the used strategy.

Get an Instance

The one and only parameter to the constructor of the SetaPDF_Extractor class is a document instance. This is the document instance from which the content will be extracted.

Reading the PDF document is up to the Core component. This way it is possible to work on the document instance further:

PHP
$document = \SetaPDF_Core_Document::loadByFilename('path/to/a/document.pdf');

$extractor = new \SetaPDF_Extractor($document);

// ...

// access the classes/functionalities via the Core component
$document->getCatalog()->setPageLayout(\SetaPDF_Core_Document_PageLayout::ONE_COLUMN);

Getting Results

The Extractor component allows you to work with different extraction strategies which control the detail level of the result. By default the component uses the plain text strategy.

This strategy simply extracts the plain text from a PDF documents page.

By default the component returns only results per page via the getResultByPageNumber() method. 

Description

Get the result by the default or individual strategy of a specific page.

Parameters
$pageNumber : integer
 
$boundaryBox : string

If set the page boundary is used to limit the result to the rectangle of the given boundary. See SetaPDF_Core_PageBoundaries::XXX_BOX constants for possible values.

Exceptions

Throws SetaPDF_Core_Exception

See

To get the results of a whole document, the result has to be resolved for each page individually. So a simple plain text extraction script for a full document will look like:

PHP
$document = \SetaPDF_Core_Document::loadByFilename(...);
$extractor = new \SetaPDF_Extractor($document);
$pageCount = $document->getCatalog()->getPages()->count();

$textPerPage = array();
for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
    $textPerPage[$pageNo] = $extractor->getResultByPageNumber($pageNo);
}

// if you're done with the document instance
$document->cleanUp();
...

Setting a Strategy

An individual extraction strategy can be passed to the setStrategy() method:

Description

Set the extraction strategy.

Parameters
$strategy : SetaPDF_Extractor_Strategy_AbstractStrategy
 

The available extraction strategies are described on the next chapters.