The Main Class

Introduction

The SetaPDF-Extractor component is created by several classes and interfaces. The central location for an extraction process is the main class SetaPDF_Extractor.

It controls the extraction processes of individual pages and strategies. Internally it setups the needed environment data for the used strategy and acts as a kind of proxy between your script and the used strategy.

Get an Instance

The one and only parameter to the constructor of the SetaPDF_Extractor class is a document instance. This is the document instance from which the content will be extracted.

Reading the PDF document is up to the Core component. This way it is possible to work on the document instance further:

PHP
$document = SetaPDF_Core_Document::loadByFilename('path/to/a/document.pdf');

$extractor = new SetaPDF_Extractor($document);

// ...

// access the classes/functionalities via the Core component
$document->getCatalog()->setPageLayout(SetaPDF_Core_Document_PageLayout::ONE_COLUMN);

Getting Results

The Extractor component allows you to work with different extraction strategies which control the detail level of the result. By default the component uses the plain text strategy.

This strategy simply extracts the plain text from a PDF documents page.

By default the component returns only results per page via the getResultByPageNumber() method. 

Description
public SetaPDF_Extractor_Result_Segment|string|string[] SetaPDF_Extractor::getResultByPageNumber ( integer $pageNumber [, boolean $cleanUp = null ] )

Get the result by the default or individual strategy of a specific page.

Parameters
$pageNumber : integer
 
$cleanUp : boolean

Defines if the strategy should automatically call the cleanUp() method on internally used objects if they are not needed anymore. By default this parameter evaluates to true on PHP below version 5.3.

See

To get the results of a whole document, the result has to be resolved for each page individually. So a simple plain text extraction script for a full document will look like:

PHP
$document = ...
$extractor = new SetaPDF_Extractor($document);
$pageCount = $document->getCatalog()->getPages()->count();

$textPerPage = array();
for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
    $textPerPage[$pageNo] = $extractor->getResultByPageNumber($pageNo);
}

...

Setting a Strategy

An individual extraction strategy can be passed to the setStrategy() method:

Description
public void SetaPDF_Extractor::setStrategy ( SetaPDF_Extractor_Strategy_AbstractStrategy $strategy )

Set the extraction strategy.

Parameters
$strategy : SetaPDF_Extractor_Strategy_AbstractStrategy
 

The available extraction strategies are described on the next chapters.