The Main Class

Introduction

The SetaPDF-Extractor component is created by several classes and interfaces. The central location for an extraction process is the main class \setasign\SetaPDF2\Extractor\Extractor.

It controls the extraction processes of individual pages and strategies. Internally it setups the needed environment data for the used strategy and acts as a kind of proxy between your script and the used strategy.

Get an Instance

The one and only parameter to the constructor of the \setasign\SetaPDF2\Extractor\Extractor class is a document instance. This is the document instance from which the content will be extracted.

Reading the PDF document is up to the Core component. This way it is possible to work on the document instance further:

PHP
use setasign\SetaPDF2\Core\Document;
use setasign\SetaPDF2\Extractor\Extractor;

$document = Document::loadByFilename('path/to/a/document.pdf');

$extractor = new Extractor($document);

// ...

// access the classes/functionalities via the Core component
$document->getCatalog()->setPageLayout(Document\PageLayout::ONE_COLUMN);

Setting a Strategy

The Extractor component allows you to work with different extraction strategies which control the detail level of the result. By default the component uses the plain text strategy.

This strategy simply extracts the plain text from a PDF documents page.

An individual extraction strategy can be passed to the setStrategy() method:

The available extraction strategies are described on the next chapters. 

Getting Results

The component allows you to get results by the Extractor::getResultByPageNumber() or Extractor::getResultByPage() methods:

getResultByPage()

Get the result by the default or individual strategy of a specific page by its page object.

getResultByPageNumber()

Get the result by the default or individual strategy of a specific page by its page number.

To get the results of a whole document, the result has to be resolved for each page individually. So a simple plain text extraction script for a full document will look like:

PHP
use setasign\SetaPDF2\Core\Document;
use setasign\SetaPDF2\Extractor\Extractor;

$document = Document::loadByFilename(...);
$extractor = new Extractor($document);
$pageCount = $document->getCatalog()->getPages()->count();

$textPerPage = array();
for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
    $textPerPage[$pageNo] = $extractor->getResultByPageNumber($pageNo);
}

// if you're done with the document instance
$document->cleanUp();
//...

To limit the result to the content of a page boundary, just pass the boundary constant to the $boundaryBox parameter. For example:

PHP
$result = $extractor->getResultByPageNumber($pageNo, PageBoundaries::CROP_BOX);

Process a Result Several Times

There are situations where you want to extract different information from the same content (e.g. to identify a document type by a special keyword in a pre-defined area). Instead of parsing the PDF content stream again and again, it is possible to create an intermediate state, which you can re-use in a strategy. You can get this intermediate result through the Extractor::getTextItemsByPage() method.

This intermediate result can be passed to the AbstractStrategy::getResultByTextItems() method of the individual strategy then. You can find example code for such process on the page for filters.