The Main Class
Table of Contents
Introduction
The SetaPDF-Extractor component is created by several classes and interfaces. The central location for an extraction process is the main class SetaPDF_Extractor
.
It controls the extraction processes of individual pages and strategies. Internally it setups the needed environment data for the used strategy and acts as a kind of proxy between your script and the used strategy.
Get an Instance
The one and only parameter to the constructor of the SetaPDF_Extractor
class is a document instance. This is the document instance from which the content will be extracted.
Reading the PDF document is up to the Core component. This way it is possible to work on the document instance further:
$document = \SetaPDF_Core_Document::loadByFilename('path/to/a/document.pdf'); $extractor = new \SetaPDF_Extractor($document); // ... // access the classes/functionalities via the Core component $document->getCatalog()->setPageLayout(\SetaPDF_Core_Document_PageLayout::ONE_COLUMN);
Getting Results
The Extractor component allows you to work with different extraction strategies which control the detail level of the result. By default the component uses the plain text strategy.
This strategy simply extracts the plain text from a PDF documents page.
By default the component returns only results per page via the getResultByPageNumber()
method.
Description
Get the result by the default or individual strategy of a specific page.
Parameters
- $pageNumber : integer
- $boundaryBox : string
If set the page boundary is used to limit the result to the rectangle of the given boundary. See
SetaPDF_Core_PageBoundaries::XXX_BOX
constants for possible values.
Exceptions
Throws SetaPDF_Core_Exception
See
To get the results of a whole document, the result has to be resolved for each page individually. So a simple plain text extraction script for a full document will look like:
$document = \SetaPDF_Core_Document::loadByFilename(...); $extractor = new \SetaPDF_Extractor($document); $pageCount = $document->getCatalog()->getPages()->count(); $textPerPage = array(); for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) { $textPerPage[$pageNo] = $extractor->getResultByPageNumber($pageNo); } // if you're done with the document instance $document->cleanUp(); ...
Setting a Strategy
An individual extraction strategy can be passed to the setStrategy()
method:
Description
Set the extraction strategy.
Parameters
- $strategy : SetaPDF_Extractor_Strategy_AbstractStrategy
The available extraction strategies are described on the next chapters.