The Main Class
Table of Contents
Introduction
The SetaPDF-Extractor component is created by several classes and interfaces. The central location for an extraction process is the main class \setasign\SetaPDF2\Extractor\Extractor
.
It controls the extraction processes of individual pages and strategies. Internally it setups the needed environment data for the used strategy and acts as a kind of proxy between your script and the used strategy.
Get an Instance
The one and only parameter to the constructor of the \setasign\SetaPDF2\Extractor\Extractor
class is a document instance. This is the document instance from which the content will be extracted.
Reading the PDF document is up to the Core component. This way it is possible to work on the document instance further:
use setasign\SetaPDF2\Core\Document; use setasign\SetaPDF2\Extractor\Extractor; $document = Document::loadByFilename('path/to/a/document.pdf'); $extractor = new Extractor($document); // ... // access the classes/functionalities via the Core component $document->getCatalog()->setPageLayout(Document\PageLayout::ONE_COLUMN);
Getting Results
The Extractor component allows you to work with different extraction strategies which control the detail level of the result. By default the component uses the plain text strategy.
This strategy simply extracts the plain text from a PDF documents page.
By default the component returns only results per page via the getResultByPageNumber()
method.
Description
string $boundaryBox = null
Get the result by the default or individual strategy of a specific page.
Parameters
- $pageNumber : int
- $boundaryBox : string
If set the page boundary is used to limit the result to the rectangle of the given boundary. See \setasign\SetaPDF2\Core\PageBoundaries::XXX_BOX constants for possible values.
Exceptions
Throws \setasign\SetaPDF2\Core\Exception
See
To get the results of a whole document, the result has to be resolved for each page individually. So a simple plain text extraction script for a full document will look like:
use setasign\SetaPDF2\Core\Document; use setasign\SetaPDF2\Extractor\Extractor; $document = Document::loadByFilename(...); $extractor = new Extractor($document); $pageCount = $document->getCatalog()->getPages()->count(); $textPerPage = array(); for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) { $textPerPage[$pageNo] = $extractor->getResultByPageNumber($pageNo); } // if you're done with the document instance $document->cleanUp(); //...
Setting a Strategy
An individual extraction strategy can be passed to the setStrategy()
method:
Description
Set the extraction strategy.
Parameters
- $strategy : \SetaPDF_Extractor_Strategy_AbstractStrategy
The available extraction strategies are described on the next chapters.