Filter Filter Items Before They Are Passed To a Result

Introduction

Sometimes it is necessary to extract only specific text or glyphs from a PDF page.

For such purpose each strategy offers a setFilter() method which accepts a filter instance.

The filter interface defines a single method which is called with each found text item. This method has to return a value that is evaluated to true or a boolean false to decide whether the item will be added to the result or not.

If the returned value is not a boolean value which evaluates to true it will be forwarded to the result as a kind of filter identification.

You should be aware of the fact that it is not guaranteed that a text item which is passed to the accept() method is a semantic entity.

Each filter also comes with an $id parameter. Depending on the result type of the used strategy this parameter is either a key of an array of strings if the expected result is a string or it is forwarded to the underlaying Glyph and Word instances.

Please note that this filter will work on a low level. It is evaluated during the parsing of a content stream. This means that the found text items are not ordered nor reassembled words but raw data of the content stream.
A filter cannot be used to search for words or text.

Predefined Filter Classes

The SetaPDF-Extractor component comes with two predefined filter implementations: 

Rectangle

The rectangle filter allows you to reduce the matched items by a rectangle area.

The class works in two modes which can be passed in the constructor:

public const string SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT = 'contact'

A mode constant.

This mode says that the text item has to contact the rectangle of this filter instance through any point or intersection.

public const string SetaPDF_Extractor_Filter_Rectangle::MODE_CONTAINS = 'contains'

A mode constant.

This mode says that the whole text item has to be contained by the rectangle of this filter instance.

The coordinates returned by a glyph or word are in the user space of the PDF page. Because a page can be rotated or its boundary box may be shifted the rectangle filter automatically translates the rectangle values by this rotation and offset internally. To do this the filter also implements the SetaPDF_Extractor_Filter_PageFilterInterface interface which can be used to adjust settings, if a page change occurs.

The rectangle coordinates are relative to the lower left corner of the CropBox of a page. 

The following demo shows you how to extract only the sender name of an invoice:

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = \SetaPDF_Core_Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new \SetaPDF_Extractor($document);

// get the default strategy
$strategy = $extractor->getStrategy();

// create a rectangle filter with a very low height and with the mode "contact".
$senderFilter = new \SetaPDF_Extractor_Filter_Rectangle(
    new \SetaPDF_Core_Geometry_Rectangle(40, 710, 220, 711),
    \SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT
);

// set the filter
$strategy->setFilter($senderFilter);

// get the result which is only the sender name and address in the address field
$result = $extractor->getResultByPageNumber(1);

// debug
echo "<pre>";
var_dump($result);

To make this demo more understandable the next demo will show you the filtered area and the matched glyphs. We use another strategy for demonstration purpose: 

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = \SetaPDF_Core_Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new \SetaPDF_Extractor($document);

// create a glyph strategy
$strategy = new \SetaPDF_Extractor_Strategy_Glyph();
$extractor->setStrategy($strategy);

// create a rectangle filter with a very low height and with the mode "contact".
$senderFilter = new \SetaPDF_Extractor_Filter_Rectangle(
    new \SetaPDF_Core_Geometry_Rectangle(40, 710, 220, 711),
    \SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT
);

// set the filter
$strategy->setFilter($senderFilter);

// get the result which is only the sender name and address in the address field
$glyphs = $extractor->getResultByPageNumber(1);

// now draw the filter area and found glyphs rectangles
$page = $document->getCatalog()->getPages()->getPage(1);
$page->getContents()->encapsulateExistingContentInGraphicState();

$canvas = $page->getCanvas();
$path = $canvas->path();
$path->setLineWidth(.5);

$colors = array(
    array(0, 1, 0),
    array(0, 0, 1),
    array(1, 1, 0),
    array(0, 1, 1),
    array(1, 0, 1),
    array(1, 0, 0),
    array(1, 0, 1)
);

$colorIndex = 0;

// draw the bounds of all found glyphs
foreach ($glyphs AS $glyph) {
    $canvas->setStrokingColor($colors[$colorIndex]);

    $bounds = $glyph->getBounds();
    foreach ($bounds AS $bound) {
        $path->moveTo($bound->getUl()->getX(), $bound->getUl()->getY())
            ->lineTo($bound->getUr()->getX(), $bound->getUr()->getY())
            ->lineTo($bound->getLr()->getX(), $bound->getLr()->getY())
            ->lineTo($bound->getLl()->getX(), $bound->getLl()->getY())
            ->closeAndStroke();
    }

    $colorIndex++;
    if ($colorIndex === count($colors))
        $colorIndex = 0;
}

// show the filtered rectangle area
$rect = $senderFilter->getRectangle();
$path->setStrokingColor(.5);
$path->rect($rect->getLl()->getX(), $rect->getLl()->getY(), $rect->getWidth(), $rect->getHeight())
    ->closeAndStroke();

$document->setWriter(new \SetaPDF_Core_Writer_Http('marked-items.pdf', true));
$document->save()->finish();

If the filter would be used in contains mode, the glyphs will not match. To demonstrate this behaviour we just change the size to a larger one which intersects with the address field below the expected line. Because of the contains-mode only the sender name will get matched:

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = \SetaPDF_Core_Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new \SetaPDF_Extractor($document);

// create a glyph strategy
$strategy = new \SetaPDF_Extractor_Strategy_Glyph();
$extractor->setStrategy($strategy);

// create a rectangle filter
$senderFilter = new \SetaPDF_Extractor_Filter_Rectangle(
    new \SetaPDF_Core_Geometry_Rectangle(40, 685, 220, 720),
    \SetaPDF_Extractor_Filter_Rectangle::MODE_CONTAINS
);

// set the filter
$strategy->setFilter($senderFilter);

// get the result which is only the sender name and address in the address field
$glyphs = $extractor->getResultByPageNumber(1);

// now draw the filter area and found glyphs rectangles
$page = $document->getCatalog()->getPages()->getPage(1);
$page->getContents()->encapsulateExistingContentInGraphicState();

$canvas = $page->getCanvas();
$path = $canvas->path();
$path->setLineWidth(.5);

$colors = array(
    array(0, 1, 0),
    array(0, 0, 1),
    array(1, 1, 0),
    array(0, 1, 1),
    array(1, 0, 1),
    array(1, 0, 0),
    array(1, 0, 1)
);

$colorIndex = 0;

// draw the bounds of all found glyphs
foreach ($glyphs AS $glyph) {
    $canvas->setStrokingColor($colors[$colorIndex]);

    $bounds = $glyph->getBounds();
    foreach ($bounds AS $bound) {
        $path->moveTo($bound->getUl()->getX(), $bound->getUl()->getY())
            ->lineTo($bound->getUr()->getX(), $bound->getUr()->getY())
            ->lineTo($bound->getLr()->getX(), $bound->getLr()->getY())
            ->lineTo($bound->getLl()->getX(), $bound->getLl()->getY())
            ->closeAndStroke();
    }

    $colorIndex++;
    if ($colorIndex === count($colors))
        $colorIndex = 0;
}

// show the filtered rectangle area
$rect = $senderFilter->getRectangle();
$path->setStrokingColor(.5);
$path->rect($rect->getLl()->getX(), $rect->getLl()->getY(), $rect->getWidth(), $rect->getHeight())
    ->closeAndStroke();

$document->setWriter(new \SetaPDF_Core_Writer_Http('marked-items.pdf', true));
$document->save()->finish();

Font Size

The font size filter allows you to filter text items based on their font size. By individual modes, which are represented by individual class constants, you've full control over the matched items: 

A mode constant.

Defines that the font size needs to be between the given filter values. If this mode is used the filter value needs to be an array. Otherwise the mode will be the same as SetaPDF_Extractor_Filter_FontSize::MODE_EQUALS

A mode constant.

Defines that the font size needs to be between or equal to the given filter values. If this mode is used the filter value needs to be an array. Otherwise the mode will be the same as SetaPDF_Extractor_Filter_FontSize::MODE_EQUALS

A mode constant.

Defines that the font size needs to be equal to the given filter value.

A mode constant.

Defines that the font size needs to be larger than the given filter value.

A mode constant.

Defines that the font size needs to be larger or equal than the given filter value.

A mode constant.

Defines that the font size needs to be smaller than the given filter value.

A mode constant.

Defines that the font size needs to be smaller or equal than the given filter value.

A simple demo script that extracts text with a font size of 24pt could look like:

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// create a document instance
$document = \SetaPDF_Core_Document::loadByFilename('files/pdfs/Brand-Guide.pdf');

// create an extractor instance
$extractor = new \SetaPDF_Extractor($document);

// create the word strategy...
$strategy = new \SetaPDF_Extractor_Strategy_Word();
// ...and pass it to the extractor
$extractor->setStrategy($strategy);

// create the instance and ...
$filter = new \SetaPDF_Extractor_Filter_FontSize(24);
// ...pass it to the strategy
$strategy->setFilter($filter);

// get access to the document pages
$pages = $document->getCatalog()->getPages();

// iterate over the pages and extract the words:
for ($pageNo = 1; $pageNo <= $pages->count(); $pageNo++) {

    echo '<h1>Words with a font size of ' . $filter->getFontSize() .
         'pt on Page #' . $pageNo . '</h1>';

    $words = $extractor->getResultByPageNumber($pageNo);

    foreach ($words as $word) {
        echo '<li>' . htmlspecialchars($word->getString()) . '</li>';
    }
}

Multi

The SetaPDF_Extractor_Filter_Multi class allows you to create a filter by several filter instances.

The filters are evaluated by an OR logic by default. If one of the filters match, the item will be accepted. It is also possible to build an AND logic by passing the mode constant SetaPDF_Extractor_Filter_Multi::MODE_AND as the second parameter to the constructor.

The $id parameter of the individual filter instances is only used if the multi-instance is working in OR mode.

Following demo will extract the senders line and the invoice number. We add filter ids to the filter instances to get a detailed result:

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = \SetaPDF_Core_Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new \SetaPDF_Extractor($document);

// get the default strategy
$strategy = $extractor->getStrategy();

// create a rectangle filter with a very low height and with the mode "contact".
$senderFilter = new \SetaPDF_Extractor_Filter_Rectangle(
    new \SetaPDF_Core_Geometry_Rectangle(40, 710, 220, 711),
    \SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT,
    'senderName' // the identification of this filter
);

$invoiceNoFilter = new \SetaPDF_Extractor_Filter_Rectangle(
    new \SetaPDF_Core_Geometry_Rectangle(512, 520, 580, 540),
    \SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT,
    'invoiceNo' // the identification of this filter
);

// set the filter
$strategy->setFilter(
    new \SetaPDF_Extractor_Filter_Multi(array($senderFilter, $invoiceNoFilter))
);

/* get the result which is the sender name and address
 * in the address field and the invoice number
 */
$result = $extractor->getResultByPageNumber(1);

// debug
echo "<pre>";
var_dump($result);

Individual Filter

To create an individual filter is very easy by implementing at least the SetaPDF_Extractor_Filter_FilterInterface interface.

The following example will show you how to filter only numeric text items:

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// create a document instance
$document = \SetaPDF_Core_Document::loadByFilename('files/pdfs/Brand-Guide.pdf');

// create an extractor instance
$extractor = new \SetaPDF_Extractor($document);

// create the word strategy...
$strategy = new \SetaPDF_Extractor_Strategy_Word();
// ...and pass it to the extractor
$extractor->setStrategy($strategy);

/**
 * Class numeric filter
 *
 * An individual filter class that will only accept numeric text items.
 */
class NumericFilter implements \SetaPDF_Extractor_Filter_FilterInterface
{
    /**
     * This is the callback that will decide if a text item will get matched or not.
     *
     * @param \SetaPDF_Extractor_TextItem $textItem
     * @return bool
     */
    public function accept(\SetaPDF_Extractor_TextItem $textItem)
    {
        return is_numeric($textItem->getString());
    }
}

// create the instance and ...
$filter = new NumericFilter();
// ...pass it to the strategy
$strategy->setFilter($filter);

// get access to the document pages
$pages = $document->getCatalog()->getPages();

// iterate over the pages and extract the words:
for ($pageNo = 1; $pageNo <= $pages->count(); $pageNo++) {

    echo '<h1>Numbers on Page #' . $pageNo . '</h1>';
    $words = $extractor->getResultByPageNumber($pageNo);

    foreach ($words as $word) {
        echo '<li>' . htmlspecialchars($word->getString()) . '</li>';
    }
}

If the individual filter is related to a specific page format or other property you may implement the SetaPDF_Extractor_Filter_PageFilterInterface interface to keep track of page changes.

The setPage() method will be called if a page is changed.