Filter Filter Items Before They Are Passed To a Result

Introduction

Sometimes it is necessary to extract only specific text, glyphs or words from a PDF page.

For such purpose each strategy offers a setFilter() method which accepts a filter instance.

The filter interface defines a single method which is called with each found text item. This method has to return a value that is evaluated to true or a boolean false to decide whether the item will be added to the result or not.  

If the returned value is not a boolean value which evaluates to true it will be forwarded to the result as a kind of filter identification.

You should be aware of the fact that it is not guaranteed that a text item which is passed to the accept() method is a semantic entity. 

Predefined Filter Classes

The SetaPDF-Extractor component comes with two predefined filter implementations: 

Rectangle

The rectangle filter allows you to reduce the matched items by a rectangle area.

The class works in two modes which can be passed in the constructor:

const string SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT = 'contact'

A mode constant.

This mode says that the text item has to contact the rectangle of this filter instance through any point or intersection.

const string SetaPDF_Extractor_Filter_Rectangle::MODE_CONTAINS = 'contains'

A mode constant.

This mode says that the whole text item has to be contained by the rectangle of this filter instance.

The coordinates returned by a glyph or word are in the user space of the PDF page. Because a page can be rotated or its boundary box may be shifted the rectangle filter automatically translates the rectangle values by this rotation and offset internally. To do this the filter also implements the SetaPDF_Extractor_Filter_PageFilterInterface interface which can be used to adjust settings, if a page change occurs.

The rectangle coordinates are relative to the lower left corner of the CropBox of a page. 

The following demo shows you how to extract only the sender name of an invoice:

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = SetaPDF_Core_Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new SetaPDF_Extractor($document);

// get the default strategy
$strategy = $extractor->getStrategy();

// create a rectangle filter with a very low height and with the mode "contact".
$senderFilter = new SetaPDF_Extractor_Filter_Rectangle(
    new SetaPDF_Core_Geometry_Rectangle(40, 710, 220, 711),
    SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT
);

// set the filter
$strategy->setFilter($senderFilter);

// get the result which is only the sender name and address in the address field
$result = $extractor->getResultByPageNumber(1);

// debug
echo "<pre>";
var_dump($result);            

To make this demo more understandable the next demo will show you the filtered area and the matched glyphs. We use another strategy for demonstration purpose: 

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = SetaPDF_Core_Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new SetaPDF_Extractor($document);

// create a glyph strategy
$strategy = new SetaPDF_Extractor_Strategy_Glyph();
$extractor->setStrategy($strategy);

// create a rectangle filter with a very low height and with the mode "contact".
$senderFilter = new SetaPDF_Extractor_Filter_Rectangle(
    new SetaPDF_Core_Geometry_Rectangle(40, 710, 220, 711),
    SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT
);

// set the filter
$strategy->setFilter($senderFilter);

// get the result which is only the sender name and address in the address field
$glyphs = $extractor->getResultByPageNumber(1);

// now draw the filter area and found glyphs rectangles
$page = $document->getCatalog()->getPages()->getPage(1);
$page->getContents()->encapsulateExistingContentInGraphicState();

$canvas = $page->getCanvas();
$path = $canvas->path();
$path->setLineWidth(.5);

$colors = array(
    array(0, 1, 0),
    array(0, 0, 1),
    array(1, 1, 0),
    array(0, 1, 1),
    array(1, 0, 1),
    array(1, 0, 0),
    array(1, 0, 1)
);

$colorIndex = 0;

// draw the bounds of all found glyphs
foreach ($glyphs AS $glyph) {
    $canvas->setStrokingColor($colors[$colorIndex]);

    $bounds = $glyph->getBounds();
    foreach ($bounds AS $bound) {
        $path->moveTo($bound->getUl()->getX(), $bound->getUl()->getY())
            ->lineTo($bound->getUr()->getX(), $bound->getUr()->getY())
            ->lineTo($bound->getLr()->getX(), $bound->getLr()->getY())
            ->lineTo($bound->getLl()->getX(), $bound->getLl()->getY())
            ->closeAndStroke();
    }

    $colorIndex++;
    if ($colorIndex === count($colors))
        $colorIndex = 0;
}

// show the filtered rectangle area
$rect = $senderFilter->getRectangle();
$path->setStrokingColor(.5);
$path->rect($rect->getLl()->getX(), $rect->getLl()->getY(), $rect->getWidth(), $rect->getHeight())
    ->closeAndStroke();

$document->setWriter(new SetaPDF_Core_Writer_Http('marked-items.pdf', true));
$document->save()->finish();            

If the filter would be used in contains mode, the glyphs will not match. To demonstrate this behaviour we just change the size to a larger one which intersects with the address field below the expected line. Because of the contains-mode only the sender name will get matched:

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = SetaPDF_Core_Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new SetaPDF_Extractor($document);

// create a glyph strategy
$strategy = new SetaPDF_Extractor_Strategy_Glyph();
$extractor->setStrategy($strategy);

// create a rectangle filter
$senderFilter = new SetaPDF_Extractor_Filter_Rectangle(
    new SetaPDF_Core_Geometry_Rectangle(40, 685, 220, 720),
    SetaPDF_Extractor_Filter_Rectangle::MODE_CONTAINS
);

// set the filter
$strategy->setFilter($senderFilter);

// get the result which is only the sender name and address in the address field
$glyphs = $extractor->getResultByPageNumber(1);

// now draw the filter area and found glyphs rectangles
$page = $document->getCatalog()->getPages()->getPage(1);
$page->getContents()->encapsulateExistingContentInGraphicState();

$canvas = $page->getCanvas();
$path = $canvas->path();
$path->setLineWidth(.5);

$colors = array(
    array(0, 1, 0),
    array(0, 0, 1),
    array(1, 1, 0),
    array(0, 1, 1),
    array(1, 0, 1),
    array(1, 0, 0),
    array(1, 0, 1)
);

$colorIndex = 0;

// draw the bounds of all found glyphs
foreach ($glyphs AS $glyph) {
    $canvas->setStrokingColor($colors[$colorIndex]);

    $bounds = $glyph->getBounds();
    foreach ($bounds AS $bound) {
        $path->moveTo($bound->getUl()->getX(), $bound->getUl()->getY())
            ->lineTo($bound->getUr()->getX(), $bound->getUr()->getY())
            ->lineTo($bound->getLr()->getX(), $bound->getLr()->getY())
            ->lineTo($bound->getLl()->getX(), $bound->getLl()->getY())
            ->closeAndStroke();
    }

    $colorIndex++;
    if ($colorIndex === count($colors))
        $colorIndex = 0;
}

// show the filtered rectangle area
$rect = $senderFilter->getRectangle();
$path->setStrokingColor(.5);
$path->rect($rect->getLl()->getX(), $rect->getLl()->getY(), $rect->getWidth(), $rect->getHeight())
    ->closeAndStroke();

$document->setWriter(new SetaPDF_Core_Writer_Http('marked-items.pdf', true));
$document->save()->finish();            

Font Size

The font size filter allows you to filter text items based on their font size. By individual modes, which are represented by individual class constants, you've full control over the matched items: 

const string SetaPDF_Extractor_Filter_FontSize::MODE_BETWEEN = '><'

A mode constant.

Defines that the font size needs to be between the given filter values. If this mode is used the filter value needs to be an array. Otherwise the mode will be the same as SetaPDF_Extractor_Filter_FontSize::MODE_EQUALS

const string SetaPDF_Extractor_Filter_FontSize::MODE_BETWEEN_OR_EQUALS = '<=||>='

A mode constant.

Defines that the font size needs to be between or equal to the given filter values. If this mode is used the filter value needs to be an array. Otherwise the mode will be the same as SetaPDF_Extractor_Filter_FontSize::MODE_EQUALS

const string SetaPDF_Extractor_Filter_FontSize::MODE_EQUALS = '=='

A mode constant.

Defines that the font size needs to be equal to the given filter value.

const string SetaPDF_Extractor_Filter_FontSize::MODE_LARGER = '>'

A mode constant.

Defines that the font size needs to be larger than the given filter value.

const string SetaPDF_Extractor_Filter_FontSize::MODE_LARGER_OR_EQUALS = '>='

A mode constant.

Defines that the font size needs to be larger or equal than the given filter value.

const string SetaPDF_Extractor_Filter_FontSize::MODE_SMALLER = '<'

A mode constant.

Defines that the font size needs to be smaller than the given filter value.

const string SetaPDF_Extractor_Filter_FontSize::MODE_SMALLER_OR_EQUALS = '<='

A mode constant.

Defines that the font size needs to be smaller or equal than the given filter value.

A simple demo script that extracts text with a font size of 24pt could look like:

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// create a document instance
$document = SetaPDF_Core_Document::loadByFilename('files/pdfs/Brand-Guide.pdf');

// create an extractor instance
$extractor = new SetaPDF_Extractor($document);

// create the word strategy...
$strategy = new SetaPDF_Extractor_Strategy_Word();
// ...and pass it to the extractor
$extractor->setStrategy($strategy);

// create the instance and ...
$filter = new SetaPDF_Extractor_Filter_FontSize(24);
// ...pass it to the strategy
$strategy->setFilter($filter);

// get access to the document pages
$pages = $document->getCatalog()->getPages();

// iterate over the pages and extract the words:
for ($pageNo = 1; $pageNo <= $pages->count(); $pageNo++) {

    echo '<h1>Words with a font size of ' . $filter->getFontSize() .
         'pt on Page #' . $pageNo . '</h1>';

    $words = $extractor->getResultByPageNumber($pageNo);

    foreach ($words as $word) {
        echo '<li>' . htmlspecialchars($word->getString()) . '</li>';
    }
}            

Multi

The SetaPDF_Extractor_Filter_Multi class allows you to create a filter by several filter instances.

The filters are evaluated by an OR logic by default. If one of the filters match, the item will be accepted. It is also possible to build an AND logic by passing the mode constant SetaPDF_Extractor_Filter_Multi::MODE_AND as the second parameter to the constructor.

The id parameter is only used if the instance is working in AND mode. 

Following demo will extract the senders line and the invoice number. We add filter ids to the filter instances to get a detailed result:

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = SetaPDF_Core_Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new SetaPDF_Extractor($document);

// get the default strategy
$strategy = $extractor->getStrategy();

// create a rectangle filter with a very low height and with the mode "contact".
$senderFilter = new SetaPDF_Extractor_Filter_Rectangle(
    new SetaPDF_Core_Geometry_Rectangle(40, 710, 220, 711),
    SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT,
    'senderName' // the identification of this filter
);

$invoiceNoFilter = new SetaPDF_Extractor_Filter_Rectangle(
    new SetaPDF_Core_Geometry_Rectangle(512, 520, 580, 540),
    SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT,
    'invoiceNo' // the identification of this filter
);

// set the filter
$strategy->setFilter(
    new SetaPDF_Extractor_Filter_Multi(array($senderFilter, $invoiceNoFilter))
);

/* get the result which is the sender name and address
 * in the address field and the invoice number
 */
$result = $extractor->getResultByPageNumber(1);

// debug
echo "<pre>";
var_dump($result);            

Individual Filter

To create an individual filter is very easy by implementing at least the SetaPDF_Extractor_Filter_FilterInterface interface.

The following example will show you how to filter only numeric text items:

PHP
<?php
require_once('library/SetaPDF/Autoload.php');

// create a document instance
$document = SetaPDF_Core_Document::loadByFilename('files/pdfs/Brand-Guide.pdf');

// create an extractor instance
$extractor = new SetaPDF_Extractor($document);

// create the word strategy...
$strategy = new SetaPDF_Extractor_Strategy_Word();
// ...and pass it to the extractor
$extractor->setStrategy($strategy);

/**
 * Class numeric filter
 *
 * An individual filter class that will only accept numeric text items.
 */
class NumericFilter implements SetaPDF_Extractor_Filter_FilterInterface
{
    /**
     * This is the callback that will decide if a text item will get matched or not.
     *
     * @param SetaPDF_Extractor_TextItem $textItem
     * @return bool
     */
    public function accept(SetaPDF_Extractor_TextItem $textItem)
    {
        return is_numeric($textItem->getString());
    }
}

// create the instance and ...
$filter = new NumericFilter();
// ...pass it to the strategy
$strategy->setFilter($filter);

// get access to the document pages
$pages = $document->getCatalog()->getPages();

// iterate over the pages and extract the words:
for ($pageNo = 1; $pageNo <= $pages->count(); $pageNo++) {

    echo '<h1>Numbers on Page #' . $pageNo . '</h1>';
    $words = $extractor->getResultByPageNumber($pageNo);

    foreach ($words as $word) {
        echo '<li>' . htmlspecialchars($word->getString()) . '</li>';
    }
}            

If the individual filter is related to a specific page format or other property you may implement the SetaPDF_Extractor_Filter_PageFilterInterface interface to keep track of page changes.

The setPage() method will be called if a page is changed.