Filter Filter Items Before They Are Passed To a Result

Introduction

Sometimes it is necessary to extract only specific text or glyphs from a PDF page.

For such purpose each strategy offers a setFilter() method which accepts a filter instance.

The filter interface defines a single method which is called with each found text item. This method has to return a value that is evaluated to true or a boolean false to decide whether the item will be added to the result or not.

If the returned value is not a boolean value which evaluates to true it will be forwarded to the result as a kind of filter identification.

You should be aware of the fact that it is not guaranteed that a text item which is passed to the accept() method is a semantic entity.

Each filter also comes with an $id parameter. Depending on the result type of the used strategy this parameter is either a key of an array of strings if the expected result is a string or it is forwarded to the underlaying Glyph and Word instances.

Please note that this filter will work on a low level. It is evaluated during the parsing of a content stream. This means that the found text items are not ordered nor reassembled words but raw data of the content stream.
A filter cannot be used to search for words or text.

Predefined Filter Classes

The SetaPDF-Extractor component comes with two predefined filter implementations: 

Rectangle

The rectangle filter allows you to reduce the matched items by a rectangle area.

The class works in two modes which can be passed in the constructor:

A mode constant.

This mode says that the text item has to contact the rectangle of this filter instance through any point or intersection.

A mode constant.

This mode says that the whole text item has to be contained by the rectangle of this filter instance.

The coordinates returned by a glyph or word are in the user space of the PDF page. Because a page can be rotated or its boundary box may be shifted the rectangle filter automatically translates the rectangle values by this rotation and offset internally. To do this the filter also implements the \setasign\SetaPDF2\Extractor\Filter\PageFilterInterface interface which can be used to adjust settings, if a page change occurs.

The rectangle coordinates are relative to the lower left corner of the CropBox of a page. 

The following demo shows you how to extract only the sender name of an invoice:

PHP
<?php

use setasign\SetaPDF2\Core\Document;
use setasign\SetaPDF2\Core\Geometry\Rectangle;
use setasign\SetaPDF2\Extractor\Extractor;
use setasign\SetaPDF2\Extractor\Filter\RectangleFilter;

require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new Extractor($document);

// get the default strategy
$strategy = $extractor->getStrategy();

// create a rectangle filter with a very low height and with the mode "contact".
$senderFilter = new RectangleFilter(
    new Rectangle(40, 710, 220, 711),
    RectangleFilter::MODE_CONTACT
);

// set the filter
$strategy->setFilter($senderFilter);

// get the result which is only the sender name and address in the address field
$result = $extractor->getResultByPageNumber(1);

// debug
echo "<pre>";
var_dump($result);

To make this demo more understandable the next demo will show you the filtered area and the matched glyphs. We use another strategy for demonstration purpose: 

PHP
<?php

use setasign\SetaPDF2\Core\Document;
use setasign\SetaPDF2\Core\Geometry\Rectangle;
use setasign\SetaPDF2\Core\Writer\HttpWriter;
use setasign\SetaPDF2\Extractor\Extractor;
use setasign\SetaPDF2\Extractor\Filter\RectangleFilter;
use setasign\SetaPDF2\Extractor\Strategy\GlyphStrategy;

require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new Extractor($document);

// create a glyph strategy
$strategy = new GlyphStrategy();
$extractor->setStrategy($strategy);

// create a rectangle filter with a very low height and with the mode "contact".
$senderFilter = new RectangleFilter(
    new Rectangle(40, 710, 220, 711),
    RectangleFilter::MODE_CONTACT
);

// set the filter
$strategy->setFilter($senderFilter);

// get the result which is only the sender name and address in the address field
$glyphs = $extractor->getResultByPageNumber(1);

// now draw the filter area and found glyphs rectangles
$page = $document->getCatalog()->getPages()->getPage(1);
$page->getContents()->encapsulateExistingContentInGraphicState();

$canvas = $page->getCanvas();
$path = $canvas->path();
$path->setLineWidth(.5);

$colors = [
    [0, 1, 0],
    [0, 0, 1],
    [1, 1, 0],
    [0, 1, 1],
    [1, 0, 1],
    [1, 0, 0],
    [1, 0, 1]
];

$colorIndex = 0;

// draw the bounds of all found glyphs
foreach ($glyphs AS $glyph) {
    $canvas->setStrokingColor($colors[$colorIndex]);

    $bounds = $glyph->getBounds();
    foreach ($bounds AS $bound) {
        $path->moveTo($bound->getUl()->getX(), $bound->getUl()->getY())
            ->lineTo($bound->getUr()->getX(), $bound->getUr()->getY())
            ->lineTo($bound->getLr()->getX(), $bound->getLr()->getY())
            ->lineTo($bound->getLl()->getX(), $bound->getLl()->getY())
            ->closeAndStroke();
    }

    $colorIndex++;
    if ($colorIndex === count($colors))
        $colorIndex = 0;
}

// show the filtered rectangle area
$rect = $senderFilter->getRectangle();
$path->setStrokingColor(.5);
$path->rect($rect->getLl()->getX(), $rect->getLl()->getY(), $rect->getWidth(), $rect->getHeight())
    ->closeAndStroke();

$document->setWriter(new HttpWriter('marked-items.pdf', true));
$document->save()->finish();

If the filter would be used in contains mode, the glyphs will not match. To demonstrate this behaviour we just change the size to a larger one which intersects with the address field below the expected line. Because of the contains-mode only the sender name will get matched:

PHP
<?php

use setasign\SetaPDF2\Core\Document;
use setasign\SetaPDF2\Core\Geometry\Rectangle;
use setasign\SetaPDF2\Core\Writer\HttpWriter;
use setasign\SetaPDF2\Extractor\Extractor;
use setasign\SetaPDF2\Extractor\Filter\RectangleFilter;
use setasign\SetaPDF2\Extractor\Strategy\GlyphStrategy;

require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new Extractor($document);

// create a glyph strategy
$strategy = new GlyphStrategy();
$extractor->setStrategy($strategy);

// create a rectangle filter
$senderFilter = new RectangleFilter(
    new Rectangle(40, 685, 220, 720),
    RectangleFilter::MODE_CONTAINS
);

// set the filter
$strategy->setFilter($senderFilter);

// get the result which is only the sender name and address in the address field
$glyphs = $extractor->getResultByPageNumber(1);

// now draw the filter area and found glyphs rectangles
$page = $document->getCatalog()->getPages()->getPage(1);
$page->getContents()->encapsulateExistingContentInGraphicState();

$canvas = $page->getCanvas();
$path = $canvas->path();
$path->setLineWidth(.5);

$colors = [
    [0, 1, 0],
    [0, 0, 1],
    [1, 1, 0],
    [0, 1, 1],
    [1, 0, 1],
    [1, 0, 0],
    [1, 0, 1]
];

$colorIndex = 0;

// draw the bounds of all found glyphs
foreach ($glyphs AS $glyph) {
    $canvas->setStrokingColor($colors[$colorIndex]);

    $bounds = $glyph->getBounds();
    foreach ($bounds AS $bound) {
        $path->moveTo($bound->getUl()->getX(), $bound->getUl()->getY())
            ->lineTo($bound->getUr()->getX(), $bound->getUr()->getY())
            ->lineTo($bound->getLr()->getX(), $bound->getLr()->getY())
            ->lineTo($bound->getLl()->getX(), $bound->getLl()->getY())
            ->closeAndStroke();
    }

    $colorIndex++;
    if ($colorIndex === count($colors))
        $colorIndex = 0;
}

// show the filtered rectangle area
$rect = $senderFilter->getRectangle();
$path->setStrokingColor(.5);
$path->rect($rect->getLl()->getX(), $rect->getLl()->getY(), $rect->getWidth(), $rect->getHeight())
    ->closeAndStroke();

$document->setWriter(new HttpWriter('marked-items.pdf', true));
$document->save()->finish();

Font Size

The font size filter allows you to filter text items based on their font size. By individual modes, which are represented by individual class constants, you've full control over the matched items: 

A mode constant.

Defines that the font size needs to be between the given filter values. If this mode is used the filter value needs to be an array. Otherwise the mode will be the same as \setasign\SetaPDF2\Extractor\Filter\FontSizeFilter::MODE_EQUALS

A mode constant.

Defines that the font size needs to be between or equal to the given filter values. If this mode is used the filter value needs to be an array. Otherwise the mode will be the same as \setasign\SetaPDF2\Extractor\Filter\FontSizeFilter::MODE_EQUALS

A mode constant.

Defines that the font size needs to be equal to the given filter value.

A mode constant.

Defines that the font size needs to be larger than the given filter value.

A mode constant.

Defines that the font size needs to be larger or equal than the given filter value.

A mode constant.

Defines that the font size needs to be smaller than the given filter value.

A mode constant.

Defines that the font size needs to be smaller or equal than the given filter value.

A simple demo script that extracts text with a font size of 24pt could look like:

PHP
<?php

use setasign\SetaPDF2\Core\Document;
use setasign\SetaPDF2\Extractor\Extractor;
use setasign\SetaPDF2\Extractor\Filter\FontSizeFilter;
use setasign\SetaPDF2\Extractor\Strategy\WordStrategy;

require_once('library/SetaPDF/Autoload.php');

// create a document instance
$document = Document::loadByFilename('files/pdfs/Brand-Guide.pdf');

// create an extractor instance
$extractor = new Extractor($document);

// create the word strategy...
$strategy = new WordStrategy();
// ...and pass it to the extractor
$extractor->setStrategy($strategy);

// create the instance and ...
$filter = new FontSizeFilter(24);
// ...pass it to the strategy
$strategy->setFilter($filter);

// get access to the document pages
$pages = $document->getCatalog()->getPages();

// iterate over the pages and extract the words:
for ($pageNo = 1; $pageNo <= $pages->count(); $pageNo++) {

    echo '<h1>Words with a font size of ' . $filter->getFontSize() .
         'pt on Page #' . $pageNo . '</h1>';

    $words = $extractor->getResultByPageNumber($pageNo);

    foreach ($words as $word) {
        echo '<li>' . htmlspecialchars($word->getString()) . '</li>';
    }
}

Multi

The \setasign\SetaPDF2\Extractor\Filter\MultiFilter class allows you to create a filter by several filter instances.

The filters are evaluated by an OR logic by default. If one of the filters match, the item will be accepted. It is also possible to build an AND logic by passing the mode constant \setasign\SetaPDF2\Extractor\Filter\MultiFilter::MODE_AND as the second parameter to the constructor.

The $id parameter of the individual filter instances is only used if the multi-instance is working in OR mode.

Following demo will extract the senders line and the invoice number. We add filter ids to the filter instances to get a detailed result:

PHP
<?php

use setasign\SetaPDF2\Core\Document;
use setasign\SetaPDF2\Core\Geometry\Rectangle;
use setasign\SetaPDF2\Extractor\Extractor;
use setasign\SetaPDF2\Extractor\Filter\MultiFilter;
use setasign\SetaPDF2\Extractor\Filter\RectangleFilter;

require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = Document::loadByFilename(
    'files/pdfs/camtown/eBook-Invoice.pdf'
);

// initiate an extractor instance
$extractor = new Extractor($document);

// get the default strategy
$strategy = $extractor->getStrategy();

// create a rectangle filter with a very low height and with the mode "contact".
$senderFilter = new RectangleFilter(
    new Rectangle(40, 710, 220, 711),
    RectangleFilter::MODE_CONTACT,
    'senderName' // the identification of this filter
);

$invoiceNoFilter = new RectangleFilter(
    new Rectangle(512, 520, 580, 540),
    RectangleFilter::MODE_CONTACT,
    'invoiceNo' // the identification of this filter
);

// set the filter
$strategy->setFilter(
    new MultiFilter([$senderFilter, $invoiceNoFilter])
);

/* get the result which is the sender name and address
 * in the address field and the invoice number
 */
$result = $extractor->getResultByPageNumber(1);

// debug
echo "<pre>";
var_dump($result);

Individual Filter

To create an individual filter is very easy by implementing at least the \setasign\SetaPDF2\Extractor\Filter\FilterInterface interface.

The following example will show you how to filter only numeric text items:

PHP
<?php

use setasign\SetaPDF2\Core\Document;
use setasign\SetaPDF2\Extractor\Extractor;
use setasign\SetaPDF2\Extractor\Filter\FilterInterface;
use setasign\SetaPDF2\Extractor\Strategy\WordStrategy;
use setasign\SetaPDF2\Extractor\TextItem;

require_once('library/SetaPDF/Autoload.php');

// create a document instance
$document = Document::loadByFilename('files/pdfs/Brand-Guide.pdf');

// create an extractor instance
$extractor = new Extractor($document);

// create the word strategy...
$strategy = new WordStrategy();
// ...and pass it to the extractor
$extractor->setStrategy($strategy);

/**
 * Class numeric filter
 *
 * An individual filter class that will only accept numeric text items.
 */
class NumericFilter implements FilterInterface
{
    /**
     * This is the callback that will decide if a text item will get matched or not.
     *
     * @param TextItem $textItem
     * @return bool
     */
    public function accept(TextItem $textItem)
    {
        return is_numeric($textItem->getString());
    }
}

// create the instance and ...
$filter = new NumericFilter();
// ...pass it to the strategy
$strategy->setFilter($filter);

// get access to the document pages
$pages = $document->getCatalog()->getPages();

// iterate over the pages and extract the words:
for ($pageNo = 1; $pageNo <= $pages->count(); $pageNo++) {

    echo '<h1>Numbers on Page #' . $pageNo . '</h1>';
    $words = $extractor->getResultByPageNumber($pageNo);

    foreach ($words as $word) {
        echo '<li>' . htmlspecialchars($word->getString()) . '</li>';
    }
}

If the individual filter is related to a specific page format or other property you may implement the \setasign\SetaPDF2\Extractor\Filter\PageFilterInterface interface to keep track of page changes.

The setPage() method will be called if a page is changed.