Exact Plain Text Strategy

Exact Plain Text Strategy Extracts Simple Plain Text By Glyphs

Introduction
Process
Usage

Introduction

The exact plain text strategy is simliar to the plain text strategy but it uses the detail level of the glyph strategy to extract the text. That said it will recreate the resulting text by sorting, comparing and concatenating individual glyphs and will ignore existing text items from the PDF document.

Especially in concunction with e.g. a rectangle filter this strategy will return much more precise result.

The strategy is represented by the \setasign\SetaPDF2\Extractor\Strategy\ExactPlainStrategy class.

The result will be also a standard PHP string.

Process

The exact plain text strategy makes use of the glyph strategy which extracts each single glyph including its metrics in the order in which it appears in the PDF data stream.

After that these single glyphs will be passed to the same logic as of the plain text strategy.

If a space between words is "faked" by a character spacing value this strategy is able to recognize this as a word separator!

Usage

An instance has to be created individually and passed to the main class:

PHP

use setasign\SetaPDF2\Extractor\Extractor;
use setasign\SetaPDF2\Extractor\Strategy\ExactPlainStrategy;

$strategy = new ExactPlainStrategy();
$extractor = new Extractor($document);
$extractor->setStrategy($strategy);

You can get a string result by this strategy by calling the getResultByPageNumber() method for each individual page:

PHP

<?php

use setasign\SetaPDF2\Core\Document;
use setasign\SetaPDF2\Extractor\Extractor;
use setasign\SetaPDF2\Extractor\Strategy\ExactPlainStrategy;

require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = Document::loadByFilename(
    'files/pdfs/camtown/Laboratory-Report.pdf'
);

// create an extractor instance
$extractor = new Extractor($document);
// create the strategy
$strategy = new ExactPlainStrategy();
// pass it to the extractor instance
$extractor->setStrategy($strategy);
// we need the total page count
$pageCount = $document->getCatalog()->getPages()->count();

// walk through the pages
for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
    // ...and extract the data through the default strategy:
    $result = $extractor->getResultByPageNumber($pageNo);

    // debug/demonstration output
    echo '<h1>Page #' . $pageNo . '</h1>';
    echo '<pre>';
    var_dump($result);
    echo '</pre>';
}

 Code
 Run

The strategy allows you to pass a filter instance to limit the result e.g. by a specific area on a page.

Plain Text Strategy Glyph Strategy

SetaPDF-Extractor Manual

Index