Plain Text Strategy

Plain Text Strategy Extracts Simple Plain Text

Introduction
Process
Usage

Introduction

The plain text strategy is the default strategy used by the SetaPDF-Extractor component and allows you to extract plain text from PDF documents. It is represented by the class \setasign\SetaPDF2\Extractor\Strategy\PlainStrategy.

By default the text items are sorted by the baseline sorter but another or individual sorter instance can be passed through the setSorter() method.

The result will be a standard PHP string.

Process

The plain text strategy extracts all defined text items including their metrics into a temporary result. The items are taken as they appear in the PDF data stream. This means that several words in a single text item are processed as a whole. Or a word splitted over several text items is processed as several individual items.

This result is sorted and grouped (by default via the base line sorter) into lines and orientations then.

The resulting text string is created by running through the sorted and grouped result and comparing the last item with the current one to decide if both text items build a continuity segment. This is done by checking for a gap between both items on their ordinate. The size of this gap is defined by the average width of the space character of both text items devided by a factor defined in the $spaceWidthFactor property.

If a space between words is "faked" by a character spacing value this strategy is not able to recognize this as a word separator. The exact plain strategy is able to handle this situation!

Usage

An instance can be created individually or by receiving it from the main class:

PHP

use setasign\SetaPDF2\Extractor\Extractor;
use setasign\SetaPDF2\Extractor\Strategy\PlainStrategy;

// get the default instance
$extractor = new Extractor($document);
$plainText = $extractor->getStrategy();

// or create your own
$plainText = new PlainStrategy();
$extractor = new Extractor($document);
$extractor->setStrategy($plainText);

You can get a string result by this strategy by calling the getResultByPageNumber() method for each individual page:

PHP

<?php

use setasign\SetaPDF2\Core\Document;
use setasign\SetaPDF2\Extractor\Extractor;

require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = Document::loadByFilename(
    'files/pdfs/camtown/Laboratory-Report.pdf'
);

// create an extractor instance
$extractor = new Extractor($document);
// we need the total page count
$pageCount = $document->getCatalog()->getPages()->count();

// walk through the pages
for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
    // ...and extract the data through the default strategy:
    $result = $extractor->getResultByPageNumber($pageNo);

    // debug/demonstration output
    echo '<h1>Page #' . $pageNo . '</h1>';
    echo '<pre>';
    var_dump($result);
    echo '</pre>';
}

 Code
 Run

The strategy allows you to pass a filter instance to limit the result e.g. by a specific area on a page.

Strategies Exact Plain Text Strategy

SetaPDF-Extractor Manual

Index