Glyph Strategy

Glyph Strategy Extracts Glyphs and Metrics

Introduction
Process
Usage

Introduction

The glyph strategy allows you to extract single glyphs from PDF documents. It is represented by the class \setasign\SetaPDF2\Extractor\Strategy\GlyphStrategy.

The result will be an instance of \setasign\SetaPDF2\Extractor\Result\Collection (more details are available here). Each glyph in the collection is represented by an instance of \setasign\SetaPDF2\Extractor\Result\Glyph.

Process

This strategy extracts each single glyph including its metrics in the order in which it appears in the PDF data stream. The result is NOT sorted.

The result may be used for further processing by another strategy or text analyses.

Usage

An instance has to be created individually and passed to the main class:

PHP

use setasign\SetaPDF2\Extractor\Extractor;
use setasign\SetaPDF2\Extractor\Strategy\GlyphStrategy;

$glyphStrategy = new GlyphStrategy();
$extractor = new Extractor($document);
$extractor->setStrategy($glyphStrategy);

You can get the result by this strategy by calling the getResultByPageNumber() method for each individual page. Each glyph will be represented by an instance of \setasign\SetaPDF2\Extractor\Result\Glyph which implements both the \setasign\SetaPDF2\Extractor\Result\CompareableInterface and \setasign\SetaPDF2\Extractor\Result\HasBoundsInterface interfaces.

PHP

<?php

use setasign\SetaPDF2\Core\Document;
use setasign\SetaPDF2\Extractor\Extractor;
use setasign\SetaPDF2\Extractor\Strategy\GlyphStrategy;

require_once('library/SetaPDF/Autoload.php');

// get a document instance
$document = Document::loadByFilename(
    'files/pdfs/camtown/Laboratory-Report.pdf'
);

// create an extractor instance
$extractor = new Extractor($document);

// create the glyph strategy and pass it to the extractor instance
$strategy = new GlyphStrategy();
$extractor->setStrategy($strategy);

// we need the total page count
$pageCount = $document->getCatalog()->getPages()->count();

// walk through the pages
for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
    // ...and extract the data through the default strategy:
    $result = $extractor->getResultByPageNumber($pageNo);

    // debug/demonstration output
    echo '<h1>Page #' . $pageNo . '</h1>';
    echo 'Found ' . count($result) . ' glyphs on page ' . $pageNo .
         '. The first 100 glyphs are:<br />';

    echo '<table border="1">';
    echo '<tr><th>Glyph</th><th>llx</th><th>lly</th><th>ulx</th>' .
         '<th>uly</th><th>urx</th><th>ury</th><th>lrx</th><th>lry</th></tr>';

    foreach ($result AS $i => $glyph) {
        echo '<tr>';
        echo '<td><b>' . $glyph . '</b></td>';

        $allBounds = $glyph->getBounds();
        foreach ($allBounds as $bounds) {
            echo '<td>' . $bounds->getLl()->getX() . '</td>';
            echo '<td>' . $bounds->getLl()->getY() . '</td>';
            echo '<td>' . $bounds->getUl()->getX() . '</td>';
            echo '<td>' . $bounds->getUl()->getY() . '</td>';
            echo '<td>' . $bounds->getUr()->getX() . '</td>';
            echo '<td>' . $bounds->getUr()->getY() . '</td>';
            echo '<td>' . $bounds->getLr()->getX() . '</td>';
            echo '<td>' . $bounds->getLr()->getY() . '</td>';
        }

        echo '</tr>';

        if ($i >= 99) {
            echo '<tr><td colspan="9">...</td></tr>';
            break;
        }
    }
    echo '</table>';
}

 Code
 Run

The strategy allows you to pass a filter instance to limit the result e.g. by a specific area on a page.

Exact Plain Text Strategy Word Strategy

SetaPDF-Extractor Manual

Index

Glyph Strategy Extracts Glyphs and Metrics

Table of Contents

Introduction

Process

Usage