Glyph Strategy Extracts Glyphs and Metrics

Table of Contents

  1. Introduction
    1. Process
      1. Usage

        Introduction

        The SetaPDF_Extractor_Strategy_Glyph allows you to extract single glyphs from PDF documents. It is represented by the class SetaPDF_Extractor_Strategy_Glyph.

        The result will be an instance of SetaPDF_Extractor_Result_Segment (more details are available here). Each glyph in the Segment is represented by an instance of SetaPDF_Extractor_Result_Glyph.

        Process

        This strategy extracts each single glyph including its metrics in the order in which it appears in the PDF data stream. The result is NOT sorted.

        The result may be used for further processing by another strategy or text analyses. 

        Usage

        An instance has to be created individually and has to be passed to the main class

        PHP
        $glyphStrategy = new SetaPDF_Extractor_Strategy_Glyph();
        $extractor = new SetaPDF_Extractor($document);
        $extractor->setStrategy($glyphStrategy);

        You can get the result by this strategy by calling the SetaPDF_Extractor method for each individual page. Each glyph will be represented by an instance of SetaPDF_Extractor_Result_Glyph which implements both the SetaPDF_Extractor_Result_CompareableInterface and SetaPDF_Extractor_Result_HasBoundsInterface interfaces. 

        PHP
        <?php
        require_once('library/SetaPDF/Autoload.php');
        
        // get a document instance
        $document = SetaPDF_Core_Document::loadByFilename(
            'files/pdfs/camtown/Laboratory-Report.pdf'
        );
        
        // create an extractor instance
        $extractor = new SetaPDF_Extractor($document);
        
        // create the glyph strategy and pass it to the extractor instance
        $strategy = new SetaPDF_Extractor_Strategy_Glyph();
        $extractor->setStrategy($strategy);
        
        // we need the total page count
        $pageCount = $document->getCatalog()->getPages()->count();
        
        // walk through the pages
        for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
            // ...and extract the data through the default strategy:
            $result = $extractor->getResultByPageNumber($pageNo);
        
            // debug/demonstration output
            echo '<h1>Page #' . $pageNo . '</h1>';
            echo 'Found ' . count($result) . ' glyphs on page ' . $pageNo .
                 '. The first 100 glyphs are:<br />';
        
            echo '<table border="1">';
            echo '<tr><th>Glyph</th><th>llx</th><th>lly</th><th>ulx</th>' .
                 '<th>uly</th><th>urx</th><th>ury</th><th>lrx</th><th>lry</th></tr>';
        
            foreach ($result AS $i => $glyph) {
                echo '<tr>';
                echo '<td><b>' . $glyph . '</b></td>';
        
                $allBounds = $glyph->getBounds();
                foreach ($allBounds as $bounds) {
                    echo '<td>' . $bounds->getLl()->getX() . '</td>';
                    echo '<td>' . $bounds->getLl()->getY() . '</td>';
                    echo '<td>' . $bounds->getUl()->getX() . '</td>';
                    echo '<td>' . $bounds->getUl()->getY() . '</td>';
                    echo '<td>' . $bounds->getUr()->getX() . '</td>';
                    echo '<td>' . $bounds->getUr()->getY() . '</td>';
                    echo '<td>' . $bounds->getLr()->getX() . '</td>';
                    echo '<td>' . $bounds->getLr()->getY() . '</td>';
                }
        
                echo '</tr>';
        
                if ($i >= 99) {
                    echo '<tr><td colspan="9">...</td></tr>';
                    break;
                }
            }
            echo '</table>';
        }            

        The strategy allows you to pass a filter instance to limit the result e.g. by a specific area on a page.