Word Strategy Extracts Words, Glyphs and Metrics

Table of Contents

  1. Introduction
    1. Process
      1. Usage

        Introduction

        The word strategy allows you to extract words from PDF documents. It is represented by the class SetaPDF_Extractor_Strategy_Word.

        The result will be an instance of SetaPDF_Extractor_Result_Words (more details are available here). Each word in the words collection is represented by an instance of SetaPDF_Extractor_Result_Word.

        Process

        This strategy internally extracts each single glyph including its metrics in the order in which it appears in the PDF data stream. This temporary result is sorted and grouped by a sorter instance (by default it uses the base line sorter). 

        The sorted and grouped result is used to build individual words by running through the grouped lines and checking if the next glyph should be part of the current word or not. This is done by checking for a gap between both glyph items on their ordinate. The size of this gap is defined by the average width of the space character of both glyph items devided by a factor defined in the $spaceWidthFactor property.

        Numeric values with joining dots or commas (e.g. "1.234,00" or "2,3456.12") are treated as a single "word". Also several following non-word-characters are treated as a single "word", too.

        Following characters are treated as "word character":

        • HYPHEN-MINUS (U+002D)
        • HYPHEN (U+2010)
        • MINUS SIGN (U+2212)
        • EM DASH (U+2014)
        • NON-BREAKING HYPHEN (U+2011)   
        • SUPERSCRIPT TWO (U+00B2)
        • SOFT HYPHEN (U+00AD) 

        If you want to treat additional characters as "word characters", you can add them to the static $characters property. The characters need to be added in UTF-8 encoding. To e.g. treat symbols such as "«" and "»" as word characters, you can add them this way: 

        PHP
        \SetaPDF_Extractor_Strategy_Word::$characters .= '«»';
        // or
        \SetaPDF_Extractor_Strategy_Word::$characters .= "\xC2\xAB\xC2\xBB";

        The strategy will ignore spaces or "empty" words.

        Usage

        An instance has to be created individually and passed to the main class

        PHP
        $wordStrategy = new \SetaPDF_Extractor_Strategy_Word();
        $extractor = new \SetaPDF_Extractor($document);
        $extractor->setStrategy($wordStrategy);

        You can get the result by this strategy by calling the getResultByPageNumber() method for each individual page. Each word will be represented by an instance of SetaPDF_Extractor_Result_Word (default) or SetaPDF_Extractor_Result_WordWithGlyphs  which both implement the SetaPDF_Extractor_Result_HasBoundsInterface interfaces.

        The detail level of the result can be controlled through the setDetailLevel() method. It accepts following constant values as arguments:

        public const string SetaPDF_Extractor_Strategy_Word::DETAIL_LEVEL_DEFAULT = 'default'

        Detail level constant.

        Default detail level resulting in instances of SetaPDF_Extractor_Result_Word.

        Detail level constant.

        Extended detail level resulting in instances of SetaPDF_Extractor_Result_WordWithGlyphs.

        The default result class SetaPDF_Extractor_Result_Word will not hold information about glyphs but is less memory intensive. To get additional information about the glyphs of a word set the detail level to glyphs

        PHP
        <?php
        require_once('library/SetaPDF/Autoload.php');
        
        // get a document instance
        $document = \SetaPDF_Core_Document::loadByFilename(
            'files/pdfs/camtown/Laboratory-Report.pdf'
        );
        
        // create an extractor instance
        $extractor = new \SetaPDF_Extractor($document);
        
        // create the word strategy and pass it to the extractor instance
        $strategy = new \SetaPDF_Extractor_Strategy_Word();
        $extractor->setStrategy($strategy);
        
        // we need the total page count
        $pageCount = $document->getCatalog()->getPages()->count();
        
        // walk through the pages
        for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
            // ...and extract the data through the default strategy:
            $result = $extractor->getResultByPageNumber($pageNo);
        
            // debug/demonstration output
            echo '<h1>Page #' . $pageNo . '</h1>';
            echo 'Found ' . count($result) . ' words on page ' . $pageNo . ':<br />';
        
            echo '<table border="1">';
            echo '<tr><th>Word</th><th>llx</th><th>lly</th><th>ulx</th>' .
                 '<th>uly</th><th>urx</th><th>ury</th><th>lrx</th><th>lry</th></tr>';
        
            foreach ($result AS $i => $word) {
                echo '<tr>';
                echo '<td><b>' . $word . '</b></td>';
        
                $allBounds = $word->getBounds();
                foreach ($allBounds as $bounds) {
                    echo '<td>' . sprintf('%.4F', $bounds->getLl()->getX()) . '</td>';
                    echo '<td>' . sprintf('%.4F', $bounds->getLl()->getY()) . '</td>';
                    echo '<td>' . sprintf('%.4F', $bounds->getUl()->getX()) . '</td>';
                    echo '<td>' . sprintf('%.4F', $bounds->getUl()->getY()) . '</td>';
                    echo '<td>' . sprintf('%.4F', $bounds->getUr()->getX()) . '</td>';
                    echo '<td>' . sprintf('%.4F', $bounds->getUr()->getY()) . '</td>';
                    echo '<td>' . sprintf('%.4F', $bounds->getLr()->getX()) . '</td>';
                    echo '<td>' . sprintf('%.4F', $bounds->getLr()->getY()) . '</td>';
                }
        
                echo '</tr>';
            }
            echo '</table>';
        }

        The strategy allows you to pass a filter instance to limit the result e.g. by a specific area on a page.