Word Group Strategy Extracts Word Groups and Their Metrics

Table of Contents

  1. Introduction
    1. Process
      1. Usage

        Introduction

        The word group strategy allows you to extract groups of words which are related to each other, such as words in a column or paragraph. It is represented by the class \setasign\SetaPDF2\Extractor\Strategy\WordGroupStrategy.

        The result will be an instance of \setasign\SetaPDF2\Extractor\Result\WordGroups (more details are available here). Each group in this collection will be an instance of \setasign\SetaPDF2\Extractor\Result\Words, which holds several instances of \setasign\SetaPDF2\Extractor\Result\Word items.

        Process

        As the basis of this strategy are words, the strategy uses the same logic to create words as the word strategy first. 

        These words are inserted into a spatial storage with a scaled rectangle then. The scaling factors can be controlled with the setRectScaleFactorX() and setRectScaleFactorY() methods. This factor is multiplied by the font size of a word to get the final scaling value. 

        The strategy will then start to re-extract the words from the storage while it keeps colliding words together.  

        The strategy additionally allows you to define a value for a difference of font-sizes between different words through the setAllowedFontSizeDifference() method. 

        Because all words are ordered in advance the resulting groups are simply re-ordered by this data.

        After that an optional (active by default) process is started which reassembles words which are separated by hyphens on several lines. This can be controlled with the setDehyphen() method.

        Usage

        An instance has to be created individually and passed to the main class

        PHP
        use setasign\SetaPDF2\Extractor\Extractor;
        use setasign\SetaPDF2\Extractor\Strategy\WordGroupStrategy;
        
        $wordStrategy = new WordGroupStrategy();
        $extractor = new Extractor($document);
        $extractor->setStrategy($wordStrategy);

        You can get the result by this strategy by calling the getResultByPageNumber() method for each individual page.  

        Each word group will be represented by an instance of \setasign\SetaPDF2\Extractor\Result\WordGroups. The items in this collection will be represented by instances of \setasign\SetaPDF2\Extractor\Result\Word (default) or \setasign\SetaPDF2\Extractor\Result\WordWithGlyphs  which both implement the \setasign\SetaPDF2\Extractor\Result\HasBoundsInterface interfaces.

        The detail level of the result can be controlled through the setDetailLevel() method. It accepts following constant values as arguments:

        Detail level constant.

        Default detail level resulting in instances of \setasign\SetaPDF2\Extractor\Result\Word.

        Detail level constant.

        Extended detail level resulting in instances of \setasign\SetaPDF2\Extractor\Result\WordWithGlyphs.

        The default result class \setasign\SetaPDF2\Extractor\Result\Word will not hold information about glyphs but is less memory intensive. To get additional information about the glyphs of a word set the detail level to glyphs

        PHP
        <?php
        
        use setasign\SetaPDF2\Core\Document;
        use setasign\SetaPDF2\Extractor\Extractor;
        use setasign\SetaPDF2\Extractor\Strategy\WordGroupStrategy;
        
        require_once('library/SetaPDF/Autoload.php');
        
        // get a document instance
        $document = Document::loadByFilename(
            'files/pdfs/camtown/Terms-and-Conditions.pdf'
        );
        
        // create an extractor instance
        $extractor = new Extractor($document);
        
        // create the word group strategy and pass it to the extractor instance
        $strategy = new WordGroupStrategy();
        $extractor->setStrategy($strategy);
        
        // we need the total page count
        $pageCount = $document->getCatalog()->getPages()->count();
        
        // walk through the pages
        for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
            // ...and extract the data through the default strategy:
            $result = $extractor->getResultByPageNumber($pageNo);
        
            // debug/demonstration output
            echo '<h1>aPage #' . $pageNo . '</h1>';
            echo 'Found ' . count($result) . ' word groups on page ' . $pageNo . ':<br />';
        
            foreach ($result as $group) {
                echo '<p>';
        
                echo $group;
                // or iterate over all words
                // foreach ($group as $word) {
                //     echo $word->getString() . ' ';
                // }
        
                echo '</p><hr />';
            }
        }
        

        The strategy allows you to pass a filter instance to limit the result e.g. by a specific area on a page.