Exact Plain Text Strategy Extracts Simple Plain Text By Glyphs

Table of Contents

  1. Introduction
    1. Process
      1. Usage

        Introduction

        The exact plain text strategy is simliar to the plain text strategy but it uses the detail level of the glyph strategy to extract the text. That said it will recreate the resulting text by sorting, comparing and concatenating individual glyphs and will ignore existing text items from the PDF document.

        Especially in concunction with e.g. a rectangle filter this strategy will return much more precise result. 

        The strategy is represented by the SetaPDF_Extractor_Strategy_ExactPlain class. 

        The result will be also a standard PHP string.

        Process

        The exact plain text strategy makes use of the glyph strategy which extracts each single glyph including its metrics in the order in which it appears in the PDF data stream.

        After that these single glyphs will be passed to the same logic as of the plain text strategy.

        If a space between words is "faked" by a character spacing value this strategy is able to recognize this as a word separator! 

        Usage

        An instance has to be created individually and passed to the main class

        PHP
        $strategy = new \SetaPDF_Extractor_Strategy_ExactPlain();
        $extractor = new \SetaPDF_Extractor($document);
        $extractor->setStrategy($strategy);

        You can get a string result by this strategy by calling the getResultByPageNumber() method for each individual page: 

        PHP
        <?php
        require_once('library/SetaPDF/Autoload.php');
        
        // get a document instance
        $document = \SetaPDF_Core_Document::loadByFilename(
            'files/pdfs/camtown/Laboratory-Report.pdf'
        );
        
        // create an extractor instance
        $extractor = new \SetaPDF_Extractor($document);
        // create the strategy
        $strategy = new \SetaPDF_Extractor_Strategy_ExactPlain();
        // pass it to the extractor instance
        $extractor->setStrategy($strategy);
        // we need the total page count
        $pageCount = $document->getCatalog()->getPages()->count();
        
        // walk through the pages
        for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
            // ...and extract the data through the default strategy:
            $result = $extractor->getResultByPageNumber($pageNo);
        
            // debug/demonstration output
            echo '<h1>Page #' . $pageNo . '</h1>';
            echo '<pre>';
            var_dump($result);
            echo '</pre>';
        }

        The strategy allows you to pass a filter instance to limit the result e.g. by a specific area on a page