Exact Plain Text Strategy Extracts Simple Plain Text By Glyphs

Table of Contents

  1. Introduction
    1. Process
      1. Usage

        Introduction

        The exact plain text strategy is simliar to the plain text strategy but it uses the detail level of the glyph strategy to extract the text. That said it will recreate the resulting text by sorting, comparing and concatenating individual glyphs and will ignore existing text items from the PDF document.

        Especially in concunction with e.g. a rectangle filter this strategy will return much more precise result. 

        The strategy is represented by the \setasign\SetaPDF2\Extractor\Strategy\ExactPlainStrategy class. 

        The result will be also a standard PHP string.

        Process

        The exact plain text strategy makes use of the glyph strategy which extracts each single glyph including its metrics in the order in which it appears in the PDF data stream.

        After that these single glyphs will be passed to the same logic as of the plain text strategy.

        If a space between words is "faked" by a character spacing value this strategy is able to recognize this as a word separator! 

        Usage

        An instance has to be created individually and passed to the main class

        PHP
        use setasign\SetaPDF2\Extractor\Extractor;
        use setasign\SetaPDF2\Extractor\Strategy\ExactPlainStrategy;
        
        $strategy = new ExactPlainStrategy();
        $extractor = new Extractor($document);
        $extractor->setStrategy($strategy);

        You can get a string result by this strategy by calling the getResultByPageNumber() method for each individual page: 

        PHP
        <?php
        
        use setasign\SetaPDF2\Core\Document;
        use setasign\SetaPDF2\Extractor\Extractor;
        use setasign\SetaPDF2\Extractor\Strategy\ExactPlainStrategy;
        
        require_once('library/SetaPDF/Autoload.php');
        
        // get a document instance
        $document = Document::loadByFilename(
            'files/pdfs/camtown/Laboratory-Report.pdf'
        );
        
        // create an extractor instance
        $extractor = new Extractor($document);
        // create the strategy
        $strategy = new ExactPlainStrategy();
        // pass it to the extractor instance
        $extractor->setStrategy($strategy);
        // we need the total page count
        $pageCount = $document->getCatalog()->getPages()->count();
        
        // walk through the pages
        for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
            // ...and extract the data through the default strategy:
            $result = $extractor->getResultByPageNumber($pageNo);
        
            // debug/demonstration output
            echo '<h1>Page #' . $pageNo . '</h1>';
            echo '<pre>';
            var_dump($result);
            echo '</pre>';
        }
        

        The strategy allows you to pass a filter instance to limit the result e.g. by a specific area on a page