Plain Text Strategy Extracts Simple Plain Text

Table of Contents

  1. Introduction
    1. Process
      1. Usage

        Introduction

        The plain text strategy is the default strategy used by the SetaPDF-Extractor component and allows you to extract plain text from PDF documents. It is represented by the class SetaPDF_Extractor_Strategy_Plain

        By default the text items are sorted by the baseline sorter but another or individual sorter instance can be passed through the setSorter() method.

        The result will be a standard PHP string

        Process

        The plain text strategy extracts all defined text items including their metrics into a temporary result. The items are taken as they appear in the PDF data stream. This means that several words in a single text item are processed as a whole. Or a word splitted over several text items is processed as several individual items.

        This result is sorted and grouped (by default via the base line sorter) into lines and orientations then.

        The resulting text string is created by running through the sorted and grouped result and comparing the last item with the current one to decide if both text items build a continuity segment. This is done by checking for a gap between both items on their ordinate. The size of this gap is defined by the average width of the space character of both text items devided by a factor defined in the $spaceWidthFactor property.

        If a space between words is "faked" by a character spacing value this strategy is not able to recognize this as a word separator. The exact plain strategy is able to handle this situation! 

        Usage

        An instance can be created individually or by receiving it from the main class

        PHP
        // get the default instance
        $extractor = new SetaPDF_Extractor($document);
        $plainText = $extractor->getStrategy();
        
        // or create your own
        $plainText = new SetaPDF_Extractor_Strategy_Plain();
        $extractor = new SetaPDF_Extractor($document);
        $extractor->setStrategy($plainText);

        You can get a string result by this strategy by calling the getResultByPageNumber() method for each individual page: 

        PHP
        <?php
        require_once('library/SetaPDF/Autoload.php');
        
        // get a document instance
        $document = SetaPDF_Core_Document::loadByFilename(
            'files/pdfs/camtown/Laboratory-Report.pdf'
        );
        
        // create an extractor instance
        $extractor = new SetaPDF_Extractor($document);
        // we need the total page count
        $pageCount = $document->getCatalog()->getPages()->count();
        
        // walk through the pages
        for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
            // ...and extract the data through the default strategy:
            $result = $extractor->getResultByPageNumber($pageNo);
        
            // debug/demonstration output
            echo '<h1>Page #' . $pageNo . '</h1>';
            echo '<pre>';
            var_dump($result);
            echo '</pre>';
        }            

        The strategy allows you to pass a filter instance to limit the result e.g. by a specific area on a page