Filter Filter Items Before They Are Passed To a Result
Table of Contents
Sometimes it is necessary to extract only specific text, glyphs or words from a PDF page.
The filter interface defines a single method which is called with each found text item. This method has to return a value that is evaluated to true or a boolean false to decide whether the item will be added to the result or not.
If the returned value is not a boolean value which evaluates to true it will be forwarded to the result as a kind of filter identification.
You should be aware of the fact that it is not guaranteed that a text item which is passed to the accept() method is a semantic entity.
The SetaPDF-Extractor component comes with two predefined filter implementations:
The rectangle filter allows you to reduce the matched items by a rectangle area.
The class works in two modes which can be passed in the constructor:
This mode says that the text item has to contact the rectangle of this filter instance through any point or intersection.
This mode says that the whole text item has to be contained by the rectangle of this filter instance.
The coordinates returned by a glyph or word are in the user space of the PDF page. Because a page can be rotated the rectangle filter automatically translates the rectangle values by this rotation internally. To do this the filter also implements the SetaPDF_Extractor_Filter_PageFilterInterface interface which can be used to adjust settings, if a page change occurs.
The following demo shows you how to extract only the sender name of an invoice:
To make this demo more understandable the next demo will show you the filtered area and the matched glyphs. We use another strategy for demonstration purpose:
If the filter would be used in contains mode, the glyphs will not match. To demonstrate this behaviour we just change the size to a larger one which intersects with the address field below the expected line. Because of the contains-mode only the sender name will get matched:
The font size filter allows you to filter text items based on their font size. By individual modes, which are represented by individual class constants, you've full control over the matched items:
Defines that the font size needs to be between the given filter values. If this mode is used the filter value needs to be an array. Otherwise the mode will be the same as SetaPDF_Extractor_Filter_FontSize::MODE_EQUALS
Defines that the font size needs to be between or equal to the given filter values. If this mode is used the filter value needs to be an array. Otherwise the mode will be the same as SetaPDF_Extractor_Filter_FontSize::MODE_EQUALS
Defines that the font size needs to be equal to the given filter value.
Defines that the font size needs to be larger than the given filter value.
Defines that the font size needs to be larger or equal than the given filter value.
Defines that the font size needs to be smaller than the given filter value.
Defines that the font size needs to be smaller or equal than the given filter value.
A simple demo script that extracts text with a font size of 24pt could look like:
The SetaPDF_Extractor_Filter_Multi class allows you to create a filter by several filter instances.
The filters are evaluated by an OR logic by default. If one of the filters match, the item will be accepted. It is also possible to build an AND logic by passing the mode constant SetaPDF_Extractor_Filter_Multi::MODE_AND as the second parameter to the constructor.
The id parameter is only used if the instance is working in AND mode.
Following demo will extract the senders line and the invoice number. We add filter ids to the filter instances to get a detailed result:
To create an individual filter is very easy by implementing at least the SetaPDF_Extractor_Filter_FilterInterface interface.
The following example will show you how to filter only numeric text items:
If the individual filter is related to a specific page format or other property you may implement the SetaPDF_Extractor_Filter_PageFilterInterface interface to keep track of page changes.
The setPage() method will be called if a page is changed.