Hints
Table of Contents
Individual Glyph Names
Sometimes PDFs are created in a way that it is impossible to get the real text out of it. That means that the shown glyphs have no relation to character codes in a specific encoding. The extracted text result will be gibberish. This is also reproduceable by copy&paste from the PDF reader application of your choice.
We stumble over such documents daily.
We also noticed that some of these documents use an own logic by the use of unspecified glyph names. For these document types we offer a way to allow you to define a callback or an additional glyph name table, that maps a glyph name to a UTF-16BE value.
We e.g. encounter documents that use glyph names of a specific stucture where the glyph names are prefixed with "MT" and followed by a number that is identically to the code point in Win-Ansi encoding. A callback that would allow you to get access to the text of such documents will look like:
$callback = static function($name) { if (strpos($name, 'MT') === 0) { return \SetaPDF_Core_Encoding::convert( chr((int)substr($name, 2)), \SetaPDF_Core_Encoding::WIN_ANSI, 'UTF-16BE' ); } return ''; }; \SetaPDF_Core_Font_Glyph_List::$lists[\SetaPDF_Core_Font_Glyph_List::LIST_CUSTOM] = $callback; // or $table = array( 'MT64' => \SetaPDF_Core_Encoding::convert(chr(64), \SetaPDF_Core_Encoding::WIN_ANSI, 'UTF-16BE'), 'MT65' => \SetaPDF_Core_Encoding::convert(chr(65), \SetaPDF_Core_Encoding::WIN_ANSI, 'UTF-16BE'), // ... ); \SetaPDF_Core_Font_Glyph_List::$lists[\SetaPDF_Core_Font_Glyph_List::LIST_CUSTOM] = $table;
The custom list entry is only used if the glyph name was not found in the adobe glyph list (AGL) or could not be evaluated as defined in the AGL specification. It is impossible to overwrite an existing name with this technic.
If you encounter any PDF document that returns gibberish, feel free to send it to support@setasign.com so we may have a chance to analyze it!