Fonts and Encodings Using fonts and understanding the text encoding in SetaPDF

Introduction

The PDF format offers a wide range support for fonts and encodings.

The SetaPDF-Core component handles them all transparently in the background for you. The default input encoding in all SetaPDF components is UTF-8. 

Fonts

A font is represented in SetaPDF as a class instance implementing the SetaPDF_Core_Font_FontInterface interface.

PDF Standard Fonts

The SetaPDF-Core component provide all 14 PDF standard fonts which are represented by following classes:

Standard fonts actually have to be initiated with an encoding or with the default encoding. It is also possible to define differences to a base encoding to build individual encodings: 

PHP
// Replace "uacute" by "Lslash" and "ucircumflex" with "fi"
$font = \SetaPDF_Core_Font_Standard_Helvetica::create(
    $document,
    \SetaPDF_Core_Encoding::WIN_ANSI,
    array(250 => 'Lslash', 251 => 'fi')
);

The names are defined in the Adobe Glyph List. This list is also included in the core component, which makes it possible to resolve a glyphs name with the following code: 

PHP
$name = \SetaPDF_Core_Font_Glyph_List::byCode(chr(251), \SetaPDF_Core_Encoding::WIN_ANSI); // ucircumflex

TrueType Fonts

SetaPDF comes with a TrueType parser and subset engine which allows you to use any character from the unicode range as long it is available in the given TrueType or OpenType (with TrueType outlines) font program.

A font subset has the advantage that it will reduce the font size to a minimum which results in very smal PDF files. 

A TrueType font subset can be create with the SetaPDF_Core_Font_TrueType_Subset class. With this font instance you can use up to 255 individual characters. The resulting font program, which will be embedded in the resulting PDF document, will automatically be subset to only these specific used glyphs.

PHP
$font = new \SetaPDF_Core_Font_TrueType_Subset($document, 'path/to/font/file.ttf');

So if the input is limited to a text which is not generated with more than 255 different characters, you're fine to use this font class. 

If you need to use more than 255 different characters you can use the SetaPDF_Core_Font_Type0_Subset class, which represents a Type0 font with a TrueType font programm as its descendant font.

PHP
$font = new \SetaPDF_Core_Font_Type0_Subset($document, 'path/to/font/file.ttf');

If it is needed to embedded the complete font program you can still use the SetaPDF_Core_Font_TrueType font class. An instance of this class can be used as any other font type and has to be created by a static create() method.  

This font instance supports automated encoding by simply passing "auto" to the $diffEncoding parameter. This way the Difference entry will be build automatically with the used glyphs. It is possible to utilize 255 different glyphs by a single font instance which could cover a wide range of text and languages, too:

PHP
$font = \SetaPDF_Core_Font_TrueType::create($document, 'path/to/font/file.ttf', 'WinAnsiEncoding', 'auto');

Generally you will need the legal permission to embed a font or a subset of it into a PDF document. Some fonts have a permission flag set, which says that the font "[...]must not be modified, embedded or exchanged in any manner without first obtaining permission of the legal owner.". If this flag is set all font classes will throw an SetaPDF_Core_Font_Exception exception. If you have the permission, you can disable this exception by passing true to the $ignoreLicenseRestrictions parameter of the desired method.

Please notice that all font instance currently do not support scripts and languages which need pre-processing such as glyph substitution or glyph ordering (such as Arabic, Hebrew,...).

Encodings

The PDF format defines 2 encodings for internal represantation of strings: PDFDocEncoding and UTF-16BE. These encodings only effects strings at their lowest level. They are used for example for metadata like author or creator. Also e.g. form field values are saved in one of these encodings.

The SetaPDF-Core component offers an encoding class which is a wrapper around mbstring (used by default) and iconv with support for PDF specific encodings.  

All components make use of this class, so that the handling of different encodings will be done seamless in the background. If any method accepts a text string it will offer an encoding parameter with which you can define the input encoding, if it differs to UTF-8. 

If fonts came up the encoding issues are much more complex. Because it's not guaranteed that a font will cover a complete encoding scheme. This could result in replacement characters (?) or the "missing glyph" if a glyph is not available in the desired font.