Fonts and Encodings Using fonts and understanding the text encoding in SetaPDF

Introduction

The PDF format offers a wide range support for fonts and encodings.

The SetaPDF-Core component handles them all transparently in the background for you. The default input encoding in all SetaPDF components is UTF-8. 

Fonts

A font is represented in SetaPDF as an instance of a SetaPDF_Core_Font class. 

PDF Standard Fonts

The SetaPDF-Core component provide all 14 PDF standard fonts which are represented by following classes:

Standard fonts actually have to be initiated with an encoding or with the default encoding. It is also possible to define differences to a base encoding to build individual encodings: 

PHP
// Replace "uacute" by "Lslash" and "ucircumflex" with "fi"
$font = SetaPDF_Core_Font_Standard_Helvetica::create(
    $document,
    SetaPDF_Core_Encoding::WIN_ANSI,
    array(250 => 'Lslash', 251 => 'fi')
);

The names are defined in the Adobe Glyph List. This list is also included in the core component, which makes it possible to resolve a glyphs name with the following code: 

PHP
$name = SetaPDF_Core_Font_Glyph_List::byCode(chr(251), SetaPDF_Core_Encoding::WIN_ANSI); // ucircumflex

It is planned to implement a mechanism for standard fonts which tries to adjust the differences automatically if characters are used which are not covered by the base encoding. Actually the differences have to be defined manually. 

True Type Fonts

The Core component offers a parser for True Type fonts that is used by a SetaPDF_Core_Font_TrueType font class. An instance of this class can be used as any standard font type.

Using such a font object makes it possible to embed the full font program into the PDF file (while subsetting is currently not supported).

Furthermore this font instance already supports automated encoding by simply passing "auto" to the $diffEncoding parameter. This way the Difference entry will be build automatically with the used glyphs. It is possible to utilize 255 different glyphs by a single font instance which could cover a wide range of text and languages: 

PHP
$font = SetaPDF_Core_Font_TrueType::create($document, 'path/to/font/file.ttf', 'WinAnsiEncoding', 'auto');

The method is defined as follwing: 

Description
static public SetaPDF_Core_Font_TrueType SetaPDF_Core_Font_TrueType::create ( SetaPDF_Core_Document $document, string $fontFile [, string $baseEncoding = \SetaPDF_Core_Encoding::WIN_ANSI [, array|string $diffEncoding = array ( ) [, boolean $embedded = true [, bool $forceLicenseRestrictions = false ]]]] )

Creates a font object based on a TrueType font file.

Parameters
$document : SetaPDF_Core_Document

The document instance in which the font will be used

$fontFile : string

A path to the TTF font file

$baseEncoding : string

The base encoding

$diffEncoding : array|string

A translation table to adjust individual char codes to different glyphs or "auto" to build this table dynamically.

$embedded : boolean

Defines if the font program will be embedded in the document or not

$forceLicenseRestrictions : bool

Could be used to disable the font license check

Return Values

The SetaPDF_Core_Font_TrueType instance

Exceptions

Throws SetaPDF_Core_Font_Exception

Encodings

The PDF format defines 2 encodings for internal represantation of strings: PDFDocEncoding and UTF-16BE. These encodings only effects strings at their lowest level. They are used for example for metadata like author or creator. Also e.g. form field values are saved in one of these encodings.

The SetaPDF-Core component offers an encoding class which is a wrapper around mbstring (used by default) and iconv with support for PDF specific encodings.  

All components make use of this class, so that the handling of different encodings will be done seamless in the background. If any method accepts a text string it will offer an encoding parameter with which you can define the input encoding, if it differs to UTF-8. 

If fonts came up the encoding issues are much more complex. Because it's not guaranteed that a font will cover a complete encoding scheme. This could result in replacement characters (?) if a glyph is not available in the desired font.