SUMMARY
Applications usually draw Unicode text by specifying character codes in a string of WORDs. When a TrueType font is used, the operating system converts these character codes into TrueType glyph indices when it draws the glyphs. An application may need to translate character codes to glyph indices. This article discusses, with sample source code, how to obtain glyph indices from TrueType font files.
MORE INFORMATION
Drawing a character onto a device in Windows involves mapping the character code to a font-dependent index of the character's graphic (a glyph). When a TrueType font is used, these indices are called glyph indices.
TrueType font files have a flexible, table-oriented file format. The flexibility of this file format supports a variety of character encoding (or mapping) from a character code to the respective glyph. One such encoding can be from Unicode to glyph indices.
The Microsoft Windows operating systems use TrueType font files that are almost always Unicode-encoded. Even the TrueType fonts used by Windows 95 (and its predecessor versions to Windows 3.1) use the Unicode standard as their internal encoding. See the References section of this article for more information on the Unicode standard and the TrueType font file.
The functions of the Win32 Application Programming Interface (API) are exported as two separate entry points. The first has the base function name with an appended "A"; the second has the base function name with an appended "W". These entry points support the character set (or ANSI) string encoding and the Unicode (or wide character) encoding, respectively. The Platform SDK header files define the base function name as the "A" or "W" variant, depending on the use of the UNICODE macro, but either variant can be referenced directly.
Most of the functions in the Win32 API for Windows 95 and later versions support only the ANSI versions of the functions. In addition to the ANSI API, there is also support for a limited subset of Unicode-capable functions such as TextOutW, lstrlenW, and so on. See the References section for more information on Unicode support in Windows 95 and its successors.
This set of Unicode or wide-character functions supports writing a limited Unicode-capable application. Just as an ANSI application may need to use glyph indices, a Unicode-capable application may need access to them as well. Such a case occurs when the application uses some advanced features of TrueType font files, if the application needs the glyph definitions, or if it needs to implement a workaround or functionality not otherwise present in the operating system.
ANSI application's can use the GetCharacterPlacementA function to translate a byte string of character codes to glyph indices. If the application uses a Unicode string encoding and it runs on Windows NT, there is a wide-character version of the GetCharacterPlacement function. On Windows 95, the GetCharacterPlacementW function is not implemented so there is no API to convert from Unicode to glyph indices.
To convert a Unicode character code to a TrueType glyph index, an application must retrieve the table data for the Unicode-to-glyph-index encoding. The data in a TrueType font file can be obtained by calling the GetFontData Win32 function. This function returns an unprocessed byte buffer, but it can extract this buffer relative to the start of a named TrueType table. The function also operates on the TrueType font file currently realized in the Device Context (DC). These features make the function more useful than the alternative of locating the font file and parsing its table directory to seek to the appropriate table.
A Unicode-to-glyph-index encoding is located in a table of a TrueType font file that is marked as "cmap", which is the tag name for the table containing the character mappings for the font file. This table may contain one or more subtables that are different mappings.
Immediately following the initial unsigned short values of the "cmap" table is a directory of each encoding that the TrueType font file contains. Per the TrueType specification, the Unicode encoding is located in the subtable marked with a PlatformId value of 3 and a SpecificId value of 1. The specification also defines the subtable referenced by this 3-1 encoding to be a Format 4 subtable. An encoding with a PlatformId of 3 and a SpecificId of 0 (zero) is also Format 4 encoding but the font file is usually interpreted as a symbol font file. Per the symbol font suggestion in the TrueType specification, one would expect this encoding to contain character codes from the Unicode Private Use Area.
A Format 4 subtable is a sparse array. To accommodate the 64K entries needed by the 16-bit Unicode standard, the Format 4 subtable collects neighboring sequences of characters into segments. The segments are defined by the first and last character code value covered by that range of characters. The collection of segments are stored in the subtable by a set of parallel arrays: one for the starting character of the range (startCount) and one for the ending character of the range (endCount). The segment array is a classification of a character code's potential mapping, not the actually mapping.
The mapping to a glyph index requires the use of a third parallel array from the subtable called the idRangeOffset. This determines which of two methods are used to compute the final glyph index. The first method uses a simple delta value to compute the glyph index from the character code. The second method uses a lookup into an intermediate glyph Id table. If the value in this table is zero, there is no glyph; otherwise, the value is used to computer the final glyph index. The lookup table is an efficient way of representing a collection of noncontiguous character codes that span a segment.
The balance of this article covers how to obtain TrueType glyph indices from a TrueType font file. The following assumptions have been made: the TrueType font is currently selected into the application's DC, the current string encoding is Unicode, and the TrueType font file has a Unicode character-map table ("cmap" Format 4 subtable).
NOTE: This technique for obtaining glyph indices is applicable to Windows NT; however, because a Unicode version of the GetCharacterPlacement function has been implemented, it is typically unnecessary. If a Windows 95 application's strings are not Unicode, it can call the GetCharacterPlacementA function to convert to glyph indices instead.
The complete source code is at the end of this article. Within the text of the article, numerous references to the source code are made. Please refer to the sample source code to examine the relevant references in their full context.
The TrueType font file is byte-packed, meaning that all data types sit on a byte offset within the file. Specific padding bytes are included in any table definitions where applicable to the specification. Most compilers allow and sometimes default to structure alignment on other than byte boundaries. This means that defining "C" language structures to mimic the definition of a table may not be compatible.
This sample code uses structure definitions where possible to represent tables. To work correctly, the code must be compiled with byte structure alignment such as is ensured by the sample code's use of the pack pragma.
Even when byte packing is guaranteed, just reading the tables into a structure is incorrect. TrueType font files use "Big Endian" or Motorola-style byte ordering while the Intel microprocessors use the "Little Endian" byte ordering. This means that all data larger than a byte taken from a TrueType font file must have the bytes swapped. The swapping of bytes makes the data compatible with Intel microprocessors. The SWAPWORD and SWAPLONG macro definitions provide a facility for doing this.
To retrieve table data from a TrueType font file given a DC, the code uses the GetFontData function. This function requires that the TrueType font file be selected into the DC. To retrieve table data indexed by a named TrueType table, the four-byte table tag name must be packed into a DWORD. Because we are only interested in getting data from the "cmap" table, the sample code defines a global DWORD, dwCmapName, that is packed with the "cmap" tag. All calls to the GetFontData function are coded to use the global dwCmapName variable.
The GetTTUnicodeCoverage function is the source code to retrieve the Unicode "cmap" subtable. It is declared as:
This function first searches for a subtable that would contain a Unicode encoding, either a "3-1" encoding or a "3-0" encoding. These are Format 4 subtables as defined in the "cmap" data type chapter of the TrueType specification.
If the function finds a Unicode encoding, it retrieves the first seven elements of the Format 4 subtable by using the GetFontFormat4Header function. The code then calculates the size of the buffer that is needed to return the entire Unicode subtable. If the buffer is too small or not provided, the size is returned to the caller so that they can allocate an appropriately sized buffer and recall the function.
If the buffer supplied by the caller is large enough, the sample code then uses the GetFontFormat4Subtable function to retrieve the entire subtable. This function appropriately reorders the bytes to accommodate Intel microprocessors. If the subtable retrieval was successful, the result is copied to the caller's buffer. If it was not successful, the code has not modified the user's buffer, and can safely return failure. By setting the bytes-needed parameter to zero, the sample code can indicate that it did not copy bytes to the buffer and can distinguish this failure from that of a lack of buffer space.
Once the Unicode subtable has been obtained, it can be used to retrieve glyph indices for a character code or to implement a number of other useful functions.
Converting a Unicode character code to a glyph index is accomplished in the sample codes' GetTTUnicodeGlyphIndex function:
It first retrieves the Unicode subtable by allocating a buffer and calling the GetTTUnicodeCoverage function. If an error occurs, the sample code fails the call by returning the missing glyph index. At this point, failure can mean that either the DC does not contain a TrueType font, or the TrueType font does not contain a suitable Unicode subtable.
Next, the code attempts to locate the Unicode character code in the encoding. The search is performed per the Format 4 subtable reference in the "cmap" data type chapter of the TrueType specification. The code segments from the subtable are linearly searched by the FindFormat4Segment function. If no code segment brackets this character code, then the font file does not contain an encoding and therefore there is no glyph. The code then returns the missing glyph index.
The lookup of the glyph index occurs in the last half of the GetTTUnicodeGlyphIndex function. There are two methods of looking up the glyph index for a particular glyph. Both cases use the idRangeOffset array by examining the value at the ordinal index of the segment in which the character code was found.
In the first case, if the value located at the segment index of the idRangeOffset array is zero, the code dereferences the idDelta array with the same array index and converts to the glyph index using modulo arithmetic:
It is also instructive to note that the count of Unicode character codes that are mapped to a glyph is not necessarily equivalent to the number of glyphs contained in the font file. There may be fewer glyphs if the Unicode encoding maps multiple character codes to the same glyph. There may also be more glyphs in the font file than the mapping suggests. For example: TrueType Open (now called OpenType Layout) tables define glyph index substitutions to multiple, alternative glyphs.
The GetTTUnicodeGlyphIndex function could also be used to implement a function to determine whether a given TrueType font contains a glyph for a given Unicode character code. Just call the GetTTUnicodeGlyphIndex function with the character code and test the return for equivalence with the missing glyph index (a value of zero).
TrueType font files have a flexible, table-oriented file format. The flexibility of this file format supports a variety of character encoding (or mapping) from a character code to the respective glyph. One such encoding can be from Unicode to glyph indices.
The Microsoft Windows operating systems use TrueType font files that are almost always Unicode-encoded. Even the TrueType fonts used by Windows 95 (and its predecessor versions to Windows 3.1) use the Unicode standard as their internal encoding. See the References section of this article for more information on the Unicode standard and the TrueType font file.
The functions of the Win32 Application Programming Interface (API) are exported as two separate entry points. The first has the base function name with an appended "A"; the second has the base function name with an appended "W". These entry points support the character set (or ANSI) string encoding and the Unicode (or wide character) encoding, respectively. The Platform SDK header files define the base function name as the "A" or "W" variant, depending on the use of the UNICODE macro, but either variant can be referenced directly.
Most of the functions in the Win32 API for Windows 95 and later versions support only the ANSI versions of the functions. In addition to the ANSI API, there is also support for a limited subset of Unicode-capable functions such as TextOutW, lstrlenW, and so on. See the References section for more information on Unicode support in Windows 95 and its successors.
This set of Unicode or wide-character functions supports writing a limited Unicode-capable application. Just as an ANSI application may need to use glyph indices, a Unicode-capable application may need access to them as well. Such a case occurs when the application uses some advanced features of TrueType font files, if the application needs the glyph definitions, or if it needs to implement a workaround or functionality not otherwise present in the operating system.
ANSI application's can use the GetCharacterPlacementA function to translate a byte string of character codes to glyph indices. If the application uses a Unicode string encoding and it runs on Windows NT, there is a wide-character version of the GetCharacterPlacement function. On Windows 95, the GetCharacterPlacementW function is not implemented so there is no API to convert from Unicode to glyph indices.
To convert a Unicode character code to a TrueType glyph index, an application must retrieve the table data for the Unicode-to-glyph-index encoding. The data in a TrueType font file can be obtained by calling the GetFontData Win32 function. This function returns an unprocessed byte buffer, but it can extract this buffer relative to the start of a named TrueType table. The function also operates on the TrueType font file currently realized in the Device Context (DC). These features make the function more useful than the alternative of locating the font file and parsing its table directory to seek to the appropriate table.
A Unicode-to-glyph-index encoding is located in a table of a TrueType font file that is marked as "cmap", which is the tag name for the table containing the character mappings for the font file. This table may contain one or more subtables that are different mappings.
Immediately following the initial unsigned short values of the "cmap" table is a directory of each encoding that the TrueType font file contains. Per the TrueType specification, the Unicode encoding is located in the subtable marked with a PlatformId value of 3 and a SpecificId value of 1. The specification also defines the subtable referenced by this 3-1 encoding to be a Format 4 subtable. An encoding with a PlatformId of 3 and a SpecificId of 0 (zero) is also Format 4 encoding but the font file is usually interpreted as a symbol font file. Per the symbol font suggestion in the TrueType specification, one would expect this encoding to contain character codes from the Unicode Private Use Area.
A Format 4 subtable is a sparse array. To accommodate the 64K entries needed by the 16-bit Unicode standard, the Format 4 subtable collects neighboring sequences of characters into segments. The segments are defined by the first and last character code value covered by that range of characters. The collection of segments are stored in the subtable by a set of parallel arrays: one for the starting character of the range (startCount) and one for the ending character of the range (endCount). The segment array is a classification of a character code's potential mapping, not the actually mapping.
The mapping to a glyph index requires the use of a third parallel array from the subtable called the idRangeOffset. This determines which of two methods are used to compute the final glyph index. The first method uses a simple delta value to compute the glyph index from the character code. The second method uses a lookup into an intermediate glyph Id table. If the value in this table is zero, there is no glyph; otherwise, the value is used to computer the final glyph index. The lookup table is an efficient way of representing a collection of noncontiguous character codes that span a segment.
The balance of this article covers how to obtain TrueType glyph indices from a TrueType font file. The following assumptions have been made: the TrueType font is currently selected into the application's DC, the current string encoding is Unicode, and the TrueType font file has a Unicode character-map table ("cmap" Format 4 subtable).
NOTE: This technique for obtaining glyph indices is applicable to Windows NT; however, because a Unicode version of the GetCharacterPlacement function has been implemented, it is typically unnecessary. If a Windows 95 application's strings are not Unicode, it can call the GetCharacterPlacementA function to convert to glyph indices instead.
The complete source code is at the end of this article. Within the text of the article, numerous references to the source code are made. Please refer to the sample source code to examine the relevant references in their full context.
Odds and Ends
To work with the data in a TrueType font file, a number of data-type problems must be well understood. In the TrueType specification, all tables in the font file are defined as a collection of base data types. The base data types are also defined by the specification but they correspond well to some base data types defined by the Windows Platform SDK.The TrueType font file is byte-packed, meaning that all data types sit on a byte offset within the file. Specific padding bytes are included in any table definitions where applicable to the specification. Most compilers allow and sometimes default to structure alignment on other than byte boundaries. This means that defining "C" language structures to mimic the definition of a table may not be compatible.
This sample code uses structure definitions where possible to represent tables. To work correctly, the code must be compiled with byte structure alignment such as is ensured by the sample code's use of the pack pragma.
Even when byte packing is guaranteed, just reading the tables into a structure is incorrect. TrueType font files use "Big Endian" or Motorola-style byte ordering while the Intel microprocessors use the "Little Endian" byte ordering. This means that all data larger than a byte taken from a TrueType font file must have the bytes swapped. The swapping of bytes makes the data compatible with Intel microprocessors. The SWAPWORD and SWAPLONG macro definitions provide a facility for doing this.
The Definitions
The "cmap" table of a TrueType font file consists of subtables; each of which define a different encoding. To locate the subtables, the sample code has defined a convenient macro called CMAPHEADERSIZE. The macro is convenient for offset calculations to the beginning of the subtable directory. This macro returns the size of the two unsigned short data types used to store the "cmap" table version and the number of subtables in the "cmap" table: /* CMAP table Data
From the TrueType Spec revision 1.66
USHORT Table Version #
USHORT Number of encoding tables
*/
#define CMAPHEADERSIZE (sizeof(USHORT)*2)
Each encoding subtable has a directory entry in the main "cmap" table. This is represented in the source code by the structure definition _CMapEncoding. This structure contains two ID fields used to distinguish each subtable and the offset from the start of the "cmap" table where the subtable is located:
typedef struct _CMapEncoding
{
USHORT PlatformId;
USHORT EncodingId;
ULONG Offset;
} CMAPENCODING;
The GetFontData function in the Win32 API takes a single DWORD argument as the table name. This works because the TrueType specification defines a table name as a four-byte tag sequence. To properly pack the table name into the DWORD parameter of the GetFontData function call, the sample code defines a macro called MAKETABLENAME. The macro works by sequentially shifting the four individual byte values of the table name tag into a DWORD data type:
// Macro to pack a TrueType table name into a DWORD.
#define MAKETABLENAME(ch1, ch2, ch3, ch4) (/
(((DWORD)(ch4)) << 24) | /
(((DWORD)(ch3)) << 16) | /
(((DWORD)(ch2)) << 8) | /
((DWORD)(ch1)) /
)
The Unicode-encoding subtable is marked by a PlatformId of 3 and a SpecificId of 1. This "3-1" encoding is a Format 4 subtable according to the TrueType specification. Defined in the source code is a structure, _CMap4, which corresponds to the first seven data types of the Format 4 subtable plus a symbolic array of one unsigned short. The array represents the balance of the subtable definition that consists of multiple unsigned short arrays. By including the array symbol in the structure definition, a convenient address to the start of the unsigned short arrays is defined. The array symbol can then be used to compute the other array start addresses by using their offsets from the first array. This is useful when casting a larger memory buffer containing a full subtable to dereference one of the unsigned short arrays:
typedef struct _CMap4 // From the TrueType Spec. revision 1.66.
{
USHORT format; // Format number is set to 4.
USHORT length; // Length in bytes.
USHORT version; // Version number (starts at 0).
USHORT segCountX2; // 2 x segCount
USHORT searchRange; // 2 x (2**floor(log2(segCount)))
USHORT entrySelector; // log2(searchRange/2)
USHORT rangeShift; // 2 x segCount - searchRange
USHORT Arrays[1]; // Placeholder symbol for address of arrays. following.
} CMAP4, *LPCMAP4;
The Process
The sample code implements two basic tasks: retrieval of a Unicode "cmap" subtable from the font file and a search of the subtable to find a TrueType glyph index for a Unicode character code.To retrieve table data from a TrueType font file given a DC, the code uses the GetFontData function. This function requires that the TrueType font file be selected into the DC. To retrieve table data indexed by a named TrueType table, the four-byte table tag name must be packed into a DWORD. Because we are only interested in getting data from the "cmap" table, the sample code defines a global DWORD, dwCmapName, that is packed with the "cmap" tag. All calls to the GetFontData function are coded to use the global dwCmapName variable.
The GetTTUnicodeCoverage function is the source code to retrieve the Unicode "cmap" subtable. It is declared as:
BOOL GetTTUnicodeCoverage (
HDC hdc, // DC with TT font.
LPCMAP4 pBuffer, // Properly allocated buffer.
DWORD cbSize, // Size of properly allocated buffer.
DWORD *pcbNeeded // Size of buffer needed.
)
This function retrieves the full Unicode subtable from the TrueType's "cmap" table. If called with a buffer that is too small (that is, size of zero) as declared by cbSize or with a NULL parameter for the pBuffer parameter, it fails and returns FALSE. When it fails in this manner, it calculates and returns the size of the buffer needed in the pcbNeeded parameter. When the function succeeds, the pBuffer parameter is filled and the number of bytes copied are placed in the pcbNeeded parameter.This function first searches for a subtable that would contain a Unicode encoding, either a "3-1" encoding or a "3-0" encoding. These are Format 4 subtables as defined in the "cmap" data type chapter of the TrueType specification.
If the function finds a Unicode encoding, it retrieves the first seven elements of the Format 4 subtable by using the GetFontFormat4Header function. The code then calculates the size of the buffer that is needed to return the entire Unicode subtable. If the buffer is too small or not provided, the size is returned to the caller so that they can allocate an appropriately sized buffer and recall the function.
If the buffer supplied by the caller is large enough, the sample code then uses the GetFontFormat4Subtable function to retrieve the entire subtable. This function appropriately reorders the bytes to accommodate Intel microprocessors. If the subtable retrieval was successful, the result is copied to the caller's buffer. If it was not successful, the code has not modified the user's buffer, and can safely return failure. By setting the bytes-needed parameter to zero, the sample code can indicate that it did not copy bytes to the buffer and can distinguish this failure from that of a lack of buffer space.
Once the Unicode subtable has been obtained, it can be used to retrieve glyph indices for a character code or to implement a number of other useful functions.
Converting a Unicode character code to a glyph index is accomplished in the sample codes' GetTTUnicodeGlyphIndex function:
USHORT GetTTUnicodeGlyphIndex (
HDC hdc, // DC with a TrueType font selected.
USHORT ch // Unicode character to convert to Index.
)
This function has a simpler interface requiring only the handle to a DC, hdc; that contains the TrueType font and the Unicode character code to convert, ch. The function returns the glyph index for ch when it is successful. If the Unicode character code is not located in the encoding (that is, there is no glyph) the missing glyph-index value of zero is returned.It first retrieves the Unicode subtable by allocating a buffer and calling the GetTTUnicodeCoverage function. If an error occurs, the sample code fails the call by returning the missing glyph index. At this point, failure can mean that either the DC does not contain a TrueType font, or the TrueType font does not contain a suitable Unicode subtable.
Next, the code attempts to locate the Unicode character code in the encoding. The search is performed per the Format 4 subtable reference in the "cmap" data type chapter of the TrueType specification. The code segments from the subtable are linearly searched by the FindFormat4Segment function. If no code segment brackets this character code, then the font file does not contain an encoding and therefore there is no glyph. The code then returns the missing glyph index.
The lookup of the glyph index occurs in the last half of the GetTTUnicodeGlyphIndex function. There are two methods of looking up the glyph index for a particular glyph. Both cases use the idRangeOffset array by examining the value at the ordinal index of the segment in which the character code was found.
In the first case, if the value located at the segment index of the idRangeOffset array is zero, the code dereferences the idDelta array with the same array index and converts to the glyph index using modulo arithmetic:
// Per TT spec, if the RangeOffset is zero,
if ( idRangeOffset[iSegment] == 0)
{
// calculate the glyph index directly.
GlyphIndex = (idDelta[iSegment] + ch) % 65536;
}
else
{
...
}
In the second case, the value located at the ordinal segment index is part of an index into a lookup table for glyph indices. Based upon the order and location of the subtables' arrays, an obscure indexing trick using the address of the value at the idRangeOffset element returns an intermediate ID value. The indexing mechanism is explained in the TrueType specification's Format 4 subtable chapter. If nonzero, this value is then added to the idDelta value and converted with modulo arithmetic; otherwise, there is no glyph and the missing glyph index is returned:
// Per TT spec, if the RangeOffset is zero,
if ( idRangeOffset[iSegment] == 0)
{
...
}
else
{
// otherwise, use the glyph ID array to get the index.
USHORT idResult; //Intermediate ID calc.
idResult = *(
idRangeOffset[iSegment]/2 +
(ch - startCount[iSegment]) +
&idRangeOffset[iSegment]
); // Indexing equation from TT spec.
if (idResult)
// Per TT spec, nonzero means there is a glyph.
GlyphIndex = (idDelta[iSegment] + idResult) % 65536;
else
// Otherwise, return the missing glyph.
GlyphIndex = 0;
}
Some other useful functions can be derived from the decoding of the Unicode subtable. For instance this sample code implements a function called:
USHORT GetTTUnicodeCharCount (
HDC hdc
)
This function adds up the characters covered by each segment in the Format 4 subtable to find the total number of Unicode character codes represented in the TrueType font file. Note however that this function must test each individual character code for a mapping if a segment in question uses the Glyph ID array rather than being continuous. It is also instructive to note that the count of Unicode character codes that are mapped to a glyph is not necessarily equivalent to the number of glyphs contained in the font file. There may be fewer glyphs if the Unicode encoding maps multiple character codes to the same glyph. There may also be more glyphs in the font file than the mapping suggests. For example: TrueType Open (now called OpenType Layout) tables define glyph index substitutions to multiple, alternative glyphs.
The GetTTUnicodeGlyphIndex function could also be used to implement a function to determine whether a given TrueType font contains a glyph for a given Unicode character code. Just call the GetTTUnicodeGlyphIndex function with the character code and test the return for equivalence with the missing glyph index (a value of zero).
A Word About Implementation
This sample code was written for clarity of explanation. It is not optimized for repeated use because it allocates and retrieves TrueType tables each time a public function is called. For real applications, a good optimization would be to cache the Unicode encoding for the TrueType font file as long as it remained in the DC. An application can compare to see whether the font selected into a DC is the same TrueType font file by caching and comparing the checksum value of the font file. This checksum is located in the Table Directory of the TrueType font file at the beginning of the file and can be retrieved by using the GetFontData function. See the TrueType specification's discussion of "The Table Directory" under the Data Types chapter to locate the checksum of a font file.The Complete Source Code
#pragma pack(1) // for byte alignment
// We need byte alignment to be structure compatible with the
// contents of a TrueType font file
// Macros to swap from Big Endian to Little Endian
#define SWAPWORD(x) MAKEWORD( /
HIBYTE(x), /
LOBYTE(x) /
)
#define SWAPLONG(x) MAKELONG( /
SWAPWORD(HIWORD(x)), /
SWAPWORD(LOWORD(x)) /
)
typedef struct _CMap4 // From the TrueType Spec. revision 1.66
{
USHORT format; // Format number is set to 4.
USHORT length; // Length in bytes.
USHORT version; // Version number (starts at 0).
USHORT segCountX2; // 2 x segCount.
USHORT searchRange; // 2 x (2**floor(log2(segCount)))
USHORT entrySelector; // log2(searchRange/2)
USHORT rangeShift; // 2 x segCount - searchRange
USHORT Arrays[1]; // Placeholder symbol for address of arrays following
} CMAP4, *LPCMAP4;
/* CMAP table Data
From the TrueType Spec revision 1.66
USHORT Table Version #
USHORT Number of encoding tables
*/
#define CMAPHEADERSIZE (sizeof(USHORT)*2)
/* ENCODING entry Data aka CMAPENCODING
From the TrueType Spec revision 1.66
USHORT Platform Id
USHORT Platform Specific Encoding Id
ULONG Byte Offset from beginning of table
*/
#define ENCODINGSIZE (sizeof(USHORT)*2 + sizeof(ULONG))
typedef struct _CMapEncoding
{
USHORT PlatformId;
USHORT EncodingId;
ULONG Offset;
} CMAPENCODING;
// Macro to pack a TrueType table name into a DWORD
#define MAKETABLENAME(ch1, ch2, ch3, ch4) (/
(((DWORD)(ch4)) << 24) | /
(((DWORD)(ch3)) << 16) | /
(((DWORD)(ch2)) << 8) | /
((DWORD)(ch1)) /
)
/* public functions */
USHORT GetTTUnicodeGlyphIndex(HDC hdc, USHORT ch);
USHORT GetTTUnicodeCharCount(HDC hdc);
// DWORD packed four letter table name for each GetFontData()
// function call when working with the CMAP TrueType table
DWORD dwCmapName = MAKETABLENAME( 'c','m','a','p' );
USHORT *GetEndCountArray(LPBYTE pBuff)
{
return (USHORT *)(pBuff + 7 * sizeof(USHORT)); // Per TT spec
}
USHORT *GetStartCountArray(LPBYTE pBuff)
{
DWORD segCount = ((LPCMAP4)pBuff)->segCountX2/2;
return (USHORT *)( pBuff +
8 * sizeof(USHORT) + // 7 header + 1 reserved USHORT
segCount*sizeof(USHORT) ); // Per TT spec
}
USHORT *GetIdDeltaArray(LPBYTE pBuff)
{
DWORD segCount = ((LPCMAP4)pBuff)->segCountX2/2;
return (USHORT *)( pBuff +
8 * sizeof(USHORT) + // 7 header + 1 reserved USHORT
segCount * 2 * sizeof(USHORT) ); // Per TT spec
}
USHORT *GetIdRangeOffsetArray(LPBYTE pBuff)
{
DWORD segCount = ((LPCMAP4)pBuff)->segCountX2/2;
return (USHORT *)( pBuff +
8 * sizeof(USHORT) + // 7 header + 1 reserved USHORT
segCount * 3 * sizeof(USHORT) ); // Per TT spec
}
void SwapArrays( LPCMAP4 pFormat4 )
{
DWORD segCount = pFormat4->segCountX2/2; // Per TT Spec
DWORD i;
USHORT *pGlyphId,
*pEndOfBuffer,
*pstartCount = GetStartCountArray( (LPBYTE)pFormat4 ),
*pidDelta = GetIdDeltaArray( (LPBYTE)pFormat4 ),
*pidRangeOffset = GetIdRangeOffsetArray( (LPBYTE)pFormat4 ),
*pendCount = GetEndCountArray( (LPBYTE)pFormat4 );
// Swap the array elements for Intel.
for (i=0; i < segCount; i++)
{
pendCount[i] = SWAPWORD(pendCount[i]);
pstartCount[i] = SWAPWORD(pstartCount[i]);
pidDelta[i] = SWAPWORD(pidDelta[i]);
pidRangeOffset[i] = SWAPWORD(pidRangeOffset[i]);
}
// Swap the Glyph Id array
pGlyphId = pidRangeOffset + segCount; // Per TT spec
pEndOfBuffer = (USHORT*)((LPBYTE)pFormat4 + pFormat4->length);
for (;pGlyphId < pEndOfBuffer; pGlyphId++)
{
*pGlyphId = SWAPWORD(*pGlyphId);
}
} /* end of function SwapArrays */
BOOL GetFontEncoding (
HDC hdc,
CMAPENCODING * pEncoding,
int iEncoding
)
/*
Note for this function to work correctly, structures must
have byte alignment.
*/
{
DWORD dwResult;
BOOL fSuccess = TRUE;
// Get the structure data from the TrueType font
dwResult = GetFontData (
hdc,
dwCmapName,
CMAPHEADERSIZE + ENCODINGSIZE*iEncoding,
pEncoding,
sizeof(CMAPENCODING) );
fSuccess = (dwResult == sizeof(CMAPENCODING));
// swap the Platform Id for Intel
pEncoding->PlatformId = SWAPWORD(pEncoding->PlatformId);
// swap the Specific Id for Intel
pEncoding->EncodingId = SWAPWORD(pEncoding->EncodingId);
// swap the subtable offset for Intel
pEncoding->Offset = SWAPLONG(pEncoding->Offset);
return fSuccess;
} /* end of function GetFontEncoding */
BOOL GetFontFormat4Header (
HDC hdc,
LPCMAP4 pFormat4,
DWORD dwOffset
)
/*
Note for this function to work correctly, structures must
have byte alignment.
*/
{
BOOL fSuccess = TRUE;
DWORD dwResult;
int i;
USHORT *pField;
// Loop and Alias a writeable pointer to the field of interest
pField = (USHORT *)pFormat4;
for (i=0; i < 7; i++)
{
// Get the field from the subtable
dwResult = GetFontData (
hdc,
dwCmapName,
dwOffset + sizeof(USHORT)*i,
pField,
sizeof(USHORT) );
// swap it to make it right for Intel.
*pField = SWAPWORD(*pField);
// move on to the next
pField++;
// accumulate our success
fSuccess = (dwResult == sizeof(USHORT)) && fSuccess;
}
return fSuccess;
} /* end of function GetFontFormat4Header */
BOOL GetFontFormat4Subtable (
HDC hdc, // DC with TrueType font
LPCMAP4 pFormat4Subtable, // destination buffer
DWORD dwOffset // Offset within font
)
{
DWORD dwResult;
USHORT length;
// Retrieve the header values in swapped order
if (!GetFontFormat4Header ( hdc,
pFormat4Subtable,
dwOffset ))
{
return FALSE;
}
// Get the rest of the table
length = pFormat4Subtable->length - (7 * sizeof(USHORT));
dwResult = GetFontData( hdc,
dwCmapName,
dwOffset + 7 * sizeof(USHORT), // pos of arrays
(LPBYTE)pFormat4Subtable->Arrays, // destination
length );
if ( dwResult != length)
{
// We really shouldn't ever get here
return FALSE;
}
// Swamp the arrays
SwapArrays( pFormat4Subtable );
return TRUE;
}
USHORT GetFontFormat4CharCount (
LPCMAP4 pFormat4 // pointer to a valid Format4 subtable
)
{
USHORT i,
*pendCount = GetEndCountArray((LPBYTE) pFormat4),
*pstartCount = GetStartCountArray((LPBYTE) pFormat4),
*idRangeOffset = GetIdRangeOffsetArray( (LPBYTE) pFormat4 );
// Count the # of glyphs
USHORT nGlyphs = 0;
if ( pFormat4 == NULL )
return 0;
// by adding up the coverage of each segment
for (i=0; i < (pFormat4->segCountX2/2); i++)
{
if ( idRangeOffset[i] == 0)
{
// if per the TT spec, the idRangeOffset element is zero,
// all of the characters in this segment exist.
nGlyphs += pendCount[i] - pstartCount[i] +1;
}
else
{
// otherwise we have to test for glyph existence for
// each character in the segment.
USHORT idResult; //Intermediate id calc.
USHORT ch;
for (ch = pstartCount[i]; ch <= pendCount[i]; ch++)
{
// determine if a glyph exists
idResult = *(
idRangeOffset[i]/2 +
(ch - pstartCount[i]) +
&idRangeOffset[i]
); // indexing equation from TT spec
if (idResult != 0)
// Yep, count it.
nGlyphs++;
}
}
}
return nGlyphs;
} /* end of function GetFontFormat4CharCount */
BOOL GetTTUnicodeCoverage (
HDC hdc, // DC with TT font
LPCMAP4 pBuffer, // Properly allocated buffer
DWORD cbSize, // Size of properly allocated buffer
DWORD *pcbNeeded // size of buffer needed
)
/*
if cbSize is to small or zero, or if pBuffer is NULL the function
will fail and return the required buffer size in *pcbNeeded.
if another error occurs, the function will fail and *pcbNeeded will
be zero.
When the function succeeds, *pcbNeeded contains the number of bytes
copied to pBuffer.
*/
{
USHORT nEncodings; // # of encoding in the TT font
CMAPENCODING Encoding; // The current encoding
DWORD dwResult;
DWORD i,
iUnicode; // The Unicode encoding
CMAP4 Format4; // Unicode subtable format
LPCMAP4 pFormat4Subtable; // Working buffer for subtable
// Get the number of subtables in the CMAP table from the CMAP header
// The # of subtables is the second USHORT in the CMAP table, per the TT Spec.
dwResult = GetFontData ( hdc, dwCmapName, sizeof(USHORT), &nEncodings, sizeof(USHORT) );
nEncodings = SWAPWORD(nEncodings);
if ( dwResult != sizeof(USHORT) )
{
// Something is wrong, we probably got GDI_ERROR back
// Probably this means that the Device Context does not have
// a TrueType font selected into it.
return FALSE;
}
// Get the encodings and look for a Unicode Encoding
iUnicode = nEncodings;
for (i=0; i < nEncodings; i++)
{
// Get the encoding entry for each encoding
if (!GetFontEncoding ( hdc, &Encoding, i ))
{
*pcbNeeded = 0;
return FALSE;
}
// Take note of the Unicode encoding.
//
// A Unicode encoding per the TrueType specification has a
// Platform Id of 3 and a Platform specific encoding id of 1
// Note that Symbol fonts are supposed to have a Platform Id of 3
// and a specific id of 0. If the TrueType spec. suggestions were
// followed then the Symbol font's Format 4 encoding could also
// be considered Unicode because the mapping would be in the
// Private Use Area of Unicode. We assume this here and allow
// Symbol fonts to be interpreted. If they do not contain a
// Format 4, we bail later. If they do not have a Unicode
// character mapping, we'll get wrong results.
// Code could infer from the coverage whether 3-0 fonts are
// Unicode or not by examining the segments for placement within
// the Private Use Area Subrange.
if (Encoding.PlatformId == 3 &&
(Encoding.EncodingId == 1 || Encoding.EncodingId == 0) )
{
iUnicode = i; // Set the index to the Unicode encoding
}
}
// index out of range means failure to find a Unicode mapping
if (iUnicode >= nEncodings)
{
// No Unicode encoding found.
*pcbNeeded = 0;
return FALSE;
}
// Get the header entries(first 7 USHORTs) for the Unicode encoding.
if ( !GetFontFormat4Header ( hdc, &Format4, Encoding.Offset ) )
{
*pcbNeeded = 0;
return FALSE;
}
// Check to see if we retrieved a Format 4 table
if ( Format4.format != 4 )
{
// Bad, subtable is not format 4, bail.
// This could happen if the font is corrupt
// It could also happen if there is a new font format we
// don't understand.
*pcbNeeded = 0;
return FALSE;
}
// Figure buffer size and tell caller if buffer to small
*pcbNeeded = Format4.length;
if (*pcbNeeded > cbSize || pBuffer == NULL)
{
// Either test indicates caller needs to know
// the buffer size and the parameters are not setup
// to continue.
return FALSE;
}
// allocate a full working buffer
pFormat4Subtable = (LPCMAP4)malloc ( Format4.length );
if ( pFormat4Subtable == NULL)
{
// Bad things happening if we can't allocate memory
*pcbNeeded = 0;
return FALSE;
}
// get the entire subtable
if (!GetFontFormat4Subtable ( hdc, pFormat4Subtable, Encoding.Offset ))
{
// Bad things happening if we can't allocate memory
*pcbNeeded = 0;
return FALSE;
}
// Copy the retrieved table into the buffer
CopyMemory( pBuffer,
pFormat4Subtable,
pFormat4Subtable->length );
free ( pFormat4Subtable );
return TRUE;
} /* end of function GetTTUnicodeCoverage */
BOOL FindFormat4Segment (
LPCMAP4 pTable, // a valid Format4 subtable buffer
USHORT ch, // Unicode character to search for
USHORT *piSeg // out: index of segment containing ch
)
/*
if the Unicode character ch is not contained in one of the
segments the function returns FALSE.
if the Unicode character ch is found in a segment, the index
of the segment is placed in*piSeg and the function returns
TRUE.
*/
{
USHORT i,
segCount = pTable->segCountX2/2;
USHORT *pendCount = GetEndCountArray((LPBYTE) pTable);
USHORT *pstartCount = GetStartCountArray((LPBYTE) pTable);
// Find segment that could contain the Unicode character code
for (i=0; i < segCount && pendCount[i] < ch; i++);
// We looked in them all, ch not there
if (i >= segCount)
return FALSE;
// character code not within the range of the segment
if (pstartCount[i] > ch)
return FALSE;
// this segment contains the character code
*piSeg = i;
return TRUE;
} /* end of function FindFormat4Segment */
USHORT GetTTUnicodeCharCount (
HDC hdc
)
/*
Returns the number of Unicode character glyphs that
are in the TrueType font that is selected into the hdc.
*/
{
LPCMAP4 pUnicodeCMapTable;
USHORT cChar;
DWORD dwSize;
// Get the Unicode CMAP table from the TT font
GetTTUnicodeCoverage( hdc, NULL, 0, &dwSize );
pUnicodeCMapTable = (LPCMAP4)malloc( dwSize );
if (!GetTTUnicodeCoverage( hdc, pUnicodeCMapTable, dwSize, &dwSize ))
{
// possibly no Unicode cmap, not a TT font selected,...
free( pUnicodeCMapTable );
return 0;
}
cChar = GetFontFormat4CharCount( pUnicodeCMapTable );
free( pUnicodeCMapTable );
return cChar;
} /* end of function GetTTUnicodeCharCount */
USHORT GetTTUnicodeGlyphIndex (
HDC hdc, // DC with a TrueType font selected
USHORT ch // Unicode character to convert to Index
)
/*
When the TrueType font contains a glyph for ch, the
function returns the glyph index for that character.
If an error occurs, or there is no glyph for ch, the
function will return the missing glyph index of zero.
*/
{
LPCMAP4 pUnicodeCMapTable;
DWORD dwSize;
USHORT iSegment;
USHORT *idRangeOffset;
USHORT *idDelta;
USHORT *startCount;
USHORT GlyphIndex = 0; // Initialize to missing glyph
// How big a buffer do we need for Unicode CMAP?
GetTTUnicodeCoverage( hdc, NULL, 0, &dwSize );
pUnicodeCMapTable = (LPCMAP4)malloc( dwSize );
if (!GetTTUnicodeCoverage( hdc, pUnicodeCMapTable, dwSize, &dwSize ))
{
// Either no Unicode cmap, or some other error occurred
// like font in DC is not TT.
free( pUnicodeCMapTable );
return 0; // return missing glyph on error
}
// Find the cmap segment that has the character code.
if (!FindFormat4Segment( pUnicodeCMapTable, ch, &iSegment ))
{
free( pUnicodeCMapTable );
return 0; // ch not in cmap, return missing glyph
}
// Get pointers to the cmap data
idRangeOffset = GetIdRangeOffsetArray( (LPBYTE) pUnicodeCMapTable );
idDelta = GetIdDeltaArray( (LPBYTE) pUnicodeCMapTable );
startCount = GetStartCountArray( (LPBYTE) pUnicodeCMapTable );
// Per TT spec, if the RangeOffset is zero,
if ( idRangeOffset[iSegment] == 0)
{
// calculate the glyph index directly
GlyphIndex = (idDelta[iSegment] + ch) % 65536;
}
else
{
// otherwise, use the glyph id array to get the index
USHORT idResult; //Intermediate id calc.
idResult = *(
idRangeOffset[iSegment]/2 +
(ch - startCount[iSegment]) +
&idRangeOffset[iSegment]
); // indexing equation from TT spec
if (idResult)
// Per TT spec, nonzero means there is a glyph
GlyphIndex = (idDelta[iSegment] + idResult) % 65536;
else
// otherwise, return the missing glyph
GlyphIndex = 0;
}
free( pUnicodeCMapTable );
return GlyphIndex;
} /* end of function GetTTUnicodeGlyphIndex */
REFERENCES
For more information on the Unicode standard please see:
The Unicode Consortium. The Unicode Standard, Version 2.0. Reading, MA, Addison-Wesley Developers Press, 1996. ISBN 0-201-48345-9.
On the internet: The Unicode Consortium (http://www.unicode.org) (http://www.unicode.org)
For additional information, click the article number below to view the article in the Microsoft Knowledge Base:
On the internet: The Unicode Consortium (http://www.unicode.org) (http://www.unicode.org)
210341
(http://support.microsoft.com/kb/210341/EN-US/) INFO: Unicode Support in Windows 95 and Windows 98
For more information on the TrueType specification please see:
Microsoft TrueType Specifications (http://www.microsoft.com/typography/tt/tt.htm)
(http://www.microsoft.com/typography/tt/tt.htm)
Also available on the Microsoft Developer Network Library CD's under Specifications.
Also available on the Microsoft Developer Network Library CD's under Specifications.