Text Data Types


 Product:MicroStation
 Version:All
 Environment:N\A
 Area:Annotations
 Subarea:N\A

Text Data Types

The MicroStation API uses several different data types to represent string data, each with varying underlying data types and semantics. Inside your own applications, we recommend using MSWChar and/or WString, and only converting to other data types when absolutely necessary.

Encoding

The goal of this section is to introduce you to the importance of encodings, not to be a full treatment or history; for more information, you may want to consult other sources, for example:

In the distant past, computers operated on English-only text, and 128 characters were sufficient to represent all characters. However, they soon needed to be able to convey data in many more languages, some with thousands of characters each. There are two predominant encodings: locale, and Unicode; additionally, each of these can be represented in multi-byte or fixed-byte formats.

At the end of the day, your ‘string' is just a series of numbers to the computer. The computer doesn't strictly equate 0x41 with the capital Latin letter A, or 0xb1 with a plus/minus symbol, your application must correctly interpret this information. In order to do this, you must know the encoding being used, and sometimes you even need more information. For example, when dealing with a locale-encoded string, 0x8E could mean a Cyrillic capital Tshe, an Arabic Jeh, a Latin capital Z with Caron, or many other characters, depending on the code page; in fact, for a multi-byte locale-encoded string, a given byte may even indicate you must combine it with the following byte in order to determine the character. If 0x8e were Unicode-encoded, it would be an unprintable control character!

One side note worth mentioning is that in all character encodings, 0x0-0x7f (0-127) represent the same (English) characters.

Locale-Encoding

* This encoding should be avoided in favor of the Unicode encoding.

This type of character encoding attempts to minimize the space needed to represent every character in all supported languages by tightly coupling the encoding to a separate numeric identifier called a code page. In the example above, a single number (0x8e) was attributed many meanings, dictated by code page. While this encoding minimizes the space needed to represent a string (with many Latin-based languages using only a single byte for every character), it requires you to track a code page at all times. A locale-encoded string without a code page is meaningless, and cannot be interpreted to mean anything. Unfortunately, what many functions (and developers) assume is that the locale of a string is that of the system: the Active Code Page (ACP). Windows keeps track of this code page at the system level, and it affects all running applications, and can only be changed with a reboot. This type of encoding can be represented by multi- or fixed-byte formats.

Unicode-Encoding

* This is the preferred, unambiguous encoding method.

This type of character encoding reserves a unique number for every character in every language, thus there is no need to track a separate ‘code page'. Unicode, in a way, is its own code page. This encoding is an industry standard, and is maintained by a consortium of companies and individuals. While technically there are more than 65,535 unique characters in the world (what 2 bytes can hold), 2 bytes-per-character maintains all of the characters in MicroStation's supported languages (those in the BMP, or Basic Multilingual Plane). When dealing with MicroStation, this type of encoding is only supported by fixed-byte formats (for those who want more detail, see information on UTF-16/UCS-2; because MicroStation only supports characters in the BMP, this is effectively a fixed-byte format, where each character uses a single 16-bit word).

This external article explains Unicode and the various encoding methods in more detail: http://www.joelonsoftware.com/articles/Unicode.html

Data Types

char*, std::string

Encoding: Locale-only
Ordinal: Multi-byte-only

This data type only supports locale-encoded, multi-byte strings. Older API's and data structures utilize this format, but going forward, it should never be used. When dealing with this format, you must always keep track of the associated code page.

MSWChar*, Bentley::WString (also wchar_t* and std::wstring)

Encoding: Unicode-only
Ordinal: Fixed-byte-only

This data type only supports Unicode-encoded, fixed-byte strings. New API's and data structures utilize this format, and this should be the only format used going forward. When dealing with this format, you do not need to keep track of any separate information.

MSWideChar*

Encoding: Locale or Unicode
Ordinal: Fixed-byte-only

This data type is a MicroStation invention. It should never be used in your applications, except where absolutely necessary; there is active and ongoing work to deprecate and replace any API's that utilize this data type. When dealing with this format, you must always keep track of the associated code page.

It is important to realize that although this data type is fixed-byte (and can be static casted to wchar_t-based data types), it is generally inappropriate to do so. This format uses the same storage mechanism for both locale and Unicode encoding. If an MSWideChar is Unicode-encoded, than it will function like an MSWChar*; however, if it is locale-encoded, it functions unlike anything else; it is similar to a multi-byte char*, but reserves two bytes for every character, regardless of whether the character needs it. It is also important to realize it is wrong to provide an MSWideChar to any Windows or C runtime functions.

Converting Between Data Types

Functions to convert from (left column) to (top row) in MicroStation V8i...

 

(full size)

Note that functions to convert to/from MSWChar and MSWideChar are not published; you will have to stub your own definitions in order to be able to use them. The Font-based methods have the advantage of knowing what code page to use (as any MSWideChar strings you provide should always be in the code page of the font).

To use the C functions, you will have to manually provide a code page, but can simply append the following declarations to your source code. These functions return the number of characters (not strictly bytes) inserted into the destination buffer.

int MSWideCharStringToMSWCharString (MSWChar* pOutString, UInt32 nOutChars, MSWideChar const * pInString, UInt32 codePage);
int MSWCharStringToMSWideCharString (MSWideChar* pOutString, UInt32 nOutChars, MSWChar const * pInString, UInt32 codePage);

To use the methods on the Font object, you will have to modify your delivered FontManager.h file to include the following declarations in the Font class:

MSCORE_EXPORT int MSWideCharStringToMSWCharString (MSWCharP outString, UInt32 nOutChars, UInt16 const* inString) const;
MSCORE_EXPORT int MSWCharStringToMSWideCharString (UInt16* outString, UInt32 nOutChars, MSWCharCP inString) const; 

See also

Other language sources

 Original Author:Bentley Technical Support Group