Previous Table of Contents Next


13.10.1 Character Processing Terminology


   This section introduces a few terms and explains a few concepts to help understand the character processing portions of this document.

   13.10.1.1 Character Set

   A finite set of different characters used for the representation, organization, or control of data. In this specification, the term “character set? is used without any relationship to code representation or associated encoding. Examples of character sets are the English alphabet, Kanji or sets of ideographic characters, corporate character sets (commonly used in Japan), and the characters needed to write certain European languages.

   13.10.1.2 Coded Character Set, or Code Set

   A set of unambiguous rules that establishes a character set and the one-to-one relationship between each character of the set and its bit representation or numeric value. In this specification, the term “code set? is used as an abbreviation for the term “coded character set.? Examples include ASCII, ISO 8859-1, JIS X0208 (which includes Roman characters, Japanese hiragana, Greek characters, Japanese kanji, etc.) and Unicode.

   13.10.1.3 Code Set Classifications

   Some language environments distinguish between byte-oriented and “wide characters.? The byte-oriented characters are encoded in one or more 8-bit bytes. A typical single-byte encoding is ASCII as used for western European languages like English. A typical multi-byte encoding which uses from one to three 8-bit bytes for each character is eucJP (Extended UNIX Code - Japan, packed format) as used for Japanese workstations.

   Wide characters are a fixed 16 or 32 bits long, and are used for languages like Chinese, Japanese, etc., where the number of combinations offered by 8 bits is insufficient and a fixed-width encoding is needed. A typical example is Unicode (a “universal? character set defined by the The Unicode Consortium, which uses an encoding scheme identical to ISO 10646 UCS-2, or 2-byte Universal Character Set encoding). An extended encoding scheme for Unicode characters is UTF-16 (UCS Transformation Format, 16bit representations).

   The C language has data types char for byte-oriented characters and wchar_t for wide characters. The language definition for C states that the sizes for these characters are implementation-dependent. Some environments do not distinguish between byte-oriented and wide characters (e.g., Ada and Smalltalk). Here again, the size of a character is implementation-dependent. The following table illustrates code set classifications as used in this document.

   Table 13-3 Code Set Classification

Orientation

Code Element Encoding

Code Set Examples

C Data Type

byte-oriented single-byte ASCII, ISO 8859-1 (Latin-1), EBCDIC, ... char
multi-byte UTF-8, eucJP, Shift-JIS, JIS, Big5, ... char[]
non-byteoriented fixed-length ISO 10646 UCS-2 (Unicode), ISO 10646 UCS-4, UTF-16, ... wchar_t

   13.10.1.4 Narrow and Wide Characters

   Some language environments distinguish between “narrow? and “wide? characters. Typically the narrow characters are considered to be 8-bit long and are used for western European languages like English, while the wide characters are 16-bit or 32bit long and are used for languages like Chinese, Japanese, etc., where the number of combinations offered by 8 bits are insufficient. However, as noted above there are common encoding schemes in which Asian characters are encoded using multi-byte code sets and it is incorrect to assume that Asian characters are always encoded as “wide? characters.

   Within this specification, the general terms “narrow character? and “wide character? are only used in discussing OMG IDL.

   13.10.1.5 Char Data and Wchar Data

   The phrase “char data? in this specification refers to data whose IDL types have been specified as char or string. Likewise “wchar data? refers to data whose IDL types have been specified as wchar or wstring.

   13.10.1.6 Byte-Oriented Code Set

   An encoding of characters where the numeric code corresponding to a character code element can occupy one or more bytes. A byte as used in this specification is synonymous with octet, which occupies 8 bits.

   13.10.1.7 Multi-Byte Character Strings

   A character string represented in a byte-oriented encoding where each character can occupy one or more bytes is called a multi-byte character string. Typically, wide characters are converted to this form from a (fixed-width) process code set before transmitting the characters outside the process (see below about process code sets). Care must be taken to correctly process the component bytes of a character’s multi-byte representation.

   13.10.1.8 Non-Byte-Oriented Code Set

   An encoding of characters where the numeric code corresponding to a character code element can occupy fixed 16 or 32 bits.

   13.10.1.9 Char and Wchar Transmission Code Set (TCS-C and TCS-W)

   These two terms refer to code sets that are used for transmission between ORBs after negotiation is completed. As the names imply, the first one is used for char data and the second one for wchar data. Each TCS can be byte-oriented or non-byte oriented.

   13.10.1.10 Process Code Set and File Code Set

   Processes generally represent international characters in an internal fixed-width format which allows for efficient representation and manipulation. This internal format is called a “process code set.? The process code set is irrelevant outside the process, and hence to the interoperation between CORBA clients and servers through their respective ORBs.

   When a process needs to write international character information out to a file, or communicate with another process (possibly over a network), it typically uses a different encoding called a “file code set.? In this specification, unless otherwise indicated, all references to a program’s code set refer to the file code set, not the process code set. Even when a client and server are located physically on the same machine, it is possible for them to use different file code sets.

   13.10.1.11 Native Code Set

   A native code set is the code set which a client or a server uses to communicate with its ORB. There might be separate native code sets for char and wchar data.

   13.10.1.12 Transmission Code Set

   A transmission code set is the commonly agreed upon encoding used for character data transfer between a client’s ORB and a server’s ORB. There are two transmission code sets established per session between a client and its server, one for char data (TCS-C) and the other for wchar data (TCS-W). Figure 13-6 illustrates these relationships:

   


transmission

   native native

   


ORB

   


ORB

   code sets




   code set code set

   Figure 13-6 Transmission Code Sets

   The intent is for TCS-C to be byte-oriented and TCS-W to be non-byte-oriented. However, this specification does allow both types of characters to be transmitted using the same transmission code set. That is, the selection of a transmission code set is orthogonal to the wideness or narrowness of the characters, although a given code set may be better suited for either narrow or wide characters.

   13.10.1.13 Conversion Code Set (CCS)

   With respect to a particular ORB’s native code set, the set of other or target code sets for which an ORB can convert all code points or character encodings between the native code set and that target code set. For each code set in this CCS, the ORB maintains appropriate translation or conversion procedures and advertises the ability to use that code set for transmitted data in addition to the native code set.