Previous Table of Contents Next


13.10.2 Code Set Conversion Framework


   13.10.2.1 Requirements

   The file code set that an application uses is often determined by the platform on which it runs. In Japan, for example, Japanese EUC is used on Unix systems, while Shift-JIS is used on PCs. Code set conversion is therefore required to enable interoperability across these platforms. This proposal defines a framework for the automatic conversion of code sets in such situations. The requirements of this framework are:

   13.10.2.2 Overview of the Conversion Framework

   Both the client and server indicate a native code set indirectly by specifying a locale. The exact method for doing this is language-specific, such as the XPG4 C/C++ function setlocale. The client and server use their native code set to communicate with their ORB. (Note that these native code sets are in general different from process code sets and hence conversions may be required at the client and server ends.)

    The conversion framework is illustrated in Figure 13-7. The server-side ORB stores a server’s code set information in a component of the IOR multiple-component profile structure (see Section 13.6.2, “Interoperable Object References: IORs? on page 13-14 )1. The code sets actually used for transmission are carried in the service context field of an IOP (Inter-ORB Protocol) request header (see Section 13.7, “Service Context? on page 13-28 and Section 13.10.2.5, “GIOP Code Set Service Context? on page 13-44). Recall that there are two code sets (TCS-C and TCS-W) negotiated for each session.

   1. Version 1.1 of the IIOP profile body can also be used to specify the server’s code set information, as this version introduces an extra field that is a sequence of tagged components.


   Figure 13-7 Code Set Conversion Framework Overview

   If the native code sets used by a client and server are the same, then no conversion is performed. If the native code sets are different and the client-side ORB has an appropriate converter, then the CMIR conversion method is used. In this case, the server’s native code set is used as the transmission code set. If the native code sets are different and the client-side ORB does not have an appropriate converter but the server-side ORB does have one, then the SMIR conversion method is used. In this case, the client’s native code set is used as the transmission code set.

   The conversion framework allows clients and servers to specify a native char code set and a native wchar code set, which determine the local encodings of IDL types char and wchar, respectively. The conversion process outlined above is executed independently for the char code set and the wchar code set. In other words, the algorithm that is used to select a transmission code set is run twice, once for char data and once for wchar data.

   The rationale for selecting two transmission code sets rather than one (which is typically inferred from the locale of a process) is to allow efficient data transmission without any conversions when the client and server have identical representations for char and/or wchar data. For example, when a Windows NT client talks to a Windows NT server and they both use Unicode for wide character data, it becomes possible to transmit wide character data from one to the other without any conversions. Of course, this becomes possible only for those wide character representations that are well-defined, not for any proprietary ones. If a single transmission code set was mandated, it might require unnecessary conversions. (For example, choosing Unicode as the transmission code set would force conversion of all byte-oriented character data to Unicode.)

   13.10.2.3 ORB Databases and Code Set Converters

   The conversion framework requires an ORB to be able to determine the native code set for a locale and to convert between code sets as necessary. While the details of exactly how these tasks are accomplished are implementation-dependent, the following databases and code set converters might be used:

   13.10.2.4 CodeSet Component of IOR Multi-Component Profile

   The code set component of the IOR multi-component profile structure contains:

   Both char and wchar conversion code sets are listed in order of preference. The code set component is identified by the following tag:

   const IOP::ComponentID TAG_CODE_SETS = 1;

    This tag has been assigned by OMG (See Section 13.6.6, “Standard IOR Components? on page 13-19.). The following IDL structure defines the representation of code set information within the component:

   module CONV_FRAME { // IDL typedef unsigned long CodeSetId; struct CodeSetComponent {

   CodeSetId native_code_set;

   sequence<CodeSetId> conversion_code_sets; }; struct CodeSetComponentInfo {

   CodeSetComponent ForCharData;CodeSetComponent ForWcharData;};};

   Code sets are identified by a 32-bit integer id from the OSF Character and Code Set Registry (See Section 13.10.5.1, “Character and Code Set Registry? on page 13-50 for further information). Data within the code set component is represented as a structure of type CodeSetComponentInfo, and is encoded as a CDR encapsulation. In other words, the char code set information comes first, then the wchar information, represented as structures of type CodeSetComponent.

   A null value should be used in the native_code_set field if the server desires to indicate no native code set (possibly with the identification of suitable conversion code sets).

   If the code set component is not present in a multi-component profile structure, then the default char code set is ISO 8859-1 for backward compatibility. However, there is no default wchar code set. If a server supports interfaces that use wide character data but does not specify the wchar code sets that it supports, client-side ORBs will raise exception INV_OBJREF, with standard minor code 1.

   If a client application invokes an operation which results in an attempt by the client ORB to marshal wchar or wstring data for an in parameter (or to unmarshal wchar or wstring data for an in/out parameter, out parameter or the return value), and the associated Object Reference does not include a codeset component, then the client ORB shall raise the INV_OBJREF standard system exception with standard minor code 2 as a response to the operation invocation.

   13.10.2.5 GIOP Code Set Service Context

   The code set GIOP service context contains:

   in the form of a code set service. This service is identified by:

   const IOP::ServiceID CodeSets = 1;

   The following IDL structure defines the representation of code set service information:

   module CONV_FRAME { // IDL typedef unsigned long CodeSetId; struct CodeSetContext {

   CodeSetId char_data;CodeSetId wchar_data;};};

   Code sets are identified by a 32-bit integer id from the OSF Character and Code Set Registry (See Section 13.10.5.1, “Character and Code Set Registry? on page 13-50 for further information).

   Note – A server’s char and wchar Code set components are usually different, but under some special circumstances they can be the same. That is, one could use the same code set for both char data and wchar data. Likewise the CodesetIds in the service context don’t have to be different.

   13.10.2.6 Code Set Negotiation

   The client-side ORB determines a server’s native and conversion code sets from the code set component in an IOR multi-component profile structure, and it determines a client’s native and conversion code sets from the locale setting (and/or environment variables/configuration files) and the converters that are available on the client. From this information, the client-side ORB chooses char and wchar transmission code sets (TCS-C and TCS-W). For both requests and replies, the char TCS-C determines the encoding of char and string data, and the wchar TCS-W determines the encoding of wchar and wstring data.

   Code set negotiation is not performed on a per-request basis, but only when a client initially connects to a server. All text data communicated on a connection are encoded as defined by the TCSs selected when the connection is established.

    Figure 13-8 illustrates, there are two channels for character data flowing between the client and the server. The first, TCS-C, is used for char data and the second, TCS-W, is used for wchar data. Also note that two native code sets, one for each type of data, could be used by the client and server to talk to their respective ORBs (as noted earlier, the selection of the particular native code set used at any particular point is done via setlocale or some other implementation-dependent method).


   Figure 13-8 Transmission Code Set Use

   Let us look at an example. Assume that the code set information for a client and server is as shown in the table below. (Note that this example concerns only char code sets and is applicable only for data described as chars in the IDL.)

Client

Server

Native code set: SJIS eucJP
Conversion code sets: eucJP, JIS SJIS, JIS

   The client-side ORB first compares the native code sets of the client and server. If they are identical, then the transmission and native code sets are the same and no conversion is required. In this example, they are different, so code set conversion is necessary. Next, the client-side ORB checks to see if the server’s native code set, eucJP, is one of the conversion code sets supported by the client. It is, so eucJP is selected as the transmission code set, with the client (i.e., its ORB) performing conversion to and from its native code set, SJIS, to eucJP. Note that the client may first have to convert all its data described as chars (and possibly wchar_ts) from process codes to SJIS first.

   Now let us look at the general algorithm for determining a transmission code set and where conversions are performed. First, we introduce the following abbreviations:

   The algorithm is as follows:

   if (CNCS==SNCS) TCS = CNCS; // no conversion required else { if (elementOf(SNCS,CCCS)) TCS = SNCS; // client converts to server’s native code set else if (elementOf(CNCS,SCCS)) TCS = CNCS; // server converts from client’s native code set

   else if (intersection(CCCS,SCCS) != emptySet) {

   TCS = oneOf(intersection(CCCS,SCCS));

   // client chooses TCS, from intersection(CCCS,SCCS), that is

   // most preferable to server;

   // client converts from CNCS to TCS and server

   // from TCS to SNCS

   else if (compatible(CNCS,SNCS)) TCS = fallbackCS; // fallbacks are UTF-8 (for char data) and // UTF-16 (for wchar data) else raise CODESET_INCOMPATIBLE exception; }

   The algorithm first checks to see if the client and server native code sets are the same. If they are, then the native code set is used for transmission and no conversion is required. If the native code sets are not the same, then the conversion code sets are examined to see if

   If the third option is selected and there is more than one possible intermediate conversion code set (i.e., the intersection of CCCS and SCCS contains more than one code set), then the one most preferable to the server is selected.2

   If none of these conversions is possible, then the fallback code set (UTF-8 for char data and UTF-16 for wchar data— see below) is used. However, before selecting the fallback code set, a compatibility test is performed. This test looks at the character sets encoded by the client and server native code sets. If they are different (e.g., Korean and French), then meaningful communication between the client and server is not possible and a CODESET_INCOMPATIBLE exception is raised. This test is similar to the DCE compatibility test and is intended to catch those cases where conversion from the client native code set to the fallback, and the fallback to the server native code set would result in massive data loss. (See Section 13.10.5, “Relevant OSFM Registry Interfaces? on page 13-50 for the relevant OSF registry interfaces that could be used for determining compatibility.)

   A DATA_CONVERSION exception is raised when a client or server attempts to transmit a character that does not map into the negotiated transmission code set. For example, not all characters in Taiwan Chinese map into Unicode. When an attempt is made to transmit one of these characters via Unicode, an ORB is required to raise a DATA_CONVERSION exception, with standard minor code 1.

   In summary, the fallback code set is UTF-8 for char data (identified in the Registry as 0x05010001, “X/Open UTF-8; UCS Transformation Format 8 (UTF-8)"), and UTF-16 for wchar data (identified in the Registry as 0x00010109, "ISO/IEC 10646-1:1993; UTF-16, UCS Transformation Format 16-bit form"). As mentioned above the fallback code set is meaningful only when the client and server character sets are compatible, and the fallback code set is distinguished from a default code set used for backward compatibility.

   If a server’s native char code set is not specified in the IOR multi-component profile, then it is considered to be ISO 8859-1 for backward compatibility. However, a server that supports interfaces that use wide character data is required to specify its native wchar code set; if one is not specified, then the client-side ORB raises exception INV_OBJREF, with standard minor code set to 1.

   Similarly, if no char transmission code set is specified in the code set service context, then the char transmission code set is considered to be ISO 8859-1 for backward compatibility. If a client transmits wide character data and does not specify its wchar transmission code set in the service context, then the server-side ORB raises exception BAD_PARAM, with standard minor code set to 23.

   To guarantee “out-of-the-box? interoperability, clients and servers must be able to convert between their native char code set and UTF-8 and their native wchar code set (if specified) and Unicode. Note that this does not require that all server native code sets be mappable to Unicode, but only those that are exported as native in the IOR. The server may have other native code sets that aren’t mappable to Unicode and those can

   2.Recall that server conversion code sets are listed in order of preference.

   be exported as SCCSs (but not SNCSs). This is done to guarantee out-of-the-box interoperability and to reduce the number of code set converters that a CORBA-compliant ORB must provide.

   ORB implementations are strongly encouraged to use widely-used code sets for each regional market. For example, in the Japanese marketplace, all ORB implementations should support Japanese EUC, JIS and Shift JIS to be compatible with existing business practices.