23.1.4 Architectural Overview

Figure 23-2 presents an architectural overview of a fault-tolerant system, showing an example strategy for implementation of the specifications for Fault Tolerant CORBA. Other implementation strategies are possible.

set create_ Invoked by roperties object() Application

Replication Manager Fault notifications Notifier

Property Object Group Manager Manager

Generic Factory

create_ Invoked by

object() Replication Manager on Factory Objects

Host H1

FaultDetector

fault reports is_alive()

Host Host H2 H3

Client Object Server Replica Server Replica

CS1 S2

Fault Fault Detector Detector Factory Factory

ORB ORB ORB

Logging Recovery Logging Recovery Mechanism Mechanism Mechanism Mechanism

Figure 23-2 Achitectural Overview of a Fault-Tolerant System

At the top of the figure are shown several components of the Fault Tolerance Infrastructure (Replication Manager, Fault Notifier, Fault Detector), all of which are implemented as CORBA objects. Logically, there is a single instance of the Replication Manager and Fault Notifier in each fault tolerance domain but, physically, they are replicated to protect against faults, just as are the application objects. The Replication Manager inherits the PropertyManager, ObjectGroupManager, and GenericFactoryinterfaces.

The bottom of the figure shows three hosts, as follows:

• a client application object C on host H 1that is invoking a replicated server object with two replicas,

• S 1on host H 2, and

• S 2on host H 3.

A typical system will contain many such client and server objects.

The figure shows Factory and Fault Detector objects that may be present on each host and are specific to that host. These host-specific objects are not replicated, unlike the service objects shown at the top of the figure, which are replicated objects.

The figure also shows the Message Handler and the Logging and Recovery Mechanisms that are present on each host. These are not CORBA objects but, rather, are a part of the ORB, or are located between the ORB and the operating system.

23.1.4.1 Fault Tolerance Property Management

This specification provides a PropertyManager interface that allows the user to define fault tolerance properties of object groups. The specification of the PropertyManager interface is designed to allow vendors to develop graphical user interfaces and to define additional properties should they so desire.

Two properties of particular relevance are the Membership Style and the Consistency Style. The Membership Style defines whether the membership of an object group is infrastructure-controlled or application-controlled. Similarly, the Consistency Style defines whether the consistency of the states of the members of an object group is infrastructure-controlled or application-controlled. Some components of the Fault Tolerance Infrastructure, such as the Logging and Recovery Mechanisms, are used only for object groups that have the infrastructure-controlled Consistency Style.

23.1.4.2 Replication Management

For the infrastructure-controlled (MEMB_INF_CTRL) Membership Style ( Section 23.3.2.2, “MembershipStyle,? on page 23-33) the replication of objects is substantially transparent to the application program, which simplifies the development of new application programs, and allows the continued use of existing application programs.

Using the create_object() operation of the GenericFactory interface, the application program requests the creation of a replicated object (object group), just as it would an unreplicated object. This operation is invoked on the Replication Manager, rather than directly on the factory (as it would have been in the unreplicated case). The Replication Manager then invokes the factories, on the different hosts, where a replica is to be created, using the same create_object() operation of the GenericFactory interface.

Using the create_member(), add_member(), and remove_member() operations of the ObjectGroupManager interface, the application can exercise control over the addition and removal, and location, of members of an object group (violating transparency).

While each individual replica has its own object reference, the object group as a whole has its interoperable object group reference, which is created by the Replication Manager. This object group reference contains a TAG_FT_GROUP component for the object group within the profiles of the object group reference. The object group reference is returned to the application by the Replication Manager, and is published by the server object. The client objects use the object group reference to invoke methods on the server object group, just as they would have used a conventional object reference for an unreplicated object.

Because of the object group abstraction, the client objects are not aware that the server objects are replicated (client transparency to replication), and are not aware of faults in the server replicas or of the recovery of server replicas when a fault has occurred (client transparency to faults).

23.1.4.3 Fault Detection and Notification

Fault tolerance requires fault detection, and typical systems contain several fault detection mechanisms to detect host failures, resource exhaustion, etc. This specification defines a simple PullMonitorable interface that the application objects inherit. The PullMonitorable interface contains the is_alive() operation that a Fault Detector invokes. For efficiency, the Fault Detector that monitors an application object is typically located on the same host as that object, while the local Fault Detectors are monitored by a global Fault Detector that is replicated for fault tolerance.

The Fault Detector, and other kinds of fault detectors in the system, such as those based on the PUSH Monitoring Style and those that detect host or network faults, report faults to the Fault Notifier, which passes fault notifications to the Replication Manager and other objects that have registered for such notifications. An application-specific fault analyzer may register to receive such notifications, and may condense and filter such notifications into further fault reports that it returns to the Fault Notifier.

23.1.4.4 Logging and Recovery

For the COLD_PASSIVE and WARM_PASSIVE Replication Styles, under fault-free conditions, only one member of an object group, the primary member, executes the requests and generates the replies. If the Fault Detector suspects that the primary member is faulty, the Replication Manager, at its discretion, restarts the current primary member or promotes a backup member to become the new primary member.

For the application-controlled (CONS_APP_CTRL) Consistency Style, the Replication Manager takes no further recovery action and the new primary member is responsible for the recovery of its own state.

For the infrastructure-controlled (CONS_INF_CTRL) Consistency Style, the new primary member must start operation with the appropriate state, and must execute the same sequence of requests that were, or should have been, executed by the previous primary member, had it not failed. Thus, each GIOP message is passed to the Logging and Recovery Mechanisms, automatically and invisibly to the application. The Logging Mechanism records the message in a log, from which the Recovery Mechanism can retrieve the message during recovery.

Periodically, the Logging Mechanism invokes the get_state() operation of the Checkpointable interface, which must be implemented by every replicated application object, to obtain the state of the object, so that the state can be recorded in a log. During recovery, the Recovery Mechanism invokes the set_state() operation of the Checkpointable interface of the new primary to set its state to the state that was recorded in the log.