[an error occurred while processing this directive]

HP OpenVMS Systems Documentation

Content starts here Repair and Recovery from Failures
HP Volume Shadowing for OpenVMS: OpenVMS Version 8.4 > Chapter 2 Configuring Your System for High Data Availability

Repair and Recovery from Failures

Volume shadowing failures, some of which are automatically recoverable by the volume shadowing software, are grouped into the following categories:

  • Controller errors

  • Device errors

  • Data errors

  • Connectivity failures

The handling of shadow set recovery and repair depends on the type of failure that occurred and the hardware configuration. In general, devices that are inaccessible tend to fail over to other controllers whenever possible. Otherwise, they are removed from the shadow set. Errors that occur as a result of media defects can often be repaired automatically by the volume shadowing software.

Table 2-1 describes these failure types and recovery mechanisms.

Table 2-1 Types of Failures

Type Description

Controller error

Results from a failure in the controller. If the failure is recoverable, processing continues and data availability is not affected. If the failure is nonrecoverable, shadow set members connected to the controller are removed from the shadow set, and processing continues with the remaining members. In configurations where disks are dual-pathed between two controllers, and one controller fails, the shadow set members fail over to the remaining controller and processing continues.

Device error

Signifies that the mechanics or electronics in the device failed. If the failure is recoverable, processing continues. If the failure is nonrecoverable, the node that detects the error removes the device from the shadow set.

Data errors

Results when a device detects corrupt data. Data errors usually result from media defects that do not cause the device to be removed from a shadow set. Depending on the severity of the data error (or the degree of media deterioration), the controller takes one of the following actions:

  • Corrects the error and returns valid data.

  • Corrects the data and, depending on the device and controller implementation, may re-vector it to a new logical block number (LBN).

  • Returns a parity error status to Volume Shadowing, which means the data cannot be read without error.

When data cannot be corrected by the controller, volume shadowing attempts to replace the lost data by retrieving it from another shadow set member and writing the data to the member with the error. This repair operation is synchronized within the cluster and with the application I/O stream. If the operation fails, then the member with the error is removed from the shadow set.

Connectivity failures

When a connectivity failure occurs, the first node to detect the failure must decide how to recover from the failure in a manner least likely to affect the availability or consistency of the data. As each node discovers the recoverable device failure, it determines its course of action as follows:

  • If at least one member of the shadow set is accessible by the node that detected the error, that node attempts to recover from the failure. The node repeatedly attempts to access the failed shadow set member within the period of time specified by the system parameter SHADOW_MBR_TMO. (This time period could be either the default setting or a different value previously set by the system manager.) If access to the failed disk is not established within the time specified by SHADOW_MBR_TMO, the disk is removed from the shadow set.

  • If no members of a shadow set can be accessed by the node, that node does not attempt to make any adjustments to the shadow set membership. Rather, it assumes that another node, which does have access to the shadow set, makes appropriate corrections.

    The node attempts to access the shadow set members until the period of time designated by the system parameter MVTIMEOUT expires. (This time period could be the default setting or a different value previously set by the system manager.) After the time expires, all application I/O is returned with the following error status message:

    -SYSTEM-F-VOLINV, Volume is not software enabled