HP OpenVMS Systems Documentation
The OpenVMS Frequently Asked Questions (FAQ)
15.6.1 OpenVMS Cluster Communications Protocol Details?
The following sections contain information on the OpenVMS System
Communications Services (SCS) Protocol. Cluster terminology is
available in Section 18.104.22.168.1.
The OpenVMS Cluster environment operates over various network protocols, but the core of clustering uses the System Communications Services (SCS) protocols, and SCS-specific network datagrams. Direct (full) connectivity is assumed.
An OpenVMS Cluster does not operate over DECnet, nor over IP.
No SCS protocol routers are available.
Many folks have suggested operating SCS over DECnet or IP over the years, but SCS is too far down in the layers, and any such project would entail a major or complete rewrite of SCS and of the DECnet or IP drivers. Further, the current DECnet and IP implementations have large tracts of code that operate at the application level, while SCS must operate in the rather more primitive contexts of the system and particularly the bootstrap---to get SCS to operate over a DECnet or IP connection would require relocating major portions of the DECnet or IP stack into the kernel. (And it is not clear that the result would even meet the bandwidth and latency expectations.)
The usual approach for multi-site OpenVMS Cluster configurations
involves FDDI, Memory Channel (MC2), or a point-to-point remote bridge,
brouter, or switch. The connection must be transparent, and it must
operate at 10 megabits per second or better (Ethernet speed), with
latency characteristics similar to that of Ethernet or better. Various
sites use FDDI, MC2, ATM, or point-to-point T3 link.
This section discusses OpenVMS Cluster communications, cluster
terminology, related utilities, and command and control interfaces.
SCS: Systems Communication Services. The protocol used to communicate between VMSCluster systems and between OpenVMS systems and SCS-based storage controllers. (SCSI-based storage controllers do not use SCS.)
All systems and storage controllers establish "Virtual Circuits" to enable communications between all available pairs of ports.
VMS$DISK_CL_DRIVER connects to MSCP$DISK
VMS$TAPE_CL_DRIVER connects to MSCP$TAPE
VMS$VAXCLUSTER connects to VMS$VAXCLUSTER
SCS$DIR_LOOKUP connects to SCS$DIRECTORY
MSCP and TMSCP
SCS CONNECTION: A SYSAP on one node establishes an SCS connection to
its counterpart on another node. This connection will be on ONE AND
ONLY ONE of the available virtual circuits.
When there are multiple virtual circuits between two OpenVMS systems it is possible for the VMS$VAXCLUSTER to VMS$VAXCLUSTER connection to use any one of these circuits. All lock traffic between the two systems will then travel on the selected virtual circuit.
Each port has a "LOAD CLASS" associated with it. This load class helps to determine which virtual circuit a connection will use. If one port has a higher load class than all others then this port will be used. If two or more ports have equally high load classes then the connection will use the first of these that it finds. Prior to enhancements found in V7.3-1 and later, the load class is static and normally all CI and DSSI ports have a load class of 14(hex), while the Ethernet and FDDI ports will have a load class of A(hex). With V7.3-1 and later, the load class values are dynamic.
For instance, if you have multiple DSSI busses and an FDDI, the VMS$VAXCLUSTER connection will chose the DSSI bus as this path has the system disk, and thus will always be the first DSSI bus discovered when the OpenVMS system boots.
To force all lock traffic off the DSSI and on to the FDDI, for instance, an adjustment to the load class value is required, or the DSSI SCS port must be disabled.
In addition to the load class mechanisms, you can also use the "preferred path" mechanisms of MSCP and TMSCP services. This allows you to control the SCS connections used for serving remote disk and tape storage. The preferred path mechanism is most commonly used to explicitly spread cluster I/O activity over hosts and/or storage controllers serving disk or tape storage in parallel. This can be particularly useful if your hosts or storage controllers individually lack the necessary I/O bandwidth for the current I/O load, and must thus aggregate bandwidth to serve the cluster I/O load.
For related tools, see various utilities including LAVC$STOP_BUS and
LAVC$START_BUS, and see DCL commands including SET PREFERRED_PATH.
In most OpenVMS versions, you can use the tools:
These tools permit you to disable or enable all SCS traffic on the on the specified paths.
You can also use a preferred path mechanism that tells the local MSCP disk class driver (DUDRIVER) which path to a disk should be used. Generally, this is used with dual-pathed disks, forcing I/O traffic through one of the controllers instead of the other. This can be used to implement a crude form of I/O load balancing at the disk I/O level.
Prior to V7.2, the preferred path feature uses the tool:
In OpenVMS V7.2 and later, you can use the following DCL command:
The preferred path mechanism does not disable nor affect SCS operations on the non-preferred path.
With OpenVMS V7.3 and later, please see the SCACP utility for control
over cluster communications, SCS virtual circuit control, port
selection, and related.
The following sections contain details of configuring cluster-related
The VMScluster connection manager uses the concept of votes and quorum to prevent disk and memory data corruptions---when sufficient votes are present for quorum, then access to resources is permitted. When sufficient votes are not present, user activity will be blocked. The act of blocking user activity is called a "quorum hang", and is better thought of as a "user data integrity interlock". This mechanism is designed to prevent a partitioned VMScluster, and the resultant massive disk data corruptions. The quorum mechanism is expressly intended to prevent your data from becoming severely corrupted.
On each OpenVMS node in a VMScluster, one sets two values in SYSGEN: VOTES, and EXPECTED_VOTES. The former is how many votes the node contributes to the VMScluster. The latter is the total number of votes expected when the full VMScluster is bootstrapped.
Some sites erroneously attempt to set EXPECTED_VOTES too low, believing that this will allow when only a subset of voting nodes are present in a VMScluster. It does not. Further, an erroneous setting in EXPECTED_VOTES is automatically corrected once VMScluster connections to other nodes are established; user data is at risk of severe corruptions during the earliest and most vulnerable portion of the system bootstrap, before the connections have been established.
One can operate a VMScluster with one, two, or many voting nodes. With any but the two-node configuration, keeping a subset of the nodes active when some nodes fail can be easily configured. With the two-node configuration, one must use a primary-secondary configuration (where the primary has all the votes), a peer configuration (where when either node is down, the other hangs), or (preferable) a shared quorum disk.
Use of a quorum disk does slow down VMScluster transitions somewhat -- the addition of a third voting node that contributes the vote(s) that would be assigned to the quorum disk makes for faster transitions---but the use of a quorum disk does mean that either node in a two-node VMScluster configuration can operate when the other node is down.
If you choose to use a quoum disk, a QUORUM.DAT file will be automatically created when OpenVMS first boots and when a quorum disk is specified -- well, the QUORUM.DAT file will be created when OpenVMS is booted without also needing the votes from the quorum disk.
In a two-node VMScluster with a shared storage interconnect, typically each node has one vote, and the quorum disk also has one vote. EXPECTED_VOTES is set to three.
Using a quorum disk on a non-shared interconnect is unnecessary---the use of a quorum disk does not provide any value, and the votes assigned to the quorum disk should be assigned to the OpenVMS host serving access to the disk.
For information on quorum hangs, see the OpenVMS documentation. For information on changing the EXPECTED_VOTES value on a running system, see the SET CLUSTER/EXPECTED_VOTES command, and see the documentation for the AMDS and Availability Manager tools. Also of potential interest is the OpenVMS system console documentation for the processor-specific console commands used to trigger the IPC (Interrrupt Priority Level %x0C; IPL C) handler. (IPC is not available on OpenVMS I64 V8.2.) AMDS, Availability Manager, and the IPC handler can each be used to clear a quorum hang. Use of AMDS and Availability Manager is generally recommended over IPC, particularly because IPC can cause CLUEXIT bugchecks if the system should remain halted beyond the cluster sanity timer limits, and because some Alpha consoles and most (all?) Integrity consoles do not permit a restart after a halt.
The quorum scheme is a set of "blade guards" deliberately
implemented by OpenVMS Engineering to provide data integrity---remove
these blade guards at your peril. OpenVMS Engineering did not
implement the quorum mechanism to make a system manager's life more
difficult--- the quorum mechanism was specifically implemented to
keep your data from getting scrambled.
Stated simply, Host-Based Volume Shadowing uses the Distributed Lock Manager (DLM) to coordinate changes to membership of a shadowset (e.g. removing a member). The DLM depends in turn on the Connection Manager enforcing the Quorum Scheme and deciding which node(s) (and quorum disk) are participating in the cluster, and telling the DLM when it needs to do things like a lock database rebuild operation. So you can't introduce a dependency of the Connection Manager on Shadowing to try to pick proper shadowset member(s) to use as the Quorum Disk when Shadowing itself is using the DLM and thus indirectly depending on the Connection Manager to keep the cluster membership straight---it's a circular dependency.
So in practice, folks simply depend on controller-based mirroring (or
controller-based RAID) to protect the Quorum Disk against disk failures
(and dual-redundant controllers to protect against most cases of
controller and interconnect failures). Since this disk unit appears to
be a single disk up at the VMS level, there's no chance of ambiguity.
The allocation class mechanism provides the system manager with a way to configure and resolve served and direct paths to storage devices within a cluster. Any served device that provides multiple paths should be configured using a non-zero allocation class, either at the MSCP (or TMSCP) storage controllers, at the port (for port allocation classes), or at the OpenVMS MSCP (or TMSCP) server. All controllers or servers providing a path to the same device should have the same allocation class (at the port, controller, or server level).
Each disk (or tape) unit number used within a non-zero disk (or tape) allocation class must be unique, regardless of the particular device prefix. For the purposes of multi-path device path determination, any disk (or tape) device with the same unit number and the same disk (or tape) allocation class configuration is assumed to be the same device.
If you are reconfiguring disk device allocation classes, you will want
to avoid the use of allocation class one ($1$) until/unless you have
Fibre Channel storage configured. (Fibre Channel storage specifically
requires the use of allocation class $1$. eg: $1$DGA0:.)
The HSZ allocation class is applied to devices, starting with OpenVMS V7.2. It is considered a port allocation class (PAC), and all device names with a PAC have their controller letter forced to "A". (You might infer from the the text in the "Guidelines for OpenVMS Cluster Configurations" that this is something you have to do, though OpenVMS will thoughtfully handle this renaming for you.)
You can force the device names back to DKB by setting the HSZ allocation class to zero, and setting the PKB PAC to -1. This will use the host allocation class, and will leave the controller letter alone (that is, the DK controller letter will be the same as the SCSI port (PK) controller). Note that this won't work if the HSZ is configured in multibus failover mode. In this case, OpenVMS requires that you use an allocation class for the HSZ.
When your configuration gets even moderately complex, you must pay careful attention to how you assign the three kinds of allocation class: node, port and HSZ/HSJ, as otherwise you could wind up with device naming conflicts that can be painful to resolve.
The display-able path information is for SCSI multi-path, and permits the multi-path software to distinguish between different paths to the same device. If you have two paths to $1$DKA100, for example by having two KZPBA controllers and two SCSI buses to the HSZ, you would have two UCBs in a multi-path set. The path information is used by the multi-path software to distinguish between these two UCBs.
The displayable path information describes the path; in this case, the SCSI port. If port is PKB, that's the path name you get. The device name is no longer completely tied to the port name; the device name now depends on the various allocation class settings of the controller, SCSI port or node.
The reason the device name's controller letter is forced to "A" when you use PACs is because a shared SCSI bus may be configured via different ports on the various nodes connected to the bus. The port may be PKB on one node, and PKC on the other. Rather obviously, you will want to have the shared devices use the same device names on all nodes. To establish this, you will assign the same PAC on each node, and OpenVMS will force the controller letter to be the same on each node. Simply choosing "A" was easier and more deterministic than negotiating the controller letter between the nodes, and also parallels the solution used for this situation when DSSI or SDI/STI storage was used.
This information is also described in the Cluster Systems and
Guidelines for OpenVMS Cluster Configurations manuals.
The OpenVMS DCL commands SET HOST/DUP and SET HOST/HSC are used to connect to storage controllers via the Diagnostics and Utility Protocol (DUP). These commands require that the FYDRIVER device driver be connected. This device driver connection is typically performed by adding the following command(s) into the system startup command procedure:
On OpenVMS Alpha:
On OpenVMS VAX:
Alternatives to the DCL SET HOST/DUP command include the console SET HOST command available on various mid- to recent-vintage VAX consoles:
Access to Parameters on an Embedded DSSI controller:
Access to Directory of tools on an Embedded DSSI controller:
Access to Parameters on a KFQSA DSSI controller:
These console commands are available on most MicroVAX and VAXstation 3xxx series systems, and most (all?) VAX 4xxx series systems. For further information, see the system documentation and---on most VAX systems---see the console HELP text.
EK-410AB-MG, _DSSI VAXcluster Installation and Troubleshooting_, is a good resource for setting up a DSSI VMScluster on OpenVMS VAX nodes. (This manual predates coverage of OpenVMS Alpha systems, but gives good coverage to all hardware and software aspects of setting up a DSSI-based VMScluster---and most of the concepts covered are directly applicable to OpenVMS Alpha systems. This manual specifically covers the hardware, which is something not covered by the standard OpenVMS VMScluster documentation.)
If you want to renumber or rename DSSI disks or DSSI tapes, it's easy---if you know the secret incantation...
From the console on most 3000- and 4000-class VAX system consoles... (Obviously, the system must be halted for these commands...)
For information on how to get out into the PARAMS subsystem, also see the HELP at the console prompt for the SET HOST syntax, or see the HELP on SET HOST /DUP (once you've connected FYDRIVER under OpenVMS).
Once you are out into the PARAMS subsystem, you can use the FORCEUNI option to force the use of the UNITNUM value and then set a unique UNITNUM inside each DSSI ISE---this causes each DSSI ISE to use the specfied unit number and not use the DSSI node as the unit number. Other parameters of interest are NODENAME and ALLCLASS, the node name and the (disk or tape) cluster allocation class.
Ensure that all disk unit numbers used within an OpenVMS Cluster disk
allocation class are unique, and all tape unit numbers used within an
OpenVMS Cluster tape allocation class are also unique. For details on
the SCS name of the OpenVMS host, see Section 5.7. For details of SET
HOST/DUP, see Section 15.6.3.
15.6.6 Which files must be shared in an OpenVMS Cluster?
The following files are expected to be common across all nodes in a cluster environment, and though SYSUAF is very often common, it can also be carefully coordinated---with matching UIC values and matching binary identifier values across all copies. (The most common use of multiple SYSUAF files is to allow different quotas on different nodes. In any event, the binary UIC values and the binary identifier values must be coordinated across all SYSUAF files, and must match the RIGHTSLIST file.) In addition to the list of files (and directories, in some cases) shown in Table 15-1, please review the VMScluster documentation, and the System Management documentation.
In addition to the documentation, also see the current version of the file SYS$STARTUP:SYLOGICALS.TEMPLATE. Specifically, please see the most recent version of this file available, starting on or after OpenVMS V7.2.
A failure to have common or (in the case of multiple SYSUAF files) synchronized files can cause problems with batch operations, with the SUBMIT/USER command, with the general operations with the cluster alias, and with various SYSMAN and related operations. Object protections and defaults will not necessarily be consistent, as well. This can also lead to system security problems, including unintended access denials and unintended object accesses, should the files and particularly should the binary identifier values become skewed.