HP OpenVMS Systems

ask the wizard

Multi-Site Cluster Failure Scenario?

» close window

The Question is:

 
I have a multi-site FDDI VAXcluster: Site A has 2 Vx7000 and a SW800 in CI
 configuration and Site B has 7 Vax7000 anda SW800 also in CI configuration.
 All servers uses FDDI as cluster interconnect and Ethernet as network access.
 Each server contribute 1 v
ote and quarum is 3. Site A has 2 votes and Site B has 3.
 
I had a power outage recently and both the FDDI and Network switches were
 downed. All servers and storages were up except for 1 of the 3 server in Site
 B.
 
The entire cluster hung-up for a while and later all SW800 volumes were
 software disabled. When power was restored and FDDI and Network switches were
 power up, and the down server was also powered up but fail to boot completely
 due to not able to mount th
e disabled volumes. I had to shutdown - forced shutdown - all the servers and
 reboot all of them to bring the cluster back up.
 
Question:
1) Losing both FDDI and Network connectivity will cause the entire cluster to
 hang?
2) If Site A has high total votes (say 4) then Site B (3 currently), would Site
 A server still hang up in this scenario?
3) In what sitiation wuold cluster partitioning occur?
 
You asnwers to these and all related input the these power outage problem would
 be greatly appreciated.
Thank you.

The Answer is :

 
  Your stated facts are unclear or contradictory.  You first claim 2 nodes
  at Site A and 7 at Site B, each with 1 vote -- but then say there are only
  3 votes at Site B.  Apparently, only 3 of the systems at Site B have 1 vote
  each and the other 4 have 0 votes.  The total of 5 votes matches the quorum
  of 3.  Perhaps the 7 was a typo?
 
  The power outage hit both sites concurrently?  If so, the OpenVMS Wizard
  suspects only one site had a power problem but that it isolated the sites
  from each other.
 
  Site A systems should enter a quorum hang (block) as the lobe only has 2
  votes together.  Site B systems would have stayed online if all 3 servers
  had been up.  Why was the one system down at Site B?  With the one system
  down at Site B, the two remaining systems also trigger a quorum hang.
 
  If all of the cluster interconnects go down, then the isolated nodes would
  all encounter quorum hangs (blocks) as none of the nodes would have the 3
  votes by themselves.  This is an expected behavior.
 
  The system parameter MVTIMEOUT specifies how long a volume can be offline
  until it is made unavailable.  After mount verify timeout occurs, only a
  dismount and can make the volume available again.  If the volume comes
  back online before MVTIMEOUT, then the stalled I/O's are simply reissued
  and the applications pick up from where they left off.  You could consider
  increasing MVTIMEOUT to tolerate longer temporary outages.
 
  Although you might have been able to recover without rebooting all of the
  systems in the cluster it was probably easier to do so.  Usually when the
  volume is made unavailable, the stalled I/O's fail back to the application
  which report the error and exit.  Typically, you can dismount the volumes
  and remount them.  The problem comes when an application keeps a channel
  open to the volume despite the state change.  Finding and stopping all
  such applications cluster-wide can be tedious.
 
  Answers:
   1: Not necessarily.  Having the one system down at Site B was the problem.
 
   2: No.  With 4+3=7, the total votes would require quorum to be set to 4.
      Site A would stay online (if all systems there were up).
 
   3: Cluster partitioning -- where both sides are online, despite being
      unconnected -- happens most often when EXPECTED_VOTES is
      misconfigured.  It can also happen when a system manager forces
      the blocked systems to recalculate quorum dynamically.  (For more
      details on VOTES and EXPECTED_VOTES, please see the OpenVMS FAQ.)
 

  
     
     answer written or last revised on ( 1-JUL-2003 )
     » close window