Saturday 17 May, 2008 - 21:56
/var/log/messages
shows the following after starting
richmond2
:
richmond2 logger: Oracle CSS daemon failed to start up. Check CRS logs for diagnostics.
/u00/crs/oracle/product/10/app/log/richmond2/cssd/ocssd.log
shows a copious amount of information amongst which is:
>USER: Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle. All rights reserved.
>USER: CSS daemon log for node richmond2, number 2, in cluster richmond
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=richmond2DBG_CSSD))
>TRACE: clssscmain: local-only set to false
>TRACE: clssnmReadNodeInfo: added node 1 (richmond1) to cluster
>TRACE: clssnmReadNodeInfo: added node 2 (richmond2) to cluster
>TRACE: clssnm_skgxnmon: skgxn init failed, rc 1
>TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
>TRACE: clssnmDiskStateChange: state from 1 to 2 disk (0//dev/raw/raw2)
>TRACE: clssnmDiskStateChange: state from 1 to 2 disk (1//dev/raw/raw17)
>TRACE: clssnmDiskStateChange: state from 1 to 2 disk (2//dev/raw/raw32)
>TRACE: clssnmDiskStateChange: state from 2 to 4 disk (0//dev/raw/raw2)
>TRACE: clssnmDiskStateChange: state from 2 to 4 disk (1//dev/raw/raw17)
>TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(8321) LATS(0) Disk lastSeqNo(8321)
>TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(8321) LATS(0) Disk lastSeqNo(8321)
>TRACE: clssnmDiskStateChange: state from 2 to 4 disk (2//dev/raw/raw32)
>TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(8322) LATS(0) Disk lastSeqNo(8322)
>TRACE: clssnmFatalInit: fatal mode enabled
>TRACE: clssnmconnect: connecting to node 2, flags 0x0001, connector 1
>TRACE: clssnmconnect: connecting to node 0, flags 0x0000, connector 1
>TRACE: clssnmconnect: connecting to node 1, flags 0x0001, connector 0
>TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_richmond_2))
>TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_richmond2_richmond))
>TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(8322) LATS(9049470) Disk lastSeqNo(8322)
...
>TRACE: clsc_send_msg: (0x96679a8) NS err (12571, 12560), transport (530, 113, 0)
>ERROR: clssnmInitialMsg: send failed, con (0x9667e20), rc 3
...
>TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(8380) LATS(9110800) Disk lastSeqNo(8380)
>TRACE: clssnmCheckDskInfo: detected active cluster node 1
>ERROR: clssnmCheckDskInfo: Found cluster with node 1, state 3, incarn 2
What appears to be happening is that there is a split-brain going on. richmond2 can see that richmond1 is alive in the cluster but is unable to communicate to it. The existence is seen through the clssnmReadDskHeartbeat calls to the OCR disks.
In other words, the interconnect is down for some reason. This is confirmed by the pings returning host unreachable messages.
The question is now why the interconnect is not responding.