RAC node reboot

Hi,
May I ask here that how to prevent from split-brain happen to a healthy two nodes RAC? I understand Oracle decided restart one node based on network messaging healthy, and on the other hand, i think from 10g to 11g, there are bugs about evict node due to ipc timeout.
Thanks

You should check following to get correct issue-
refer to database log & associated trace files, asm.log associated trace files then further drill down to ocssd, crsd evmd logfiles.
From trace files you will get the reason for node eviction normally for following reasons
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend
Once you know the reason, then look for the cause and fix it. For troubleshooting and data gathering refer to metalink notes.
Thanks.

Similar Messages

RAC node reboots from time to time

Hi %,
we have a problem with our rac: it's a three node rac on sles9, 64 bit. one node reboots from time to time. We found nothing in any log file. (only in /var/log/messages of node 1:
"Feb 21 14:58:02 pmg-db1 kernel: o2net: connection to node pmg-db2 (num 1) at 192.168.0.2:7777 has been idle for 10 seconds, shutting it down."
). Does anyone had a similar problem? Or anyone an idea?
regards
Andreas

sorry no /var/log/demsg.
Perhaps I have to write another detail: the third node was added after the two node rac ran for several month. First we had the reboot problem with this third node. We found out, that the interconnect was connected to a 100Mbit module of the switch and not to a 1000Mbit module. We changed this a few days ago, but no the second node rebooted. And it is connected with 1000Mbit/s.
And did I mention, that we use 10.2.0.2?
regards
Andreas

RAC node rebooting frequently

Hi all,
I am woserking on two node rac environment.One of my rac node is rebooting so frequently.I am using oracle 10g database and clusterware also(10.2.0.1).
Ihave checked os logs(linux AS 4),and rac related logs.Not able to find out anything.Posting all logs please suggest.

Hi i am posting alert log,os log and ocssd logs....
clusterware alert log....._
[crsd(5649)]CRS-1201:CRSD started on node ctmisdb1.
2012-03-21 09:50:38.188
[cssd(7490)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ctmisdb1 .
2012-03-21 09:50:46.726
[crsd(5649)]CRS-1204:Recovering CRS resources for node ctmisdb2.
2012-03-21 09:55:21.760
[cssd(7490)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ctmisdb1 ctmisdb2 .
2012-03-21 12:07:46.681
[cssd(7426)]CRS-1605:CSSD voting file is online: /dev/raw/raw2. Details in /u01/app/oracle/product/crs/log/ctmisdb1/cssd/ocssd.log.
2012-03-21 12:07:50.432
[cssd(7426)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ctmisdb1 ctmisdb2 .
2012-03-21 12:07:50.893
[crsd(5549)]CRS-1012:The OCR service started on node ctmisdb1.
2012-03-21 12:07:50.942
[evmd(7304)]CRS-1401:EVMD started on node ctmisdb1.
2012-03-21 12:07:52.827
[crsd(5549)]CRS-1201:CRSD started on node ctmisdb1.
2012-03-21 12:48:41.908
[cssd(7448)]CRS-1605:CSSD voting file is online: /dev/raw/raw2. Details in /u01/app/oracle/product/crs/log/ctmisdb1/cssd/ocssd.log.
2012-03-21 12:48:45.741
[cssd(7448)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ctmisdb1 ctmisdb2 .
2012-03-21 12:48:49.173
[crsd(5546)]CRS-1012:The OCR service started on node ctmisdb1.
2012-03-21 12:48:49.190
[evmd(7328)]CRS-1401:EVMD started on node ctmisdb1.
2012-03-21 12:48:50.818
[crsd(5546)]CRS-1201:CRSD started on node ctmisdb1.
2012-03-21 13:26:36.398
[cssd(7343)]CRS-1605:CSSD voting file is online: /dev/raw/raw2. Details in /u01/app/oracle/product/crs/log/ctmisdb1/cssd/ocssd.log.
2012-03-21 13:26:40.492
[cssd(7343)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ctmisdb1 ctmisdb2 .
2012-03-21 13:26:40.939
[crsd(5542)]CRS-1012:The OCR service started on node ctmisdb1.
2012-03-21 13:26:40.977
[evmd(7223)]CRS-1401:EVMD started on node ctmisdb1.
2012-03-21 13:26:42.772
[crsd(5542)]CRS-1201:CRSD started on node ctmisdb1.
node os log....+
Mar 21 12:06:35 ctmisdb1 rc: Starting readahead: succeeded
Mar 21 12:06:35 ctmisdb1 messagebus: messagebus startup succeeded
Mar 21 12:06:36 ctmisdb1 cups-config-daemon: cups-config-daemon startup succeeded
Mar 21 12:06:36 ctmisdb1 haldaemon: haldaemon startup succeeded
Mar 21 12:06:37 ctmisdb1 fstab-sync[6267]: removed all generated mount points
Mar 21 12:06:37 ctmisdb1 fstab-sync[6378]: added mount point /media/cdrecorder for /dev/hde
Mar 21 12:06:37 ctmisdb1 su(pam_unix)[6323]: session opened for user oracle by (uid=0)
Mar 21 12:06:37 ctmisdb1 su(pam_unix)[6324]: session opened for user oracle by (uid=0)
Mar 21 12:06:37 ctmisdb1 su(pam_unix)[6229]: session opened for user oracle by (uid=0)
Mar 21 12:06:37 ctmisdb1 su(pam_unix)[6229]: session closed for user oracle
Mar 21 12:06:37 ctmisdb1 su(pam_unix)[6644]: session opened for user oracle by (uid=0)
Mar 21 12:06:37 ctmisdb1 kernel: matroxfb: cannot set xres to 800, rounded up to 832
Mar 21 12:06:37 ctmisdb1 last message repeated 2 times
Mar 21 12:06:41 ctmisdb1 su(pam_unix)[6323]: session closed for user oracle
Mar 21 12:06:41 ctmisdb1 su(pam_unix)[6644]: session closed for user oracle
Mar 21 12:06:41 ctmisdb1 su(pam_unix)[6324]: session closed for user oracle
Mar 21 12:06:41 ctmisdb1 logger: Cluster Ready Services completed waiting on dependencies.
Mar 21 12:06:41 ctmisdb1 last message repeated 2 times
Mar 21 12:06:45 ctmisdb1 gdm(pam_unix)[6379]: session opened for user root by (uid=0)
Mar 21 12:06:46 ctmisdb1 gconfd (root-7052): starting (version 2.8.1), pid 7052 user 'root'
Mar 21 12:06:47 ctmisdb1 gconfd (root-7052): Resolved address "xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only configuration source at position 0
Mar 21 12:06:47 ctmisdb1 gconfd (root-7052): Resolved address "xml:readwrite:/root/.gconf" to a writable configuration source at position 1
Mar 21 12:06:47 ctmisdb1 gconfd (root-7052): Resolved address "xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only configuration source at position 2
Mar 21 12:06:55 ctmisdb1 gconfd (root-7052): Resolved address "xml:readwrite:/root/.gconf" to a writable configuration source at position 0
Mar 21 12:07:41 ctmisdb1 su(pam_unix)[5547]: session opened for user oracle by (uid=0)
Mar 21 12:07:41 ctmisdb1 logger: Running CRSD with TZ =
Mar 21 12:07:43 ctmisdb1 su(pam_unix)[7399]: session opened for user oracle by (uid=0)
Mar 21 12:12:49 ctmisdb1 sshd(pam_unix)[15323]: session opened for user root by root(uid=0)
Mar 21 12:12:57 ctmisdb1 su(pam_unix)[15531]: session opened for user oracle by root(uid=0)
Mar 21 12:47:05 ctmisdb1 syslogd 1.4.1: restart.
ocssd log....
[    CSSD]2012-03-21 11:24:41.045 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661f0c0) proc(0x8006622560) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 11:24:41.078 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660cfe0) proc(0x800662ba70) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:07:44.564 >USER: Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle. All rights reserved.
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=ctmisdb1DBG_CSSD))
[    CSSD]2012-03-21 12:07:44.564 >USER: CSS daemon log for node ctmisdb1, number 1, in cluster crs
[    CSSD]2012-03-21 12:07:44.581 [28260544] >TRACE: clssscmain: local-only set to false
[    CSSD]2012-03-21 12:07:44.603 [28260544] >TRACE: clssnmReadNodeInfo: added node 1 (ctmisdb1) to cluster
[    CSSD]2012-03-21 12:07:44.621 [28260544] >TRACE: clssnmReadNodeInfo: added node 2 (ctmisdb2) to cluster
[    CSSD]2012-03-21 12:07:44.627 [72925824] >TRACE: clssnm_skgxnmon: skgxn init failed, rc 1
[    CSSD]2012-03-21 12:07:44.627 [28260544] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
[    CSSD]2012-03-21 12:07:44.641 [28260544] >TRACE: clssnmInitNMInfo: misscount set to 60
[    CSSD]2012-03-21 12:07:44.655 [28260544] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (0//dev/raw/raw2)
[    CSSD]2012-03-21 12:07:46.661 [72925824] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (0//dev/raw/raw2)
[    CSSD]2012-03-21 12:07:46.690 [72925824] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(18) wrtcnt(7920) LATS(0) Disk lastSeqNo(7920)
[    CSSD]2012-03-21 12:07:46.752 [28260544] >TRACE: clssnmFatalInit: fatal mode enabled
[    CSSD]2012-03-21 12:07:46.752 [94777984] >TRACE: clssnmconnect: connecting to node 1, flags 0x0001, connector 1
[    CSSD]2012-03-21 12:07:46.753 [94777984] >TRACE: clssnmconnect: connecting to node 0, flags 0x0000, connector 1
[    CSSD]2012-03-21 12:07:46.753 [94777984] >TRACE: clssnmClusterListener: Probing node(2)
[    CSSD]2012-03-21 12:07:46.755 [94777984] >TRACE: clssnmConnComplete: connected to node 2 (con 0x8006601040), state 3 birth 0, unique 1332303918/1332303918 prevConuni(0)
[    CSSD]2012-03-21 12:07:46.756 [106332800] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_1))
[    CSSD]2012-03-21 12:07:46.756 [106332800] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ctmisdb1_crs))
[    CSSD]2012-03-21 12:07:46.757 [151810688] >TRACE: clssnmPollingThread: Connection complete
[    CSSD]2012-03-21 12:07:46.757 [162296448] >TRACE: clssnmSendingThread: Connection complete
[    CSSD]2012-03-21 12:07:46.757 [172782208] >TRACE: clssnmRcfgMgrThread: Connection complete
[    CSSD]2012-03-21 12:07:46.757 [172782208] >TRACE: clssnmRcfgMgrThread: Local Join
[    CSSD]2012-03-21 12:07:46.757 [172782208] >WARNING: clssnmLocalJoinEvent: takeover aborted due to connected but inactive nodes
[    CSSD]2012-03-21 12:07:47.339 [94777984] >TRACE: clssnmHandleSync: Acknowledging sync: src[2] srcName[ctmisdb2] seq[5] sync[18]
[    CSSD]2012-03-21 12:07:47.759 [172782208] >TRACE: clssnmRcfgMgrThread: lastleader(2) unique(1332311864)
[    CSSD]2012-03-21 12:07:48.341 [94777984] >TRACE: clssnmSendVoteInfo: node(2) syncSeqNo(18)
[    CSSD]2012-03-21 12:07:50.346 [94777984] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2012-03-21 12:07:50.346 [94777984] >TRACE: clssnmDeactivateNode: node 0 () left cluster
[    CSSD]2012-03-21 12:07:50.346 [94777984] >TRACE: clssnmUpdateNodeState: node 1, state (1/2) unique (1332311864/1332311864) prevConuni(0) birth (0/18) (old/new)
[    CSSD]2012-03-21 12:07:50.346 [94777984] >TRACE: clssnmUpdateNodeState: node 2, state (4/3) unique (1332303918/1332303918) prevConuni(0) birth (0/16) (old/new)
[    CSSD]2012-03-21 12:07:50.346 [94777984] >USER: clssnmHandleUpdate: SYNC(18) from node(2) completed
[    CSSD]2012-03-21 12:07:50.346 [94777984] >USER: clssnmHandleUpdate: NODE 1 (ctmisdb1) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-03-21 12:07:50.346 [94777984] >USER: clssnmHandleUpdate: NODE 2 (ctmisdb2) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-03-21 12:07:50.429 [28260544] >USER: NMEVENT_SUSPEND [00][00][00][00]
[    CSSD]2012-03-21 12:07:50.429 [183267968] >TRACE: clssgmReconfigThread: started for reconfig (18)
[    CSSD]2012-03-21 12:07:50.429 [183267968] >USER: NMEVENT_RECONFIG [00][00][00][06]
[    CSSD]2012-03-21 12:07:50.429 [183267968] >TRACE: clssgmEstablishConnections: 2 nodes in cluster incarn 18
[    CSSD]2012-03-21 12:07:50.430 [140255872] >TRACE: clssgmInitialRecv: (0x102a0360) accepted a new connection from node 2 born at 16 active (2, 2), vers (10,3,1,2)
[    CSSD]2012-03-21 12:07:50.430 [140255872] >TRACE: clssgmInitialRecv: conns done (2/2)
[    CSSD]2012-03-21 12:07:50.430 [183267968] >TRACE: clssgmEstablishMasterNode: MASTER for 18 is node(2) birth(16)
[    CSSD]2012-03-21 12:07:50.430 [183267968] >TRACE: clssgmChangeMasterNode: requeued 0 RPCs
[    CSSD]2012-03-21 12:07:50.432 [140255872] >TRACE: clssgmHandleDBDone(): src/dest (2/65535) size(72) incarn 18
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 18 with 2 nodes
[    CSSD]CLSS-3001: local node number 1, master node number 2
[    CSSD]2012-03-21 12:07:50.433 [183267968] >TRACE: clssgmReconfigThread: completed for reconfig(18), with status(1)
[    CSSD]2012-03-21 12:07:50.550 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006603bb0) proc(0x8006608b00) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:07:50.551 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x80066066f0) proc(0x8006608d70) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:07:53.569 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660ec70) proc(0x8006611260) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:00.829 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006610990) proc(0x800660de00) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:04.698 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006613030) proc(0x8006612930) pid(8115) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:04.816 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006612950) proc(0x8006613c20) pid(8115) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:04.832 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006612950) proc(0x8006613c20) pid(8115) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:06.615 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006612950) proc(0x8006613c20) pid(8171) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:07.114 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006615960) proc(0x8006616350) pid(8175) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:11.373 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x80066192a0) proc(0x8006619470) pid(8302) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:11.669 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661bf60) proc(0x800661ee20) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:17.135 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661bf60) proc(0x800661ee70) pid(8458) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:17.268 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661fc00) proc(0x80066220d0) pid(8460) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:17.305 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x80066223e0) proc(0x8006625250) pid(8462) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:17.353 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006625560) proc(0x8006628430) pid(8464) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:24.585 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006625560) proc(0x8006628430) pid(8645) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:27.957 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006628740) proc(0x800662b610) pid(8722) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:30.931 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800662cce0) proc(0x800662c860) pid(8801) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:36.400 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661c5f0) proc(0x800661eb50) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:37.863 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800662f1c0) proc(0x800661eee0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:38.537 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800662f1c0) proc(0x800661d500) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:39.232 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661bf60) proc(0x800661d500) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:43.085 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006611210) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:58.971 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x80066112c0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:09:59.290 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800660b190) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:10:59.589 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800660b190) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:11:59.904 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800660b190) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:13:00.203 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800660b190) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:13:14.029 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800660b190) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:14:00.501 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006611210) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:15:00.809 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:16:01.117 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:17:01.447 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:01.762 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:39.841 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:42.123 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:42.316 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:42.843 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:42.963 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:43.098 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b260) proc(0x800662bd20) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:44.173 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:44.368 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b260) proc(0x800660b310) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:45.351 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:46.236 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:47.031 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:47.694 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:47.819 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b260) proc(0x800660b310) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:48.103 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:48.327 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b260) proc(0x800660b310) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:48.484 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x8006611210) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:48.758 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:49.529 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:50.509 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:51.060 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:51.558 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:48:39.836 >USER: Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle. All rights reserved.
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=ctmisdb1DBG_CSSD))
[    CSSD]2012-03-21 12:48:39.836 >USER: CSS daemon log for node ctmisdb1, number 1, in cluster crs
[    CSSD]2012-03-21 12:48:39.849 [28260544] >TRACE: clssscmain: local-only set to false
[    CSSD]2012-03-21 12:48:39.865 [28260544] >TRACE: clssnmReadNodeInfo: added node 1 (ctmisdb1) to cluster
[    CSSD]2012-03-21 12:48:39.872 [28260544] >TRACE: clssnmReadNodeInfo: added node 2 (ctmisdb2) to cluster
[    CSSD]2012-03-21 12:48:39.879 [72925824] >TRACE: clssnm_skgxnmon: skgxn init failed, rc 1
[    CSSD]2012-03-21 12:48:39.879 [28260544] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
[    CSSD]2012-03-21 12:48:39.881 [28260544] >TRACE: clssnmInitNMInfo: misscount set to 60
[    CSSD]2012-03-21 12:48:39.888 [28260544] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (0//dev/raw/raw2)
[    CSSD]2012-03-21 12:48:41.892 [72925824] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (0//dev/raw/raw2)
[    CSSD]2012-03-21 12:48:41.915 [72925824] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(20) wrtcnt(10367) LATS(0) Disk lastSeqNo(10367)
[    CSSD]2012-03-21 12:48:41.959 [28260544] >TRACE: clssnmFatalInit: fatal mode enabled
[    CSSD]2012-03-21 12:48:41.959 [94777984] >TRACE: clssnmconnect: connecting to node 1, flags 0x0001, connector 1
[    CSSD]2012-03-21 12:48:41.959 [94777984] >TRACE: clssnmconnect: connecting to node 0, flags 0x0000, connector 1
[    CSSD]2012-03-21 12:48:41.959 [94777984] >TRACE: clssnmClusterListener: Probing node(2)
[    CSSD]2012-03-21 12:48:41.961 [94777984] >TRACE: clssnmConnComplete: connected to node 2 (con 0x8006702790), state 3 birth 0, unique 1332303918/1332303918 prevConuni(0)
[    CSSD]2012-03-21 12:48:41.962 [106332800] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_1))
[    CSSD]2012-03-21 12:48:41.962 [106332800] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ctmisdb1_crs))
[    CSSD]2012-03-21 12:48:41.963 [152330880] >TRACE: clssnmPollingThread: Connection complete
[    CSSD]2012-03-21 12:48:41.963 [162816640] >TRACE: clssnmSendingThread: Connection complete
[    CSSD]2012-03-21 12:48:41.963 [173302400] >TRACE: clssnmRcfgMgrThread: Connection complete
[    CSSD]2012-03-21 12:48:41.963 [173302400] >TRACE: clssnmRcfgMgrThread: Local Join
[    CSSD]2012-03-21 12:48:41.963 [173302400] >WARNING: clssnmLocalJoinEvent: takeover aborted due to connected but inactive nodes
[    CSSD]2012-03-21 12:48:42.631 [94777984] >TRACE: clssnmHandleSync: Acknowledging sync: src[2] srcName[ctmisdb2] seq[13] sync[20]
[    CSSD]2012-03-21 12:48:42.965 [173302400] >TRACE: clssnmRcfgMgrThread: lastleader(2) unique(1332314319)
[    CSSD]2012-03-21 12:48:43.636 [94777984] >TRACE: clssnmSendVoteInfo: node(2) syncSeqNo(20)
[    CSSD]2012-03-21 12:48:45.640 [94777984] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2012-03-21 12:48:45.640 [94777984] >TRACE: clssnmDeactivateNode: node 0 () left cluster
[    CSSD]2012-03-21 12:48:45.640 [94777984] >TRACE: clssnmUpdateNodeState: node 1, state (1/2) unique (1332314319/1332314319) prevConuni(0) birth (0/20) (old/new)
[    CSSD]2012-03-21 12:48:45.640 [94777984] >TRACE: clssnmUpdateNodeState: node 2, state (4/3) unique (1332303918/1332303918) prevConuni(0) birth (0/16) (old/new)
[    CSSD]2012-03-21 12:48:45.640 [94777984] >USER: clssnmHandleUpdate: SYNC(20) from node(2) completed
[    CSSD]2012-03-21 12:48:45.640 [94777984] >USER: clssnmHandleUpdate: NODE 1 (ctmisdb1) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-03-21 12:48:45.640 [94777984] >USER: clssnmHandleUpdate: NODE 2 (ctmisdb2) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-03-21 12:48:45.737 [28260544] >USER: NMEVENT_SUSPEND [00][00][00][00]
[    CSSD]2012-03-21 12:48:45.738 [183788160] >TRACE: clssgmReconfigThread: started for reconfig (20)
[    CSSD]2012-03-21 12:48:45.738 [183788160] >USER: NMEVENT_RECONFIG [00][00][00][06]
[    CSSD]2012-03-21 12:48:45.738 [183788160] >TRACE: clssgmEstablishConnections: 2 nodes in cluster incarn 20
[    CSSD]2012-03-21 12:48:45.739 [140776064] >TRACE: clssgmInitialRecv: (0x102a0370) accepted a new connection from node 2 born at 16 active (2, 2), vers (10,3,1,2)
[    CSSD]2012-03-21 12:48:45.739 [140776064] >TRACE: clssgmInitialRecv: conns done (2/2)
[    CSSD]2012-03-21 12:48:45.739 [183788160] >TRACE: clssgmEstablishMasterNode: MASTER for 20 is node(2) birth(16)
[    CSSD]2012-03-21 12:48:45.739 [183788160] >TRACE: clssgmChangeMasterNode: requeued 0 RPCs
[    CSSD]2012-03-21 12:48:45.741 [140776064] >TRACE: clssgmHandleDBDone(): src/dest (2/65535) size(72) incarn 20
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 20 with 2 nodes
Plz check and help..........

RAC nodes rebooting

I'm newbie, and trying to implement 11g RAC using openfiler on E-linux 5.3
I have so far successfully configured openfiler, created volumes and configured the nodes, configured ocfs2 and ASM.
When I rebooted the machines, I first started the openfiler server and external storage they start fine and all volumes(devices) comes up fine, but when I boot the nodes one after the other, they are rebooting after couple of minutes continuously one after other , I am clue less, how to figure out what is the problem, why is this happening, has any one else experienced similar situatio? , how can this be resolved?
I would appreciate any advise or help
Thanks

what is difference in timings on your rac nodes...any thing > 45 secs can possibly cause reboots.
check you disktimeouts.. and hangcheck timer settings
hth

If use MSSQ , when oracle rac node reboot, client get TPEOS error

Hi, all
in my tuxedo applicaton, if we use Single Server, Single Queue mode , when reboot any Oracle RAC node, our application is ok, client can get correct result. but if we use MSSQ（Multi Server, Single Queue) , if Oracle RAC node is ok , our application also is ok. but if we reboot any Oracle RAC node, client program can continue run, get correct result, but always get TPEOS error , for this situation， server can get client request, but client can not get server reply, only get TPEOS error.
our enviroment is :
oracle RAC ,10g 10.2.0.4 , two instances ,rac1 rac2, and two DTP services s1 and s2, set s1 and s2 services TAF is basic
tuxedo 10R3 , two nodes ,work in MP model ，use XA access oracle rac database，services have Transaction and not Transaction
OS is linux AS4 U5, 64bits
service program use OCI
can any one encounter this problem ?

Hi, first thanks you
in ULOG file , only have failover information, not any other error message, in client side also has no other error.
not use MSSQ, ubb file about MSSQ config
SERVERS
DEFAULT:
CLOPT="-A "
sinUpdate_server SRVGRP=GROUP11 SRVID=80 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinUpdate_server SRVGRP=GROUP12 SRVID=160 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinCount_server SRVGRP=GROUP11 SRVID=240 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinCount_server SRVGRP=GROUP12 SRVID=320 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinSelect_server SRVGRP=GROUP11 SRVID=360 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinSelect_server SRVGRP=GROUP12 SRVID=400 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinInsert_server SRVGRP=GROUP11 SRVID=520 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinInsert_server SRVGRP=GROUP12 SRVID=560 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinDelete_server SRVGRP=GROUP11 SRVID=600 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinDelete_server SRVGRP=GROUP12 SRVID=640 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinDdl_server SRVGRP=GROUP11 SRVID=700 MIN=5 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinDdl_server SRVGRP=GROUP12 SRVID=740 MIN=5 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
lockselect_server SRVGRP=GROUP11 SRVID=800 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
lockselect_server SRVGRP=GROUP12 SRVID=840 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
#mulup_server SRVGRP=GROUP11 SRVID=1 MIN=2 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
#mulup_server SRVGRP=GROUP12 SRVID=60 MIN=2 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinUpdate_server SRVGRP=GROUP13 SRVID=83 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinUpdate_server SRVGRP=GROUP14 SRVID=164 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinCount_server SRVGRP=GROUP13 SRVID=243 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinCount_server SRVGRP=GROUP14 SRVID=324 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinSelect_server SRVGRP=GROUP13 SRVID=363 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinSelect_server SRVGRP=GROUP14 SRVID=404 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinInsert_server SRVGRP=GROUP13 SRVID=523 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinInsert_server SRVGRP=GROUP14 SRVID=564 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinDelete_server SRVGRP=GROUP13 SRVID=603 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinDelete_server SRVGRP=GROUP14 SRVID=644 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinDdl_server SRVGRP=GROUP13 SRVID=703 MIN=5 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
sinDdl_server SRVGRP=GROUP14 SRVID=744 MIN=5 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
lockselect_server SRVGRP=GROUP13 SRVID=803 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
lockselect_server SRVGRP=GROUP14 SRVID=844 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
#mulup_server SRVGRP=GROUP13 SRVID=13 MIN=2 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
#mulup_server SRVGRP=GROUP14 SRVID=64 MIN=2 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y
WSL SRVGRP=GROUP11 SRVID=1000
CLOPT="-A -- -n//120.3.8.237:7200 -I 60 -T 60 -w WSH -m 50 -M 100 -x 6 -N 3600"
WSL SRVGRP=GROUP12 SRVID=1001
CLOPT="-A -- -n//120.3.8.238:7200 -I 60 -T 60 -w WSH -m 50 -M 100 -x 6 -N 3600"
WSL SRVGRP=GROUP13 SRVID=1003
CLOPT="-A -- -n//120.3.8.237:7203 -I 60 -T 60 -w WSH -m 50 -M 100 -x 6 -N 3600"
WSL SRVGRP=GROUP14 SRVID=1004
CLOPT="-A -- -n//120.3.8.238:7204 -I 60 -T 60 -w WSH -m 50 -M 100 -x 6 -N 3600"
if we use MSSQ ,ubb file about MSSQ config is
*SERVERS
DEFAULT:
CLOPT="-A -p 1,60:1,30"
sinUpdate_server SRVGRP=GROUP11 SRVID=80 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinUpdate11 REPLYQ=Y
sinUpdate_server SRVGRP=GROUP12 SRVID=160 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinUpdate12 REPLYQ=Y
sinCount_server SRVGRP=GROUP11 SRVID=240 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinCount11 REPLYQ=Y
sinCount_server SRVGRP=GROUP12 SRVID=320 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinCount12 REPLYQ=Y
sinSelect_server SRVGRP=GROUP11 SRVID=360 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinSelec11 REPLYQ=Y
sinSelect_server SRVGRP=GROUP12 SRVID=400 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinSelect12 REPLYQ=Y
sinInsert_server SRVGRP=GROUP11 SRVID=520 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinInsert11 REPLYQ=Y
sinInsert_server SRVGRP=GROUP12 SRVID=560 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinInsert12 REPLYQ=Y
sinDelete_server SRVGRP=GROUP11 SRVID=600 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinDelete11 REPLYQ=Y
sinDelete_server SRVGRP=GROUP12 SRVID=640 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinDelete12 REPLYQ=Y
sinDdl_server SRVGRP=GROUP11 SRVID=700 MIN=5 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinDdl11 REPLYQ=Y
sinDdl_server SRVGRP=GROUP12 SRVID=740 MIN=5 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinDdl12 REPLYQ=Y
lockselect_server SRVGRP=GROUP11 SRVID=800 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=lockselect11 REPLYQ=Y
lockselect_server SRVGRP=GROUP12 SRVID=840 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=lockselect12 REPLYQ=Y
#mulup_server SRVGRP=GROUP11 SRVID=1 MIN=2 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=mulup11 REPLYQ=Y
#mulup_server SRVGRP=GROUP12 SRVID=60 MIN=2 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=mulup12 REPLYQ=Y
sinUpdate_server SRVGRP=GROUP13 SRVID=83 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinUpdate13 REPLYQ=Y
sinUpdate_server SRVGRP=GROUP14 SRVID=164 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinUpdate14 REPLYQ=Y
sinCount_server SRVGRP=GROUP13 SRVID=243 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinCount13 REPLYQ=Y
sinCount_server SRVGRP=GROUP14 SRVID=324 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinCount14 REPLYQ=Y
sinSelect_server SRVGRP=GROUP13 SRVID=363 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinSelec13 REPLYQ=Y
sinSelect_server SRVGRP=GROUP14 SRVID=404 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinSelect14 REPLYQ=Y
sinInsert_server SRVGRP=GROUP13 SRVID=523 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinInsert13 REPLYQ=Y
sinInsert_server SRVGRP=GROUP14 SRVID=564 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinInsert14 REPLYQ=Y
sinDelete_server SRVGRP=GROUP13 SRVID=603 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinDelete13 REPLYQ=Y
sinDelete_server SRVGRP=GROUP14 SRVID=644 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinDelete14 REPLYQ=Y
sinDdl_server SRVGRP=GROUP13 SRVID=703 MIN=5 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinDdl13 REPLYQ=Y
sinDdl_server SRVGRP=GROUP14 SRVID=744 MIN=5 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=sinDdl14 REPLYQ=Y
lockselect_server SRVGRP=GROUP13 SRVID=803 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=lockselect13 REPLYQ=Y
lockselect_server SRVGRP=GROUP14 SRVID=844 MIN=10 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=lockselect14 REPLYQ=Y
#mulup_server SRVGRP=GROUP13 SRVID=13 MIN=2 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=mulup13 REPLYQ=Y
#mulup_server SRVGRP=GROUP14 SRVID=64 MIN=2 MAX=30 MAXGEN=10 GRACE=10 RESTART=Y RQADDR=mulup14 REPLYQ=Y
WSL SRVGRP=GROUP11 SRVID=1000
CLOPT="-A -- -n//120.3.8.237:7200 -I 60 -T 60 -w WSH -m 50 -M 100 -x 6 -N 3600"
WSL SRVGRP=GROUP12 SRVID=1001
CLOPT="-A -- -n//120.3.8.238:7200 -I 60 -T 60 -w WSH -m 50 -M 100 -x 6 -N 3600"
WSL SRVGRP=GROUP13 SRVID=1003
CLOPT="-A -- -n//120.3.8.237:7203 -I 60 -T 60 -w WSH -m 50 -M 100 -x 6 -N 3600"
WSL SRVGRP=GROUP14 SRVID=1004
CLOPT="-A -- -n//120.3.8.238:7204 -I 60 -T 60 -w WSH -m 50 -M 100 -x 6 -N 3600"
about above ubb file ,has any error ? or not correct use MSSQ
look forward to you answer,thanks.

Linux RAC NODES Rebooting

We have 2-NODE RAC Cluster running GC since about 3months. But lately (last 3weeks) we have seen NODE2 reboot 5-6 times with CSSD errors:
Oracle clsomon failed with fatal status 12
Oracle CSSD failure 134.
Oracle CRS failure. Rebooting for cluster integrity.
This Environment is RHEL4 U4 with all RAC components running 10.2.0.3. Has anyone encountered the same.
thanks

Chandra,
We looked into ocssd.log and didn't find anything unusual.
Below is log on the NODE failed.
[    CSSD]2008-xx-xx 11:49:49.172 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x786ba0) proc(0x7ba7b0) pid() proto(10:2:1:1)
[    CSSD]2008-xx-xx 11:50:11.573 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x786ba0) proc(0x7ba7b0) pid() proto(10:2:1:1)
[    CSSD]2008-xx-xx 11:50:44.376 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x786ba0) proc(0x7ba7b0) pid() proto(10:2:1:1)
[    CSSD]2008-xx-xx 11:51:44.652 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x786ba0) proc(0x7ba7b0) pid() proto(10:2:1:1)
[    CSSD]2008-xx-xx 11:52:44.921 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x786ba0) proc(0x7adaf0) pid() proto(10:2:1:1)
[    CSSD]2008-xx-xx 11:58:02.771 >USER: Oracle Database 10g CSS Release 10.2.0.3.0 Production Copyright 1996, 2004 Oracle. All rights reserved.
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=xxxxxDBG_CSSD))
[    CSSD]2008-xx-xx 11:58:02.771 >USER: CSS daemon log for node xxxxxx, number 2, in cluster xxxxxxx-crs
[    CSSD]2008-xx-xx 11:58:02.801 [2538463008] >TRACE: clssscmain: local-only set to false
[    CSSD]2008-xx-xx 11:58:02.844 [2538463008] >TRACE: clssnmReadNodeInfo: added node 1 (xxxx) to cluster
[    CSSD]2008-xx-xx 11:58:02.853 [2538463008] >TRACE: clssnmReadNodeInfo: added node 2 (xxxx) to cluster
[    CSSD]2008-xx-xx 11:58:02.862 [1115699552] >TRACE: clssnm_skgxnmon: skgxn init failed
[    CSSD]2008-xx-xx 11:58:02.862 [2538463008] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
If you look at the log on the failed node it failed at around 11:53 per OS log and after the reboot it took back the resources and will run for couple of days without any issue and then again same happens.

Solaris RAC nodes re-booting

I have a pre-production 2-node cluster running on Solaris 10, Oracle 10.2.0.3 with the Oracle CRS, and using a NetApp filer as the shared storage.
I also have a separate Solaris server running Grid Control 10.2.0.3, with the repository as one of the databases on the RAC (don't know if this is relevant to my problem).
Periodically both RAC nodes reboot, with no trace of why (the GC server is fine). There is nothing logged in the Solaris logs (messages file), CRS logs, Oracle logs or the NetApp logs.
All that is shown is the relevant service starting up following the shutdown.
Has anyone any experience of this, or any thoughts on which component may cause such an issue?
Thanks in advance
Bob

What type of Sun hardware are you using?
Below is the Action Plan Oracle support sent me on my SR on this issue, not sure if any of this was provided to you or would be of help.
ACTION PLAN
============
1. there is nothing on the files at all that sheds any light on the issue
agian 3 sperate sets of clusters all losing all nodes at the same tiem is a very strange occurance. Please be sure to have the admin look for
anything in common wiht all custers.
2. advice placing oswatcher on the systems Note.301137.1 Ext/Pub OS Watcher User Guide
if we should have another occurances we will want the oswatcher logs for 1 hr before issue thru issue
also see if the unix admin perhaps has any os stats from this occurance
3. advice settign ntpd to run with -x option I do see that you are having negative time changes
at times
-x will give us a skew rather then an abbrupt time change
4. advice setting this when you can
Please do the following
set the diagwait parameter:
crsctl set css diagwait N [-force]
Where N is the number of seconds to wait for a filesystem sync to
complete (after this wait the node will reboot regardless of whether the
sync has completed). This change must be made with the clusterware
down, which will require the '-force', or with the stack up on just 1
node, after which the stack on that node must be restarted before the
stack starts up on any of the other nodes.
N should be set to 25 (25 seconds)
5. advice that you have with pcw mlr#6 Patch 5980915 on the systems as well
but I do not believe that this was an oracle bug the reason for placing the patch on is for advanced diagnostics that is in that patchset
6. the two issues sun is workking on
Sun is working to resolve a time skew issue and a Solaris 10 kernel SIGALRM Sun#6292092 in addition to Sun#6595936.
7. we do have a diagnostic oprocd that soem sites have used but on thier test systems. It stops reboots adn dumps information but I have
been hesitant to place it on production boxes if you continue to have issues we may consider download the oprocd_skewfix_noreboot fro
m Bug 6279879 but at this time I do not belvve that is warrented

Question on Rebooting RAC Nodes

Hi, I heard that when rebooting all RAC nodes, one has to wait at least 5 minutes between each node reboot. So you would reboot node 1 at 0 time, node 2 5 minutes later etc.
However, I could not find any documentation on this, can someone please point me to the right place to look? Thanks.

I have not heard that before. I generally use srvctl to stop/start the databases, I do not think it waits 5 minutes between starting nodes as it does not take 15 minutes to stop/start the databases. As far as rebooting the hosts, the only time interval between machines was the time it took to send the reboot command to the host.

RAC Node hang and unexpected reboot

Hello friends
We are facing the intermittent issue of node hang and unexpected shutdown of node. This is 2 node rac 10.2.03 running on windows 2003. Here's crsd.log
2009-07-16 17:24:03.058: [ OCRMSG][5252]prom_rpc: CLSC recv failure..ret code 7
2009-07-16 17:24:03.058: [ OCRMSG][5252]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.058: [ COMMCRS][5616]clscsendx: (0000000002AF5C60) Physical connection (0000000003892080) not active
2009-07-16 17:24:03.058: [ OCRMSG][5616]prom_rpc: CLSC send failure..ret code 11
2009-07-16 17:24:03.058: [ OCRMSG][5616]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.105: [ COMMCRS][5252]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.105: [ OCRMSG][5252]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.105: [ OCRMSG][5252]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.105: [ COMMCRS][5616]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.105: [ OCRMSG][5616]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.105: [ OCRMSG][5616]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.152: [ COMMCRS][5252]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.152: [ OCRMSG][5252]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.152: [ OCRMSG][5252]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.168: [ COMMCRS][5616]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.168: [ OCRMSG][5616]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.168: [ OCRMSG][5616]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.215: [ COMMCRS][5252]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.215: [ OCRMSG][5252]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.215: [ OCRMSG][5252]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.215: [ COMMCRS][5616]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.215: [ OCRMSG][5616]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.215: [ OCRMSG][5616]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.261: [ COMMCRS][5616]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.261: [ OCRMSG][5616]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.261: [ OCRMSG][5616]prom_rpc: possible OCR retry scenario
Please throw me the light, what may be issue.

I suggest you install [ IPD/OS|http://www.oracle.com/technology/products/database/clustering/ipd_download_homepage.html] on you cluster. This will give you all the relevant OS statistics so when a node reboot happens, you can figure out what the state of the nodes was at that time and then fix the problem. The hang is often caused by something other than Oracle RAC.

Private Interconnect: Should any nodes other than RAC nodes have one?

The contractors that set up our four-node production 10g RAC (and a standalone development server) also assigned private interconnect addresses to 2 Apache/ApEx servers and a standalone development database server.
There are service names in the tnsnames.ora on all servers in our infrastructure referencing these private interconnects- even the non-rac member servers. The nics on these servers are not bound for failover with the nics bound to the public/VIP addresses. These nics are isolated on their own switch.
Could this configuration be related to lost heartbeats or voting disk errors? We experience rac node expulsions and even arbitrary bounces (reboots!) of all the rac nodes.

I do not have access to the contractors. . . .can only look at what they have left behind and try to figure out their intention. . .
I am reading the Ault/Tumha book Oracle 10g Grid and Real Application Clusters and looking through our own settings and config files and learning srvctl and crsctl commands from their examples. Also googling and OTN searching through the library full of documentation. . .
I still have yet to figure out if the private interconnect spoken about so frequently in cluster configuration documents are the binding to the set of node.vip address specifications in the tnsnames.ora (bound the the first eth adaptor along with the public ip addresses for the nodes) or the binding on the second eth adaptor to the node.prv addresses not found in the local pfile, in the tnsnames.ora, or the listener.ora (but found at the operating system level in the ifconfig). If the node.prv addresses are not the private interconnect then can anyone tell me that they are for?

Rac Node is Restarting

Hello,
I request your help to solve an issue with my RAC to see if someone can guide me through the solution:
I have a RAC running on 11G R1 11.1.0.7.0 on an ITANIUM SUPERDOME OS HP-UX version B.11.31 EMC Storage. 64 GB RAM
I have 8 databases on it, the problem is with one particular database when starting an instance in node 1 it causes the
whole node to restart, additionally the alert log is showing problems on the heartbeat, however HP has already diagnosed it and it's OK
Is there someone who can help me figure out what is causing this and guide me through a possible solution?
Edited by: user2542932 on 16/10/2012 11:08 AM

Hi,
check
Troubleshooting 10g and 11.1 Clusterware Reboots [Document 265769.1]
especially use OSWatcher to get information on why your node reboots.
OSWatcher Black Box User Guide (Includes: [Video]) (Doc ID 301137.1)
Regards
Sebastian

RAC node restarting!

hi
one of our RAC environment keep restarting.
i've disable the init.cssd, init.crs, init.evmd in the /etc/inittab in order to check the logs.
this is the situation:
crsd.log:
2009-02-04 00:09:00.118: [ COMMCRS][9]clsc_connect: (8000000100318640) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_loud))
2009-02-04 00:09:00.132: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9
2009-02-04 00:09:00.134: [ CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status ..
2009-02-04 00:09:08.016: [    CRSD][1]32Daemon Version: 10.2.0.2.0 Active Version: 10.2.0.2.0
2009-02-04 00:09:08.016: [    CRSD][1]32Active Version and Software Version are same
2009-02-04 00:09:08.017: [ CRSMAIN][1]32Initializing OCR
2009-02-04 00:09:08.037: [ OCRRAW][1]proprioo: for disk 0 (/dev/rdsk/ora_ocr_raw), id match (1), my id set (752560621,1028247821) total id sets (1), 1st set
(752560621,1028247821), 2nd set (0,0) my votes (2), total votes (2)
2009-02-04 00:09:08.140: [ CSSCLNT][24]clssgsGroupJoin: CSS has not reached fatal mode.Registration is not yet safe. Retrying
ocssd.log:
[    CSSD]2009-02-03 21:52:08.651 [9] >USER: clssnmHandleUpdate: NODE 1 (node1l) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2009-02-03 21:52:08.651 [9] >TRACE: clssnmHandleUpdate: diskTimeout set to (200000)ms
[    CSSD]2009-02-03 21:52:08.651 [16] >TRACE: clssnmWaitForAcks: done, msg type(15)
[    CSSD]2009-02-03 21:52:08.651 [16] >TRACE: clssnmDoSyncUpdate: Sync Complete!
[    CSSD]2009-02-03 21:52:08.722 [1] >USER: NMEVENT_SUSPEND [00][00][00][00]
[    CSSD]2009-02-03 21:52:08.724 [17] >TRACE: clssgmReconfigThread: started for reconfig (1)
[    CSSD]2009-02-03 21:52:08.749 [17] >USER: NMEVENT_RECONFIG [00][00][00][02]
[    CSSD]2009-02-03 21:52:08.749 [17] >TRACE: clssgmEstablishConnections: 1 nodes in cluster incarn 1
[    CSSD]2009-02-03 21:52:08.751 [13] >TRACE: clssgmPeerListener: connects done (1/1)
[    CSSD]2009-02-03 21:52:08.752 [17] >TRACE: clssgmEstablishMasterNode: MASTER for 1 is node(1) birth(1)
[    CSSD]2009-02-03 21:52:08.752 [17] >TRACE: clssgmChangeMasterNode: requeued 0 RPCs
[    CSSD]2009-02-03 21:52:08.752 [17] >TRACE: clssgmMasterCMSync: Synchronizing group/lock status
[    CSSD]2009-02-03 21:52:08.752 [17] >TRACE: clssgmMasterSendDBDone: group/lock status synchronization complete
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 1 with 1 nodes
[    CSSD]CLSS-3001: local node number 1, master node number 1
[    CSSD]2009-02-03 21:52:08.753 [17] >TRACE: clssgmReconfigThread: completed for reconfig(1), with status(1)
[    CSSD]2009-02-03 21:52:08.863 [10] >TRACE: clssgmClientConnectMsg: Connect from con(80000001008fd2a0) proc(8000000100ae26a8) pid() proto(10:2:1:1)
[    CSSD]2009-02-03 21:52:08.864 [10] >TRACE: clssgmClientConnectMsg: Connect from con(8000000100ae0128) proc(8000000100ae2a10) pid() proto(10:2:1:1) from con(8000000100aa32c0) proc(8000000100aa5b90) pid() proto(10:2:1:1)
alertlog:
[cssd(2535)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 .
2009-02-03 23:55:20.821
[cssd(2575)]CRS-1605:CSSD voting file is online: /dev/rdsk/ora_voting_raw. Detai ls in /work/crs/product/10.2/crs/log/lourmel/cssd/ocssd.log.
2009-02-03 23:55:28.376
evmd.log:
Oracle Database 10g CRS Release 10.2.0.2.0 Production Copyright 1996, 2004, Oracle. All rights reserved
2009-02-04 00:08:58.331: [    EVMD][1]32EVMD waiting for CSS to be ready err = 3
2009-02-04 00:08:59.939: [ COMMCRS][9]clsc_connect: (800000010007d658) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_loud))
2009-02-04 00:08:59.946: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9
2009-02-04 00:08:59.948: [    EVMD][1]32EVMD waiting for CSS to be ready err = 3
2009-02-04 00:09:07.596: [ CSSCLNT][1]clssgsGroupJoin: CSS has not reached fatal mode.Registration is not yet safe. Retrying
syslog:
Feb 4 00:08:41 lourmel syslog: Oracle Cluster Ready Services starting up automatically.
Feb 4 00:08:45 lourmel sfd[2153]: starting the daemon.
Feb 4 00:08:45 lourmel su: + tty?? root-orac
Feb 4 00:08:45 lourmel krsd[2152]: Delay time is 300 seconds
Feb 4 00:08:43 lourmel syslog: Oracle Cluster Ready Services starting up automatically.
Feb 4 00:08:52 lourmel above message repeats 2 times
Feb 4 00:08:52 lourmel syslog: Cluster Ready Services completed waiting on dependencies.
Feb 4 00:08:53 lourmel syslog: Running CRSD with TZ =
when i checked(befor the restart) the command crs_stat i got the message:
ORA-0184: Cannot communicate wirh CRS
crsctl check crs gives us:
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
as i said befor, the machine always restarting
anyone have an idea?? please

Dear All,
I recently upgrade the Few RAC setups with Oracle 10g Patchset 3 (10.2.0.4) on Linux Servers
In one of the RAC setup, found servers are rebooting daily. The same setup was working fine and problem started only after applying the Patchset. Checked all the logs and Found nothing relevant.
Then i checked the things which added with this Patchset.
The Most interesting found , Oracle Added a New Daemon- oprocd.
# ps -efl | grep oprocd
4 S root 6440 6063 0 -40 - - 2114 - Mar03 ? 00:00:00 /opt/oracle/product/10.2.0/crs/bin/oprocd.bin run -t 1000 -m 500 -hsi 5:10:50:75:90 -f
These are Interesting Points about above line
1.This Process is running by root user
2. With Highest Priority -40
3. Probing every Seconds (t 1000)
4. waiting CPU response for 500 Milliseconds ( -m 500 means margin time is 500 Milli Seconds)
5. Process status is Fatal (-f)
Now I am concluding these points- This daemon will probe cpu every second and wait for response within 500 Mill seconds. If in the 500 Milli second not getting any response from the cpu, will assume the CPU is hang and try to Reboot the Machine. The OPERATING SYSTEM will not get enough time to write the system logs and server reboots.
So the solution is increase the Margin time for 500 Milli second to 10 seconds.
These are following steps to increase the Margin time.
Please Remember- The Modification process need Downtime and You need to stop cluster service in all member nodes.
1. Stop The CRS Process
#crsctl stop crs
#<CRS_HOME>/bin/oprocd stop
2. Ensure that Clusterware stack is down and not running
#ps -ef |egrep "crsd.bin|ocssd.bin|evmd.bin|oprocd"
This should return no processes.
3. From one node of the cluster, change the value of the "diagwait" parameter to 13 by issuing the command as root:
#crsctl set css diagwait 13 -force
4. Check if diagwait is successfully set.
#crsctl get css diagwait
5. Restart the Oracle Clusterware on all the nodes by executing:
#crsctl start crs
(Note- If facing any problem to restarting the CRS services, ASM and Database, You can reboot the Nodes.The Cluster and Database will come automatically due to init startup scripts.)
6. The oprocd daemon process will show with -m 10000
# ps -efl| grep oprocd
# 4 S root 6440 6063 0 -40 - - 2114 - Feb02 ? 00:00:00 /opt/oracle/product/10.2.0/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f
Rollback Procedure-
If You need to unset oprocd value due any reason
#crsctl unset css diagwait
I am confident, The abnormal RAC Node restart problem will solve with this workaround.
Regards,
Sumit
Bangalore,India

DB didn't come up along with crs after node reboot

Grid Version: 11.2.0.3
OS: Red Hat Enterprise Linux 5.6
Node2 of our two node RAC got rebooted. Upon reboot, CRS and ASM instance came up. But the DB didn't come up.
How can I check if DB is linked to CRS startup ?
How can I enable DB startup upon CRS startup ?

Hi,
Check Alert log of Database Instance on that node.
By default if Oracle Database Instance already is started in one node, CRS automaticaly start Database Instance automatically on others nodes.
But if you issue "srvctl stop instance" before shutdown of node the state of this resource will be "shutdown" (i.e stay down) CRS database resource have by default the attribute AUTO_START=restore, which means Oracle CRSD will remember last state of that resource.
In this case you must manually issuing "srvctl start instance" after startup of clusterware, but if database instance was running and you issued "crctl stop crs", crsd must start database automatically at clusterware start.

Both cluster node reboot

There is a two nodes cluster and running Oracle RAC DB. Yesterday both nodes rebooted at the same time (less than few seconds different). Don't know it was caused by Oracle CRS and server itsefl?
Here is the log:
/var/log/messages in node 1
Dec 8 15:14:38 dc01locs01 kernel: 493 http://RAIDarray.mppdcsgswsst6140:1:0:2 Cmnd failed-retry the same path. vcmnd SN 18469446 pdev H3:C0:T0:L2 0x02/0x04/0x01 0x08000002 mpp_status:1
Dec 8 15:14:38 dc01locs01 kernel: 493 http://RAIDarray.mppdcsgswsst6140:1:0:2 Cmnd failed-retry the same path. vcmnd SN 18469448 pdev H3:C0:T0:L2 0x02/0x04/0x01 0x08000002 mpp_status:1
Dec 8 15:17:20 dc01locs01 syslogd 1.4.1: restart.
Dec 8 15:17:20 dc01locs01 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Dec 8 15:17:20 dc01locs01 kernel: Linux version 2.6.18-128.7.1.0.1.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Mon Aug 24 14:07:09 EDT 2009
Dec 8 15:17:20 dc01locs01 kernel: Command line: ro root=/dev/vg00/root rhgb quiet crashkernel=128M@16M
Dec 8 15:17:20 dc01locs01 kernel: BIOS-provided physical RAM map:
ocssd.log in node 1
CSSD2009-12-08 15:14:33.467 1134680384 >TRACE: clssgmDispatchCMXMSG: msg type(13) src(2) dest(1) size(123) tag(00000000) incarnation(148585637)
CSSD2009-12-08 15:14:33.468 1134680384 >TRACE: clssgmHandleDataInvalid: grock HB+ASM, member 2 node 2, birth 1
CSSD2009-12-08 15:19:00.217 >USER: Copyright 2009, Oracle version 11.1.0.7.0
CSSD2009-12-08 15:19:00.217 >USER: CSS daemon log for node dc01locs01, number 1, in cluster ocsprodrac
clsdmtListening to (ADDRESS=(PROTOCOL=ipc)(KEY=dc01locs01DBG_CSSD))
CSSD2009-12-08 15:19:00.235 1995774848 >TRACE: clssscmain: Cluster GUID is 79db6803afc7df32ffd952110f22702c
CSSD2009-12-08 15:19:00.239 1995774848 >TRACE: clssscmain: local-only set to false
/var/log/messages in node 2
Dec 8 15:14:38 dc01locs02 kernel: 493 http://RAIDarray.mppdcsgswsst6140:1:0:2 Cmnd failed-retry the same path. vcmnd SN 18561465 pdev H3:C0:T0:L2 0x02/0x04/0x01 0x08000002 mpp_status:1
Dec 8 15:14:38 dc01locs02 kernel: 493 http://RAIDarray.mppdcsgswsst6140:1:0:2 Cmnd failed-retry the same path. vcmnd SN 18561463 pdev H3:C0:T0:L2 0x02/0x04/0x01 0x08000002 mpp_status:1
Dec 8 15:17:14 dc01locs02 syslogd 1.4.1: restart.
Dec 8 15:17:14 dc01locs02 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Dec 8 15:17:14 dc01locs02 kernel: Linux version 2.6.18-128.7.1.0.1.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Mon Aug 24 14:07:09 EDT 2009
Dec 8 15:17:14 dc01locs02 kernel: Command line: ro root=/dev/vg00/root rhgb quiet crashkernel=128M@16M
Dec 8 15:17:14 dc01locs02 kernel: BIOS-provided physical RAM map:
ocssd.log in node 2
CSSD2009-12-08 15:14:35.450 1264081216 >TRACE: clssgmExecuteClientRequest: Received data update request from client (0x2aaaac065a00), type 1
CSSD2009-12-08 15:14:36.909 1127713088 >TRACE: clssgmDispatchCMXMSG: msg type(13) src(1) dest(1) size(123) tag(00000000) incarnation(148585637)
CSSD2009-12-08 15:14:36.909 1127713088 >TRACE: clssgmHandleDataInvalid: grock HB+ASM, member 1 node 1, birth 0
CSSD2009-12-08 15:18:55.047 >USER: Copyright 2009, Oracle version 11.1.0.7.0
clsdmtListening to (ADDRESS=(PROTOCOL=ipc)(KEY=dc01locs02DBG_CSSD))
CSSD2009-12-08 15:18:55.047 >USER: CSS daemon log for node dc01locs02, number 2, in cluster ocsprodrac
CSSD2009-12-08 15:18:55.071 3628915584 >TRACE: clssscmain: Cluster GUID is 79db6803afc7df32ffd952110f22702c
CSSD2009-12-08 15:18:55.077 3628915584 >TRACE: clssscmain: local-only set to false

Hi!
I suppose this seems easy: you have a service at 'http://RAIDarray.mppdcsgswsst6140:1:0:2' (a RAID perhaps?) which failed. Logically all servers connected to thi RAID went down at the same time.
Seems no Oracle problem. Good luck!

Oracle Cluster Node Reboots Abruptly

One of our RAC 11gR2 Cluster Node rebooted abruptly. We found the following error in the grid home alter log file and ocssd.log file.
[cssd(6014)]CRS-1611:Network communication with node mumchora12 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.190 secondsWe need to find the Root Cause for this node reboot. Kindly assist.
OS Version : RHEL 5.8
GRID : 11.2.0.2
Database : 11.2.0.2.10

Hi,
By looking the logs it seems private interconnect problem. I would suggest you to refer one of nice metalink doc on same issue.
Node reboot or eviction: How to check if your private interconnect CRS can transmit network heartbeats [ID 1445075.1]
Hope it will help you to identify the root cause of node eviction.
Thanks

RAC node reboot

Similar Messages

Maybe you are looking for