RAC reboot

Hi Folks,
I have a RAC 10g database running on two separate machines x1 with instance1 and x2 with instance2 having sun solaris installed on both. I do have a physical standby database configured with instance1.
Theres an activity for server reboot of machine x1.
Please suggest me the steps which should be performed on database(s) before server reboot is done and later again to restart the DB.
Your early response is appreciated.

I found Krishnas aproach somewhat helpful. but finally i got the solution from few oracel docs.
BTW,
Disable standby archiving writing at production node1SQL> alter system set log_archive_dest_state_2=defer scope = both sid = 'instance1';
Disable auto recovery at standby DR databaseSQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
The entire database can be stopped usingsrvctl stop database –d instance (from any prod RAC node)
reboot the node1 with OS comands.
srvctl start database –d instance (from any DC RAC node)
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE THROUGH ALL SWITCHOVER DISCONNECT
Enable standby archive writing at production database instance1SQL> alter system set log_archive_dest_state_2=enable scope = both sid = 'instance1’;

Similar Messages

RAC Reboot Error

Hi all,
Our RAC site has 5 nodes on SuseLinux . It's running smooth since 2009 . Suddenly , Power failure for Node5 , and then all other nodes reboot at the sametime !
It bursts out usually while UPS is out -of service . Everytime , one node down and all instances terminated afterwards.
Could anybody can give me some advice ?

Frist .nodeA Lost power and it was getting restarted:
Jan3002:27:16Node5sshd(pam_unix)[21583]:sessionopenedforuseroracleby(uid=0)
Jan3002:27:16Node5sshd(pam_unix)[21583]:sessionclosedforuseroracle
Jan3002:35:12Node5syslogd1.4.1:restart.
Second Since nodeB was in the process of restart, we started getting below message in /var/log/messages.
Jan3002:28:45Node1kernel:o2net:connectiontonodeNode5(num2)at10.2.0.2:7777hasbeenidlefor60.0seconds,shuttingitdown.
Jan3002:28:45Node1kernel:o2net:connectiontonodeNode5(num2)at10.2.0.2:7777hasbeenidlefor60.0seconds,shuttingitdown.
Third . Suddenly all other nodes restarted after below message:
Jan3002:28:45Node1kernel:o2net:nolongerconnectedtonodeNode5(num2)at 10.2.0.2:7777
Jan3002:28:45Node1Kernel:o2net:nolongerconnectedtonodepracdb003(num2)at 10.2.0.2:7777
Jan3002:35:06Node1syslogd1.4.1:restart.

Oracle RAC reboots

Hi,
I have a two node cluster with raw devices as its storage. I have ASM instance and three database instances running on these servers. The OS is solaris 10 and the DB version is 10.2.0.3 . The problem is the servers reboot itself after some interval of time by itself. I am not sure what to look at to fix this issue. The only thing I know is that CSSD process is failing and making the servers reboot. This is production environment. Any kind of help will be appreciated.
Thanks

The following is what I got from Sun peple.
System was rebooted by Oracle CSSD process. It appears to be due to the fact the the system was cpu-bottlenecked with overloaded cpus (only 2 on the system) that has about 50 runable threads on each cpu dispatch queue. Most of the threads in the cpus dispatch queues belong to Oracle.
Customer need to contact Oracle to provide further analysis and recommendations
How can we reduce CPU and find if CPU is actually the culprit

Solaris 10 mpio and Emulex LP11002 HBA

Greetings,
Has anyone tried to use a Emulex LP11002 HBA with Solaris 10 mpio driver ?
Kindly advise.
Thanks and Regards

Hi All,
What I observed is a SAN messages just before a sudain reboot (periodicaly the two nodes of the Oracle RAC reboot). It says "disappeared from fabric".
The configuration is :
2 T2000 with 2 Emulex dual on each.
Two Qlogic 5200 Switches
1 Stk 6140
Mpxio enabled.
All thing are updated to the latest version, sc, obp, emulex driver,...
Jul 29 13:48:04 std01b fctl: [ID 517869 kern.warning] WARNING: fp(1)::GPN_ID for D_ID=10500 failed
Jul 29 13:48:04 std01b fctl: [ID 517869 kern.warning] WARNING: fp(1)::N_x Port with D_ID=10500, PWWN=10000000c95de248 disappeared from fabric
Jul 29 13:48:04 std01b fctl: [ID 517869 kern.warning] WARNING: fp(4)::GPN_ID for D_ID=10400 failed
Jul 29 13:48:04 std01b fctl: [ID 517869 kern.warning] WARNING: fp(4)::N_x Port with D_ID=10400, PWWN=10000000c95de2b4 disappeared from fabric
Jul 29 13:48:04 std01b e1000g: [ID 801593 kern.notice] NOTICE: pciex8086,105e - e1000g[2] : Adapter copper link is down.
Jul 29 13:50:53 std01b genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_125100-07 64-bit
Jul 29 13:50:53 std01b genunix: [ID 172907 kern.notice] Copyright 1983-2006 Sun Microsystems, Inc. All rights reserved.
Jul 29 13:50:53 std01b Use is subject to license terms.
Jul 29 13:50:53 std01b genunix: [ID 678236 kern.info] Ethernet address = 0:14:4f:6f:21:28
Regards.

Database startup after reboot of RAC server

Hello,
My config : 2 nodes w2k3 with 15 dbs . Oracle 10.2.0.3
I started to scheduled my rac server reboot so i follow the oracle doc to shutdown properly all db,asm,service,listener,etc...
Today one the server reboot, but some instance doesn't start (only five , and the last five in alphabetic order)
In the log :
db log : nothing just the shutdown
crs_log : the start of the 10 db but nothing for the last five !!
just a trace on one listener service :
StartResource error for ora.sqyora01.LISTENER_1522_SQYORA01.lsnr error code = 1
2010-08-11 02:56:08.000: [ CRSRES][3008]32Start of `ora.sqyora01.LISTENER_1522_SQYORA01.lsnr` on member `sqyora01` failed.
2010-08-11 02:56:08.125: [ CRSRES][3008]32Skip online resource: ora.sqyora01.LISTENER_SQYORA01.lsnr
On windows services : the OracleServiceSID service isn't started.
Any help is welcome to understand this problem.
Thanks.

user4511076 wrote:
IMO : (i am not a great oracle dba, because i have not enough time to spend with, and my collegue need really simple think to do, like startup/shutdown a database, no more investigation)Why startup/shutdown a database? I have databases that work 24x7 and uptimes of over a year. The record so far, has been over 24 months uptime (1st downtime were caused by a powerfailure to the rack cabinet). And for a database instance that runs a number of processes and deals with 100's to 1000's of inserts per second.
it's easier to view that an oracle process take more cpu and so investigate in that database what's appens. Rather than search in my v$sql table to find witch schema doing wrong thinksNot sure what you are trying to say here. Single or multiple database instances do not change how a session looks like, or how SQLs are executed.. and thus not how you troubleshoot problems. Nor does it change resource requirements, or how you address these. But multiple instances do impact resource availability and requires one to split resources amongst instances - the end result of this is less flexibility.
It's easier to affect cpu ressource to a process, rather than configuring it in oraclePlease explain.
It's easier to stop a database rather than "put offline a schema"This is a "silly" statement to make IMO. Why do you want to offline a schema? Availability of applications and data are a critical feature in today's information system landscape. So why argue the complete opposite and say it is now more complex to make something less available?
By the same token, why do you want to offline a database? A down database is even less useful than a downed server - at least you can use the downed server as a doorstop.. ;-)
And if a database crash it only affect one application.Yeah.. and these happens when? Every few minutes? Every hour? Once a day?
This is not a valid argument. Oracle instances very seldom crashes just for the heck of it. Most often it is due to incorrect o/s configuration, problematic hardware, uncertified components, old drivers, etc. Or plain old application abuse of Oracle.
There is the theorical approach and practical one : I'm not alone is that case :
http://oracledoug.com/serendipity/index.php?/archives/1339-The-Reality-Gap-3-A-Single-Instance-per-Server.html
http://www.dba-oracle.com/art_dbazine_server_consolidation.htm
I have never done RDBMS as a theoretical thing. The most dynamic table (ito SQLs against it) I have, grows by more than 450 million rows per day. This is as real world as real world gets. And that is what shapes my experience and opinion. And Burleson and I have never seen eye to eye on a number of subjects - so quoting his views on consolidation does little to persuade me to change my opinion.
have you any ideas where can i found an error message that explain why my database doesn't start ? (like not enough ressource ;-) )Troubleshooting a problem starts with a very basic principle - isolation. Isolate the layer or moving part where the problem occurs. As the other instances are running, it means that the CRS software layer is up and running, that the storage layer is up and running and so on. So the problem should be with the instance that for some reason refuses to start? Confirm this by using sqlplus to start a down instance and looking at what its alert log file says. It should have some kind of pointer to what dependency that has not been met.

Oracle RAC Nodes getting reboot in case of preferred controller failed

When we are disconnecting both Fiber cable from preferred Controller A or plugging out Controller A card from Disk Array(IBM DS 4300), After 90 seconds both the servers are rebooting.
In this time complete RAC network is going out of service for approx 5 minutes.After reboot both servers are coming with both instances without any manual intervention
Its a critical issue for us because we are loosing High Availability, Let us know how we can resolve this critical issue.
Detail of Network:
1. Software- Oracle 10g Release2
2. OS- Redhat Linux 3 (Kernel Version-2.4.21-27.ELsmp)
3. Shared Storage- IBM DS 4300.
4. Multipathing Driver - RDAC (rdac-LINUX-09.00 A5.13)
4. Nodes- IBM 346
5. Databse on ASM
6. ASM,OCR & Voting Disk Preferred controller is A.
7. Hangcheck timer value is 210 seconds.
8. Both Server available with 2 HBA port . I HBA port is connected with Controller A and Seconfd HBA port is connected with Controller B of SAN Disk Array.
As per my understanding,
Voting disk resides in Disk Array and Controller A is preferred owner of Voting Disk LUN.. When i am disconnecting both fiber cable from preferred controller A , then Both Nodes Clusterware software trying to contact with Voting Disk, When they are unable to contact with Voting disk in specfic time period, they are going for reboot.
I tested Controller failure testing with Oracle RAC software as well without Oracle. Without Oracle its working fine and reason behind, in that time Disk Array is waiting for approx 300 seconds for changing preferred controlller from A to B.
But With Oracle, Clusterware Software reboot both nodes before Controller can shift from A to B.
So if i conclude,the tech who has good understanding of Oracle Clusterware on Linux OS & IBM RDAC multipath driver can help me.
when we install Oracle RAC on Linux, it is required to configure hangcheck timer.
Oracle recomends 180 second.
It means if one of node is hanging, then second node will wait for 180 seconds, if within 180 seconds ,it is not able to resolve this situation then it will reboot hung node.
I think Hangcheck timer configuration reuired only with Linux OS.
Configuration File
cat >> /etc/rc.d/rc.local << EOF
modprobe hangcheck-timer hangcheck_tick=15 hangcheck_margin=60

Sorry
Hangcheck timer is
Configuration File
cat >> /etc/rc.d/rc.local << EOF
modprobe hangcheck-timer hangcheck_tick=30 hangcheck_margin=180

SAN reboot for oracle DB at ASM in linux RAC

Hi Experts,
we use 10.2.0.4 database in ASM at oracle RAC in red hat 5 linux.
we use 3 directoey ( asm, crs, and database)
I got notes that SAN box (support SAM in database) will be reboot.
Under this condition, what do i need to do? shutdown instance? database? crs? or asm instance?
Thanks for your help?
JIM

user589812 wrote:
what means is about start all oracle related services in sequence?CRS will start the complete Oracle cluster s/w cluster for you - ASM, RAC, nodeapps, etc.
Usually, the only effort required is simply hitting the reset/power on button - as the o/s will boot, CRS will start and it will in turn bring up the s/w stack. No manual intervention required. (unless you on purpose configured it differently)
Based on Billy suggestion, can I use srvctl stop nodeapps -n all and #ORA_CRS_HOME/bin/crsctl stop crsNo - my suggestion is that before the SAN maintenance window period start, you do a "+shutdown -h now+" on all cluster nodes to halt/powerdown each and every RAC server.
And after the SAN maintenance period is over, and the SAN available again, ssh into the LoM (Lights Out Management) console of each server and do a "+start SYS+" (or equivalent) to powerup the server.
In other words, with the SAN down/busy rebooting/undergoing maintenance, I would not want to have my RAC servers up and running as there is no storage layer to run them on. IMO, it is a lot safer to have these servers powered down to during such a maintenance period.
PS. I have even had the odd case that during SAN maintenance power cables being pulled, Interconnect switches accidentally reset and so on - or you could have some bright spark also shutting down the aircon with the SAN and your RAC servers suffering heat problems and potential damage while running. So my question is - why should I take the risk of keeping my RAC servers up when the storage layer is not there and the cluster is broken and useless? Surely it makes a lot more sense to power down those servers too and then only power them on again when the maintenance period is over and the SAN (and data centre) is in a proper running state again.

Question on Rebooting RAC Nodes

Hi, I heard that when rebooting all RAC nodes, one has to wait at least 5 minutes between each node reboot. So you would reboot node 1 at 0 time, node 2 5 minutes later etc.
However, I could not find any documentation on this, can someone please point me to the right place to look? Thanks.

I have not heard that before. I generally use srvctl to stop/start the databases, I do not think it waits 5 minutes between starting nodes as it does not take 15 minutes to stop/start the databases. As far as rebooting the hosts, the only time interval between machines was the time it took to send the reboot command to the host.

Oracle RAC 10.2G reboots node every 45 minutes

Hello:
- We have installed Oracle RAC 10.2G for Solaris X86 ( 64 bit ).
- On one node, there are no issues. But the other node ( I think )
is being rebooted by CRS every 45 minutes or so.
- Is this issue caused by some misconfiguration I did during the install ?
- Or is there a patch available to fix this ?
- Has anyone else encountered this problem ?
Thanks
jlem

Hello:
- I re-installed Oracle RAC. The nodes were only rebooted once so far.
So, the second install may be ok. If not, I have provided answers to the first email reply.
- Any help given is most welcome. In meantime, I will continue searching the oracle forums
for solutions.
- My environment is:
- both nodes are running under vmware ESX server version 3.0.1
- the shared storage for OCR and Voting Disk is a raw shared device under vmware
- both nodes are using Solaris X86 5.10 update 5
- Oracle version is: 10.2.0.3 ( patched from version 10.2.0.1 )
- My public network configuration is:
node 1:
e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 10.20.1.74 netmask ffff0000 broadcast 10.20.255.255
ether 0:c:29:3a:45:a9
e1000g0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 10.20.1.77 netmask ffff0000 broadcast 10.20.255.255
node 2:
e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 10.20.1.75 netmask ffff0000 broadcast 10.20.255.255
ether 0:c:29:2b:db:90
e1000g0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 10.20.1.78 netmask ffff0000 broadcast 10.20.255.255
- My private network configuration is:
node 1:
e1000g1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 192.168.0.1 netmask ffffff00 broadcast 192.168.0.255
ether 0:c:29:3a:45:b3
node 2:
e1000g1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 192.168.0.2 netmask ffffff00 broadcast 192.168.0.255
ether 0:c:29:2b:db:9a
- My storage solution is:
- 3 virtual shared SCSI hard disks ( each 500 MB in size )
- My log files are:
- /var/adm/messages
- doesn't report much only the following:
Nov 12 10:57:05 saucer nfs4cbd[328]: [ID 867284 daemon.notice] nfsv4 cannot determine local hostname binding for transport
tcp6 - delegations will not be available on this transport
Nov 12 10:57:21 saucer savecore: [ID 570001 auth.error] reboot after panic: forced crash dump initiated at user requestNov 12 10:57:21 saucer savecore: [ID 748169 auth.error] saving system crash dump in /var/crash/saucer/*.2Nov 12 10:57:41 saucer root: [ID 702911 user.error] Oracle Cluster Ready Services disabled by administrator.Nov 12 10:57:54 saucer rootnex: [ID 349649 kern.info] xsvc0 at rootNov 12 10:57:54 saucer genunix: [ID 936769 kern.info] xsvc0 is /xsvc
- ocssd.log file for node1 indicates that node2 was evicted for impeding a reconfig. Details are:
[    CSSD]2008-11-12 10:55:43.700 [15] >TRACE: clssnmPollingThread: node saucer (2) is impending reconfig
[    CSSD]2008-11-12 10:55:43.700 [15] >WARNING: clssnmPollingThread: node saucer (2) at 90% heartbeat fatal, eviction in 0
.973 seconds
[    CSSD]2008-11-12 10:55:44.679 [15] >TRACE: clssnmPollingThread: node saucer (2) is impending reconfig
[    CSSD]2008-11-12 10:55:44.679 [15] >TRACE: clssnmPollingThread: Eviction started for node saucer (2), flags 0x000d, s
tate 3, wt4c 0
[    CSSD]2008-11-12 10:55:44.690 [17] >TRACE: clssnmDoSyncUpdate: Initiating sync 3
[    CSSD]2008-11-12 10:55:44.690 [17] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (27000)ms
[    CSSD]2008-11-12 10:55:44.691 [17] >TRACE: clssnmSetupAckWait: Ack message type (11)
[    CSSD]2008-11-12 10:55:44.691 [17] >TRACE: clssnmSetupAckWait: node(1) is ALIVE
[    CSSD]2008-11-12 10:55:44.691 [17] >TRACE: clssnmSetupAckWait: node(2) is ALIVE
[    CSSD]2008-11-12 10:55:44.691 [17] >TRACE: clssnmSendSync: syncSeqNo(3)
- node2 ocssd.log does not indicate the problem. See below for details:
[    CSSD]2008-11-12 10:52:34.731 [11] >TRACE: clssgmClientConnectMsg: Connect from con(da8410) proc(dab900) pid() proto(
10:2:1:1)
[    CSSD]2008-11-12 10:53:37.305 [11] >TRACE: clssgmClientConnectMsg: Connect from con(da8410) proc(dab900) pid() proto(
10:2:1:1)
[    CSSD]2008-11-12 10:54:40.515 [11] >TRACE: clssgmClientConnectMsg: Connect from con(da8410) proc(dab900) pid() proto(
10:2:1:1)
[    CSSD]2008-11-12 11:18:09.997 >USER: Oracle Database 10g CSS Release 10.2.0.3.0 Production Copyright 1996, 2004 Orac
le. All rights reserved.
[    CSSD]2008-11-12 11:18:09.997 >USER: CSS daemon log for node saucer, number 2, in cluster crs
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=saucerDBG_CSSD))
[    CSSD]2008-11-12 11:18:10.016 [1] >TRACE: clssscmain: local-only set to false
[    CSSD]2008-11-12 11:18:10.031 [1] >TRACE: clssnmReadNodeInfo: added node 1 (flying) to cluster
[    CSSD]2008-11-12 11:18:10.042 [1] >TRACE: clssnmReadNodeInfo: added node 2 (saucer) to cluster
[    CSSD]2008-11-12 11:18:10.057 [5] >TRACE: clssnm_skgxnmon: skgxn init failed
[    CSSD]2008-11-12 11:18:10.057 [1] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
- ORACLE VERIFY: cluvfy was run on node2 resulting with the following:
bash-3.00$ ./cluvfy comp ocr -n all -verbose
Verifying OCR integrity
Checking OCR integrity...
Checking the absence of a non-clustered configuration...
All nodes free of non-clustered, local-only configurations.
Uniqueness check for OCR device passed.
Checking the version of OCR...
OCR of correct Version "2" exists.
Checking data integrity of OCR...
Data integrity check for OCR passed.
OCR integrity check passed.
Verification of OCR integrity was successful.
bash-3.00$
Thanks
jlem

SC 3.1 and Oracle 10g RAC: instance goes down when rebooting other node

I have Sun cluster 3.1 with Oracle 10gR2 on Solaris 10 Sparc. Thanks to this forum now that my cluster seems fine with a database running. However I still have one problem: when I reboot node1, the instance on node2 also disappears. The instance on the node2 will make itself alive once node1 comes back. This happens also for the instance on node1 if I reboot node2.
The interconnect cables are direct cross-over cable.
Any input is appreciated,
Luke

Although I am not TIm I can anticipate his answer, as he gave it 3 topics back in this forum. You cannot mount UFS on top of shared SVM. It does not work as you can see with your own configuration. The only shared filesystem that works for RAC is shared QFS. The doc
http://docs.sun.com/app/docs/doc/819-0583/6n30h62v7?a=view
has all the details.
If you need a shared filesystem for your binaries or whatever, you have to use UFS/PxFS but that sits on top of normal SVM and not shared SVM.
Hartmut

When one node reboot other node in RAC

Hi Friends,
I faced one situation where one node of RAC cluster had been rebooted by other node. This happen due to network interconnect link fluctuation.
Sep 13 16:23:48 kkvs1a su: [ID 810491 auth.crit] 'su admin' failed for wipro1 on /dev/pts/3
Sep 14 00:22:17 kkvs1a ixgbe: [ID 611667 kern.info] NOTICE: ixgbe3: link down
Sep 14 00:22:21 kkvs1a ixgbe: [ID 611667 kern.info] NOTICE: ixgbe3: link up, , full duplex
Sep 14 00:22:31 kkvs1a ixgbe: [ID 611667 kern.info] NOTICE: ixgbe1: link down
Sep 14 00:22:31 kkvs1a ixgbe: [ID 611667 kern.info] NOTICE: ixgbe3: link down
/opt/oracle/product/10.2.0/crs/log/node1/alertkk1a.log
==============================================
2013-09-14 00:22:05.180
[cssd(12561)]CRS-1612:node kk1b (2) at 50% heartbeat fatal, eviction in 14.251 seconds
2013-09-14 00:22:12.180
[cssd(12561)]CRS-1611:node kk1b (2) at 75% heartbeat fatal, eviction in 7.251 seconds
2013-09-14 00:22:13.180
[cssd(12561)]CRS-1611:node kk1b (2) at 75% heartbeat fatal, eviction in 6.251 seconds
2013-09-14 00:22:17.179
[cssd(12561)]CRS-1610:node kk1b (2) at 90% heartbeat fatal, eviction in 2.251 seconds
2013-09-14 00:22:18.180
[cssd(12561)]CRS-1610:node kkvs1b (2) at 90% heartbeat fatal, eviction in 1.251 seconds
This clearly shows CSSD of node kkvs1a has given node eviction message to kkvs1b node.
I got following messages on the instance which got rebooted:
ASM alert log:
Sat Sep 14 00:22:25 IST 2013
Error: KGXGN aborts the instance (6)
Sat Sep 14 00:22:25 IST 2013
Errors in file /opt/oracle/admin/+ASM/bdump/+asm2_lmon_8527.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
A network fluctuation shouldn't give reboot like this. Then why oracle design like this way? Is this a bug? My oracle version is: 10.2.0.5.0
Could you tell me the other possible situations when 1 RC instance reboots other RAC instacne.

What you are describing is the expected behaviour: if your interconnect fails, you will have a node eviction. Releases < 11.2.0.2 evict a node by reboot, which can fix the problem: the NIC may come up correctly when the machine re-starts. Releases >= 11.2.0.2 can often evict without a re-boot. But either way, if your interconnect goes down, a node must be evicted to prevent uncoordinated disc writes.
If you are interested, you can find some discussion and demos of this in a series of webcasts I've recorded,
Free Oracle Database Tutorials for Administration and Developers
If you really don't like this behaviour and the problems are transient, you can try 'raising the CSS MISSCOUNT parameter.
John Watson
Oracle Certified Master DBA

RAC Node hang and unexpected reboot

Hello friends
We are facing the intermittent issue of node hang and unexpected shutdown of node. This is 2 node rac 10.2.03 running on windows 2003. Here's crsd.log
2009-07-16 17:24:03.058: [ OCRMSG][5252]prom_rpc: CLSC recv failure..ret code 7
2009-07-16 17:24:03.058: [ OCRMSG][5252]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.058: [ COMMCRS][5616]clscsendx: (0000000002AF5C60) Physical connection (0000000003892080) not active
2009-07-16 17:24:03.058: [ OCRMSG][5616]prom_rpc: CLSC send failure..ret code 11
2009-07-16 17:24:03.058: [ OCRMSG][5616]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.105: [ COMMCRS][5252]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.105: [ OCRMSG][5252]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.105: [ OCRMSG][5252]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.105: [ COMMCRS][5616]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.105: [ OCRMSG][5616]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.105: [ OCRMSG][5616]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.152: [ COMMCRS][5252]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.152: [ OCRMSG][5252]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.152: [ OCRMSG][5252]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.168: [ COMMCRS][5616]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.168: [ OCRMSG][5616]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.168: [ OCRMSG][5616]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.215: [ COMMCRS][5252]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.215: [ OCRMSG][5252]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.215: [ OCRMSG][5252]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.215: [ COMMCRS][5616]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.215: [ OCRMSG][5616]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.215: [ OCRMSG][5616]prom_rpc: possible OCR retry scenario
2009-07-16 17:24:03.261: [ COMMCRS][5616]clscsendx: (0000000002AF5C60) Connection not active
2009-07-16 17:24:03.261: [ OCRMSG][5616]prom_rpc: CLSC send failure..ret code 6
2009-07-16 17:24:03.261: [ OCRMSG][5616]prom_rpc: possible OCR retry scenario
Please throw me the light, what may be issue.

I suggest you install [ IPD/OS|http://www.oracle.com/technology/products/database/clustering/ipd_download_homepage.html] on you cluster. This will give you all the relevant OS statistics so when a node reboot happens, you can figure out what the state of the nodes was at that time and then fix the problem. The hang is often caused by something other than Oracle RAC.

RAC node reboots from time to time

Hi %,
we have a problem with our rac: it's a three node rac on sles9, 64 bit. one node reboots from time to time. We found nothing in any log file. (only in /var/log/messages of node 1:
"Feb 21 14:58:02 pmg-db1 kernel: o2net: connection to node pmg-db2 (num 1) at 192.168.0.2:7777 has been idle for 10 seconds, shutting it down."
). Does anyone had a similar problem? Or anyone an idea?
regards
Andreas

sorry no /var/log/demsg.
Perhaps I have to write another detail: the third node was added after the two node rac ran for several month. First we had the reboot problem with this third node. We found out, that the interconnect was connected to a 100Mbit module of the switch and not to a 1000Mbit module. We changed this a few days ago, but no the second node rebooted. And it is connected with 1000Mbit/s.
And did I mention, that we use 10.2.0.2?
regards
Andreas

RAC node rebooting frequently

Hi all,
I am woserking on two node rac environment.One of my rac node is rebooting so frequently.I am using oracle 10g database and clusterware also(10.2.0.1).
Ihave checked os logs(linux AS 4),and rac related logs.Not able to find out anything.Posting all logs please suggest.

Hi i am posting alert log,os log and ocssd logs....
clusterware alert log....._
[crsd(5649)]CRS-1201:CRSD started on node ctmisdb1.
2012-03-21 09:50:38.188
[cssd(7490)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ctmisdb1 .
2012-03-21 09:50:46.726
[crsd(5649)]CRS-1204:Recovering CRS resources for node ctmisdb2.
2012-03-21 09:55:21.760
[cssd(7490)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ctmisdb1 ctmisdb2 .
2012-03-21 12:07:46.681
[cssd(7426)]CRS-1605:CSSD voting file is online: /dev/raw/raw2. Details in /u01/app/oracle/product/crs/log/ctmisdb1/cssd/ocssd.log.
2012-03-21 12:07:50.432
[cssd(7426)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ctmisdb1 ctmisdb2 .
2012-03-21 12:07:50.893
[crsd(5549)]CRS-1012:The OCR service started on node ctmisdb1.
2012-03-21 12:07:50.942
[evmd(7304)]CRS-1401:EVMD started on node ctmisdb1.
2012-03-21 12:07:52.827
[crsd(5549)]CRS-1201:CRSD started on node ctmisdb1.
2012-03-21 12:48:41.908
[cssd(7448)]CRS-1605:CSSD voting file is online: /dev/raw/raw2. Details in /u01/app/oracle/product/crs/log/ctmisdb1/cssd/ocssd.log.
2012-03-21 12:48:45.741
[cssd(7448)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ctmisdb1 ctmisdb2 .
2012-03-21 12:48:49.173
[crsd(5546)]CRS-1012:The OCR service started on node ctmisdb1.
2012-03-21 12:48:49.190
[evmd(7328)]CRS-1401:EVMD started on node ctmisdb1.
2012-03-21 12:48:50.818
[crsd(5546)]CRS-1201:CRSD started on node ctmisdb1.
2012-03-21 13:26:36.398
[cssd(7343)]CRS-1605:CSSD voting file is online: /dev/raw/raw2. Details in /u01/app/oracle/product/crs/log/ctmisdb1/cssd/ocssd.log.
2012-03-21 13:26:40.492
[cssd(7343)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ctmisdb1 ctmisdb2 .
2012-03-21 13:26:40.939
[crsd(5542)]CRS-1012:The OCR service started on node ctmisdb1.
2012-03-21 13:26:40.977
[evmd(7223)]CRS-1401:EVMD started on node ctmisdb1.
2012-03-21 13:26:42.772
[crsd(5542)]CRS-1201:CRSD started on node ctmisdb1.
node os log....+
Mar 21 12:06:35 ctmisdb1 rc: Starting readahead: succeeded
Mar 21 12:06:35 ctmisdb1 messagebus: messagebus startup succeeded
Mar 21 12:06:36 ctmisdb1 cups-config-daemon: cups-config-daemon startup succeeded
Mar 21 12:06:36 ctmisdb1 haldaemon: haldaemon startup succeeded
Mar 21 12:06:37 ctmisdb1 fstab-sync[6267]: removed all generated mount points
Mar 21 12:06:37 ctmisdb1 fstab-sync[6378]: added mount point /media/cdrecorder for /dev/hde
Mar 21 12:06:37 ctmisdb1 su(pam_unix)[6323]: session opened for user oracle by (uid=0)
Mar 21 12:06:37 ctmisdb1 su(pam_unix)[6324]: session opened for user oracle by (uid=0)
Mar 21 12:06:37 ctmisdb1 su(pam_unix)[6229]: session opened for user oracle by (uid=0)
Mar 21 12:06:37 ctmisdb1 su(pam_unix)[6229]: session closed for user oracle
Mar 21 12:06:37 ctmisdb1 su(pam_unix)[6644]: session opened for user oracle by (uid=0)
Mar 21 12:06:37 ctmisdb1 kernel: matroxfb: cannot set xres to 800, rounded up to 832
Mar 21 12:06:37 ctmisdb1 last message repeated 2 times
Mar 21 12:06:41 ctmisdb1 su(pam_unix)[6323]: session closed for user oracle
Mar 21 12:06:41 ctmisdb1 su(pam_unix)[6644]: session closed for user oracle
Mar 21 12:06:41 ctmisdb1 su(pam_unix)[6324]: session closed for user oracle
Mar 21 12:06:41 ctmisdb1 logger: Cluster Ready Services completed waiting on dependencies.
Mar 21 12:06:41 ctmisdb1 last message repeated 2 times
Mar 21 12:06:45 ctmisdb1 gdm(pam_unix)[6379]: session opened for user root by (uid=0)
Mar 21 12:06:46 ctmisdb1 gconfd (root-7052): starting (version 2.8.1), pid 7052 user 'root'
Mar 21 12:06:47 ctmisdb1 gconfd (root-7052): Resolved address "xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only configuration source at position 0
Mar 21 12:06:47 ctmisdb1 gconfd (root-7052): Resolved address "xml:readwrite:/root/.gconf" to a writable configuration source at position 1
Mar 21 12:06:47 ctmisdb1 gconfd (root-7052): Resolved address "xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only configuration source at position 2
Mar 21 12:06:55 ctmisdb1 gconfd (root-7052): Resolved address "xml:readwrite:/root/.gconf" to a writable configuration source at position 0
Mar 21 12:07:41 ctmisdb1 su(pam_unix)[5547]: session opened for user oracle by (uid=0)
Mar 21 12:07:41 ctmisdb1 logger: Running CRSD with TZ =
Mar 21 12:07:43 ctmisdb1 su(pam_unix)[7399]: session opened for user oracle by (uid=0)
Mar 21 12:12:49 ctmisdb1 sshd(pam_unix)[15323]: session opened for user root by root(uid=0)
Mar 21 12:12:57 ctmisdb1 su(pam_unix)[15531]: session opened for user oracle by root(uid=0)
Mar 21 12:47:05 ctmisdb1 syslogd 1.4.1: restart.
ocssd log....
[    CSSD]2012-03-21 11:24:41.045 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661f0c0) proc(0x8006622560) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 11:24:41.078 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660cfe0) proc(0x800662ba70) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:07:44.564 >USER: Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle. All rights reserved.
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=ctmisdb1DBG_CSSD))
[    CSSD]2012-03-21 12:07:44.564 >USER: CSS daemon log for node ctmisdb1, number 1, in cluster crs
[    CSSD]2012-03-21 12:07:44.581 [28260544] >TRACE: clssscmain: local-only set to false
[    CSSD]2012-03-21 12:07:44.603 [28260544] >TRACE: clssnmReadNodeInfo: added node 1 (ctmisdb1) to cluster
[    CSSD]2012-03-21 12:07:44.621 [28260544] >TRACE: clssnmReadNodeInfo: added node 2 (ctmisdb2) to cluster
[    CSSD]2012-03-21 12:07:44.627 [72925824] >TRACE: clssnm_skgxnmon: skgxn init failed, rc 1
[    CSSD]2012-03-21 12:07:44.627 [28260544] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
[    CSSD]2012-03-21 12:07:44.641 [28260544] >TRACE: clssnmInitNMInfo: misscount set to 60
[    CSSD]2012-03-21 12:07:44.655 [28260544] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (0//dev/raw/raw2)
[    CSSD]2012-03-21 12:07:46.661 [72925824] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (0//dev/raw/raw2)
[    CSSD]2012-03-21 12:07:46.690 [72925824] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(18) wrtcnt(7920) LATS(0) Disk lastSeqNo(7920)
[    CSSD]2012-03-21 12:07:46.752 [28260544] >TRACE: clssnmFatalInit: fatal mode enabled
[    CSSD]2012-03-21 12:07:46.752 [94777984] >TRACE: clssnmconnect: connecting to node 1, flags 0x0001, connector 1
[    CSSD]2012-03-21 12:07:46.753 [94777984] >TRACE: clssnmconnect: connecting to node 0, flags 0x0000, connector 1
[    CSSD]2012-03-21 12:07:46.753 [94777984] >TRACE: clssnmClusterListener: Probing node(2)
[    CSSD]2012-03-21 12:07:46.755 [94777984] >TRACE: clssnmConnComplete: connected to node 2 (con 0x8006601040), state 3 birth 0, unique 1332303918/1332303918 prevConuni(0)
[    CSSD]2012-03-21 12:07:46.756 [106332800] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_1))
[    CSSD]2012-03-21 12:07:46.756 [106332800] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ctmisdb1_crs))
[    CSSD]2012-03-21 12:07:46.757 [151810688] >TRACE: clssnmPollingThread: Connection complete
[    CSSD]2012-03-21 12:07:46.757 [162296448] >TRACE: clssnmSendingThread: Connection complete
[    CSSD]2012-03-21 12:07:46.757 [172782208] >TRACE: clssnmRcfgMgrThread: Connection complete
[    CSSD]2012-03-21 12:07:46.757 [172782208] >TRACE: clssnmRcfgMgrThread: Local Join
[    CSSD]2012-03-21 12:07:46.757 [172782208] >WARNING: clssnmLocalJoinEvent: takeover aborted due to connected but inactive nodes
[    CSSD]2012-03-21 12:07:47.339 [94777984] >TRACE: clssnmHandleSync: Acknowledging sync: src[2] srcName[ctmisdb2] seq[5] sync[18]
[    CSSD]2012-03-21 12:07:47.759 [172782208] >TRACE: clssnmRcfgMgrThread: lastleader(2) unique(1332311864)
[    CSSD]2012-03-21 12:07:48.341 [94777984] >TRACE: clssnmSendVoteInfo: node(2) syncSeqNo(18)
[    CSSD]2012-03-21 12:07:50.346 [94777984] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2012-03-21 12:07:50.346 [94777984] >TRACE: clssnmDeactivateNode: node 0 () left cluster
[    CSSD]2012-03-21 12:07:50.346 [94777984] >TRACE: clssnmUpdateNodeState: node 1, state (1/2) unique (1332311864/1332311864) prevConuni(0) birth (0/18) (old/new)
[    CSSD]2012-03-21 12:07:50.346 [94777984] >TRACE: clssnmUpdateNodeState: node 2, state (4/3) unique (1332303918/1332303918) prevConuni(0) birth (0/16) (old/new)
[    CSSD]2012-03-21 12:07:50.346 [94777984] >USER: clssnmHandleUpdate: SYNC(18) from node(2) completed
[    CSSD]2012-03-21 12:07:50.346 [94777984] >USER: clssnmHandleUpdate: NODE 1 (ctmisdb1) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-03-21 12:07:50.346 [94777984] >USER: clssnmHandleUpdate: NODE 2 (ctmisdb2) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-03-21 12:07:50.429 [28260544] >USER: NMEVENT_SUSPEND [00][00][00][00]
[    CSSD]2012-03-21 12:07:50.429 [183267968] >TRACE: clssgmReconfigThread: started for reconfig (18)
[    CSSD]2012-03-21 12:07:50.429 [183267968] >USER: NMEVENT_RECONFIG [00][00][00][06]
[    CSSD]2012-03-21 12:07:50.429 [183267968] >TRACE: clssgmEstablishConnections: 2 nodes in cluster incarn 18
[    CSSD]2012-03-21 12:07:50.430 [140255872] >TRACE: clssgmInitialRecv: (0x102a0360) accepted a new connection from node 2 born at 16 active (2, 2), vers (10,3,1,2)
[    CSSD]2012-03-21 12:07:50.430 [140255872] >TRACE: clssgmInitialRecv: conns done (2/2)
[    CSSD]2012-03-21 12:07:50.430 [183267968] >TRACE: clssgmEstablishMasterNode: MASTER for 18 is node(2) birth(16)
[    CSSD]2012-03-21 12:07:50.430 [183267968] >TRACE: clssgmChangeMasterNode: requeued 0 RPCs
[    CSSD]2012-03-21 12:07:50.432 [140255872] >TRACE: clssgmHandleDBDone(): src/dest (2/65535) size(72) incarn 18
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 18 with 2 nodes
[    CSSD]CLSS-3001: local node number 1, master node number 2
[    CSSD]2012-03-21 12:07:50.433 [183267968] >TRACE: clssgmReconfigThread: completed for reconfig(18), with status(1)
[    CSSD]2012-03-21 12:07:50.550 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006603bb0) proc(0x8006608b00) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:07:50.551 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x80066066f0) proc(0x8006608d70) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:07:53.569 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660ec70) proc(0x8006611260) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:00.829 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006610990) proc(0x800660de00) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:04.698 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006613030) proc(0x8006612930) pid(8115) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:04.816 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006612950) proc(0x8006613c20) pid(8115) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:04.832 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006612950) proc(0x8006613c20) pid(8115) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:06.615 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006612950) proc(0x8006613c20) pid(8171) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:07.114 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006615960) proc(0x8006616350) pid(8175) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:11.373 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x80066192a0) proc(0x8006619470) pid(8302) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:11.669 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661bf60) proc(0x800661ee20) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:17.135 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661bf60) proc(0x800661ee70) pid(8458) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:17.268 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661fc00) proc(0x80066220d0) pid(8460) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:17.305 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x80066223e0) proc(0x8006625250) pid(8462) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:17.353 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006625560) proc(0x8006628430) pid(8464) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:24.585 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006625560) proc(0x8006628430) pid(8645) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:27.957 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006628740) proc(0x800662b610) pid(8722) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:30.931 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800662cce0) proc(0x800662c860) pid(8801) proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:36.400 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661c5f0) proc(0x800661eb50) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:37.863 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800662f1c0) proc(0x800661eee0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:38.537 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800662f1c0) proc(0x800661d500) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:39.232 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800661bf60) proc(0x800661d500) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:43.085 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006611210) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:08:58.971 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x80066112c0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:09:59.290 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800660b190) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:10:59.589 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800660b190) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:11:59.904 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800660b190) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:13:00.203 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800660b190) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:13:14.029 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800660b190) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:14:00.501 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006611210) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:15:00.809 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:16:01.117 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:17:01.447 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:01.762 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:39.841 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:42.123 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:42.316 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:42.843 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:42.963 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:43.098 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b260) proc(0x800662bd20) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:44.173 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:44.368 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b260) proc(0x800660b310) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:45.351 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x8006628670) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:46.236 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:47.031 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:47.694 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:47.819 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b260) proc(0x800660b310) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:48.103 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:48.327 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b260) proc(0x800660b310) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:48.484 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x8006611210) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:48.758 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:49.529 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:50.509 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:51.060 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x800660b830) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:18:51.558 [106332800] >TRACE: clssgmClientConnectMsg: Connect from con(0x8006611630) proc(0x800662f0f0) pid() proto(10:2:1:1)
[    CSSD]2012-03-21 12:48:39.836 >USER: Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle. All rights reserved.
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=ctmisdb1DBG_CSSD))
[    CSSD]2012-03-21 12:48:39.836 >USER: CSS daemon log for node ctmisdb1, number 1, in cluster crs
[    CSSD]2012-03-21 12:48:39.849 [28260544] >TRACE: clssscmain: local-only set to false
[    CSSD]2012-03-21 12:48:39.865 [28260544] >TRACE: clssnmReadNodeInfo: added node 1 (ctmisdb1) to cluster
[    CSSD]2012-03-21 12:48:39.872 [28260544] >TRACE: clssnmReadNodeInfo: added node 2 (ctmisdb2) to cluster
[    CSSD]2012-03-21 12:48:39.879 [72925824] >TRACE: clssnm_skgxnmon: skgxn init failed, rc 1
[    CSSD]2012-03-21 12:48:39.879 [28260544] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
[    CSSD]2012-03-21 12:48:39.881 [28260544] >TRACE: clssnmInitNMInfo: misscount set to 60
[    CSSD]2012-03-21 12:48:39.888 [28260544] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (0//dev/raw/raw2)
[    CSSD]2012-03-21 12:48:41.892 [72925824] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (0//dev/raw/raw2)
[    CSSD]2012-03-21 12:48:41.915 [72925824] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(20) wrtcnt(10367) LATS(0) Disk lastSeqNo(10367)
[    CSSD]2012-03-21 12:48:41.959 [28260544] >TRACE: clssnmFatalInit: fatal mode enabled
[    CSSD]2012-03-21 12:48:41.959 [94777984] >TRACE: clssnmconnect: connecting to node 1, flags 0x0001, connector 1
[    CSSD]2012-03-21 12:48:41.959 [94777984] >TRACE: clssnmconnect: connecting to node 0, flags 0x0000, connector 1
[    CSSD]2012-03-21 12:48:41.959 [94777984] >TRACE: clssnmClusterListener: Probing node(2)
[    CSSD]2012-03-21 12:48:41.961 [94777984] >TRACE: clssnmConnComplete: connected to node 2 (con 0x8006702790), state 3 birth 0, unique 1332303918/1332303918 prevConuni(0)
[    CSSD]2012-03-21 12:48:41.962 [106332800] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_1))
[    CSSD]2012-03-21 12:48:41.962 [106332800] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ctmisdb1_crs))
[    CSSD]2012-03-21 12:48:41.963 [152330880] >TRACE: clssnmPollingThread: Connection complete
[    CSSD]2012-03-21 12:48:41.963 [162816640] >TRACE: clssnmSendingThread: Connection complete
[    CSSD]2012-03-21 12:48:41.963 [173302400] >TRACE: clssnmRcfgMgrThread: Connection complete
[    CSSD]2012-03-21 12:48:41.963 [173302400] >TRACE: clssnmRcfgMgrThread: Local Join
[    CSSD]2012-03-21 12:48:41.963 [173302400] >WARNING: clssnmLocalJoinEvent: takeover aborted due to connected but inactive nodes
[    CSSD]2012-03-21 12:48:42.631 [94777984] >TRACE: clssnmHandleSync: Acknowledging sync: src[2] srcName[ctmisdb2] seq[13] sync[20]
[    CSSD]2012-03-21 12:48:42.965 [173302400] >TRACE: clssnmRcfgMgrThread: lastleader(2) unique(1332314319)
[    CSSD]2012-03-21 12:48:43.636 [94777984] >TRACE: clssnmSendVoteInfo: node(2) syncSeqNo(20)
[    CSSD]2012-03-21 12:48:45.640 [94777984] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2012-03-21 12:48:45.640 [94777984] >TRACE: clssnmDeactivateNode: node 0 () left cluster
[    CSSD]2012-03-21 12:48:45.640 [94777984] >TRACE: clssnmUpdateNodeState: node 1, state (1/2) unique (1332314319/1332314319) prevConuni(0) birth (0/20) (old/new)
[    CSSD]2012-03-21 12:48:45.640 [94777984] >TRACE: clssnmUpdateNodeState: node 2, state (4/3) unique (1332303918/1332303918) prevConuni(0) birth (0/16) (old/new)
[    CSSD]2012-03-21 12:48:45.640 [94777984] >USER: clssnmHandleUpdate: SYNC(20) from node(2) completed
[    CSSD]2012-03-21 12:48:45.640 [94777984] >USER: clssnmHandleUpdate: NODE 1 (ctmisdb1) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-03-21 12:48:45.640 [94777984] >USER: clssnmHandleUpdate: NODE 2 (ctmisdb2) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-03-21 12:48:45.737 [28260544] >USER: NMEVENT_SUSPEND [00][00][00][00]
[    CSSD]2012-03-21 12:48:45.738 [183788160] >TRACE: clssgmReconfigThread: started for reconfig (20)
[    CSSD]2012-03-21 12:48:45.738 [183788160] >USER: NMEVENT_RECONFIG [00][00][00][06]
[    CSSD]2012-03-21 12:48:45.738 [183788160] >TRACE: clssgmEstablishConnections: 2 nodes in cluster incarn 20
[    CSSD]2012-03-21 12:48:45.739 [140776064] >TRACE: clssgmInitialRecv: (0x102a0370) accepted a new connection from node 2 born at 16 active (2, 2), vers (10,3,1,2)
[    CSSD]2012-03-21 12:48:45.739 [140776064] >TRACE: clssgmInitialRecv: conns done (2/2)
[    CSSD]2012-03-21 12:48:45.739 [183788160] >TRACE: clssgmEstablishMasterNode: MASTER for 20 is node(2) birth(16)
[    CSSD]2012-03-21 12:48:45.739 [183788160] >TRACE: clssgmChangeMasterNode: requeued 0 RPCs
[    CSSD]2012-03-21 12:48:45.741 [140776064] >TRACE: clssgmHandleDBDone(): src/dest (2/65535) size(72) incarn 20
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 20 with 2 nodes
Plz check and help..........

ORACLE RAC 10g 10.2 on linux_86 linux1(node1) rebooted and crs stop

hi
i am a little bit new in rac world
i am 100% following the otn.oracle.com (build your own rac on iscsi)
power surge and linux1 was down
please some body help me to bring it back i am facing one error
PRKH-1010 : Unable to communicate with CRS services.and
linux1:orcl1:/u01/app/crs/bin:>crs_stat
CRS-0184: Cannot communicate with the CRS daemon.
Result of CLUVRFY utility
System requirement failed for 'database'
Checking CRS integrity...
Checking daemon liveness...
Check: Liveness for "CRS daemon"
Node Name Running
linux4 yes
linux1 no
Result: Liveness check failed for "CRS daemon".
Checking daemon liveness...
Check: Liveness for "CSS daemon"
Node Name Running
linux4 yes
linux1 no
Result: Liveness check failed for "CSS daemon".
Checking daemon liveness...
Check: Liveness for "EVM daemon"
Node Name Running
linux4 yes
linux1 no
Result: Liveness check failed for "EVM daemon".
Liveness of all the daemons
Node Name CRS daemon CSS daemon EVM daemon
linux4 yes yes yes
linux1 no no no
Checking CRS health...
Check: Health of CRS
Node Name CRS OK?
linux4 yes
Result: CRS health check passed.
CRS integrity check failed.
Checking node application existence...
Checking existence of VIP node application
Node Name Required Status Comment
linux4 yes exists passed
linux1 yes exists passed
Result: Check passed.
Checking existence of ONS node application
Node Name Required Status Comment
linux4 no exists passed
linux1 no exists passed
Result: Check passed.
Checking existence of GSD node application
Node Name Required Status Comment
linux4 no exists passed
linux1 no exists passed
Result: Check passed.
Message was edited by:
shakil_zubair
Message was edited by:
shakil_zubair

In addition to Chandra's steps , also make sure that the Shared Voting and OCR disks are visible from Linux1 and have the same device mappings as were before the Node reboot.
If the CRS stack fails to come up , you should be looking at the
following files to start with :
a) The OS Messages file to check if there are any messages logged from the
the time when you attempted to start the stack manually.
b) The /tmp mountpoint sometimes contains files names crsctl* which could
also indicate what the problem is.
c) The logs under
$ORA_CRS_HOME/log/<nodename>/ocssd
$ORA_CRS_HOME/log/<nodename>/crsd
$ORA_CRS_HOME/log/<nodename>/client
Let's know if there are any messages in these files which might help us analyze
this further.
Vishwa

RAC reboot

Similar Messages

Maybe you are looking for