Interconnect IP Issue

Hello All,
We haev a 4 ndoe rac..windows 64 bit.
This is the information from alert log on ndoe 1
Interface type 1 Private 192.168.117.0 configured from OCR for use as a cluster interconnect
Interface type 1 Public 192.168.0.0 configured from OCR for use as a public interface
This is the information from alert log on ndoe 1
Interface type 1 Private 192.168.117.0 configured from OCR for use as a cluster interconnect
WARNING 192.168.117.0 could not be translated to a network address error 1
Interface type 1 Public 192.168.0.0 configured from OCR for use as a public interface
WARNING: No cluster interconnect has been specified. Depending on
the communication driver configured Oracle cluster traffic
may be directed to the public interface of this machine.
Oracle recommends that RAC clustered databases be configured
with a private interconnect for enhanced security and
performance.
Now on node 1 if i do
D:\oracle\product\10.2.0\crs\BIN>oifcfg getif
Public 192.168.0.0 global public
Private 192.168.117.0 global cluster_interconnect
D:\oracle\product\10.2.0\crs\BIN>oifcfg iflist
Public 192.168.110.0
Public 192.168.0.0
Private 192.168.117.0
Node 2
D:\oracle\product\10.2.0\crs\bin>oifcfg getif
Public 192.168.0.0 global public
Private 192.168.117.0 global cluster_interconnect
D:\oracle\product\10.2.0\crs\bin>oifcfg iflist
Public 192.168.110.0
Public 192.168.0.0
Privae 192.168.117.0
Please do look at the spelling private when i do a getif and iflist ON NODE 2. The name in the network connection tab in windows is also missing a t on db02.
How do i rectify the warning in alert.log and make node 2 use the correct interconnect IP 192.168.117.0 rather than the public one..
Also could this be the reason of node eviction(network ping -- fatal heartbeat) we are seeing lately..in the last month or so. This configuration is the same for the last 8 months and we haev had no issues..
Thanks,

Windows as a gaming console these days I know. As a cluster o/s.... not really. ;-)
I would simply try and isolate the problem. Working from the bottom up. Is the public IPs correct? Is the net mask correct? Does connectivity work to and from and between each node? Repeat for private IPs (Interconnect).
This will eliminate any network config error as the potential problem - allowing you to focus next on the CRS stack and its configuration.
BTW, how is you public and interconnect physically wired? Why are you using a 192. address range for the Interconnect and not for example a 10. range? Is this a new error in an existing cluster? Or an installation/config error in a new cluster?

Similar Messages

Wait events indicating Interconnect Hardware issues

Version:10.2, 11.2
What are Wait events that appear in AWR report which could indicate that the high speed interconnect is not functioning well?

What are the "expected standards"?
There are numerous factors that determines performance at this level. Latency. Packet size. Collissions. Bandwidth. Etc.
The critical bit is that your Interconnect is not of much use if it is the same speed, or slower, than your storage fabric layer. For example, if you have a dual port HBA running 2 x 2Gb fibre channels into the storage system switch, it makes little sense to have an Interconnect running at 1Gb (minimum Oracle recommendation).
Cache Fusion runs over the Interconnect. Having a cluster cache that is slower (with more latency) than what your physical storage layer is, will be a major performance drawback.
The "standards" for RAC Interconnect, if one wants to call it that, would seem to be running RDS (an Infiniband protocol) over QDR/Quad Data Rate (40Gb) Infiniband - as that is what Oracle's Database Machine and Exadata servers use.

Oracle 9i reading BLOB performance issues

Windows XP Pro SP2
JDK 1.5.0_05
Oracle 9i
Oracle Thin Driver for JDK 1.4 v.10.2.0.1.0
DBCP v.1.2.1
Spring v1.2.7 (I am using the JDBC template for convenience)
I have run into serious performance issues reading BLOBs from Oracle using oracle's JDBC thin driver. I am not sure if it a constraint/mis-configuration with oracle or a JDBC problem.
I am hoping that someone has some experience accessing multi-MB BLOBs under heavy volume.
We are considering using Oracle 8 or 9 as a document repository. It will end up storing hundreds of thousands of PDFs that can be as large as 30 MBs. We don't have access to Oracle 10.
TESTS
I am running tests against Oracle 8 and 9 to simulate single and multi-threaded document access. Out goal is to get a sense of KBps throughput and BLOB data access contention.
DATA
There is a single test table with 100 rows. Each row has a PK id and a BLOB field. The blobs range in size from a few dozen KB to 12MB. They represent a valid sample of production data. The total data size is approx. 121 MBs.
Single Threaded Test
The test selects a single blob object at a time and then reads the contents of the blob's binary input stream in 2 KB chunks. At the end of the test, it will have accessed all 100 blobs and streamed all 121 MBs. The test harness is JUnit.
8i Results: On 8i it starts and terminates successfully on a steady and reliable basis. The throughput hovers around 4.8 MBps.
9i Results: Similar reliability to 8i. The throughput is about 30% better.
Multi-Threaded Test
The multi-threaded test uses the same "blob reader" functionality used in the single threaded test. However, it spawns 8 threads each running a separate "blob reader".
8i Results: The tests successfully complete on a reliable basis. The aggregate throughput of all 8 threads is a bit more than 4.8 MBps.
9i Results: Erratic. The tests were highly erratic on 9i. Threads would intermittently lock when accessing a BLOB's output stream. Sometimes they lock accessing data from the same row, othertimes it is distinct rows. The number and the timing of the thread "locks" is indeterminate. When the test completed successfully the aggregate throughput of the 8 threads was approx. 5.4 MBps.
I would be more than happy to post code or the data model if that would help.
Carlos

Hi Murphy16,
Try investigate where are the principal issues in your RAC system.
Check:
* Expensive SQL's;
* Sorts in disks;
* Wait Events;
* Interconnect hardware issues;
* Applications doing unnecessary manual LOCKs (SQL);
* If SGA is adequatly sized (take care to not use of SWAP space "DISK");
* Backup's and unnecessary jobs running at business time (Realocate this jobs and backups to night window or a less intensive work hour at database);
* Rebuild indexes and identify tables that must be reorganized (fragmentation);
* Verify another software consuming resources on your server;
Please give us more info about your environment. The steps above are general, but you can use to guide u in basic performance issues.
Regards,
Rodrigo Mufalani
http://mufalani.blogspot.com

RAC Issue with high Transaction OLTP + GC's

In the Lab i am testing in-house application with Oracle 10g.
Setup
- 2 Database servers (RHEL OS 3, IBM H/W, 4GB RAM)
- 2 in house Application servers (RHEL OS 3, IBM H/W, 4GB RAM)
- IBM Disk Array with 2Gbps fiber Cable Speed
- Cisco LAN Switch with 100mbps/sec Port
Application Detail-
-- Its a OLTP application.
-- Doing lots of insertion,update,delete,commit for each transaction.
-- With high application load,its doing 5000 DB tranactions/sec with database.
The following test scenario need to test
1. Start one application server and conenct with DB instance1 and push high load..
2. Start both application server and connect with DB Instance1b and push high load from both applications.
3. Start Both application server and connect with both DB servers ( For Ex.
(A1-DB1 & A2-DB2) and check the application compatibility with RAC environment ( Note- Here need to push very less db transaction data to DB servers).
4. Similar like 3 but with very high db transaction rate
Now these are the results
1. Full Success
2. Full Success
3. Full Success
4. Fail
Detail description of Case4 Failure.
When ever i am starting applications with high load, within few minutes application quesues are getting full and db instances are not able to handle load.
When i checked the Oracle ASH Report and then found following major events
1. gc buffer busy
2. gc current block busy
3. gc current multi block request
4. gc cr block 2-way
i checked the oracle documents and found three are major root cause for this issue.
-- Interconnect
-- Load issues
-- SQL Execution against a Large shared working set
But i am not sure, how i can come out from this issue.
If any one faced this issue with Oracle RAC. Please share your experience or give some suggestion to come out from this issue.
Rgds
Sumit
Bangalore,India

Hi,
you can use fiber for interconnect, Once i have looked for this option but haven't implemented.......
in this case your global cache requests are too much. one thing more if is it possible then divide the load according to domains. like same queries on on node and other domain queries on second nodes...... e.g. reporting on node2 and oltp transations on node1.
and kindly look into it...
http://www.ardentperf.com/2007/09/12/gc-buffer-busy-waits-in-rac-finding-hot-blocks/

Management IP Address inquiry in UCS

We are starting to grow our UCS environment. Due to that we are using more and more IP's in the pool we set aside for Management IP addresses. Presently, our fabric interconnects reside in the same subnet (class C) as the pool we are using for Management IP's. It looks as though each blade uses two IP's out of the Management Pool so we are exhausting the IP's that we set aside pretty quickly. When I went to setup the UCS environment initially I went to use a subnet for the Management Pool that was different than the subnet being used for the fabric interconnects. Unfortuantely, I was unable to connect to the KVM functionality with the Management Pool in a different subnet that the fabric interconnects. When I changed the Management Pool IP's to the same subnet as the fabric interconnects the issue went away. So I take it that the KVM functionality comes through the management interfaces of the fabric interconnects.
My question is how do I provide a different subnet for my Management Pool. I will run out of IP's in this one subnet eventually and will have to add another one in that is different from what the fabric interconnects are using. Do I have to set the ports that the fabric interconnect management interfaces are plugged into for vlan tagging and then change the network configuration of the fabric interconnects?

Russ,
In the current version (up to 2.1) the Management IPs for blades are required to belong to the same subnet as the Management Interfaces of the Fabric Interconnects. This is due to the way we proxy the KVM request from Management Interfaces to the Blades CIMC. In a future release we're investigating breaking the blade IPs into their own subnet/VLAN but this is a ways out - no committed date at this time.
We understand this puts quite a requirment on the size of the Management Subnet, but with proper design it shouldn't be much of an issue.
Regards,
Robert

3 Nodes with transport Switches

Hi Support,
I am having new installation of 3 nodes oracle 9i cluster , i have probelm in cluster interconnect VLAN issues , I am getting below error msg when I try to assign network interface as private .
unexpected network traffic was seen on e1000g1 may be cabled to a public network . Thus the nodes can not join the cluster
I have configured 2x VLANs , First VLAN in first switch including server1:e1000g0,server2:e1000g0,server3:e1000g0 and another VLAN is switch2 icluding server1:e1000g3, server2:e1000g3,server3:e1000g3 ..
But still givin above mesg , what is the probelm and how Should I configure the switches ??? and whic traffic is going on ?
Thank

Is this during the running of scinstall? If so, it shouldn't prevent you from installing the clusters. There should be a question like:
"Expected traffic seen on network... Do you still want to use this network"
or something like that. You should say yes, if you have correctly configured your switches. You often see this kind of traffic where the switches are keeping some internal state. At least, I've seen it before and it was never a problem.
However, if you've configured you switches incorrectly, you could be heading into trouble. I suggest using snoop to see what traffic is on those networks.
Regards,
Tim
---

Infiniband vs 10GB

Can anyone help me with some information about Infiniband vs 10Gb for a RAC interconnect? I'm using 4 quadsocket servers with 24 total cores.
So, I'm worry with some possible issues with interconnections.
Is 1Gb with Jumbo Frames an option?
Thanks

user11178363 wrote:
Can anyone help me with some information about Infiniband vs 10Gb for a RAC interconnect? I'm using 4 quadsocket servers with 24 total cores.
So, I'm worry with some possible issues with interconnections.What issues specifically?
Is 1Gb with Jumbo Frames an option?An option with IB (Infiniband) or GigE?
Cannot recall seeing a 1Gb option when configuring IB. The slowest speed we configure it at is 10Gb. Have not use jumbo frames with IB, but super-jumbo frames - at that time we had server h/w issues (eventually traced and fixed with a CMOS patch) and during the course of troubleshooting, disabled jumbo frames. Have not since re-enabled it, but I suspect that when we do, Interconnect performance will scale better.
Personally - IB over GigE anytime. Been using it for a number of years now and it has been a rock solid Interconnect architecture.
Also, IB is what Oracle themselves selected to use with their Exadata product - they are using Voltaire switches. Not a surprise there either as Cisco not only selected to miss the boat, but also burn the raft they had (by discontinuing their IB FC gateway product range).

Oracle10g RAC Cluster Interconnect issues

Hello Everybody,
Just a brief overview as to what i am currently doing. I have installed Oracle10g RAC database on a cluster of two Windows 2000 AS nodes.These two nodes are accessing an external SCSI hard disk.I have used Oracle cluster file system.
Currently i am facing some performance issues when it comes to balancing workload on both the nodes.(Single instance database load is faster than a parallel load using two database instances).
I feel the performance issues could be due to IPC using public Ethernet IP instead of private interconnect.
(During a parallel load large amount of packets of data are sent over the Public IP and not Private interconnect).
How can i be sure that the Private interconnect is used for transferring cluster traffic and not the Public IP? (Oracle mentions that for a Oracle10g RAC database, private IP should be used for heart beat as well as transferring cluster traffic).
Thanks in advance,
Regards,
Salil

You find the answers here:
RAC: Frequently Asked Questions
Doc ID: NOTE:220970.1
At least crossover interconnect is completely unsupported.
Werner

Fabric Interconnect ISSU support

I was asked by many cutomers about ISSU support for the UCS Fabric Interconnect ? They all know, that N5K (without L3 module) supports ISSU. Is it on the roadmap ?

Hello wdey,
This is a different concept, the nexus control plane need keep relationships during the upgrade process.
In the UCS Fabric Interconnect, you have others tecnologies to provide high avability to the system overall. (fabric failover, UCSM cluster, FC multipath).
Although the N5K and the FI is almost the same box, the systems running are completly different. The ISSU in the FI is not necessary, as all functions can be transfered to the pair FI that remained running.
Richard

Requirement to issue COMMIT to propogate messages via Interconnect;

Interconnect is currently in the frame as the Hub and Spoke for a solution integrating various vendor components.
However one item I would like to get clarity on; the concept that for a message to be propagated via Interconnect a commit must be issued. e.g. Doing data entry against Oracle Database; the inputted data will be validated against details from another database and/or a call to a rules engine using Interconnect. To get the message out to Interconnect my current understanding is that a commit must be issued in the Data Entry Database ==> commits the entered data to the database.
However if the validation fails (e.g. person id not valid ) after the call to the other database / rules engine , concerned we may have previously committed data in the Data Entry Database which is not correct. Ideally would like to issue a commit on the data entry database only when the inputted data has been successfully validated.
Questions;
Is a commit required to propagate messages via Interconnect ?
If so is it possible to restrict the scope of the commit to Interconnect Message call by using the Autonomous Transaction construct ?
Are there other approaches ?
Have searched various places on Oracle web-site but can't find a definite answer to the above questions, so all advice most welcome.
Any pointers to what I should read up on which would aid my understanding on Interconnect much appreciated.
Rgds & Thanks,
Eamonn

Yes a commit would be required if you are using the database adapter to transmit the data to interconnect.
And yes it is possible to perform this in an autonomous transaction.

RAC 11R2 Private Interconnect Issue

Friends
We had setup our Oracle Clusterware on Solaris Sparc with a version 11.2.0.3 PSU 2 patch sets. Some changes happen at the OS level and the private Interconnect IPs were picked wrong by our Oracle Clusterware registry.
The clusterware is down. We are not able to bring up the clusterware. There will be a need to change the private IP configuration at the Oracle Clusterware level and now the clusterware is down.
Is there any way we can change the configuration in private Interconnect ?
Whenever we are trying to do a change. Getting the error message "PRIF-10: failed to initialize the cluster registry"
$ oifcfg setif -global vnet2/10.131.239.0:cluster_interconnect
PRIF-10: failed to initialize the cluster registry
Thank You !
Jai

The clusterware is down. We are not able to bring up the clusterware. There will be a need to change the private IP configuration at the Oracle Clusterware level and now the clusterware is down.
Is there any way we can change the configuration in private Interconnect ?
Whenever we are trying to do a change. Getting the error message "PRIF-10: failed to initialize the cluster registry"
$ oifcfg setif -global vnet2/10.131.239.0:cluster_interconnect
PRIF-10: failed to initialize the cluster registryThis error happen when clusterware is down and you are trying to change Interconnect configuration, then you must start the Oracle Clusterware on the node to make changes.
We are not able to bring up the clusterware. Some changes happen at the OS level and the private Interconnect IPsWhy you clusterware is not starting?...please post alertlog and crsd.log of cluster (only relevant info).
If the error on crsd.log is : PROC-44: Error in network address and interface operations Network address and interface operations error
This errors indicate a mismatch between OS setting (oifcfg iflist) and gpnp profile setting profile.xml.
You will restore the OS network configuration back to the original status, start Oracle Clusterware. Then try make the changes again.

9iAS Interconnect Issue

Hi Everyone,
As part of 9iAs integration scenarios for one of our customer(sending the messages from Oracle CRM to WebMethod(WM) through 9iAs and getting back response from WM to do some DML operations at Oracle CRM side through 9iAs) we are facing message creation problem in Request & Reply within same session. We are able to execute this scenario with single message. When we have parent with multiple child elements we are unable to send message from CRM to 9iAS. Could you pls help us to resolve this problem? Also if you have any 9iAS related white papers /documents that can help us in resolving this pls send -> [email protected]
Thanks in advance and waiting for your reply.
Suman

Karen, I would not believe that anything in the application server would cause the downloading of JRE 1.2.2. This behavior sounds like a bug in the application. Since this is a web-based application, can you identify the html source that triggers the downloading and post it here?

Aggregates, VLAN's, Jumbo-Frames and cluster interconnect opinions

Hi All,
I'm reviewing my options for a new cluster configuration and would like the opinions of people with more expertise than myself out there.
What I have in mind as follows:
2 x X4170 servers with 8 x NIC's in each.
On each 4170 I was going to configure 2 aggregates with 3 nics in each aggregate as follows
igb0 device in aggr1
igb1 device in aggr1
igb2 device in aggr1
igb3 stand-alone device for iSCSI network
e1000g0 device in aggr2
e1000g1 device in aggr2
e1000g2 device in aggr3
e1000g3 stand-alone device of iSCSI network
Now, on top of these aggregates, I was planning on creating VLAN interfaces which will allow me to connect to our two "public" network segments and for the cluster heartbeat network.
I was then going to configure the vlan's in an IPMP group for failover. I know there are some questions around that configuration in the sense that IPMP will not detect a nic failure if a NIC goes offline in the aggregate, but I could monitor that in a different manner.
At this point, my questions are:
[1] Are vlan's, on top of aggregates, supported withing Solaris Cluster? I've not seen anything in the documentation to mention that it is, or is not for that matter. I see that vlan's are supported, inluding support for cluster interconnects over vlan's.
Now with the standalone interface I want to enable jumbo frames, but I've noticed that the igb.conf file has a global setting for all nic ports, whereas I can enable it for a single nic port in the e1000g.conf kernel driver. My questions are as follows:
[2] What is the general feeling with mixing mtu sizes on the same lan/vlan? Ive seen some comments that this is not a good idea, and some say that it doesnt cause a problem.
[3] If the underlying nic, igb0-2 (aggr1) for example, has 9k mtu enabled, I can force the mtu size (1500) for "normal" networks on the vlan interfaces pointing to my "public" network and cluster interconnect vlan. Does anyone have experience of this causing any issues?
Thanks in advance for all comments/suggestions.

For 1) the question is really "Do I need to enable Jumbo Frames if I don't want to use them (neither public nore private network)" - the answer is no.
For 2) each cluster needs to have its own seperate set of VLANs.
Greets
Thorsten

Post Upgrade SQL Performance Issue

Hello,
I Just Upgraded/Migrated my database from 11.1.0.6 SE to 11.2.0.3 EE. I did this with datapump export/import out of the 11.1.0.6 and into a new 11.2.0.3 database. Both the old and the new database are on the same Linux server. The new database has 2GB more RAM assigned to its SGA then the old one. Both DB are using AMM.
The strange part is I have a SQL statement that completes in 1 second in the Old DB and takes 30 seconds in the new one. I even moved the SQL Plan from the Old DB into the New DB so they are using the same plan.
To sum up the issue. I have one SQL statement using the same SQL Plan running at dramatically different speeds on two different databases on the same server. The databases are 11.1.0.7 SE and 11.2.0.3 EE.
Not sure what is going on or how to fix it, Any help would be great!
I have included Explains and Auto Traces from both NEW and OLD databases.
NEW DB Explain Plan (Slow)
Plan hash value: 1046170788
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 94861 | 193M| | 74043 (1)| 00:18:52 |
| 1 | SORT ORDER BY | | 94861 | 193M| 247M| 74043 (1)| 00:18:52 |
| 2 | VIEW | PBM_MEMBER_INTAKE_VW | 94861 | 193M| | 31803 (1)| 00:08:07 |
| 3 | UNION-ALL | | | | | | |
| 4 | NESTED LOOPS OUTER | | 1889 | 173K| | 455 (1)| 00:00:07 |
|* 5 | HASH JOIN | | 1889 | 164K| | 454 (1)| 00:00:07 |
| 6 | TABLE ACCESS FULL| PBM_CODES | 2138 | 21380 | | 8 (0)| 00:00:01 |
|* 7 | TABLE ACCESS FULL| PBM_MEMBER_INTAKE | 1889 | 145K| | 446 (1)| 00:00:07 |
|* 8 | INDEX UNIQUE SCAN | ADJ_PK | 1 | 5 | | 1 (0)| 00:00:01 |
| 9 | NESTED LOOPS | | 92972 | 9987K| | 31347 (1)| 00:08:00 |
| 10 | NESTED LOOPS OUTER| | 92972 | 8443K| | 31346 (1)| 00:08:00 |
|* 11 | TABLE ACCESS FULL| PBM_MEMBERS | 92972 | 7989K| | 31344 (1)| 00:08:00 |
|* 12 | INDEX UNIQUE SCAN| ADJ_PK | 1 | 5 | | 1 (0)| 00:00:01 |
|* 13 | INDEX UNIQUE SCAN | PBM_EMPLOYER_UK1 | 1 | 17 | | 1 (0)| 00:00:01 |
Predicate Information (identified by operation id):
5 - access("C"."CODE_ID"="MI"."STATUS_ID")
7 - filter("MI"."CLAIM_NUMBER" LIKE '%A0000250%' AND "MI"."CLAIM_NUMBER" IS NOT NULL)
8 - access("MI"."ADJUSTER_ID"="A"."ADJUSTER_ID"(+))
11 - filter("M"."THEIR_GROUP_ID" LIKE '%A0000250%' AND "M"."THEIR_GROUP_ID" IS NOT NULL)
12 - access("M"."ADJUSTER_ID"="A"."ADJUSTER_ID"(+))
13 - access("M"."GROUP_CODE"="E"."GROUP_CODE" AND "M"."EMPLOYER_CODE"="E"."EMPLOYER_CODE")
Note
- SQL plan baseline "SYS_SQL_PLAN_a3c20fdcecd98dfe" used for this statement
OLD DB Explain Plan (Fast)
Plan hash value: 1046170788
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 95201 | 193M| | 74262 (1)| 00:14:52 |
| 1 | SORT ORDER BY | | 95201 | 193M| 495M| 74262 (1)| 00:14:52 |
| 2 | VIEW | PBM_MEMBER_INTAKE_VW | 95201 | 193M| | 31853 (1)| 00:06:23 |
| 3 | UNION-ALL | | | | | | |
| 4 | NESTED LOOPS OUTER | | 1943 | 178K| | 486 (1)| 00:00:06 |
|* 5 | HASH JOIN | | 1943 | 168K| | 486 (1)| 00:00:06 |
| 6 | TABLE ACCESS FULL| PBM_CODES | 2105 | 21050 | | 7 (0)| 00:00:01 |
|* 7 | TABLE ACCESS FULL| PBM_MEMBER_INTAKE | 1943 | 149K| | 479 (1)| 00:00:06 |
|* 8 | INDEX UNIQUE SCAN | ADJ_PK | 1 | 5 | | 0 (0)| 00:00:01 |
| 9 | NESTED LOOPS | | 93258 | 9M| | 31367 (1)| 00:06:17 |
| 10 | NESTED LOOPS OUTER| | 93258 | 8469K| | 31358 (1)| 00:06:17 |
|* 11 | TABLE ACCESS FULL| PBM_MEMBERS | 93258 | 8014K| | 31352 (1)| 00:06:17 |
|* 12 | INDEX UNIQUE SCAN| ADJ_PK | 1 | 5 | | 0 (0)| 00:00:01 |
|* 13 | INDEX UNIQUE SCAN | PBM_EMPLOYER_UK1 | 1 | 17 | | 0 (0)| 00:00:01 |
Predicate Information (identified by operation id):
5 - access("C"."CODE_ID"="MI"."STATUS_ID")
7 - filter("MI"."CLAIM_NUMBER" LIKE '%A0000250%')
8 - access("MI"."ADJUSTER_ID"="A"."ADJUSTER_ID"(+))
11 - filter("M"."THEIR_GROUP_ID" LIKE '%A0000250%')
12 - access("M"."ADJUSTER_ID"="A"."ADJUSTER_ID"(+))
13 - access("M"."GROUP_CODE"="E"."GROUP_CODE" AND "M"."EMPLOYER_CODE"="E"."EMPLOYER_CODE")
NEW DB Auto trace (Slow)
active txn count during cleanout     0
blocks decrypted     0
buffer is not pinned count     664129
buffer is pinned count     3061793
bytes received via SQL*Net from client     3339
bytes sent via SQL*Net to client     28758
Cached Commit SCN referenced     662366
calls to get snapshot scn: kcmgss     3
calls to kcmgas     0
calls to kcmgcs     8
CCursor + sql area evicted     0
cell physical IO interconnect bytes     0
cleanout - number of ktugct calls     0
cleanouts only - consistent read gets     0
cluster key scan block gets     0
cluster key scans     0
commit cleanout failures: block lost     0
commit cleanout failures: callback failure      0
commit cleanouts     0
commit cleanouts successfully completed     0
Commit SCN cached     0
commit txn count during cleanout     0
concurrency wait time     0
consistent changes     0
consistent gets     985371
consistent gets - examination     2993
consistent gets direct     0
consistent gets from cache     985371
consistent gets from cache (fastpath)     982093
CPU used by this session     3551
CPU used when call started     3551
CR blocks created     0
cursor authentications     1
data blocks consistent reads - undo records applied     0
db block changes     0
db block gets     0
db block gets direct     0
db block gets from cache     0
db block gets from cache (fastpath)     0
DB time     3553
deferred (CURRENT) block cleanout applications     0
dirty buffers inspected     0
Effective IO time     0
enqueue releases     0
enqueue requests     0
execute count     3
file io wait time     0
free buffer inspected     0
free buffer requested     0
heap block compress     0
Heap Segment Array Updates     0
hot buffers moved to head of LRU     0
HSC Heap Segment Block Changes     0
immediate (CR) block cleanout applications     0
immediate (CURRENT) block cleanout applications     0
IMU Flushes     0
IMU ktichg flush     0
IMU Redo allocation size     0
IMU undo allocation size     0
index fast full scans (full)     2
index fetch by key     0
index scans kdiixs1     12944
lob reads     0
LOB table id lookup cache misses     0
lob writes     0
lob writes unaligned     0
logical read bytes from cache     -517775360
logons cumulative     0
logons current     0
messages sent     0
no buffer to keep pinned count     10
no work - consistent read gets     982086
non-idle wait count     6
non-idle wait time     0
Number of read IOs issued     0
opened cursors cumulative     4
opened cursors current     1
OS Involuntary context switches     853
OS Maximum resident set size     0
OS Page faults     0
OS Page reclaims     2453
OS System time used     9
OS User time used     3549
OS Voluntary context switches     238
parse count (failures)     0
parse count (hard)     0
parse count (total)     1
parse time cpu     0
parse time elapsed     0
physical read bytes     0
physical read IO requests     0
physical read total bytes     0
physical read total IO requests     0
physical read total multi block requests     0
physical reads     0
physical reads cache     0
physical reads cache prefetch     0
physical reads direct     0
physical reads direct (lob)     0
physical write bytes     0
physical write IO requests     0
physical write total bytes     0
physical write total IO requests     0
physical writes     0
physical writes direct     0
physical writes direct (lob)     0
physical writes non checkpoint     0
pinned buffers inspected     0
pinned cursors current     0
process last non-idle time     0
recursive calls     0
recursive cpu usage     0
redo entries     0
redo size     0
redo size for direct writes     0
redo subscn max counts     0
redo synch time     0
redo synch time (usec)     0
redo synch writes     0
Requests to/from client     3
rollbacks only - consistent read gets     0
RowCR - row contention     0
RowCR attempts     0
rows fetched via callback     0
session connect time     0
session cursor cache count     1
session cursor cache hits     3
session logical reads     985371
session pga memory     131072
session pga memory max     0
session uga memory     392928
session uga memory max     0
shared hash latch upgrades - no wait     284
shared hash latch upgrades - wait     0
sorts (memory)     3
sorts (rows)     243
sql area evicted     0
sql area purged     0
SQL*Net roundtrips to/from client     4
switch current to new buffer     0
table fetch by rowid     1861456
table fetch continued row     9
table scan blocks gotten     0
table scan rows gotten     0
table scans (short tables)     0
temp space allocated (bytes)     0
undo change vector size     0
user calls     7
user commits     0
user I/O wait time     0
workarea executions - optimal     10
workarea memory allocated     342
OLD DB Auto trace (Fast)
active txn count during cleanout     0
buffer is not pinned count     4
buffer is pinned count     101
bytes received via SQL*Net from client     1322
bytes sent via SQL*Net to client     9560
calls to get snapshot scn: kcmgss     15
calls to kcmgas     0
calls to kcmgcs     0
calls to kcmgrs     1
cleanout - number of ktugct calls     0
cluster key scan block gets     0
cluster key scans     0
commit cleanouts     0
commit cleanouts successfully completed     0
concurrency wait time     0
consistent changes     0
consistent gets     117149
consistent gets - examination     56
consistent gets direct     115301
consistent gets from cache     1848
consistent gets from cache (fastpath)     1792
CPU used by this session     118
CPU used when call started     119
cursor authentications     1
db block changes     0
db block gets     0
db block gets from cache     0
db block gets from cache (fastpath)     0
DB time     123
deferred (CURRENT) block cleanout applications     0
Effective IO time     2012
enqueue conversions     3
enqueue releases     2
enqueue requests     2
enqueue waits     1
execute count     2
free buffer requested     0
HSC Heap Segment Block Changes     0
IMU Flushes     0
IMU ktichg flush     0
index fast full scans (full)     0
index fetch by key     101
index scans kdiixs1     0
lob writes     0
lob writes unaligned     0
logons cumulative     0
logons current     0
messages sent     0
no work - consistent read gets     117080
Number of read IOs issued     1019
opened cursors cumulative     3
opened cursors current     1
OS Involuntary context switches     54
OS Maximum resident set size     7868
OS Page faults     12
OS Page reclaims     2911
OS System time used     57
OS User time used     71
OS Voluntary context switches     25
parse count (failures)     0
parse count (hard)     0
parse count (total)     3
parse time cpu     0
parse time elapsed     0
physical read bytes     944545792
physical read IO requests     1019
physical read total bytes     944545792
physical read total IO requests     1019
physical read total multi block requests     905
physical reads     115301
physical reads cache     0
physical reads cache prefetch     0
physical reads direct     115301
physical reads prefetch warmup     0
process last non-idle time     0
recursive calls     0
recursive cpu usage     0
redo entries     0
redo size     0
redo synch writes     0
rows fetched via callback     0
session connect time     0
session cursor cache count     1
session cursor cache hits     2
session logical reads     117149
session pga memory     -983040
session pga memory max     0
session uga memory     0
session uga memory max     0
shared hash latch upgrades - no wait     0
sorts (memory)     2
sorts (rows)     157
sql area purged     0
SQL*Net roundtrips to/from client     3
table fetch by rowid     0
table fetch continued row     0
table scan blocks gotten     117077
table scan rows gotten     1972604
table scans (direct read)     1
table scans (long tables)     1
table scans (short tables)     2
undo change vector size     0
user calls     5
user I/O wait time     0
workarea executions - optimal     4

Hi Srini,
Yes the stats on the tables and indexes are current in both DBs. However the NEW DB has "System Stats" in sys.aux_stats$ and the OLD DB does not. The old DB has optimizer_index_caching=0 and optimizer_index_cost_adj=100. The new DB as them at optimizer_index_caching=90 and optimizer_index_cost_adj=25 but should not be using them because of the "System Stats".
Also I thought none of the SQL Optimize stuff would matter because I forced in my own SQL Plan using SPM.
Differences in init.ora
OLD-11     optimizerpush_pred_cost_based = FALSE
NEW-15     audit_sys_operations = FALSE
     audit_trail = "DB, EXTENDED"
     awr_snapshot_time_offset = 0
OLD-16     audit_sys_operations = TRUE
     audit_trail = "XML, EXTENDED"
NEW-22     cell_offload_compaction = "ADAPTIVE"
     cell_offload_decryption = TRUE
     cell_offload_plan_display = "AUTO"
     cell_offload_processing = TRUE
NEW-28     clonedb = FALSE
NEW-32     compatible = "11.2.0.0.0"
OLD-27     compatible = "11.1.0.0.0"
NEW-37     cursor_bind_capture_destination = "memory+disk"
     cursor_sharing = "FORCE"
OLD-32     cursor_sharing = "EXACT"
NEW-50     db_cache_size = 4294967296
     db_domain = "my.com"
OLD-44     db_cache_size = 0
NEW-54     db_flash_cache_size = 0
NEW-58     db_name = "NEWDB"
     db_recovery_file_dest_size = 214748364800
OLD-50     db_name = "OLDDB"
     db_recovery_file_dest_size = 8438939648
NEW-63     db_unique_name = "NEWDB"
     db_unrecoverable_scn_tracking = TRUE
     db_writer_processes = 2
OLD-55     db_unique_name = "OLDDB"
     db_writer_processes = 1
NEW-68     deferred_segment_creation = TRUE
NEW-71     dispatchers = "(PROTOCOL=TCP) (SERVICE=NEWDBXDB)"
OLD-61     dispatchers = "(PROTOCOL=TCP) (SERVICE=OLDDBXDB)"
NEW-73     dml_locks = 5068
     dst_upgrade_insert_conv = TRUE
OLD-63     dml_locks = 3652
     drs_start = FALSE
NEW-80     filesystemio_options = "SETALL"
OLD-70     filesystemio_options = "none"
NEW-87     instance_name = "NEWDB"
OLD-77     instance_name = "OLDDB"
NEW-94     job_queue_processes = 1000
OLD-84     job_queue_processes = 100
NEW-104     log_archive_dest_state_11 = "enable"
     log_archive_dest_state_12 = "enable"
     log_archive_dest_state_13 = "enable"
     log_archive_dest_state_14 = "enable"
     log_archive_dest_state_15 = "enable"
     log_archive_dest_state_16 = "enable"
     log_archive_dest_state_17 = "enable"
     log_archive_dest_state_18 = "enable"
     log_archive_dest_state_19 = "enable"
NEW-114     log_archive_dest_state_20 = "enable"
     log_archive_dest_state_21 = "enable"
     log_archive_dest_state_22 = "enable"
     log_archive_dest_state_23 = "enable"
     log_archive_dest_state_24 = "enable"
     log_archive_dest_state_25 = "enable"
     log_archive_dest_state_26 = "enable"
     log_archive_dest_state_27 = "enable"
     log_archive_dest_state_28 = "enable"
     log_archive_dest_state_29 = "enable"
NEW-125     log_archive_dest_state_30 = "enable"
     log_archive_dest_state_31 = "enable"
NEW-139     log_buffer = 7012352
OLD-108     log_buffer = 34412032
OLD-112     max_commit_propagation_delay = 0
NEW-144     max_enabled_roles = 150
     memory_max_target = 12884901888
     memory_target = 8589934592
     nls_calendar = "GREGORIAN"
OLD-114     max_enabled_roles = 140
     memory_max_target = 6576668672
     memory_target = 6576668672
NEW-149     nls_currency = "$"
     nls_date_format = "DD-MON-RR"
     nls_date_language = "AMERICAN"
     nls_dual_currency = "$"
     nls_iso_currency = "AMERICA"
NEW-157     nls_numeric_characters = ".,"
     nls_sort = "BINARY"
NEW-160     nls_time_format = "HH.MI.SSXFF AM"
     nls_time_tz_format = "HH.MI.SSXFF AM TZR"
     nls_timestamp_format = "DD-MON-RR HH.MI.SSXFF AM"
     nls_timestamp_tz_format = "DD-MON-RR HH.MI.SSXFF AM TZR"
NEW-172     optimizer_features_enable = "11.2.0.3"
     optimizer_index_caching = 90
     optimizer_index_cost_adj = 25
OLD-130     optimizer_features_enable = "11.1.0.6"
     optimizer_index_caching = 0
     optimizer_index_cost_adj = 100
NEW-184     parallel_degree_limit = "CPU"
     parallel_degree_policy = "MANUAL"
     parallel_execution_message_size = 16384
     parallel_force_local = FALSE
OLD-142     parallel_execution_message_size = 2152
NEW-189     parallel_max_servers = 320
OLD-144     parallel_max_servers = 0
NEW-192     parallel_min_time_threshold = "AUTO"
NEW-195     parallel_servers_target = 128
NEW-197     permit_92_wrap_format = TRUE
OLD-154     plsql_native_library_subdir_count = 0
NEW-220     result_cache_max_size = 21495808
OLD-173     result_cache_max_size = 0
NEW-230     service_names = "NEWDB, NEWDB.my.com, NEW"
OLD-183     service_names = "OLDDB, OLD.my.com"
NEW-233     sessions = 1152
     sga_max_size = 12884901888
OLD-186     sessions = 830
     sga_max_size = 6576668672
NEW-238     shared_pool_reserved_size = 35232153
OLD-191     shared_pool_reserved_size = 53687091
OLD-199     sql_version = "NATIVE"
NEW-248     star_transformation_enabled = "TRUE"
OLD-202     star_transformation_enabled = "FALSE"
NEW-253     timed_os_statistics = 60
OLD-207     timed_os_statistics = 5
NEW-256     transactions = 1267
OLD-210     transactions = 913
NEW-262     use_large_pages = "TRUE"

Failed to restart the CSSD during the interconnect failure

Hi all,
I run a small ATP on my LAB where i have
- 2x nodes RAC 11.2.0.2 & ASM (my OCR & Voting files are stored on ASM)
- 1 public interface <> eth0
- 1 private interface <> eth1
- 1 SCAN IP defined in the /etc/hosts file (i'm not using DNS or GNS)
The test i run was to shutdown the private interface (eth1) on node 1 and i saw that
1) all cluster services and cluster daemons on node 2 were killed and node 2 was evicted from the cluster by node 1
2) all new connections were redirected to the survived node
3) Oracle OHASD daemon was restarted on node 2 and tried to start the cluster services without success because private network between cluster nodes was down
Up to here everything worked as expected but once i turn on eth1 it took ~ 9 minutes for the CSSD to startup and bring all the components up & running.
The node2 alert logs showes
[ctssd(12949)]CRS-2402:The Cluster Time Synchronization Service aborted on host node2. Details at (:ctss_css_init1:) in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/ctssd/octssd.log.
2011-04-13 08:09:40.978
[ohasd(5058)]CRS-2765:Resource 'ora.cssd' has failed on server 'node2'.
2011-04-13 08:09:40.985
[/u01/oracle/installed/oracle_cluster-11.2.0.2-1/bin/oraagent.bin(5764)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/agent/ohasd/oraagent_oracle/oraagent_oracle.log";
2011-04-13 08:09:41.169
[ohasd(5058)]CRS-2765:Resource 'ora.asm' has failed on server 'node2'.
2011-04-13 08:09:50.337
[cssd(13103)]CRS-1713:CSSD daemon is started in clustered mode
2011-04-13 08:10:05.833
[cssd(13103)]CRS-1707:Lease acquisition for node node2 number 2 completed
2011-04-13 08:10:07.119
[cssd(13103)]CRS-1605:CSSD voting file is online: ORCL:CRS_DISK1_2G; details in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/cssd/ocssd.log.
2011-04-13 08:10:07.121
[cssd(13103)]CRS-1605:CSSD voting file is online: ORCL:CRS_DISK2_2G; details in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/cssd/ocssd.log.
2011-04-13 08:10:07.143
[cssd(13103)]CRS-1605:CSSD voting file is online: ORCL:CRS_DISK1_2G; details in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/cssd/ocssd.log.
2011-04-13 08:19:49.386
[/u01/oracle/installed/oracle_cluster-11.2.0.2-1/bin/cssdagent(13091)]CRS-5818:Aborted command 'start for resource: ora.cssd 1 1' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:6:7} in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/agent/ohasd/oracssdagent_root/oracssdagent_root.log.
2011-04-13 08:19:49.387
[cssd(13103)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/cssd/ocssd.log
2011-04-13 08:19:49.387
[cssd(13103)]CRS-1603:CSSD on node node2 shutdown by user.
2011-04-13 08:19:54.501
[ohasd(5058)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'node2'.
2011-04-13 08:19:57.723
[cssd(17068)]CRS-1713:CSSD daemon is started in clustered mode
2011-04-13 08:20:01.177
[ohasd(5058)]CRS-2765:Resource 'ora.diskmon' has failed on server 'node2'.
2011-04-13 08:20:13.167
[cssd(17068)]CRS-1707:Lease acquisition for node node2 number 2 completed pay attention at the timestamp 08:10:07.143 & 08:19:49.386
The error in the oracssdagent_root.log is
2011-04-13 08:09:49.286: [CLSFRAME][3014212592] New Framework state: 2
2011-04-13 08:09:49.286: [CLSFRAME][3014212592] M2M is starting...
2011-04-13 08:09:49.288: [ CRSCOMM][3014212592] Ipc: Starting send thread
2011-04-13 08:09:49.288: [ CRSCOMM][1092061504] Ipc: sendWork thread started.
2011-04-13 08:09:49.289: [ CRSCOMM][1105643840] IpcC: IPC Client thread started listening
2011-04-13 08:09:49.289: [ CRSCOMM][1105643840] IpcC: Received member number of 10
2011-04-13 08:09:49.290: [CLSFRAME][3014212592] New IPC Member:{Relative|Node:0|Process:0|Type:2}:OHASD:node2
2011-04-13 08:09:49.290: [CLSFRAME][3014212592] New process connected to us ID:{Relative|Node:0|Process:0|Type:2} Info:OHASD:node2
2011-04-13 08:09:49.291: [CLSFRAME][3014212592] Tints initialized with nodeId: 0 procId: 10
2011-04-13 08:09:49.291: [CLSFRAME][3014212592] Starting thread model named: MultiThread
2011-04-13 08:09:49.292: [CLSFRAME][3014212592] Starting thread model named: TimerSharedTM
2011-04-13 08:09:49.293: [CLSFRAME][3014212592] New Framework state: 3
2011-04-13 08:09:49.293: [    AGFW][3014212592] Agent Framework started successfully
2011-04-13 08:09:49.293: [    AGFW][1116150080] {0:10:2} Agfw engine module has enabled...
2011-04-13 08:09:49.293: [CLSFRAME][1116150080] {0:10:2} Module Enabling is complete
2011-04-13 08:09:49.293: [CLSFRAME][1116150080] {0:10:2} New Framework state: 6
2011-04-13 08:09:49.294: [CLSFRAME][3014212592] M2M is now powered by a doWork() thread.
2011-04-13 08:09:49.294: [    AGFW][1116150080] {0:10:2} Agent is started with userid: root , expected user: root
2011-04-13 08:09:49.294: [   AGENT][1116150080] {0:10:2} Static Version 11.2.0.2.0
2011-04-13 08:09:49.294: [    AGFW][1116150080] {0:10:2} Agent sending message to PE: AGENT_HANDSHAKE[Proxy] ID 20484:11
2011-04-13 08:09:49.302: [    AGFW][1116150080] {0:10:2} Agent received the message: RESTYPE_ADD[ora.cssd.type] ID 8196:12358
2011-04-13 08:09:49.302: [    AGFW][1116150080] {0:10:2} Added new restype: ora.cssd.type
2011-04-13 08:09:49.303: [    AGFW][1116150080] {0:10:2} Agent sending last reply for: RESTYPE_ADD[ora.cssd.type] ID 8196:12358
2011-04-13 08:09:49.305: [    AGFW][1116150080] {0:10:2} Agent received the message: RESOURCE_ADD[ora.cssd 1 1] ID 4356:12359
2011-04-13 08:09:49.305: [    AGFW][1116150080] {0:10:2} Added new resource: ora.cssd 1 1 to the agfw
2011-04-13 08:09:49.306: [    AGFW][1116150080] {0:10:2} Agent sending last reply for: RESOURCE_ADD[ora.cssd 1 1] ID 4356:12359
2011-04-13 08:09:49.308: [    AGFW][1116150080] {0:6:7} Agent received the message: RESOURCE_START[ora.cssd 1 1] ID 4098:12360
2011-04-13 08:09:49.308: [    AGFW][1116150080] {0:6:7} Preparing START command for: ora.cssd 1 1
2011-04-13 08:09:49.308: [    AGFW][1116150080] {0:6:7} ora.cssd 1 1 state changed from: UNKNOWN to: STARTING
2011-04-13 08:09:49.309: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_cssdstart: Start action called
2011-04-13 08:09:49.309: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_getattr: attr OMON_INITRATE, value 1000
2011-04-13 08:09:49.309: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_getattr: attr OMON_POLLRATE, value 500
2011-04-13 08:09:49.309: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_getattr: attr ORA_OPROCD_MODE, value
2011-04-13 08:09:49.310: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_getattr: attr PROCD_TIMEOUT, value 1000
2011-04-13 08:09:49.310: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_getattr: attr LOGGING_LEVEL, value 1
2011-04-13 08:09:49.310: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_cssdstart: loglevels CSSD=2,GIPCNM=2,GIPCGM=2,GIPCCM=2,CLSF=0,SKGFD=0,GPNP=1,OLR=0
2011-04-13 08:09:49.313: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_cssdstart: START action for resource /u01/oracle/installed/oracle_cluster-11.2.0.2-1/bin/ocssd: SUCCESS
2011-04-13 08:09:49.313: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_waitomon: start waiting
2011-04-13 08:09:49.313: [ CSSCLNT][1098377536]clsssInitNative: Init for agent
2011-04-13 08:09:50.317: [ CSSCLNT][1098377536]clsssInitNative: Init for agent
2011-04-13 08:09:51.319: [ CSSCLNT][1098377536]clsssInitNative: Init for agent
2011-04-13 08:09:51.322: [ CSSCLNT][1098377536]clssnsqueryfatal: css is fatal = 0
2011-04-13 08:09:51.322: [ USRTHRD][1098377536] clsncssd_thrdspawn: spawn OPROCD succ
2011-04-13 08:09:51.322: [ USRTHRD][1098377536] clsncssd_thrdspawn: spawn POLLMSG succ
2011-04-13 08:09:51.323: [ USRTHRD][1099954496] clsnpollmsg_main: starting pollmsg thread
2011-04-13 08:09:51.323: [ USRTHRD][1107745088] clsnproc_main: timeout of procd cannot be 0, now we set to default 1000.
2011-04-13 08:09:51.323: [ USRTHRD][1117727040] clsnwork_main: starting worker thread
2011-04-13 08:09:51.323: [ USRTHRD][1098377536] clsncssd_thrdspawn: spawn WORKER succ
2011-04-13 08:09:51.323: [ USRTHRD][1107745088] clsnproc_main: starting oprocd
2011-04-13 08:09:51.323: [ USRTHRD][1098377536] clsncssd_thrdspawn: spawn KILL succ
2011-04-13 08:10:07.151: [ USRTHRD][1098377536] clsnomon_init: css init done, nodenum 2
2011-04-13 08:10:07.151: [ USRTHRD][1098377536] clsnomon_WaitToRegister: waiting for first reconfiguration and kgzf initialization
2011-04-13 08:19:49.385: [CLSFRAME][3014212592] TM [MultiThread] is changing desired thread # to 3. Current # is 2
2011-04-13 08:19:49.387: [    AGFW][1111947584] {0:6:7} Created alert : (:CRSAGF00113:) : Aborting the command: start for resource: ora.cssd 1 1
2011-04-13 08:19:49.387: [ora.cssd][1111947584] {0:6:7} [start] clsncssd_cssdabort: sending shutdown abort to CSS with new ctx
2011-04-13 08:19:49.387: [ CSSCLNT][1098377536]clsssRecvMsg: wrong type request (0) on 0xc9 ret 0
2011-04-13 08:19:49.387: [ CSSCLNT][1098377536]clssnskgzfdone: RPC failed rc 1
2011-04-13 08:19:49.387: [ USRTHRD][1098377536] clsnomon_WaitToRegister: exadata initialization completed with rc=1
2011-04-13 08:19:49.387: [ USRTHRD][1098377536] clsnomon_init: problems in the CSS to allow OMON registration 2
2011-04-13 08:19:49.387: [ USRTHRD][1098377536] clsnomon_cleanup: to exit status = 2
2011-04-13 08:19:49.387: [ USRTHRD][1098377536] clsnomon_cleanup: failure, sending shutdown immediate to CSS
2011-04-13 08:19:49.387: [ USRTHRD][1098377536] CHECK action is in progress, Rejecting the check action requested by entry point for ora.cssd
2011-04-13 08:19:49.426: [    AGFW][2008402928] Starting the agent: /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/agent/ohasd/oracssdagent_root/
2011-04-13 08:19:49.426: [   AGENT][2008402928] Agent framework initialized, Process Id = 17013
2011-04-13 08:19:49.426: [ USRTHRD][2008402928] to enter agent main
2011-04-13 08:19:49.426: [ USRTHRD][2008402928] clsscssd_main: New soft limit for stack size is 1572864, hard limit is 4294967295
2011-04-13 08:19:49.434: [ USRTHRD][2008402928] clsncssd_main: setting priority to 4
2011-04-13 08:19:49.434: [ USRTHRD][2008402928] *** Agent Framework Started *** Do you have any idea why it took so long to bring all the components up & running?
Thanks a lot!!
G

Hi,
there is an internal timer for the clusterware ressources regarding restarting the ressources.
In case of a node eviction or clusterstack reboot the clusterware tries to startup again.
If the issue still persists, CRS will wait for some time to start the stack again. This "restart" try is based on a timer, which is set to 600 seconds (note this is not the ORA_CHECK_TIMEOUT) but the STARTUP_TIMEOUT.
Since a missing interconnect does have some implications (not only on the network but on the whole stack) it is expected, that the cluster does not start so fast automatically (because it still has the first start running.
There is even another "issue" connected to this - Oracle will only try several times (FAILURE_COUNT/FAILURE_THRESHOLD) to restart ressources. If he cannot restart cssd/crsd for several times, OCW will not try to startup automatically, but expects the administrator to solve the error and then startup again.
But actually this does make sense:
We have to give some time for an error to be resolved, before we start automatically. It does not matter if the restart of the node is delayed by this, because
=> If the error is fixed automatically, it will normally be fixed after a cluster/node reboot and hence cluster will come up
=> If the error is not fixed automatically, but manually, it can be expected that the administrator tells clusterware the issue is resolved. He does that by simply starting the stack (crsctl start crs)
=> If the error is fixed automaticall, but fixing took a while (lets say 15 minutes), it does not really matter if clusterware needs 10 more minutes to come up.
So what you see is expected, and wanted.
It would cost way too much to monitor all ressources regarding cluster problems and trigger a startup....
Sebastian

Interconnect IP Issue

Similar Messages

Maybe you are looking for