Cluster Synchronization/Communication

Hi,
what is the suggested/preferred way of SAP to implement cluster synchronization/communication in a NetWeaver AS Java cluster? F.e. an application deployed on two instances manages a RAM based cache. This cache needs to be synchronized somehow, at least a flush triggered on the one instance should result in a flush on the other instance. I would use JMS for this kind of situation, since the AS seems to be J2EE 1.3 compatible this should be no problem, right? Are there other suggested/preferred ways of implementing this communication?
Best regards,
Fabian

Not sure if you've found an answer or are still looking, but since I was looking for an answer to the same thing and found this post on Google, I'll post this here for any other Googlers.
From the release notes for Sun Cluster 3.2 Geographic Edition:
Sun Cluster Manager Requires Same Root Password on Partner Clusters (6260505)
Problem Summary: To use the Sun Cluster Manager graphical user interface (GUI), the root password must be the same on all nodes of both clusters in the Sun Cluster Geographic Edition deployment.
Workaround: If you use Sun Cluster Manager to configure your clusters, ensure that the root password is the same on every node of both clusters. If you prefer to not set the root password identically on all nodes, use the command-line interface to configure your clusters.
I had the exact same error, changed the root passwords to match, and the error goes away, so apparently that was the issue.

Similar Messages

Error: failed to synchronize community: [ SOLVED ]

Hi friends,
I cant synchronize community. I use pacman 3.0 from testing.
pacman -Sy
:: Synchronizing package databases...
testing is up to date
current is up to date
extra is up to date
unstable is up to date
error: failed to synchronize community:
local database is up to date
Directory /var/lib/pacma/community is empty.
I think it s the problem.
Any suggestions.
Last edited by mezcal (2007-04-01 21:19:00)

first off, check to see if the file
/etc/pacman.d/community
exists or not. Make sure the file contains a list of mirrors.
Then, look inside the file
/etc/pacman.conf
And make sure you haven't commented out the Include=/etc/pacman.d/community

ORA-29701: unable to connect to Cluster Synchronization Service

Hi, I'm working on an Oracle 11.2 EE on a RedHat 5.4.
I installed the Grid Infrastructure to work with ASM.
While starting ASM instance I got this error:
ORA-01078: failure in processing system parameters
ORA-29701: unable to connect to Cluster Synchronization Service
I tried to start the resource wiht:
$ crsctl start resource -all
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Start failed, or completed with errors.
Then i tried this:
$ crsctl start has
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
Do you have any suggestion on how start the CSS?
Thanks in advance.
Samuel

Hi,
it was just a metter of the run level.
As soon as I set it to 5 everything started to work.
Samuel

Logs for Cluster Synchronization Service daemon

ORA-29746: Cluster Synchronization Service is being shut down.
Cause: The administrator has shut down the Cluster Synchronization Service daemon. This error message is intended to be informative to users on the status of the service.
Action: Check the log file of the Cluster Synchronization Service daemon to verify the state of the service.
Can someone tell me the location of these logs?
Thnaks

$CRS_HOME/log/<node>/cssd/

Cluster Network communication doesn't go correctly through adapters

Hi!
I have from VMM configured cluster with logical networks for cluster communication, live migration and public network with management. These networks are in team of two physical adapters with converged switch. For each network is created virtual adapter on
every cluster host.
Communication in team goes weird. Communication in using one adapter and communication out using the second adapter. There should be aggregation. Team is LACP and we tried this with Hyper-V Port and Adress Hash - same behavior on both of these
configurations?
Anybody know what to do for aggregated communication?
Thanks

I'd like to push this up to the labview dev folks and ask why could a string outputing version of an enum not be created and functional as a type def. I understand the difference between enums an rings. But I would like to suggest that the powers that be create a data type version for strings similar to the enum which updates all instances of a type def when changes are made to the string content within.
To draw a parallel. An enum pairs the string label with a number and packages the two as part of the type def. Instead could we not pair the string label with a string and again package the two as a type def? You could choose to have the label and string content match or not as needed but the pair would update universally.
Here is one instance where this would be useful. I have been using the Asynchronous Message Communication Library lately which utilizes the standard Labview queue funcitons. While the standard queue functions can accept various data types the AMC library is limited to string messages only and rewriting the entire AMC lib is time prehibitive.
So it would be very convienent to have something that looks like a combo box constant as a type def to feed into the AMC libraries instead of individual string constants. It would significantly reduce errors when repeatedly typing the same sting constants repeatedly only to find them the hard way after hours of debug.

Cluster synchronization monitor

Helo all,
I have successfully clustered several CQ5.5 (CRX 2.3) instances. I would like to monitor all nodes to ensure they are in sync. How do you do this? In-other-words, if a cluster node were to become, say, 10-15 minutes out of sync with the master, how would you know?
I see a ClusterNodeRevision property in MBean com.adobe.granite:type=Repository. This seems like a type of sequence number. Is it safe to assume that ClusterNodeRevision will always increase? If so, we could check to see if ClusterNodeRevision was, say, within 100000 of the master.
What are you doing to ensure your cluster nodes stay in sync?
Thanks,
Jeffrey

Hi Tim ,
Sorry for asking too too many questions. I has searched the internet for CMM, before posting. Most of the searches giving the decription about CMM and its function. But, there is no answer for my question.
If CMM is a kernel module then, modinfo should retrun the name of CMM module. But the output does not show the CMM module.
My Question is
Where is CMM running?
Is there any daemon for that, if it is a daemon, then what is the daemon name.
How to start CMM if it is stopped.
Regards,
R. Rajesh Kannan.

Question on cluster synchronization...

          Say you have two servers in a cluster, S1 and S2. Assume client A
          updates entity bean B in server S1. Client B later queries entity
          bean in server S2, which was pooled. Will client B read stale data?
          When client A updates entity bean B in server S1, how is this change
          propagated to server S2? Is it propagated at all?
          Does it make sense to use pooling in a multi-server configuration?
          Thank's in advance
          Andrej


You have to set DBShared to true and then every method invocation on the
          bean will load the data from the database and thus you will always have
          latest changes.
          - Prasad
          Andrej Gabara wrote:
          > Say you have two servers in a cluster, S1 and S2. Assume client A
          > updates entity bean B in server S1. Client B later queries entity
          > bean in server S2, which was pooled. Will client B read stale data?
          > When client A updates entity bean B in server S1, how is this change
          > propagated to server S2? Is it propagated at all?
          >
          > Does it make sense to use pooling in a multi-server configuration?
          >
          > Thank's in advance
          > Andrej

During the installation of grid infra(cluster) for Oracle 11.2 RAC one.

Good Day All, and thanks in advance…
During the installation of grid infrastructure(cluster) for Oracle 11.2 RAC One Node on AIX6.1 ( PROD) , ASM used. I am getting below errors when executing ./root.sh
Upon investigation ,I managed to get note: 1068212.1 from the support oracle site ( see below for details) . I might be hitting Unpublished bug 8670579. I also logged Severity 2 SR with Oracle support to get the bug/patch fix and no one has attended the call.
This might be configuration issue or otherwise , if you have experienced the same issue please assist ? ( if you need more logfiles please feel free to request)….
I ran the Cluster Verify Check – all passed.
Many Thanks
Ezekiel Filane
/u01/app/11.2.0/grid#./root.sh
Running Oracle 11g root.sh script...
The following environment variables are set as:
ORACLE_OWNER= grid
ORACLE_HOME= /u01/app/11.2.0/grid
Enter the full pathname of the local bin directory: [usr/local/bin]:
The file "dbhome" already exists in /usr/local/bin. Overwrite it? (y/n) [n]:
The file "oraenv" already exists in /usr/local/bin. Overwrite it? (y/n) [n]:
The file "coraenv" already exists in /usr/local/bin. Overwrite it? (y/n) [n]:
Creating /etc/oratab file...
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root.sh script.
Now product-specific root actions will be performed.
2010-10-19 10:33:11: Parsing the host name
2010-10-19 10:33:11: Checking for super user privileges
2010-10-19 10:33:11: User has super user privileges
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
Creating trace directory
User grid has the required capabilities to run CSSD in realtime mode
LOCAL ADD MODE
Creating OCR keys for user 'root', privgrp 'system'..
Operation successful.
root wallet
root wallet cert
root cert export
peer wallet
profile reader wallet
pa wallet
peer wallet keys
pa wallet keys
peer cert request
pa cert request
peer cert
pa cert
peer root cert TP
profile reader root cert TP
pa root cert TP
peer pa cert TP
pa peer cert TP
profile reader pa cert TP
profile reader peer cert TP
peer user cert
pa user cert
Adding daemon to inittab
CRS-4123: Oracle High Availability Services has been started.
ohasd is starting
CRS-2672: Attempting to start 'ora.gipcd' on 'csgipm'
CRS-2672: Attempting to start 'ora.mdnsd' on 'csgipm'
CRS-2676: Start of 'ora.gipcd' on 'csgipm' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'csgipm' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'csgipm'
CRS-2676: Start of 'ora.gpnpd' on 'csgipm' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'csgipm'
CRS-2676: Start of 'ora.cssdmonitor' on 'csgipm' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'csgipm'
CRS-2672: Attempting to start 'ora.diskmon' on 'csgipm'
CRS-2676: Start of 'ora.diskmon' on 'csgipm' succeeded
CRS-2676: Start of 'ora.cssd' on 'csgipm' succeeded
CRS-2672: Attempting to start 'ora.ctssd' on 'csgipm'
Start action for daemon aborted
CRS-2674: Start of 'ora.ctssd' on 'csgipm' failed
CRS-2679: Attempting to clean 'ora.ctssd' on 'csgipm'
CRS-2681: Clean of 'ora.ctssd' on 'csgipm' succeeded
CRS-4000: Command Start failed, or completed with errors.
Command return code of 1 (256) from command: /u01/app/11.2.0/grid/bin/crsctl start resource ora.ctssd -init
Start of resource "ora.ctssd -init" failed
Clusterware exclusive mode start of resource ora.ctssd failed
CRS-2500: Cannot stop resource 'ora.crsd' as it is not running
CRS-4000: Command Stop failed, or completed with errors.
Command return code of 1 (256) from command: /u01/app/11.2.0/grid/bin/crsctl stop resource ora.crsd -init
Stop of resource "ora.crsd -init" failed
Failed to stop CRSD
CRS-2500: Cannot stop resource 'ora.asm' as it is not running
CRS-4000: Command Stop failed, or completed with errors.
Command return code of 1 (256) from command: /u01/app/11.2.0/grid/bin/crsctl stop resource ora.asm -init
Stop of resource "ora.asm -init" failed
Failed to stop ASM
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'csgipm'
CRS-2677: Stop of 'ora.cssdmonitor' on 'csgipm' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'csgipm'
CRS-2677: Stop of 'ora.cssd' on 'csgipm' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'csgipm'
CRS-2677: Stop of 'ora.gpnpd' on 'csgipm' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'csgipm'
CRS-2677: Stop of 'ora.gipcd' on 'csgipm' succeeded
CRS-2673: Attempting to stop 'ora.mdnsd' on 'csgipm'
CRS-2677: Stop of 'ora.mdnsd' on 'csgipm' succeeded
Initial cluster configuration failed. See /u01/app/11.2.0/grid/cfgtoollogs/crsconfig/rootcrs_csgipm.log for details
csgipm:/u01/app/11.2.0/grid#ps -ef | grep pmon
root 6160492 3932160 0 10:54:13 pts/2 0:00 grep pmon
more /u01/app/11.2.0/grid/log/csgipm/client/ocrconfig_5767204.log
csgipm:/usr/sbin#more /u01/app/11.2.0/grid/log/csgipm/client/ocrconfig_5767204.log
2010-10-19 10:33:14.435: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 4
2010-10-19 10:33:14.435: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-10-19 10:33:14.435: [ OCRRAW][1]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2010-10-19 10:33:14.435: [ OCRRAW][1]proprioini: all disks are not OCR/OLR formatted
2010-10-19 10:33:14.435: [ OCRRAW][1]proprinit: Could not open raw device
2010-10-19 10:33:14.442: [ default][1]a_init:7!: Backend init unsuccessful : [26]
2010-10-19 10:33:14.461: [ OCRCONF][1]Exporting OCR data to [OCRUPGRADEFILE]
2010-10-19 10:33:14.461: [ OCRAPI][1]a_init:7!: Backend init unsuccessful : [33]
2010-10-19 10:33:14.461: [ OCRCONF][1]There was no previous version of OCR. error:[PROCL-33: Oracle Local Registry is not configured]
2010-10-19 10:33:14.461: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 0
2010-10-19 10:33:14.461: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 1
2010-10-19 10:33:14.462: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 2
2010-10-19 10:33:14.462: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 3
2010-10-19 10:33:14.462: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 4
2010-10-19 10:33:14.462: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-10-19 10:33:14.462: [ OCRRAW][1]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2010-10-19 10:33:14.462: [ OCRRAW][1]proprioini: all disks are not OCR/OLR formatted
2010-10-19 10:33:14.462: [ OCRRAW][1]proprinit: Could not open raw device
2010-10-19 10:33:14.462: [ default][1]a_init:7!: Backend init unsuccessful : [26]
2010-10-19 10:33:14.462: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 0
2010-10-19 10:33:14.463: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 1
2010-10-19 10:33:14.463: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 2
2010-10-19 10:33:14.463: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 3
2010-10-19 10:33:14.463: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 4
2010-10-19 10:33:14.463: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-10-19 10:33:14.463: [ OCRRAW][1]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2010-10-19 10:33:14.463: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 0
2010-10-19 10:33:14.463: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 1
2010-10-19 10:33:14.463: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 2
2010-10-19 10:33:14.463: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 3
2010-10-19 10:33:14.463: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 4
2010-10-19 10:33:14.463: [ OCROSD][1]utread:3: Problem reading buffer 104ef000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-10-19 10:33:14.483: [ OCRRAW][1]ibctx: Failed to read the whole bootblock. Assumes invalid format.
2010-10-19 10:33:14.483: [ OCRRAW][1]proprinit:problem reading the bootblock or superbloc 22
2010-10-19 10:33:14.483: [ OCROSD][1]utread:3: Problem reading buffer 104fe000 buflen 4096 retval 0 phy_offset 102400 retry 0
2010-10-19 10:33:14.483: [ OCROSD][1]utread:3: Problem reading buffer 104fe000 buflen 4096 retval 0 phy_offset 102400 retry 1
2010-10-19 10:33:14.483: [ OCROSD][1]utread:3: Problem reading buffer 104fe000 buflen 4096 retval 0 phy_offset 102400 retry 2
2010-10-19 10:33:14.484: [ OCROSD][1]utread:3: Problem reading buffer 104fe000 buflen 4096 retval 0 phy_offset 102400 retry 3
2010-10-19 10:33:14.484: [ OCROSD][1]utread:3: Problem reading buffer 104fe000 buflen 4096 retval 0 phy_offset 102400 retry 4
2010-10-19 10:33:14.484: [ OCROSD][1]utread:3: Problem reading buffer 104fe000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-10-19 10:33:14.484: [ OCRRAW][1]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2010-10-19 10:33:14.541: [ OCRAPI][1]a_init:6a: Backend init successful
2010-10-19 10:33:14.646: [ OCRCONF][1]Initialized DATABASE keys
2010-10-19 10:33:14.650: [ OCRCONF][1]Exiting [status=success]...

Hi,
We are also trying to install 11.2.0.2 Grid infrastructure for Oracle RAC One Node on AIX 6.1. We did a POC in our lab environment and after much struggle got that working. Now we are building 4 clusters in the production environment and the first cluster installation failed while running root.sh on node2. We already have a Sev1 ticket open with Oracle Support but have not heard anything.
Here is root.sh output from node2. The two node names are p01dou416 and p01dou417.
CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node p01dou416, number 1, and is terminating
An active cluster was found during exclusive startup, restarting to join the cluster
Failed to start Oracle Clusterware stack
Failed to start Cluster Synchorinisation Service in clustered mode at /u01/app/11.2.0/grid/crs/install/crsconfig_lib.pm line 1020.
/u01/app/11.2.0/grid/perl/bin/perl -I/u01/app/11.2.0/grid/perl/lib -I/u01/app/11.2.0/grid/crs/install /u01/app/11.2.0/grid/crs/install/rootcrs.pl execution failed
[root@P01DOU417] /u01/app/11.2.0/grid #
LOG output: /u01/app/11.2.0/grid/cfgtoollogs/crsconfig/ rootcrs_p01dou417.log
2010-11-13 17:22:14: Successfully started requested Oracle stack daemons
2010-11-13 17:22:14: Starting CSS in clustered mode
2010-11-13 17:22:14: Executing cmd: /u01/app/11.2.0/grid/bin/crsctl start resource ora.cssd -init
2010-11-13 17:32:28: Command output:
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'p01dou417'
CRS-2672: Attempting to start 'ora.gipcd' on 'p01dou417'
CRS-2676: Start of 'ora.cssdmonitor' on 'p01dou417' succeeded
CRS-2676: Start of 'ora.gipcd' on 'p01dou417' succeeded> CRS-2679: Attempting to clean 'ora.cssd' on 'p01dou417'
CRS-2681: Clean of 'ora.cssd' on 'p01dou417' succeeded
CRS-2673: Attempting to stop 'ora.diskmon' on 'p01dou417'
CRS-2677: Stop of 'ora.diskmon' on 'p01dou417' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'p01dou417'
CRS-2677: Stop of 'ora.gipcd' on 'p01dou417' succeeded
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'p01dou417'
CRS-2677: Stop of 'ora.cssdmonitor' on 'p01dou417' succeeded
CRS-5804: Communication error with agent process
CRS-4000: Command Start failed, or completed with errors.
End Command output2010-11-13 17:32:28: Executing cmd: /u01/app/11.2.0/grid/bin/crsctl check css
2010-11-13 17:32:28: Command output:
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
End Command output2010-11-13 17:32:28: Checking the status of css
2010-11-13 17:32:33: Executing cmd: /u01/app/11.2.0/grid/bin/crsctl check css
2010-11-13 17:32:33: Command output:
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
End Command output2010-11-13 17:32:33: Checking the status of css
2010-11-13 17:32:38: CRS-2672: Attempting to start 'ora.cssdmonitor' on 'p01dou417'
2010-11-13 17:32:38: CRS-2672: Attempting to start 'ora.gipcd' on 'p01dou417'
2010-11-13 17:32:38: CRS-2676: Start of 'ora.cssdmonitor' on 'p01dou417' succeeded
2010-11-13 17:32:38: CRS-2676: Start of 'ora.gipcd' on 'p01dou417' succeeded
2010-11-13 17:32:38: CRS-2672: Attempting to start 'ora.cssd' on 'p01dou417'
2010-11-13 17:32:38: CRS-2672: Attempting to start 'ora.diskmon' on 'p01dou417'
2010-11-13 17:32:38: CRS-2676: Start of 'ora.diskmon' on 'p01dou417' succeeded
2010-11-13 17:32:38: CRS-2674: Start of 'ora.cssd' on 'p01dou417' failed
2010-11-13 17:32:38: CRS-2679: Attempting to clean 'ora.cssd' on 'p01dou417'
2010-11-13 17:32:38: CRS-2681: Clean of 'ora.cssd' on 'p01dou417' succeeded
2010-11-13 17:32:38: CRS-2673: Attempting to stop 'ora.diskmon' on 'p01dou417'
2010-11-13 17:32:38: CRS-2677: Stop of 'ora.diskmon' on 'p01dou417' succeeded
2010-11-13 17:32:38: CRS-2673: Attempting to stop 'ora.gipcd' on 'p01dou417'
2010-11-13 17:32:38: CRS-2677: Stop of 'ora.gipcd' on 'p01dou417' succeeded
2010-11-13 17:32:38: CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'p01dou417'
2010-11-13 17:32:38: CRS-2677: Stop of 'ora.cssdmonitor' on 'p01dou417' succeeded
2010-11-13 17:32:38: CRS-5804: Communication error with agent process
2010-11-13 17:32:38: CRS-4000: Command Start failed, or completed with errors.
2010-11-13 17:32:38: Failed to start Oracle Clusterware stack
2010-11-13 17:32:38: ###### Begin DIE Stack Trace ######
2010-11-13 17:32:38: Package File Line Calling
2010-11-13 17:32:38: --------------- -------------------- ---- ----------
2010-11-13 17:32:38: 1: main rootcrs.pl 324 crsconfig_lib::dietrap
2010-11-13 17:32:38: 2: crsconfig_lib crsconfig_lib.pm 1020 main::__ANON__
2010-11-13 17:32:38: 3: crsconfig_lib crsconfig_lib.pm 997 crsconfig_lib::start_cluster
2010-11-13 17:32:38: 4: main rootcrs.pl 697 crsconfig_lib::perform_start_cluster
2010-11-13 17:32:38: ####### End DIE Stack Trace #######
2010-11-13 17:32:38: 'ROOTCRS_STACK' checkpoint has failed
Any help on this is appreciated.
Edited by: user12019257 on Nov 17, 2010 1:26 PM

CRS-4535: Cannot communicate with Cluster Ready Services

[oracle@bnl11237dat01 ~]$ /u01/app/11.2.0/grid/bin/cluvfy stage -post crsinst -n bnl11237dat01,bnl11237dat02 -verbose
Performing post-checks for cluster services setup
Checking node reachability...
Check: Node reachability from node "bnl11237dat01"
Destination Node Reachable?
bnl11237dat01 yes
bnl11237dat02 yes
Result: Node reachability check passed from node "bnl11237dat01"
Checking user equivalence...
Check: User equivalence for user "oracle"
Node Name Comment
bnl11237dat02 passed
bnl11237dat01 passed
Result: User equivalence check passed for user "oracle"
ERROR:
Unable to obtain network interface list from Oracle Clusterware (OIFCFG)
Verification cannot proceed
Post-check for cluster services setup was unsuccessful on all the nodes.
[oracle@bnl11237dat01 ~]$ oifcfg iflist
bond1 10.32.25.160
bond2 10.32.17.200
bond2 169.254.0.0
bond0.1515 10.32.24.0
bond0.1536 172.16.7.0
[oracle@bnl11237dat01 ~]$ oifcfg iflist -p
bond1 10.32.25.160 PRIVATE
bond2 10.32.17.200 PRIVATE
bond2 169.254.0.0 UNKNOWN
bond0.1515 10.32.24.0 PRIVATE
bond0.1536 172.16.7.0 PRIVATE
[oracle@bnl11237dat01 ~]$ oifcfg iflist -p -n
bond1 10.32.25.160 PRIVATE 255.255.255.240
bond2 10.32.17.200 PRIVATE 255.255.255.248
bond2 169.254.0.0 UNKNOWN 255.255.0.0
bond0.1515 10.32.24.0 PRIVATE 255.255.255.192
bond0.1536 172.16.7.0 PRIVATE 255.255.255.0
[oracle@bnl11237dat01 ~]$ /u01/app/11.2.0/grid/bin/onsctl debug
HTTP/1.1 200 OK
Content-Length: 1923
Content-Type: text/html
Response:
== bnl11237dat01:6200 8732 11/05/19 11:44:09 ==
Home: /u01/app/11.2.0/grid
======== ONS ========
IP ADDRESS PORT TIME SEQUENCE FLAGS
127.0.0.1 6200 4dd4f2a3 00000000 00000008
Listener:
TYPE BIND ADDRESS PORT SOCKET
Local 127.0.0.1 6100 5
Remote any 6200 7
Remote any 6200 -
Connection Topology: (1)
IP PORT VERS TIME
127.0.0.1 6200 4 4dd4f2a3=
Client connections:
ID CONNECTION ADDRESS PORT FLAGS SENDQ REF SUB W
0 internal 0 01008a 00000 001 002
2 127.0.0.1 6100 01001a 00000 001 001
1 127.0.0.1 6100 01001a 00000 001 001
3 127.0.0.1 6100 01001a 00000 001 001
4 127.0.0.1 6100 01001a 00000 001 001
7 127.0.0.1 6100 01001a 00000 001 000
request 127.0.0.1 6100 03201a 00000 001 000
Worker Ticket: 10/10, Last: 11/05/19 11:42:21
THREAD FLAGS
40460940 00000012
405b4940 00000012
41280940 00000012
Resources:
Notifications:
Received: Total 0 (Internal 0), in Receive Q: 0
Processed: Total 0, in Process Q: 0
Pool Counts:
Message: 1, Link: 1, Ack: 1, Match: 1
[oracle@bnl11237dat01 ~]$
[root@bnl11237dat01 ~]# /u01/app/11.2.0/grid/bin/crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.crsd' on 'bnl11237dat01'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.bnl11237dat01.vip' on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.oc4j' on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.cvu' on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN3.lsnr' on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN2.lsnr' on 'bnl11237dat01'
CRS-2677: Stop of 'ora.bnl11237dat01.vip' on 'bnl11237dat01' succeeded
CRS-2677: Stop of 'ora.cvu' on 'bnl11237dat01' succeeded
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'bnl11237dat01' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'bnl11237dat01'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'bnl11237dat01' succeeded
CRS-2673: Attempting to stop 'ora.bnl11237dat02.vip' on 'bnl11237dat01'
CRS-2677: Stop of 'ora.LISTENER_SCAN3.lsnr' on 'bnl11237dat01' succeeded
CRS-2673: Attempting to stop 'ora.scan3.vip' on 'bnl11237dat01'
CRS-2677: Stop of 'ora.LISTENER_SCAN2.lsnr' on 'bnl11237dat01' succeeded
CRS-2673: Attempting to stop 'ora.scan2.vip' on 'bnl11237dat01'
CRS-2677: Stop of 'ora.scan1.vip' on 'bnl11237dat01' succeeded
CRS-2677: Stop of 'ora.bnl11237dat02.vip' on 'bnl11237dat01' succeeded
CRS-2677: Stop of 'ora.scan3.vip' on 'bnl11237dat01' succeeded
CRS-2677: Stop of 'ora.scan2.vip' on 'bnl11237dat01' succeeded
CRS-2677: Stop of 'ora.oc4j' on 'bnl11237dat01' succeeded
CRS-2673: Attempting to stop 'ora.ons' on 'bnl11237dat01'
CRS-2677: Stop of 'ora.ons' on 'bnl11237dat01' succeeded
CRS-2673: Attempting to stop 'ora.net1.network' on 'bnl11237dat01'
CRS-2677: Stop of 'ora.net1.network' on 'bnl11237dat01' succeeded
CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'bnl11237dat01' has completed
CRS-2677: Stop of 'ora.crsd' on 'bnl11237dat01' succeeded
CRS-2673: Attempting to stop 'ora.mdnsd' on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.crf' on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.ctssd' on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.evmd' on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'bnl11237dat01'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'bnl11237dat01' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'bnl11237dat01' succeeded
CRS-2677: Stop of 'ora.crf' on 'bnl11237dat01' succeeded
CRS-2677: Stop of 'ora.evmd' on 'bnl11237dat01' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'bnl11237dat01' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'bnl11237dat01'
CRS-2677: Stop of 'ora.cssd' on 'bnl11237dat01' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'bnl11237dat01'
CRS-2673: Attempting to stop 'ora.diskmon' on 'bnl11237dat01'
CRS-2677: Stop of 'ora.gipcd' on 'bnl11237dat01' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'bnl11237dat01'
CRS-2677: Stop of 'ora.diskmon' on 'bnl11237dat01' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'bnl11237dat01' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'bnl11237dat01' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@bnl11237dat01 ~]# /u01/app/11.2.0/grid/bin/crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
[root@bnl11237dat01 ~]# /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
I am unable to start Oracle Cluster Service on 11.2.0.2. Please can I know what the problem is and how to start CRS
thanks.

Hi,
after the crsctl start crs you have to wait a while till the clusterstack is up. You just issued crsctl check crs a little to early.
To see the startup stack do a crsctl stat res -t -init.
If not everything in crsctl stat res -t -init goes to online after a while, then you have a problem, which has to be analyzed.
However if crsctl stat res -t -init show everything is started use crsctl stat res -t to see the other processes coming online.
Can you post the crsctl stat res -t -init (if not everything is online) or a crsctl stat res -t.
Only thing which I find strange is that oifcfg does not list a PUBLIC interface...
Regards
Sebastian

Fresh cluster inst., servers reboot, cannot restart clusters or ASM

Hello all,
I just installed 11Gr2 cluster over 5 nodes. I used ASM in the installer, to hold the voting disk, etc for it.
I installed the RDBMS binairies successfully across all nodes. NO INSTANCES YET.
A few days went by....
I was getting ready to do post installing patches...and found things looking strange. I found the (working on node1), the clustering systems was not running.
I looked, and the servers (all 5 of them) for some reason had rebooted since install.
I tried starting the cluster:
crsctl start cluster -all.
Took while to return, and then errored with a timeout msg.
I checked to see if it was up:
./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4533: Event Manager is online
It then dawned on me...maybe ASM wasn't up either?
Nope...not running.
I tried to start it locally on node 1
I set the SID, and tried using sqlplus
I got:
ORA-01078: failure in processing system parameters
LRM-00109: could not open parameter file '/u01/app/oracle/product/11.2.0/dbhome_1/dbs/init+ASM1.ora'
I looked...nothing in that directory at all but a simple init.ora file.
I tried shutting down the cluster with
crsctl stop cluster -all
I got a ton of messages for each node like:
CRS-4548: Unable to connect to CRSD
CRS-2678: 'ora.crsd' on 'node1' has experienced an unrecoverable failure
CRS-0267: Human intervention required to resume its availability.
I'm trying to get through to Oracle support, but they're running slow.
Any ideas here?
I used OUI to create ASM for the cluster...why would it not put an init file there to point to the spfile in ASM?
I'm guessing this is the reason the nodes couldn't talk or sync. Trouble is...how do I start ASM without an init file? I seem to recall there might be a way to create a file to point to the ASM for the spfile, but I'm new to this too...and not sure where to point or the syntax to use.
Will have starting cluster up with no ASM have done any damage...if so, how to fix it?
As you can tell, learning about clusters/RAC and ASM....and I'm not finding good reference materials on troubleshooting. Heck, the install docs are bad enough....
Thank you in advance for any advice or links...
cayenne
ps. this is on RHEL5
Edited by: cayenne on Aug 10, 2010 12:33 PM

ssolbach wrote:
Hi,
Try doing crsctl stop crs -f 2 times. Sometimes this is needed.
crsctl check crs should report you that nothing is running anymore.
If it is... then only way I can think of is disable the automatic startup of crs
crsctl disable crs
and restart the node... This will definitely bring everything down (and does not start it up after restarting).
You can then start the crs stack with crsctl start crs.
Furthermore don't get confused with oracleasm.
Oracleasm is for asmlib. Which may be used in preparating the storage for ASM but has nothing to do later with if ASM is running or not.
To check the status of ASM first see if clusterstack was started successfully (you need to wait a little till it is started):
crsctl check crs.
Everything should be online... if it is, you can do a crsctl stat res -t which will show you all ressources including if ASM is running.
If there has been a problem starting up the stack (crsctl check crs) then we have to find out why.
Check $CRS_HOME/log/<hostname>/alert*.log for error messages.
If something indicates a problem with ASM do a:
adrci
show alert
and choose the ASM alert.log.
Search for error messages.
GL.
SebastianThank you, I'd not known of adrci before!! Once I set the ORACLE_HOME, it came right up!
Ok, it does look like ASM problem...and clustering can't find the voting disc.
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_dia0_25186.trc:
ORA-27508: IPC error sending a message
ORA-27300: OS system dependent operation:sendmsg failed with status: 22
ORA-27301: OS failure message: Invalid argument
ORA-27302: failure occurred at: sskgxpsnd1
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_dia0_25186.trc:
ORA-27508: IPC error sending a message
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_o000_22504.trc (incident=4801):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.100.1 not found. Check output from ifconfig command
Incident details in: /u01/app/oracle/diag/asm/+asm/+ASM1/incident/incdir_4801/+ASM1_o000_22504_i4801.trc
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_gmon_25212.trc:
ORA-29746: Cluster Synchronization Service is being shut down.
ORA-29702: error occurred in Cluster Group Service operation
GMON (ospid: 25212): terminating the instance due to error 29746
opidrv aborting process O000 ospid (22504) as a result of ORA-603
Ok, I've been going through the alert log, and the .trc files it indicates. I started by checking the private address 192.168.100.1...it seems to be up with ifconfig..and pingable.
I'm looking through errors...the one : ORA-27300 returning a value 22 got a hit on Oracle Support...but wasn't the same errors as I got.
Whew...still plowing through all the files and logs...not seeing anything out there so far that matches my problem...dunno what could have caused this to just all go BANG. I mean...no one but me using the machines, no databases on them yet...all that was installed was clustering with its voting disk and (ocr?) on a single ASM disk group....and RDBMS binaries installed across all 5 nodes.
Please let me know if you see anything that stands out here...I'm still searching myself...
Again, thanks for the adrcpi hint!!
cayenne

Oracel 11gR1 RAC Cluster issue

We have 2-node Oracle 11gR2 RAC on HP-UX 11.31 environment. It was running lase 2 month without any issue.
We got some netconfig issue, and node-1 got rebooted today. after the reboot cluster didn't not start on node-1, database is running on node-2.
grid@hublhp4:/app/oracle/grid/product/11.2.0.1/log/hublhp4/crsd$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
grid@hublhp4:/app/oracle/grid/product/11.2.0.1/log/hublhp4/crsd$ crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
grid@hublhp4:/app/oracle/grid/product/11.2.0.1/log/hublhp4/crsd$ ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=8, opn=kgfolclcpi1, dep=301, loc=kgfokge
AMDU-00301: Unable to open file tmp-AMIPOCR01.ocr
AMDU-00204: Disk N0002 is in currently mounted diskgroup AMIPOCR01
AMDU-00201: Disk N0002: '/dev/rdisk/ora_OCR
] [8]
grid@hublhp4:/app/oracle/grid/product/11.2.0.1/log/hublhp4/crsd$ olsnodes -n
hublhp4 1
hublhp5 2
any idea please.
Edited by: ManoRangasamy on Jul 5, 2011 6:38 PM

Hi,
Please post the alertlog ASM from node 1, crsd.log and ocssd.log from node 1
It might be because node 1 can't see asm disk or permission accidentally changed when the node rebooted
Cheers

Cluster Installation Error while running root.sh

Hi all,
Please help
I tried to install clusterware but when i run root.sh on first node it shows as below
[root@rac11g1 etc]# /u01/app/oraInventory/orainstRoot.sh
Changing permissions of /u01/app/oraInventory to 770.
Changing groupname of /u01/app/oraInventory to oinstall.
The execution of the script is complete
[root@rac11g1 etc]# /u01/crs/oracle/product/11.1.0/crs/root.sh
WARNING: directory '/u01/crs/oracle/product/11.1.0' is not owned by root
WARNING: directory '/u01/crs/oracle/product' is not owned by root
WARNING: directory '/u01/crs/oracle' is not owned by root
WARNING: directory '/u01/crs' is not owned by root
WARNING: directory '/u01' is not owned by root
Checking to see if Oracle CRS stack is already configured
/etc/oracle does not exist. Creating it now.
Setting the permissions on OCR backup directory
Setting up Network socket directories
Oracle Cluster Registry configuration upgraded successfully
The directory '/u01/crs/oracle/product/11.1.0' is not owned by root. Changing owner to root
The directory '/u01/crs/oracle/product' is not owned by root. Changing owner to root
The directory '/u01/crs/oracle' is not owned by root. Changing owner to root
The directory '/u01/crs' is not owned by root. Changing owner to root
The directory '/u01' is not owned by root. Changing owner to root
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node <nodenumber>: <nodename> <private interconnect name> <hostname>
node 1: rac11g1 rac11g1-priv rac11g1
node 2: rac11g2 rac11g2-priv rac11g2
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
Now formatting voting device: /dev/sdd1
Format of 1 voting devices complete.
Startup will be queued to init within 30 seconds.
Adding daemons to inittab
Expecting the CRS daemons to be up within 600 seconds.
Cluster Synchronization Services is active on these nodes.
rac11g1
Cluster Synchronization Services is inactive on these nodes.
rac11g2
Local node checking complete. Run root.sh on remaining nodes to start CRS daemon
On the Second Node
[root@rac11g2 crs]# ./root.sh
WARNING: directory '/u01/crs/oracle/product/11.1.0' is not owned by root
WARNING: directory '/u01/crs/oracle/product' is not owned by root
WARNING: directory '/u01/crs/oracle' is not owned by root
WARNING: directory '/u01/crs' is not owned by root
WARNING: directory '/u01' is not owned by root
Checking to see if Oracle CRS stack is already configured
/etc/oracle does not exist. Creating it now.
Setting the permissions on OCR backup directory
Setting up Network socket directories
Oracle Cluster Registry configuration upgraded successfully
The directory '/u01/crs/oracle/product/11.1.0' is not owned by root. Changing owner to root
The directory '/u01/crs/oracle/product' is not owned by root. Changing owner to root
The directory '/u01/crs/oracle' is not owned by root. Changing owner to root
The directory '/u01/crs' is not owned by root. Changing owner to root
The directory '/u01' is not owned by root. Changing owner to root
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node <nodenumber>: <nodename> <private interconnect name> <hostname>
node 1: rac11g1 rac11g1-priv rac11g1
node 2: rac11g2 rac11g2-priv rac11g2
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
Now formatting voting device: /dev/sdd1
Format of 1 voting devices complete.
Startup will be queued to init within 30 seconds.
Adding daemons to inittab
Expecting the CRS daemons to be up within 600 seconds.
Cluster Synchronization Services is active on these nodes.
rac11g2
Cluster Synchronization Services is inactive on these nodes.
rac11g1
Local node checking complete. Run root.sh on remaining nodes to start CRS daemon
When i run root.sh on first node it shows that the second node is not active which is accceptable
but when i run on second node it shows that fisrt node is not active
please help i'm stuck here from past 2 days
Bala
Edited by: Bala0575 on Jun 16, 2010 3:48 AM

Did you run cluvfy.sh before performing the instalation, if so did it report any errors. I would suggest running it now in "pre crsinst" mode and report any findings back. If CVU says everything is good then I would recommend talking to Oracle Support.
It looks like CSS is running on node1 at the end of its root.sh, yet its dead when root.sh is run on node 2.... makes me wonder if its root.2h on node2 thats killing it.
Run "cluvfy stage -pre crsinst" and post the output back to the forum please?
Yours,
Bob

What are the host network requirements for a 2012 R2 failover cluster using fiber channel?

I've seen comments on here regarding how the heartbeat signal isn't really required anymore - is that true? We started using Hyper-V in its infancy and have upgraded gleefully every step of the way. With 2012 R2, we also upgraded from 1gb iSCSI
to 8GB Fiber Channel. Currently, I have three NICs in use on each host. One for "No cluster communication" on it's own VLAN. Another for "Allow cluster network communication on this network" but NOT allowing clients, on
a different VLAN. And lastly the public network which allows cluster comms and clients on it (public VLAN).
Is it still necessary to have all three of these NICs in use? If the heartbeat isn't necessary any more, is there any reason to not have two public IPs and do away with the rest of the network? (two for fault tolerance) Does Live Migration
still use Ethernet if FC is available? I wasn't sure what all has changed with these requirements since Hyper-V first came out.
If it matters, we have 5 servers w/160GB RAM, 8 NICs, dual HBAs connected to redundant FC switches, going to two SANs. We're running around 30 VMs right now.
Can someone share their knowledge with me regarding the proper setup for my environment? Many Thanks!

Hi,
You can setup cluster with a single network but that leaves you with single point of failure on the Networking front, it is still recommended to have a heartbeat network.
Live migration would still happen though Ethernet, it has nothing to do with FC. Don't get confused, you had iSCSI for storage which used one of your VLAN and now you have FC for your storage.
Your hardware specs looks good. You can set up the following networks -
1. Public Network - Team two or more NICs (based on bandwidth aggregation)
2. Heartbeat Network - Don't use teamed Adaptor
3. Live Migration - Team two or more NICs (based on bandwidth aggregation)
Plan properly and draw guidelines to visualize and to remove single point of failure at all points.
Feel free to ask if you have some more queries.
Regards
Prabhash

Hyper-V cluster Backup causes virtual machine reboots for common Cluster Shared Volumes members.

I am having a problem where my VMs are rebooting while other VMs that share the same CSV are being backed up. I have provided all the information that I have gather to this point below. If I have missed anything, please let me know.
My HyperV Cluster configuration:
5 Node Cluster running 2008R2 Core DataCenter w/SP1. All updates as released by WSUS that will install on a Core installation
Each Node has 8 NICs configured as follows:
NIC1 - Management/Campus access (26.x VLAN)
NIC2 - iSCSI dedicated (22.x VLAN)
NIC3 - Live Migration (28.x VLAN)
NIC4 - Heartbeat (20.x VLAN)
NIC5 - VSwitch (26.x VLAN)
NIC6 - VSwitch (18.x VLAN)
NIC7 - VSwitch (27.x VLAN)
NIC8 - VSwitch (22.x VLAN)
Following hotfixes additional installed by MS guidance (either while build or when troubleshooting stability issue in Jan 2013)
KB2531907 - Was installed during original building of cluster
KB2705759 - Installed during troubleshooting in early Jan2013
KB2684681 - Installed during troubleshooting in early Jan2013
KB2685891 - Installed during troubleshooting in early Jan2013
KB2639032 - Installed during troubleshooting in early Jan2013
Original cluster build was two hosts with quorum drive. Initial two hosts were HST1 and HST5
Next host added was HST3, then HST6 and finally HST2.
NOTE: HST4 hardware was used in different project and HST6 will eventually become HST4
Validation of cluster comes with warning for following things:
Updates inconsistent across hosts
  I have tried to manually install "missing" updates and they were not applicable
  Most likely cause is different build times for each machine in cluster
   HST1 and HST5 are both the same level because they were built at same time
   HST3 was not rebuilt from scratch due to time constraints and it actually goes back to Pre-SP1 and has a larger list of updates that others are lacking and hence the inconsistency
   HST6 was built from scratch but has more updates missing than 1 or 5 (10 missing instead of 7)
   HST2 was most recently built and it has the most missing updates (15)
Storage - List Potential Cluster Disks
  It says there are Persistent Reservations on all 14 of my CSV volumes and thinks they are from another cluster.
  They are removed from the validation set for this reason. These iSCSI volumes/disks were all created new for
  this cluster and have never been a part of any other cluster.
When I run the Cluster Validation wizard, I get a slew of Event ID 5120 from FailoverClustering. Wording of error:
  Cluster Shared Volume 'Volume12' ('Cluster Disk 13') is no longer available on this node because of
  'STATUS_MEDIA_WRITE_PROTECTED(c00000a2)'. All I/O will temporarily be queued until a path to the
  volume is reestablished.
Under Storage and Cluster Shared VOlumes in Failover Cluster Manager, all disks show online and there is no negative effect of the errors.
Cluster Shared Volumes
We have 14 CSVs that are all iSCSI attached to all 5 hosts. They are housed on an HP P4500G2 (LeftHand) SAN.
I have limited the number of VMs to no more than 7 per CSV as per best practices documentation from HP/Lefthand
VMs in each CSV are spread out amonst all 5 hosts (as you would expect)
Backup software we use is BackupChain from BackupChain.com.
Problem we are having:
When backup kicks off for a VM, all VMs on same CSV reboot without warning. This normally happens within seconds of the backup starting
What have to done to troubleshoot this:
We have tried rebalancing our backups
  Originally, I had backup jobs scheduled to kick off on Friday or Saturday evening after 9pm
  2 or 3 hosts would be backing up VMs (Serially; one VM per host at a time) each night.
  I changed my backup scheduled so that of my 90 VMs, only one per CSV is backing up at the same time
   I mapped out my Hosts and CSVs and scheduled my backups to run on week nights where each night, there
   is only one VM backed up per CSV. All VMs can be backed up over 5 nights (there are some VMs that don't
   get backed up). I also staggered the start times for each Host so that only one Host would be starting
   in the same timeframe. There was some overlap for Hosts that had backups that ran longer than 1 hour.
  Testing this new schedule did not fix my problem. It only made it more clear. As each backup timeframe
  started, whichever CSV the first VM to start was on would have all of their VMs reboot and come back up.
I then thought maybe I was overloading the network still so I decided to disable all of the scheduled backup
and run it manually. Kicking off a backup on a single VM, in most cases, will cause the reboot of common
CSV members.
Ok, maybe there is something wrong with my backup software.
  Downloaded a Demo of Veeam and installed it onto my cluster.
  Did a test backup of one VM and I had not problems.
  Did a test backup of a second VM and I had the same problem. All VMs on same CSV rebooted
Ok, it is not my backup software. Apparently it is VSS. I have looked through various websites. The best troubleshooting
site I have found for VSS in one place it on BackupChain.com (http://backupchain.com/hyper-v-backup/Troubleshooting.html)
I have tested almost every process on there list and I will lay out results below:
  1. I have rebooted HST6 and problems still persist
  2. When I run VSSADMIN delete shadows /all, I have no shadows to delete on any of my 5 nodes
   When I run VSSADMIN list writers, I have no error messages on any writers on any node...
  3. When I check the listed registry key, I only have the build in MS VSS writer listed (I am using software VSS)
  4. When I run VSSADMIN Resize ShadowStorge command, there is no shadow storage on any node
  5. I have completed the registration and service cycling on HST6 as laid out here and most of the stuff "errors"
   Only a few of the DLL's actually register.
  6. HyperV Integration Services were reconciled when I worked with MS in early January and I have no indication of
   further issue here.
  7. I did not complete the step to delete the Subscriptions because, again, I have no error messages when I list writers
  8. I removed the Veeam software that I had installed to test (it hadn't added any VSS Writer anyway though)
  9. I can't realistically uninstall my HyperV and test VSS
  10. Already have latest SPs and Updates
  11. This is part of step 5 so I already did this. This seems to be a rehash of various other stratgies
I have used the VSS Troubleshooter that is part of BackupChain (Ctrl-T) and I get the following error:
  ERROR: Selected writer 'Microsoft Hyper-V VSS Writer' is in failed state!
  - Status: 8 (VSS_WS_FAILED_AT_PREPARE_SNAPSHOT)
  - Writer Failure code: 0x800423f0 (<Unknown error code>)
  - Writer ID: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
  - Instance ID: {d55b6934-1c8d-46ab-a43f-4f997f18dc71}
  VSS snapshot creation failed with result: 8000FFFF
VSS errors in event viewer. Below are representative errors I have received from various Nodes of my cluster:
I have various of the below spread out over all hosts except for HST6
Source: VolSnap, Event ID 10, The shadow copy of volume took too long to install
Source: VolSnap, Event ID 16, The shadow copies of volume x were aborted because volume y, which contains shadow copy storage for this shadow copy, wa force dismounted.
Source: VolSnap, Event ID 27, The shadow copies of volume x were aborted during detection because a critical control file could not be opened.
I only have one instance of each of these and both of the below are from HST3
Source: VSS, Event ID 12293, Volume Shadow Copy Service error: Error calling a routine on a Shadow Copy Provider {b5946137-7b9f-4925-af80-51abd60b20d5}. Routine details RevertToSnashot [hr = 0x80042302, A Volume Shadow Copy Service component encountered an
unexpected error.
Source: VSS, Event ID 8193, Volume Shadow Copy Service error: Unexpected error calling routine GetOverlappedResult. hr = 0x80070057, The parameter is incorrect.
So, basically, everything I have tried has resulted in no success towards solving this problem.
I would appreciate anything assistance that can be provided.
Thanks,
Charles J. Palmer
Wright Flood

Tim,
Thanks for the reply. I ran the first two commands and got this:
Name
Role Metric
Cluster Network 1
3  10000
Cluster Network 2 - HeartBeat                              1   1300
Cluster Network 3 - iSCSI                                    0  10100
Cluster Network 4 - LiveMigration                         1   1200
When you look at the properties of each network, this is how I have it configured:
Cluster Network 1 - Allow cluster network communications on this network and Allow clients to connect through this network (26.x subnet)
Cluster Network 2 - Allow cluster network communications on this network. New network added while working with Microsoft support last month. (28.x subnet)
Cluster Network 3 - Do not allow cluster network communications on this network. (22.x subnet)
Cluster Network 4 - Allow cluster network communications on this network. Existing but not configured to be used by VMs for Live Migration until MS corrected. (20.x subnet)
Should I modify my metrics further or are the current values sufficient.
I worked with an MS support rep because my cluster (once I added the 5th host) stopped being able to live migrate VMs and I had VMs host jumping on startup. It was a mess for a couple of days. They had me add the Heartbeat network as part of the solution
to my problem. There doesn't seem to be anywhere to configure a network specifically for CSV so I would assume it would use (based on my metrics above) Cluster Network 4 and then Cluster Network 2 for CSV communications and would fail back to the Cluster Network
1 if both 2 and 4 were down/inaccessible.
As to the iSCSI getting a second NIC, I would love to but management wants separation of our VMs by subnet and role and hence why I need the 4 VSwitch NICs. I would have to look at adding an additional quad port NIC to my servers and I would be having to
use half height cards for 2 of my 5 servers for that to work.
But, on that note, it doesn't appear to actually be a bandwidth issue. I can run a backup for a single VM and get nothing on the network card (It caused the reboots before any real data has even started to pass apparently) and still the problem occurs.
As to Backup Chain, I have been working with the vendor and they are telling my the issue is with VSS. They also say they support CSV as well. If you go to this page (http://backupchain.com/Hyper-V-Backup-Software.html)
they say they support CSVs. Their tech support has been very helpful but unfortunately, nothing has fixed the problem.
What is annoying is that every backup doesn't cause a problem. I have a daily backup of one of our machines that runs fine without initiating any additional reboots. But most every other backup job will trigger the VMs on the common CSV to reboot.
I understood about the updates but I had to "prove" it to the MS tech I was on the phone with and hence I brought it up. I understand on the storage as well. Why give a warning for something that is working though... I think that is just a poor indicator
that it doesn't explain that in the report.
At a loss for what else I can do,
Charles J. Palmer

MTR Sync error in Initial Synchronization

Getting the below error in inital Sync for MTR application SR 05 on MI 2.5 SP 21.
Confirmation Details
Customizing Tables
Due to a communications error method WAF_TRIP_GET_CUSTOMIZING could not be executed
Synchronization failed
Please let me know, if there is some thing missing.
Thanks

Thanks for the Reply.
After looking for the Table MEREP_MAPDEST , i am getting an error "Table MEREP_MAPDEST is not active in the Dictionary".
Did you by any chance meant "MEMAPPDEST"
Please let me know, whats the next step i should look in to?
Continued ....
Currently error is reduced to 1 error, after implementing Note 595069 and below is the trace. Let me know.
[20081108 01:13:57:078] I [MI/API/Logging           ] ***** LOG / TRACE SWITCHED ON
[20081108 01:13:57:078] I [MI/API/Logging           ] ***** Mobile Infrastructure Version: MI 25 SP 21 Patch 01 Build 200809011414
[20081108 01:13:57:078] I [MI/API/Logging           ] ***** Current Timezone: America/New_York
[20081108 01:13:57:078] I [MI/API/Logging           ] ***** Current Trace Level: 50
[20081108 01:14:58:281] E [Unknown                  ] com.sap.mbs.mttcore.tools.TEException: Field name SCFRA not found in PTRV_T706B1
com.sap.mbs.mttcore.tools.TEException: Field name SCFRA not found in PTRV_T706B1
     at com.sap.mbs.mttcore.tools.TypedRecord.setField(Unknown Source)
     at com.sap.mbs.mttcore.synchronization.container.UnpackContainerHelper.unpackME1(Unknown Source)
     at com.sap.mbs.mttcore.synchronization.communication.InboundProcessing.unpack(Unknown Source)
     at com.sap.mbs.mttcore.synchronization.communication.InboundProcessing.processIntern(Unknown Source)
     at com.sap.mbs.mttcore.synchronization.communication.InboundProcessing.process(Unknown Source)
     at com.sap.ip.me.sync.SyncManagerImpl.processSingleContainer(SyncManagerImpl.java:175)
     at com.sap.ip.me.sync.SyncManagerMerger.processInboundContainers(SyncManagerMerger.java:169)
     at com.sap.ip.me.sync.SyncManagerImpl.processSyncCycle(SyncManagerImpl.java:836)
     at com.sap.ip.me.sync.SyncManagerImpl.syncForUser(SyncManagerImpl.java:1278)
     at com.sap.ip.me.sync.SyncManagerImpl.processSynchronization(SyncManagerImpl.java:909)
     at com.sap.ip.me.sync.SyncManagerImpl.synchronizeWithBackend(SyncManagerImpl.java:464)
     at com.sap.ip.me.sync.SyncManagerImpl.synchronizeWithBackend(SyncManagerImpl.java:319)
     at com.sap.ip.me.api.sync.SyncManager.synchronizeWithBackend(SyncManager.java:79)
     at com.sap.mbs.mttcore.compatibility.sync.TESyncManager.synchronizeWithBackend(Unknown Source)
     at com.sap.mbs.mtr.sync.process.impl.SynchronizerAbstract.doSync(Unknown Source)
     at com.sap.mbs.mtr.sync.process.impl.CustomizationSynchronizer.syncWAF_TRIP_GET_CUSTOMIZING(Unknown Source)
     at com.sap.mbs.mtr.sync.process.impl.CustomizationSynchronizer.eventSyncUserOptions(Unknown Source)
     at com.sap.mbs.mtr.sync.process.impl.SyncProcessImpl.syncInitial(Unknown Source)
     at com.sap.mbs.mtr.sync.control.impl.InitialSyncController.sync(Unknown Source)
     at com.sap.mbs.mtr.sync.control.impl.InitialSyncController.onPrepareSync(Unknown Source)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
     at java.lang.reflect.Method.invoke(Unknown Source)
     at com.sap.mbs.core.control.AbstractViewController.process(Unknown Source)
     at com.sap.mbs.core.control.DefaultStateMachine.process(Unknown Source)
     at com.sap.mbs.core.web.FrontServlet.doHandleEvent(Unknown Source)
     at com.sap.mbs.mtt.application.web.FrontServletImpl.doHandleEvent(Unknown Source)
     at com.sap.ip.me.api.runtime.jsp.AbstractMEHttpServlet.doGetNotThreadSafe(AbstractMEHttpServlet.java:347)
     at com.sap.ip.me.api.runtime.jsp.AbstractMEHttpServlet.doGet(AbstractMEHttpServlet.java:689)
     at com.sap.ip.me.api.runtime.jsp.AbstractMEHttpServlet.doPost(AbstractMEHttpServlet.java:706)
     at javax.servlet.http.HttpServlet.service(HttpServlet.java:760)
     at com.sap.ip.me.api.runtime.jsp.AbstractMEHttpServlet.service(AbstractMEHttpServlet.java:313)
     at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
     at org.apache.tomcat.core.ServletWrapper.doService(ServletWrapper.java:405)
     at org.apache.tomcat.core.Handler.service(Handler.java:287)
     at org.apache.tomcat.core.ServletWrapper.service(ServletWrapper.java:372)
     at org.apache.tomcat.core.ContextManager.internalService(ContextManager.java:806)
     at org.apache.tomcat.core.ContextManager.service(ContextManager.java:752)
     at org.apache.tomcat.service.http.HttpConnectionHandler.processConnection(HttpConnectionHandler.java:213)
     at org.apache.tomcat.service.TcpWorkerThread.runIt(PoolTcpEndpoint.java:416)
     at org.apache.tomcat.util.ThreadPool$ControlRunnable.run(ThreadPool.java:501)
     at java.lang.Thread.run(Unknown Source)
Edited by: SJ on Nov 8, 2008 2:19 AM

Cluster Synchronization/Communication

Similar Messages

Maybe you are looking for