Solaris Cluster Private Link Failure

Hi,
I have configured Solaris Cluster 3.3 and add two Back to Back interconnect cable.
Sun Cluster is working fine but private link is fail and i can not ping the clusternode2-priv and clusternode1-priv form each other. some cammands faile
~ # ping clusternode2-priv
no answer from clusternode2-priv
~ # metaset -s nfsds -a -h t1u331 t1u332
metaset: 172.16.4.1: metad client create: RPC: Rpcbind failure
~ # scstat
-- Cluster Nodes --
Node name Status
Cluster node: n1u332 Online
Cluster node: n1u331 Online
-- Cluster Transport Paths --
Endpoint Endpoint Status
Transport path:   n1u332:nxge2           n1u331:nxge2           Path online
Transport path:   n1u332:nxge1           n1u331:nxge1           Path online
-- Quorum Summary from latest node reconfiguration --
Quorum votes possible: 3
Quorum votes needed: 2
Quorum votes present: 3
-- Quorum Votes by Node (current status) --
Node Name Present Possible Status
Node votes: n1u332 1 1 Online
Node votes: n1u331 1 1 Online
-- Quorum Votes by Device (current status) --
Device Name Present Possible Status
Device votes: /dev/did/rdsk/d4s2 1 1 Online
-- Device Group Servers --
Device Group Primary Secondary
-- Device Group Status --
Device Group Status
-- Multi-owner Device Groups --
Device Group Online Status
-- Resource Groups and Resources --
Group Name Resources
-- Resource Groups --
Group Name Node Name State Suspended
-- Resources --
Resource Name Node Name State Status Message
-- IPMP Groups --
Node Name Group Status Adapter Status
[root @ n1u332]
~ # ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
e1000g0: flags=1000802<BROADCAST,MULTICAST,IPv4> mtu 1500 index 2
inet 0.0.0.0 netmask 0
ether 0:15:17:e3:a4:e8
vsw0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3
inet 10.131.58.76 netmask ffffff00 broadcast 10.131.58.255
groupname ipmp-grp
ether 0:14:4f:f9:1:bd
vsw0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 10.131.58.75 netmask ffffff00 broadcast 10.131.58.255
vsw1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 4
inet 10.131.58.77 netmask ffffff00 broadcast 10.131.58.255
groupname ipmp-grp
ether 0:14:4f:fb:44:4
nxge1: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 7
inet 172.16.0.129 netmask ffffff80 broadcast 172.16.0.255
ether 0:14:4f:a0:81:d9
nxge2: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 6
inet 172.16.1.1 netmask ffffff80 broadcast 172.16.1.127
ether 0:14:4f:a0:81:da
clprivnet0: flags=1009843<UP,BROADCAST,RUNNING,MULTICAST,MULTI_BCAST,PRIVATE,IPv4> mtu 1500 index 8
inet 172.16.4.1 netmask fffffe00 broadcast 172.16.5.255
ether 0:0:0:0:0:1
[root @ n1u332]
~ # dladm show-dev
vsw0 link: up speed: 1000 Mbps duplex: full
vsw1 link: up speed: 1000 Mbps duplex: full
e1000g0 link: down speed: 0 Mbps duplex: half
e1000g1 link: up speed: 1000 Mbps duplex: full
e1000g2 link: unknown speed: 0 Mbps duplex: half
e1000g3 link: unknown speed: 0 Mbps duplex: half
nxge0 link: up speed: 100 Mbps duplex: full
nxge1 link: up speed: 1000 Mbps duplex: full
nxge2 link: up speed: 1000 Mbps duplex: full
nxge3 link: up speed: 100 Mbps duplex: full
e1000g4 link: unknown speed: 0 Mbps duplex: half
e1000g5 link: up speed: 1000 Mbps duplex: full
clprivnet0              link: unknown   speed: 0     Mbps       duplex: unknown
Edited by: 808696 on Mar 2, 2011 8:27 AM

If your private interconnect had really failed then one or other of the cluster nodes would have panicked. I think it is more likely that either you have changed the nsswitch.conf entry for hosts such that it does not include 'cluster' first, although I would have expected that to result in an unresolved host name. The other option is that you have hardened your machine in some way with ipfilters or security settings.
Has it ever worked?
Tim
---

Similar Messages

  • UCCE Router Call Key Reset Post Private Link Failure

                     Hi Guys
    As above. My customer has UCCE v7.5.8 together with CVP 7.0 designed with a pair of Proggers geographically split with a 2x10Mb LES between the 2 sites supporting both the visible and private traffic (One LES for each).
    The LES supporting the Private traffic 'blipped' but recovered within 10 mins. Now each day @ midnight Router Call Key starts at 201, however,whilst looking into a reporting issue I noticed that Router Call Key, seems to have reset to 201, but prepended with digits 34768, making a Router Call Key of 34768201. This appears to coincide with the time of the Private link recovery. Has anyone seen this before? I'm guessing this denotes a state-transfer/Db re-sync has taken place, or am I barking up the wrong tree?
    Cheers
    Freddie

    Thanks Lasse,
    When running that tool, all the links are showing zero utilization - not sure if the tool is able to show any more detailed information? 
    I should state that this Lync implementation is not live, there no Lync clients out on the corporate network, and I'm not expecting any traffic on the WAN site links at this point.
    It just seems to be an issue when accessing as a remote user.   At this point I think it must be because I have misconfigured CAC.  And because it is happening for a conference and not peer to peer for the remote user, then it is probably the relationship
    between the front end server pool and Edge server / remote user.  
    I have a local / central site defined in CAC with no bandwidth policy applied.  Initially the subnets assigned to this central site were just the LAN subnets for the central site that I knew would have Lync clients running on them.  Then as part
    of troubleshooting decided to add the Edge Internal IP, and then Edge external IPs as subnets associated with this central site.  Still getting the same problem though.
    Its around about times like now where I wonder if it is something simple like a service needing restarting so I'll double check things like this.   If you have any other ideas it would be greatly appreciated - a bit stuck.
    Regards,
    James

  • HOWTO: Create 2-node Solaris Cluster 4.1/Solaris 11.1(x64) using VirtualBox

    I did this on VirtualBox 4.1 on Windows 7 and VirtualBox 4.2 on Linux.X64. Basic pre-requisites are : 40GB disk space, 8GB RAM, 64-bit guest capable VirtualBox.
    Please read all the descriptive messages/prompts shown by 'scinstall' and 'clsetup' before answering.
    0) Download from OTN
    - Solaris 11.1 Live Media for x86(~966 MB)
    - Complete Solaris 11.1 IPS Repository Image (total 7GB)
    - Oracle Solaris Cluster 4.1 IPS Repository image (~73MB)
    1) Run VirtualBox Console, create VM1 : 3GB RAM, 30GB HDD
    2) The new VM1 has 1 NIC, add 2 more NICs (total 3). Setting the NIC to any type should be okay, 'VirtualBox Host Only Adapter' worked fine for me.
    3) Start VM1, point the "Select start-up disk" to the Solaris 11.1 Live Media ISO.
    4) Select "Oracle Solaris 11.1" in the GRUB menu. Select Keyboard layout and Language.
    VM1 will boot and the Solaris 11.1 Live Desktop screen will appear.
    5) Click <Install Oracle Solaris> from the desktop, supply necessary inputs.
    Default Disk Discovery (iSCSI not needed) and Disk Selection are fine.
    Disable the "Support Registration" connection info
    6) The alternate user created during the install has root privileges (sudo). Set appropriate VM1 name
    7) When the VM has to be rebooted after the installation is complete, make sure the Solaris 11.1 Live ISO is ejected or else the VM will again boot from the Live CD.
    8) Repeat steps 1-6, create VM2 and install Solaris.
    9) FTP(secure) the Solaris 11.1 Repository IPS and Solaris Cluster 4.1 IPS onto both the VMs e.g under /home/user1/
    10) We need to setup both the packages: Solaris 11.1 Repository and Solaris Cluster 4.1
    11) All commands now to be run as root
    12) By default the 'solaris' repository is of type online (pkg.oracle.com), that needs to be updated to the local ISO we downloaded :-
    +$ sudo sh+
    +# lofiadm -a /home/user1/sol-11_1-repo-full.iso+
    +//output : /dev/lofi/N+
    +# mount -F hsfs /dev/lofi/N /mnt+
    +# pkg set-publisher -G '*' -M '*' -g /mnt/repo solaris+
    13) Setup the ha-cluster package :-
    +# lofiadm -a /home/user1/osc-4_1-ga-repo-full.iso+
    +//output : /dev/lofi/N+
    +# mkdir /mnt2+
    +# mount -f hsfs /dev/lofi/N /mnt2+
    +# pkg set-publisher -g file:///mnt2/repo ha-cluster+
    14) Verify both packages are fine :-
    +# pkg publisher+
    PUBLISHER                   TYPE     STATUS P LOCATION
    solaris                     origin   online F file:///mnt/repo/
    ha-cluster                  origin   online F file:///mnt2/repo/
    15) Install the complete SC4.1 package by installing 'ha-cluster-full'
    +# pkg install ha-cluster-full+
    14) Repeat steps 12-15 on VM2.
    15) Now both VMs have the OS and SC4.1 installed.
    16) By default the 3 NICs are in the "Automatic" profile and have DHCP configured. We need to activate the Fixed profile and put the 3 NICs into it. Only 1 interface, the public interface, needs to be
    configured. The other 2 are for the cluster interconnect and will be automatically configured by scinstall. Execute the following commands :-
    +# netadm enable -p ncp defaultfixed+
    +//verify+
    +# netadm list -p ncp defaultfixed+
    +#Configure the public-interface+
    +#Verify none of the interfaces are listed, add all the 3+
    +# ipadm show-if+
    +# run dladm show-phys or dladm show-link to check interface names : must be net0/net1/net2+
    +# ipadm create-ip net0+
    +# ipadm create-ip net1+
    +# ipadm create-ip net2+
    +# ipadm show-if+
    +//select proper IP and configure the public interface. I have used 192.168.56.171 & 172+
    +# ipadm create-addr -T static -a 192.168.56.171/24 net0/publicip+
    +#IP plumbed, restart+
    +# ipadm down-addr -t net0/publicip+
    +# ipadm up-addr -t net0/publicip+
    +//Verify publicip is fine by pinging the host+
    +# ping 192.168.56.1+
    +//Verify, net0 should be up, net1/net2 should be down+
    +# ipadm+
    17) Repeat step 16 on VM2
    18) Verify both VMs can ping each other using the public IP. Add entries to each other's /etc/hosts
    Now we are ready to run scinstall and create/configure the 2-node cluster
    19)
    +# cd /usr/cluster/bin+
    +# ./scinstall+
    select 1) Create a new cluster ...
    select 1) Create a new cluster
    select 2) Custom in "Typical or Custom Mode"
    Enter cluster name : mycluster1 (e.g)
    Add the 2 nodes : solvm1 & solvm2 and press <ctrl-d>
    Accept default "No" for <Do you need to use DES authentication>"
    Accept default "Yes" for <Should this cluster use at least two private networks>
    Enter "No" for <Does this two-node cluster use switches>
    Select "1)net1" for "Select the first cluster transport adapter"
    If there is warning of unexpected traffic on "net"1, ignore it
    Enter "net1" when it asks corresponding adapter on "solvm2"
    Select "2)net2" for "Select the second cluster transport adapter"
    Enter "net2" when it asks corresponding adapter on "solvm2"
    Select "Yes" for "Is it okay to accept the default network address"
    Select "Yes" for "Is it okay to accept the default network netmask"Now the IP addresses 172.16.0.0 will be plumbed in the 2 private interfaces
    Select "yes" for "Do you want to turn off global fencing"
    (These are SATA serial disks, so no fencing)
    Enter "Yes" for "Do you want to disable automatic quorum device selection"
    (we will add quorum disks later)
    Enter "Yes" for "Proceed with cluster creation"
    Select "No" for "Interrupt cluster creation for cluster check errors"
    The second node will be configured and 2nd node rebooted
    The first node will be configured and rebootedAfter both nodes have rebooted, verify the cluster has been created and both nodes joined.
    On both nodes :-
    +# cd /usr/cluster/bin+
    +# ./clnode status+
    +//should show both nodes Online.+
    At this point there are no quorum disks, so 1 of the node's will be designated quorum vote. That node VM has to be up for the other node to come up and cluster to be formed.
    To check the current quorum status, run :-
    +# ./clquorum show+
    +//one of the nodes will have 1 vote and other 0(zero).+
    20)
    Now the cluster is in 'Installation Mode' and we need to add a quorum disk.
    Shutdown both the nodes as we will be adding shared disks to both of them
    21)
    Create 2 VirtualBox HDDs (VDI Files) on the host, 1 for quorum and 1 for shared filesystem. I have used a size of 1 GB for each :-
    *$ vboxmanage createhd --filename /scratch/myimages/sc41cluster/sdisk1.vdi --size 1024 --format VDI --variant Fixed*
    *0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%*
    *Disk image created. UUID: 899147b9-d21f-4495-ad55-f9cf1ae46cc3*
    *$ vboxmanage createhd --filename /scratch/myimages/sc41cluster/sdisk2.vdi --size 1024 --format VDI --variant Fixed*
    *0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%*
    *Disk image created. UUID: 899147b9-d22f-4495-ad55-f9cf15346caf*
    22)
    Attach these disks to both the VMs as shared type
    *$ vboxmanage storageattach solvm1 --storagectl "SATA" --port 1 --device 0 --type hdd --medium /scratch/myimages/sc41cluster/sdisk1.vdi --mtype shareable*
    *$ vboxmanage storageattach solvm1 --storagectl "SATA" --port 2 --device 0 --type hdd --medium /scratch/myimages/sc41cluster/sdisk2.vdi --mtype shareable*
    *$ vboxmanage storageattach solvm2 --storagectl "SATA" --port 1 --device 0 --type hdd --medium /scratch/myimages/sc41cluster/sdisk1.vdi --mtype shareable*
    *$ vboxmanage storageattach solvm2 --storagectl "SATA" --port 2 --device 0 --type hdd --medium /scratch/myimages/sc41cluster/sdisk2.vdi --mtype shareable*
    The disks are attached to SATA ports 1 & 2 of each VM. On my VirtualBox on Linux, the controller type is "SATA", whereas on Windows it is "SATA Controller".
    The "--mtype shareable' parameter is important
    23)
    Mark both disks as shared :-
    *$ vboxmanage modifyhd /scratch/myimages/sc41cluster/sdisk1.vdi --type shareable*
    *$ vboxmanage modifyhd /scratch/myimages/sc41cluster/sdisk2.vdi --type shareable*
    24) Start both VMs. We need to format the 2 shared disks
    25) From VM1, run format. In my case, the 2 new shared disks show up as 'c7t1d0' and 'c7t2d0'.
    +# format+
    select disk 1 (c7t1d0)
    [disk formated]
    FORMAT MENU
    fdisk
    Type 'y' to accept default partition
    partition
    0
    <enter>
    <enter>
    1
    995mb
    print
    label
    <yes>
    quit
    quit26) Repeat step 25) for the 2nd disk (c7t2d0)
    27) Make sure the shared disks can be used for quorum :-
    On VM1
    +# ./cldevice refresh+
    +# ./cldevice show+
    On VM2
    +# ./cldevice refresh+
    +# ./cldevice show+
    The shared disks should have the same DID (d2,d3,d4 etc). Note down the DID that you are going to use for quorum (e.g d2)
    By default, global fencing is enabled for these disks. We need to turn it off for all disks as these are SATA disks :-
    +# cldevice set -p default_fencing=nofencing-noscrub d1+
    +# cldevice set -p default_fencing=nofencing-noscrub d2+
    +# cldevice set -p default_fencing=nofencing-noscrub d3+
    +# cldevice set -p default_fencing=nofencing-noscrub d4+
    28) It is better to do one more reboot of both VMs, otherwise I got a error when adding the quorum disk
    29) Run clsetup to add quorum disk and to complete cluster configuration :-
    +# ./clsetup+
    === Initial Cluster Setup ===
    Enter 'Yes' for "Do you want to continue"
    Enter 'Yes' for "Do you want add any quorum devices"
    Select '1) Directly Attached Shared Disk' for the type of device
    Enter 'Yes' for "Is it okay to continue"
    Enter 'd2' (or 'd3') for 'Which global device do you want to use'
    Enter 'Yes' for "Is it okay to proceed with the update"
    The command 'clquorum add d2' is run
    Enter 'No' for "Do you want to add another quorum device"
    Enter 'Yes' for "Is it okay to reset "installmode"?"Cluster initialization is complete.!!!
    30) Run 'clquorum status' to confirm both nodes and the quorum disk have 1 vote each
    31) Run other cluster commands to explore!
    I will cover Data services and shared file system in another post. Basically the other shared disk
    can be used to create a UFS filesystem and mount it on all nodes.

    The Solaris Cluster 4.1 Installation and Concepts Guide are available at :-
    http://docs.oracle.com/cd/E29086_01/index.html
    Thanks.

  • Re: Oracle Solaris Cluster 3.3 using Oracle Solaris 09/10 on virtualbox

    Dear All,
    I am trying to build a solaris cluster 3.3 on virtualbox using Oracle Solaris 09/10. My host OS is Ubuntu 10.10. But, when i try to add the second node, it is not able to find the node1 says "unable to connect to node1". Can anyone help me build this cluster.
    my laptop is hp probook 4420s and has i3 processor and 2gb ram, wherein i have assigned 750mb ram to both Oracle Solaris 09/10 OS. Moreover, the first node is successfully attached. Both the virtual machines can ssh succesfully and link based ipmp is setup on both the virtual machines. Please help me regarding this as i am new to clustering.

    As Nik said, this could be a problem with the setup of the private interconnects. This could also be caused by the fact that the necessary ports on the nodes are closed - due to secure by default.
    There is an old blog entry on blogs.sun.com by Thorsten Frueauf that explains how to open the ports and/or start the necessary services. I think it is also documented now as part of the normal installation manuals.

  • Cisco 6880-x vsl link failure

    Hi Guys,
    I have a head scratcher that I'm dealing with.  I have 2x (2 month) 6880-x switches configured with vss with 2 vsl links on each of its sup cards and switch 1 is the active switch on the vss.  Both 6880x chassis also have 2 additional 10g modules each.  I am dealing with 2 issues in this setup.  
    The first issue is,  i have a 3750x with a 10g module thats port channeled to the vss with one 10g link going to each of the chassis.  Within the first 3 days it started dropping the link that's plugged to switch1, then comes back on right away.  It does it randomly throughout the day.  I have changed out the optics on both ends, changed the cable, and also changed the module on the 3750x and I still get the same issue.  The strange part is that when switch2 on the vss becomes the active switch, the problem goes away.  I did notice this before changing out the optics, cable, module on the 3750x.
    The second issue is the one I'm been trying to figure out along with cisco TAC.  This started about a month after bring the vss online.  When switch1 was the active switch in the vss, every couple of days vsl links drop one at a time, eventually killing the vss and putting the standby unit into recovery mode because of the dual active detection.  once I reboot the standby switch, the vss comes back up normally.  it did this a couple of times until I decided to force switch2 to be the active switch on the vss.  when switch2 became active, the vss was stable for about a month then the vsl links died again, and the system failed over to switch1.  after looking at the logs with cisco tac, we see that the vsl links stop responding which causes the failover, but up until now we still can't determine what is causing the vsl links to fail.  cisco tac said that maybe the vsl links were being overloaded but we have been monitoring the bandwidth utilization on the vsl links and they never go beyond 1% utilization.  The last suggestion by cisco tac was to add another vsl link but through another module other than the sup.  This was done a couple of days ago so now i'm waiting to see if the vsl links fail again and since switch1 is the active switch on the vss, i'm having to deal with the first issue above with the 3750x.
    I've included some log entries from the dropped port-channel member and also logs for when the vsl links fail.
    Logs for 3750x port-channel member drops.
    *Jul  5 05:27:10 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
    *Jul  5 05:27:10 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
    *Jul  5 05:27:10 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
    *Jul  5 05:28:10 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
    *Jul  6 11:28:52 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
    *Jul  6 11:28:52 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
    *Jul  6 11:28:53 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
    *Jul  6 11:29:53 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
    *Jul  7 06:27:25 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
    *Jul  7 06:27:25 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
    *Jul  7 06:27:26 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
    *Jul  7 06:27:33 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
    *Jul  7 06:33:43 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
    *Jul  7 06:33:43 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
    *Jul  7 06:33:43 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
    *Jul  7 06:34:43 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
    *Jul  7 07:38:35 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
    *Jul  7 07:38:35 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
    *Jul  7 07:38:36 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
    log for VSL link failures
    *Jun 27 23:59:25 PDT: %VSLP-SW2-3-VSLP_LMP_FAIL_REASON: Te2/5/15: Link down
    *Jun 27 23:59:25 PDT: %VSL-SW2-5-VSL_CNTRL_LINK:  New VSL Control Link  2/5/16 
    *Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/5/15, changed state to down
    *Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/5/15, changed state to down
    *Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet2/5/15, changed state to down
    *Jun 27 23:59:25 PDT: %VSLP-SW2-3-VSLP_LMP_FAIL_REASON: Te2/5/16: Link down
    *Jun 27 23:59:25 PDT: %VSLP-SW2-2-VSL_DOWN:   Last VSL interface Te2/5/16 went down
    *Jun 27 23:59:25 PDT: %VSLP-SW2-2-VSL_DOWN:   All VSL links went down while switch is in ACTIVE role
    *Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/5/16, changed state to down
    *Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface Port-channel2, changed state to down
    *Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel2, changed state to down
    *Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet2/5/16, changed state to down
    *Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface Port-channel1, changed state to down
    *Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel1, changed state to down
    *Jun 27 23:59:25 PDT: %OIR-SW2-6-INSREM: Switch 1 Physical Slot 5 - Module Type LINE_CARD  removed 
    *Jun 27 23:59:25 PDT: %OSPF-SW2-5-ADJCHG: Process 1, Nbr 10.253.0.3 on TenGigabitEthernet1/1/1 from FULL to DOWN, Neighbor Down: Interface down or detached
    *Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface Port-channel24, changed state to down
    *Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel24, changed state to down
    *Jun 27 23:59:26 PDT: %OIR-SW2-6-INSREM: Switch 1 Physical Slot 1 - Module Type LINE_CARD  removed 
    *Jun 27 23:59:26 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel17, changed state to down
    *Jun 27 23:59:26 PDT: %PFREDUN-SW2-6-ACTIVE: Standby processor removed or reloaded, changing to Simplex mode
    *Jun 27 23:59:26 PDT: %OIR-SW2-6-INSREM: Switch 1 Physical Slot 2 - Module Type LINE_CARD  removed 
    *Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/5/14, changed state to down
    *Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/5/15, changed state to down
    *Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/5/16, changed state to down
    *Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/5/14, changed state to down
    *Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/5/16, changed state to down
    *Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/1/1, changed state to down
    *Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/1/2, changed state to down
    *Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/1/3, changed state to down
    *Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/1, changed state to down
    *Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/2, changed state to down
    *Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/3, changed state to down
    Press RETURN to get started!
    *Jun 28 07:03:33.421: %USBFLASH-SW2_STBY-5-CHANGE: bootdisk has been inserted!
    *Jun 28 07:03:53.297: %OIR-SW2_STBY-6-INSPS: Power supply inserted in slot 1
    *Jun 28 07:03:53.301: %C6KPWR-SW2_STBY-4-PSOK: power supply 1 turned on.
    *Jun 28 07:04:26.093: %FABRIC-SW2_STBY-5-FABRIC_MODULE_ACTIVE: The Switch Fabric Module in slot 5 became active.
    *Jun 28 07:04:48.497: %DIAG-SW2_STBY-6-RUN_MINIMUM: Switch 2 Module 5: Running Minimal Diagnostics...
    *Jun 28 07:04:48.497: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestLoopback due to: the port is used as a VSL link.
    *Jun 28 07:04:48.497: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestLoopback due to: the port is used as a VSL link.
    *Jun 28 07:04:55.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestFexModeLoopback due to: the port is used as a VSL link.
    *Jun 28 07:04:55.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestFexModeLoopback due to: the port is used as a VSL link.
    *Jun 28 07:05:00.165: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestL2CTSLoopback due to: the port is used as a VSL link.
    *Jun 28 07:05:00.165: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestL2CTSLoopback due to: the port is used as a VSL link.
    *Jun 28 07:05:06.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestL3CTSLoopback due to: the port is used as a VSL link.
    *Jun 28 07:05:06.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestL3CTSLoopback due to: the port is used as a VSL link.
    *Jun 28 07:05:23.309: %DIAG-SW2_STBY-6-DIAG_OK: Switch 2 Module 5: Passed Online Diagnostics
    *Jun 28 00:05:38 PDT: %SYS-SW2_STBY-6-CLOCKUPDATE: System clock has been updated from 00:05:38 PDT Sat Jun 28 2014 to 00:05:38 PDT Sat Jun 28 2014, configured from console by console.
    *Jun 28 00:05:38 PDT: %SYS-SW2_STBY-6-CLOCKUPDATE: System clock has been updated from 00:05:38 PDT Sat Jun 28 2014 to 00:05:38 PDT Sat Jun 28 2014, configured from console by console.
    *Jun 28 00:05:38 PDT: %SSH-SW2_STBY-5-DISABLED: SSH 2.0 has been disabled
    *Jun 28 00:05:54 PDT: %SYS-SW2_STBY-5-RESTART: System restarted --
    Cisco IOS Software, c6880x Software (c6880x-ADVENTERPRISEK9-M), Version 15.1(2)SY2, RELEASE SOFTWARE (fc3)
    Technical Support: http://www.cisco.com/techsupport
    Copyright (c) 1986-2014 by Cisco Systems, Inc.
    Compiled Wed 26-Feb-14 15:30 by prod_rel_team
    *Jun 28 00:05:54 PDT: %SSH-SW2_STBY-5-ENABLED: SSH 2.0 has been enabled
    *Jun 28 00:05:54 PDT: %SYS-SW2_STBY-3-LOGGER_FLUSHED: System was paused for 00:03:01 to ensure console debugging output.
    *Jun 28 00:05:56 PDT: %C6KENV-SW2_STBY-4-LOWER_SLOT_EMPTY: The lower adjacent slot of module 5 might be empty. Airdam must be installed in that slot to be NEBS compliant
    *Jun 28 00:07:36 PDT: %DIAG-SW2_STBY-6-RUN_MINIMUM: Switch 2 Module 1: Running Minimal Diagnostics...
    *Jun 28 00:07:37 PDT: %SYS-SW2_STBY-3-LOGGER_FLUSHED: System was paused for 00:01:29 to ensure console debugging output.
    *Jun 28 00:07:41 PDT: %DIAG-SW2_STBY-6-RUN_MINIMUM: Switch 2 Module 2: Running Minimal Diagnostics...
    *Jun 28 00:08:10 PDT: %DIAG-SW2_STBY-6-DIAG_OK: Switch 2 Module 1: Passed Online Diagnostics
    *Jun 28 00:08:22 PDT: %EC-SW2_STBY-5-CANNOT_BUNDLE2: Te2/1/10 is not compatible with Te1/2/10 and will be suspended (Operational flow control send of Te2/1/10 is off, Te1/2/10 is on)
    *Jun 28 00:08:31 PDT: %EC-SW2_STBY-5-COMPATIBLE: Te2/1/10 is compatible with port-channel members
    *Jun 28 00:08:45 PDT: %DIAG-SW2_STBY-6-DIAG_OK: Switch 2 Module 2: Passed Online Diagnostics
    *Jun 28 00:08:46 PDT: %C6KENV-SW2_STBY-4-HIGHER_SLOT_EMPTY: The higher adjacent slot of module 2 might be empty. Airdam must be installed in that slot to be NEBS compliant
    ELDC1C1-AG01-1 line 0 
    ************************ W A R N I N G ***************************
    * THIS IS A PRIVATE COMPUTER SYSTEM, AND FOR AUTHORIZED USE ONLY.*
    * THIS SYSTEM IS MONITORED, AND ANY UNAUTHORIZED USE MAY BE      *
    * SUBJECT TO CRIMINAL PROSECUTION.                               *
    * IF YOU ARE NOT AUTHORIZED, LOG OUT IMMEDIATELY!!!!!            *
    so according to the logs, it seems like the switch was reloaded or something of the nature but even the cisco tac said that it wasnt the case but couldnt determine whats causing it either.  Tac went through the config on the vss and has said that its configured correctly.
    Maybe someone else has experienced this issue or if someone can point out something that I can look at...  sorry for the very long post.
    thanks

    I suppose, we have the same problem. Only when Switch 1 is the active switch of the 6880-X VSS system, the connection to the FEX-stack breaks suddenly after running fine for some days and the FEX-stack (3 x 6800-IA) get only ready again when I reset the whole VSS-system, both VSS-switches and the FEX-stack.
    All access ports at the FEX can't connect to our infrastructure (DHCP,DNS...). The two connected TenG FEX-stack uplinks sometimes get down and sometimes only the uplink to the active parent switch get down. The SYST LED of FEX 1 lighting amber.
    When parent switch 2 is the active unit the issue never occured.
    6880-X running with 15.2(1)SY, 6800-IA running with 15.2(3)E.
    Cisco TAC is investigating that issue, but till now without any result.
    When the issue occurs, the recorded logging messages are:
    Syslog 6880-X
    Apr  7 07:42:43.182: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1274 (FEX-101)
    Apr  7 07:45:43.181: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1276 (FEX-101)
    Syslog 6800-IA:
    Apr  7 07:57:43.182: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1284
    -Traceback= 5198C4z 21413F8z 1C3B35Cz 1C3C6FCz 1C3CA00z 2654CC0z 26509BCz
    Apr  7 08:00:43.181: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1286
    -Traceback= 5198C4z 21413F8z 1C3B35Cz 1C3C6FCz 1C3CA00z 2654CC0z 26509BCz

  • Solaris Cluster 3.3 on VMware ESX 4.1

    Hi there,
    I am trying to setup Solaris Cluster 3.3 on Vmware ESX 4.1
    My first question is: Is there anyone out there setted up Solaris Cluster on vmware accross boxes?
    My tools:
    Solaris 10 U9 x64
    Solaris Cluster 3.3
    Vmware ESX 4.1
    HP DL 380 G7
    HP P2000 Fibre Channel Storage
    When I try to setup cluster, just next next next, it completes successfully. It reboots the second node first and then the itself.
    After second node comes up on login screen, ping stops after 5 sec. Same either nodes!
    I am trying to understand why it does that? I did every possibility to complete this job. Setted up quorum as RDM from VMware. Solaris has direct access to quorum disk now.
    I am new to Solaris and I am having the errors below. If someone would like to help me it will be much appreciated!
    Please explain me in more details i am new bee in solaris :) Thanks!
    I need help especially on error: /proc fails to mount periodically during reboots.
    Here is the error messages. Is there any one out there setted up Solaris Cluster on ESX 4.1 ?
    * cluster check (ver 1.0)
    Report Date: 2011.02.28 at 16.04.46 EET
    2011.02.28 at 14.04.46 GMT
    Command run on host:
    39bc6e2d- sun1
    Checks run on nodes:
    sun1
    Unique Checks: 5
    ===========================================================================
    * Summary of Single Node Check Results for sun1
    ===========================================================================
    Checks Considered: 5
    Results by Status
    Violated : 0
    Insufficient Data : 0
    Execution Error : 0
    Unknown Status : 0
    Information Only : 0
    Not Applicable : 2
    Passed : 3
    Violations by Severity
    Critical : 0
    High : 0
    Moderate : 0
    Low : 0
    * Details for 2 Not Applicable Checks on sun1
    * Check ID: S6708606 ***
    * Severity: Moderate
    * Problem Statement: Multiple network interfaces on a single subnet have the same MAC address.
    * Applicability: Scan output of '/usr/sbin/ifconfig -a' for more than one interface with an 'ether' line. Check does not apply if zero or only one ether line.
    * Check ID: S6708496 ***
    * Severity: Moderate
    * Problem Statement: Cluster node (3.1 or later) OpenBoot Prom (OBP) has local-mac-address? variable set to 'false'.
    * Applicability: Applicable to SPARC architecture only.
    * Details for 3 Passed Checks on sun1
    * Check ID: S6708605 ***
    * Severity: Critical
    * Problem Statement: The /dev/rmt directory is missing.
    * Check ID: S6708638 ***
    * Severity: Moderate
    * Problem Statement: Node has insufficient physical memory.
    * Check ID: S6708642 ***
    * Severity: Critical
    * Problem Statement: /proc fails to mount periodically during reboots.
    ===========================================================================
    * End of Report 2011.02.28 at 16.04.46 EET
    ===========================================================================
    Edited by: user13603929 on 28-Feb-2011 22:22
    Edited by: user13603929 on 28-Feb-2011 22:24
    Note: Please ignore memory error I have installed 5GB memory and it says it requires min 1 GB! i think it is a bug!
    Edited by: user13603929 on 28-Feb-2011 22:25

    @TimRead
    Hi, thanks for reply,
    I have already followed the steps also on your links but no joy on this.
    What i noticed here is cluster seems to be buggy. Because i have tried to install cluster 3.3 on physical hardware and it gave me excat same error messages! interesting isnt it?
    Please see errors below that I got from on top of VMware and also on Solaris Physical hardware installation:
    ERROR1:
    Comment: I have installed different memories all the time. It keeps sayying that silly error.
    problem_statement : *Node has insufficient physical memory.
    <analysis>5120 MB of memory is installed on this node.The current release of Solaris Cluster requires a minimum of 1024 MB of physical memory in each node. Additional memory required for various Data Services.</analysis>
    <recommendations>Add enough memory to this node to bring its physical memory up to the minimum required level.
    ERROR2
    Comment: Despite rmt directory is there I gor error below on cluster check
    <problem_statement>The /dev/rmt directory is missing.
    <analysis>The /dev/rmt directory is missing on this Solaris Cluster node. The current implementation of scdidadm(1M) relies on the existence of /dev/rmt to successfully execute 'scdidadm -r'. The /dev/rmt directory is created by Solaris regardless of the existence of the actual nderlying devices. The expectation is that the user will never delete this directory. During a reconfiguration reboot to add new disk devices, if /dev/rmt is missing scdidadm will not create the new devices and will exit with the following error: 'ERR in discover_paths : Cannot walk /dev/rmt' The absence of /dev/rmt might prevent a failover to this node and result in a cluster outage. See BugIDs 4368956 and 4783135 for more details.</analysis>
    ERROR3
    Comment: All Nics have different MAC address though, also I have done what it suggests me. No joy here as well!
    <problem_statement>Cluster node (3.1 or later) OpenBoot Prom (OBP) has local-mac-address? variable set to 'false'.
    <analysis>The local-mac-address? variable must be set to 'true.' Proper operation of the public networks depends on each interface having a different MAC address.</analysis>
    <recommendations>Change the local-mac-address? variable to true: 1) From the OBP (ok> prompt): ok> setenv local-mac-address? true ok> reset 2) Or as root: # /usr/sbin/eeprom local-mac-address?=true # init 0 ok> reset</recommendations>
    ERROR4
    Comment: No comment on this, i have done what it says no joy...
    <problem_statement>/proc fails to mount periodically during reboots.
    <analysis>Something is trying to access /proc before it is normally mounted during the boot process. This can cause /proc not to mount. If /proc isn't mounted, some Solaris Cluster daemons might fail on startup, which can cause the node to panic. The following lines were found:</analysis>
    Thanks!

  • Solaris cluster 3.2 Sparc

    Hi folks
    First things first. I may not have great knowledge about Solaris clusters, so please be merciful :)
    Here it is what I have:
    - 2 x Netra T1 AC200 each with 1GB Ram, 2x18GB disks, 500 MHZ Sparc Cpu, 4 port ethernet card
    - 1 array netra d130 3x36 GB
    -- cable et all, switches , you name it
    So, I set up the OS, all ok. I set up the cluster, all SEEMS to be ok.
    But when I define my resources and stuff like that all goes fine, except when I try top bring the resource group on line.
    On another configuration I teste the shared logical hostname and works fine.
    Group Name Resources
    Resources: ingresc nodec ingresr
    -- Resource Groups --
    Group Name Node Name State Suspended
    Group: ingresc node2 Unmanaged No
    Group: ingresc node1 Unmanaged No
    -- Resources --
    Resource Name Node Name State Status Message
    Resource: nodec node2 Offline Offline
    Resource: nodec node1 Offline Offline
    Resource: ingresr node2 Offline Offline
    Resource: ingresr node1 Offline Offline
    scswitch: (C969069) Request failed because resource group ingresc is in ERROR_STOP_FAILED state and requires operator attention
    Now, in /var/adm/messsages I spotted this :
    Mar 6 17:09:03 node2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hafoip_stop> for resource <nodec>, resource group <IngresNCG>, node <node2>, timeout <300> seconds
    Mar 6 17:09:03 node2 Cluster.RGM.rgmd: [ID 510020 daemon.notice] 46 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hafoip/hafoip_stop>:tag=<IngresNCG.nodec.1>: Calling security_clnt_connect(..., host=<node2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
    A little bit of research points in the direction of a bug (see CR 6565601)
    Here it is what I see as my options:
    1 - reinstall Solaris OS, but not the Solaris Cluster 3.2, instead using Solaris Express 10/07 or 2/08. But will this combination work ? Or will it work only in the combination Solaris Cluster Express and Solaris Express Developer Edition ? If the later, which versions will work together ?
    2 - Beg for a Solaris Cluster 3.2 patch, although in my humble opinion, this should be free since it looks to me that once you write your own stuff, you run in the bug, and after all it is education
    Any ideas, help, greatly appreciated
    Many thanks
    Armand

    Although names are different since I used two setups, this is the relevant part of /var/adm/messages.
    It looks to me Ingres resource is failing:
    Mar  6 17:08:03 node2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hafoip_prenet_start> for resource <nodec>, resource group <IngresNCG>, node <node2>, timeout <300> seconds
    Mar  6 17:08:03 node2 Cluster.RGM.rgmd: [ID 510020 daemon.notice] 46 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hafoip/hafoip_prenet_start>:tag=<IngresNCG.nodec.10>: Calling security_clnt_connect(..., host=<node2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
    Mar  6 17:08:05 node2 svc.startd[8]: [ID 652011 daemon.warning] svc:/system/cluster/scsymon-srv:default: Method "/usr/cluster/lib/svc/method/svc_scsymon_srv start" failed with exit status 96.
    Mar  6 17:08:05 node2 svc.startd[8]: [ID 748625 daemon.error] system/cluster/scsymon-srv:default misconfigured: transitioned to maintenance (see 'svcs -xv' for details)
    Mar  6 17:08:09 node2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hafoip_prenet_start> completed successfully for resource <nodec>, resource group <IngresNCG>, node <node2>, time used: 1% of timeout <300 seconds>
    Mar  6 17:08:09 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nodec state on node node2 change to R_PRENET_STARTED
    Mar  6 17:08:09 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nodec state on node node2 change to R_STARTING
    Mar  6 17:08:09 node2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hafoip_start> for resource <nodec>, resource group <IngresNCG>, node <node2>, timeout <500> seconds
    Mar  6 17:08:09 node2 Cluster.RGM.rgmd: [ID 510020 daemon.notice] 46 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hafoip/hafoip_start>:tag=<IngresNCG.nodec.0>: Calling security_clnt_connect(..., host=<node2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource nodec status on node node2 change to R_FM_ONLINE
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource nodec status msg on node node2 change to <LogicalHostname online.>
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hafoip_start> completed successfully for resource <nodec>, resource group <IngresNCG>, node <node2>, time used: 0% of timeout <500 seconds>
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nodec state on node node2 change to R_JUST_STARTED
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nodec state on node node2 change to R_ONLINE_UNMON
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource IngresNCR state on node node2 change to R_STARTING
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nodec state on node node2 change to R_MON_STARTING
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource IngresNCR status on node node2 change to R_FM_UNKNOWN
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource IngresNCR status msg on node node2 change to <Starting>
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <bin/ingres_server_start> for resource <IngresNCR>, resource group <IngresNCG>, node <node2>, timeout <300> seconds
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hafoip_monitor_start> for resource <nodec>, resource group <IngresNCG>, node <node2>, timeout <300> seconds
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 510020 daemon.notice] 46 fe_rpc_command: cmd_type(enum):<1>:cmd=</global/disk2s0/ing_nc_1/ingresclu/bin/ingres_server_start>:tag=<IngresNCG.IngresNCR.0>: Calling security_clnt_connect(..., host=<node2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
    Mar  6 17:08:11 node2 Cluster.RGM.rgmd: [ID 268902 daemon.notice] 45 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hafoip/hafoip_monitor_start>:tag=<IngresNCG.nodec.7>: Calling security_clnt_connect(..., host=<node2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
    Mar  6 17:08:12 node2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hafoip_monitor_start> completed successfully for resource <nodec>, resource group <IngresNCG>, node <node2>, time used: 0% of timeout <300 seconds>
    Mar  6 17:08:12 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nodec state on node node2 change to R_ONLINE
    Mar  6 17:08:13 node2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource IngresNCR status msg on node node2 change to <Bringing Ingres DBMS server online.>
    Mar  6 17:08:30 node2 sendmail[534]: [ID 702911 mail.alert] unable to qualify my own domain name (node2) -- using short name
    Mar  6 17:08:30 node2 sendmail[535]: [ID 702911 mail.alert] unable to qualify my own domain name (node2) -- using short name
    Mar  6 17:08:31 node2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource IngresNCR status msg on node node2 change to <Bringing Ingres DBMS server offline.>
    Mar  6 17:08:45 node2 SC[Ingres.ingres_server,IngresNCG,IngresNCR,stop]: [ID 147958 daemon.error] ERROR : HA-Ingres failed to stop.
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource IngresNCR status on node node2 change to R_FM_FAULTED
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource IngresNCR status msg on node node2 change to <Ingres DBMS server faulted.>
    Mar  6 17:08:46 node2 SC[Ingres.ingres_server,IngresNCG,IngresNCR,start]: [ID 335575 daemon.error] ERROR : Stop method failed for the HA-Ingres data service.
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 938318 daemon.error] Method <bin/ingres_server_start> failed on resource <IngresNCR> in resource group <IngresNCG> [exit code <1>, time used: 11% of timeout <300 seconds>]
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource IngresNCR state on node node2 change to R_START_FAILED
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group IngresNCG state on node node2 change to RG_PENDING_OFF_START_FAILED
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource IngresNCR state on node node2 change to R_STOPPING
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nodec state on node node2 change to R_MON_STOPPING
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource IngresNCR status on node node2 change to R_FM_UNKNOWN
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource IngresNCR status msg on node node2 change to <Stopping>
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <bin/ingres_server_stop> for resource <IngresNCR>, resource group <IngresNCG>, node <node2>, timeout <300> seconds
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hafoip_monitor_stop> for resource <nodec>, resource group <IngresNCG>, node <node2>, timeout <300> seconds
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 510020 daemon.notice] 46 fe_rpc_command: cmd_type(enum):<1>:cmd=</global/disk2s0/ing_nc_1/ingresclu/bin/ingres_server_stop>:tag=<IngresNCG.IngresNCR.1>: Calling security_clnt_connect(..., host=<node2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
    Mar  6 17:08:46 node2 Cluster.RGM.rgmd: [ID 268902 daemon.notice] 45 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hafoip/hafoip_monitor_stop>:tag=<IngresNCG.nodec.8>: Calling security_clnt_connect(..., host=<node2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
    Mar  6 17:08:47 node2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource IngresNCR status msg on node node2 change to <Bringing Ingres DBMS server offline.>
    Mar  6 17:08:48 node2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hafoip_monitor_stop> completed successfully for resource <nodec>, resource group <IngresNCG>, node <node2>, time used: 0% of timeout <300 seconds>
    Mar  6 17:08:48 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nodec state on node node2 change to R_ONLINE_UNMON
    Mar  6 17:09:00 node2 SC[Ingres.ingres_server,IngresNCG,IngresNCR,stop]: [ID 147958 daemon.error] ERROR : HA-Ingres failed to stop.
    Mar  6 17:09:02 node2 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource IngresNCR status on node node2 change to R_FM_FAULTED
    Mar  6 17:09:02 node2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource IngresNCR status msg on node node2 change to <Ingres DBMS server faulted.>
    Mar  6 17:09:03 node2 Cluster.RGM.rgmd: [ID 938318 daemon.error] Method <bin/ingres_server_stop> failed on resource <IngresNCR> in resource group <IngresNCG> [exit code <2>, time used: 5% of timeout <300 seconds>]
    Mar  6 17:09:03 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource IngresNCR state on node node2 change to R_STOP_FAILED
    Mar  6 17:09:03 node2 Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group IngresNCG state on node node2 change to RG_PENDING_OFF_STOP_FAILED
    Mar  6 17:09:03 node2 Cluster.RGM.rgmd: [ID 424774 daemon.error] Resource group <IngresNCG> requires operator attention due to STOP failure
    Mar  6 17:09:03 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nodec state on node node2 change to R_STOPPING
    Mar  6 17:09:03 node2 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource nodec status on node node2 change to R_FM_UNKNOWN
    Mar  6 17:09:03 node2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource nodec status msg on node node2 change to <Stopping>
    Mar  6 17:09:03 node2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hafoip_stop> for resource <nodec>, resource group <IngresNCG>, node <node2>, timeout <300> seconds
    Mar  6 17:09:03 node2 Cluster.RGM.rgmd: [ID 510020 daemon.notice] 46 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hafoip/hafoip_stop>:tag=<IngresNCG.nodec.1>: Calling security_clnt_connect(..., host=<node2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
    Mar  6 17:09:04 node2 ip: [ID 678092 kern.notice] TCP_IOC_ABORT_CONN: local = 192.168.005.085:0, remote = 000.000.000.000:0, start = -2, end = 6
    Mar  6 17:09:04 node2 ip: [ID 302654 kern.notice] TCP_IOC_ABORT_CONN: aborted 0 connection
    Mar  6 17:09:04 node2 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource nodec status on node node2 change to R_FM_OFFLINE
    Mar  6 17:09:04 node2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource nodec status msg on node node2 change to <LogicalHostname offline.>
    Mar  6 17:09:04 node2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hafoip_stop> completed successfully for resource <nodec>, resource group <IngresNCG>, node <node2>, time used: 0% of timeout <300 seconds>
    Mar  6 17:09:04 node2 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nodec state on node node2 change to R_OFFLINE
    Mar  6 17:09:04 node2 Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group IngresNCG state on node node2 change to RG_ERROR_STOP_FAILED
    Mar  6 17:09:04 node2 Cluster.RGM.rgmd: [ID 424774 daemon.error] Resource group <IngresNCG> requires operator attention due to STOP failure
    Mar  6 17:09:04 node2 Cluster.RGM.rgmd: [ID 663692 daemon.error] failback attempt failed on resource group <IngresNCG> with error <resource group in ERROR_STOP_FAILED state requires operator attention>
    Mar  6 17:09:10 node2 java[1652]: [ID 807473 user.error] pkcs11_softtoken: Keystore version failure.Thank you
    Armand

  • DAA Installation in Solaris cluster Environment

    Hello,
    We are integrate our PRD system to SOlution Manager 7.1 .
    Our System is in Solaris Cluster where in DB run on one node and SAP on another.
    Right now I have installed DAA in Node A where SAP is installed.  In managed system setup in solution manager I can see both the nodes and I can assign the DAA for SAP ABAP system .
    My doubt is do I have to install DAA again in node B and do the same manged system steps for Node B in Solman.
    I am not able to find any relevant blog or notes related to this scenario.
    It would be great if someone can guide me through the process.
    Thanks
    Raghu

    Hi ,
    Ok , so as Divyanshu said I can just continue installation on Node B with same DAA SID.
    But If I going to install as Amar said I have to uninstall the current installation and start a fresh installation on both Node without mentioning Logical Hostname.
    Amar,
    I have already gone through that link but now the problem is I have to convince the client to uninstall the DAA which is already installed . To get permission and to redo activities its bit complicated.
    So I probably go with Divyanshu's method . Install DAA in Node B with same SID and continue the same configuration in Solman.
    I will update the result once I finish with that.
    Thank you
    Regards
    Raghu

  • Configure Solaris cluster to failover guest domain when NICs were down

    Hi,
    I am running Solaris 11 as the control domains on 2 clustered nodes running on Solaris Cluster 4. There is a Solaris 10 guest domain which is managed via the Solaris cluster in failover mode.
    2 virtual switches connected to 2 different network switches are presented to the guest domain. I would like to use link based IPMP to facilitate HA for the network connections. I understand that in this case the IPMP can only be configured within the guest domain. Now the question is how do I configure it in such a way that the guest domain fails over to the second cluster node (standby control domain) if both network interfaces are down? Thanks.
    Edited by: user12925046 on Dec 25, 2012 9:48 PM
    Edited by: user12925046 on Dec 25, 2012 9:49 PM

    The Solaris Cluster 4.1 Installation and Concepts Guide are available at :-
    http://docs.oracle.com/cd/E29086_01/index.html
    Thanks.

  • Solaris Cluster 3.3u2 configuration - Solaris Ldoms

    Hi,
    I've 2 sun blade T5-1B and I create 4 ldoms on each blade, also I install the Solaris Cluster 3.3u2 and its all done successfully, I did several time resource fail-over with nodes, everything seems working perfect for me as normal behavior of expected but I've an issue to test public transport testing, hereby I remove the both physical public transport cable form node1 but its didn't fail-over on node2, little confuse as its should be fail-over to node2, but not. Please let me know if where missed something to get proper fail-over?
    Please advise.
    regards,   

    Hi M10vir,
    this means you are running SC3.3u2 in guest domains and pulling the cables from the primary domain?
    If so, this would mean that the IPMP group used in the guest domain need to get a link down that Solaris Cluster do a failover when logical hostname resources are configured. How are the network interfaces configured and what is the status of the network interfaces when the cables are removed? Furthermore does the domains recognize that the network is gone and the relevant messages are in /var/adm/messages?
    Thanks,
      Juergen

  • Grid installation: root.sh failed on the first node on Solaris cluster 4.1

    Hi all,
    I'm trying to install the Grid (11.2.0.3.0) on the 2 node-clusters (OSC 4.1).
    When I run the root.sh on the first node, I got the out put as follow:
    xha239080-root-5.11# root.sh
    Performing root user operation for Oracle 11g
    The following environment variables are set as:
    ORACLE_OWNER= oracle
    ORACLE_HOME= /Grid/CRShome
    Enter the full pathname of the local bin directory: [/usr/local/bin]:
    /usr/local/bin is read only. Continue without copy (y/n) or retry (r)? [y]:
    Warning: /usr/local/bin is read only. No files will be copied.
    Creating /var/opt/oracle/oratab file...
    Entries will be added to the /var/opt/oracle/oratab file as needed by
    Database Configuration Assistant when a database is created
    Finished running generic part of root script.
    Now product-specific root actions will be performed.
    Using configuration parameter file: /Grid/CRShome/crs/install/crsconfig_params
    Creating trace directory
    User ignored Prerequisites during installation
    OLR initialization - successful
    root wallet
    root wallet cert
    root cert export
    peer wallet
    profile reader wallet
    pa wallet
    peer wallet keys
    pa wallet keys
    peer cert request
    pa cert request
    peer cert
    pa cert
    peer root cert TP
    profile reader root cert TP
    pa root cert TP
    peer pa cert TP
    pa peer cert TP
    profile reader pa cert TP
    profile reader peer cert TP
    peer user cert
    pa user cert
    Adding Clusterware entries to inittab
    CRS-2672: Attempting to start 'ora.mdnsd' on 'xha239080'
    CRS-2676: Start of 'ora.mdnsd' on 'xha239080' succeeded
    CRS-2672: Attempting to start 'ora.gpnpd' on 'xha239080'
    CRS-2676: Start of 'ora.gpnpd' on 'xha239080' succeeded
    CRS-2672: Attempting to start 'ora.cssdmonitor' on 'xha239080'
    CRS-2672: Attempting to start 'ora.gipcd' on 'xha239080'
    CRS-2676: Start of 'ora.cssdmonitor' on 'xha239080' succeeded
    CRS-2676: Start of 'ora.gipcd' on 'xha239080' succeeded
    CRS-2672: Attempting to start 'ora.cssd' on 'xha239080'
    CRS-2672: Attempting to start 'ora.diskmon' on 'xha239080'
    CRS-2676: Start of 'ora.diskmon' on 'xha239080' succeeded
    CRS-2676: Start of 'ora.cssd' on 'xha239080' succeeded
    ASM created and started successfully.
    Disk Group DATA created successfully.
    clscfg: -install mode specified
    Successfully accumulated necessary OCR keys.
    Creating OCR keys for user 'root', privgrp 'root'..
    Operation successful.
    CRS-4256: Updating the profile
    Successful addition of voting disk 9cdb938773bc4f16bf332edac499fd06.
    Successful addition of voting disk 842907db11f74f59bf65247138d6e8f5.
    Successful addition of voting disk 748852d2a5c84f72bfcd50d60f65654d.
    Successfully replaced voting disk group with +DATA.
    CRS-4256: Updating the profile
    CRS-4266: Voting file(s) successfully replaced
    ## STATE File Universal Id File Name Disk group
    1. ONLINE 9cdb938773bc4f16bf332edac499fd06 (/dev/did/rdsk/d10s6) [DATA]
    2. ONLINE 842907db11f74f59bf65247138d6e8f5 (/dev/did/rdsk/d8s6) [DATA]
    3. ONLINE 748852d2a5c84f72bfcd50d60f65654d (/dev/did/rdsk/d9s6) [DATA]
    Located 3 voting disk(s).
    Start of resource "ora.cssd" failed
    CRS-2672: Attempting to start 'ora.cssdmonitor' on 'xha239080'
    CRS-2672: Attempting to start 'ora.gipcd' on 'xha239080'
    CRS-2676: Start of 'ora.cssdmonitor' on 'xha239080' succeeded
    CRS-2676: Start of 'ora.gipcd' on 'xha239080' succeeded
    CRS-2672: Attempting to start 'ora.cssd' on 'xha239080'
    CRS-2672: Attempting to start 'ora.diskmon' on 'xha239080'
    CRS-2676: Start of 'ora.diskmon' on 'xha239080' succeeded
    CRS-2674: Start of 'ora.cssd' on 'xha239080' failed
    CRS-2679: Attempting to clean 'ora.cssd' on 'xha239080'
    CRS-2681: Clean of 'ora.cssd' on 'xha239080' succeeded
    CRS-2673: Attempting to stop 'ora.gipcd' on 'xha239080'
    CRS-2677: Stop of 'ora.gipcd' on 'xha239080' succeeded
    CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'xha239080'
    CRS-2677: Stop of 'ora.cssdmonitor' on 'xha239080' succeeded
    CRS-5804: Communication error with agent process
    CRS-4000: Command Start failed, or completed with errors.
    Failed to start Oracle Grid Infrastructure stack
    Failed to start Cluster Synchorinisation Service in clustered mode at /Grid/CRShome/crs/install/crsconfig_lib.pm line 1211.
    /Grid/CRShome/perl/bin/perl -I/Grid/CRShome/perl/lib -I/Grid/CRShome/crs/install /Grid/CRShome/crs/install/rootcrs.pl execution failed
    xha239080-root-5.11# history
    checking the ocssd.log, I see some thing as follow:
    2013-09-16 18:46:24.238: [    CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1379371584
    2013-09-16 18:46:24.239: [    CSSD][1]clssscmain: Environment is production
    2013-09-16 18:46:24.239: [    CSSD][1]clssscmain: Core file size limit extended
    2013-09-16 18:46:24.248: [    CSSD][1]clssscmain: GIPCHA down 1
    2013-09-16 18:46:24.249: [    CSSD][1]clssscGetParameterOLR: OLR fetch for parameter logsize (8) failed with rc 21
    2013-09-16 18:46:24.250: [    CSSD][1]clssscExtendLimits: The current soft limit for file descriptors is 65536, hard limit is 65536
    2013-09-16 18:46:24.250: [    CSSD][1]clssscExtendLimits: The current soft limit for locked memory is 4294967293, hard limit is 4294967293
    2013-09-16 18:46:24.250: [    CSSD][1]clssscGetParameterOLR: OLR fetch for parameter priority (15) failed with rc 21
    2013-09-16 18:46:24.250: [    CSSD][1]clssscSetPrivEnv: Setting priority to 4
    2013-09-16 18:46:24.253: [    CSSD][1]clssscSetPrivEnv: unable to set priority to 4
    2013-09-16 18:46:24.253: [    CSSD][1]SLOS: cat=-2, opn=scls_mem_lockdown, dep=11, loc=mlockall
    unable to lock memory
    2013-09-16 18:46:24.253: [    CSSD][1](:CSSSC00011:)clssscExit: A fatal error occurred during initialization
    Do anyone have any idea what going on and how can I fix it ?

    Hi,
    solaris has several issues with DISM, e.g.:
    Solaris 10 and Solaris 11 Shared Memory Locking May Fail (Doc ID 1590151.1)
    Sounds like Solaris Cluster  has a similar bug. A "workaround" is to reboot the (cluster) zone, that "fixes" the mlock error. This bug was introduced with updates in september, atleast to our environment (Solaris 11.1). Prior i did not have the issue and now i have to restart the entire zone, whenever i stop crs.
    With 11.2.0.3 the root.sh script can be rerun without prior cleaning up, so you should be able to continue installation at that point after the reboot. After the root.sh completes some configuration assistants need to be run, to complete the installation. You need to execute this manually as you wipe your oui session
    Kind Regards
    Thomas

  • Common Agent Container Problem on New Solaris Cluster 3.2 (1/09)

    Hi,
    I have just installed Solaris Cluster 3.2 u2. Getting following error when run cluster check or sccheck -v 2:
    cluster check
    cacaocsc: unable to connect: Connection refused by peer
    cluster check: (C704199) unable to reach Common Agent Container
    sccheck -v 2
    sccheck: Requesting explorer data and node report from node1.
    sccheck: Requesting explorer data and node report from node1-cl.
    sccheck: node1: Additional explorer arguments: -w !default,cluster,disks,etc,messages,nbu,netinfo,patch,pkg,sds,lvm,sonoma,sysconfig,var,vxvm,vxfs,vxfsextended -c "/usr/cluster/lib/sccheck/vfstab-global-mount-points" -c "/usr/cluster/lib/sccheck/netapp-nas-quorum-devices"
    sccheck: node1: WARNING: EXP_CONTRACT_ID not set!
    sccheck: node1: WARNING: EXP_REPLY not set!
    sccheck: node1:
    sccheck: node1: 2 warnings found in /etc/opt/SUNWexplo/default/explorer
    sccheck: node1:
    sccheck: node1: Mar 07 21:56:15 node1[5519] explorer: ERROR explorer
    sccheck: node1: Mar 07 21:56:15 node1[5519] explorer: ERROR Module or alias sds does not exist.
    sccheck: node1: Explorer run failed:
    sccheck: node1-cl: Additional explorer arguments: -w !default,cluster,disks,etc,messages,nbu,netinfo,patch,pkg,sds,lvm,sonoma,sysconfig,var,vxvm,vxfs,vxfsextended -c "/usr/cluster/lib/sccheck/vfstab-global-mount-points" -c "/usr/cluster/lib/sccheck/netapp-nas-quorum-devices"
    sccheck: node1-cl: WARNING: EXP_CONTRACT_ID not set!
    sccheck: node1-cl: WARNING: EXP_REPLY not set!
    sccheck: node1-cl:
    sccheck: node1-cl: 2 warnings found in /etc/opt/SUNWexplo/default/explorer
    sccheck: node1-cl:
    sccheck: node1-cl: Mar 07 21:56:15 node1-cl[3851] explorer: ERROR explorer
    sccheck: node1-cl: Mar 07 21:56:15 node1-cl[3851] explorer: ERROR Module or alias sds does not exist.
    sccheck: node1-cl: Explorer run failed:
    sccheck: node1 error: Unexpected early return from server.
    sccheck: node1-cl error: Unexpected early return from server.
    sccheck: Unable to run checks on: node1,node1-cl
    Even when I try to run Cluster Manager from web I am getting following error:
    "A communication problem was encountered by the system"
    Following is the version I am using:
    root@node1 #
    root@node1 # cacaoadm -V
    2.2.0.1
    root@node1 #
    root@node1 # smcwebserver -V
    Version 3.1
    root@node1 #
    root@node1 #
    root@node1 # svcs -a | grep container
    online 21:30:30 svc:/application/management/common-agent-container-1:default
    root@node1 #
    root@node1 #
    root@node1 # svcs -a | grep webconsole
    online 21:20:29 svc:/system/webconsole:console
    root@node1 #
    root@node1 #
    root@node1 #
    root@node1 # cat /etc/release
    Solaris 10 5/08 s10s_u5wos_10 SPARC
    Copyright 2008 Sun Microsystems, Inc. All Rights Reserved.
    Use is subject to license terms.
    Assembled 24 March 2008
    root@node1 #
    root@node1 #
    root@node1 # cat /etc/cluster/release
    Sun Cluster 3.2u2 for Solaris 10 sparc
    Copyright 2008 Sun Microsystems, Inc. All Rights Reserved.
    root@node1 #
    root@node1 #
    Do you have any idea or tips to solve this problem?
    Thanks
    Edited by: shmozumder on Mar 8, 2009 1:50 AM

    Hi Tim,
    Yes - it's running:
    default instance is ENABLED at system startup.
    Smf monitoring process:
    29410
    29411
    Uptime: 0 day(s), 0:14
    dssdbgen03p1 # svcs -a |grep -i comm
    disabled 17:03:06 svc:/network/rpc/mdcomm:default
    online 13:20:53 svc:/application/management/common-agent-container-2:default
    uninitialized 16:58:32 svc:/application/management/common-agent-container-1:default
    I also get messages like the following in the /var/adm/messages file when starting cacaoadm :
    Jul 7 13:20:52 dssdbgen03p1 java.lang.ClassNotFoundException: Cannot find class com.sun.cacao.rmi.impl.RMIModule in module com.sun.cacao.rmi
    Jul 7 13:20:52 dssdbgen03p1 java.lang.ClassNotFoundException: Cannot find class com.sun.cacao.invoker.impl.InvokerModule in module com.sun.cacao.invoker
    Jul 7 13:20:52 dssdbgen03p1 java.lang.ClassNotFoundException: Cannot find class com.sun.cacao.snmpv3adaptor.SnmpV3AdaptorModule in module com.sun.cacao.snmpv3_adaptor
    Jul 7 13:20:52 dssdbgen03p1 java.lang.ClassNotFoundException: Cannot find class com.sun.cacao.dtrace.impl.DTraceModule in module com.sun.cacao.dtrace
    Jul 7 13:20:52 dssdbgen03p1 java.lang.ClassNotFoundException: Cannot find class com.sun.cacao.rbac.impl.RbacModule in module com.sun.cacao.rbac
    Jul 7 13:20:52 dssdbgen03p1 java.lang.ClassNotFoundException: Cannot find class com.sun.cacao.instrum.impl.InstrumModule in module com.sun.cacao.instrum
    Jul 7 13:20:52 dssdbgen03p1 java.lang.ClassNotFoundException: Cannot find class com.sun.cacao.commandstream.CommandStreamAdaptorModule in module com.sun.cacao.command_stream_adaptor

  • ColdFusion 10 - MySQL Communications link failure

    We are migrating some websites onto a cloud infrastructure running Windows 2008 virtual machines.  These websites all run on ColdFusion with MySQL databases.  They currently are running in our CoLo with no problems.  Additionally, they are running on our development network in our offices with no problems.
    We are setting up our cloud to match as closely as possible the configuration we currently use which is, essentially, CF10 + IIS on one server and MySQL on a separate machine.  We are 99% finished and most things are running great.  However....
    We have run into a couple, as in 2, places where we click a link/button and are greeted with:
    Error Executing Database Query.
    Communications link failure The last packet successfully received from the server was 0 milliseconds ago. The last packet sent successfully to the server was 0 milliseconds ago.
    Scanning the stack-trace I also find:
    `Caused by: java.net.SocketException: Connection reset`
    The communications link error is ALWAYS: 0ms.
    What's most puzzling is the Queries that seem to be causing this are *simple* queries that are used ALL OVER the sites with no problems.  Why they are failing at hese 2 particular places has us at wits end.
    Our only clue is, looking at the CF Error description of what scripts are called, we can see the script where the query is failing is getting called twice?  For example, one of the occurences is in our Application file:
        >The error occurred in D:/Our_Web_Sites/oursite/Application.cfm: line 73
        >Called from D:/Our_Web_Sites/oursite/Application.cfm: line 17
        >Called from D:/Our_Web_Sites/oursite/Application.cfm: line 1
        >Called from D:/Our_Web_Sites/oursite/Application.cfm: line 73
        >Called from D:/Our_Web_Sites/oursite/Application.cfm: line 17
        >Called from D:/Our_Web_Sites/oursite/Application.cfm: line 1
    We can find nothing in our CF code that would be causing the script to be called twice so our guess is the first call is failing on the Query so CF tries again...only to fail and error.
    Googling this issue I've found lots of posts about changing the MySQL timeouts.  None of those worked and I didn't expect them to since what we're dealing with doesn't appear to be a timeout issue.  These pages fail each and every time.
    The closest we've come to a solution came from this blog posting:
    http://www.talkingtree.com/blog/index.cfm/2011/1/12/Validation-Query-for-MySQL-communicati ons-link-failure
    If we UNCHECK the "Maintain connections across client requests. " setting in CFAdmin then the error goes away.  The blog suggests leaving that checked, which is our preference, and using Connection Validation of "SELECT 1;".  Try that...same error.
    We've also tried the JDBC AutoConnect=true option.  No effect.
    Downloaded latest JDBC Connector and used it instead of standard CF10-MySQL connector.  No effect.
    Again, 99% of the site works with the exception of these two links, both of which work just fine in all our other environments.  Any other ideas?

    I have some questions.
    1) You talk of 2 calls. However, you show us 5 calls from Application.cfm. What do those line numbers tell us?
    2) Shouldn't that be AutoReconnect=true rather than AutoConnect=true? In any case, it seems that the problem here is that ColdFusion attempts to use a stale connection to the database. It is then usually preferable to change wait_timeout instead. The MySQL documentation says, 'Alternatively[to AutoReconnect=true], as a last option, investigate setting the MySQL server variable "wait_timeout" to a high value, rather than the default of 8 hours.'.
    3) Shouldn't you already convert to Application.cfc?

  • Communcation link failure with HS ODBC to AS/400  (v4r3)

    HI all,
    i have problem when using Heterogeneous Service to connect Oralce to DB2/400 in AS/400 (v4r3).
    I am using HS's ODBC to connect to DB2/400 in AS/400.
    1) first I created a DSN with name AS400 and driver "Client Access ODBC Driver (32-bit)"
    2) then configured HS in %ORACLE_HOME%/hs/admin/initodbc.ora
    3) modified listener.ora as well as tnsnames.ora
    4) restarted the listener and Oracle.
    5) then I created the database link for this service.
    create database link as400
    connect to "USER" identified by "PASSWRD"
    using 'as400';but when I tried to select something from DB2/400. i encountered the following error
    select count(*) from grw99.sitsupp@as400ORA-28500: connection from ORACLE to a non-Oracle system returned this message:
    [Generic Connectivity Using ODBC][IBM][Client Access ODBC Driver (32-bit)]
    [DB2/400 SQL]Communication link failure. COMM RC=0x5 (SQL State: 08S01; SQL Code: -5)
    ORA-02063: preceding 2 lines from HSODBC
    can anyone tell me why?!!
    pls help me!
    Marco

    Sounds like the IBM SQLExtendedFetch bug within the IBM ODBC driver.
    Please add these parameters to your DG4ODBC configuration file initDB2_SID.ora:
    HS_FDS_FETCH_ROWS=1
    HS_RPC_FETCH_REBLOCKING=off
    Then open a new SQL*Plus session and retry the select.

  • Oracle ASM Configuration on Solaris Cluster - Oracle 11.2.0.3

    Hi,
    I want some clarifications!
    I need to set Active and Passive Cluster settup on Solaris 10 SPARC Operating System, the HA software is Solaris Cluster and Oracle 11.2.0.3.
    1) I understand "Single instance Oracle ASM is not supported with Oracle 11g release 2" so we need to go for Clustered ASM - is it required to use RAC framework in this case?
    2) When i use the RAC framework, do i need to have license for RAC?
    Am new to Oracle, any help is appreciated.
    Regards,
    Shashank

    Hi,
    I want some clarifications!
    I need to set Active and Passive Cluster settup on Solaris 10 SPARC Operating System, the HA software is Solaris Cluster and Oracle 11.2.0.3.
    1) I understand "Single instance Oracle ASM is not supported with Oracle 11g release 2" so we need to go for Clustered ASM - is it required to use RAC framework in this case?
    2) When i use the RAC framework, do i need to have license for RAC?
    Am new to Oracle, any help is appreciated.
    Regards,
    Shashank

Maybe you are looking for

  • Early 2011 (March) MBP hard drive crashed a week ago, all data was lost. Why is the Hitachi hard drive so prone to have defective read/write heads?

    Shouldn't Apple be better quality than this? I've used my old MBP from 2006 for 5 years with no problems. I've only had to back up data maybe once a year because it was so durable. But this 2011 MBP had the hitachi 320gb SATA drive and one day it sta

  • Report - asset balances by asset class

    dear all ; when i run report asset balances by asset class (S_ALR_87011946) , it tips error message "fiscal year change not yet made for company code1000" what's wrong with it ,how can i do ? thanks.

  • Single table recovery

    Is it possible to recover a single table in oracle instead of the entire database? We lost all of our data in just one table by mistake. Please let me know..

  • Object Link MARC - Screen 211

    In SAP you can link a Document info record to object link to MARC (Material u2013 Plant) I was expecting to be able to view DIR in a plant specific view within the material master, but I canu2019t seem to find it. Does anyone know where I can see thi

  • Kuler API doesn't work with Windows 8 apps

    Hi, I've been trying to get the Kuler API to work with my app for Windows 8 (I'm using C# and XAML). For example: string feed = await client.GetStringAsync("https://kuler-api.adobe.com/rss/get.cfm?listType=" + type + "&itemsPerPage=" + items + "&time