Cluster IP address

Just in theory.
When I create a server pool with HA option, then the server pool master gets an cluster ip on xen brigde 0 --> xenbr0:0
If the servermaster is shutting down, the next available node should takeover this master role and should get this cluster ip. Is that correct ?
Christian

Christian,
As long as the VIP for the server pool is enabled in Oracle VM Manager, if the server with the master agent role is knocked off the network or if the ovs-agent fails the VIP/master agent will fail-over to a live node.
If you have two boxes, enable the VIP and a) from dom0 "service ovs-agent stop --disable-nowayout" to force a fail-over of the VIP/master agent role or b) unplug the master's NIC to validate that the VIP/master role moves between the nodes.
If you have mote than two nodes, after you force the VIP fail-over ssh in to the VIP to determine which box is the master.
Respectfully,
Roddy

Similar Messages

  • Getting Error: cluster ip address not added to tcpip properties

    I have 2 2008 R2 physical servers on the same subnet and they have been using NLB for the past 1.5 years.  We had a firewall issue and I took one of the servers out of the cluster to do testing, while the other main server (priority 1) was left serving
    up the virtual IPs. The main server continues to work properly.
    The servers have 2 NICs, one for NLB and one just for regular traffic.  The NICs also have their own IP addresses and then there is a cluster IP and 2 virtual IPs.
    Error:
    When I try and add the second server to the cluster, I first connect to existing cluster which works fine.  Then I do a Add Host to Cluster, and type the name of the server and select the NLB NIC.  It sees the other server and it seems to start
    the process, however soon after the NLB NIC goes to having internet access to a "enabled" state and the gateway gets taken out of the settings.  I try to add it back, but as soon as I get out of the settings it disappears again.  NLB manager
    tells me: cluster ip address (192.#.#.#) not added to tcpip properties.  It lists this error 4 times, once for each IP (2 virtual, 1 cluster, and then once for the dedicated NLB NIC IP).  I have also tried adding all virtual IPs to the NLB NIC's
    settings and still same exact error.  Registry: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\Tcpip\Parameters\Interfaces    -even reg looks good.
    Any help would be appreciated.  If I can't get any resolution my next step is going to be to delete the NLB cluster on the main server and recreate it....but this requires downtime and got to make sure it comes back up!

    Hi,
    You can find out the log of the event, then refer the following KB to future troubleshoot.
    Network Adapter Functionality
    http://technet.microsoft.com/en-us/library/cc726411.aspx
    More information:
    Dual-NIC NLB Configuration with Windows Server 2008 NLB Clusters
    http://blogs.technet.com/b/networking/archive/2008/11/20/balancing-act-dual-nic-configuration-with-windows-server-2008-nlb-clusters.aspx
    Hope this helps.
    We
    are trying to better understand customer views on social support experience, so your participation in this
    interview project would be greatly appreciated if you have time.
    Thanks for helping make community forums a great place.

  • DAG 2010 Cluster IP address resource 'Cluster IP Address' cannot be brought online because the cluster network replication

    Hello,
    DAG Exchange 2010 SP3 RU6 with MAPI network and Replication network. All works correctly.
    But, when the DAG member restarts , the cluster goes offline and i can't bring it online.
    The message error:
    Cluster IP address resource 'Cluster IP Address' cannot be brought online because the cluster network 'Cluster Network 1' is not configured to allow client access.
    Cluster Network 1 is the replication network and it is normal that allow client access is unchecked
    I already tried to check, apply then uncheck apply. it does anything.
    Could you please help me to figure out the issue ?
    Best regards

    Hi,
    Check below link.
    http://forums.msexchange.org/Cluster_network_name_is_not_online/m_1800552315/tm.htm
    I was able to resolve the issue without taking down any resources.
    First, I noticed that the Failover Cluster Manager "Cluster Name" had the IP address of the replication network only..
    After going back through the guide @
    http://technet.microsoft.com/en-us/library/dd638104.aspx I changed the properties on the NICs for file sharing, etc..I then adjusted windows firewall rules to block traffic from my MAPI network destined for the replication network. 
    I then removed the IP from the replication network on the DAG leaving only the 1 MAPI network IP.
    After an hour or so, I ran Get-DatabaseAvailabilityGroupNetwork and seen that the MAPIAccess property was finally set to true on my MAPI network. I went back to Failover Cluster Manager and my Cluster Core Resource Cluster Name dropped the IP address that was
    associated (IP from the replication network) I added a new IP from my Mapi network range, updated the DAG IP in Exchange and the DNS record for the DAG and my cluster resource came online.

  • Cluster IP address thru L2L tunnel

    I have 3 windows 2003 terminal servers setup for load balance using Windows Network Load Balance Manager. IP addresses 192.168.1.14, 192.168.1.15, 192.168.1.16 Cluster IP 192.168.1.40 multicast.
    I have a remote site connected via site to site VPN tunnel using Cisco ASA5510 devices, subnet 192.168.100.1. On the local LAN(192.168.1.0) I can get connected to terminal servers using the cluster IP, at the remote site I can not. At the remote site I can connect to each TS using the actual IP address, I can ping the cluster IP address or the dns name and get a response. Can anybody think of any reason why I can not connect using the cluster IP address?
    Thanks

    I have setup wireshark on my 192.168.1.0 subnet and setup a packet capture on the ASA5510. On the wireshark I see SYN packets coming in from my machine 192.168.100.102 to the cluster IP and I see SYN,ACK packets Src the cluster IP with the mac address of one of the terminal servers and the dst my IP address with the mac address of the ASA 5510. On the ASA5510 packet capture I only see the SYN packets from my machine coming in but no SYN,ACK packets going out. What happened to the SYN,ACK packets?
    I did a packet capture when connecting to the actual IP address of the terminal server (Which Works) and compared the SYN,ACK packets from both and saw no difference.

  • SSL Cert for 2008 R2 Reporting Services that is installed on a Failover Cluster - server address mismatch?

    I utilized the idea from
    http://www.mssqltips.com/sqlservertip/2778/how-to-add-reporting-services-to-an-existing-sql-server-clustered-instance/ to install 2008 R2 Reporting Services on a new Clustered SQL instance.  In short, create the new Clustered SQL instance on Node1,
    installing Reporting Services with it.  Then on Node2, Add a Failover Cluster Node (without choosing Reporting Services); following that up with starting the SQL setup.exe with a cmd to bypass a check so that I can then install the Reporting Services
    feature on Node2.  It points out using the SQL Cluster Network name for connecting to Reporting Services.
    I verified upon failover that I could still access the Reports and ReportServer URLs.  However, when wanting to add an SSL certificate to the RS configuration, I run into the warning of "mismatched address - the security certificate presented by
    this website was issued for a different website's address", where I can continue and get to the Reports or ReportManager URLs.
    I played with different certs (internal CA created) and SANs and other things, but I still get this error with the cert.  The Reports URL, for example, is <a href="https:///Reports">https://<SQLClusterNetworkName>/Reports, and the
    cert has a CN and Friendly Name of SQLClusterNetworkName (with SAN of DNS: SQLClusterNetworkName.<domain>), but the error still happens.
    What am I missing to eliminate the mismatched address warning when using the SQLClusterNetworkName as the base of the URLs?

    I got it working by using the FQDN as the common name on the SSL cert, with FQDN in RS URLs.

  • Change Cluster IP address and Hostname

    Hi pals,
    Just to confirm if we have a cluster of PUB and SUB for Unity connection 8.6.2a. The procedure to change IP address or hostname will always start from SUB first then PUB or PUB first then SUB?
    If we start PUB first, i would need to install new license file before I can start with SUB right?
    cheers,
    TA

    Hi TA,
    The steps are different depending if the server(s) are defined by IP address or Hostname
    but all the steps are here
    http://www.cisco.com/en/US/docs/voice_ip_comm/connection/8x/upgrade/guide/8xcucrug050.html
    If you are talking licenses then you must be running in a virtual environment. If so,
    you will have a 30 day grace period to get the license updated:
    License Files and License MACs for Cisco Unity Connection Virtual Machines
    Each license file for a Cisco Unity Connection virtual machine (except for the demonstration license file) is registered to a license MAC value. This value is calculated to look like a MAC address based on the settings listed in Table 42-1, but it is not a real MAC address.
    If you change any of these settings, the existing licenses become invalid, and you must obtain replacement license files that are registered to the calculated license MAC value that is based on the new settings. The old licenses continue to work for a 30-day grace period. During the grace period, you can change the settings back to the original values to make your original licenses valid again. If you need more than 30 days of grace period, change your settings to the original values, then change them back to the new values that you want to use, and you will get another 30- day grace period.
    If you do not reset the 30-day grace period by changing settings back to the original values, then Connection stops running. If you restart the server, Connection starts running again but stops after 24 hours. Each time you restart the server, Connection runs for another 24 hours until you either change the settings back to the original values or you install licenses based on the new license MAC value.
    http://www.cisco.com/en/US/docs/voice_ip_comm/connection/8x/administration/guide/8xcucsag310.html#wp1074759
    Cheers!
    Rob

  • Managed server not able to join the cluster

    Hi
    I have two storage node enabled coherence servers on two different machines.These two are able to form the cluster without any problem. I also have two Managed servers. When I start one, will join the cluster without any issue but when I start the fourth one which does not join the cluster. Only one Managed server joins the cluster. I am getting the following error.
    2011-12-22 15:39:26.940/356.798 Oracle Coherence GE 3.6.0.4 &lt;Info> (thread=[ACTIVE] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)', member=n/a): Loaded cache configuration from "file:/u02/oracle/admin/atddomain/atdcluster/ATD/config/atd-client-cache-config.xml"
    2011-12-22 15:39:26.943/356.801 Oracle Coherence GE 3.6.0.4 &lt;D4> (thread=[ACTIVE] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)', member=n/a): TCMP bound to /172.23.34.91:8190 using SystemSocketProvider
    2011-12-22 15:39:57.909/387.767 Oracle Coherence GE 3.6.0.4 &lt;Warning> (thread=Cluster, member=n/a): This Member(Id=0, Timestamp=2011-12-22 15:39:26.944, Address=172.23.34.91:8190, MachineId=39242, Location=site:dev.icd,machine:appsoad2-web2,process:24613, Role=WeblogicServer) has been attempting to join the cluster at address 231.1.1.50:7777 with TTL 4 for 30 seconds without success; this could indicate a mis-configured TTL value, or it may simply be the result of a busy cluster or active failover.
    2011-12-22 15:39:57.909/387.767 Oracle Coherence GE 3.6.0.4 &lt;Warning> (thread=Cluster, member=n/a): Received a discovery message that indicates the presence of an existing cluster:
    Message "NewMemberAnnounceWait"
    FromMember=Member(Id=2, Timestamp=2011-12-22 15:22:56.607, Address=172.23.34.74:8090, MachineId=39242, Location=site:dev.icd,machine:appsoad4,process:23937,member:CoherenceServer2, Role=WeblogicWeblogicCacheServer)
    FromMessageId=0
    Internal=false
    MessagePartCount=1
    PendingCount=0
    MessageType=9
    ToPollId=0
    Poll=null
    Packets
    [000]=Broadcast{PacketType=0x0DDF00D2, ToId=0, FromId=2, Direction=Incoming, ReceivedMillis=15:39:57.909, MessageType=9, ServiceId=0, MessagePartCount=1, MessagePartIndex=0, Body=0}
    Service=ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_ANNOUNCE), Id=0, Version=3.6}
    ToMemberSet=null
    NotifySent=false
    ToMember=Member(Id=0, Timestamp=2011-12-22 15:39:26.944, Address=172.23.34.91:8190, MachineId=39242, Location=site:dev.icd,machine:appsoad2-web2,process:24613, Role=WeblogicServer)
    SeniorMember=Member(Id=1, Timestamp=2011-12-22 15:22:53.032, Address=172.23.34.73:8090, MachineId=39241, Location=site:dev.icd,machine:appsoad3,process:19339,member:CoherenceServer1, Role=WeblogicWeblogicCacheServer)
    2011-12-22 15:40:02.915/392.773 Oracle Coherence GE 3.6.0.4 &lt;Warning> (thread=Cluster, member=n/a): Received a discovery message that indicates the presence of an existing cluster:
    Message "NewMemberAnnounceWait"
    FromMember=Member(Id=2, Timestamp=2011-12-22 15:22:56.607, Address=172.23.34.74:8090, MachineId=39242, Location=site:dev.icd,machine:appsoad4,process:23937,member:CoherenceServer2, Role=WeblogicWeblogicCacheServer)
    FromMessageId=0
    Internal=false
    MessagePartCount=1
    PendingCount=0
    MessageType=9
    ToPollId=0
    Poll=null
    Packets
    {                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

    Hi,
    By default Coherence uses a multicast protocol to discover other nodes when forming a cluster. Since you are having difficulties in establishing a cluster via multicast, Can you please perform a multicast test and see if multicast is configured properly.
    http://wiki.tangosol.com/display/COH32UG/Multicast+Test
    Hope you are using same configuration files across the cluster members and all members of the cluster must specify the same cluster name in order to be allowed to join the cluster.
    <cluster-name system-property="tangosol.coherence.cluster";>xxx</cluster-name>
    I would suggest, try using the unicast-listener with the well-known-addresses instead of muticast-listener.
    http://wiki.tangosol.com/display/COH32UG/well-known-addresses
    Add similar entries like below in your tangosol override xml..
    <well-known-addresses>
    <socket-address id="1">
    <address> 172.23.34.91<;/address>
    <port>8190</port>
    </socket-address>
    <socket-address id="2">
    <address> 172.23.34.74<;/address>
    <port> 8090</port>
    </socket-address>
    </well-known-addresses>
    This list is used by all other nodes to find their way into the cluster without the use of multicast, thus at least one well known node must be running for other nodes to be able to join.
    Hope this helps!!
    Thanks,
    Ashok.
    <div id="isChromeWebToolbarDiv" style="display:none"></div>

  • WARNING:IP: Hardware address '02:bf:0a:d0:00:2a' trying to be our address

    Hi,
    When my system boots up the following error message is prompted:
    WARNING:IP: Hardware address '02:bf:0a:d0:00:2a' trying to be our address
    I perused documentation and it states that this occurrs when the ATM lane device is set to promiscuous mode by running snoop -d lane0. The corrective action to be taken should be to not ket the ATM lane device rin in promiscuous mode & not to ignore the warning.
    What I want to know is how to implement the resolution?
    Thanks!

    Ok, my mistake.
    So that PC is also trying to use 192.168.0.1.
    And it must be doing arp broadcasts on the LAN.
    Solaris should probably be ignoring them since it has an 10.x.x.x address on the lan. So it shouldnt care about 192.168.0.x addresses on the LAN.
    Only on the crossover network.
    But apparently it cares anyway.
    So I suggest you change the cluster heartbeat addresses to something else thats less likely to be chosen by random computers on your lan.
    Or figure out why the PC is trying to use 192.168.0.1. It could be someone running vmware or virtual box.
    Or vlan you lan, so your critical servers arent sharing a network with random poorly configured PC's.

  • "Service Cluster left the cluster" - lost all my data

    My four storage enabled cluster nodes lost all their cached data when the all services left the cluster in response to some issue(?). Is that the expected behavior? Is the correct procedure to transactionally store to disk so you can reload when this happens or should this simply never happen? Seems like this should not happen. These four nodes are on the the same server. At about time 12:31 everything goes pear shaped.
    2011-01-14 12:31:16.904/50004.436 Oracle Coherence GE 3.6.0.0 <Error> (thread=Cluster, member=3): This senior Member(Id=3, Timestamp=2011-01-13 22:37:52.106, Address=192.168.3.20:8088, MachineId=27412, Location=machine:amd4,process:4428,member:Administrator, Role=CoherenceServer) appears to have been disconnected from other nodes due to a long period of inactivity and the seniority has been assumed by the Member(Id=9, Timestamp=2011-01-13 22:38:01.438, Address=192.168.3.20:8094, MachineId=27412, Location=machine:amd4,process:3904,member:Administrator, Role=CoherenceServer); stopping cluster service.
    2011-01-14 12:31:16.905/50004.437 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=3): Service Cluster left the cluster
    2011-01-14 12:31:16.906/50004.438 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedStatsCacheService, member=3): Service DistributedStatsCacheService left the cluster
    2011-01-14 12:31:16.906/50004.438 Oracle Coherence GE 3.6.0.0 <D5> (thread=Proxy:ExtendTcpProxyService, member=3): Service ExtendTcpProxyService left the cluster
    2011-01-14 12:31:16.907/50004.439 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedQuotesCacheService, member=3): Service DistributedQuotesCacheService left the cluster
    2011-01-14 12:31:16.913/50004.445 Oracle Coherence GE 3.6.0.0 <D5> (thread=Invocation:Management, member=3): Service Management left the cluster
    2011-01-14 12:31:16.913/50004.445 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedOrdersService, member=3): Service DistributedOrdersService left the cluster
    2011-01-14 12:31:16.913/50004.445 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedCacheService, member=3): Service DistributedCacheService left the cluster
    2011-01-14 12:31:16.914/50004.446 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=214992652, Open=false)
    2011-01-14 12:31:16.914/50004.446 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=8305999, Open=false)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1383343339, Open=false)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: TcpConnection(Id=0x0000012D84061C15C0A803149CF3279B334BE6140AC76C47CA03670D76A96D22, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65480)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1003858188, Open=false)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1586910282, Open=false)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: TcpConnection(Id=0x0000012D84060E5AC0A8031442EA3CC26AC425D55D93A6AFC5404E5A76A96D1E, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65472)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor:TcpProcessor, member=3): Released: TcpConnection(Id=0x0000012D84061C15C0A803149CF3279B334BE6140AC76C47CA03670D76A96D22, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65480)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=160435953, Open=false)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor:TcpProcessor, member=3): Released: TcpConnection(Id=0x0000012D84060E5AC0A8031442EA3CC26AC425D55D93A6AFC5404E5A76A96D1E, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65472)
    2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1635893341, Open=false)
    2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: TcpConnection(Id=0x0000012D84061203C0A8031455CD3A790F6009CA79AEC8BACC464D9976A96D20, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65478)
    2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor:TcpProcessor, member=3): Released: TcpConnection(Id=0x0000012D84061203C0A8031455CD3A790F6009CA79AEC8BACC464D9976A96D20, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65478)
    2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedExecutionsService, member=3): Service DistributedExecutionsService left the cluster
    2011-01-14 12:31:16.919/50004.451 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedPositionsCacheService, member=3): Service DistributedPositionsCacheService left the clusterand ...
    2011-01-14 12:31:22.874/50006.273 Oracle Coherence GE 3.6.0.0 <Info> (thread=main, member=n/a): Restarting cluster
    2011-01-14 12:31:22.924/50006.323 Oracle Coherence GE 3.6.0.0 <D4> (thread=main, member=n/a): TCMP bound to /192.168.3.20:8094 using SystemSocketProvider
    2011-01-14 12:31:52.937/50036.336 Oracle Coherence GE 3.6.0.0 <Warning> (thread=Cluster, member=n/a): This Member(Id=0, Timestamp=2011-01-14 12:31:22.924, Address=192.168.3.20:8094, MachineId=27412, Location=machine:amd4,process:4136,member:Administrator, Role=CoherenceServer) has been attempting to join the cluster at address 225.0.0.1:54321 with TTL 4 for 30 seconds without success; this could indicate a mis-configured TTL value, or it may simply be the result of a busy cluster or active failover.
    2011-01-14 12:31:52.950/50036.349 Oracle Coherence GE 3.6.0.0 <Warning> (thread=Cluster, member=n/a): Received a discovery message that indicates the presence of an existing cluster that does not respond to join requests; this is usually caused by a network layer failure:Logs starting at 12:30 from the four nodes are here:
    http://www.nmedia.net/~andrew/logs/1.log
    http://www.nmedia.net/~andrew/logs/2.log
    http://www.nmedia.net/~andrew/logs/3.log
    http://www.nmedia.net/~andrew/logs/4.log
    If someone could tell me if this is a bug in the cluster re-join logic or something I screwed up that would be great. Thanks!
    Andrew

    Hi Andrew
    I had a quick look at your logs but cannot say for certain why your cluster died. I can say that losing data is a normal consequence of node loss though. If you have the backup count set to 1 then you can lose a single node without losing data. If you lose more than one node (on different machines, or the same machine if you only have one) over a very short space of time then you will almost certainly lose at least one partition and hence lose the data within that partition.
    Going back to you logs is is difficult to determine the underlying cause without the whole set of logs. You have posted links to four logs but from looking at them the cluster has about 16 nodes. I know from experience (as we had a cluster that was quite unstable for a while) that tracing these issues through the logs can be a bit awkwrd but you soon get the hang of it :-)
    For example in the log http://www.nmedia.net/~andrew/logs/1.log you have...
    2011-01-14 12:31:16.807/49993.331 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=9): MemberLeft notification for Member(Id=3, Timestamp=2011-01-13 22:37:52.106, Address=192.168.3.20:8088, MachineId=27412, Location=machine:amd4,process:4428,member:Administrator, Role=CoherenceServer, PublisherSuccessRate=0.9975, ReceiverSuccessRate=0.9999, PauseRate=0.0, Threshold=93, Paused=false, Deferring=false, OutstandingPackets=0, DeferredPackets=0, ReadyPackets=0, LastIn=261ms, LastOut=277ms, LastSlow=n/a) received from Member(Id=22, Timestamp=2011-01-14 08:21:22.284, Address=192.168.3.121:8092, MachineId=27513, Location=machine:H1,process:3716,member:Howard, Role=Order_entry_window, PublisherSuccessRate=0.8326, ReceiverSuccessRate=1.0, PauseRate=0.0024, Threshold=1456, Paused=false, Deferring=false, OutstandingPackets=0, DeferredPackets=0, ReadyPackets=0, LastIn=0ms, LastOut=8ms, LastSlow=n/a)...which is Member-9 recieving a message about the departure of Member-3 from Member-22, so you would then need to look at the logs for Member-22 to see why it thought Member-3 had departed and also look at the logs for Member-3 for that time to see what might be wrong with it.
    The more worrying message would be these...
    2011-01-14 12:31:16.709/49993.233 Oracle Coherence GE 3.6.0.0 <Warning> (thread=PacketPublisher, member=9): Experienced a 19025 ms communication delay (probable remote GC) with Member(Id=21, Timestamp=2011-01-14 08:21:12.174, Address=192.168.3.121:8090, MachineId=27513, Location=machine:H1,process:4316,member:Howard, Role=OrderbookviewerViewer); 111 packets rescheduled, PauseRate=0.0014, Threshold=1696...a 19 second delay is a long time and would suggest either very long GC pauses of a network problem. Do you have GC logs of these processes. Are all the servers connected to the same switch or is the cluster distributed over more than one part of your network? Do you have too much on one machine, are you overloading the NIC, are you swapping, all these can cause delays and/or los of packets.
    We have had problems with storage disabled nodes doing long GC pauses and causing storage nodes to drop out of the cluster. Our cluster was on 3.5.3-p8 whereas you are on 3.6.0.0 which is supposed to have better node death detection so you might not have the same issues we had.
    Sorry to not be more help,
    JK

  • Failover cluster failed due to mysterious IP conflict ?

    I'm having a mysterious problem with my Failover cluster,
    Cluster name: PrintCluster01.domain.com
    Members: PrintServer01.domain.com andPrintServer02.domain.com
    in the Failover Cluster Management – Cluster Event I received the Critical error message 1135 and 1177:
    Log Name: System
    Source: Microsoft-Windows-FailoverClustering
    Date: 15/06/2011 9:07:49 PM
    Event ID: 1177
    Task Category: None
    Level: Critical
    Keywords:
    User: SYSTEM
    Computer: PrintServer01.domain.com
    Description:
    The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
    Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is
    connected such as hubs, switches, or bridges.
    Log Name: System
    Source: Microsoft-Windows-FailoverClustering
    Date: 15/06/2011 9:07:28 PM
    Event ID: 1135
    Task Category: None
    Level: Critical
    Keywords:
    User: SYSTEM
    Computer: PrintServer01.domain.com
    Description:
    Cluster node 'PrintServer02' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run
    the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node
    is connected such as hubs, switches, or bridges.
    After further investigation, I found some interesting error here, from the very first critical error message logged in the Event viewer on PrintServer02:
    Log Name: System
    Source: Tcpip
    Date: 15/06/2011 9:07:29 PM
    Event ID: 4199
    Task Category: None
    Level: Error
    Keywords: Classic
    User: N/A
    Computer: PrintServer02-VM.domain.com
    Description:
    The system detected an address conflict for IP address 192.168.127.142 with the system having network hardware address 00-50-56-AE-29-23. Network operations on this system may be disrupted as a result.
    192.168.127.142 --> secondary IP of PrintServer01
    how could that be possible it conflict by one of the PrintServer01 node ? the detailed is as below:
    **From PrintServer01**
    Ethernet adapter Local Area Connection* 8:
    Connection-specific DNS Suffix . :
    Description . . . . . . . . . . . : Microsoft Failover Cluster Virtual Adapter
    Physical Address. . . . . . . . . : 02-50-56-AE-29-23
    DHCP Enabled. . . . . . . . . . . : No
    Autoconfiguration Enabled . . . . : Yes
    IPv4 Address. . . . . . . . . . . : 169.254.1.183(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.0.0
    Default Gateway . . . . . . . . . :
    NetBIOS over Tcpip. . . . . . . . : Enabled
    I have double check in all of the cluster members that all IP addresses is now unique.
    however I'm sure that I the IP is static not by DHCP as from the IPCONFIG results below:
    From **PrintServer01** (the Active Node)
    Windows IP Configuration
    Host Name . . . . . . . . . . . . : PrintServer01
    Primary Dns Suffix . . . . . . . : domain.com
    Node Type . . . . . . . . . . . . : Hybrid
    IP Routing Enabled. . . . . . . . : No
    WINS Proxy Enabled. . . . . . . . : No
    DNS Suffix Search List. . . . . . : domain.com
    domain.com.au
    Ethernet adapter Local Area Connection* 8:
    Connection-specific DNS Suffix . :
    Description . . . . . . . . . . . : Microsoft Failover Cluster Virtual Adapter
    Physical Address. . . . . . . . . : 02-50-56-AE-29-23
    DHCP Enabled. . . . . . . . . . . : No
    Autoconfiguration Enabled . . . . : Yes
    IPv4 Address. . . . . . . . . . . : 169.254.1.183(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.0.0
    Default Gateway . . . . . . . . . :
    NetBIOS over Tcpip. . . . . . . . : Enabled
    Ethernet adapter Cluster Public Network:
    Connection-specific DNS Suffix . :
    Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection
    Physical Address. . . . . . . . . : 00-50-56-AE-29-23
    DHCP Enabled. . . . . . . . . . . : No
    Autoconfiguration Enabled . . . . : Yes
    IPv4 Address. . . . . . . . . . . : 192.168.127.155(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    IPv4 Address. . . . . . . . . . . : 192.168.127.88(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    IPv4 Address. . . . . . . . . . . : 192.168.127.142(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    IPv4 Address. . . . . . . . . . . : 192.168.127.143(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    IPv4 Address. . . . . . . . . . . : 192.168.127.144(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    Default Gateway . . . . . . . . . : 192.168.127.254
    DNS Servers . . . . . . . . . . . : 192.168.127.10
    192.168.127.11
    Primary WINS Server . . . . . . . : 192.168.127.10
    Secondary WINS Server . . . . . . : 192.168.127.11
    NetBIOS over Tcpip. . . . . . . . : Enabled
    Ethernet adapter Cluster Private Network:
    Connection-specific DNS Suffix . :
    Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection #2
    Physical Address. . . . . . . . . : 00-50-56-AE-43-EC
    DHCP Enabled. . . . . . . . . . . : No
    Autoconfiguration Enabled . . . . : Yes
    IPv4 Address. . . . . . . . . . . : 10.184.2.2(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    Default Gateway . . . . . . . . . :
    NetBIOS over Tcpip. . . . . . . . : Disabled
    From **PrintServer02**
    Windows IP Configuration
    Host Name . . . . . . . . . . . . : PrintServer02
    Primary Dns Suffix . . . . . . . : domain.com
    Node Type . . . . . . . . . . . . : Hybrid
    IP Routing Enabled. . . . . . . . : No
    WINS Proxy Enabled. . . . . . . . : No
    DNS Suffix Search List. . . . . . : domain.com
    domain.com.au
    Ethernet adapter Local Area Connection* 8:
    Connection-specific DNS Suffix . :
    Description . . . . . . . . . . . : Microsoft Failover Cluster Virtual Adapter
    Physical Address. . . . . . . . . : 02-50-56-AE-5F-E5
    DHCP Enabled. . . . . . . . . . . : No
    Autoconfiguration Enabled . . . . : Yes
    IPv4 Address. . . . . . . . . . . : 169.254.2.86(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.0.0
    Default Gateway . . . . . . . . . :
    NetBIOS over Tcpip. . . . . . . . : Enabled
    Ethernet adapter Cluster Public Network:
    Connection-specific DNS Suffix . :
    Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection
    Physical Address. . . . . . . . . : 00-50-56-AE-79-FA
    DHCP Enabled. . . . . . . . . . . : No
    Autoconfiguration Enabled . . . . : Yes
    IPv4 Address. . . . . . . . . . . : 192.168.127.172(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    IPv4 Address. . . . . . . . . . . : 192.168.127.119(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    Default Gateway . . . . . . . . . : 192.168.127.254
    DNS Servers . . . . . . . . . . . : 192.168.127.10
    192.168.127.11
    Primary WINS Server . . . . . . . : 192.168.127.11
    Secondary WINS Server . . . . . . : 192.168.127.10
    NetBIOS over Tcpip. . . . . . . . : Enabled
    Ethernet adapter Cluster Private Network:
    Connection-specific DNS Suffix . :
    Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection #2
    Physical Address. . . . . . . . . : 00-50-56-AE-77-8D
    DHCP Enabled. . . . . . . . . . . : No
    Autoconfiguration Enabled . . . . : Yes
    IPv4 Address. . . . . . . . . . . : 10.184.2.3(Preferred)
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    Default Gateway . . . . . . . . . :
    NetBIOS over Tcpip. . . . . . . . : Disabled
    Any help would be greatly appreciated.
    Thanks,
    AWT
    /* Server Support Specialist */

    I
    am facing the same scenario as the original poster. This is on Server 2008 R2 SP1.
     WIndow event log entries follow the same pattern. The MAC address listed in connection with the duplicate IP belonged to the passive node.
    Interestingly, the Cluster.log begins to explode with activity a few milliseconds before the first Windows event is logged.
    2012/07/11-15:20:59.517 INFO  [CHANNEL fe80::8145:f2b9:898e:784e%37:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_IO_PENDING(997)
    2012/07/11-15:20:59.517 WARN  [PULLER SQLTESTSQLB] ReadObject failed with GracefulClose(1226)' because of 'channel to remote endpoint fe80::8145:f2b9:898e:784e%37:~3343~
    is closed'
    2012/07/11-15:20:59.517 ERR   [NODE] Node 1: Connection to Node 2 is broken. Reason GracefulClose(1226)' because of 'channel to remote endpoint fe80::8145:f2b9:898e:784e%37:~3343~
    is closed'
    2012/07/11-15:20:59.517 WARN  [RGP] Node 1: only local suspects are missing (2). moving to the next stage (shortcut compensation time 05.000)
    2012/07/11-15:20:59.548 WARN  [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.1.79 (status 80070490)
    2012/07/11-15:20:59.548 WARN  [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.1.79 (status 80070490)
    2012/07/11-15:20:59.579 INFO  [CHANNEL 192.168.3.22:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
    2012/07/11-15:20:59.579 WARN  cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.168.3.22:~3343~ is closed'
    2012/07/11-15:20:59.829 INFO  [GEM] Node 1: EnterRepairStage1: Gem agent for node 1
    2012/07/11-15:21:00.141 INFO  [GEM] Node 1: EnterRepairStage2: Gem agent for node 1
    2012/07/11-15:21:00.499 WARN  [RCM] Moving orphaned group Available Storage from downed node SQLTESTSQLB to node SQLTESTSQLA.
    2012/07/11-15:21:00.499 WARN  [RES] IP Address <Cluster IP Address>: WorkerThread: NetInterface ef150d1a-f4a1-4f4f-a5c7-6e7cb2bfacab changed to state 3.
    2012/07/11-15:21:00.499 WARN  [RCM] Moving orphaned group MSSTEST from downed node SQLTESTSQLB to node SQLTESTSQLA.
    2012/07/11-15:21:00.546 WARN  [RES] IP Address <SQL IP Address 1 (DEVSQL)>: Failed to delete IP interface 2003B882, status 87.
    2012/07/11-15:21:00.562 WARN  [RES] Physical Disk <Cluster Disk 2>: PR reserve failed, status 170
    2012/07/11-15:21:00.577 WARN  [RES] Physical Disk <Cluster Disk 1>: PR reserve failed, status 170
    2012/07/11-15:21:00.593 WARN  [RES] Physical Disk <Cluster Disk 3>: PR reserve failed, status 170
    2012/07/11-15:21:02.215 WARN  [NETFTAPI] Failed to query parameters for 192.168.3.32 (status 80070490)
    2012/07/11-15:21:02.215 WARN  [NETFTAPI] Failed to query parameters for 192.168.3.32 (status 80070490)
    2012/07/11-15:21:05.864 DBG   [NETFTAPI] received NsiParameterNotification  for fe80::5cd:8cc2:186:f5cb (IpDadStatePreferred )
    2012/07/11-15:21:06.565 ERR   [RES] Physical Disk <Cluster Disk 2>: Failed to preempt reservation, status 170
    2012/07/11-15:21:06.581 ERR   [RES] Physical Disk <Cluster Disk 2>: OnlineThread: Unable to arbitrate for the disk. Error: 170.
    2012/07/11-15:21:06.581 ERR   [RES] Physical Disk <Cluster Disk 2>: OnlineThread: Error 170 bringing resource online.
    2012/07/11-15:21:06.581 ERR   [RHS] Online for resource Cluster Disk 2 failed.
    2012/07/11-15:21:06.581 WARN  [RCM] HandleMonitorReply: ONLINERESOURCE for 'Cluster Disk 2', gen(0) result 5018.
    2012/07/11-15:21:06.581 ERR   [RCM] rcm::RcmResource::HandleFailure: (Cluster Disk 2)
    2012/07/11-15:21:06.581 WARN  [RES] Physical Disk <Cluster Disk 2>: Terminate: Failed to open device \Device\Harddisk5\Partition1, Error 2
    2012/07/11-15:21:06.581 ERR   [RES] Physical Disk <Cluster Disk 1>: Failed to preempt reservation, status 170
    2012/07/11-15:21:06.581 ERR   [RES] Physical Disk <Cluster Disk 1>: OnlineThread: Unable to arbitrate for the disk. Error: 170.
    2012/07/11-15:21:06.581 ERR   [RES] Physical Disk <Cluster Disk 1>: OnlineThread: Error 170 bringing resource online.
    Full cluster log here:
    https://skydrive.live.com/redir?resid=A694FDEBF02727CD!133&authkey=!ADQMxHShdeDvXVc

  • WARNING: IP: Hardware address '08:00:20:c4:1f:e8' trying to be our address

    ""ERROR""
    WARNING: IP: Hardware address '08:00:20:c4:1f:e8' trying to be our address 192.168.001.001
    I have 2nodes of Sun solaris9 and both are configured in Veritas Cluster. The cluster nodes have 2 heartbeats connected cross.
    The cross cables are connected with the below IP's
    LAN1 of node1 192.168.2.1
    LAN2 of node1 192.168.2.2
    LAN1 of node2 192.168.1.1
    LAN2 of node2 192.168.1.2
    All the above LAN are connected with cross but still there is ip conflict message in the dmesg. The error is mention above

    Ok, my mistake.
    So that PC is also trying to use 192.168.0.1.
    And it must be doing arp broadcasts on the LAN.
    Solaris should probably be ignoring them since it has an 10.x.x.x address on the lan. So it shouldnt care about 192.168.0.x addresses on the LAN.
    Only on the crossover network.
    But apparently it cares anyway.
    So I suggest you change the cluster heartbeat addresses to something else thats less likely to be chosen by random computers on your lan.
    Or figure out why the PC is trying to use 192.168.0.1. It could be someone running vmware or virtual box.
    Or vlan you lan, so your critical servers arent sharing a network with random poorly configured PC's.

  • Urgent! Node keep disconnecting from Coherence Cluster

    The system consists of 4 standalone cache servers with local storage set to true and 14 other embedded nodes started with different web apps on tomcat with local storage set to false.
    When the servers are started after a new deployment, sometimes it would just work, but most times some random tomcat server will stuck in the following pattern.
    First it would successful start the cluster service and join an existing cluster.
    Oracle Coherence Version 3.5.1/461
    Grid Edition: Development mode
    Copyright (c) 2000, 2009, Oracle and/or its affiliates. All rights reserved.
    2012-07-18 12:24:33.335/31.845 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Service Cluster joined the cluster with senior service member n/a
    2012-07-18 12:24:33.550/32.060 Oracle Coherence GE 3.5.1/461 <Info> (thread=Cluster, member=n/a): This Member(Id=8, Timestamp=2012-07-18 12:24:33.347, Address=10.34.32.107:8089, MachineId=2155, Location=machine:dev1sxapp2,process:20831, Role=ApacheCatalinaStartupBootstrap, Edition=Grid Edition, Mode=Development, CpuCount=24, SocketCount=24) joined cluster "DEV1" with senior Member(Id=10, Timestamp=2012-07-18 09:39:44.861, Address=10.34.32.101:8090, MachineId=2149, Location=machine:dev1ssapp3,process:27796, Role=ApacheCatalinaStartupBootstrap, Edition=Grid Edition, Mode=Development, CpuCount=64, SocketCount=64)
    2012-07-18 12:24:33.555/32.065 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member(Id=1, Timestamp=2012-07-18 12:22:14.231, Address=10.34.32.107:8090, MachineId=2155, Location=machine:dev1sxapp2,process:1278, Role=ApacheCatalinaStartupBootstrap) joined Cluster with senior member 10
    2012-07-18 12:24:33.555/32.065 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member(Id=2, Timestamp=2012-07-18 12:22:14.331, Address=10.34.32.106:8089, MachineId=2154, Location=machine:dev1sxapp1,process:6549, Role=ApacheCatalinaStartupBootstrap) joined Cluster with senior member 10
    2012-07-18 12:24:33.555/32.065 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member(Id=3, Timestamp=2012-07-18 12:22:55.086, Address=10.34.32.106:8088, MachineId=2154, Location=machine:dev1sxapp1,process:23083, Role=ApacheCatalinaStartupBootstrap) joined Cluster with senior member 10
    2012-07-18 12:24:33.555/32.065 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member(Id=4, Timestamp=2012-07-18 12:22:56.799, Address=10.34.32.107:8088, MachineId=2155, Location=machine:dev1sxapp2,process:19624, Role=ApacheCatalinaStartupBootstrap) joined Cluster with senior member 10
    2012-07-18 12:24:33.555/32.065 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member(Id=5, Timestamp=2012-07-18 12:24:31.869, Address=10.34.32.106:8090, MachineId=2154, Location=machine:dev1sxapp1,process:24411, Role=ApacheCatalinaStartupBootstrap) joined Cluster with senior member 10
    2012-07-18 12:24:33.555/32.065 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member(Id=6, Timestamp=2012-07-18 12:24:33.084, Address=10.34.32.101:8088, MachineId=2149, Location=machine:dev1ssapp3,process:28932, Role=ApacheCatalinaStartupBootstrap) joined Cluster with senior member 10
    2012-07-18 12:24:33.555/32.065 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member(Id=14, Timestamp=2012-07-18 09:40:50.645, Address=10.34.32.104:8090, MachineId=2152, Location=machine:dev1ssapp4,process:17697, Role=ApacheCatalinaStartupBootstrap) joined Cluster with senior member 10
    2012-07-18 12:24:33.556/32.066 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member(Id=17, Timestamp=2012-07-18 10:35:16.722, Address=10.34.32.104:8093, MachineId=2152, Location=machine:dev1ssapp4,process:19365, Role=ApacheCatalinaStartupBootstrap) joined Cluster with senior member 10
    2012-07-18 12:24:33.556/32.066 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member(Id=18, Timestamp=2012-07-18 10:38:47.714, Address=10.34.32.101:8093, MachineId=2149, Location=machine:dev1ssapp3,process:29887, Role=ApacheCatalinaStartupBootstrap) joined Cluster with senior member 10
    2012-07-18 12:24:33.563/32.073 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 10 joined Service Management with senior member 10
    2012-07-18 12:24:33.566/32.076 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 1 joined Service Management with senior member 10
    2012-07-18 12:24:33.566/32.076 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 2 joined Service Management with senior member 10
    2012-07-18 12:24:33.566/32.076 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 4 joined Service Management with senior member 10
    2012-07-18 12:24:33.566/32.076 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 17 joined Service Management with senior member 10
    2012-07-18 12:24:33.566/32.076 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 17 joined Service PFExpiryDistributedCache with senior member 17
    2012-07-18 12:24:33.566/32.076 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 17 joined Service SsoRuleEntryDistributedCache with senior member 17
    2012-07-18 12:24:33.567/32.077 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 3 joined Service Management with senior member 10
    2012-07-18 12:24:33.567/32.077 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 18 joined Service Management with senior member 10
    2012-07-18 12:24:33.567/32.077 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 18 joined Service PFExpiryDistributedCache with senior member 17
    2012-07-18 12:24:33.567/32.077 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 18 joined Service SsoRuleEntryDistributedCache with senior member 17
    2012-07-18 12:24:33.568/32.078 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 14 joined Service Management with senior member 10
    2012-07-18 12:24:33.568/32.078 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 5 joined Service Management with senior member 10
    2012-07-18 12:24:33.579/32.089 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member 6 joined Service Management with senior member 10
    Then it started getting heartbeat overdue message and cluster stopped:
    2012-07-18 12:37:20.717/799.227 Oracle Coherence GE 3.5.1/461 <Info> (thread=PacketListenerN, member=8): Scheduled senior member heartbeat is overdue; rejoining multicast group.
    2012-07-18 12:37:29.916/808.426 Oracle Coherence GE 3.5.1/461 <Info> (thread=PacketListenerN, member=8): Scheduled senior member heartbeat is overdue; rejoining multicast group.
    2012-07-18 12:37:59.291/837.801 Oracle Coherence GE 3.5.1/461 <Error> (thread=PacketListenerN, member=8): Stopping cluster due to unhandled exception: com.tangosol.net.messaging.ConnectionException: Unable to refresh sockets: [UnicastUdpSocket{State=STATE_OPEN, address:port=10.34.32.107:8089}, MulticastUdpSocket{State=STATE_OPEN, address:port=237.0.0.1:40109, InterfaceAddress=10.34.32.107, TimeToLive=4}, TcpSocketAccepter{State=STATE_OPEN, ServerSocket=10.34.32.107:8089}]; last failed socket: MulticastUdpSocket{State=STATE_OPEN, address:port=237.0.0.1:40109, InterfaceAddress=10.34.32.107, TimeToLive=4}
    at com.tangosol.coherence.component.net.Cluster$SocketManager.refreshSockets(Cluster.CDB:91)
    at com.tangosol.coherence.component.net.Cluster$SocketManager$MulticastUdpSocket.onInterruptedIOException(Cluster.CDB:9)
    at com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:33)
    at com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
    at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
    at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    at java.lang.Thread.run(Thread.java:662)
    Caused by: java.net.SocketTimeoutException: Receive timed out
    at java.net.PlainDatagramSocketImpl.receive0(Native Method)
    at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:145)
    at java.net.DatagramSocket.receive(DatagramSocket.java:725)
    at com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
    at com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
    at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
    at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    at java.lang.Thread.run(Thread.java:662)
    2012-07-18 12:37:59.291/837.801 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=8): Service Cluster left the cluster
    2012-07-18 12:37:59.293/837.803 Oracle Coherence GE 3.5.1/461 <D5> (thread=Invocation:Management, member=8): Service Management left the cluster
    2012-07-18 12:37:59.293/837.803 Oracle Coherence GE 3.5.1/461 <D5> (thread=ReplicatedCache:HibernateReplicatedCache, member=8): Service HibernateReplicatedCache left the cluster
    Then it started getting messages from various nodes about the existing cluster:
    2012-07-18 12:40:02.862/961.372 Oracle Coherence GE 3.5.1/461 <Info> (thread=queue://authenticationService.logonEvent.consumer-2, member=n/a): Restarting cluster
    2012-07-18 12:40:02.891/961.401 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Service Cluster joined the cluster with senior service member n/a
    2012-07-18 12:40:20.167/978.677 Oracle Coherence GE 3.5.1/461 <Warning> (thread=Cluster, member=n/a): This Member(Id=0, Timestamp=2012-07-18 12:40:02.867, Address=10.34.32.107:8089, MachineId=2155, Location=machine:dev1sxapp2,process:20831, Role=ApacheCatalinaStartupBootstrap) has been attempting to join the cluster at address 237.0.0.1:40109 with TTL 4 for 17 seconds without success; this could indicate a mis-configured TTL value, or it may simply be the result of a busy cluster or active failover.
    2012-07-18 12:40:20.168/978.678 Oracle Coherence GE 3.5.1/461 <Warning> (thread=Cluster, member=n/a): Received a discovery message that indicates the presence of an existing cluster:
    Message "NewMemberAnnounceWait"
    FromMember=Member(Id=4, Timestamp=2012-07-18 12:22:56.799, Address=10.34.32.107:8088, MachineId=2155, Location=machine:dev1sxapp2,process:19624, Role=ApacheCatalinaStartupBootstrap)
    FromMessageId=0
    Internal=false
    MessagePartCount=1
    PendingCount=0
    MessageType=9
    ToPollId=0
    Poll=null
    Packets
    [000]=Broadcast{PacketType=0x0DDF00D2, ToId=0, FromId=4, Direction=Incoming, ReceivedMillis=12:40:20.167, MessageType=9, MessagePartCount=1, MessagePartIndex=0, Body=0x00000001389AE63E1F0A22206B00000000000000000000000040001
    F980000086B000405011818044445563140400A64657631737861707032053139363234401E417061636865436174616C696E6153746172747570426F6F7473747261700001000001389AF5E6330A22206B00000000000000000000000040001F990000086B000005011818044445563140
    400A64657631737861707032053230383331401E417061636865436174616C696E6153746172747570426F6F74737472617000000001389AE6376E0A22206A00000000000000000000000040001F980000086A000305011818044445563140400A646576317378617070310532333038334
    01E417061636865436174616C696E6153746172747570426F6F74737472617000, Body.length=287}
    Service=ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_ANNOUNCE), Id=0, Version=3.5}
    ToMemberSet=null
    NotifySent=false
    ToMember=Member(Id=0, Timestamp=2012-07-18 12:40:02.867, Address=10.34.32.107:8089, MachineId=2155, Location=machine:dev1sxapp2,process:20831, Role=ApacheCatalinaStartupBootstrap)
    SeniorMember=Member(Id=3, Timestamp=2012-07-18 12:22:55.086, Address=10.34.32.106:8088, MachineId=2154, Location=machine:dev1sxapp1,process:23083, Role=ApacheCatalinaStartupBootstrap)
    Then it failed to connect to the cluster:
    2012-07-18 12:40:33.187/991.697 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Service Cluster left the cluster
    2012-07-18 12:40:33.190/991.700 Oracle Coherence GE 3.5.1/461 <Error> (thread=queue://authenticationService.logonEvent.consumer-2, member=n/a): Error while starting cluster: com.tangosol.net.RequestTimeoutException:
    Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
    MemberSet=ServiceMemberSet(
    OldestMember=n/a
    ActualMemberSet=MemberSet(Size=0, BitSetCount=0
    MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
    at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onStartupTimeout(Grid.CDB:6)
    at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:28)
    at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:38)
    at com.tangosol.coherence.component.net.Cluster.onStart(Cluster.CDB:366)
    at com.tangosol.coherence.component.net.Cluster.start(Cluster.CDB:11)
    at com.tangosol.coherence.component.util.SafeCluster.startCluster(SafeCluster.CDB:3)
    at com.tangosol.coherence.component.util.SafeCluster.restartCluster(SafeCluster.CDB:7)
    at com.tangosol.coherence.component.util.SafeCluster.ensureRunningCluster(SafeCluster.CDB:27)
    at com.tangosol.coherence.component.util.SafeCluster.start(SafeCluster.CDB:2)
    at com.tangosol.net.CacheFactory.ensureCluster(CacheFactory.java:1011)
    at com.tangosol.coherence.hibernate.CoherenceCacheProvider.nextTimestamp(CoherenceCacheProvider.java:58)
    at org.hibernate.cache.impl.bridge.RegionFactoryCacheProviderBridge.nextTimestamp(RegionFactoryCacheProviderBridge.java:93)
    at org.hibernate.impl.SessionFactoryImpl.openSession(SessionFactoryImpl.java:652)
    at org.hibernate.ejb.EntityManagerImpl.getRawSession(EntityManagerImpl.java:111)
    at org.hibernate.ejb.EntityManagerImpl.getSession(EntityManagerImpl.java:91)
    at org.hibernate.ejb.AbstractEntityManagerImpl.setDefaultProperties(AbstractEntityManagerImpl.java:250)
    at org.hibernate.ejb.AbstractEntityManagerImpl.postInit(AbstractEntityManagerImpl.java:162)
    at org.hibernate.ejb.EntityManagerImpl.<init>(EntityManagerImpl.java:84)
    at org.hibernate.ejb.EntityManagerFactoryImpl.createEntityManager(EntityManagerFactoryImpl.java:112)
    at org.hibernate.ejb.EntityManagerFactoryImpl.createEntityManager(EntityManagerFactoryImpl.java:107)
    at org.springframework.orm.jpa.JpaTransactionManager.createEntityManagerForTransaction(JpaTransactionManager.java:399)
    at org.springframework.orm.jpa.JpaTransactionManager.doBegin(JpaTransactionManager.java:321)
    at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:371)
    at org.springframework.transaction.interceptor.TransactionAspectSupport.createTransactionIfNecessary(TransactionAspectSupport.java:335)
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:105)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:621)
    at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:560)
    at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:498)
    at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:467)
    at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:325)
    at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:263)
    at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1058)
    at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1050)
    at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:947)
    at java.lang.Thread.run(Thread.java:662)
    2012-07-18 12:40:33.216/991.726 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Service Cluster joined the cluster with senior service member n/a
    2012-07-18 12:40:50.398/1008.908 Oracle Coherence GE 3.5.1/461 <Warning> (thread=Cluster, member=n/a): This Member(Id=0, Timestamp=2012-07-18 12:40:33.194, Address=10.34.32.107:8089, MachineId=2155, Location=machine:dev1sxapp2,process:20831, Role=ApacheCatalinaStartupBootstrap) has been attempting to join the cluster at address 237.0.0.1:40109 with TTL 4 for 17 seconds without success; this could indicate a mis-configured TTL value, or it may simply be the result of a busy cluster or active failover.
    This particular jvm then start go into this kind of loop, receive a lot of messages from other nodes about the exist cluster but failed to join.
    I have ran the MulticastTest and Datagram test which didn't reveal any obvious network issue. What should I do next?
    JVM is 1.6.0_31
    Thanks a lot in advance, any help will be greatly appreciated.

    I correlated the log with all servers and found the issue might be due to some member it is connected with actually was being restarted.
    Server 1:
    - starts as member 23 and discovered the existing cluster and joined it. Then a lot of messages on server1 with all different members joining the cluster with different member id.
    - Then it found some member failed to respond:
    2012-07-30 22:00:25.371/34.325 Oracle Coherence GE 3.5.1/461 <D6> (thread=PacketPublisher, member=n/a): Member(Id=5, Timestamp=2012-07-30 15:35:09.735, Address=10.34.32.107:8089, MachineId=2155, Location=machine:dev1sxapp2,process:21324, Role=ApacheCatalinaStartupBootstrap) has failed to respond to 17 packets; declaring this member as paused.
    - Then it's requesting the departure confirmation for member 5:
    2012-07-30 22:00:52.042/60.996 Oracle Coherence GE 3.5.1/461 <Warning> (thread=PacketPublisher, member=n/a): Timeout while delivering a packet Directed{PacketType=0x0DDF00D5, ToId=0, FromId=23, Direction=Outgoing, SentCount=145, SentMillis=22:00:51.832, ToMemberSet=[5(1)], ServiceId=0, MessageType=16, FromMessageId=6, ToMessageId=0, MessagePartCount=1, MessagePartIndex=0, NackInProgress=false, ResendScheduled=22:00:52.32, Timeout=22:00:51.849, PendingResendSkips=0, DeliveryState=unsent, Body=0x0000000200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000..., Body.length=1398}; requesting the departure confirmation for Member(Id=5, Timestamp=2012-07-30 15:35:09.735, Address=10.34.32.107:8089, MachineId=2155, Location=machine:dev1sxapp2,process:21324, Role=ApacheCatalinaStartupBootstrap)
    by MemberSet(Size=2, BitSetCount=2
    Member(Id=1, Timestamp=2012-07-27 10:46:51.616, Address=10.34.32.101:8088, MachineId=2149, Location=machine:dev1ssapp3,process:1, Role=CoherenceServer)
    Member(Id=12, Timestamp=2012-07-30 22:00:43.132, Address=10.34.32.101:8091, MachineId=2149, Location=machine:dev1ssapp3,process:7803, Role=ApacheCatalinaStartupBootstrap)
    - Then the member set confirmed the departure however at the same time, service cluster also left.
    2012-07-30 22:00:52.046/61.000 Oracle Coherence GE 3.5.1/461 <Info> (thread=Cluster, member=n/a): Member departure confirmed by MemberSet(Size=1, BitSetCount=2
    Member(Id=12, Timestamp=2012-07-30 22:00:43.132, Address=10.34.32.101:8091, MachineId=2149, Location=machine:dev1ssapp3,process:7803, Role=ApacheCatalinaStartupBootstrap)
    ); removing Member(Id=5, Timestamp=2012-07-30 15:35:09.735, Address=10.34.32.107:8089, MachineId=2155, Location=machine:dev1sxapp2,process:21324, Role=ApacheCatalinaStartupBootstrap)
    2012-07-30 22:00:52.046/61.000 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Member(Id=5, Timestamp=2012-07-30 22:00:52.046, Address=10.34.32.107:8089, MachineId=2155, Location=machine:dev1sxapp2,process:21324, Role=ApacheCatalinaStartupBootstrap) left Cluster with senior member 1
    2012-07-30 22:00:52.049/61.003 Oracle Coherence GE 3.5.1/461 <D5> (thread=Cluster, member=n/a): Service Cluster left the cluster
    - Then the timeout during service start hence application fails to start
    2012-07-30 22:00:52.051/61.005 Oracle Coherence GE 3.5.1/461 <Error> (thread=main, member=n/a): Error while starting cluster: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
    MemberSet=ServiceMemberSet(
    OldestMember=n/a
    ActualMemberSet=MemberSet(Size=15, BitSetCount=2
    Member(Id=1, Timestamp=2012-07-27 10:46:51.616, Address=10.34.32.101:8088, MachineId=2149, Location=machine:dev1ssapp3,process:1, Role=CoherenceServer)
    Member(Id=2, Timestamp=2012-07-27 10:47:12.122, Address=10.34.32.101:8089, MachineId=2149, Location=machine:dev1ssapp3,process:2, Role=CoherenceServer)
    Member(Id=3, Timestamp=2012-07-27 10:48:02.603, Address=10.34.32.104:8088, MachineId=2152, Location=machine:dev1ssapp4,process:1, Role=CoherenceServer)
    Member(Id=4, Timestamp=2012-07-27 10:48:04.76, Address=10.34.32.104:8089, MachineId=2152, Location=machine:dev1ssapp4,process:2, Role=CoherenceServer)
    Member(Id=8, Timestamp=2012-07-30 14:27:07.382, Address=10.34.32.101:8090, MachineId=2149, Location=machine:dev1ssapp3,process:23727, Role=ApacheCatalinaStartupBootstrap)
    Member(Id=9, Timestamp=2012-07-30 22:00:28.596, Address=10.34.32.101:8092, MachineId=2149, Location=machine:dev1ssapp3,process:7619, Role=ApacheCatalinaStartupBootstrap)
    Member(Id=10, Timestamp=2012-07-30 14:34:27.573, Address=10.34.32.104:8090, MachineId=2152, Location=machine:dev1ssapp4,process:25219, Role=ApacheCatalinaStartupBootstrap)
    Member(Id=11, Timestamp=2012-07-30 22:00:41.609, Address=10.34.32.107:8088, MachineId=2155, Location=machine:dev1sxapp2,process:17632, Role=ApacheCatalinaStartupBootstrap)
    Member(Id=12, Timestamp=2012-07-30 22:00:43.132, Address=10.34.32.101:8091, MachineId=2149, Location=machine:dev1ssapp3,process:7803, Role=ApacheCatalinaStartupBootstrap)
    Member(Id=14, Timestamp=2012-07-30 15:35:09.811, Address=10.34.32.106:8088, MachineId=2154, Location=machine:dev1sxapp1,process:5186, Role=ApacheCatalinaStartupBootstrap)
    Member(Id=15, Timestamp=2012-07-30 16:02:34.096, Address=10.34.32.106:8091, MachineId=2154, Location=machine:dev1sxapp1,process:2691, Role=ApacheCatalinaStartupBootstrap)
    Member(Id=16, Timestamp=2012-07-30 16:08:41.885, Address=10.34.32.107:8091, MachineId=2155, Location=machine:dev1sxapp2,process:15992, Role=ApacheCatalinaStartupBootstrap)
    Member(Id=21, Timestamp=2012-07-30 21:58:56.669, Address=10.34.32.106:8089, MachineId=2154, Location=machine:dev1sxapp1,process:28689, Role=ApacheCatalinaStartupBootstrap)
    Member(Id=22, Timestamp=2012-07-30 21:58:58.29, Address=10.34.32.107:8090, MachineId=2155, Location=machine:dev1sxapp2,process:15491, Role=ApacheCatalinaStartupBootstrap)
    Member(Id=23, Timestamp=2012-07-30 22:00:21.648, Address=10.34.32.106:8090, MachineId=2154, Location=machine:dev1sxapp1,process:556, Role=ApacheCatalinaStartupBootstrap)
    MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
    1/3.5/Fri Jul 27 10:46:51 EDT 2012/false,
    2/3.5/Fri Jul 27 10:47:12 EDT 2012/false,
    3/3.5/Fri Jul 27 10:48:02 EDT 2012/false,
    4/3.5/Fri Jul 27 10:48:04 EDT 2012/false,
    8/3.5/Mon Jul 30 14:27:07 EDT 2012/false,
    9/3.5/Mon Jul 30 22:00:28 EDT 2012/false,
    10/3.5/Mon Jul 30 14:34:27 EDT 2012/false,
    11/3.5/Mon Jul 30 22:00:41 EDT 2012/false,
    12/3.5/Mon Jul 30 22:00:43 EDT 2012/false,
    14/3.5/Mon Jul 30 15:35:09 EDT 2012/false,
    15/3.5/Mon Jul 30 16:02:34 EDT 2012/false,
    16/3.5/Mon Jul 30 16:08:41 EDT 2012/false,
    21/3.5/Mon Jul 30 21:58:56 EDT 2012/false,
    22/3.5/Mon Jul 30 21:58:58 EDT 2012/false,
    23/3.5/Mon Jul 30 22:00:21 EDT 2012/false
         at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onStartupTimeout(Grid.CDB:6)
         at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:28)
         at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:38)
         at com.tangosol.coherence.component.net.Cluster.onStart(Cluster.CDB:366)
         at com.tangosol.coherence.component.net.Cluster.start(Cluster.CDB:11)
         at com.tangosol.coherence.component.util.SafeCluster.startCluster(SafeCluster.CDB:3)
         at com.tangosol.coherence.component.util.SafeCluster.restartCluster(SafeCluster.CDB:7)
         at com.tangosol.coherence.component.util.SafeCluster.ensureRunningCluster(SafeCluster.CDB:27)
         at com.tangosol.coherence.component.util.SafeCluster.start(SafeCluster.CDB:2)
         at com.tangosol.net.CacheFactory.ensureCluster(CacheFactory.java:1011)
         at com.tangosol.coherence.hibernate.CoherenceCacheProvider.start(CoherenceCacheProvider.java:73)
         at org.hibernate.cache.impl.bridge.RegionFactoryCacheProviderBridge.start(RegionFactoryCacheProviderBridge.java:72)
    Looking at member 5's log and I found it was being bounced at that time but somehow it failed to stop the coherence thread and didn't send out departure event to the cluster until was requested by other members.
    SEVERE: The web application [riding-services] appears to have started a thread named Cluster but has failed to stop it. This is very likely to create a memory leak.
    Questions:
    1. Seems that this issue only happens when one server starts while another server is shut down at the same time range and both happen to be connected with each other for distributed caching. How can I modify the script to retry during startup when the first time it timed out? Or maybe modify the configuration to use a longer timeout value?
    2. Is it possible to detect the unavailability of certain member quicker? Now seems 30 seconds or more.
    Thanks in advance,

  • Message Bridge to Cluster

    Hey guys,
    I'm trying to configure a message bridge to a cluster of servers. When configuring the destination portion of the bridge, I entered server 1 connection url and messages went to that one server. I shut the server down to see if the bridge would send messages to any of the other servers and it didn't.
    So in the destination bridge connection url I entered all the addresses of the servers in the cluster and got the following error when restarting the JMS server:
    <Feb 22, 2010 2:00:09 PM EST> <Warning> <Connector> <BEA-190032> << eis/jms/WLSC
    onnectionFactoryJNDIXA > ResourceAllocationException thrown by resource adapter
    on call to ManagedConnectionFactory.createManagedConnection(): "javax.resource.R
    esourceException: ConnectionFactory: failed to get initial context (InitialConte
    xtFactory =weblogic.jndi.WLInitialContextFactory, url = t3://localhost:9015,t3:/
    /localhost:9016,t3://localhost:9017, user name = null) ">
    Do I need a separate bridge for each server or how would I configure the bridge to randomly pick a server to send message to?
    In the cluster there is a distributed queue module targeted to all the machines in the cluster.
    Thanks for viewing and any helpful replies

    Changed the connection url to the cluster url address fixed problem :)
    But its still not working :(
    Think the problem now is my cluster jms server. It's currently targeted to my first server in the cluster, which is the one I'm shutting down. In the articles I've read supposely the jms sever will move to another server within the cluster?
    Help!
    Thanks

  • Failing to create HA nfs storage on a shared 3310 HW Raid cluster 3.2

    Hi,
    I'm working on testing clustering on a couple v240s, running identitcal Sol10 10/08 and Sun Cluster 3.2. In trying things, I may have messed up the cluster. I may want to backout the cluster and start over. Is that possible, or do I need to install Solaris fresh.
    But first, the problem. I have the array connect to both machines and working. I mount 1 LUN on /global/nfs using the device /dev/did/dsk/d4s0. Then I ran the commands:
    # clrt register SUNW.nfs
    # clrt register SUNW.HAStoragePlus
    # clrt list -v
    Resource Type Node List
    SUNW.LogicalHostname:2 <All>
    SUNW.SharedAddress:2 <All>
    SUNW.nfs:3.2 <All>
    SUNW.HAStoragePlus:6 <All>
    # clrg create -n stnv240a,stnv240b -p PathPrefix=/global/nfs/admin nfs-rg
    I enabled them just now so:
    # clrg status
    Cluster Resource Groups ===
    Group Name Node Name Suspended Status
    nfs-rg stnv240a No Online
    stnv240b No Offline
    Then:
    # clrslh create -g nfs-rg cluster
    # clrslh status
    Cluster Resources ===
    Resource Name Node Name State Status Message
    cluster stnv240a Online Online - LogicalHostname online.
    stnv240b Offline Offline
    I'm guessing that 'b' is offline because it's the backup.
    Finally, I get:
    # clrs create -t HAStoragePlus -g nfs-rg -p AffinityOn=true -p FilesystemMountPoints=/global/nfs nfs-stor
    clrs: stnv240b - Invalid global device path /dev/did/dsk/d4s0 detected.
    clrs: (C189917) VALIDATE on resource nfs-stor, resource group nfs-rg, exited with non-zero exit status.
    clrs: (C720144) Validation of resource nfs-stor in resource group nfs-rg on node stnv240b failed.
    clrs: (C891200) Failed to create resource "nfs-stor".
    On stnv240a:
    # df -h /global/nfs
    Filesystem size used avail capacity Mounted on
    /dev/did/dsk/d4s0 49G 20G 29G 41% /global/nfs
    and on stnv240b:
    # df -h /global/nfs
    Filesystem size used avail capacity Mounted on
    /dev/did/dsk/d4s0 49G 20G 29G 41% /global/nfs
    Any help? Like I said, this is a test setup. I've started over once. So I can start over if I did something irreversible.

    I still have the issue. I reinstalled from scratch and installed the cluster. Then I did the following:
    $ vi /etc/default/nfs
    GRACE_PERIOD=10
    $ ls /global//nfs
    $ mount /global/nfs
    $ df -h
    Filesystem size used avail capacity Mounted on
    /dev/global/dsk/d4s0 49G 20G 29G 41% /global/nfs
    $ clrt register SUNW.nfs
    $ clrt register SUNW.HAStoragePlus
    $ clrt list -v
    Resource Type Node List
    SUNW.LogicalHostname:2 <All>
    SUNW.SharedAddress:2 <All>
    SUNW.nfs:3.2 <All>
    SUNW.HAStoragePlus:6 <All>
    $ clrg create -n stnv240a,stnv240b -p PathPrefix=/global/nfs/admin nfs-rg
    $ clrslh create -g nfs-rg patience
    clrslh: IP Address 204.155.141.146 is already plumbed at host: stnv240b
    $ grep cluster /etc/hosts
    204.155.141.140 stnv240a stnv240a.mns.qintra.com # global - cluster
    204.155.141.141 cluster cluster.mns.qintra.com # cluster virtual address
    204.155.141.146 stnv240b stnv240b.mns.qintra.com patience patience.mns.qintra.com # global v240 - cluster test
    $ clrslh create -g nfs-rg cluster
    $ clrs create -t HAStoragePlus -g nfs-rg -p AffinityOn=true -p FilesystemMountPoints=/global/nfs nfs-stor
    clrs: stnv240b - Failed to analyze the device special file associated with file system mount point /global/nfs: No such file or directory.
    clrs: (C189917) VALIDATE on resource nfs-stor, resource group nfs-rg, exited with non-zero exit status.
    clrs: (C720144) Validation of resource nfs-stor in resource group nfs-rg on node stnv240b failed.
    clrs: (C891200) Failed to create resource "nfs-stor".
    Now, on the second machine (stnv240b), /dev/global does not exist, but the file system mounts anyway. I guess that's cluster magic?
    $ cat /etc/vfstab
    /dev/global/dsk/d4s0 /dev/global/dsk/d4s0 /global/nfs ufs 1 yes global
    $ df -h /global/nfs
    Filesystem size used avail capacity Mounted on
    /dev/global/dsk/d4s0 49G 20G 29G 41% /global/nfs
    $ ls -l /dev/global
    /dev/global: No such file or directory
    I followed the other thread. devfsadm and scgdevs
    One other thing I notice. Both nodes mount my global on node@1
    /dev/md/dsk/d6 723M 3.5M 662M 1% /global/.devices/node@1
    /dev/md/dsk/d6 723M 3.5M 662M 1% /global/.devices/node@1

  • Failover Cluster - Hyperv Networking

    Hi All 
    I am   in  the middle of a private cloud  project  and  i am  having a few issues with my Fail over Clustering 
    I ended up creating my fail over which ran successfully but later complained  about my Networking
    This is a fully virtualized environment with 3 hyper v hosts running windows server  2012 R2.
    These are my network settings.
    I have put the main and cluster ip addresses as shown above on vmnet 3 and 4.(vmware workstation)
    I am not sure of how to configure the hyper v virtual switch network  tho.
    so i have a feeling this could be the error i am getting when i try to validate my fail-over cluster.The error is shown below
    Regards  to all 

    "is there any other way i can just create a two node fail over in vmware workstation without having to use physical servers "
    Absolutely!  You will need at least three VMs.  One VM will be your Active Directory domain controller and your shared storage server (not a best practice, but it seems like you are creating a lab, so it will work).  Then you will need two
    to be the nodes of the cluster.  On the first machine you will need to set up either iSCSI or SMB - SMB is a lot easier.
    http://blogs.technet.com/b/josebda/archive/2013/08/16/3587652.aspx provides a step by step guide for setting up a configuration that is more complex than you want, but it should
    help you get started.  He uses a three node cluster, but two will do.  And, he is doing things under Hyper-V, so some things will have to be translated to the VMware environment.  Of course, if you use Hyper-V on a Windows 8.1 system instead
    of VMware workstation, more will apply.
    . : | : . : | : . tim

Maybe you are looking for

  • Problem in creation of promotion

    Dear Friends, We would like to create promotion and the scenario is. We have 4 products. if customer purchase 10 carton in any combination of these 4 products. customer will be eligible for one free carton of any of these 4 product. how can we create

  • How do I set an appointment that is every 4 weeks, or the second Thursday of the month?

    How do I set a recurring event that is every 4 weeks or the second Thursday of every month?

  • HP C4385 All In One is the worst printer I've ever bought

    I will never buy another HP printer or other HP product for the rest of my life. I purchased a HP C4385 Photosmart All In One printer and it has been absolutely terrible from day one. First when installing it screwed up my LAN so bad that I had to sp

  • How to create email users with open directory?

    I'm trying to used a mac mini as a mail server for my domains. It works well for SMTP server/gateway for multiple locally networked systems running Lion, Mountain Lion and Maverick. The server is running Mavericks 10.9.2 server 3.1.1. I need to add e

  • FCE doesn't recognize Panasonic VTR

    For 1st time i have filmed HDV 1080i on Panasonic HVX 200 P2 card; while trying to capture on FCE,the following message comes: plug-in VTR ,it does not recognizeI have Imac 20'. Can anyone help please?