Unexpected cluster heartbeat

I'm seeing the following errors when starting several nodes of a cluster using scripts. This only happens occasionally and works most of the times. Also, there is no problem when the nodes were started manually one by one.
The cluster consists of 2 hosts running multiple programs (jvm) each as indicated in the log.
Could some one explain what happened and how to fix it?
Thanks!
2010-09-23 19:29:48,161 14496 [Logger@559022270 3.5.3/465] DEBUG Coherence - 2010-09-23 19:29:48.161/14.743 Oracle Coherence GE 3.5.3/465 <D5> (thread=Cluster,member=n/a): Service Cluster joined the cluster with senior service member n/a
2010-09-23 19:29:48,382 14717 [Logger@9250185 3.5.3/465] INFO Coherence - 2010-09-23 19:29:48.382/14.964 Oracle Coherence GE 3.5.3/465 <Info> (thread=Cluster,member=n/a): This Member(Id=7, Timestamp=2010-09-23 19:29:48.207, Address=10.253.97.133:16001, MachineId=38533, Location=site:mytest.com,machine:host1,process:18482, Role=Program1, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2) joined cluster "test1" with senior Member(Id=12, Timestamp=2010-09-23 19:15:01.264, Address=10.253.97.133:16002, MachineId=38533, Location=site:mytest.com,machine:host1,process:15142, Role=Program2, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2)
2010-09-23 19:29:49,157 15492 [Logger@9250185 3.5.3/465] WARN Coherence - 2010-09-23 19:29:49.157/15.739 Oracle Coherence GE 3.5.3/465 <Warning> (thread=Cluster, member=n/a): Notifying the senior Member(Id=12, Timestamp=2010-09-23 19:15:01.264, Address=10.253.97.133:16002, MachineId=38533, Location=site:mytest.com,machine:host1,process:15142, Role=Program2) of an unexpected cluster heartbeat from Member(Id=13, Timestamp=2010-09-23 19:15:03.524, Address=10.253.97.134:16001, MachineId=38534, Location=site:mytest.com,machine:host2,process:24439, Role=Program2)
2010-09-23 19:29:57,729 24064 [Logger@9250185 3.5.3/465] WARN Coherence - 2010-09-23 19:29:57.729/24.311 Oracle Coherence GE 3.5.3/465 <Warning> (thread=Cluster, member=n/a): Notifying the senior Member(Id=12, Timestamp=2010-09-23 19:15:01.264, Address=10.253.97.133:16002, MachineId=38533, Location=site:mytest.com,machine:host1,process:15142, Role=Program2) of an unexpected cluster heartbeat from Member(Id=14, Timestamp=2010-09-23 19:15:16.618, Address=10.253.97.133:16003, MachineId=38533, Location=site:mytest.com,machine:host1,process:15734, Role=Program3)
2010-09-23 19:30:13,808 40143 [Logger@9250185 3.5.3/465] WARN Coherence - 2010-09-23 19:30:13.807/40.390 Oracle Coherence GE 3.5.3/465 <Warning> (thread=Cluster, member=n/a): Notifying the senior Member(Id=12, Timestamp=2010-09-23 19:15:01.264, Address=10.253.97.133:16002, MachineId=38533, Location=site:mytest.com,machine:host1,process:15142, Role=Program2) of an unexpected cluster heartbeat from Member(Id=15, Timestamp=2010-09-23 19:15:26.421, Address=10.253.97.134:16002, MachineId=38534, Location=site:mytest.com,machine:host2,process:24991, Role=Program3)
2010-09-23 19:30:18,453 44788 [Logger@9250185 3.5.3/465] ERROR Coherence - 2010-09-23 19:30:18.453/45.035 Oracle Coherence GE 3.5.3/465 <Error> (thread=main, member=n/a): Error while starting cluster: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
MemberSet=ServiceMemberSet(
OldestMember=n/a
ActualMemberSet=MemberSet(Size=2, BitSetCount=2
Member(Id=7, Timestamp=2010-09-23 19:29:48.207, Address=10.253.97.133:1600
1, MachineId=38533, Location=site:mytest.com,machine:host1,process:18482, Role=Program1)
Member(Id=12, Timestamp=2010-09-23 19:15:01.264, Address=10.253.97.133:160
02, MachineId=38533, Location=site:mytest.com,machine:host1,process:15142, Role=Program2)
MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
7/3.5/Thu Sep 23 19:29:48 UTC 2010/false,
12/3.5/Thu Sep 23 19:15:01 UTC 2010/false
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onStartupTimeout(Grid.CDB:6)
at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:28)
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:38)
at com.tangosol.coherence.component.net.Cluster.onStart(Cluster.CDB:366)
at com.tangosol.coherence.component.net.Cluster.start(Cluster.CDB:11)
at com.tangosol.coherence.component.util.SafeCluster.startCluster(SafeCluster.CDB:3)
at com.tangosol.coherence.component.util.SafeCluster.restartCluster(SafeCluster.CDB:7)
at com.tangosol.coherence.component.util.SafeCluster.ensureRunningCluster(SafeCluster.CDB:27)
at com.tangosol.coherence.component.util.SafeCluster.start(SafeCluster.CDB:2)
at com.tangosol.net.CacheFactory.ensureCluster(CacheFactory.java:998)
at com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:915)
at com.oracle.coherence.environment.extensible.ExtensibleEnvironment.ensureService(ExtensibleEnvironment.java:374)
at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:877)
at com.tangosol.net.DefaultConfigurableCacheFactory.configureCache(DefaultConfigurableCacheFactory.java:1088)
at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:304)
at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:735)
at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:712)
2010-09-23 19:30:18,456 44791 [Logger@9250185 3.5.3/465] ERROR Coherence - 2010-09-23 19:30:18.456/45.038 Oracle Coherence GE 3.5.3/465 <Error> (thread=Cluster, member=n/a): validatePolls: This service timed-out due to unanswered handshake request. Manual intervention is required to stop the members that have not responded to this Poll
PollId=1, active
InitTimeMillis=1285270188378
Service=Cluster (0)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[12]
Edited by: user10049765 on Oct 7, 2010 2:23 PM

Any suggestion?

Similar Messages

Cluster heartbeat message in Coherence 3.6

Hi,
We recently upgraded to Coherence 3.6 in our production environment. Occasionally in the Coherence cluster, I see the following happening.
2010-08-18 18:27:48.953/28.828 Oracle Coherence GE 3.6.0.0 <Error> (thread=Cluster, member=13): Received cluster heartbeat from the senior Member(Id=1, Timestamp=2010-08-18 18:03:43.927, Address=10.31.151.246:9000, MachineId=33526, Location=site:xxx.com,machine:machine1,process:2665,member:coherence_cache_server-0, Role=cache-server) that does not contain this Member(Id=13, Timestamp=2010-08-18 18:25:20.158, Address=10.30.71.60:8092, MachineId=21308, Location=site:xxx.com,machine:machine2,process:3540,member:CoherenceCommandLineTool, Role=cache-client); stopping cluster service.
Whenever any node for ex: coherence cmd line tries to join the cluster it gets kicked out of the cluster immediately. Nodes on the cluster keep exiting the cluster and joining back. This happens constantly. pasting another log snippet.
2010-08-18 14:23:22.458/-13596.00-214 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=7): Member 5 joined Service Management with senior member 1
2010-08-18 14:23:36.110/-13582.00-562 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=7): Member 5 joined Service DistributedCache with senior member 1
2010-08-18 14:23:37.811/-13580.00-861 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=7): MemberLeft notification for Member(Id=5, Timestamp=2010-08-18 14:23:25.924, Address=10.30.71.60:8092, MachineId=21308, Location=site:xxx.com,machine:machine2,process:5936,member:CoherenceCommandLineTool, Role=cache-client, PublisherSuccessRate=0.9166, ReceiverSuccessRate=1.0, PauseRate=0.0, Threshold=1976, Paused=false, Deferring=false, OutstandingPackets=0, DeferredPackets=0, ReadyPackets=0, LastIn=845ms, LastOut=854ms, LastSlow=n/a) received from Member(Id=2, Timestamp=2010-08-18 12:34:15.068, Address=10.31.151.246:9001, MachineId=33526, Location=site:xxx.com,machine:machine1,process:2667,member:coherence_cache_server-1, Role=cache-server, PublisherSuccessRate=0.8568, ReceiverSuccessRate=0.5934, PauseRate=0.0021, Threshold=1878, Paused=false, Deferring=false, OutstandingPackets=0, DeferredPackets=0, ReadyPackets=0, LastIn=24744ms, LastOut=24753ms, LastSlow=n/a)
2010-08-18 14:23:37.811/-13580.00-861 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=7): Member 5 left service Management with senior member 1
2010-08-18 14:23:37.811/-13580.00-860 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=7): Member 5 left service DistributedCache with senior member 1
Just to give a brief background on our environment, we have 3 linux hosts configured to form the cluster. 2 of them have 2 cache server nodes on each= total of 4 cache server's in the cluster, with about 7 storage disabled client nodes.
Any clues as to why this is happening with cluster? Do we need to configure anything on the cluster? We have all the ports on which the nodes communicate, opened up for udp-tcp/input-output.
Appreciate all help on this.
-Chandini

Here is the log snippet on member 2 for around the same timestamp when member 5 was removed
2010-08-18 14:23:26.269/6588.113 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=2): Member 5 joined Service Management with senior member 1
2010-08-18 14:23:39.922/6601.766 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=2): Member 5 joined Service DistributedCache with senior member 1
2010-08-18 14:23:41.607/6603.451 Oracle Coherence GE 3.6.0.0 <Warning> (thread=Cluster, member=2): Failed to reach address /10.30.71.60 within the IpMonitor timeout. Members [Member(Id=5, Timestamp=2010-08-18 14:23:25.924, Address=10.30.71.60:8092, MachineId=21308, Location=site:xxx.com,machine:machine2,process:5936,member:CoherenceCommandLineTool, Role=cache-client)] are suspect.
2010-08-18 14:23:41.608/6603.452 Oracle Coherence GE 3.6.0.0 <Warning> (thread=Cluster, member=2): Timed-out members MemberSet(Size=1, BitSetCount=2
Member(Id=5, Timestamp=2010-08-18 14:23:25.924, Address=10.30.71.60:8092, MachineId=21308, Location=site:xxx.com,machine:machine2,process:5936,member:CoherenceCommandLineTool, Role=cache-client)
) will be removed.
2010-08-18 14:23:41.608/6603.452 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=2): Member 5 left service Management with senior member 1
2010-08-18 14:23:41.609/6603.453 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=2): Member 5 left service DistributedCache with senior member 1
I will work on getting the logs on the other members for the first message and post it here.
Thanks.

Unicast cluster - heartbeat message failure messages

Using unicast messaging mode and i see following messages
####<Jul 9, 2010 12:46:56 AM PDT> <Info> <Cluster> <anaeur30> <WL10MP2-ServiceSTServer6> <[ACTIVE] ExecuteThread: '45'
for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1278661616559> <BEA-000112> <Removing WL10M
P2-ServiceSTServer1 jvmid:6806396782256322086S:anaeur10:[7033,7033,-1,-1,-1,-1,-1]:anaeur10:7033,anaeur10:7035,anaeur2
0:7033,anaeur20:7035,anaeur30:7033,anaeur30:7035,anaeur50:7033,anaeur50:7035:WL10MP2-ServiceTier:WL10MP2-ServiceSTServ
er1 from cluster view due to timeout.>
####<Jul 9, 2010 12:55:36 AM PDT> <Info> <Cluster> <anaeur30> <WL10MP2-ServiceSTServer6> <[ACTIVE] ExecuteThread: '34'
for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1278662136552> <BEA-000112> <Removing WL10M
P2-ServiceSTServer2 jvmid:-2694311272134716565S:anaeur10:[7035,7035,-1,-1,-1,-1,-1]:anaeur10:7033,anaeur10:7035,anaeur
20:7033,anaeur20:7035,anaeur30:7033,anaeur30:7035,anaeur50:7033,anaeur50:7035:WL10MP2-ServiceTier:WL10MP2-ServiceSTSer
ver2 from cluster view due to timeout.>
During the same time frame, I see lost multicast messages on all the instances for a about 20 minutes. What could be the problem? Why am i seeing the multicast messages when using uncast? My config.xml has multicast related entries for each server but how will that be effective? is that an issue? we see servers dropping out frequently from cluster.
000115> <Lost 1 multicast message(s).>
####<Jul 9, 2010 12:46:42 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661602751> <BEA-000115> <Lost 1 multicast message(s).>
####<Jul 9, 2010 12:46:46 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661606548> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:47:04 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661624185> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:48:40 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661720809> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:54:14 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278662054823> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:54:14 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278662054827> <BEA-000115> <Lost 1 multicast message(s).>
####<Jul 9, 2010 12:54:14 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278662054827> <BEA-000115> <Lost 2 multicast message(s).>

SJ,
Thanks, that's perfect explanation i was looking for. We always create cluster from console and it could be that we used MULTICAST messaging mode in past hence the entries in config.xml. What made me to raise the question "will UNICAST or MULTICAST be used" is that when ever we experience a drop out server issue from cluster, i see the following message written into each managed server log. Ideally, the following should be written into log if the multicast messaging mode is in operation, right?
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490260768> <BEA-000115> <Lost 2 multicast message(s).>
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490260768> <BEA-000115> <Lost 2 multicast message(s).>
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490261355> <BEA-000115> <Lost 2 multicast message(s).>
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490261355> <BEA-000115> <Lost 2 multicast message(s).>
The above message is not written all the time but only when server removed from cluster group. Please be inforemed that i have enable unicast debug mode. will unicast also writes messages as above when hearbeat message lost?
To trace our issue further, i have to manually remove reference from config.xml and monitor for sometime. its still mystery why the clusters are dropping out. Sometimes, soon after cluster instances dropped out i can see the drop-out frequency as "Rarely" and after a week or so the members are regrouped with difference group leader. Are you aware of any issue with unicast messaging mode in WL10 MP2?
Is it good idea of testing multicast?
Thanks a lot for your time.
-RR

The panic protocol and Coherence 3.5

All,
We just upgraded from 3.3.1 to 3.5 but I'm having trouble forming a cluster in multi-server environments. Our config files were developed against older versions of Coherence and I had a lot of trouble with them at first, some of which is detailed here: Config file problem with new Coherence 3.5
The problem now is that we have 2 standalone nodes and 2 application nodes (WebLogic) spread across 2 physical servers (1 standalone and 1 application on each box.) Previously (Coherence 3.3.1,) they all formed one happy cluster of 4 members. Now (Coherence 3.5,) they form separate clusters: each physical machine makes a cluster of 2 members. At startup, I can see the 2-node clusters form. Some time later (not immediately) I see the "unexpected cluster heartbeat" message warning about getting a heartbeat from the other physical server. Clearly the members of the different servers can communicate to some degree if they get these unexpected heartbeats. But why don't they form a cluster in the first place?
If I understand the config correctly, we're using a ttl of 4, the default. I ran the multicast test and a ttl of 1 worked also. I think the join timeout is 30000.
When the standalone node starts, it outputs a ttl of 4 and the expected cluster address and port to the log.
One wrinkle in the config is that there are 2 applications deployed to the same weblogic jvm that both use Coherence. They are in separate classloaders and use unique cluster ports. This hasn't been a problem in the past. Now, however, my app is Coherence 3.5 and the other one is still 3.3.1. The Coherence jars are not shared and the startup params apply to both applications.
In the past I've seen errors where 2 nodes weren't using the same coherence version, same cluster name, etc. but I don't see anything like that now.
thanks
john

Hi John,
The clustering technologies did not change between 3.3 and 3.5. The fact that you could establish a multicast best cluster in 3.3 and not in 3.5 is therefor quite odd. My initial guess would be that your network may be blocking certain multicast address/port ranges? Are you using the same multicast address and port as you'd successfully used in 3.3? Also please use this address and port when running the multicast test to make it as close as possible to the medium on which coherence is trying to operate.
If none of these suggestions resolves the issue, can you please post the following:
- multicast test output from all nodes running the test concurrently
- coherence logs from all nodes, including startup, and panic
- coherence operational configuration
Regarding the mix of Coherence 3.3 and 3.5 in the same JVM. So long as they are classloader isolated and running on a different multicast address/port you should be fine. Note I'm suggesting that both the address and the port be different. Some OSs (Linux) has issues related to not taking the port into consideration during multicast packet delivery. It wouldn't hurt to try starting 3.5 without the 3.3 app running, just to ensure that it isn't causing your troubles in some unforeseen way.
thanks,
Mark
Oracle Coherence

Aggregates, VLAN's, Jumbo-Frames and cluster interconnect opinions

Hi All,
I'm reviewing my options for a new cluster configuration and would like the opinions of people with more expertise than myself out there.
What I have in mind as follows:
2 x X4170 servers with 8 x NIC's in each.
On each 4170 I was going to configure 2 aggregates with 3 nics in each aggregate as follows
igb0 device in aggr1
igb1 device in aggr1
igb2 device in aggr1
igb3 stand-alone device for iSCSI network
e1000g0 device in aggr2
e1000g1 device in aggr2
e1000g2 device in aggr3
e1000g3 stand-alone device of iSCSI network
Now, on top of these aggregates, I was planning on creating VLAN interfaces which will allow me to connect to our two "public" network segments and for the cluster heartbeat network.
I was then going to configure the vlan's in an IPMP group for failover. I know there are some questions around that configuration in the sense that IPMP will not detect a nic failure if a NIC goes offline in the aggregate, but I could monitor that in a different manner.
At this point, my questions are:
[1] Are vlan's, on top of aggregates, supported withing Solaris Cluster? I've not seen anything in the documentation to mention that it is, or is not for that matter. I see that vlan's are supported, inluding support for cluster interconnects over vlan's.
Now with the standalone interface I want to enable jumbo frames, but I've noticed that the igb.conf file has a global setting for all nic ports, whereas I can enable it for a single nic port in the e1000g.conf kernel driver. My questions are as follows:
[2] What is the general feeling with mixing mtu sizes on the same lan/vlan? Ive seen some comments that this is not a good idea, and some say that it doesnt cause a problem.
[3] If the underlying nic, igb0-2 (aggr1) for example, has 9k mtu enabled, I can force the mtu size (1500) for "normal" networks on the vlan interfaces pointing to my "public" network and cluster interconnect vlan. Does anyone have experience of this causing any issues?
Thanks in advance for all comments/suggestions.

For 1) the question is really "Do I need to enable Jumbo Frames if I don't want to use them (neither public nore private network)" - the answer is no.
For 2) each cluster needs to have its own seperate set of VLANs.
Greets
Thorsten

Install Guide - SQL Server 2014, Failover Cluster, Windows 2012 R2 Server Core

I am looking for anyone who has a guide with notes about an installation of a two node, multi subnet failover cluster for SQL Server 2014 on Server Core edition

Hi KamarasJaranger,
According to your description, you want configure a SQL Server 2014 Multi-Subnet failover Cluster on Windows Server 2012 R2. Below are the whole steps for the configuration. For the detailed steps about the configuration, please download
and refer to the
PDF file.
1.Add Required Windows Features (.NET Framework 3.5 Features, Failover Clustering and Multipath I/O).
2.Discover target portals.
3.Connect targets and configuring Multipathing.
4.Initialize and format the Disks.
5.Verify the Storage Replication Process.
6.Run the Failover Cluster Validation Wizard.
7.Create the Windows Server 2012 R2 Multi-Subnet Cluster.
8.Tune Cluster Heartbeat Settings.
9.Install SQL Server 2014 on a Multi-Subnet Failover Cluster.
10.Add a Node on a SQL Server 2014 Multi-Subnet Cluster.
11.Tune the SQL Server 2014 Failover Clustered Instance DNS Settings.
12.Test application connectivity.
Regards,
Michelle Li

Multiple Senior Cluster Members?

Hi Guys,
We've had a few nodes kicked out of one of our production clusters all with messages similar to this:
ERROR 2008-04-21 18:17:05.753 Oracle Coherence GE 3.3.1/389p1 <Error> (thread=Cluster, member=29): Received cluster heartbeat from the senior Member(Id=2, Timestamp=2008-04-18 11:07:21.948, Address=172.21.205.151:8089, MachineId=29847, Location=process:17367@trulxfw0006,member:trulxfw0006-2) that does not contain this Member(Id=29, Timestamp=2008-04-18 11:07:34.491, Address=172.21.205.148:8092, MachineId=29844, Location=process:17098@trulxfw0003,member:trulxfw0003-5); stopping cluster service.
DEBUG 2008-04-21 18:17:05.753 Oracle Coherence GE 3.3.1/389p1 <D5> (thread=Cluster, member=29): Service Cluster left the cluster
The logs on the senior member (2) are interesting though:
DEBUG 2008-04-21 18:17:05.601 Oracle Coherence GE 3.3.1/389p1 <D5> (thread=Cluster, member=2): Member 29 left service Management with senior member 2
DEBUG 2008-04-21 18:17:05.602 Oracle Coherence GE 3.3.1/389p1 <D5> (thread=Cluster, member=2): Member 29 left service WriteQueueSync with senior member 3
DEBUG 2008-04-21 18:17:05.602 Oracle Coherence GE 3.3.1/389p1 <D5> (thread=Cluster, member=2): Member 29 left service WriteQueueAsync with senior member 3
DEBUG 2008-04-21 18:17:05.603 Oracle Coherence GE 3.3.1/389p1 <D5> (thread=Cluster, member=2): Member 29 left service DistributedCache with senior member 3
DEBUG 2008-04-21 18:17:05.603 Oracle Coherence GE 3.3.1/389p1 <D5> (thread=Cluster, member=2): Member 29 left service ServiceControl with senior member 3
DEBUG 2008-04-21 18:17:05.604 Oracle Coherence GE 3.3.1/389p1 <D5> (thread=Cluster, member=2): Member 29 left service InvocationService with senior member 2
DEBUG 2008-04-21 18:17:05.607 Oracle Coherence GE 3.3.1/389p1 <D5> (thread=Cluster, member=2): Member 29 left Cluster with senior member 2
I wasn't aware that there could be multiple senior members. Is this indicative of something bad going on?
Thanks, Paul
PS Metalink is not behaving so I can't raise it there.

Hi Jon,
I've done some more digging into the logs for the whole cluster.
Of the five nodes that left the cluster, three have the same reason:
trulxfw0002/180-primary-0.log.1:ERROR 2008-04-21 17:05:19.763 Oracle Coherence GE 3.3.1/389p1 <Error> (thread=Cluster, member=34): This node appears to have partially lost the connectivity: it receives responses from MemberSet(Size=2, BitSetCount=2, ids=[8, 32]) which communicate with Member(Id=22, Timestamp=2008-04-18 11:07:29.911, Address=172.21.205.149:8092, MachineId=29845, Location=process:11858@trulxfw0004,member:trulxfw0004-5), but is not responding directly to this member; that could mean that either requests are not coming out or responses are not coming in; stopping cluster service.
trulxfw0003/180-primary-7.log.3:ERROR 2008-04-21 00:40:02.153 Oracle Coherence GE 3.3.1/389p1 <Error> (thread=Cluster, member=28): This node appears to have partially lost the connectivity: it receives responses from MemberSet(Size=2, BitSetCount=3, ids=[35, 38]) which communicate with Member(Id=5, Timestamp=2008-04-18 11:07:21.992, Address=172.21.205.151:8090, MachineId=29847, Location=process:17351@trulxfw0006,member:trulxfw0006-1), but is not responding directly to this member; that could mean that either requests are not coming out or responses are not coming in; stopping cluster service.
trulxfw0006/180-primary-0.log.6:ERROR 2008-04-18 23:13:33.896 Oracle Coherence GE 3.3.1/389p1 <Error> (thread=Cluster, member=1): This node appears to have partially lost the connectivity: it receives responses from MemberSet(Size=2, BitSetCount=3, ids=[14, 42]) which communicate with Member(Id=28, Timestamp=2008-04-18 11:07:34.381, Address=172.21.205.148:8091, MachineId=29844, Location=process:17152@trulxfw0003,member:trulxfw0003-7), but is not responding directly to this member; that could mean that either requests are not coming out or responses are not coming in; stopping cluster service.
Member 29 had been marked as paused during its lifetime by various other nodes, however there are not frequent at the time of eviction:
grep -R "member:trulxfw0003-5" * | grep "failed to respond" | grep "18:17"
grep -R "member:trulxfw0003-5" * | grep "failed to respond" | grep "18:16"
trulxfw0004/180-primary-6.log.8:DEBUG 2008-04-18 18:16:26.337 Oracle Coherence GE 3.3.1/389p1 <D6> (thread=PacketPublisher, member=23): Member(Id=29, Timestamp=2008-04-18 11:07:34.491, Address=172.21.205.148:8092, MachineId=29844, Location=process:17098@trulxfw0003,member:trulxfw0003-5) has failed to respond to 17 packets; declaring this member as paused.
trulxfw0005/180-primary-1.log.7:DEBUG 2008-04-21 18:16:37.315 Oracle Coherence GE 3.3.1/389p1 <D6> (thread=PacketPublisher, member=10): Member(Id=29, Timestamp=2008-04-18 11:07:34.491, Address=172.21.205.148:8092, MachineId=29844, Location=process:17098@trulxfw0003,member:trulxfw0003-5) has failed to respond to 17 packets; declaring this member as paused.
[xflow@lonrs00342 machines]$ grep -R "member:trulxfw0003-5" * | grep "failed to respond" | grep "18:15"
trulxfw0002/180-primary-7.log.8:DEBUG 2008-04-18 18:15:44.161 Oracle Coherence GE 3.3.1/389p1 <D6> (thread=PacketPublisher, member=37): Member(Id=29, Timestamp=2008-04-18 11:07:34.491, Address=172.21.205.148:8092, MachineId=29844, Location=process:17098@trulxfw0003,member:trulxfw0003-5) has failed to respond to 17 packets; declaring this member as paused.
trulxfw0006/180-primary-7.log.9:DEBUG 2008-04-18 18:15:51.477 Oracle Coherence GE 3.3.1/389p1 <D6> (thread=PacketPublisher, member=4): Member(Id=29, Timestamp=2008-04-18 11:07:34.491, Address=172.21.205.148:8092, MachineId=29844, Location=process:17098@trulxfw0003,member:trulxfw0003-5) has failed to respond to 17 packets; declaring this member as paused.
grep -R "member:trulxfw0003-5" * | grep "failed to respond" | grep "18:14"
trulxfw0002/180-primary-0.log.41:DEBUG 2008-04-18 22:18:14.220 Oracle Coherence GE 3.3.1/389p1 <D6> (thread=PacketPublisher, member=34): Member(Id=29, Timestamp=2008-04-18 11:07:34.491, Address=172.21.205.148:8092, MachineId=29844, Location=process:17098@trulxfw0003,member:trulxfw0003-5) has failed to respond to 17 packets; declaring this member as paused.
[xflow@lonrs00342 machines]$
grep -R "member:trulxfw0003-5" * | grep "failed to respond" | grep "18:13"
trulxfw0002/180-primary-0.log.43:DEBUG 2008-04-18 18:13:41.083 Oracle Coherence GE 3.3.1/389p1 <D6> (thread=PacketPublisher, member=34): Member(Id=29, Timestamp=2008-04-18 11:07:34.491, Address=172.21.205.148:8092, MachineId=29844, Location=process:17098@trulxfw0003,member:trulxfw0003-5) has failed to respond to 17 packets; declaring this member as paused.
How closely does a paused declaration correlate with a vote for eviction?
The last node which was kicked out had a similar pattern to trulxfw0003-5 above, however it had significantly more paused declaration messages in the minutes preceding eviction.
We have a listener on the Cluster service which detects the local node leaving the cluster and kills itself. Is this a good idea, or is it safer to let Coherence sort itself out?
Thanks, Paul

Best Practice Question on Heartbeat Issue

Our environment consists of 2 Fiber Channel Hard Drive enclosures. One is an HP P2000 that has 12 2TB disks in it, piggy backed on these controllers is a D2700 with 24 SFF drives, 15,000 RPM and 146GB each. This enclosure has full RAID and volume/LUN creation capability. I can pretty much put the disks together anyway I want, though I can not combine SFF and LFF (2TB) drives together into a single RAID Set.
My other devices (Texas Memory System 810) are 2 extremely fast SSD enclosures that have 8 500 Gig cards in each and show up as 4TB of storage each. Each device does not have RAID capability, so there is no redundancy except that if the device detects a bad "chip" it migrates the data to one of the spare chips. There is a full card that is considered a spare, but your data does not exist in more than one place from what I can tell. I can create any number of LUNs and both are completely visible into my Oracle VM environment.
The LFF spinning disks mostly have LUNs that are used for Large data transfers (backups etc) but is the controller for the SFF disks too. The SSDs are used for our database ASM with normal redundancy across the 2 distinct TMS810's. The SFF drives are used for the various filesystems that actually boot up the servers and other things that need a faster disk.
My question is: Which of these 3 should I create my Cluster Heartbeat on? I have that LUN currently on a LFF LUN (one LUN of many on the RAID1 - 2 2TB drives combined). The LUN is only 20Gigs, but I do have other LUNs off that same RAIDSET as I did not want to waste the whole 2TB for a single heartbeat. This way I knew if one disk failed in that set, I could swap the disk and not loose my heartbeat and therefore all of my guests running in my cluster. We are looking for 99.9999% uptime.
Everything in my environment is redundant except for the heartbeat. Does OVM 3.1 expect to have redundant heartbeats perhaps?
If the P2000 goes down, I loose my heartbeat and all of my servers/guests go down too. Its my single point of failure.
I tried a large filecopy 1TB worth of data to a 3TB filesystem on the LFF drives and it seemed to loose heartbeat connectivity and fenced my server. I expect the redundant controllers were overloaded and OVM was not able to keep up. I have no other explanation why the guest was down, and the server needed to be fully rebooted. OVMM showed the server down, the guest down, but I could ping the server still.
I could place it on the extremely fast SSDs, but then it would only be in one location on one set of chips. If I need to replace a flashcard in this device - I must take that device single down and my database would still be up from the other device for ASM, but I would loose my servers and guests. Not the ideal solution.
I am all ears as to how 1) to better configure the hardware we have or 2) buy additional hardware if absolutely necessary. I have 4 physical enclosures - all on separate redundant 8GB FC cards in our 2 servers. It seems it would be enough.
Thanks for all your help, Apologies for the long post.

Avi Miller wrote:
>
OCFS2's timeout needs to be larger than the timeout for your SAN. If your SAN takes 120 seconds to fail from one path to another, but OCFS2 is set to 60 second disk heartbeat timeout, then your servers will fence halfway through a potential fabric failover. So,Do you know how to check this setting for the Server Pool Heartbeat? Did you say OCFS2 is 60 seconds by default?
>
OCFS2 v1.8 does support multiple global heartbeat regions and there are plans to allow for multiple heartbeat devices in some future version of Oracle VM. However, I have no idea when this will be. Keep in mind however, that if the enclosure hosting the heartbeat goes down, you will lose everything else hosted on that enclosure as well. If you put it on the large storage repository, all your VM virtual disks disappear too, so you're offline anyway. If you put it on the fast SSDs, all your data has gone away, so you're hosed anyway. Both enclosures appear (to me) to be fairly critical for the running of your VMs, so losing either of them during normal operation would probably cause an outage. Unless I'm missing something?Yes since we have multiple enclosures, I have separated a lot of the servers, 2 node RAC DB servers running on each enclosure (primary on P2000 which is RAIDED, secondary on the SSD which is unraided, but its a backup), 2 different WEB/APP servers on both as well. So if one enclosure goes down, yes I would loose one set of servers, but one DB and one WEB server is still up. No single point of failure. Even if one of the SSDs went down for the Database files those are 2 distinct physical redundant devices with ASM. ASM handles having one side of the FAILURE group down until it can be brought back online. But if I loose the enclosure with the Heartbeat, I loose all my servers and nothing stays up. Its my only point of frustration in my design.

Sun Cluster 3.2/Solaris 10 Excessive ICMP traffic

Hi all,
I have inherited a 2 node cluster with a 3510 san which I have upgraded to Cluster 3.2/Solaris 10. Apparently this was happening on Cluster 3.0/Solaris 8 as well.
The real interfaces on the two nodes seem to be sending excessive pings to the default gateway it is connected to. The configuration of the network adapters are the same - 2 NIC's on each are grouped for multi-home and 2 NIC's configured as private for cluster heartbeats.
The 2 NIC's that are grouped together on each of the servers are the cards generating the traffic.
23:27:52.402377 192.168.200.216 > 192.168.200.1: icmp: echo request [ttl 1]
23:27:52.402392 192.168.200.1 > 192.168.200.216: icmp: echo reply
23:27:52.588793 192.168.200.217 > 192.168.200.1: icmp: echo request [ttl 1]
23:27:52.588806 192.168.200.1 > 192.168.200.217: icmp: echo reply
23:27:52.818690 192.168.200.215 > 192.168.200.1: icmp: echo request [ttl 1]
23:27:52.818714 192.168.200.1 > 192.168.200.215: icmp: echo reply
23:27:53.072442 192.168.200.214 > 192.168.200.1: icmp: echo request [ttl 1]
23:27:53.072479 192.168.200.1 > 192.168.200.214: icmp: echo reply
Here is the setup to one of the servers:
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
ce0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
inet 192.168.200.214 netmask ffffff00 broadcast 192.168.200.255
groupname prod
ether 0:3:ba:43:f4:f4
ce0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 192.168.200.212 netmask ffffff00 broadcast 192.168.200.255
ce1: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 5
inet 172.16.0.129 netmask ffffff80 broadcast 172.16.0.255
ether 0:3:ba:43:f4:f3
qfe0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3
inet 192.168.200.216 netmask ffffff00 broadcast 192.168.200.255
groupname prod
ether 0:3:ba:34:95:4
qfe1: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 4
inet 172.16.1.1 netmask ffffff80 broadcast 172.16.1.127
ether 0:3:ba:34:95:5
clprivnet0: flags=1009843<UP,BROADCAST,RUNNING,MULTICAST,MULTI_BCAST,PRIVATE,IPv4> mtu 1500 index 6
inet 172.16.193.1 netmask ffffff00 broadcast 172.16.193.255
ether 0:0:0:0:0:1
Any suggestions on why the excessive traffic?

I would guess these are the ipmp probes (man in.mpathd).
You can start in.mpathd in debug mode to find out.
HTH,
jono

Guest Cluster error in Hyper-V Cluster

Hello everybody,
in my environment I do have an issue with failover clusters (Exchange, Fileserver) while performing a live migration of one virtual clusternode. The clustergroup is going offline.
The environment is the following:
2x Hyper-V Clusters: Hyper-V-Cluster1 and Hyper-V-Cluster2 (Windows Server 2012 R2) with 5 Nodes per Cluster
1x Scaleout Fileserver (Windows Server 2012 R2) with 2 Nodes
1x Exchange Cluster (Windows Server 2012 R2) with EX01 VM running on Hyper-V-Cluster1 and EX02 VM running on Hyper-V-Cluster2
1x Fileserver Failover Cluster (Windows Server 2012 R2) with FS01 VM running on Hyper-V-Cluster1 and FS02 VM running on Hyper-V-Cluster2
The physical networks on the Hyper-V Nodes are redundant with 2x 10Gb/s uplinks to 2x physical switches for VMs in a LBFO Team:
New-NetLbfoTeam
-Name 10Gbit_TEAM -TeamMembers 10Gbit_01,10Gbit_02
-TeamingMode SwitchIndependent -LoadBalancingAlgorithm HyperVPort
The SMB 3 traffic runs on 2x 10Gb/s NIC without NIC-Teaming (SMB-Multichannel).
SMB is used for livemigrations.
The VMs for clustering were installed according to the technet guideline:
http://technet.microsoft.com/en-us/library/dn265980.aspx
Because my Hyper-V Uplinks are allready redundant, I am using one NIC inside the VM.
As I understand, there is no advantage of using two NICs inside the VM as long they are connected to the same vSwitch.
Now, when I want to perform a hardware maintenance, I have to livemigrate the EX01 VM from Hyper-V-Cluster1-Node-1 to Hyper-V-Cluster1-Node-2.
EX02 VM still runs untouched on Hyper-V-Cluster2-Node-1.
At the end of the livemigration I see error 1135 (source: FailoverClustering) on EX01 VM, which says that EX02 VM was removed from Failover Cluster and I have to check my network.
The clustergroup of exchange is offline after that event and I have to bring it online again manually.
Any ideas what can cause this behavior?
Thanks.
Greetings,
torsten

Hello again,
I found the cause and the solution :-)
In the article here: http://technet.microsoft.com/en-us/library/dn440540.aspx
is the description of my cluster failure:
########## relevant part from article #######################
Protect against short-term network interruptions
Failover cluster nodes use the network to send heartbeat packets to other nodes of the cluster. If a node does not receive a response from another node for a specified period of time, the cluster removes the node from cluster membership. By default, a guest
cluster node is considered down if it does not respond within 5 seconds. Other nodes that are members of the cluster will take over any clustered roles that were running on the removed node.
Typically, during the live migration of a virtual machine there is a fast final transition when the virtual machine is stopped on the source node and is running on the destination node. However, if something causes the final transition to take longer than
the configured heartbeat threshold settings, the guest cluster considers the node to be down even though the live migration eventually succeeds. If the live migration final transition is completed within the TCP time-out interval (typically around 20 seconds),
clients that are connected through the network to the virtual machine seamlessly reconnect.
To make the cluster heartbeat time-out more consistent with the TCP time-out interval, you can change the
SameSubnetThreshold and CrossSubnetThreshold cluster properties from the default of 5 seconds to 20 seconds. By default, the cluster sends a heartbeat every 1 second. The threshold specifies how many heartbeats to miss in succession
before the cluster considers the cluster node to be down.
After changing both parameters in failover cluster as described the error is gone.
Greetings,
torsten

Changing Hyper-V host and cluster virtual IP addresses to new subnet/VLAN

I have a 2 node Hyper-V 2012 R2 failover cluster, managed by System Center Virtual Machine Manager 2012 R2, and I would like to change the IP addresses of the hosts and the cluster, in order to move them to a new subnet and VLAN. The existing and new subnets
are able to route to each other so all hosts will still be able to communicate throughout the parts of the process where they may be on separate subnets. There is also a dedicated cluster heartbeat network on its own subnet and VLAN that I am not altering
in any way.
The 2 hosts are configured with 4 nics in a team, with dedicated virtual interfaces for each of the following:
-Live Migration
-Cluster Heartbeating
-Host management/general traffic (the cluster virtual IP address is also on the same subnet as these interfaces).
It is the host management/general traffic addresses that I want to change. The interfaces were created and configured with the Add-VMNetworkAdapter, New-NetIPAddres and Set-VMNetworkAdapterVlan commands.
Please advise if the following process is correct:
1) Evacuate all the VMs from the first host to be changed and put it in maintenance mode.
2) Use Set-VMNetworkAdapter to change the name of the interface (the current name refers to the VLAN it's on)
3) Use Set-NetIPAddress to change the IP address and gateway of the interface as appropriate
4) Use Set-VMNetworkAdapterVlan to set the VLAN ID
5) Take the host out of maintenance mode and move all VMs off the other host
6) Repeat above steps on the other host
I know that I will then need to change the cluster virtual IP address, but I have no idea how to do this or where to look for that setting. Please advise!
Cheers.

Hi new_guise,
For changing cluster node's IP address please refer to the link below :
https://support.microsoft.com/kb/230356?wa=wsignin1.0
For changing VIP please refer to this article :
http://blogs.technet.com/b/chrad/archive/2011/09/16/changing-hyper-v-cluster-virtual-ip-address-vip-after-layer-3-changes.aspx
Best Regards,
Elton Ji
Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact [email protected] .

Best Practice Question on Heartbeat Network

After running 3.0.3 a few weeks in production, we are wondering if we set up our Heartbeat /Servers correctly.
We have 2 servers in our Production Server pool. Our LAN, a 192.168.x.x network, has the Virtual IP of the Cluster (heartbeat), the 2 main IP addresses of the servers, and a NIC assigned to each guest. All of this has been configured on the same network. Over the weekend, I wanted to separate the Heartbeat onto a new network, but when trying to add to the pool I received:
Cannot add server: ovsx.mydomain.com, to pool: mypool. Server Mgt IP address: 192.168.x.x, is not on same subnet as pool VIP: 192.168.y.y
Currently, I only have one router that translate our WAN to our LAN of 192.168.x.x. I thought the heartbeat would strictly be internal and would not need to be routed anywhere and just set up as a separate VLAN and this is why I created 192.168.y.y. I know that the servers can have multiple IP addresses, and I have 3 networks added to my OVM servers. 192.168.x.x, 192.168.y.y and 192.168.z.z. y and z are not pingable from anything but the servers themselves or one of the guests that I have assigned that network to. I can not ping them directly from our office network, even through the VPN which only gives us access to 192.168.x.x.
I guess I can change my Sever Mgt IP away from 192.168.x.x to 192.168.y.y, but can I do that without reinstalling the VM server? How have others structured there networks especially relating to the heartbeat?
Is there any documentation/guides that would describe how to set up the networks properly relating to the heartbeat?
Thanks for any help!!

Hello user,
In order to change your environment, what you could do is go to the Hardware tab -> Network. Within here you can create new networks and also change via the Edit this Network pencil icon what networks should manage what roles (i.e. Virtual Machine, Cluster Heartbeat, etc). In my past experience, I've had issues changing the cluster heartbeat once it has been set. If you have issues changing it, via the OVM Manager, one thing you could do is change it manually via the /etc/ocfs2/cluster.conf file. Also, if it successfully lets you change it via the OVM Manager, verify it within the cluster.conf to ensure it actually did your change. This is where that is being set. However, doing it manually can be tricky because OVM has a tendency to like to revert it's changes back to its original state say after a reboot. Of course I'm not even sure if they support you manually making that change. Ideally, when setting up an OVM environment, best practice would be to separate your networks as much as possible i.e. (Public network, private network, management network, clusterhb network, and live migration network if you do a lot of live migrating, otherwise you can probably place it with say the management network).
Hope that helps,
Roger

Error in coherence-- stopping cluster service.

i do have found the error in one of my coherence server log files can some one explain me what does it mean?
Coherence Logger@9272718 3.4.2/411 ERROR 2009-06-01 16:08:31.396/1217.130 Oracle Coherence GE 3.4.2/411 <Error> (thread=Cluster, member=3): Received cluster heartbeat from the senior Member(Id=7, Timestamp=2009-04-24 12:29:25.802, Address=xx.xxx.xx.xxx:8093, MachineId=55400, Location=machine:server72,process:11324, Role=WeblogicServer) that does not contain this Member(Id=3, Timestamp=2009-06-01 15:48:09.18, Address=xx.xxx.xxx.xx:8091, MachineId=47428, Location=site:ops.company.org,machine:cohserverbox1,process:14401, Role=CoherenceServer); stopping cluster service.
Thanks Much

Hi,
This error essentially means what it says: The process received a cluster heartbeat that did not include the process as a member of the cluster. The process, therefore, stops its cluster service and will attempt to join the cluster again when appropriate. There are few reasons that the senior member may not have included the process in its heartbeat. Based on the timestamps and roles, I would first want to confirm the intent to cluster these processes. If the intent is not to cluster these processes, I would adjust their configurations appropriately (eg. use a distinct port) to form separate clusters. If the intent is to cluster these processes and the error (with the timestamp spread) reproduces, I would want to examine the network topology and look for reasons the members are being dropped from the cluster.
Regards,
Harv

IP Multicast requirements in WebLogic 5.1 cluster

          Hi,
          Does multicast need to be available to servers containing the Weblogic plug-in?
          We have a firewall between web servers (iPlanet) and the WebLogic cluster.
          Thanks in advance,
          Julian


          Another way of putting my question:
          How does the plug-in become aware of the need to failover (eg. to a replica session)?:
          (i) - "A Proactive method" -Plug-ins listen to cluster heartbeats to determine
          instance availability, only send requests to available instance (e.g. compare
          list of available instances from heartbeat results against primary/secondary instances
          specified in cookie).
          (ii) - "A reactive method" - Plug-in sends client request to Primary instance
          specified in cookie, request is not fulfilled, so intercepts response to client
          and re-sends request to secondary instance.
          (iii) other ??
          Please help.
          Thanks again - Julian
          "Julian Herzel" <[email protected]> wrote:
          >
          >Hi,
          >
          >Does multicast need to be available to servers containing the Weblogic
          >plug-in?
          >We have a firewall between web servers (iPlanet) and the WebLogic cluster.
          >
          >Thanks in advance,
          >
          >Julian

Cluster fragmentation

What happens when part of my cluster cannot communicate with the other part?
For example, what if I have two machine in a cluster and at some point the two machines can longer communicate with each other but are otherwise connected to the network and running fine.
Will they both assume the other machine has left the cluster and form two independent clusters? This can lead to problems as an update made by one machine will not be reflected in the other. Basically each clusters will become out of sync with the db as the other cluster makes updates.
Or, will one machine somehow become the sole member of “the cluster�? and the other will not be in “the cluster�? at all. If this is the case, how does the machine that has left the cluster behave? Do all cache access methods throw exceptions until it is able to rejoin the cluster or what?

Hi Rohan,
The two islands will start looking for each otherthe
next timethe application code calls into the
Coherence API (e.g. CacheFactory.getCache(...)).When
they will see each other again depends on howflaky
your server/network/switch is.I have observed that when the split occurs, if the
one server that is effectively in an island by itself
does not make calls into the Coherence API, it
remains disconnected even though the other servers in
the cluster continue to make their own calls into the
API.Exactly, there is no need for this node to be in the cluster if it is not using the cache (i.e. accessing the API).
Also, I have seen this warning:
2005-06-23 09:18:47,270 INFO [STDOUT] 2005-06-23
09:18:47.270 Tangosol Coherence 2.5/290 <Warning>
(thread=Cluster, member=3): The member formerly known
as Member(Id=1, Timestamp=Thu Jun 23 09:18:46 UTC
2005, Address=A.B.C.139, Port=8088, MachineId=2955)
has been forcefully evicted from the cluster, but
continues to emit a cluster heartbeat; henceforth,
the member will be shunned and its messages will be
ignored.
Does this mean that the shunned Member 1 will never
be able to rejoin the cluster, since its messages
will be ignored?The member will attempt to rejoin the cluster once it has (1) shut down all its Coherence services and (2) the application calls into the Coherence API again.
Log messages on Member 1:
2005-06-23 09:18:47,390 INFO [STDOUT] 2005-06-23
09:18:47.389 Tangosol Coherence 2.5/290 <Error>
(thread=Cluster, member=1): This senior Member(Id=1,
Timestamp=Thu Jun 23 08:53:10 UTC 2005,
Address=A.B.C.139, Port=8088, MachineId=2955) appears
to have been disconnected from other nodes due to a
long period of inactivity
and the seniority has been assumed by the
Member(Id=2, Timestamp=Thu Jun 23 08:53:17 UTC 2005,
Address=A.B.C.140, Port=8088, MachineId=2956);
stopping cluster.
2005-06-23 09:21:47,908 INFO [STDOUT] 2005-06-23
09:21:47.908 Tangosol Coherence 2.5/290 <Info>
(thread=Thread-48, member=1): Restarting NamedCache:
ruleCache
2005-06-23 09:21:47.908 Tangosol Coherence 2.5/290
<Info> (thread=Thread-48, member=1): Restarting
Service: DistributedCache
2005-06-23 09:21:47.908 Tangosol Coherence 2.5/290
<Info> (thread=Thread-48, member=n/a): Restarting
cluster
Member 1 did appear to be operating as normal after
this ... do these log messages mean it really was
part of the cluster again or did it just think it
was, even though it was being shunned?Correct, it is now part of the cluster again. As I stated above a shunned member will attempt to rejoin the cluster.
This is the level of fault-tolerance and reliability that is built into Coherence from the start. However, I would still suggest fixing the flaky server/network/switch.
Later,
Rob Misek
Tangosol, Inc.
Message was edited by: rmisek

Unexpected cluster heartbeat

Similar Messages

Maybe you are looking for