Unicast cluster - heartbeat message failure messages

Using unicast messaging mode and i see following messages
####<Jul 9, 2010 12:46:56 AM PDT> <Info> <Cluster> <anaeur30> <WL10MP2-ServiceSTServer6> <[ACTIVE] ExecuteThread: '45'
for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1278661616559> <BEA-000112> <Removing WL10M
P2-ServiceSTServer1 jvmid:6806396782256322086S:anaeur10:[7033,7033,-1,-1,-1,-1,-1]:anaeur10:7033,anaeur10:7035,anaeur2
0:7033,anaeur20:7035,anaeur30:7033,anaeur30:7035,anaeur50:7033,anaeur50:7035:WL10MP2-ServiceTier:WL10MP2-ServiceSTServ
er1 from cluster view due to timeout.>
####<Jul 9, 2010 12:55:36 AM PDT> <Info> <Cluster> <anaeur30> <WL10MP2-ServiceSTServer6> <[ACTIVE] ExecuteThread: '34'
for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1278662136552> <BEA-000112> <Removing WL10M
P2-ServiceSTServer2 jvmid:-2694311272134716565S:anaeur10:[7035,7035,-1,-1,-1,-1,-1]:anaeur10:7033,anaeur10:7035,anaeur
20:7033,anaeur20:7035,anaeur30:7033,anaeur30:7035,anaeur50:7033,anaeur50:7035:WL10MP2-ServiceTier:WL10MP2-ServiceSTSer
ver2 from cluster view due to timeout.>
During the same time frame, I see lost multicast messages on all the instances for a about 20 minutes. What could be the problem? Why am i seeing the multicast messages when using uncast? My config.xml has multicast related entries for each server but how will that be effective? is that an issue? we see servers dropping out frequently from cluster.
000115> <Lost 1 multicast message(s).>
####<Jul 9, 2010 12:46:42 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661602751> <BEA-000115> <Lost 1 multicast message(s).>
####<Jul 9, 2010 12:46:46 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661606548> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:47:04 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661624185> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:48:40 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661720809> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:54:14 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278662054823> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:54:14 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278662054827> <BEA-000115> <Lost 1 multicast message(s).>
####<Jul 9, 2010 12:54:14 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278662054827> <BEA-000115> <Lost 2 multicast message(s).>

SJ,
Thanks, that's perfect explanation i was looking for. We always create cluster from console and it could be that we used MULTICAST messaging mode in past hence the entries in config.xml. What made me to raise the question "will UNICAST or MULTICAST be used" is that when ever we experience a drop out server issue from cluster, i see the following message written into each managed server log. Ideally, the following should be written into log if the multicast messaging mode is in operation, right?
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490260768> <BEA-000115> <Lost 2 multicast message(s).>
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490260768> <BEA-000115> <Lost 2 multicast message(s).>
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490261355> <BEA-000115> <Lost 2 multicast message(s).>
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490261355> <BEA-000115> <Lost 2 multicast message(s).>
The above message is not written all the time but only when server removed from cluster group. Please be inforemed that i have enable unicast debug mode. will unicast also writes messages as above when hearbeat message lost?
To trace our issue further, i have to manually remove reference from config.xml and monitor for sometime. its still mystery why the clusters are dropping out. Sometimes, soon after cluster instances dropped out i can see the drop-out frequency as "Rarely" and after a week or so the members are regrouped with difference group leader. Are you aware of any issue with unicast messaging mode in WL10 MP2?
Is it good idea of testing multicast?
Thanks a lot for your time.
-RR

Similar Messages

  • Cluster heartbeat message in Coherence 3.6

    Hi,
    We recently upgraded to Coherence 3.6 in our production environment. Occasionally in the Coherence cluster, I see the following happening.
    2010-08-18 18:27:48.953/28.828 Oracle Coherence GE 3.6.0.0 <Error> (thread=Cluster, member=13): Received cluster heartbeat from the senior Member(Id=1, Timestamp=2010-08-18 18:03:43.927, Address=10.31.151.246:9000, MachineId=33526, Location=site:xxx.com,machine:machine1,process:2665,member:coherence_cache_server-0, Role=cache-server) that does not contain this Member(Id=13, Timestamp=2010-08-18 18:25:20.158, Address=10.30.71.60:8092, MachineId=21308, Location=site:xxx.com,machine:machine2,process:3540,member:CoherenceCommandLineTool, Role=cache-client); stopping cluster service.
    Whenever any node for ex: coherence cmd line tries to join the cluster it gets kicked out of the cluster immediately. Nodes on the cluster keep exiting the cluster and joining back. This happens constantly. pasting another log snippet.
    2010-08-18 14:23:22.458/-13596.00-214 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=7): Member 5 joined Service Management with senior member 1
    2010-08-18 14:23:36.110/-13582.00-562 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=7): Member 5 joined Service DistributedCache with senior member 1
    2010-08-18 14:23:37.811/-13580.00-861 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=7): MemberLeft notification for Member(Id=5, Timestamp=2010-08-18 14:23:25.924, Address=10.30.71.60:8092, MachineId=21308, Location=site:xxx.com,machine:machine2,process:5936,member:CoherenceCommandLineTool, Role=cache-client, PublisherSuccessRate=0.9166, ReceiverSuccessRate=1.0, PauseRate=0.0, Threshold=1976, Paused=false, Deferring=false, OutstandingPackets=0, DeferredPackets=0, ReadyPackets=0, LastIn=845ms, LastOut=854ms, LastSlow=n/a) received from Member(Id=2, Timestamp=2010-08-18 12:34:15.068, Address=10.31.151.246:9001, MachineId=33526, Location=site:xxx.com,machine:machine1,process:2667,member:coherence_cache_server-1, Role=cache-server, PublisherSuccessRate=0.8568, ReceiverSuccessRate=0.5934, PauseRate=0.0021, Threshold=1878, Paused=false, Deferring=false, OutstandingPackets=0, DeferredPackets=0, ReadyPackets=0, LastIn=24744ms, LastOut=24753ms, LastSlow=n/a)
    2010-08-18 14:23:37.811/-13580.00-861 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=7): Member 5 left service Management with senior member 1
    2010-08-18 14:23:37.811/-13580.00-860 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=7): Member 5 left service DistributedCache with senior member 1
    Just to give a brief background on our environment, we have 3 linux hosts configured to form the cluster. 2 of them have 2 cache server nodes on each= total of 4 cache server's in the cluster, with about 7 storage disabled client nodes.
    Any clues as to why this is happening with cluster? Do we need to configure anything on the cluster? We have all the ports on which the nodes communicate, opened up for udp-tcp/input-output.
    Appreciate all help on this.
    -Chandini

    Here is the log snippet on member 2 for around the same timestamp when member 5 was removed
    2010-08-18 14:23:26.269/6588.113 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=2): Member 5 joined Service Management with senior member 1
    2010-08-18 14:23:39.922/6601.766 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=2): Member 5 joined Service DistributedCache with senior member 1
    2010-08-18 14:23:41.607/6603.451 Oracle Coherence GE 3.6.0.0 <Warning> (thread=Cluster, member=2): Failed to reach address /10.30.71.60 within the IpMonitor timeout. Members [Member(Id=5, Timestamp=2010-08-18 14:23:25.924, Address=10.30.71.60:8092, MachineId=21308, Location=site:xxx.com,machine:machine2,process:5936,member:CoherenceCommandLineTool, Role=cache-client)] are suspect.
    2010-08-18 14:23:41.608/6603.452 Oracle Coherence GE 3.6.0.0 <Warning> (thread=Cluster, member=2): Timed-out members MemberSet(Size=1, BitSetCount=2
    Member(Id=5, Timestamp=2010-08-18 14:23:25.924, Address=10.30.71.60:8092, MachineId=21308, Location=site:xxx.com,machine:machine2,process:5936,member:CoherenceCommandLineTool, Role=cache-client)
    ) will be removed.
    2010-08-18 14:23:41.608/6603.452 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=2): Member 5 left service Management with senior member 1
    2010-08-18 14:23:41.609/6603.453 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=2): Member 5 left service DistributedCache with senior member 1
    I will work on getting the logs on the other members for the first message and post it here.
    Thanks.

  • Sending heartbeat messages

    Can anyone post sample code to send heartbeat messages from client to server. Need urgenltly.
    thanks in advance

    Most of the code is here. There are parts before and after the snipits - but this should be enough to get the point. I have not included the class that is getting serialized. However, it could be whatever you want it to be (doesn't matter what the contents of the class/object is).
    source system:
    try {
    | NON-SSL connection |
    if (!sslConnection) {
    mySocket = new Socket(serverName, communicationPort);
    oout = new ObjectOutputStream(mySocket.getOutputStream());
    oin = new ObjectInputStream(mySocket.getInputStream());
    | SSL connection |
    else {
    sslFact = (SSLSocketFactory) SSLSocketFactory.getDefault();
    mySSLSocket = (SSLSocket) sslFact.createSocket(serverName, communicationPort);
    oout = new ObjectOutputStream(mySSLSocket.getOutputStream());
    oin = new ObjectInputStream(mySSLSocket.getInputStream());
    heartBeatMessage = new DmiHeartBeatMessage(destinationName, serverName);
    myMessage = new CommunicationMessage(heartBeatMessage);
    oout.writeObject(myMessage);
    oout.flush();
    while (true) {
    incomingMessage = (CommunicationMessage)oin.readObject();
    if (incomingMessage.getMessageText().equals("Request Complete"))
    connectionClosed = true;
    catch (Exception e) {
    if (connectionClosed)
    executerLogger.writeToLog("Remote agent available.", true, true);
    else
    executerLogger.writeToLog("Remote agent not available.", true, true);
    target:
    | Non-SSL connection |
    if (communicationSocket != null) {
    agentInfo.getAgentLogger().writeToLog("Received a NON-SSL connection request...", true, true);
    socketInput = communicationSocket.getInputStream();
    socketOutput = communicationSocket.getOutputStream();
    | SSL connection |
    else {
    agentInfo.getAgentLogger().writeToLog("Received a SSL connection request...", true, true);
    socketInput = sslCommunicationSocket.getInputStream();
    socketOutput = sslCommunicationSocket.getOutputStream();
    oout = new ObjectOutputStream(socketOutput);
    oin = new ObjectInputStream(socketInput);
    | Continue to read messages until the client disconnects. |
    incomingMessage = (CommunicationMessage)oin.readObject();
    | Someone has requested to tail the agent log file. |
    if (incomingMessage.getObject() instanceof DmiHeartBeatMessage)
    agentInfo.getAgentLogger().writeToLog("Heartbeat received from Deployment Executer: " +
    ((DmiHeartBeatMessage)incomingMessage.getObject()).getDestinationName() + ", " +
    ((DmiHeartBeatMessage)incomingMessage.getObject()).getServerName() + ".", true, true);
    sendMessageToClient("Request Complete", false);
    ...

  • Unexpected cluster heartbeat

    I'm seeing the following errors when starting several nodes of a cluster using scripts. This only happens occasionally and works most of the times. Also, there is no problem when the nodes were started manually one by one.
    The cluster consists of 2 hosts running multiple programs (jvm) each as indicated in the log.
    Could some one explain what happened and how to fix it?
    Thanks!
    2010-09-23 19:29:48,161 14496 [Logger@559022270 3.5.3/465] DEBUG Coherence - 2010-09-23 19:29:48.161/14.743 Oracle Coherence GE 3.5.3/465 <D5> (thread=Cluster,member=n/a): Service Cluster joined the cluster with senior service member n/a
    2010-09-23 19:29:48,382 14717 [Logger@9250185 3.5.3/465] INFO Coherence - 2010-09-23 19:29:48.382/14.964 Oracle Coherence GE 3.5.3/465 <Info> (thread=Cluster,member=n/a): This Member(Id=7, Timestamp=2010-09-23 19:29:48.207, Address=10.253.97.133:16001, MachineId=38533, Location=site:mytest.com,machine:host1,process:18482, Role=Program1, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2) joined cluster "test1" with senior Member(Id=12, Timestamp=2010-09-23 19:15:01.264, Address=10.253.97.133:16002, MachineId=38533, Location=site:mytest.com,machine:host1,process:15142, Role=Program2, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2)
    2010-09-23 19:29:49,157 15492 [Logger@9250185 3.5.3/465] WARN Coherence - 2010-09-23 19:29:49.157/15.739 Oracle Coherence GE 3.5.3/465 <Warning> (thread=Cluster, member=n/a): Notifying the senior Member(Id=12, Timestamp=2010-09-23 19:15:01.264, Address=10.253.97.133:16002, MachineId=38533, Location=site:mytest.com,machine:host1,process:15142, Role=Program2) of an unexpected cluster heartbeat from Member(Id=13, Timestamp=2010-09-23 19:15:03.524, Address=10.253.97.134:16001, MachineId=38534, Location=site:mytest.com,machine:host2,process:24439, Role=Program2)
    2010-09-23 19:29:57,729 24064 [Logger@9250185 3.5.3/465] WARN Coherence - 2010-09-23 19:29:57.729/24.311 Oracle Coherence GE 3.5.3/465 <Warning> (thread=Cluster, member=n/a): Notifying the senior Member(Id=12, Timestamp=2010-09-23 19:15:01.264, Address=10.253.97.133:16002, MachineId=38533, Location=site:mytest.com,machine:host1,process:15142, Role=Program2) of an unexpected cluster heartbeat from Member(Id=14, Timestamp=2010-09-23 19:15:16.618, Address=10.253.97.133:16003, MachineId=38533, Location=site:mytest.com,machine:host1,process:15734, Role=Program3)
    2010-09-23 19:30:13,808 40143 [Logger@9250185 3.5.3/465] WARN Coherence - 2010-09-23 19:30:13.807/40.390 Oracle Coherence GE 3.5.3/465 <Warning> (thread=Cluster, member=n/a): Notifying the senior Member(Id=12, Timestamp=2010-09-23 19:15:01.264, Address=10.253.97.133:16002, MachineId=38533, Location=site:mytest.com,machine:host1,process:15142, Role=Program2) of an unexpected cluster heartbeat from Member(Id=15, Timestamp=2010-09-23 19:15:26.421, Address=10.253.97.134:16002, MachineId=38534, Location=site:mytest.com,machine:host2,process:24991, Role=Program3)
    2010-09-23 19:30:18,453 44788 [Logger@9250185 3.5.3/465] ERROR Coherence - 2010-09-23 19:30:18.453/45.035 Oracle Coherence GE 3.5.3/465 <Error> (thread=main, member=n/a): Error while starting cluster: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
    MemberSet=ServiceMemberSet(
    OldestMember=n/a
    ActualMemberSet=MemberSet(Size=2, BitSetCount=2
    Member(Id=7, Timestamp=2010-09-23 19:29:48.207, Address=10.253.97.133:1600
    1, MachineId=38533, Location=site:mytest.com,machine:host1,process:18482, Role=Program1)
    Member(Id=12, Timestamp=2010-09-23 19:15:01.264, Address=10.253.97.133:160
    02, MachineId=38533, Location=site:mytest.com,machine:host1,process:15142, Role=Program2)
    MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
    7/3.5/Thu Sep 23 19:29:48 UTC 2010/false,
    12/3.5/Thu Sep 23 19:15:01 UTC 2010/false
    at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onStartupTimeout(Grid.CDB:6)
    at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:28)
    at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:38)
    at com.tangosol.coherence.component.net.Cluster.onStart(Cluster.CDB:366)
    at com.tangosol.coherence.component.net.Cluster.start(Cluster.CDB:11)
    at com.tangosol.coherence.component.util.SafeCluster.startCluster(SafeCluster.CDB:3)
    at com.tangosol.coherence.component.util.SafeCluster.restartCluster(SafeCluster.CDB:7)
    at com.tangosol.coherence.component.util.SafeCluster.ensureRunningCluster(SafeCluster.CDB:27)
    at com.tangosol.coherence.component.util.SafeCluster.start(SafeCluster.CDB:2)
    at com.tangosol.net.CacheFactory.ensureCluster(CacheFactory.java:998)
    at com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:915)
    at com.oracle.coherence.environment.extensible.ExtensibleEnvironment.ensureService(ExtensibleEnvironment.java:374)
    at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:877)
    at com.tangosol.net.DefaultConfigurableCacheFactory.configureCache(DefaultConfigurableCacheFactory.java:1088)
    at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:304)
    at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:735)
    at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:712)
    2010-09-23 19:30:18,456 44791 [Logger@9250185 3.5.3/465] ERROR Coherence - 2010-09-23 19:30:18.456/45.038 Oracle Coherence GE 3.5.3/465 <Error> (thread=Cluster, member=n/a): validatePolls: This service timed-out due to unanswered handshake request. Manual intervention is required to stop the members that have not responded to this Poll
    PollId=1, active
    InitTimeMillis=1285270188378
    Service=Cluster (0)
    RespondedMemberSet=[]
    LeftMemberSet=[]
    RemainingMemberSet=[12]
    Edited by: user10049765 on Oct 7, 2010 2:23 PM

    Any suggestion?

  • Uniform Distributed Topics and Unicast Cluster Messaging

    ...seem to be incompatible. I have a WLS 10.0 domain that has a cluster of servers to which a uniform distributed topic is deployed. There is an MDB listening to the topic member in each server. A stateless session EJB publishes to the distributed topic from one of its methods, and I expect every MDB instance to receive the message. This works correctly when the cluster is configured with multicast messaging, but when I configured the cluster to use unicast messaging (as recommended in the documentation) the MDB on the server that receives the EJB call is the only MDB that receives the message.
    Is there something else I need to configure?

    ...seem to be incompatible. I have a WLS 10.0 domain that has a cluster of servers to which a uniform distributed topic is deployed. There is an MDB listening to the topic member in each server. A stateless session EJB publishes to the distributed topic from one of its methods, and I expect every MDB instance to receive the message. This works correctly when the cluster is configured with multicast messaging, but when I configured the cluster to use unicast messaging (as recommended in the documentation) the MDB on the server that receives the EJB call is the only MDB that receives the message.
    Is there something else I need to configure?

  • O-Cluster Errror Messages

    Hello,
    Out team is running the O-Cluster algorithm in ODM (10G R2). During training, we are getting the following error message. Our models train using the K-Means algorithm and we can train small, trivial models using O-Cluster, but something about our data, I'm assuming, it doesn't like:
    ORA-40101: Data Mining System Error ODM_OC_CLUSTERING_MODEL-BUILD_OC.build_ocluster--20010
    ORA-06512: at "SYS.DBMS_SYS_ERROR", line 105
    ORA-06512: at "DMSYS.ODM_OC_CLUSTERING_MODEL", line 122
    ORA-06512: at "DMSYS.ODM_OC_CLUSTERING_MODEL", line 2408
    ORA-40101: Data Mining System Error ODM_OC_CLUSTERING_MODEL-BUILD_OC.ocluster--20010
    ORA-06512: at "SYS.DBMS_SYS_ERROR", line 105
    ORA-06512: at "DMSYS.ODM_OC_CLUSTERING_MODEL", line 122
    ORA-06512: at "DMSYS.ODM_OC_CLUSTERING_MODEL", line 2312
    ORA-06500: PL/SQL:
    Any ideas?
    Thanks,
    Chad

    Hi Chad,
    Sorry but there is not enough to go on with the error message.
    Are you running ODM 10.2.0.3?
    Did you invoke model build using ODMr?
    We might need to have a test case to run to understand why the failure is taking place.
    Have you ever worked with Oracle Support to file a problem report.
    They provide a means for development to access data from a client.
    Thanks, Mark

  • Compressor Cluster - Error message when attaching .scc caption files

    Hello,
    We have a 3 XServer Cluster controlled by a 4'th XServer (Our FCServer machine). My workflow is:
    Source Video: 1920X1080 ProRes Video (28:30min)
    Resized to 640X360 ProRes LT (also de interlaced and some black restore and sharpening applied here)
    Encoded 640X360 to H.264 at 750Kb.Sec - .scc files defines in "Additional Information" tab in Compressor at this point.
    This job is submitted to the cluster. My submitting machine as well as all cluster machines are all connected to the same fiber network. All files are on the same XSAN.
    I am getting the following error message. I get it after it has tried to encode the video:
    Status: Failed - 5x HOST [fcsqm2.local] error: Failed to add CC to movie: -50
    note: fcsqm2 is one of the encoding machines in the cluster.
    I can't seem to find any answers via google. Anyone got any suggestions where I can look? Any ideas?
    Thanks a lot!
    Nathan

    {Ctrl + Shft + J} - any messages in the Error Console, relating to that?

  • Cluster error message ????

    I am running 2 WL servers in a cluster on two separate SUN Solaris
              machines not using a shared file system and using NES plugin. The
              properties file are exactly the same. The cluster comes up ok. I am
              using a simple counter servlet to test the clustering. First time the
              primary server updates seconday just fine.
              I take the primary down and reload the servlet a few times. Things work
              fine with the message
              <RepMan> updateSecondary called on unpaired primary
              Then I bring back the primary server that I had killed. Try to reload
              the servlet. I get the following messages
              "<RepMan> getRepMan unable to obtain ReplicationManager (for
              id+ipaddress - where id is WL generated appended to IP of the machine).
              .[7001,7001,7002,7002,-1] " in the new primary server.
              "Unable to to create secondary for (id - WL generated id for the
              server)"
              Has anyone come across these messages in their logs when running a WL
              cluster (http session) on separate boxes w/o a shared file system???
              I have tested multicast and seems to be ok for both machines. I have
              added 3rd machine in cluster and I still get the same results.
              

    Prasad,
              I did not see any comments on in-memory replication related issues in SP7.
              Do you have any information on when should we expect a service pack dealing
              with these issues.
              Thanks
              Vlad
              Prasad Peddada wrote:
              > This has been identified as a bug. We will fix this in the next service
              > pack.
              >
              > -- Prasad
              >
              > Junaid Hossain wrote:
              >
              > > I am running 2 WL servers in a cluster on two separate SUN Solaris
              > > machines not using a shared file system and using NES plugin. The
              > > properties file are exactly the same. The cluster comes up ok. I am
              > > using a simple counter servlet to test the clustering. First time the
              > > primary server updates seconday just fine.
              > > I take the primary down and reload the servlet a few times. Things work
              > > fine with the message
              > > <RepMan> updateSecondary called on unpaired primary
              > >
              > > Then I bring back the primary server that I had killed. Try to reload
              > > the servlet. I get the following messages
              > > "<RepMan> getRepMan unable to obtain ReplicationManager (for
              > > id+ipaddress - where id is WL generated appended to IP of the machine).
              > > .[7001,7001,7002,7002,-1] " in the new primary server.
              > >
              > > "Unable to to create secondary for (id - WL generated id for the
              > > server)"
              > >
              > > Has anyone come across these messages in their logs when running a WL
              > > cluster (http session) on separate boxes w/o a shared file system???
              > > I have tested multicast and seems to be ok for both machines. I have
              > > added 3rd machine in cluster and I still get the same results.
              [vlad.vcf]
              

  • Compressor 3 "no cluster found" message - EASY FIX

    Greetings,
    I have had to install compressor 3 times before I finally found this easy fix. In my case Compressor would stop working when a video would get "stuck" in the batch and just go on forever. From then on the batch would always say "no cluster found", and if I tried to submit a batch it would say something like cluster no found.
    So how did I fix it without the dreaded delete everything and reinstall? I saw a post that said to check that your sharing name matches your cluster name in the qmaster system preferences. They did not exactly match. The name in sharing was "My Mac" and the name in qmaster was "My Mac Cluster" I changed it to match the sharing "My Mac" and hit the start sharing button. I reset my Mac (no sure if that's needed) and then opened the batch monitor. Instead of "no cluster found" it now said "My Mac" and when I submitted a batch in compressor the "This Computer" showed up again, and it stared working again!
    Not sure if this works in all cases. But I hope this post helps someone else from having to reinstall everything, AND maybe apple can read this to help figure out what the problem is with their software.

    i had the usual "no clusters found" in batch monitor, and "this computer" did not appear in batch monitor. so, when trying to submit a batch from compressor, i got the "unable to submit batch / retart your computer" error message.
    so for 2 days i tried really EVERY METHOD i found in all blogs, threads, posts, discussion forums to remove / reinstall / make work compressor.
    here what i tried:
    fcs remover.app / completely reinstall fcp from scratch
    (http://www.digitalrebellion.com/fcs_remover.htm)
    partially delete / reinstall compressor / qmaster only via standard install
    (http://docs.info.apple.com/article.html?artnum=302845)
    partially delete / reinstall compressor / qmaster only via pacifist install
    http://www.scottsimmons.tv/blog/2008/01/11/compressor-hatred-resolved/
    http://www.charlessoft.com/
    NOTHING WORKED. before completely re-installing my entire OS, i found this post here AND IT WORKS - at leat on my machine
    i hope this will save some of you sleepless nights
    regards
    ivan

  • WLS Cluster with Message Driven Beans and MQSeries on more than one Host

              With the Examples of http://developer.bea.com/jmsproviders.jsp and http://developer.bea.com/jmsmdb.jsp
              a MDB can be
              configured to work with MQSeries with one WLS Server. This works only, if a Queuemanager
              is started at the same Host that runs the WLS Server too.
              And the QueueConnectionFactory (QCF) is configured to TRANSPORT(BIND).
              In my configuration should be two WLS Servers and one JMS Queue (MQS) with the
              Queuemanager.
              A Message Driven Bean is deployed on both WLS Servers wich should get the Messages
              of this Queue.
              If one of the two WLS Servers fails the other WLS Server with the corresponding
              MDB should get the Messages of the
              MQSeries Queue.
              If the QCF is configured to TRANSPORT(Client) the Message Driven Bean can't start
              and the following Exception is thrown:
              <Jul 18, 2001 3:52:49 PM CEST> <Error> <J2EE> <Error deploying EJB Component :
              mdb_deployed
              weblogic.ejb20.EJBDeploymentException: Error deploying Message-Driven EJB:; nested
              exception is:
              javax.jms.JMSException: MQJMS2005: failed to create MQQueueManager for
              'btsun1a:TEST'
              javax.jms.JMSException: MQJMS2005: failed to create MQQueueManager for 'btsun1a:TEST'
              at com.ibm.mq.jms.services.ConfigEnvironment.newException(ConfigEnvironment.java:434)
              I'm wondering, because their is a MQQueueManager on btsun1a; all Servers throws
              the same Exception when the MDB is deployed.
              The configuration of JMSadmin on both Hosts is the following:
              dis qcf(myQCF2)
              HOSTNAME(btsun1a)
              CCSID(819)
              TRANSPORT(CLIENT)
              PORT(1414)
              TEMPMODEL(SYSTEM.DEFAULT.MODEL.QUEUE)
              QMANAGER(TEST)
              CHANNEL(JAVA.CHANNEL)
              VERSION(1)
              dis q(myQueue)
              CCSID(819)
              PERSISTENCE(APP)
              TARGCLIENT(JMS)
              QUEUE(MYQUEUE)
              EXPIRY(APP)
              QMANAGER(TEST)
              ENCODING(NATIVE)
              VERSION(1)
              PRIORITY(APP)
              I think only TRANSPORT(CLIENT) can be used when i don't wan't to install a Queue
              and a QueueManager on each WLS Server.
              Does anybody know a problem of WLS 6.0 SP2 to cope with TRANSPORT(CLIENT)?
              

              With the Examples of http://developer.bea.com/jmsproviders.jsp and http://developer.bea.com/jmsmdb.jsp
              a MDB can be
              configured to work with MQSeries with one WLS Server. This works only, if a Queuemanager
              is started at the same Host that runs the WLS Server too.
              And the QueueConnectionFactory (QCF) is configured to TRANSPORT(BIND).
              In my configuration should be two WLS Servers and one JMS Queue (MQS) with the
              Queuemanager.
              A Message Driven Bean is deployed on both WLS Servers wich should get the Messages
              of this Queue.
              If one of the two WLS Servers fails the other WLS Server with the corresponding
              MDB should get the Messages of the
              MQSeries Queue.
              If the QCF is configured to TRANSPORT(Client) the Message Driven Bean can't start
              and the following Exception is thrown:
              <Jul 18, 2001 3:52:49 PM CEST> <Error> <J2EE> <Error deploying EJB Component :
              mdb_deployed
              weblogic.ejb20.EJBDeploymentException: Error deploying Message-Driven EJB:; nested
              exception is:
              javax.jms.JMSException: MQJMS2005: failed to create MQQueueManager for
              'btsun1a:TEST'
              javax.jms.JMSException: MQJMS2005: failed to create MQQueueManager for 'btsun1a:TEST'
              at com.ibm.mq.jms.services.ConfigEnvironment.newException(ConfigEnvironment.java:434)
              I'm wondering, because their is a MQQueueManager on btsun1a; all Servers throws
              the same Exception when the MDB is deployed.
              The configuration of JMSadmin on both Hosts is the following:
              dis qcf(myQCF2)
              HOSTNAME(btsun1a)
              CCSID(819)
              TRANSPORT(CLIENT)
              PORT(1414)
              TEMPMODEL(SYSTEM.DEFAULT.MODEL.QUEUE)
              QMANAGER(TEST)
              CHANNEL(JAVA.CHANNEL)
              VERSION(1)
              dis q(myQueue)
              CCSID(819)
              PERSISTENCE(APP)
              TARGCLIENT(JMS)
              QUEUE(MYQUEUE)
              EXPIRY(APP)
              QMANAGER(TEST)
              ENCODING(NATIVE)
              VERSION(1)
              PRIORITY(APP)
              I think only TRANSPORT(CLIENT) can be used when i don't wan't to install a Queue
              and a QueueManager on each WLS Server.
              Does anybody know a problem of WLS 6.0 SP2 to cope with TRANSPORT(CLIENT)?
              

  • Hyper-V guest SQL 2012 cluster live migration failure

    I have two IBM HX5 nodes connected to IBM DS5300. Hyper-V 2012 cluster was built on blades. In HV cluster was made six virtual machines, connected to DS5300 via HV Virtual SAN. These VMs was formed a guest SQL Cluster. Databases' files are placed on
    DS5300 storage and available through VM FibreChannel Adapters. IBM MPIO Module is installed on all hosts and VMs.
    SQL Server instances work without problem. But! When I try to live migrate SQL VM to another HV node an SQL Instance fails. In SQL error log I see:
    2013-06-19 10:39:44.07 spid1s      Error: 17053, Severity: 16, State: 1.
    2013-06-19 10:39:44.07 spid1s      SQLServerLogMgr::LogWriter: Operating system error 170(The requested resource is in use.) encountered.
    2013-06-19 10:39:44.07 spid1s      Write error during log flush.
    2013-06-19 10:39:44.07 spid55      Error: 9001, Severity: 21, State: 4.
    2013-06-19 10:39:44.07 spid55      The log for database 'Admin' is not available. Check the event log for related error messages. Resolve any errors and restart the database.
    2013-06-19 10:39:44.07 spid55      Database Admin was shutdown due to error 9001 in routine 'XdesRMFull::CommitInternal'. Restart for non-snapshot databases will be attempted after all connections to the database are aborted.
    2013-06-19 10:39:44.31 spid36s     Error: 17053, Severity: 16, State: 1.
    2013-06-19 10:39:44.31 spid36s     fcb::close-flush: Operating system error (null) encountered.
    2013-06-19 10:39:44.31 spid36s     Error: 17053, Severity: 16, State: 1.
    2013-06-19 10:39:44.31 spid36s     fcb::close-flush: Operating system error (null) encountered.
    2013-06-19 10:39:44.32 spid36s     Error: 17053, Severity: 16, State: 1.
    2013-06-19 10:39:44.32 spid36s     fcb::close-flush: Operating system error (null) encountered.
    2013-06-19 10:39:44.32 spid36s     Error: 17053, Severity: 16, State: 1.
    2013-06-19 10:39:44.32 spid36s     fcb::close-flush: Operating system error (null) encountered.
    2013-06-19 10:39:44.33 spid36s     Starting up database 'Admin'.
    2013-06-19 10:39:44.58 spid36s     349 transactions rolled forward in database 'Admin' (6:0). This is an informational message only. No user action is required.
    2013-06-19 10:39:44.58 spid36s     SQLServerLogMgr::FixupLogTail (failure): alignBuf 0x000000001A75D000, writeSize 0x400, filePos 0x156adc00
    2013-06-19 10:39:44.58 spid36s     blankSize 0x3c0000, blkOffset 0x1056e, fileSeqNo 1313, totBytesWritten 0x0
    2013-06-19 10:39:44.58 spid36s     fcb status 0x42, handle 0x0000000000000BC0, size 262144 pages
    2013-06-19 10:39:44.58 spid36s     Error: 17053, Severity: 16, State: 1.
    2013-06-19 10:39:44.58 spid36s     SQLServerLogMgr::FixupLogTail: Operating system error 170(The requested resource is in use.) encountered.
    2013-06-19 10:39:44.58 spid36s     Error: 5159, Severity: 24, State: 13.
    2013-06-19 10:39:44.58 spid36s     Operating system error 170(The requested resource is in use.) on file "v:\MSSQL\log\Admin\Log.ldf" during FixupLogTail.
    2013-06-19 10:39:44.58 spid36s     Error: 3414, Severity: 21, State: 1.
    2013-06-19 10:39:44.58 spid36s     An error occurred during recovery, preventing the database 'Admin' (6:0) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected,
    contact Technical Support.
    In windows system log I see a lot of warnings like this:
    - <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
    - <System>
      <Provider
    Name="Microsoft-Windows-Ntfs" Guid="{3FF37A1C-A68D-4D6E-8C9B-F79E8B16C482}" />
      <EventID>140</EventID>
      <Version>0</Version>
      <Level>3</Level>
      <Task>0</Task>
      <Opcode>0</Opcode>
      <Keywords>0x8000000000000008</Keywords>
      <TimeCreated
    SystemTime="2013-06-19T06:39:44.314400200Z" />
      <EventRecordID>25239</EventRecordID>
      <Correlation
    />
      <Execution
    ProcessID="4620" ThreadID="4284" />
      <Channel>System</Channel>
      <Computer>sql-node-5.local.net</Computer>
      <Security
    UserID="S-1-5-21-796845957-515967899-725345543-17066" />
      </System>
    - <EventData>
      <Data Name="VolumeId">\\?\Volume{752f0849-6201-48e9-8821-7db897a10305}</Data>
      <Data Name="DeviceName">\Device\HarddiskVolume70</Data>
      <Data Name="Error">0x80000011</Data>
      </EventData>
     </Event>
    The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{752f0849-6201-48e9-8821-7db897a10305}, DeviceName: \Device\HarddiskVolume70.
    ({Device Busy}
    The device is currently busy.)
    There aren't any error or warning in HV hosts.

    Hello,
    I am trying to involve someone more familiar with this topic for a further look at this issue. Sometime delay might be expected from the job transferring. Your patience is greatly appreciated.
    Thank you for your understanding and support.
    Regards,
    Fanny Liu
    If you have any feedback on our support, please click 
    here.
    Fanny Liu
    TechNet Community Support

  • Cluster  3.2 failure retry time

    Dear All,
    I have messaging sever 7 in a cluster, however, for some reasons after so many watcher crashes, the cluster didn't restart the messaging resource.
    I am wondering if there is a retry timeout, or retry number of times.
    Would anyone please let me know if there are such options? If so, how to set them?
    Regards,
    Scotty

    Hi Scotty,
    if you want an indefinite restart, you can ser retry_count to -q. However I would recommend not to do that. Think about a cyclic failure, like after 10 seconds yor messaging server does not react to the probe any more and a restart is submitted. In this cases you will find it hard to interact.
    The retry count is a safety feature which prevents such reconfiguration storms.
    My suggestion is to set it to a fair number.
    So Retry_count * Thorough_probe_interval needs to be smaller than Retry_interval If the default number seems to small, increase it.
    The command is:
    clrs set -p retry_count=<new value> <your resource name>
    Cheers
    Detlef

  • Solaris Cluster Private Link Failure

    Hi,
    I have configured Solaris Cluster 3.3 and add two Back to Back interconnect cable.
    Sun Cluster is working fine but private link is fail and i can not ping the clusternode2-priv and clusternode1-priv form each other. some cammands faile
    ~ # ping clusternode2-priv
    no answer from clusternode2-priv
    ~ # metaset -s nfsds -a -h t1u331 t1u332
    metaset: 172.16.4.1: metad client create: RPC: Rpcbind failure
    ~ # scstat
    -- Cluster Nodes --
    Node name Status
    Cluster node: n1u332 Online
    Cluster node: n1u331 Online
    -- Cluster Transport Paths --
    Endpoint Endpoint Status
    Transport path:   n1u332:nxge2           n1u331:nxge2           Path online
    Transport path:   n1u332:nxge1           n1u331:nxge1           Path online
    -- Quorum Summary from latest node reconfiguration --
    Quorum votes possible: 3
    Quorum votes needed: 2
    Quorum votes present: 3
    -- Quorum Votes by Node (current status) --
    Node Name Present Possible Status
    Node votes: n1u332 1 1 Online
    Node votes: n1u331 1 1 Online
    -- Quorum Votes by Device (current status) --
    Device Name Present Possible Status
    Device votes: /dev/did/rdsk/d4s2 1 1 Online
    -- Device Group Servers --
    Device Group Primary Secondary
    -- Device Group Status --
    Device Group Status
    -- Multi-owner Device Groups --
    Device Group Online Status
    -- Resource Groups and Resources --
    Group Name Resources
    -- Resource Groups --
    Group Name Node Name State Suspended
    -- Resources --
    Resource Name Node Name State Status Message
    -- IPMP Groups --
    Node Name Group Status Adapter Status
    [root @ n1u332]
    ~ # ifconfig -a
    lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
    inet 127.0.0.1 netmask ff000000
    e1000g0: flags=1000802<BROADCAST,MULTICAST,IPv4> mtu 1500 index 2
    inet 0.0.0.0 netmask 0
    ether 0:15:17:e3:a4:e8
    vsw0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3
    inet 10.131.58.76 netmask ffffff00 broadcast 10.131.58.255
    groupname ipmp-grp
    ether 0:14:4f:f9:1:bd
    vsw0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
    inet 10.131.58.75 netmask ffffff00 broadcast 10.131.58.255
    vsw1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 4
    inet 10.131.58.77 netmask ffffff00 broadcast 10.131.58.255
    groupname ipmp-grp
    ether 0:14:4f:fb:44:4
    nxge1: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 7
    inet 172.16.0.129 netmask ffffff80 broadcast 172.16.0.255
    ether 0:14:4f:a0:81:d9
    nxge2: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 6
    inet 172.16.1.1 netmask ffffff80 broadcast 172.16.1.127
    ether 0:14:4f:a0:81:da
    clprivnet0: flags=1009843<UP,BROADCAST,RUNNING,MULTICAST,MULTI_BCAST,PRIVATE,IPv4> mtu 1500 index 8
    inet 172.16.4.1 netmask fffffe00 broadcast 172.16.5.255
    ether 0:0:0:0:0:1
    [root @ n1u332]
    ~ # dladm show-dev
    vsw0 link: up speed: 1000 Mbps duplex: full
    vsw1 link: up speed: 1000 Mbps duplex: full
    e1000g0 link: down speed: 0 Mbps duplex: half
    e1000g1 link: up speed: 1000 Mbps duplex: full
    e1000g2 link: unknown speed: 0 Mbps duplex: half
    e1000g3 link: unknown speed: 0 Mbps duplex: half
    nxge0 link: up speed: 100 Mbps duplex: full
    nxge1 link: up speed: 1000 Mbps duplex: full
    nxge2 link: up speed: 1000 Mbps duplex: full
    nxge3 link: up speed: 100 Mbps duplex: full
    e1000g4 link: unknown speed: 0 Mbps duplex: half
    e1000g5 link: up speed: 1000 Mbps duplex: full
    clprivnet0              link: unknown   speed: 0     Mbps       duplex: unknown
    Edited by: 808696 on Mar 2, 2011 8:27 AM

    If your private interconnect had really failed then one or other of the cluster nodes would have panicked. I think it is more likely that either you have changed the nsswitch.conf entry for hosts such that it does not include 'cluster' first, although I would have expected that to result in an unresolved host name. The other option is that you have hardened your machine in some way with ipfilters or security settings.
    Has it ever worked?
    Tim
    ---

  • Multihomed, unicast cluster question

    I'm trying to direct a unicast cache cluster to run on a different IP then the server default by using the following java command line options:
    -Dtangosol.coherence.localhost=192.168.202.xxx -Dtangosol.coherence.localport=20000
    I'm able to tell the jvms how to connect to the cluster over the non-default IP address on the box. All 6 jvms (2 jvms per server - 3 servers) cluster in, but it seems that something is missing because when we kick off the script to load the cache files, it doesn't load successfully - it seems stalled. There are no error messages anywhere and no indicator that anything is wrong. When I remove all the references to the IP I want to use and use the defaults instead, the jvms cluster fine and when we kick off the script to load the cache, it runs normally. I'm not modifying the tangosol-coherence-override.xml file - I'm just doing the above options on the java command line. Shouldn't this be enough? Or do I need to make modifications to the override xml too?

    Sorry for taking so long to respond, the past couple weeks were pretty busy. I've found that the cache loading script does work, but instead of taking 10-15 minutes, it's taking about an hour and 15 minutes to load cache over hipersocket. Over ethernet, the load time is normal. We've found out about a bug for the hipersocket driver that exists in SuSE z/OS linux, and we think it also exists in Redhat RHEL4.0 z/OS linux. I'm going to do some more research into server / network tuning on hipersockets before I open a ticket with you guys. Thanks for the prompt response though ...
    - Jim

  • Cluster point of failure

    I'm trying to setup an environment where if my primary web server goes down then request will be sent to the backup. I think clustering can help me here but my fear is that I have a single point of failure on the managing server. If i have a cluster is one machine managing all traffic? and if that machine were to go down my entire site would be down. Any suggestion at how to handle this at the router level would be appreciated also.
    Scott

    I'm not sure I understand your question completely.
    You can certainly run multiple managed servers and/or a cluster of managed servers to give you some redundancy.
    You can run multiple physical and/or virtual machines.
    You can run multiple sites etc for disaster recovery.
    I can't recall a site I've visited in a long time that didn't do all of these.
    Was there a specific question you had about HA or failure scenarios?
    -- Rob
    WLS Blog http://dev2dev.bea.com/blog/rwoollen/

Maybe you are looking for

  • What is the difference betwwen SELECT ALL Column and Select Speceific Colum

    Hi All, If the block size of the database is 8K and average row length is 2K and if we select all column of the table the I/O show that it had read more blocks then compare to specific column of the same table. Why is this? Secondly if Oracle brings

  • Trying to output HDCAM...

    So i have several QT clips that were shot with various frame rates....ranging from 24 to 60 frames. Some were shot using DVCPRO HD 720p 60i (see below): 1. DVCPRO HD 720p60, 960X720 23.98 fps 2. DV 720X480 29.97 fps 3. DVCPRO HD 720p60, 960X720 59.94

  • I have a question about Configuration of Post with Clearing

    I have a question about confiruation of the post with clearing which is t-doce 'FB05'. When I make post with clearing on 'FB05', I can change the additional selections. Where can I control the confiruation of the additional selections in t-code 'FB05

  • 140735232340352 ae.blitpipe 2 Making New Context

    Evertime I open my after effects project the program quits and shows this error. The file has a solid with some masks applied and keyframes on the masks. It claims to be a rendering error. I am using cs6 on my macbook pro retina display. What can I d

  • How to I unlock my iPod touch

    How do I unlock my ipod touch? I have forgotten the passcode.