Error in coherence-- stopping cluster service.

i do have found the error in one of my coherence server log files can some one explain me what does it mean?
Coherence Logger@9272718 3.4.2/411 ERROR 2009-06-01 16:08:31.396/1217.130 Oracle Coherence GE 3.4.2/411 <Error> (thread=Cluster, member=3): Received cluster heartbeat from the senior Member(Id=7, Timestamp=2009-04-24 12:29:25.802, Address=xx.xxx.xx.xxx:8093, MachineId=55400, Location=machine:server72,process:11324, Role=WeblogicServer) that does not contain this Member(Id=3, Timestamp=2009-06-01 15:48:09.18, Address=xx.xxx.xxx.xx:8091, MachineId=47428, Location=site:ops.company.org,machine:cohserverbox1,process:14401, Role=CoherenceServer); stopping cluster service.
Thanks Much

Hi,
This error essentially means what it says: The process received a cluster heartbeat that did not include the process as a member of the cluster. The process, therefore, stops its cluster service and will attempt to join the cluster again when appropriate. There are few reasons that the senior member may not have included the process in its heartbeat. Based on the timestamps and roles, I would first want to confirm the intent to cluster these processes. If the intent is not to cluster these processes, I would adjust their configurations appropriately (eg. use a distinct port) to form separate clusters. If the intent is to cluster these processes and the error (with the timestamp spread) reproduces, I would want to examine the network topology and look for reasons the members are being dropped from the cluster.
Regards,
Harv

Similar Messages

  • Cluster service is requested to stop on all nodes when DNS is unavailable

    Our 6 node coherence cluster has been running fine for few days. All coherence nodes were requested to stop the cluster service when the DNS server was not available for few mins due to a scheduled maintenance activity. Cluster services didn’t come back up until the DNS server is available. Why would it need a DNS server when the cluster is already started and running fine for few days?
    Here’s the error message and thread dump from the logs:
    2010-12-18 18:07:18.819/3464791.277 Oracle Coherence GE 3.6.0.3 <Error> (thread=IpMonitor, member=7): Detected hard timeout) of {WrapperGuardable Guard{Daemon=Cluster} Service=ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_JOINED), Id=0, Version=3.6, OldestMemberId=5}}
    2010-12-18 18:07:18.823/3464791.281 Oracle Coherence GE 3.6.0.3 <Error> (thread=Termination Thread, member=7): Full Thread Dump
    Thread[Invocation:Management:EventDispatcher,5,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.onWait(Service.CDB:7)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[Logger@9250962 3.6.0.3,3,main]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[Signal Dispatcher,9,system]
    Thread[Finalizer,8,system]
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
    Thread[Invocation:Management,5,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:6)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    ThreadCluster
    java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850)
    java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201)
    java.net.InetAddress.getAllByName0(InetAddress.java:1154)
    java.net.InetAddress.getAllByName(InetAddress.java:1084)
    java.net.InetAddress.getAllByName(InetAddress.java:1020)
    java.net.InetAddress.getByName(InetAddress.java:970)
    java.net.InetSocketAddress.<init>(InetSocketAddress.java:124)
    com.tangosol.net.ConfigurableAddressProvider$AddressHolder.getAddress(ConfigurableAddressProvider.java:426)
    com.tangosol.net.ConfigurableAddressProvider$1.next(ConfigurableAddressProvider.java:167)
    java.util.AbstractCollection.contains(AbstractCollection.java:89)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.isWellKnown(ClusterService.CDB:5)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.compareImportance(ClusterService.CDB:7)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.getWitnessMemberSet(ClusterService.CDB:49)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.verifyMemberLeft(ClusterService.CDB:91)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.onNotifyTcmpTimeout(ClusterService.CDB:11)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService$NotifyTcmpTimeout.onReceived(ClusterService.CDB:1)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onMessage(Grid.CDB:11)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onNotify(Grid.CDB:33)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.onNotify(ClusterService.CDB:3)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[main,5,main]
    java.lang.Object.wait(Native Method)
    com.tangosol.net.DefaultCacheServer.monitorServices(DefaultCacheServer.java:270)
    com.tangosol.net.DefaultCacheServer.startAndMonitor(DefaultCacheServer.java:56)
    com.tangosol.net.DefaultCacheServer.main(DefaultCacheServer.java:197)
    Thread[PacketReceiver,7,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketReceiver.onWait(PacketReceiver.CDB:2)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[PacketSpeaker,8,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.queue.ConcurrentQueue.waitForEntry(ConcurrentQueue.CDB:16)
    com.tangosol.coherence.component.util.queue.ConcurrentQueue.remove(ConcurrentQueue.CDB:7)
    com.tangosol.coherence.component.util.Queue.remove(Queue.CDB:1)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketSpeaker.onNotify(PacketSpeaker.CDB:21)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[Termination Thread,6,Cluster]
    java.lang.Thread.dumpThreads(Native Method)
    java.lang.Thread.getAllStackTraces(Thread.java:1487)
    com.tangosol.net.GuardSupport.logStackTraces(GuardSupport.java:810)
    com.tangosol.coherence.component.net.Cluster$DefaultFailurePolicy.onGuardableTerminate(Cluster.CDB:4)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$WrapperGuardable.terminate(Grid.CDB:1)
    com.tangosol.net.GuardSupport$Context$2.run(GuardSupport.java:677)
    java.lang.Thread.run(Thread.java:619)
    Thread[Reference Handler,10,system]
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    Thread[PacketPublisher,6,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher.onWait(PacketPublisher.CDB:2)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[DistributedCache,5,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:6)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[IpMonitor,6,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.IpMonitor.onWait(IpMonitor.CDB:4)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[PacketListener1P,8,Cluster]
    java.net.PlainDatagramSocketImpl.receive0(Native Method)
    java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
    java.net.DatagramSocket.receive(DatagramSocket.java:725)
    com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:22)
    com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:1)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:20)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[PacketListener1,8,Cluster]
    java.net.PlainDatagramSocketImpl.receive0(Native Method)
    java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
    java.net.DatagramSocket.receive(DatagramSocket.java:725)
    com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:22)
    com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:1)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:20)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    2010-12-18 18:07:18.823/3464791.281 Oracle Coherence GE 3.6.0.3 <Warning> (thread=Termination Thread, member=7): Terminating Guard{Daemon=Cluster}
    2010-12-18 18:07:18.823/3464791.281 Oracle Coherence GE 3.6.0.3 <Error> (thread=StopService, member=7): Requested to stop cluster service.
    2010-12-18 18:07:18.826/3464791.284 Oracle Coherence GE 3.6.0.3 <D5> (thread=DistributedCache, member=7): Service DistributedCache left the cluster
    2010-12-18 18:07:18.826/3464791.284 Oracle Coherence GE 3.6.0.3 <D5> (thread=Invocation:Management, member=7): Service Management left the cluster
    2010-12-18 18:07:24.904/3464797.362 Oracle Coherence GE 3.6.0.3 <Error> (thread=main, member=7): Failed to restart services: com.tangosol.net.RequestTimeoutException: Timeout while waiting for cluster to stop.
    2010-12-18 18:07:33.915/3464806.373 Oracle Coherence GE 3.6.0.3 <Error> (thread=main, member=7): Failed to restart services: com.tangosol.net.RequestTimeoutException: Timeout while waiting for cluster to stop.
    2010-12-18 18:07:42.924/3464815.382 Oracle Coherence GE 3.6.0.3 <Error> (thread=main, member=7): Failed to restart services: com.tangosol.net.RequestTimeoutException: Timeout while waiting for cluster to stop.
    2010-12-18 18:07:51.936/3464824.394 Oracle Coherence GE 3.6.0.3 <Error> (thread=main, member=7): Failed to restart services: com.tangosol.net.RequestTimeoutException: Timeout while waiting for cluster to stop.

    The log file shows that list of the addresses are formed by IP, but they are configured by using hostname in override file.
    Here's the log entry:
    WellKnownAddressList(Size=2,
    WKA{Address=165.X.X.XX7, Port=8088}
    WKA{Address=165.X.X.XX8, Port=8088}
    Here's the configuration from tangosol-coherence-override-prod.xml:
    <well-known-addresses>
    <socket-address id="1">
    <address system-property="tangosol.coherence.wka">serverA</address>
    <port system-property="tangosol.coherence.wka.port">8088</port>
    </socket-address>
    <socket-address id="2">
    <address system-property="tangosol.coherence.wka">serverB</address>
    <port system-property="tangosol.coherence.wka.port">8088</port>
    </socket-address>
    </well-known-addresses>
    Thanks,
    Ramesh

  • Windows could not start the Cluster Service on Local computer. For more information, review the System Event Log. If this is a non-Microsoft service, contact the service vendor, and refer to service-specific error code 2.

    Dear Technet,
    Windows could not start the Cluster Service on Local computer. For more information, review the System Event Log. If this is a non-Microsoft service, contact the service vendor, and refer to service-specific error code 2.
    My cluster suddenly went disappear. and tried to restart the cluster service. When trying to restart service this above mention error comes up.
    even i tried to remove the cluster through power-shell still couldn't happen because of cluster service not running.
    Help me please.. thank you.
    Regards
    Shamil

    Hi,
    Could you confirm which account when you start the cluster service? The Cluster service is a service that requires a domain user account.
    The server cluster Setup program changes the local security policy for this account by granting a set of user rights to the account. Additionally, this account is made a member
    of the local Administrators group.
    If one or more of these user rights are missing, the Cluster service may stop immediately during startup or later, depending on when the Cluster service requires the particular
    user right.
    Hope this helps.
    We
    are trying to better understand customer views on social support experience, so your participation in this
    interview project would be greatly appreciated if you have time.
    Thanks for helping make community forums a great place.

  • The Cluster Service function call 'ClusterResourceControl' failed with error code '1008(An attempt was made to reference a token that does not exist.)' while verifying the file path. Verify that your failover cluster is configured properly.

    I am experiencing this error with one of our cluster environment. Can anyone help me in this issue.
    The Cluster Service function call 'ClusterResourceControl' failed with error code '1008(An attempt was made to reference a token that does not exist.)' while verifying the file path. Verify that your failover cluster is configured properly.
    Thanks,
    Venu S.
    Venugopal S ----------------------------------------------------------- Please click the Mark as Answer button if a post solves your problem!

    Hi Venu S,
    Based on my research, you might encounter a known issue, please try the hotfix in this KB:
    http://support.microsoft.com/kb/928385
    Meanwhile since there is less information about this issue, before further investigation, please provide us the following information:
    The version of Windows Server you are using
    The result of SELECT @@VERSION
    The scenario when you get this error
    If anything is unclear, please let me know.
    Regards,
    Tom Li

  • Error stopping 'Bonjour Service' when trying to download itunes 9.2

    i recently tried to update itunes to 9.2 on my windows. The automatic download had an error so i ended up having to do it manually. When trying to download itunes, i encountered an error message saying that bojour services could not be stopped. It asked me to verify that I have sufficient privelages to stop system services (??!!). I decided I would try to delete bonjour all together, but i couldn't find it anywhere on my computer. I am so confused and all i really want is to update itunes. Please help!!

    HRESULT: 0x8007054f.
    ... that seems to have been the most popular alphanumeric code in recent days. (Just like katherine's one.)
    In general, these ones tend to be caused by underlying problems on the PC that can prevent Windows Updates from installing. So my main strategy with them is to try to fix the Windows Update issues. If we can get Windows Updates flowing again, that generally clears up the iTunes install errors +en passant.+
    Head into your Windows Update, and check for any new updates. Do any fail? If so, open your Update History and doubleclick on the failure entries to bring up a small box with alphanumeric error message codes. Are you getting 8000FFFFs? (They often seem to be associated with an iTunes-install 0x8007054f.)
    If you're getting 8000FFFFs, try the following Microsoft document:
    [Windows Update Error 8000FFFF|http://windows.microsoft.com/en-US/windows-vista/Windows-Update-Error- 8000FFFF]
    Head back to Windows Update, and see if you can get Updates to install successfully this time. If you can get updates now, stock up on any updates you're behind on. Once you're completely up to date in that regard, try another iTunes install. Does it go through properly this time?

  • To run an application on iAS6sp1 on HP-Unix, while starting the kjs from command line, it gives a GDS error and crashes. Subsequently, after stopping all services and restarting iAS wouldnot come up.

     

    Hi,
    Not a problem, please post the KJS error logs for me to hunt the
    exact reason for the error.
    Thanks & Regards
    Raj
    Neel John wrote:
    To run an application on iAS6sp1 on HP-Unix, while starting the kjs
    from command line, it gives a GDS error and crashes. Subsequently,
    after stopping all services and restarting iAS wouldnot come up.
    Try our New Web Based Forum at http://softwareforum.sun.com
    Includes Access to our Product Knowledge Base!

  • Cloud service created works fine in local Azure Emulator. But doesnot work when deployed to actual azure and says Restarting (Role has encountered an error and has stopped. Unhandled Exception: System.IO.FileNotFoundException)

    Hi,
    I have a created a cloud service and when i run locally in azure emulator it works fine. But when I am trying to deploy that to azure it says: Restarting (Role has encountered an error and has stopped.
    Unhandled Exception: System.IO.FileNotFoundException)
    Can anyone please help me on this.
    Thanks,
    Satya Chenna

    Any specific reason you are having timer on role start up? are you using thread.sleep for longer duration in the method?
    If you are using azure diagnostics - make sure these things
    1. Azure storage connection string is set to correct storage account and not to the development storage.
    2. Check session state provider - local is not supported on azure.
    Also you can refer this for more information 
    http://blogs.msdn.com/b/kwill/archive/2013/09/06/troubleshooting-scenario-3-role-stuck-in-busy.aspx
    Bhushan | Blog |
    LinkedIn | Twitter

  • Error communicating with the Cluster Manage Web Service

    Hi,
    i connected oracle database with the Endeca Studio but while uploading Departments dataset i m getting this error
    "Error communicating with the Cluster Manage Web Service" so plz can anyone explain me wat is this error about.
    thanks in advance

    I actually ran into this myself- as Pat notes, it's hard to say without seeing log files, but I think that perhaps the default domain profile that the Endeca Server uses for domains created using Provisioning services has not been created.  First see what domain profiles exist.  Navigate to wherever endeca-cmd lives (e.g., user_projects/domains/endeca_server_domain/EndecaServer/bin), and use the list-dd-profile command:
    -bash-3.2$ ./endeca-cmd.sh list-dd-profiles
    The profiles that exist will be returned in your terminal window:
    prov_dd_profile
    default
    If you only see 'default', then you will need to create 'prov_dd_profile'
    ./endeca-cmd.sh put-dd-profile prov_dd_profile
    Then try uploading your file again in Studio.
    Cheers,
    Andrew

  • Error while stopping Apache Services.

    Hi partners,
    I am having a problem while I want to stop Apache Service using "standard" script "adapcctl.sh".
    I am getting next error message, when I execute "adapcctl.sh stop apps/<apps_password>
    /d03/applmgr/prodcomn/admin/scripts/PROD_gamhap22/adapcctl.sh stop
    Timeout specified in context file: 100 second(s)
    script returned:
    adapcctl.sh version 115.53
    Apache Restricted Web Server Listener :httpd ( pid 7244 ) is running.
    Service can not be stopped using this script
    Apache Restricted Web Server Listener (PLSQL) :httpd ( pid 7522 ) is running.
    Service can not be stopped using this script
    adapcctl.sh: exiting with status 1
    .end std out.
    .end err out.
    GSM Enabled profile is setted to "Y", but I have similar "TEST" environments where GSM is enabled and Apache Server is shutted down without any problem.
    Any advice or help will be really appreciated.
    Thanks in advance.
    Kind regards,
    Francisco Mtz.

    Hi list,
    I have the solution to this problem.
    I have taken a llok inside of the <b>adapcctl.sh</b> script and I have found next lines:
    elif [ "$control_code" = "stop" ]; then
    if [ "$RUNNING" = "0" ]; then
    exit_code=2;
    elif [ "$RUNNING" = "1" -a "$RESTRICT_RUNNING" = "1" ]; then
    STATUSMSG="Service can not be stopped using this script\n"
    printf "$STATUSMSG"
    printf "$STATUSMSG" >> $OUTFILE
    exit_code=1;
    else
    RUNNING is equal to 1 and RESTRICT_RUNNING is equal to 1 when next condition is meet:
    if [ -f $RESTRICT_PIDFILE ] ; then
    restrict_pid=`cat $RESTRICT_PIDFILE | grep -i "TRUE"`
    if [ "x$restrict_pid" != "x" -a "$RUNNING" = "1" ] ; then
    RESTRICT_RUNNING=1
    RESTRICT_FILE is pointing to file "$IAS_HOME/Apache/Apache/logs/apache_runmode.properties" which was created in this environment (I don't know who created it).
    Kind regards,
    Francisco Mtz.

  • The cluster service terminated, error 7024, cannot create a file when that file already exists

    I have a test 2-node Failover cluster using Server 2012 R2
    As of last night the cluster service on one of the 2 nodes is down with this error:
    The Cluster Service service terminated with the following service-specific error: 
    Cannot create a file when that file already exists.
    EventID 7024
    The Cluster service waits 60 sec, tries to start, and the same error occurs again. 
    Any idea where to look to identify which file this error is referring to, or how to go about identifying root cause and getting a solution?
    thank you.
    samb

    Hi Yeswanth
    Then you can try with a "Add Counter". This will create new file each time with the same name but a counter will be added to the file name at the end specifying the number of times it is created.
    You can also the specify the format to create the counter once select this option u can correspondingly fill the Format and step fields.
    Will this be fine.
    Regards
    Ashmi

  • Stopping cluster due to unhandled exception: java.lang.ArrayIndexOutOfBound

    We had this problem in production where one node among the 16 node cluster terminated with this error.
    2013-04-12 11:39:00.533/1139.283 Oracle Coherence EE 3.6.1.4 <Warning> (thread=PacketPublisher, member=4): Experienced a 12316 ms communication delay (probable remote GC) with Member(Id=6, Timestamp=2013-04-12 11:20:08.733, Address=169.168.22.79:32120, MachineId=5967, Location=XXXX,machine:XXXXXXX,process:18088102,member:Container1u7, Role=XXXXXXXX); 114 packets rescheduled, PauseRate=0.0108, Threshold=1878
    2013-04-12 11:47:35.704/2528.573 Oracle Coherence EE 3.6.1.4 <Error> (thread=PacketReceiver, member=1): Stopping cluster due to unhandled exception: java.lang.ArrayIndexOutOfBoundsException
         at com.tangosol.coherence.component.net.Packet.extract(Packet.CDB:30)
         at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketReceiver.onNotify(PacketReceiver.CDB:28)
         at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
         at java.lang.Thread.run(Thread.java:777)
    After that that when the services which are configured to restart, tried to restart it failed with following exception. Any idea what would be causing this error. We have WKA configured.
    2013-04-12 11:47:35.951/2528.820 Oracle Coherence EE 3.6.1.4 <Error> (thread=DEFAULT_EDN-Thread-28, member=n/a): Error while starting cluster: (Wrapped) java.io.IOException: SystemSocketProvider unable find available port(s)
         at com.tangosol.util.Base.ensureRuntimeException(Base.java:293)
         at com.tangosol.util.Base.ensureRuntimeException(Base.java:269)
         at com.tangosol.coherence.component.net.Cluster.onStart(Cluster.CDB:232)
         at com.tangosol.coherence.component.net.Cluster.start(Cluster.CDB:11)
         at com.tangosol.coherence.component.util.SafeCluster.startCluster(SafeCluster.CDB:3)
         at com.tangosol.coherence.component.util.SafeCluster.restartCluster(SafeCluster.CDB:7)
         at com.tangosol.coherence.component.util.SafeCluster.ensureRunningCluster(SafeCluster.CDB:26)
         at com.tangosol.coherence.component.util.SafeService.restartService(SafeService.CDB:22)
         at com.tangosol.coherence.component.util.SafeService.ensureRunningService(SafeService.CDB:39)
         at com.tangosol.coherence.component.util.safeService.SafeCacheService.ensureRunningCacheService(SafeCacheService.CDB:3)
         at com.tangosol.coherence.component.util.SafeNamedCache$CacheAction.run(SafeNamedCache.CDB:3)
         at java.security.AccessController.doPrivileged(AccessController.java:252)
         at javax.security.auth.Subject.doAs(Subject.java:494)
         at com.tangosol.coherence.component.util.SafeNamedCache.restartNamedCache(SafeNamedCache.CDB:8)
         at com.tangosol.coherence.component.util.SafeNamedCache.ensureRunningNamedCache(SafeNamedCache.CDB:33)
         at com.tangosol.coherence.component.util.SafeNamedCache.getRunningNamedCache(SafeNamedCache.CDB:1)
         at com.tangosol.coherence.component.util.SafeNamedCache.lock(SafeNamedCache.CDB:1)
         at container.pool.BoundedThreadPool$PooledThread.run(BoundedThreadPool.java:591)
    Caused by: java.io.IOException: SystemSocketProvider unable find available port(s)
         at com.tangosol.coherence.component.net.Cluster$SocketManager.bindListeners(Cluster.CDB:117)
         at com.tangosol.coherence.component.net.Cluster.onStart(Cluster.CDB:228)
         ... 20 more

    Hello,
    This not a Coherence bug. It looks like the system is running out of memory.
    Best regards,
    -Dave

  • Java.lang.OutOfMemoryError: getNewTla  in coherence production cluster!

    Hi guys, we need some urgent help, JVMs in our production coherence cluster would randomly stop due to the outofmemory error, and we cannot find out the root cause.
    1) Version 3.7.1.4, running 4x servers, each with 40x jvm, each jvm set to 2GB heap, for a total of 320GB cluster. Each server also has a extend proxy running with 3GB heap (no issue)
    2) The cluster is configured using WKA by explicitly listing out all the server nodes in the config.
    3) Our data storage is only ~30GB, details below.
    Stats for cache 'CACHE0':
    Number of cache entries: 14761116
    Memory usage (mb): 26722.643
    Average entry size (bytes): 1898
    Stats for cache 'CACHE1':
    Number of cache entries: 46047
    Memory usage (mb): 51.911
    Average entry size (bytes): 1182
    Stats for cache 'CACHE2':
    Number of cache entries: 4
    Memory usage (mb): 0.154
    Average entry size (bytes): 40448
    Stats for cache 'CACHE3':
    Number of cache entries: 69
    Memory usage (mb): 0.705
    Average entry size (bytes): 10707
    Grand total: 26775.413 MB, Number of entries: 14807237
    4) Random jvms storage nodes (not proxy) on each server would just go down with below errors, we cannot reproduce the issue, it just happens at random. Out of 40 jvms on each server about 3-5 went down over the weekend on, the issue happens on all 4 servers.
    ERROR Coherence - 2012-08-11 11:36:51.670/156864.993 Oracle Coherence GE 3.7.1.4 <Error> (thread=Cluster, member=17):
    java.lang.OutOfMemoryError: getNewTla
    at java.util.HashMap.newKeyIterator(HashMap.java:1024)
    at java.util.HashMap$KeySet.iterator(HashMap.java:1062)
    at java.util.HashSet.iterator(HashSet.java:153)
    at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:127)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:69)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at com.tangosol.coherence.component.net.TcpRing.select(TcpRing.CDB:11)
    at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.onWait(ClusterService.CDB:6)
    at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    at java.lang.Thread.run(Thread.java:662)
    ERROR Coherence - 2012-08-11 11:36:51.854/156865.177 Oracle Coherence GE 3.7.1.4 <Error> (thread=PacketListener1, member=17): Stopping cluster due to unhandled exception: java.lang.OutOfMemoryError: java/net/Inet4Address, size 24B
    at java.net.PlainDatagramSocketImpl.receive0(Native Method)
    at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:145)
    at java.net.DatagramSocket.receive(DatagramSocket.java:725)
    at com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:22)
    at com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:1)
    at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:20)
    at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    at java.lang.Thread.run(Thread.java:662)
    Exception in thread "Main Thread" java.lang.OutOfMemoryError
    5) Initially we thought it was because of a issue with the small default packet speaker size when joinining the nodes since we using WKA. But changing the config did not help:
    <coherence>
    <cluster-config>
    <packet-speaker>
    <volume-threshold>
    <minimum-packets>10000</minimum-packets>
    </volume-threshold>
    </packet-speaker>
    </cluster-config>
    </coherence>
    Out of ideas, any help will be greatly appreciated. Thanks

    i dont think the issue is with the code, i just noticed as soon as i start up all the cache servers, 1 went down already. And noone is accessing the system.
    this is extremely troublesome, i am loading the hprof file to look at the dump per suggestion above, not sure if it will help pinpoint the root cause though.
    cacheserver:1 30578 [Logger@9218328 3.7.1.4] WARN Coherence - 2012-08-13 13:52:14.857/32.262 Oracle Coherence GE 3.7.1.4 <Warning> (thread=PacketPublisher, member=22): Experienced a 1230 ms communication delay (probable remote GC) with Member(Id=1, Timestamp=2012-08-09 16:02:24.413, Address=xxxxxx, MachineId=xxxxx, Location=site:,machine:xxxxx,process:27118,member:xxxxxx:cacheserver:1, Role=CoherenceServer); 25 packets rescheduled, PauseRate=0.042, Threshold=875
    Exception in thread "Main Thread" java.lang.OutOfMemoryError
    [WARN ][thread ] dispatchUncaughtException
    Logger: java.lang.OutOfMemoryError
    java.lang.OutOfMemoryError
    Exception in thread "PacketListener1" java.lang.OutOfMemoryError
    [WARN ][thread ] dispatchUncaughtException
    [WARN ][thread ] dispatchUncaughtException
    java.lang.OutOfMemoryError: getNewTla
    at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:983)
    at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:976)
    at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
    java/lang/OutOfMemoryError: getNewTla
    --- End of stack trace
    java/lang/OutOfMemoryError: getNewTla
    --- End of stack trace
    Exception in thread "Logger@9218328 3.7.1.4" java.lang.OutOfMemoryError
    Exception in thread "PacketListener1P" java.lang.OutOfMemoryError

  • Stopping Cluster... Unable to refresh sockets

    Hi,
    Can someone explain what the following errors mean please:
    05/18/10 14:53:15.415 INFO: [ProcessWrapper] [STDOUT] 2010-05-18 14:53:15,413 [Logger@9226875 3.5.3/465p2] DEBUG Coherence - 2010-05-18 14:53:15.413/14734.646 Oracle Coherence GE 3.5.3/465p2 <D6> (thread=PacketListenerN, member=46): Attempt to refresh sockets: [UnicastUdpSocket{State=STATE_OPEN, address:port=11.160.45.170:8097}, MulticastUdpSocket{State=STATE_OPEN, address:port=239.255.12.37:1235, InterfaceAddress=11.160.45.170, TimeToLive=100}, TcpSocketAccepter{State=STATE_OPEN, ServerSocket=11.160.45.170:8097}] caused by MulticastUdpSocket{State=STATE_OPEN, address:port=239.255.12.37:1235, InterfaceAddress=11.160.45.170, TimeToLive=100}; exception java.net.SocketTimeoutException: Receive timed out
    05/18/10 14:53:15.416 INFO: [ProcessWrapper] [STDOUT]      at java.net.PlainDatagramSocketImpl.receive0(Native Method)
    05/18/10 14:53:15.416 INFO: [ProcessWrapper] [STDOUT]      at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
    05/18/10 14:53:15.416 INFO: [ProcessWrapper] [STDOUT]      at java.net.DatagramSocket.receive(DatagramSocket.java:712)
    05/18/10 14:53:15.416 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
    05/18/10 14:53:15.416 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
    05/18/10 14:53:15.416 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
    05/18/10 14:53:15.416 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    05/18/10 14:53:15.416 INFO: [ProcessWrapper] [STDOUT]      at java.lang.Thread.run(Thread.java:619)
    05/18/10 14:53:15.416 INFO: [ProcessWrapper] [STDOUT]
    05/18/10 14:53:17.011 INFO: [ProcessWrapper] [STDOUT] 14734.924: [Full GC (System) 14734.924: [CMS: 153422K->153463K(3350528K), 1.4893660 secs] 225600K->153463K(3468544K), [CMS Perm : 23955K->23955K(39996K)], 1.4897740 secs] [Times: user=1.13 sys=0.00, real=1.49 secs]
    05/18/10 14:53:17.071 INFO: [ProcessWrapper] [STDOUT] 2010-05-18 14:53:17,070 [Logger@9226875 3.5.3/465p2] INFO  Coherence - 2010-05-18 14:53:17.069/14736.302 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=PacketListenerN, member=46): Scheduled senior member heartbeat is overdue; rejoining multicast group.
    05/18/10 14:53:21.598 INFO: [ProcessWrapper] [STDOUT] 2010-05-18 14:53:21,598 [latest-cache-serviceWorker:72] DEBUG EventsPublisherImpl - Published message : ObjectMessage={ Header={ JMSMessageID={ID:LON_DEV_ODCache_7026.104C4BEBFD63A9A32:63038} JMSDestination={Topic[odcache.dev.topic.odcdev.persistence]} JMSReplyTo={null} JMSDeliveryMode={PERSISTENT} JMSRedelivered={false} JMSCorrelationID={null} JMSType={null} JMSTimestamp={Tue May 18 14:53:21 BST 2010} JMSExpiration={0} JMSPriority={4} } Properties={ partitionId={Integer:6803} } Object={PersistenceEventImpl for cache: LatestRefLedgerBook} } ; event : PersistenceEventImpl for cache: LatestRefLedgerBook
    05/18/10 14:53:23.762 INFO: [ProcessWrapper] [STDOUT] 2010-05-18 14:53:23,762 [latest-cache-serviceWorker:35] DEBUG EventsPublisherImpl - Published message : ObjectMessage={ Header={ JMSMessageID={ID:LON_DEV_ODCache_7026.104C4BEBFD63A9A32:63039} JMSDestination={Topic[odcache.dev.topic.odcdev.persistence]} JMSReplyTo={null} JMSDeliveryMode={PERSISTENT} JMSRedelivered={false} JMSCorrelationID={null} JMSType={null} JMSTimestamp={Tue May 18 14:53:23 BST 2010} JMSExpiration={0} JMSPriority={4} } Properties={ partitionId={Integer:1015} } Object={PersistenceEventImpl for cache: LatestRefLedgerBook} } ; event : PersistenceEventImpl for cache: LatestRefLedgerBook
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT] 2010-05-18 14:53:34,972 [Logger@9226875 3.5.3/465p2] ERROR Coherence - 2010-05-18 14:53:34.969/14754.202 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=PacketListenerN, member=46): Stopping cluster due to unhandled exception: com.tangosol.net.messaging.ConnectionException: Unable to refresh sockets: [UnicastUdpSocket{State=STATE_OPEN, address:port=11.160.45.170:8097}, MulticastUdpSocket{State=STATE_OPEN, address:port=239.255.12.37:1235, InterfaceAddress=11.160.45.170, TimeToLive=100}, TcpSocketAccepter{State=STATE_OPEN, ServerSocket=11.160.45.170:8097}]; last failed socket: MulticastUdpSocket{State=STATE_OPEN, address:port=239.255.12.37:1235, InterfaceAddress=11.160.45.170, TimeToLive=100}
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.net.Cluster$SocketManager.refreshSockets(Cluster.CDB:91)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.net.Cluster$SocketManager$MulticastUdpSocket.onInterruptedIOException(Cluster.CDB:9)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:33)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at java.lang.Thread.run(Thread.java:619)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT] Caused by: java.net.SocketTimeoutException: Receive timed out
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at java.net.PlainDatagramSocketImpl.receive0(Native Method)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at java.net.DatagramSocket.receive(DatagramSocket.java:712)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]      at java.lang.Thread.run(Thread.java:619)
    05/18/10 14:53:34.972 INFO: [ProcessWrapper] [STDOUT]
    05/18/10 14:53:34.974 INFO: [ProcessWrapper] [STDOUT] 2010-05-18 14:53:34,974 [Logger@9226875 3.5.3/465p2] DEBUG Coherence - 2010-05-18 14:53:34.974/14754.207 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=46): Service Cluster left the cluster
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT] 2010-05-18 14:53:35,093 [Logger@9226875 3.5.3/465p2] DEBUG Coherence - 2010-05-18 14:53:35.086/14754.319 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Invocation:PartitionedFilterInvocationService, member=46): Service PartitionedFilterInvocationService left the cluster
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT] 2010-05-18 14:53:35,093 [Logger@9226875 3.5.3/465p2] ERROR Coherence - 2010-05-18 14:53:35.086/14754.320 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=DistributedCache:distributed-pof-scheme, member=46): validatePolls: This service timed-out due to unanswered handshake request. Manual intervention is required to stop the members that have not responded to this Poll
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT]   {
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT]   PollId=244, active
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT]   InitTimeMillis=1274188777824
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT]   Service=distributed-pof-scheme (10)
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT]   RespondedMemberSet=[]
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT]   LeftMemberSet=[]
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT]   RemainingMemberSet=[2]
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT]   }
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT] Request=Message "InvokeRequest"
    05/18/10 14:53:35.094 INFO: [ProcessWrapper] [STDOUT]   {
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   FromMember=Member(Id=46, Timestamp=2010-05-18 10:47:44.479, Address=11.160.45.170:8097, MachineId=34474, Location=machine:lonrs05718,process:16198,member:lonrs05718:Data-14, Role=RbsOdcCoreDaoODCCacheServer)
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   FromMessageId=279506
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   Internal=false
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   MessagePartCount=3
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   PendingCount=0
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   MessageType=27
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   ToPollId=0
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   Poll=null
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   Packets
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]     {
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]     }
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   Service=DistributedCache{Name=distributed-pof-scheme, State=(SERVICE_STOPPED), Not initialized}
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   ToMemberSet=MemberSet(Size=1, BitSetCount=1
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]       Member(Id=2, Timestamp=2010-05-18 10:47:20.865, Address=11.160.45.172:8088, MachineId=34476, Location=machine:lonrs05720,process:20359,member:lonrs05720:Data-4, Role=RbsOdcCoreDaoODCCacheServer)
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]       )
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   NotifySent=false
    05/18/10 14:53:35.095 INFO: [ProcessWrapper] [STDOUT]   }Specifically the error:
    <Error> (thread=PacketListenerN, member=46): Stopping cluster due to unhandled exception: com.tangosol.net.messaging.ConnectionException: Unable to refresh socketsWhat does it mean and what might be causing it. The only similar post I could find was this one Re: start Conherence ERROR!!! from last year where the response was to try the multi-cast test. We run this cluster with multi-cast all the time though.
    Cheers,
    JK

    Hi Jonathan,
    This: http://www.tibcommunity.com/thread/11192
    Seems to indicate that the problem is at the socket buffer level.
    This one : http://www.jivesoftware.com/jivespace/message/60848;jsessionid=5BF9B4F2637DFC55A07B9F3090927A9D
    Indicates that it may be a problem, a conflict on the network with another device.
    HTH
    Serge

  • SAP Cluster service issue

    Here is the description of the PRD cluster scenario. ( windows 2008 + oracle)
    We have 2 nodes .
    1. host-erpn01 ( Have ASCS , Database instance, Enqueue and Dialog
    Instance installed)
    2. host-erp02 ( Have Central Instance, Dialog Instance and Enqueue installed)
    When we move "SAP SID" service using "failover cluster management tool" from one node to another its fails and we have to manually select the  "SAP SID cluster service" and "SAP SID cluster instance" to online.
    These both service and instance were coming online after manual selection, however after some time in the mmc console of node 2 the sap instances hosted on node1 are in red cross and are giving " cannot connect to sap service dcom interface error 800706BA"
    We replaced the sapstartsrv.exe from working directory of ASCS instance to CI executable directory.
    Now the disp+work is stopped for CI instance. Also in the CI instance executable directory we can see five files with name of sapstartsrv i.e
    sapstartsrv.exe.new , sapstartsrv.exe.tmp, sapstartsrv.new, sapstartsrv.pdb and actual sapstartsrv.exe file.
    Here is the log of sapstartsrv.log  CI work directory from node2.
    trc file: "sapstartsrv.log", trc level: 0, release: "701"
    pid        1968
    Mon Oct 11 15:55:33 2010
    SAP HA Trace: Build in SAP Microsoft Cluster library '701, patch 32, changelist 1046543' initialized
    Initializing SAPControl Webservice
    SapSSLInit failed => https support disabled
    Starting WebService Named Pipe thread
    Starting WebService thread
    Webservice named pipe thread started, listening on port
    .\pipe\sapcontrol_01
    Webservice thread started, listening on port 50113
    GCCIA\csrvadmin is starting SAP System at 2010/10/11 16:09:07
    SAP HA Trace: FindClusterResource: SAP resource not found [sapwinha.cpp, line 334]
    SAP HA Trace: SAP_HA_FindSAPInstance returns: SAP_HA_NOT_CLUSTERED [sapwinha.cpp, line 907]"
    or you can view other logs from the work directory dump at
    http://s000.tinyupload.com/index.php?file_id=45384422007535688902
    Now when we try to start the SAPSID_00 service manually its giving error "The SAPSID_00 service failed to start due to the following error: The system cannot find the path specified.
    Please advice.
    Regards
    Edited by: Tech GCCIA on Oct 11, 2010 3:27 PM
    Edited by: Tech GCCIA on Oct 11, 2010 3:28 PM

    Hi Sunil ,
                       On node 1 there is no  listener.trc at /oracle_home/network/trace folder , here is the log of listener.log file in case if it is helpful.
    TNSLSNR for 64-bit Windows: Version 10.2.0.4.0 - Production on 10-OCT-2010 10:37:37
    Copyright (c) 1991, 2007, Oracle.  All rights reserved.
    System parameter file is D:\oracle\GCP\102\network\admin\listener.ora
    Log messages written to D:\oracle\GCP\102\network\log\listener.log
    Trace information written to D:\oracle\GCP\102\network\trace\listener.trc
    Trace level is currently 0
    Started with pid=3116
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCP.WORLDipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCPipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=gccia-erpn01.gccia.com.sa)(PORT=1527)))
    Listener completed notification to CRS on start
    TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
    TNSLSNR for 64-bit Windows: Version 10.2.0.4.0 - Production on 10-OCT-2010 11:59:37
    Copyright (c) 1991, 2007, Oracle.  All rights reserved.
    System parameter file is D:\oracle\GCP\102\network\admin\listener.ora
    Log messages written to D:\oracle\GCP\102\network\log\listener.log
    Trace information written to D:\oracle\GCP\102\network\trace\listener.trc
    Trace level is currently 0
    Started with pid=5036
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCP.WORLDipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCPipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=1527)))
    Listener completed notification to CRS on start
    TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
    10-OCT-2010 12:00:31 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=60592)) * establish * GCP * 0
    10-OCT-2010 12:00:31 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=60593)) * establish * GCP * 0
    10-OCT-2010 12:00:31 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=60594)) * establish * GCP * 0
    10-OCT-2010 12:00:31 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=60595)) * establish * GCP * 0
    10-OCT-2010 12:00:31 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=60596)) * establish * GCP * 0
    10-OCT-2010 13:01:19 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61336)) * establish * GCP * 0
    10-OCT-2010 13:01:37 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61340)) * establish * GCP * 0
    10-OCT-2010 13:01:37 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61341)) * establish * GCP * 0
    10-OCT-2010 13:01:37 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61342)) * establish * GCP * 0
    10-OCT-2010 13:01:37 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61343)) * establish * GCP * 0
    10-OCT-2010 13:01:37 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61344)) * establish * GCP * 0
    10-OCT-2010 13:08:27 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61485)) * establish * GCP * 0
    10-OCT-2010 13:08:42 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61489)) * establish * GCP * 0
    10-OCT-2010 13:08:42 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61490)) * establish * GCP * 0
    10-OCT-2010 13:08:42 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61491)) * establish * GCP * 0
    10-OCT-2010 13:08:42 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61492)) * establish * GCP * 0
    10-OCT-2010 13:08:42 * (CONNECT_DATA=(SID=GCP)(GLOBAL_NAME=GCP.WORLD)(CID=(PROGRAM=D:\oracle\OFS\SRV\fs\fssvr\bin\FsSurrogate.exe)(HOST=GCCIA-ERPN01)(USER=csrvadmin))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=61493)) * establish * GCP * 0
    TNSLSNR for 64-bit Windows: Version 10.2.0.4.0 - Production on 10-OCT-2010 13:09:57
    Copyright (c) 1991, 2007, Oracle.  All rights reserved.
    System parameter file is D:\oracle\GCP\102\network\admin\listener.ora
    Log messages written to D:\oracle\GCP\102\network\log\listener.log
    Trace information written to D:\oracle\GCP\102\network\trace\listener.trc
    Trace level is currently 0
    Started with pid=2336
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCP.WORLDipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCPipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=1527)))
    Listener completed notification to CRS on start
    TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
    TNSLSNR for 64-bit Windows: Version 10.2.0.4.0 - Production on 10-OCT-2010 13:14:34
    Copyright (c) 1991, 2007, Oracle.  All rights reserved.
    System parameter file is D:\oracle\GCP\102\network\admin\listener.ora
    Log messages written to D:\oracle\GCP\102\network\log\listener.log
    Trace information written to D:\oracle\GCP\102\network\trace\listener.trc
    Trace level is currently 0
    Started with pid=4948
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCP.WORLDipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCPipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=1527)))
    Listener completed notification to CRS on start
    TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
    TNSLSNR for 64-bit Windows: Version 10.2.0.4.0 - Production on 10-OCT-2010 13:38:12
    Copyright (c) 1991, 2007, Oracle.  All rights reserved.
    System parameter file is D:\oracle\GCP\102\network\admin\listener.ora
    Log messages written to D:\oracle\GCP\102\network\log\listener.log
    Trace information written to D:\oracle\GCP\102\network\trace\listener.trc
    Trace level is currently 0
    Started with pid=2456
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCP.WORLDipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCPipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=1527)))
    Listener completed notification to CRS on start
    TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
    TNSLSNR for 64-bit Windows: Version 10.2.0.4.0 - Production on 10-OCT-2010 14:03:35
    Copyright (c) 1991, 2007, Oracle.  All rights reserved.
    System parameter file is D:\oracle\GCP\102\network\admin\listener.ora
    Log messages written to D:\oracle\GCP\102\network\log\listener.log
    Trace information written to D:\oracle\GCP\102\network\trace\listener.trc
    Trace level is currently 0
    Started with pid=2756
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCP.WORLDipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCPipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=1527)))
    Listener completed notification to CRS on start
    TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
    TNSLSNR for 64-bit Windows: Version 10.2.0.4.0 - Production on 10-OCT-2010 14:10:42
    Copyright (c) 1991, 2007, Oracle.  All rights reserved.
    System parameter file is D:\oracle\GCP\102\network\admin\listener.ora
    Log messages written to D:\oracle\GCP\102\network\log\listener.log
    Trace information written to D:\oracle\GCP\102\network\trace\listener.trc
    Trace level is currently 0
    Started with pid=4812
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCP.WORLDipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=
    .\pipe\GCPipc)))
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=1527)))
    Listener completed notification to CRS on start
    TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
    TNSLSNR for 64-bit Windows: Version 10.2.0.4.0 - Production on 11-OCT-2010 09:34:05
    Copyright (c) 1991, 2007, Oracle.  All rights reserved.
    System parameter file is D:\oracle\GCP\102\network\admin\listener.ora
    Log messages written to D:\oracle\GCP\102\network\log\listener.log
    Trace information written to D:\oracle\GCP\102\network\trace\listener.trc
    Trace level is currently 0
    Started with pid=1920
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=1527)))
    Listener completed notification to CRS on start
    TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
    TNSLSNR for 64-bit Windows: Version 10.2.0.4.0 - Production on 11-OCT-2010 21:12:29
    Copyright (c) 1991, 2007, Oracle.  All rights reserved.
    System parameter file is D:\oracle\GCP\102\network\admin\listener.ora
    Log messages written to D:\oracle\GCP\102\network\log\listener.log
    Trace information written to D:\oracle\GCP\102\network\trace\listener.trc
    Trace level is currently 0
    Started with pid=1952
    Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.11.13)(PORT=1527)))
    Listener completed notification to CRS on start
    TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE

  • Problem: Stopping cluster due to unhandled exception .. Unable to refresh

    Greetings
    While testing a rolling upgrade of an application that uses Coherence 3.5.2 on 62-jvm cluster hosted on
    11 physical machines, we encountered a situation where after the upgrade was completed, most jvms
    in the system abruptly left the cluster. The physical hosts are running CentOS 5.4, the java used is
    the 64-bit server version 1.6.0_16-b01. The test was run under a scenario that imposed "moderate"
    load, with cpu usage on the physical machines never exceeding 60% busy, network bandwidth never
    exceeding 5%, and with some free physical memory. Swapping did not occur during any time during
    the test. I believe we are using the default tangosol-coherence.xml. We got the error below in
    our coherence.log files on all of the systems, all at about the same time (within 10 milliseconds).
    55 of the jvms left the cluster during the incident. There were 55 copies of the error message in
    the various logs, all nearly identical except for the time and member id.
    My questions include
    - what does the error mean?
    - what could cause it? (I investigated system logs, and found no evidence of the NIC cards
    going off line at the time. Any suggestions about how to look for evidence of broadcast storm?)
    - how can we keep it from happening again?
    Many thanks for your help -
    Mike Murphy
    2011-04-26 17:34:14,629 Coherence Logger@9224544 3.5.2/463 ERROR 2011-04-26 17:34:14.629/1929.311 Oracle Coherence GE
    3.5.2/463 <Error> (thread=PacketListenerN, member=30): Stopping cluster due to unhandled exception: com.tangosol.net.mes
    saging.ConnectionException: Unable to refresh sockets: [UnicastUdpSocket{State=STATE_OPEN, address:port=10.48.88.116:809
    1}, MulticastUdpSocket{State=STATE_OPEN, address:port=224.3.5.2:10013, InterfaceAddress=10.48.88.116, TimeToLive=4}, Tcp
    SocketAccepter{State=STATE_OPEN, ServerSocket=10.48.88.116:8091}]; last failed socket: MulticastUdpSocket{State=STATE_OP
    EN, address:port=224.3.5.2:10013, InterfaceAddress=10.48.88.116, TimeToLive=4}
    at com.tangosol.coherence.component.net.Cluster$SocketManager.refreshSockets(Cluster.CDB:91)
    at com.tangosol.coherence.component.net.Cluster$SocketManager$MulticastUdpSocket.onInterruptedIOException(Cluste
    r.CDB:9)
    at com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:33)
    at com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
    at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketLis
    tener.CDB:19)
    at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    at java.lang.Thread.run(Thread.java:619)
    Caused by: java.net.SocketTimeoutException: Receive timed out
    at java.net.PlainDatagramSocketImpl.receive0(Native Method)
    at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
    at java.net.DatagramSocket.receive(DatagramSocket.java:712)
    at com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
    at com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
    at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketLis
    tener.CDB:19)
    at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)

    Hi Mike,
    It looks like you are having problems with multicast. You can run the multicast test described here
    [http://download.oracle.com/docs/cd/E15357_01/coh.360/e15723/tune_multigramtest.htm]
    which will help in diagnosing the problem

Maybe you are looking for