Restart of Cluster node

Hi!
I have restarted one of the windows cluster. Now I cannot log in again.
Should any steps on the other cluster node be executed (moving groups)?
If yes, how?
Thank you very much!
regards
Thom

Hi
I have restarted one of the windows cluster.
---- Node A or Node B, then you should still be able to login to other node, did you try to do a ping test / network traceroute test from your system to the active node which is online.
there will be total 5 IP address assigned to your Cluster server ping test all IP address and if they available then try to see if the physical server is up and running at the server terminal, try to login to the server directly at the server level not from network
then check if all services are up and running, very important is the clusterservices is up and running.
try all the above steps and feedback
regards
Raj

Similar Messages

  • Restarting a cluster node

    Hello!
    Normally it should be a simple thing to restart a server when it´s part of a cluster. But last time it went very wrong and it took us 90 minutes to get SAP running again.
    What we have:
    Windows 2003 Cluster:
    Node1 holding SAP application
    Node2 holding the database
    What we did:
    moving SAP application from Node1 to Node2
    shutting down SAP instances on Node1
    restarting Node1
    What happened:
    Node1 not responding
    Cluster group was malfunctioning
    group SAP PRD not found
    We had to shutdown both servers to initialize both Nodes and get the cluster running again. Today we have to do the same again, but this time please without any problems Did do we something wrong? Do we have to shutdown the SAP instances from Node1 (they were already running on Node2 the last time).
    Thanks for any input

    i've not good experience with Cluster in Windows.....
    If you move a nodo it's done in the first step, or it has to trey more times?.
    When you restart the server some servers may be corrupted....
    Regards,
    Alfredo.

  • Cluster Node Unable to Maintain Cluster Membership

    My cluster logs are very similar to the above thread... was it ever addressed?
    [SV] Already protecting connection with message security level 'sign'
    [FTI] Stream already exists to node: false
    [Channel IP to another cluster node member] Close()
    GracefuleClose(1226) because of channel to remote endpoint another cluster node
    ~ is closed
    Cluster services stops and generates:
    The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server serverName$. The target name used was
    serverName.
    This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN
    is only registered on the account used by the server.
    Roderick Lyons

    Hi Roderick Lyons,
    Could you tell us the exact URL “above thread” I am not very sure which thread you meaning.
     Please offer more information about your environment, such as, the DC server edition, the cluster node server edition.
    If you are 2003 and 2012R2 mixed DC environment please restart your cluster node then do the further monitor.
    The related article:
    It turns out that weird things can happen when you mix Windows Server 2003 and Windows Server 2012 R2 domain controllers
    http://blogs.technet.com/b/askds/archive/2014/07/23/it-turns-out-that-weird-things-can-happen-when-you-mix-windows-server-2003-and-windows-server-2012-r2-domain-controllers.aspx
    Can't log on after changing machine account password in mixed Windows Server 2012 R2 and Windows Server 2003 environment
    http://support.microsoft.com/kb/2989971
    From the current error another possible is you never run the cluster validation before you create the cluster, please run the cluster validation first then post
    the warning or error information.
    If above solution not work please consider reboot your PDC at unproductive time.
    More information:
    Kerberos Service Principal Name on Wrong Account
    https://support.microsoft.com/kb/2706695?wa=wsignin1.0
    Fixing the Security-Kerberos / 4 error
    http://blogs.technet.com/b/dcaro/archive/2013/07/04/fixing-the-security-kerberos-4-error.aspx
    Service Principal Names (SPNs) SetSPN Syntax (Setspn.exe)
    http://social.technet.microsoft.com/wiki/contents/articles/717.service-principal-names-spns-setspn-syntax-setspn-exe.aspx
    I’m glad to be of help to you!
    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Support, contact [email protected]

  • SCVMM losing connection to cluster nodes

    Hey guys'n girls, I hope this is the right forum for this question. I already opened a ticket at MS support as well because it's impacting our production environment indirectly, but even after a week there's been no contact. Losing faith in MS support there
    The problem we're having is that scvmm is that a host enters the 'needs attention' state, with a winrm error 0x80338126. I guess it has something to do with the network or with Kerberos, and I've found some info on it, but I still haven't been able to solve
    it. Do you guys have any ideas?
    Problem summary:
    We are seeing an issue on our new hyper-v platform. The platform should have been in production last week, but this issue is delaying our project as we can't seem to get it stable.
    The problem we are experiencing is that SCVMM loses the connection to some of the Hyper-V nodes. Not one
     specific node. Last week it happened to two nodes, and today it happened to another node. I see issues with WinRM, and I expect something to do with kerberos. See the bottom of this post for background details and software versions.
    The host gets the status 'needs attention', and if you look at the status of the machine, WinRM gives an error. The error is:
    Error (2916)
    VMM is unable to complete the request. The connection to the agent cc1-hyp-10.domaincloud1.local was lost.
    WinRM: URL: [http://cc1-hyp-10.domaincloud1.local:5985], Verb: [ENUMERATE], Resource: [http://schemas.microsoft.com/wbem/wsman/1/wmi/root/cimv2/Win32_Service], Filter: [select * from Win32_Service where Name="WinRM"]
    Unknown error (0x80338126)
    Recommended Action
    Ensure that the Windows Remote Management (WinRM) service and the VMM agent are installed and running and that a firewall is not blocking HTTP/HTTPS traffic. Ensure that VMM server is able to communicate with cc1-hyp-10.domaincloud1.local over WinRM by successfully
    running the following command:
     winrm id –r:cc1-hyp-10.domaincloud1.local
    This
     problem can also be caused by a Windows Management Instrumentation (WMI) service crash. If the server is running Windows Server 2008 R2, ensure that KB 982293 (http://support.microsoft.com/kb/982293)
    is installed on it.
    If the error persists, restart cc1-hyp-10.domaincloud1.local and then try the operation again. /nRefer to
    http://support.microsoft.com/kb/2742275 for more details.
    Doing a simple test from the VMM server to the problematic cluster node shows this error:
    PS C:\> hostname
    CC1-VMM-01
    PS C:\> winrm id -r:cc1-hyp-10.domaincloud1.local
    WSManFault
        Message = WinRM cannot complete the operation. Verify that the specified computer name is valid, that the computer is accessible over the network, and that a firewall exception for the WinRM service is enabled and allows access from this
    computer. By default, the WinRM firewall exception for public profiles limits access to remote computers within the same local subnet.
    Error number:  -2144108250 0x80338126
    WinRM cannot complete the operation. Verify that the specified computer name is valid, that the computer is accessible over the network, and that a firewall exception for the WinRM service is enabled and allows access from this computer. By default, the WinRM
    firewall exception for public profiles limits access to remote computers within the same local subnet.
    I CAN connect from other hosts to this problematic cluster node:
    PS C:\> hostname
    CC1-HYP-16
    PS C:\> winrm id -r:cc1-hyp-10.domaincloud1.local
    IdentifyResponse
        ProtocolVersion =
    http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd
        ProductVendor = Microsoft Corporation
        ProductVersion = OS: 6.3.9600 SP: 0.0 Stack: 3.0
        SecurityProfiles
            SecurityProfileName =
    http://schemas.dmtf.org/wbem/wsman/1/wsman/secprofile/http/spnego-kerberos
    And I can connect from the vmm server to all other cluster nodes:
    PS C:\> hostname
    CC1-VMM-01
    PS C:\> winrm id -r:cc1-hyp-11.domaincloud1.local
    IdentifyResponse
        ProtocolVersion =
    http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd
        ProductVendor = Microsoft Corporation
        ProductVersion = OS: 6.3.9600 SP: 0.0 Stack: 3.0
        SecurityProfiles
            SecurityProfileName =
    http://schemas.dmtf.org/wbem/wsman/1/wsman/secprofile/http/spnego-kerberos
    So at this point only the test from the cc1-vmm-01 to cc1-hyp-10 seems to be problematic.
    I followed the steps in the page
    https://support.microsoft.com/kb/2742275 (which is referred to above). I tried the VMMCA, but it can't really get it working the way I want, or it seems to give outdated recommendations.
    I tried checking for duplicate SPN's by running setspn -x on affected machines. No results (although I do not understand
     what an SPN is or how it works). I rebuilt the performance counters.
    It tried setting 'sc config winrm type= own' as described in [http://blinditandnetworkadmin.blogspot.nl/2012/08/kb-how-to-troubleshoot-needs-attention.html].
    If I reboot this cc1-hyp-10 machine, it will start working perfectly again. However, then I can't troubleshoot the issue, and it will happen again.
    I want this problem to be solved, so vmm never loses connection to the hypervisors it's managing again!
    Background information:
    We've set up a platform with Hyper-V to run a VM workload. The platform consists of the following hardware:
    2 Dell R620's with 32GB of RAM, running hyper-v to virtualize the cloud management layer (DC's, VMM, SQL). These machines are called cc1-hyp-01 and cc1-hyp-02. They run the management vm's like cc1-dc-01/02, cc1-sql-01, cc1-vmm-01, etc. The names are self-explanatory.
    The VMM machine is NOT clustered.
    8 Dell M620 blades with 320GB of RAM, running hyper-v to virtualize the customer workload. The machines are
    called cc1-hyp-10 until cc1-hyp-17. They are in a cluster.
    2 Equallogic units form a SAN (premium storage), and we have a Dell R515 running iscsi target (budget storage).
    We have Dell Force10 switches and Cisco C3750X switches to connect everything together (mostly 10GB links).
    All hosts run Windows Server 2012R2 Datacenter edition. The VMM server runs System Center Virtual Machine Manage 2012 R2.
    All the latest Windows updates are installed on every host. There are no firewalls between any host (vmm and hypervisors) at this level. Windows firewalls are all disabled. No antivirus software is installed, no symantec software is installed.
    The only non-standard software that is installed is the Dell Host Integration Tools 4.7.1, Dell Openmanage Server Administrator, and some small stuff like 7-zip, bginfo, net-snap, etc.
    The SCVMM service is running under the domain account DOMAINCLOUD1\scvmm. This machine is in the local administrators group of each cluster node.
    On top of this cloud layer we're running the tenant layer with a lot of vm's for a specific customer (although they are all off now).

    I think I found the culprit, after an hour of analyzing wireshark dumps I found the vmm had jumbo frames enabled on the management interface to the hosts (and the underlying infrastructure does not).. Now my winrm commands started working again.

  • Error: Halting this cluster node due to unrecoverable service failure

    Our cluster has experienced some sort of fault that has only become apparent today. The origin appears to have been nearly a month ago yet the symptoms have only just manifested.
    The node in question is a standalone instance running a DistributedCache service with local storage. It output the following to stdout on Jan-22:
    Coherence <Error>: Halting this cluster node due to unrecoverable service failure
    It finally failed today with OutOfMemoryError: Java heap space.
    We're running coherence-3.5.2.jar.
    Q1: It looks like this node failed on Jan-22 yet we did not notice. What is the best way to monitor node health?
    Q2: What might the root cause be for such a fault?
    I found the following in the logs:
    2011-01-22 01:18:58,296 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:18:58.296/9910749.462 Oracle Coherence EE 3.5.2/463 <Error> (thread=Cluster, member=33): Attempting recovery (due to soft timeout) of Guard{Daemon=DistributedCache}
    2011-01-22 01:18:58,296 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:18:58.296/9910749.462 Oracle Coherence EE 3.5.2/463 <Error> (thread=Cluster, member=33): Attempting recovery (due to soft timeout) of Guard{Daemon=DistributedCache}
    2011-01-22 01:19:04,772 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:04.772/9910755.938 Oracle Coherence EE 3.5.2/463 <Error> (thread=Cluster, member=33): Terminating guarded execution (due to hard timeout) of Guard{Daemon=DistributedCache}
    2011-01-22 01:19:04,772 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:04.772/9910755.938 Oracle Coherence EE 3.5.2/463 <Error> (thread=Cluster, member=33): Terminating guarded execution (due to hard timeout) of Guard{Daemon=DistributedCache}
    2011-01-22 01:19:05,785 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:05.785/9910756.951 Oracle Coherence EE 3.5.2/463 <Error> (thread=Termination Thread, member=33): Full Thread Dump
    Thread[Reference Handler,10,system]
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    Thread[DistributedCache,5,Cluster]
    java.nio.Bits.copyToByteArray(Native Method)
    java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224)
    com.tangosol.io.nio.ByteBufferInputStream.read(ByteBufferInputStream.java:123)
    java.io.DataInputStream.readFully(DataInputStream.java:178)
    java.io.DataInputStream.readFully(DataInputStream.java:152)
    com.tangosol.util.Binary.readExternal(Binary.java:1066)
    com.tangosol.util.Binary.<init>(Binary.java:183)
    com.tangosol.io.nio.BinaryMap$Block.readValue(BinaryMap.java:4304)
    com.tangosol.io.nio.BinaryMap$Block.getValue(BinaryMap.java:4130)
    com.tangosol.io.nio.BinaryMap.get(BinaryMap.java:377)
    com.tangosol.io.nio.BinaryMapStore.load(BinaryMapStore.java:64)
    com.tangosol.net.cache.SerializationPagedCache$WrapperBinaryStore.load(SerializationPagedCache.java:1547)
    com.tangosol.net.cache.SerializationPagedCache$PagedBinaryStore.load(SerializationPagedCache.java:1097)
    com.tangosol.net.cache.SerializationMap.get(SerializationMap.java:121)
    com.tangosol.net.cache.SerializationPagedCache.get(SerializationPagedCache.java:247)
    com.tangosol.net.cache.AbstractSerializationCache$1.getOldValue(AbstractSerializationCache.java:315)
    com.tangosol.net.cache.OverflowMap$Status.registerBackEvent(OverflowMap.java:4210)
    com.tangosol.net.cache.OverflowMap.onBackEvent(OverflowMap.java:2316)
    com.tangosol.net.cache.OverflowMap$BackMapListener.onMapEvent(OverflowMap.java:4544)
    com.tangosol.util.MultiplexingMapListener.entryDeleted(MultiplexingMapListener.java:49)
    com.tangosol.util.MapEvent.dispatch(MapEvent.java:214)
    com.tangosol.util.MapEvent.dispatch(MapEvent.java:166)
    com.tangosol.util.MapListenerSupport.fireEvent(MapListenerSupport.java:556)
    com.tangosol.net.cache.AbstractSerializationCache.dispatchEvent(AbstractSerializationCache.java:338)
    com.tangosol.net.cache.AbstractSerializationCache.dispatchPendingEvent(AbstractSerializationCache.java:321)
    com.tangosol.net.cache.AbstractSerializationCache.removeBlind(AbstractSerializationCache.java:155)
    com.tangosol.net.cache.SerializationPagedCache.removeBlind(SerializationPagedCache.java:348)
    com.tangosol.util.AbstractKeyBasedMap$KeySet.remove(AbstractKeyBasedMap.java:556)
    com.tangosol.net.cache.OverflowMap.removeInternal(OverflowMap.java:1299)
    com.tangosol.net.cache.OverflowMap.remove(OverflowMap.java:380)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$Storage.clear(DistributedCache.CDB:24)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onClearRequest(DistributedCache.CDB:32)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ClearRequest.run(DistributedCache.CDB:1)
    com.tangosol.coherence.component.net.message.requestMessage.DistributedCacheRequest.onReceived(DistributedCacheRequest.CDB:12)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onMessage(Grid.CDB:9)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onNotify(Grid.CDB:136)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onNotify(DistributedCache.CDB:3)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[Finalizer,8,system]
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
    Thread[PacketReceiver,7,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketReceiver.onWait(PacketReceiver.CDB:2)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[RMI TCP Accept-0,5,system]
    java.net.PlainSocketImpl.socketAccept(Native Method)
    java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
    java.net.ServerSocket.implAccept(ServerSocket.java:453)
    java.net.ServerSocket.accept(ServerSocket.java:421)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
    java.lang.Thread.run(Thread.java:619)
    Thread[PacketSpeaker,8,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.queue.ConcurrentQueue.waitForEntry(ConcurrentQueue.CDB:16)
    com.tangosol.coherence.component.util.queue.ConcurrentQueue.remove(ConcurrentQueue.CDB:7)
    com.tangosol.coherence.component.util.Queue.remove(Queue.CDB:1)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketSpeaker.onNotify(PacketSpeaker.CDB:62)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[Logger@9216774 3.5.2/463,3,main]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[PacketListener1,8,Cluster]
    java.net.PlainDatagramSocketImpl.receive0(Native Method)
    java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
    java.net.DatagramSocket.receive(DatagramSocket.java:712)
    com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
    com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[main,5,main]
    java.lang.Object.wait(Native Method)
    com.tangosol.net.DefaultCacheServer.main(DefaultCacheServer.java:79)
    com.networkfleet.cacheserver.Launcher.main(Launcher.java:122)
    Thread[Signal Dispatcher,9,system]
    Thread[RMI TCP Accept-41006,5,system]
    java.net.PlainSocketImpl.socketAccept(Native Method)
    java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
    java.net.ServerSocket.implAccept(ServerSocket.java:453)
    java.net.ServerSocket.accept(ServerSocket.java:421)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
    java.lang.Thread.run(Thread.java:619)
    ThreadCluster
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[TcpRingListener,6,Cluster]
    java.net.PlainSocketImpl.socketAccept(Native Method)
    java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
    java.net.ServerSocket.implAccept(ServerSocket.java:453)
    java.net.ServerSocket.accept(ServerSocket.java:421)
    com.tangosol.coherence.component.net.socket.TcpSocketAccepter.accept(TcpSocketAccepter.CDB:18)
    com.tangosol.coherence.component.util.daemon.TcpRingListener.acceptConnection(TcpRingListener.CDB:10)
    com.tangosol.coherence.component.util.daemon.TcpRingListener.onNotify(TcpRingListener.CDB:9)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[PacketPublisher,6,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher.onWait(PacketPublisher.CDB:2)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[RMI TCP Accept-0,5,system]
    java.net.PlainSocketImpl.socketAccept(Native Method)
    java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
    java.net.ServerSocket.implAccept(ServerSocket.java:453)
    java.net.ServerSocket.accept(ServerSocket.java:421)
    sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
    java.lang.Thread.run(Thread.java:619)
    Thread[PacketListenerN,8,Cluster]
    java.net.PlainDatagramSocketImpl.receive0(Native Method)
    java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
    java.net.DatagramSocket.receive(DatagramSocket.java:712)
    com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
    com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[Invocation:Management,5,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[DistributedCache:PofDistributedCache,5,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[Invocation:Management:EventDispatcher,5,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.onWait(Service.CDB:7)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[Termination Thread,5,Cluster]
    java.lang.Thread.dumpThreads(Native Method)
    java.lang.Thread.getAllStackTraces(Thread.java:1487)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    java.lang.reflect.Method.invoke(Method.java:597)
    com.tangosol.net.GuardSupport.logStackTraces(GuardSupport.java:791)
    com.tangosol.coherence.component.net.Cluster.onServiceFailed(Cluster.CDB:5)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$Guard.terminate(Grid.CDB:17)
    com.tangosol.net.GuardSupport$2.run(GuardSupport.java:652)
    java.lang.Thread.run(Thread.java:619)
    2011-01-22 01:19:05,785 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:05.785/9910756.951 Oracle Coherence EE 3.5.2/463 <Error> (thread=Termination Thread, member=33): Full Thread Dump
    Thread[Reference Handler,10,system]
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    Thread[DistributedCache,5,Cluster]
    java.nio.Bits.copyToByteArray(Native Method)
    java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224)
    com.tangosol.io.nio.ByteBufferInputStream.read(ByteBufferInputStream.java:123)
    java.io.DataInputStream.readFully(DataInputStream.java:178)
    java.io.DataInputStream.readFully(DataInputStream.java:152)
    com.tangosol.util.Binary.readExternal(Binary.java:1066)
    com.tangosol.util.Binary.<init>(Binary.java:183)
    com.tangosol.io.nio.BinaryMap$Block.readValue(BinaryMap.java:4304)
    com.tangosol.io.nio.BinaryMap$Block.getValue(BinaryMap.java:4130)
    com.tangosol.io.nio.BinaryMap.get(BinaryMap.java:377)
    com.tangosol.io.nio.BinaryMapStore.load(BinaryMapStore.java:64)
    com.tangosol.net.cache.SerializationPagedCache$WrapperBinaryStore.load(SerializationPagedCache.java:1547)
    com.tangosol.net.cache.SerializationPagedCache$PagedBinaryStore.load(SerializationPagedCache.java:1097)
    com.tangosol.net.cache.SerializationMap.get(SerializationMap.java:121)
    com.tangosol.net.cache.SerializationPagedCache.get(SerializationPagedCache.java:247)
    com.tangosol.net.cache.AbstractSerializationCache$1.getOldValue(AbstractSerializationCache.java:315)
    com.tangosol.net.cache.OverflowMap$Status.registerBackEvent(OverflowMap.java:4210)
    com.tangosol.net.cache.OverflowMap.onBackEvent(OverflowMap.java:2316)
    com.tangosol.net.cache.OverflowMap$BackMapListener.onMapEvent(OverflowMap.java:4544)
    com.tangosol.util.MultiplexingMapListener.entryDeleted(MultiplexingMapListener.java:49)
    com.tangosol.util.MapEvent.dispatch(MapEvent.java:214)
    com.tangosol.util.MapEvent.dispatch(MapEvent.java:166)
    com.tangosol.util.MapListenerSupport.fireEvent(MapListenerSupport.java:556)
    com.tangosol.net.cache.AbstractSerializationCache.dispatchEvent(AbstractSerializationCache.java:338)
    com.tangosol.net.cache.AbstractSerializationCache.dispatchPendingEvent(AbstractSerializationCache.java:321)
    com.tangosol.net.cache.AbstractSerializationCache.removeBlind(AbstractSerializationCache.java:155)
    com.tangosol.net.cache.SerializationPagedCache.removeBlind(SerializationPagedCache.java:348)
    com.tangosol.util.AbstractKeyBasedMap$KeySet.remove(AbstractKeyBasedMap.java:556)
    com.tangosol.net.cache.OverflowMap.removeInternal(OverflowMap.java:1299)
    com.tangosol.net.cache.OverflowMap.remove(OverflowMap.java:380)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$Storage.clear(DistributedCache.CDB:24)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onClearRequest(DistributedCache.CDB:32)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ClearRequest.run(DistributedCache.CDB:1)
    com.tangosol.coherence.component.net.message.requestMessage.DistributedCacheRequest.onReceived(DistributedCacheRequest.CDB:12)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onMessage(Grid.CDB:9)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onNotify(Grid.CDB:136)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onNotify(DistributedCache.CDB:3)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[Finalizer,8,system]
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
    Thread[PacketReceiver,7,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketReceiver.onWait(PacketReceiver.CDB:2)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[RMI TCP Accept-0,5,system]
    java.net.PlainSocketImpl.socketAccept(Native Method)
    java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
    java.net.ServerSocket.implAccept(ServerSocket.java:453)
    java.net.ServerSocket.accept(ServerSocket.java:421)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
    java.lang.Thread.run(Thread.java:619)
    Thread[PacketSpeaker,8,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.queue.ConcurrentQueue.waitForEntry(ConcurrentQueue.CDB:16)
    com.tangosol.coherence.component.util.queue.ConcurrentQueue.remove(ConcurrentQueue.CDB:7)
    com.tangosol.coherence.component.util.Queue.remove(Queue.CDB:1)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketSpeaker.onNotify(PacketSpeaker.CDB:62)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[Logger@9216774 3.5.2/463,3,main]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[PacketListener1,8,Cluster]
    java.net.PlainDatagramSocketImpl.receive0(Native Method)
    java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
    java.net.DatagramSocket.receive(DatagramSocket.java:712)
    com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
    com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[main,5,main]
    java.lang.Object.wait(Native Method)
    com.tangosol.net.DefaultCacheServer.main(DefaultCacheServer.java:79)
    com.networkfleet.cacheserver.Launcher.main(Launcher.java:122)
    Thread[Signal Dispatcher,9,system]
    Thread[RMI TCP Accept-41006,5,system]
    java.net.PlainSocketImpl.socketAccept(Native Method)
    java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
    java.net.ServerSocket.implAccept(ServerSocket.java:453)
    java.net.ServerSocket.accept(ServerSocket.java:421)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
    java.lang.Thread.run(Thread.java:619)
    ThreadCluster
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[TcpRingListener,6,Cluster]
    java.net.PlainSocketImpl.socketAccept(Native Method)
    java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
    java.net.ServerSocket.implAccept(ServerSocket.java:453)
    java.net.ServerSocket.accept(ServerSocket.java:421)
    com.tangosol.coherence.component.net.socket.TcpSocketAccepter.accept(TcpSocketAccepter.CDB:18)
    com.tangosol.coherence.component.util.daemon.TcpRingListener.acceptConnection(TcpRingListener.CDB:10)
    com.tangosol.coherence.component.util.daemon.TcpRingListener.onNotify(TcpRingListener.CDB:9)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[PacketPublisher,6,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher.onWait(PacketPublisher.CDB:2)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[RMI TCP Accept-0,5,system]
    java.net.PlainSocketImpl.socketAccept(Native Method)
    java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
    java.net.ServerSocket.implAccept(ServerSocket.java:453)
    java.net.ServerSocket.accept(ServerSocket.java:421)
    sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
    sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
    java.lang.Thread.run(Thread.java:619)
    Thread[PacketListenerN,8,Cluster]
    java.net.PlainDatagramSocketImpl.receive0(Native Method)
    java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
    java.net.DatagramSocket.receive(DatagramSocket.java:712)
    com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
    com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
    com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
    java.lang.Thread.run(Thread.java:619)
    Thread[Invocation:Management,5,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[DistributedCache:PofDistributedCache,5,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[Invocation:Management:EventDispatcher,5,Cluster]
    java.lang.Object.wait(Native Method)
    com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
    com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.onWait(Service.CDB:7)
    com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
    java.lang.Thread.run(Thread.java:619)
    Thread[Termination Thread,5,Cluster]
    java.lang.Thread.dumpThreads(Native Method)
    java.lang.Thread.getAllStackTraces(Thread.java:1487)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    java.lang.reflect.Method.invoke(Method.java:597)
    com.tangosol.net.GuardSupport.logStackTraces(GuardSupport.java:791)
    com.tangosol.coherence.component.net.Cluster.onServiceFailed(Cluster.CDB:5)
    com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$Guard.terminate(Grid.CDB:17)
    com.tangosol.net.GuardSupport$2.run(GuardSupport.java:652)
    java.lang.Thread.run(Thread.java:619)
    2011-01-22 01:19:06,738 Coherence Logger@9216774 3.5.2/463 INFO 2011-01-22 01:19:06.738/9910757.904 Oracle Coherence EE 3.5.2/463 <Info> (thread=main, member=33): Restarting Service: DistributedCache
    2011-01-22 01:19:06,738 Coherence Logger@9216774 3.5.2/463 INFO 2011-01-22 01:19:06.738/9910757.904 Oracle Coherence EE 3.5.2/463 <Info> (thread=main, member=33): Restarting Service: DistributedCache
    2011-01-22 01:19:06,738 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:06.738/9910757.904 Oracle Coherence EE 3.5.2/463 <Error> (thread=main, member=33): Failed to restart services: java.lang.IllegalStateException: Failed to unregister: Distr
    butedCache{Name=DistributedCache, State=(SERVICE_STARTED), LocalStorage=enabled, PartitionCount=257, BackupCount=1, AssignedPartitions=16, BackupPartitions=16}
    2011-01-22 01:19:06,738 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:06.738/9910757.904 Oracle Coherence EE 3.5.2/463 <Error> (thread=main, member=33): Failed to restart services: java.lang.IllegalStateException: Failed to unregister: Distr
    butedCache{Name=DistributedCache, State=(SERVICE_STARTED), LocalStorage=enabled, PartitionCount=257, BackupCount=1, AssignedPartitions=16, BackupPartitions=16}

    Hi
    It seems like the problem in this case is the call to clear() which will try to load all entries stored in the overflow scheme to emit potential cache events to listeners. This probably requires much more memory than there is Java heap available, hence the OOM.
    Our recommendation in this case is to call destroy() since this will bypass the even firing.
    /Charlie

  • Hyper-V Failover Cluster Node Corruption

    Dear All,
                Some of my nodes are showing abnormal behavior.  They are restarting every now and then.  I had updated the cluster nodes, but all updates were OS specific, there was nothing specific
    with respect to hardware update.
    I have analyzed crash dumps and find out that following is causing the crash:
    page_fault_in_nonpaged_area
    anyone has any idea about this?
    Thanks in advance.

    Hi ,
    What is the OS of the cluster node ?
    Did you try to remove the protection client for troubleshooing ?
    If it is a 2008R2 cluster , please refer to this thread :
    http://social.technet.microsoft.com/Forums/en-US/32ab6a85-6002-4c3c-97ea-27cb1091e9b3/windows-cluster-server-is-getting-restarted?forum=winservergen
    Hope it helps
    Best Regards
    Elton Ji
    We
    are trying to better understand customer views on social support experience, so your participation in this
    interview project would be greatly appreciated if you have time.
    Thanks for helping make community forums a great place.

  • Hyper-V Guest Cluster Node Failing Regularly

    Hi,
    We currently have a 4-node Server 2012 R2 Cluster witch hosts among other things, a 3 node Guest Cluster running a single clustered file service.  
    Around once a week, the guest cluster node that is currently hosting the clustered file service will fail.  It's as if the VM is blue screening.  That in itself is fairly anoying and I'll be doing all the updates and checking event log for clues
    as to the cause.  
    The problem then is that whichever physical cluster node that is hosting the VM when it fails,  will not unlock some of the VM's files.  The Virtual machine configuration lists as Online Pending.  This means that the failed VM cannot be restarted
    on any other cluster node.  The only fix is to drain the physical host it failed on, and reboot. 
    Looking for suggestions on how to fix the following.
    1. Crashing guest file cluster node
    2. Failed VM with shared VHDX requiring Phyiscal host reboot.
    Event messages for the physical host that was hosting the failed vm in order that they occured.
    Hyper-V-Worker: Event ID 18590 - 'FS-03' has encountered a fatal error.  The guest operating system reported that it failed with the following error codes: ErrorCode0: 0x9E, ErrorCode1: 0x6C2A17C0, ErrorCode2: 0x3C, ErrorCode3: 0xA, ErrorCode4:
    0x0.  If the problem persists, contact Product Support for the guest operating system.  (Virtual machine ID 36166B47-D003-4E51-AFB5-7B967A3EFD2D)
    FailoverClustering: Event ID 1069 - Cluster resource 'Virtual Machine FS-03' of type 'Virtual Machine' in clustered role 'FS-03' failed.
    Hyper-V-High-Availability: Event ID 21128 - 'Virtual Machine FS-03' failed to shutdown the virtual machine during the resource termination. The virtual machine will be forcefully stopped.
    Hyper-V-High-Availability: Event ID 21110 - 'Virtual Machine FS-03' failed to terminate.
    Hyper-V-VMMS: Event ID 20108 - The Virtual Machine Management Service failed to start the virtual machine '36166B47-D003-4E51-AFB5-7B967A3EFD2D': The group or resource is not in the correct state to perform the requested operation. (0x8007139F).
    Hyper-V-High-Availability: Event ID 21107 - 'Virtual Machine FS-03' failed to start.
    FailoverClustering: Event ID 1205 - The Cluster service failed to bring clustered role 'FS-03' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

    Hi,
    I don’t found the similar issue, Does your cluster can pass the cluster validation? Does all your Hyper-V host compatible with Server 2012r2? Have you try to disable all your
    AV soft and firewall? Please rerun Storage validation on the Cluster in non-production hours, the cluster validation report will quickly locate the issue.
    More information:
    Cluster
    http://technet.microsoft.com/en-us/library/dd581778(v=ws.10).aspx
    Hope this helps.
    We
    are trying to better understand customer views on social support experience, so your participation in this
    interview project would be greatly appreciated if you have time.
    Thanks for helping make community forums a great place.

  • Soft-restart of Java node by  using command line utility

    Hello,
    Could anyone advise whether there is a way to soft-restart the java node by using a command line utility (if there is one)?
    I would like to script to run in unix.
    Kind regards,
    Murad.

    Thank you for all your reply.
    Does Jcmon issue soft-restart?
    We have problem with Veritas Cluster. When there failover occurs, Java nodes appears to be online when we check from SMICM, but in fact it looses connection to the central instance. We have to issue a soft-restart for each java node to create connection again. It is a known bug and this only can be fixed by using replicated enqueue server. This only available in SP, which we can not apply right now. What I want to do is to create a script to automate the soft-restart which will be run just after failover.
    Thanks,
    Murad

  • Can I use cluster node for metadata controller?

    Just as the subject says - can I use a cluster node as a secondary metedata controller, or do I need to use an Xserve? We already have an Xserve we would use as the primary controller

    I have done this and attached the node to my iBook by firewire and booted the node in target mode and set the ServerHD as the startup for my iBook. Upon restart my iBook was an OSX server

  • Question about cluster node majority voting

    We've been having problems with a DB instance crashing regularly.  This weekend when it crashed, it seems to have taken the node it was on with it, or this was a separate incident...
    Right now I have 3 nodes in the cluster.  2 nodes are running 3 instances (2 on 1). The 3rd node is in a state where the OS is mostly unusable and the Cluster service will not start. 
    Event Log:
    "The failover cluster database could not be unloaded. If restarting the cluster service does not fix the problem, please restart the machine."
    Cluster Log from that machine:
    00003768.000067a0::2014/01/06-03:28:05.393 INFO  -----------------------------+ LOG BEGIN +-----------------------------
    00003768.000067a0::2014/01/06-03:28:05.393 INFO  [CS] Starting clussvc as a service
    00003768.000067a0::2014/01/06-03:28:05.394 INFO  [CS] cluster service logging level is 2
    00003768.00004c30::2014/01/06-03:28:05.521 DBG   [NETFTAPI] received NsiInitialNotification
    00003768.00004c30::2014/01/06-03:28:05.523 DBG   [NETFTAPI] received NsiInitialNotification
    00003768.000031f4::2014/01/06-03:28:05.588 DBG   [NETFTAPI] received NsiAddInstance  for 169.254.3.47
    00003768.00004eb4::2014/01/06-03:28:05.590 ERR   [DM] Error while restoring (refreshing) the hive: STATUS_INVALID_PARAMETER(c000000d
    00003768.00004eb4::2014/01/06-03:28:05.592 ERR   [DM] mscs::DmAgent::Start: STATUS_INVALID_PARAMETER(c000000d' because of 'Load(NOTHROW(), securityAttributes, discardError )'
    00003768.00004eb4::2014/01/06-03:28:05.592 ERR   [DM] Node 3: failed to unload cluster hive, error 87.
    00003768.00004eb4::2014/01/06-03:28:05.592 ERR   Hive unload failed (status = 87)
    00003768.00004eb4::2014/01/06-03:28:05.592 ERR   FatalError is Calling Exit Process.
    This is a 3 node cluster set to node majority, I don't have an available drive letter for a witness disk.  Since the cluster service won't start, I'm not certain how the cluster is still running, but am thankful that it is.
    A reboot might fix everything, but I'm very worried that if I reboot the server, and the cluster service still fails to start... it may prevent the entire cluster from starting and we won't be able to run the instances on the other 2 nodes.
    Does the 3rd server still act as an odd-number server, even if the cluster service won't start?  If I reboot and the cluster service still fails to start, will the cluster itself be able to be in an UP state and run the DB instances on the other nodes?
    I already need to open a MS Support incident on the DB instance crashing, so I'd rather not have to open a 2nd one just to answer this hopefully simple question.
    Thanks in advance!
    Mark

    I'll answer it here, since it matters fundamentally to SQL High Availability.
    There are a couple of entities you are conflating here, leading to much confusion.  There is a difference between the Cluster and the cluster service.
    The cluster service will run on a node once the Failover Cluster Feature is installed on that node.  The cluster service will run, even if a cluster is not created.  It may generate errors and not participate in a Cluster if it cannot talk to the
    other nodes, but it will not shut down.
    The Cluster itself requires a quorum, that is a majority of votes, in order to operate.  With three nodes, you should choose Node Majority quorum model, which sounds like what you have.  Any two votes will count, so the third node being offline
    does not matter.  You can safely restart the cluster service on the failed nod, and even restart the node.  Note that with the third node down, you have no redundancy.  (Windows 2012 and 2012 R2 have dynamic quorum, which adjusts the quorum
    count based on the last "settled" quorum vote, but that doesn't apply here).
    I am concerned with your statement that you are out of drive letters.  With three instances, you should have plenty of drive letters left.  I suggest investigating Mount Points.  You only need one drive letter per instance when using Mount
    Points.
    Geoff N. Hiten Principal Consultant Microsoft SQL Server MVP

  • Cluster node down....

    Hello All,
    I have setup MCSC cluster in a Netweaver system last week. Now suddenly one of my nodes is showing failed. I have even applied note # 1043592 but still its now working. I have even changed my startsapsrv script but still its not working.
    Can anyone help me in this.
    Also let me know where from can I download the latest ntclust.sar.
    Regards,
    Nirmal.K

    Hello Juan,
    I just followed note # 1043592 and has replaced files saprc.dll and saprcex.dll and now has restarted the cluster. But my java node didnt restart showing failed in status. When i tried to bring the node online i got the following error :
    An error occured tring to bring the SAP <SID> node online.
    The group or resource is not in the correct state to perform the requesteed operation.
    Error ID : 5023(0000139f)
    And also let me know if you need any other log.
    Regards,
    Nirmal.K

  • Cluster node does not shutdown after "received shutdown"

    Hi,
    We put together an automated restart process that restarts cluster nodes across multiple servers. To shutdown a node, we use the Coherence MBeanConnector and invoke stop on object: name=Management,nodeId=<member id>. This works for most cases where member's log output shows "received shutdown", partition transfer messages and after the last primary partitions have been transferred the VM exits.
    For one node however, the VM did not exit. From looking at the log file for this particular node, the primary partitions were transferred, the distributedCache thread stops showing output, but the Cluster thread continues to show activity.
    Note that this node was the last VM to stop on the given server.
    Has anyone seen this before or ideas on why this particular node did not exit after receiving the shutdown message?
    Thanks!
    Marcel.

    Hi Marcel -
    Please take a thread dump (via "kill -3" or "ctrl-break") on the VM that does not stop correctly. Coherence does not shut the VM down; it simply shuts itself down. If a non-daemon thread is running on the VM, then it may not exit. However, we won't know that until we see the thread dump.
    Peace,
    Cameron Purdy | Oracle Coherence

  • Cluster node is hung but not killed

    Hello,
    one of two SC3.2 cluster nodes was hung under the heavy load probably due to low memory.
    One of visible symptoms was these error messages:
    Jan 29 17:48:56 node2 genunix: [ID 661778 kern.warning] WARNING: clcomm: memory low: freemem 0xff8
    Jan 29 17:49:15 node2 genunix: [ID 661778 kern.warning] WARNING: clcomm: memory low: freemem 0xff6The problem that the node wasn't killed and the whole cluster (it's an HA NFS Active/Passive configuration) became unfunctional.
    What can be done to prevent such situation?
    TIA,
    -- leon

    Those error messages seem to indicate that at least some part of the system was
    working enough to complain that more memory is needed.
    Can you tell us a bit more about the exact problem you were experiencing? You mention
    that there was heavy load, which indicates perhaps that there was lots of IO going on
    on the system? If so, that is opposite of "hung" which, to me, means that the system is
    not able to perform any useful work at all.
    Perhaps the system was merely very slow, because of lack of memory and very heavy
    load?
    It is possible that the lack of memory is not because of lots of load, but because of a
    bug in the system (a daemon which is leaking memory, perhaps?). However, in that
    case, doing a "prstat" should help you find if that is the case. Otherwise, start with
    memory analysis on you system and try to figure out what it is which is consuming memory.
    Assuming that, as you suggested, the problem is heavy load on the system, read on...
    You mention that HA-NFS is the application running on the system. If you want the system to
    failover in cases of extrerem load and slowness, you can configure HA-NFS to do so by
    reducing its timeouts etc. However, please realize that after the failover to another node,
    (or restart on the local node), the client load would resume, the system would again become
    slow, and you haven't really achieved anything.
    You ask, "What can be done?", i would say adding more memory would be a start. But
    i personally suspect that you are running into a bug in the system somewhere which is
    causing this slowness. If so, let us start with figuring that out by looking closely at the
    system. Do a prstat, note the size of processes on the system and rule out any user level
    processes. Next, look at the filesystem in use by NFS, and make sure that is working
    fine (you are able to create files etc.), look at the CPU/disk usage to rule out "maxed out
    CPU or disk usage" as the cause of the slowness/almost_hungness....
    HTH,
    -ashu

  • Cluster node networking

    I have five node Windows Server 2008 R2 Hyper-V cluster. I put one node to Maintance mode and all VMs migrated to other hosts. I pulled out LAN cables form that node for testing (one out, waited a litte, put it back and pulled second and so on) and put
    them right back on.
    After that I had a lot of cluster errors and some VMs restarted.
    I have put many times nodes on maintance mode and restarted / shut down them and never had any cluster problems. Why did I have now when I pulled out LAN cables?

    Hi antesl,
    The
     failover behavior occurs because the cluster node has detect the cluster resource or node fail, such as network, storage, please refer the following related KB to confirm there have no potential single point failure configuration in your
    cluster.
    Failover Cluster
    http://msdn.microsoft.com/en-us/library/ff650328.aspx
    Failover Cluster Step-by-Step Guide: Configuring the Quorum in a Failover Cluster
    http://technet.microsoft.com/zh-cn/library/cc770620(v=ws.10).aspx
    How a Server Cluster Works
    http://technet.microsoft.com/en-us/library/cc738051(v=ws.10).aspx
    HYPER-V 2008 R2 SP1 Best Practices (In Easy Checklist Form)
    http://blogs.technet.com/b/askpfeplat/archive/2012/11/19/hyper-v-2008-r2-sp1-best-practices-in-easy-checklist-form.aspx
    I’m glad to be of help to you!
    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Support, contact [email protected]

  • Cluster node problems in communciation channel

    Hello,
    I often have node problems where all messags get stuck in particular cluster node of communication channel.
    Once it is restarted, all messages are working problem,  With respect to that were the following issues. Error when getting an FTP connection from connection pool: com.sap.aii.af.service.util.concurrent.ResourcePoolException: Unable to create new pooled resource: ConnectException: Connection refused (errno:239)
    Please do let me know for the solution to avoid it in the future.
    Regards,
    Anandh

    Hello,
    Question 43 of file adapter faq,
    Q: J2EE engine hangs with the File/FTP sender channels. How to resolve this ?
    A: The reason for this is, some times due to network level issues, a message is waiting forever for a response from a FTP server which is down for some time. Adapter does not know this and try to poll FTP server again with second message. This goes on and on and eventually we would have J2EE engine hanging. To solve this, following things need to be applied:
        1) Set FTP timeout to appropraite value if channel is FTP sender channel.
            2) In advanced mode table options, add a new parameter 'clusterSyncMode' and its value set to 'lock'. This parameter             is without single quotes and case-sensitive.
            3) Last but not least, make sure that you are in latest patches            of SP19/SP20/SP21/SP22 for SAPXIAFC component of XI 3.0                 release and SP10/SP11/SP12/SP13/SP14 for SAPXIAFC component             of XI 7.0 release. Any patch which is released after 11th of    Febraury 2008 for the above releases is fine.
    We had the same issue and this did the trick.
    Regards,
    Bhavesh

Maybe you are looking for

  • Apache Abdera deployment is not working in Weblogc9.1

    Hi I am accessing Atom feed from one url(www.example.com.rss.xml). I wrote one servlet and deployed it in weblogic9.1 and Tomcat5.5. Weblogic deployement is not working. but tomcat deployment is working properly. My Code is : public void service(Http

  • Shell Script to send email with .txt file as attachment

    Dear Al, Could any one help me with code, for sending email with .txt file as attachment using shell script. Thank You!

  • Enterprise License Manager login issue...

    Hey guys, Trying to log into Enterprise Licensing Manager (on a Be6K v9) and am getting the error Invalid Username or Password. Please try again.  I am using the admin login that I use to log into Cisco Unified CM Administration. Would I log into the

  • How to incorporate the C/C++ code in Java?

    Guys, Does any one here knows about in incorporating the C/C++ code in java? Or know about to call the C/C++ code in java? Guys, please impart your knowledge to me... Thanks, Mercy

  • Can I get Jam Packs for GarageBand '11 on Snow Leopard?

    I have GarageBand '11 (v 6.0.5) on a MacBook 2 with Snow Leopard, and I would like to add Jam Packs.  I know I can get them all if I buy MainStage and download them.  Unfortunately MainStage is now in version 3.3, which does not run on Snow Leopard.