Cluster Node Failure

Hi,
I am a Columbia University engineering graduate student doing research on Sun Clusters. I just have two quick questions that I was unable to find answers for on the Sun Cluster Documentation website. If any1 can help me answer these two questions I would really appreciate it. The questions are as follows.
1. If a node of a cluster fails, how are resources such as file locks or client sessions recovered, rebuilt, if at all? I understand that if a node fails while talking to a client, it gets restarted on the same node if it is healthy or a backup node if it is not. But I am not clear on what happens to resources such as file locks that were originally owned by the node.
2. If a node of a cluster fails, how are internal data of the node recovered, rebuilt, if at all? I mean things like caches or internal data structures.
Please let me know.
Thanks,
Larry Chen

I assume by 3.0 you actually mean 3.x because 3.0 is very old technology now. We're currently on 3.2, which is the 7th or 8th release of the software. Furthermore, I don't think these are particularly simple things to give answers for, so I may have to refer you to other material for longer answers.
You may find most of the answers you want in the book I co-wrote with Richard Elling entitled "Designing Enterprise Solutions with Sun Cluster 3.0". Chapter 3 covers some of the stuff you're asking about.
From your question I'm assuming you're asking about NFS? If so, the NFS protocol together with statd and lockd co-ordinate client recovery after an NFS fail-over (although to the client, it just looks like the same server came back up quickly).
For objects that do have state in the kernel, e.g. writes to PxFS, global devices, etc, these are all handled by a highly available services framework in the kernel. They effectively use a two phase commit-like protocol to ensure that these operations can continue after failures.
Hope that helps (somewhat),
Tim
---

Similar Messages

Error: Halting this cluster node due to unrecoverable service failure

Our cluster has experienced some sort of fault that has only become apparent today. The origin appears to have been nearly a month ago yet the symptoms have only just manifested.
The node in question is a standalone instance running a DistributedCache service with local storage. It output the following to stdout on Jan-22:
Coherence <Error>: Halting this cluster node due to unrecoverable service failure
It finally failed today with OutOfMemoryError: Java heap space.
We're running coherence-3.5.2.jar.
Q1: It looks like this node failed on Jan-22 yet we did not notice. What is the best way to monitor node health?
Q2: What might the root cause be for such a fault?
I found the following in the logs:
2011-01-22 01:18:58,296 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:18:58.296/9910749.462 Oracle Coherence EE 3.5.2/463 <Error> (thread=Cluster, member=33): Attempting recovery (due to soft timeout) of Guard{Daemon=DistributedCache}
2011-01-22 01:18:58,296 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:18:58.296/9910749.462 Oracle Coherence EE 3.5.2/463 <Error> (thread=Cluster, member=33): Attempting recovery (due to soft timeout) of Guard{Daemon=DistributedCache}
2011-01-22 01:19:04,772 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:04.772/9910755.938 Oracle Coherence EE 3.5.2/463 <Error> (thread=Cluster, member=33): Terminating guarded execution (due to hard timeout) of Guard{Daemon=DistributedCache}
2011-01-22 01:19:04,772 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:04.772/9910755.938 Oracle Coherence EE 3.5.2/463 <Error> (thread=Cluster, member=33): Terminating guarded execution (due to hard timeout) of Guard{Daemon=DistributedCache}
2011-01-22 01:19:05,785 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:05.785/9910756.951 Oracle Coherence EE 3.5.2/463 <Error> (thread=Termination Thread, member=33): Full Thread Dump
Thread[Reference Handler,10,system]
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread[DistributedCache,5,Cluster]
java.nio.Bits.copyToByteArray(Native Method)
java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224)
com.tangosol.io.nio.ByteBufferInputStream.read(ByteBufferInputStream.java:123)
java.io.DataInputStream.readFully(DataInputStream.java:178)
java.io.DataInputStream.readFully(DataInputStream.java:152)
com.tangosol.util.Binary.readExternal(Binary.java:1066)
com.tangosol.util.Binary.<init>(Binary.java:183)
com.tangosol.io.nio.BinaryMap$Block.readValue(BinaryMap.java:4304)
com.tangosol.io.nio.BinaryMap$Block.getValue(BinaryMap.java:4130)
com.tangosol.io.nio.BinaryMap.get(BinaryMap.java:377)
com.tangosol.io.nio.BinaryMapStore.load(BinaryMapStore.java:64)
com.tangosol.net.cache.SerializationPagedCache$WrapperBinaryStore.load(SerializationPagedCache.java:1547)
com.tangosol.net.cache.SerializationPagedCache$PagedBinaryStore.load(SerializationPagedCache.java:1097)
com.tangosol.net.cache.SerializationMap.get(SerializationMap.java:121)
com.tangosol.net.cache.SerializationPagedCache.get(SerializationPagedCache.java:247)
com.tangosol.net.cache.AbstractSerializationCache$1.getOldValue(AbstractSerializationCache.java:315)
com.tangosol.net.cache.OverflowMap$Status.registerBackEvent(OverflowMap.java:4210)
com.tangosol.net.cache.OverflowMap.onBackEvent(OverflowMap.java:2316)
com.tangosol.net.cache.OverflowMap$BackMapListener.onMapEvent(OverflowMap.java:4544)
com.tangosol.util.MultiplexingMapListener.entryDeleted(MultiplexingMapListener.java:49)
com.tangosol.util.MapEvent.dispatch(MapEvent.java:214)
com.tangosol.util.MapEvent.dispatch(MapEvent.java:166)
com.tangosol.util.MapListenerSupport.fireEvent(MapListenerSupport.java:556)
com.tangosol.net.cache.AbstractSerializationCache.dispatchEvent(AbstractSerializationCache.java:338)
com.tangosol.net.cache.AbstractSerializationCache.dispatchPendingEvent(AbstractSerializationCache.java:321)
com.tangosol.net.cache.AbstractSerializationCache.removeBlind(AbstractSerializationCache.java:155)
com.tangosol.net.cache.SerializationPagedCache.removeBlind(SerializationPagedCache.java:348)
com.tangosol.util.AbstractKeyBasedMap$KeySet.remove(AbstractKeyBasedMap.java:556)
com.tangosol.net.cache.OverflowMap.removeInternal(OverflowMap.java:1299)
com.tangosol.net.cache.OverflowMap.remove(OverflowMap.java:380)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$Storage.clear(DistributedCache.CDB:24)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onClearRequest(DistributedCache.CDB:32)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ClearRequest.run(DistributedCache.CDB:1)
com.tangosol.coherence.component.net.message.requestMessage.DistributedCacheRequest.onReceived(DistributedCacheRequest.CDB:12)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onMessage(Grid.CDB:9)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onNotify(Grid.CDB:136)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onNotify(DistributedCache.CDB:3)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
java.lang.Thread.run(Thread.java:619)
Thread[Finalizer,8,system]
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread[PacketReceiver,7,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketReceiver.onWait(PacketReceiver.CDB:2)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[RMI TCP Accept-0,5,system]
java.net.PlainSocketImpl.socketAccept(Native Method)
java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
java.net.ServerSocket.implAccept(ServerSocket.java:453)
java.net.ServerSocket.accept(ServerSocket.java:421)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
java.lang.Thread.run(Thread.java:619)
Thread[PacketSpeaker,8,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.queue.ConcurrentQueue.waitForEntry(ConcurrentQueue.CDB:16)
com.tangosol.coherence.component.util.queue.ConcurrentQueue.remove(ConcurrentQueue.CDB:7)
com.tangosol.coherence.component.util.Queue.remove(Queue.CDB:1)
com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketSpeaker.onNotify(PacketSpeaker.CDB:62)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
java.lang.Thread.run(Thread.java:619)
Thread[Logger@9216774 3.5.2/463,3,main]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[PacketListener1,8,Cluster]
java.net.PlainDatagramSocketImpl.receive0(Native Method)
java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
java.net.DatagramSocket.receive(DatagramSocket.java:712)
com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
java.lang.Thread.run(Thread.java:619)
Thread[main,5,main]
java.lang.Object.wait(Native Method)
com.tangosol.net.DefaultCacheServer.main(DefaultCacheServer.java:79)
com.networkfleet.cacheserver.Launcher.main(Launcher.java:122)
Thread[Signal Dispatcher,9,system]
Thread[RMI TCP Accept-41006,5,system]
java.net.PlainSocketImpl.socketAccept(Native Method)
java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
java.net.ServerSocket.implAccept(ServerSocket.java:453)
java.net.ServerSocket.accept(ServerSocket.java:421)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
java.lang.Thread.run(Thread.java:619)
ThreadCluster
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[TcpRingListener,6,Cluster]
java.net.PlainSocketImpl.socketAccept(Native Method)
java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
java.net.ServerSocket.implAccept(ServerSocket.java:453)
java.net.ServerSocket.accept(ServerSocket.java:421)
com.tangosol.coherence.component.net.socket.TcpSocketAccepter.accept(TcpSocketAccepter.CDB:18)
com.tangosol.coherence.component.util.daemon.TcpRingListener.acceptConnection(TcpRingListener.CDB:10)
com.tangosol.coherence.component.util.daemon.TcpRingListener.onNotify(TcpRingListener.CDB:9)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
java.lang.Thread.run(Thread.java:619)
Thread[PacketPublisher,6,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher.onWait(PacketPublisher.CDB:2)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[RMI TCP Accept-0,5,system]
java.net.PlainSocketImpl.socketAccept(Native Method)
java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
java.net.ServerSocket.implAccept(ServerSocket.java:453)
java.net.ServerSocket.accept(ServerSocket.java:421)
sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
java.lang.Thread.run(Thread.java:619)
Thread[PacketListenerN,8,Cluster]
java.net.PlainDatagramSocketImpl.receive0(Native Method)
java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
java.net.DatagramSocket.receive(DatagramSocket.java:712)
com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
java.lang.Thread.run(Thread.java:619)
Thread[Invocation:Management,5,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[DistributedCache:PofDistributedCache,5,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[Invocation:Management:EventDispatcher,5,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.onWait(Service.CDB:7)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[Termination Thread,5,Cluster]
java.lang.Thread.dumpThreads(Native Method)
java.lang.Thread.getAllStackTraces(Thread.java:1487)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)
com.tangosol.net.GuardSupport.logStackTraces(GuardSupport.java:791)
com.tangosol.coherence.component.net.Cluster.onServiceFailed(Cluster.CDB:5)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$Guard.terminate(Grid.CDB:17)
com.tangosol.net.GuardSupport$2.run(GuardSupport.java:652)
java.lang.Thread.run(Thread.java:619)
2011-01-22 01:19:05,785 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:05.785/9910756.951 Oracle Coherence EE 3.5.2/463 <Error> (thread=Termination Thread, member=33): Full Thread Dump
Thread[Reference Handler,10,system]
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread[DistributedCache,5,Cluster]
java.nio.Bits.copyToByteArray(Native Method)
java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224)
com.tangosol.io.nio.ByteBufferInputStream.read(ByteBufferInputStream.java:123)
java.io.DataInputStream.readFully(DataInputStream.java:178)
java.io.DataInputStream.readFully(DataInputStream.java:152)
com.tangosol.util.Binary.readExternal(Binary.java:1066)
com.tangosol.util.Binary.<init>(Binary.java:183)
com.tangosol.io.nio.BinaryMap$Block.readValue(BinaryMap.java:4304)
com.tangosol.io.nio.BinaryMap$Block.getValue(BinaryMap.java:4130)
com.tangosol.io.nio.BinaryMap.get(BinaryMap.java:377)
com.tangosol.io.nio.BinaryMapStore.load(BinaryMapStore.java:64)
com.tangosol.net.cache.SerializationPagedCache$WrapperBinaryStore.load(SerializationPagedCache.java:1547)
com.tangosol.net.cache.SerializationPagedCache$PagedBinaryStore.load(SerializationPagedCache.java:1097)
com.tangosol.net.cache.SerializationMap.get(SerializationMap.java:121)
com.tangosol.net.cache.SerializationPagedCache.get(SerializationPagedCache.java:247)
com.tangosol.net.cache.AbstractSerializationCache$1.getOldValue(AbstractSerializationCache.java:315)
com.tangosol.net.cache.OverflowMap$Status.registerBackEvent(OverflowMap.java:4210)
com.tangosol.net.cache.OverflowMap.onBackEvent(OverflowMap.java:2316)
com.tangosol.net.cache.OverflowMap$BackMapListener.onMapEvent(OverflowMap.java:4544)
com.tangosol.util.MultiplexingMapListener.entryDeleted(MultiplexingMapListener.java:49)
com.tangosol.util.MapEvent.dispatch(MapEvent.java:214)
com.tangosol.util.MapEvent.dispatch(MapEvent.java:166)
com.tangosol.util.MapListenerSupport.fireEvent(MapListenerSupport.java:556)
com.tangosol.net.cache.AbstractSerializationCache.dispatchEvent(AbstractSerializationCache.java:338)
com.tangosol.net.cache.AbstractSerializationCache.dispatchPendingEvent(AbstractSerializationCache.java:321)
com.tangosol.net.cache.AbstractSerializationCache.removeBlind(AbstractSerializationCache.java:155)
com.tangosol.net.cache.SerializationPagedCache.removeBlind(SerializationPagedCache.java:348)
com.tangosol.util.AbstractKeyBasedMap$KeySet.remove(AbstractKeyBasedMap.java:556)
com.tangosol.net.cache.OverflowMap.removeInternal(OverflowMap.java:1299)
com.tangosol.net.cache.OverflowMap.remove(OverflowMap.java:380)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$Storage.clear(DistributedCache.CDB:24)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onClearRequest(DistributedCache.CDB:32)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ClearRequest.run(DistributedCache.CDB:1)
com.tangosol.coherence.component.net.message.requestMessage.DistributedCacheRequest.onReceived(DistributedCacheRequest.CDB:12)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onMessage(Grid.CDB:9)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onNotify(Grid.CDB:136)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onNotify(DistributedCache.CDB:3)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
java.lang.Thread.run(Thread.java:619)
Thread[Finalizer,8,system]
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread[PacketReceiver,7,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketReceiver.onWait(PacketReceiver.CDB:2)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[RMI TCP Accept-0,5,system]
java.net.PlainSocketImpl.socketAccept(Native Method)
java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
java.net.ServerSocket.implAccept(ServerSocket.java:453)
java.net.ServerSocket.accept(ServerSocket.java:421)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
java.lang.Thread.run(Thread.java:619)
Thread[PacketSpeaker,8,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.queue.ConcurrentQueue.waitForEntry(ConcurrentQueue.CDB:16)
com.tangosol.coherence.component.util.queue.ConcurrentQueue.remove(ConcurrentQueue.CDB:7)
com.tangosol.coherence.component.util.Queue.remove(Queue.CDB:1)
com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketSpeaker.onNotify(PacketSpeaker.CDB:62)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
java.lang.Thread.run(Thread.java:619)
Thread[Logger@9216774 3.5.2/463,3,main]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[PacketListener1,8,Cluster]
java.net.PlainDatagramSocketImpl.receive0(Native Method)
java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
java.net.DatagramSocket.receive(DatagramSocket.java:712)
com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
java.lang.Thread.run(Thread.java:619)
Thread[main,5,main]
java.lang.Object.wait(Native Method)
com.tangosol.net.DefaultCacheServer.main(DefaultCacheServer.java:79)
com.networkfleet.cacheserver.Launcher.main(Launcher.java:122)
Thread[Signal Dispatcher,9,system]
Thread[RMI TCP Accept-41006,5,system]
java.net.PlainSocketImpl.socketAccept(Native Method)
java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
java.net.ServerSocket.implAccept(ServerSocket.java:453)
java.net.ServerSocket.accept(ServerSocket.java:421)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
java.lang.Thread.run(Thread.java:619)
ThreadCluster
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[TcpRingListener,6,Cluster]
java.net.PlainSocketImpl.socketAccept(Native Method)
java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
java.net.ServerSocket.implAccept(ServerSocket.java:453)
java.net.ServerSocket.accept(ServerSocket.java:421)
com.tangosol.coherence.component.net.socket.TcpSocketAccepter.accept(TcpSocketAccepter.CDB:18)
com.tangosol.coherence.component.util.daemon.TcpRingListener.acceptConnection(TcpRingListener.CDB:10)
com.tangosol.coherence.component.util.daemon.TcpRingListener.onNotify(TcpRingListener.CDB:9)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
java.lang.Thread.run(Thread.java:619)
Thread[PacketPublisher,6,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher.onWait(PacketPublisher.CDB:2)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[RMI TCP Accept-0,5,system]
java.net.PlainSocketImpl.socketAccept(Native Method)
java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
java.net.ServerSocket.implAccept(ServerSocket.java:453)
java.net.ServerSocket.accept(ServerSocket.java:421)
sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
java.lang.Thread.run(Thread.java:619)
Thread[PacketListenerN,8,Cluster]
java.net.PlainDatagramSocketImpl.receive0(Native Method)
java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
java.net.DatagramSocket.receive(DatagramSocket.java:712)
com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:20)
com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:4)
com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:19)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
java.lang.Thread.run(Thread.java:619)
Thread[Invocation:Management,5,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[DistributedCache:PofDistributedCache,5,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:9)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[Invocation:Management:EventDispatcher,5,Cluster]
java.lang.Object.wait(Native Method)
com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.onWait(Service.CDB:7)
com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
java.lang.Thread.run(Thread.java:619)
Thread[Termination Thread,5,Cluster]
java.lang.Thread.dumpThreads(Native Method)
java.lang.Thread.getAllStackTraces(Thread.java:1487)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)
com.tangosol.net.GuardSupport.logStackTraces(GuardSupport.java:791)
com.tangosol.coherence.component.net.Cluster.onServiceFailed(Cluster.CDB:5)
com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$Guard.terminate(Grid.CDB:17)
com.tangosol.net.GuardSupport$2.run(GuardSupport.java:652)
java.lang.Thread.run(Thread.java:619)
2011-01-22 01:19:06,738 Coherence Logger@9216774 3.5.2/463 INFO 2011-01-22 01:19:06.738/9910757.904 Oracle Coherence EE 3.5.2/463 <Info> (thread=main, member=33): Restarting Service: DistributedCache
2011-01-22 01:19:06,738 Coherence Logger@9216774 3.5.2/463 INFO 2011-01-22 01:19:06.738/9910757.904 Oracle Coherence EE 3.5.2/463 <Info> (thread=main, member=33): Restarting Service: DistributedCache
2011-01-22 01:19:06,738 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:06.738/9910757.904 Oracle Coherence EE 3.5.2/463 <Error> (thread=main, member=33): Failed to restart services: java.lang.IllegalStateException: Failed to unregister: Distr
butedCache{Name=DistributedCache, State=(SERVICE_STARTED), LocalStorage=enabled, PartitionCount=257, BackupCount=1, AssignedPartitions=16, BackupPartitions=16}
2011-01-22 01:19:06,738 Coherence Logger@9216774 3.5.2/463 ERROR 2011-01-22 01:19:06.738/9910757.904 Oracle Coherence EE 3.5.2/463 <Error> (thread=main, member=33): Failed to restart services: java.lang.IllegalStateException: Failed to unregister: Distr
butedCache{Name=DistributedCache, State=(SERVICE_STARTED), LocalStorage=enabled, PartitionCount=257, BackupCount=1, AssignedPartitions=16, BackupPartitions=16}

Hi
It seems like the problem in this case is the call to clear() which will try to load all entries stored in the overflow scheme to emit potential cache events to listeners. This probably requires much more memory than there is Java heap available, hence the OOM.
Our recommendation in this case is to call destroy() since this will bypass the even firing.
/Charlie

OrainstRoot.sh: Failure to promote local gpnp setup to other cluster nodes

I'm trying to build a 2 node cluster and everything appeared to be going swimmingly until the end of the 1st nodes running of the orainstRoot.sh script.
The following is the end of the output:
Disk Group OCR_VOTE created successfully.
clscfg: -install mode specified
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
CRS-4256: Updating the profile
Successful addition of voting disk 4e3f692529584f8bbf7f16146bd90346.
Successful addition of voting disk 728bed918cf54f6cbf904d37638c674b.
Successful addition of voting disk 8ac20793405d4fdcbfcafc7e311f877d.
Successfully replaced voting disk group with +OCR_VOTE.
CRS-4256: Updating the profile
CRS-4266: Voting file(s) successfully replaced
## STATE File Universal Id File Name Disk group
1. ONLINE 4e3f692529584f8bbf7f16146bd90346 (ORCL:VOTE01) [OCR_VOTE]
2. ONLINE 728bed918cf54f6cbf904d37638c674b (ORCL:VOTE02) [OCR_VOTE]
3. ONLINE 8ac20793405d4fdcbfcafc7e311f877d (ORCL:VOTE03) [OCR_VOTE]
Located 3 voting disk(s).
Failed to rmtcopy "/tmp/fileLgKPGV" to "/u01/app/11.2.0/grid/gpnp/manifest.txt" for nodes {ilprevzedb01,ilprevzedb02}, rc=256
Failed to rmtcopy "/u01/app/11.2.0/grid/gpnp/ilprevzedb01/profiles/peer/profile.xml" to "/u01/app/11.2.0/grid/gpnp/profiles/peer/profile.xml" for nodes {ilprevzedb01,ilprevzedb02}, rc=256
rmtcopy aborted
Failed to promote local gpnp setup to other cluster nodes at /u01/app/11.2.0/grid/crs/install/crsconfig_lib.pm line 6504.
/u01/app/11.2.0/grid/perl/bin/perl -I/u01/app/11.2.0/grid/perl/lib -I/u01/app/11.2.0/grid/crs/install /u01/app/11.2.0/grid/crs/install/rootcrs.pl execution failed
Has anyone run into this problem and found a solution?
Thanks in advance!

Ok, for everyone out there, I resolved the issue. Hopefully this will help others encountering the same problem.
It turns out that when the OS was installed, iptables firewall was enabled. This will cause havoc with the installer scripts.
My first inkling should have been when the installer stalled at 65% trying to copy home directories between nodes, the first time I ran through the installer.
At that time, Googling around found that iptables might be the problem and indeed it was running, so I just did a 'service iptables stop' WITHOUT REBOOTING THE NODES and re-ran the installer.
Well, it looks as though NOT REBOOTING THE NODES doesn't quite cut it. I then did a 'chkconfig iptables off' and REBOOTED BOTH NODES.
Oracle support simply provided me with: How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation (Doc ID 942166.1), which didn't really work all that well, lots of failures, errors, etc. So I just deleted the 11.2.0 directory and tried running the installer again.
This time the install went through without problems.
Thanks!

Question about cluster node NodeWeight property

Hi,
I have a three nodes (A/B/C) windows 2008 r2 sp1 cluster testCluster, and installed KB2494036 for three nodes,suppose Node A is a active node.
I configured node C's NodeWeight property to 0, and node A and node B keep default (NodeWeight=1). I also added a shared disk Q for cluster quorum.
So i want to know if node C and Node B are down , is the windows cluster testCluster down as lost of quorum or keep up?
At the first i thought testCluster should keep up , because the cluster has 2 votes (node A and quorum), node B is down, node C doesn't join voting. But after testing, testCluster was down as lost of quorum.
So anybody konw the reason,thanks.

Hello mark.gao,
Let me see if I understand correctly your steps, so I can think that if you create your cluster with three nodes at the beginning your quorum model should be "Node Majority", then you have three votes one per each node.
Then was removed the vote for Node "C" and added a disk to be witness for cluster quorum, at this point we have two out of three votes from the original configuration on "Node Majority"
Question:
At some point you changed the quorum model to be "Node and Disk Majority"???
Maybe this is the issue, you are stuck on "Node Majority" and when "B" and "C" nodes are down we have only one vote from node "A" therefore there is no quorum to keep the service online.
On 2012 we have the awesome option to configure a Dynamic Quorum:
Dynamic quorum management
In Windows Server 2012, as an advanced quorum configuration option, you can choose to enable dynamic quorum management by cluster. When this option is enabled, the cluster dynamically manages
the vote assignment to nodes, based on the state of each node. Votes are automatically removed from nodes that leave active cluster membership, and a vote is automatically assigned when a node rejoins the cluster. By default, dynamic quorum management is enabled.
Note
With dynamic quorum management, the cluster quorum majority is determined by the set of nodes that are active members of the cluster at any time. This is an important distinction from the cluster quorum in Windows Server 2008 R2, where the quorum
majority is fixed, based on the initial cluster configuration.
With dynamic quorum management, it is also possible for a cluster to run on the last surviving cluster node. By dynamically adjusting the quorum majority requirement, the cluster can sustain
sequential node shutdowns to a single node.
The cluster-assigned dynamic vote of a node can be verified with the DynamicWeight common property of the cluster node by using the Get-ClusterNodeWindows
PowerShell cmdlet. A value of 0 indicates that the node does not have a quorum vote. A value of 1 indicates that the node has a quorum vote.
The vote assignment for all cluster nodes can be verified by using the Validate Cluster Quorum validation test.
Additional considerations
Dynamic quorum management does not allow the cluster to sustain a simultaneous failure of a majority of voting members. To continue running, the cluster must always have a quorum majority at the time of a node shutdown or failure.
If you have explicitly removed the vote of a node, the cluster cannot dynamically add or remove that vote.
Configure and Manage the Quorum in a Windows Server 2012 Failover Cluster
https://technet.microsoft.com/en-us/library/jj612870.aspx#BKMK_dynamic
Hope this info help you to reach your goal. :D
5ALU2 !

Local NFS / LDAP on cluster nodes

Hi,
I have a 2-node cluster (3.2 1/09) on Solaris 10 U8, providing NFS (/home) and LDAP for clients. I would like to configure LDAP and NFS clients on each cluster node, so they share user information with the rest of the machines.
I assume the right way to do this is to configure the cluster nodes the same as other clients, using the HA Logical Hostnames for the LDAP and NFS server; this way, there's always a working LDAP and NFS server for each node. However, what happens if both nodes reboot at once (for example, power failure)? As the first node boots, there is no working LDAP or NFS server, because it hasn't been started yet. Will this cause the boot to fail and require manual intervention, or will the cluster boot without NFS and LDAP clients enabled, allowing me to fix it later?

Thanks. In that case, is it safe to configure the NFS-exported filesystem as a global mount, and symlink e.g. "/home" -> "/global/home", so home directories are accessible via the normal path on both nodes? (I understand global filesystems have worse performance, but this would just be for administrators logging in with their LDAP accounts.)
For LDAP, my concern is that if svc:/network/ldap/client:default fails during startup (because no LDAP server is running yet), it might prevent the cluster services from starting, even though all names required by cluster are available from /etc.

MDM Cluste Node 2 rebuild

Hi ,
We are using SAP MDM 5.5 application installed in Microsoft Cluster.
Unfortunately one of our cluster node goes down and as per System Management team we have rebuild the node 2 from scratch.
While checking the resolution I got below MS link which explains the similar situation and its resolution .
http://technet.microsoft.com/en-us/library/cc786625(v=ws.10).aspx
Scenario 6u2014Single Cluster Node Corruption or Failure .
While System management team is working on this I want to just check what other option do we have, if we have to rebuild the server from scratch then what will be the process.
I am assuming below process.
1.     Windows team rebuild the server (O.S and Cluster configuration).
2.     We have to install Oracle DB and MDM application from installation media.
3.     We have to add this node 2 to existing cluster configuration (on node1).
But I am not sure about this process and have some doubt like on node 2 do we have to perform fresh installation of apps and DB like we did while installing the cluster first time or in this case there will be different process as apps & db are working fine on node 1.
Please help me if anyone has ever faced this kind of issue.
Thanks and Regards
Alok
Edited by: Alok Jain on Mar 6, 2012 7:47 AM

Hi buddy,
What a pity!!! :(
I wish the best for this recovering!!!
About Your questions:
Am I being too paranoid with this and wasting too much time on a mock environment while running on risky hardware? I don't think so, As You've never done it yet, I guess it's safer test it before. It can became worse if You do the wrong thing :)
Is the recovery of this node really as straight forward as it seems: Delete the Node, Add the node back?Yes, As You have to rebuild the node, You`ll have to rebuild CRS too. You have to remove and add the node again, Don't forget about the instance, listeners, services,etc. The procedure on the documentations is really really clean.
Can I add the node back as the same named node or will the cluster freak out due to some linguring previous config?You can add the node back as the same named node.
Are there any other "gotchas" I may not be thinking about that some of you may have experienced?As You told this is very crucial component to Your production system, If I were You, I would Work with Oracle support, instead of executing everything by myself.
Good Luck!
Cerreia

Xserve Cluster node does not (really) switch on

Hi.
I've got a pretty strange problem.
Today, after a long break, I have tried to switch on my Xserve G5 cluster node. The yellow ID light was constantly on, the power light was white, machine was powered on, and nothing happened. It did not boot, did not fail -- just sat there with both lights on.
Then, I tried to get into OF by using panel lights method -- it didn't work. No matter how long I was keeping status button pressed after a power on, nothing happened -- none of the activity lights were on or flashing.
Next thing I tried to do was to reset a PMU.
After that the situation is exactly the same except that the power light is no longer on -- although the machine starts up and the status light is on. I am still not able to enter any OF commands using front panel.
All in all, it looks like the bootstrap processor is on, but none of the G5s is getting up (for example, even if I take out all of the memory, nothing changes and no failures are indicated)
Any ideas?

Try this:
1 - Remove the battery from the machine for 24 hours. Replace.
2 - Reset the PMU.
3 - Try to boot.
If you still have a problem, disconnect the firewire cable from the front panel. Try to boot.
If you STILL have a problem, you're at the hardware change stage. It could be a number of problems, but it's likely either the main logic board, the power supply, or the front panel board.

Cluster node networking

I have five node Windows Server 2008 R2 Hyper-V cluster. I put one node to Maintance mode and all VMs migrated to other hosts. I pulled out LAN cables form that node for testing (one out, waited a litte, put it back and pulled second and so on) and put
them right back on.
After that I had a lot of cluster errors and some VMs restarted.
I have put many times nodes on maintance mode and restarted / shut down them and never had any cluster problems. Why did I have now when I pulled out LAN cables?

Hi antesl,
The
failover behavior occurs because the cluster node has detect the cluster resource or node fail, such as network, storage, please refer the following related KB to confirm there have no potential single point failure configuration in your
cluster.
Failover Cluster
http://msdn.microsoft.com/en-us/library/ff650328.aspx
Failover Cluster Step-by-Step Guide: Configuring the Quorum in a Failover Cluster
http://technet.microsoft.com/zh-cn/library/cc770620(v=ws.10).aspx
How a Server Cluster Works
http://technet.microsoft.com/en-us/library/cc738051(v=ws.10).aspx
HYPER-V 2008 R2 SP1 Best Practices (In Easy Checklist Form)
http://blogs.technet.com/b/askpfeplat/archive/2012/11/19/hyper-v-2008-r2-sp1-best-practices-in-easy-checklist-form.aspx
I’m glad to be of help to you!
Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Support, contact [email protected]

Cluster node addition fails on cleanup

We have a 2 node cluster setup already
(2) HP BL460c G8 servers connected to a VNX5300 SAN (Nodes 1 & 2)
Server 2012 Datacenter installed
Quarum: Node + Disk
all failover tests went perfectly and all VMs are healthy
Verification on the cluster show some warnings but no failures
We have rebuilt a server (node 3) renamed it and have run a single machine verification test to see if it is suitable for clustering. it succeeded with minor warnings
We ran verification on all three machines and received the formentioned warnings but no game stoppers, however when trying to add the host to the cluster we get the following error in the logs:
WARN mscs::ListenerWorker::operator (): ERROR_TIMEOUT(1460)' because of '[FTI][Initiator] Aborting connection because NetFT route to node <machine name> on virtual IP fe80::cdf2:f6ea:5ce:5f9c:~3343~ has failed to come up.'
This happens after the node is added to the cluster but reports a failure on cleanup processes and reverts everything back. I have done all of this under my domain_admin account.
before and after the attempt to add the NetFT adapter is in media disconnect, during the attempts it does pull down a 169 address as it is supposed to
Node 3 Networking breakdown
The new host uses an Intel/HP NC365T Quard port adaptor
port 1: Mgmt : Static assignment subnet 1
port 2: VM net: Static assignment sibmet 2
port 3: Heartbeat: assigned via DHCP subnet 1 pool (we have attempted the above with this disabled as well)
NCU is not installed for the adapter and bridging in server 2012 is not enabled.
I am at a loss, and would appreciate any additional help as i have spent 3 days researching this to try and find the cause.

Hi,
The error message mentioned an IPv6 address, have you enable IPv6 network for the cluster?
Check the IPv6 network configuration in the 3<sup>rd</sup> node server, what’s the status, enabled or disabled?
When two or more cluster nodes are running IPv6 for heartbeat communications, they will require any additional nodes that join to also running IPv6. If the node server has IPv6 disabled, it will fail to join.
Also whether these cluster node server has antivirus software installed, you may temporarily disable it and rejoin the new node.
Check that and give us feedback for further troubleshooting, for more information please refer to following MS articles:
Failover Cluster Creation Issue
http://social.technet.microsoft.com/Forums/en-US/winserverClustering/thread/1ed1936d-6283-46cc-951d-9c236329b8be
Failure to re-add rebuilt cluster node to Windows 2008 R2 Cluster: System error 1460 has occurred (0x000005b4). Timeout.
http://social.technet.microsoft.com/Forums/en-US/winserverClustering/thread/a21e9a8e-9f68-4d83-a747-204000cda65a
Hope this helps!
TechNet Subscriber Support
If you are
TechNet Subscription
user and have any feedback on our support quality, please send your feedback
here.
Lawrence
TechNet Community Support

Cluster Node paused

Hi there
My Setup:
2 Cluster Nodes (HP DL380 G7 & HP DL380 Gen8)
HP P2000 G3 FC MSA (MPIO)
The Gen8 Cluster Node pauses after a few minutes, but stays online if the G7 is paused (no drain) My troubleshooting has led me to believe that there is a problem with the Cluster Shared Volume:
00001508.000010b4::2015/02/19-14:51:14.189 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:cf2dec1d-ee88-4fb6-a86d-0c2d1aa888b4:Netbios
00000d1c.0000299c::2015/02/19-14:51:14.615 INFO [API] s_ApiGetQuorumResource final status 0.
00000d1c.0000299c::2015/02/19-14:51:14.616 INFO [RCM [RES] Virtual Machine VirtualMachine1 embedded failure notification, code=0 _isEmbeddedFailure=false _embeddedFailureAction=2
00001508.000010b4::2015/02/19-14:51:15.010 INFO [RES] Network Name <Cluster Name>: Getting Read only private properties
00000d1c.00002294::2015/02/19-14:51:15.096 INFO [API] s_ApiGetQuorumResource final status 0.
00000d1c.00002294::2015/02/19-14:51:15.121 INFO [API] s_ApiGetQuorumResource final status 0.
000014a8.000024f4::2015/02/19-14:51:15.269 INFO [RES] Physical Disk <Quorum>: VolumeIsNtfs: Volume
\\?\GLOBALROOT\Device\Harddisk1\ClusterPartition2\ has FS type NTFS
00000d1c.00002294::2015/02/19-14:51:15.343 WARN [RCM] ResourceTypeChaseTheOwnerLoop::DoCall: ResType MSMQ's DLL is not present on this node. Attempting to find a good node...
00000d1c.00002294::2015/02/19-14:51:15.352 WARN [RCM] ResourceTypeChaseTheOwnerLoop::DoCall: ResType MSMQTriggers's DLL is not present on this node. Attempting to find a good node...
000014a8.000024f4::2015/02/19-14:51:15.386 INFO [RES] Physical Disk: HardDiskpQueryDiskFromStm: ClusterStmFindDisk returned device='\\?\mpio#disk&ven_hp&prod_p2000_g3_fc&rev_t250#1&7f6ac24&0&36304346463030314145374646423434393243353331303030#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}'
000014a8.000024f4::2015/02/19-14:51:15.386 ERR   [RES] Physical Disk: HardDiskpGetDiskInfo: GetVolumeInformation failed for
\\?\GLOBALROOT\Device\Harddisk3\ClusterPartition2\, status 3
000014a8.000024f4::2015/02/19-14:51:15.386 ERR   [RES] Physical Disk: HardDiskpGetDiskInfo: failed to get partition size for
\\?\GLOBALROOT\Device\Harddisk3\ClusterPartition2\, status 3
00000d1c.00001420::2015/02/19-14:51:15.847 WARN [RCM] ResourceTypeChaseTheOwnerLoop::DoCall: ResType MSMQ's DLL is not present on this node. Attempting to find a good node...
00000d1c.00001420::2015/02/19-14:51:15.855 WARN [RCM] ResourceTypeChaseTheOwnerLoop::DoCall: ResType MSMQTriggers's DLL is not present on this node. Attempting to find a good node...
000014a8.000024f4::2015/02/19-14:51:15.887 INFO [RES] Physical Disk: HardDiskpQueryDiskFromStm: ClusterStmFindDisk returned device='\\?\mpio#disk&ven_hp&prod_p2000_g3_fc&rev_t250#1&7f6ac24&0&36304346463030314145374646423434393243353331303030#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}'
000014a8.000024f4::2015/02/19-14:51:15.888 ERR   [RES] Physical Disk: HardDiskpGetDiskInfo: GetVolumeInformation failed for
\\?\GLOBALROOT\Device\Harddisk3\ClusterPartition2\, status 3
000014a8.000024f4::2015/02/19-14:51:15.888 ERR   [RES] Physical Disk: HardDiskpGetDiskInfo: failed to get partition size for
\\?\GLOBALROOT\Device\Harddisk3\ClusterPartition2\, status 3
00000d1c.00001420::2015/02/19-14:51:15.928 WARN [RCM] ResourceTypeChaseTheOwnerLoop::DoCall: ResType MSMQ's DLL is not present on this node. Attempting to find a good node...
00000d1c.00001420::2015/02/19-14:51:15.939 WARN [RCM] ResourceTypeChaseTheOwnerLoop::DoCall: ResType MSMQTriggers's DLL is not present on this node. Attempting to find a good node...
000014a8.000024f4::2015/02/19-14:51:15.968 INFO [RES] Physical Disk: HardDiskpQueryDiskFromStm: ClusterStmFindDisk returned device='\\?\mpio#disk&ven_hp&prod_p2000_g3_fc&rev_t250#1&7f6ac24&0&36304346463030314145374646423434393243353331303030#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}'
000014a8.000024f4::2015/02/19-14:51:15.969 ERR   [RES] Physical Disk: HardDiskpGetDiskInfo: GetVolumeInformation failed for
\\?\GLOBALROOT\Device\Harddisk3\ClusterPartition2\, status 3
000014a8.000024f4::2015/02/19-14:51:15.969 ERR   [RES] Physical Disk: HardDiskpGetDiskInfo: failed to get partition size for
\\?\GLOBALROOT\Device\Harddisk3\ClusterPartition2\, status 3
00000d1c.00001420::2015/02/19-14:51:16.005 WARN [RCM] ResourceTypeChaseTheOwnerLoop::DoCall: ResType MSMQ's DLL is not present on this node. Attempting to find a good node...
00000d1c.00001420::2015/02/19-14:51:16.015 WARN [RCM] ResourceTypeChaseTheOwnerLoop::DoCall: ResType MSMQTriggers's DLL is not present on this node. Attempting to find a good node...
000014a8.000024f4::2015/02/19-14:51:16.059 INFO [RES] Physical Disk: HardDiskpQueryDiskFromStm: ClusterStmFindDisk returned device='\\?\mpio#disk&ven_hp&prod_p2000_g3_fc&rev_t250#1&7f6ac24&0&36304346463030314145374646423434393243353331303030#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}'
000014a8.000024f4::2015/02/19-14:51:16.059 ERR   [RES] Physical Disk: HardDiskpGetDiskInfo: GetVolumeInformation failed for
\\?\GLOBALROOT\Device\Harddisk3\ClusterPartition2\, status 3
000014a8.000024f4::2015/02/19-14:51:16.059 ERR   [RES] Physical Disk: HardDiskpGetDiskInfo: failed to get partition size for
\\?\GLOBALROOT\Device\Harddisk3\ClusterPartition2\, status 3
00000d1c.00002568::2015/02/19-14:51:17.110 INFO [GEM] Node 1: Deleting [2:395 , 2:396] (both included) as it has been ack'd by every node
00000d1c.0000299c::2015/02/19-14:51:17.444 INFO [RCM [RES] Virtual Machine VirtualMachine2 embedded failure notification, code=0 _isEmbeddedFailure=false _embeddedFailureAction=2
00000d1c.0000299c::2015/02/19-14:51:18.103 INFO [RCM] rcm::DrainMgr::PauseNodeNoDrain: [DrainMgr] PauseNodeNoDrain
00000d1c.0000299c::2015/02/19-14:51:18.103 INFO [GUM] Node 1: Processing RequestLock 1:164
00000d1c.00002568::2015/02/19-14:51:18.104 INFO [GUM] Node 1: Processing GrantLock to 1 (sent by 2 gumid: 1470)
00000d1c.0000299c::2015/02/19-14:51:18.104 INFO [GUM] Node 1: executing request locally, gumId:1471, my action: /nsm/stateChange, # of updates: 1
00000d1c.00001420::2015/02/19-14:51:18.104 INFO [DM] Starting replica transaction, paxos: 99:99:50133, smartPtr: HDL( c9b16cf1e0 ), internalPtr: HDL( c9b21
This issue has been bugging me for some time now. The Cluster is fully functional and works great until the node gets paused again. I've read somewhere that the MSMQ errors can be ignored, but can't find anything about the
HardDiskpGetDiskInfo: GetVolumeInformation failed messages. No errors in the san or the Server Event logs. Driver and Firmware are up to date. Any help would be greatly appreciated.
Best regards

Thank you for your replies.
First some information I left out in my original post. We're using Windows Server 2012 R2 Datacenter and are currently only hosting virtual machines on the cluster.
I did some testing over the weekend, including a firmware update on the san and cluster validation.
The problem doesn't seem to be related to backup. We use Microsoft DPM to make a full express backup once every day, the getvolumeinformation Failed error gets logged periodically every half an hour.
Excerpts from the validation report:
Validate Disk Failover
Description: Validate that a disk can fail over successfully with
data intact.
Start: 21.02.2015 18:02:17.
Node Node2 holds the SCSI PR on Test Disk 3
and brought the disk online, but failed in its attempt to write file data to
partition table entry 1. The disk structure is corrupted and
unreadable.
Stop: 21.02.2015 18:02:37.
Node Node1 holds the SCSI PR on Test Disk 3
and brought the disk online, but failed in its attempt to write file data to
partition table entry 1. The disk structure is corrupted and unreadable.
Validate File System
Description: Validate that the file system on disks in shared
storage is supported by failover clusters and Cluster Shared Volumes (CSVs).
Failover cluster physical disk resources support NTFS, ReFS, FAT32, FAT, and
RAW. Only volumes formatted as NTFS or ReFS are accessible in disks added as
CSVs.
The test was canceled.
Validate Simultaneous Failover
Description: Validate that disks can fail over simultaneously with
data intact.
The test was canceled.
Validate Storage Spaces Persistent Reservation
Description: Validate that storage supports the SCSI-3 Persistent
Reservation commands needed by Storage Spaces to support clustering.
Start: 21.02.2015 18:01:00.
Verifying there are no Persistent Reservations, or Registration
keys, on Test Disk 3 from node Node1. Issuing Persistent Reservation REGISTER AND IGNORE EXISTING KEY
using RESERVATION KEY 0x0 SERVICE ACTION RESERVATION KEY 0x30000000a for Test
Disk 3 from node Node1.
Issuing Persistent Reservation RESERVE on Test Disk 3 from node
Node1 using key 0x30000000a.
Issuing Persistent Reservation REGISTER AND IGNORE EXISTING KEY
using RESERVATION KEY 0x0 SERVICE ACTION RESERVATION KEY 0x3000100aa for Test
Disk 3 from node Node2.
Issuing Persistent Reservation REGISTER using RESERVATION KEY
0x30000000a SERVICE ACTION RESERVATION KEY 0x30000000b for Test Disk 3 from node
Node1 to change the registered key while holding the
reservation for the disk.
Verifying there are no Persistent Reservations, or Registration
keys, on Test Disk 2 from node Node1.
Issuing Persistent Reservation REGISTER AND IGNORE EXISTING KEY
using RESERVATION KEY 0x0 SERVICE ACTION RESERVATION KEY 0x20000000a for Test
Disk 2 from node Node1.
Issuing Persistent Reservation RESERVE on Test Disk 2 from node
Node1 using key 0x20000000a.
Issuing Persistent Reservation REGISTER AND IGNORE EXISTING KEY
using RESERVATION KEY 0x0 SERVICE ACTION RESERVATION KEY 0x2000100aa for Test
Disk 2 from node Node2.
Issuing Persistent Reservation REGISTER using RESERVATION KEY
0x20000000a SERVICE ACTION RESERVATION KEY 0x20000000b for Test Disk 2 from node
Node1 to change the registered key while holding the
reservation for the disk.
Verifying there are no Persistent Reservations, or Registration
keys, on Test Disk 0 from node Node1.
Issuing Persistent Reservation REGISTER AND IGNORE EXISTING KEY
using RESERVATION KEY 0x0 SERVICE ACTION RESERVATION KEY 0xa for Test Disk 0
from node Node1.
Issuing Persistent Reservation RESERVE on Test Disk 0 from node
Node1 using key 0xa.
Issuing Persistent Reservation REGISTER AND IGNORE EXISTING KEY
using RESERVATION KEY 0x0 SERVICE ACTION RESERVATION KEY 0x100aa for Test Disk 0
from node Node2.
Issuing Persistent Reservation REGISTER using RESERVATION KEY
0xa SERVICE ACTION RESERVATION KEY 0xb for Test Disk 0 from node
Node1 to change the registered key while holding the
reservation for the disk.
Verifying there are no Persistent Reservations, or Registration
keys, on Test Disk 1 from node Node1.
Issuing Persistent Reservation REGISTER AND IGNORE EXISTING KEY
using RESERVATION KEY 0x0 SERVICE ACTION RESERVATION KEY 0x10000000a for Test
Disk 1 from node Node1.
Issuing Persistent Reservation RESERVE on Test Disk 1 from node
Node1 using key 0x10000000a.
Issuing Persistent Reservation REGISTER AND IGNORE EXISTING KEY
using RESERVATION KEY 0x0 SERVICE ACTION RESERVATION KEY 0x1000100aa for Test
Disk 1 from node Node2.
Issuing Persistent Reservation REGISTER using RESERVATION KEY
0x10000000a SERVICE ACTION RESERVATION KEY 0x10000000b for Test Disk 1 from node
Node1 to change the registered key while holding the
reservation for the disk.
Failure. Persistent Reservation not present on Test Disk 3 from
node Node1 after successful call to update reservation holder's
registration key 0x30000000b.
Failure. Persistent Reservation not present on Test Disk 1 from
node Node1 after successful call to update reservation holder's
registration key 0x10000000b.
Failure. Persistent Reservation not present on Test Disk 0 from
node Node1 after successful call to update reservation holder's
registration key 0xb.
Failure. Persistent Reservation not present on Test Disk 2 from
node Node1 after successful call to update reservation holder's
registration key 0x20000000b.
Test Disk 0 does not support SCSI-3 Persistent Reservations
commands needed by clustered storage pools that use the Storage Spaces
subsystem. Some storage devices require specific firmware versions or settings
to function properly with failover clusters. Contact your storage administrator
or storage vendor for help with configuring the storage to function properly
with failover clusters that use Storage Spaces.
Test Disk 1 does not support SCSI-3 Persistent Reservations
commands needed by clustered storage pools that use the Storage Spaces
subsystem. Some storage devices require specific firmware versions or settings
to function properly with failover clusters. Contact your storage administrator
or storage vendor for help with configuring the storage to function properly
with failover clusters that use Storage Spaces.
Test Disk 2 does not support SCSI-3 Persistent Reservations
commands needed by clustered storage pools that use the Storage Spaces
subsystem. Some storage devices require specific firmware versions or settings
to function properly with failover clusters. Contact your storage administrator
or storage vendor for help with configuring the storage to function properly
with failover clusters that use Storage Spaces.
Test Disk 3 does not support SCSI-3 Persistent Reservations
commands needed by clustered storage pools that use the Storage Spaces
subsystem. Some storage devices require specific firmware versions or settings
to function properly with failover clusters. Contact your storage administrator
or storage vendor for help with configuring the storage to function properly
with failover clusters that use Storage Spaces.
Stop: 21.02.2015 18:01:02
Thank you for your help.
David

Cluster node has exceeded it's failover threshold

I am trying to create the Availability group listener for a 2 node cluster and cluster node events show failure due to "Cluster node has exceeded it's failover threshold" after I make one attempt that has failed for a variety of reasons, usually
permission. How do I set the threshold higher. All the information I get tells me to open processes that are not listed in the failover cluster manager. I haven't seen code that works in powershell. How can I set the failover threshold higher than one?

Hi,
Please try to install the recommended hotfixes and updates for Windows Server 2012-based failover cluster then monitor it again.
The related hotfixes.
Recommended hotfixes and updates for Windows Server 2012-based failover clusters
http://support.microsoft.com/kb/2784261/en-us
Hope this helps.
We
are trying to better understand customer views on social support experience, so your participation in this
interview project would be greatly appreciated if you have time.
Thanks for helping make community forums a great place.

SCVMM losing connection to cluster nodes

Hey guys'n girls, I hope this is the right forum for this question. I already opened a ticket at MS support as well because it's impacting our production environment indirectly, but even after a week there's been no contact. Losing faith in MS support there
The problem we're having is that scvmm is that a host enters the 'needs attention' state, with a winrm error 0x80338126. I guess it has something to do with the network or with Kerberos, and I've found some info on it, but I still haven't been able to solve
it. Do you guys have any ideas?
Problem summary:
We are seeing an issue on our new hyper-v platform. The platform should have been in production last week, but this issue is delaying our project as we can't seem to get it stable.
The problem we are experiencing is that SCVMM loses the connection to some of the Hyper-V nodes. Not one
specific node. Last week it happened to two nodes, and today it happened to another node. I see issues with WinRM, and I expect something to do with kerberos. See the bottom of this post for background details and software versions.
The host gets the status 'needs attention', and if you look at the status of the machine, WinRM gives an error. The error is:
Error (2916)
VMM is unable to complete the request. The connection to the agent cc1-hyp-10.domaincloud1.local was lost.
WinRM: URL: [http://cc1-hyp-10.domaincloud1.local:5985], Verb: [ENUMERATE], Resource: [http://schemas.microsoft.com/wbem/wsman/1/wmi/root/cimv2/Win32_Service], Filter: [select * from Win32_Service where Name="WinRM"]
Unknown error (0x80338126)
Recommended Action
Ensure that the Windows Remote Management (WinRM) service and the VMM agent are installed and running and that a firewall is not blocking HTTP/HTTPS traffic. Ensure that VMM server is able to communicate with cc1-hyp-10.domaincloud1.local over WinRM by successfully
running the following command:
winrm id –r:cc1-hyp-10.domaincloud1.local
This
problem can also be caused by a Windows Management Instrumentation (WMI) service crash. If the server is running Windows Server 2008 R2, ensure that KB 982293 (http://support.microsoft.com/kb/982293)
is installed on it.
If the error persists, restart cc1-hyp-10.domaincloud1.local and then try the operation again. /nRefer to
http://support.microsoft.com/kb/2742275 for more details.
Doing a simple test from the VMM server to the problematic cluster node shows this error:
PS C:\> hostname
CC1-VMM-01
PS C:\> winrm id -r:cc1-hyp-10.domaincloud1.local
WSManFault
    Message = WinRM cannot complete the operation. Verify that the specified computer name is valid, that the computer is accessible over the network, and that a firewall exception for the WinRM service is enabled and allows access from this
computer. By default, the WinRM firewall exception for public profiles limits access to remote computers within the same local subnet.
Error number: -2144108250 0x80338126
WinRM cannot complete the operation. Verify that the specified computer name is valid, that the computer is accessible over the network, and that a firewall exception for the WinRM service is enabled and allows access from this computer. By default, the WinRM
firewall exception for public profiles limits access to remote computers within the same local subnet.
I CAN connect from other hosts to this problematic cluster node:
PS C:\> hostname
CC1-HYP-16
PS C:\> winrm id -r:cc1-hyp-10.domaincloud1.local
IdentifyResponse
    ProtocolVersion =
http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd
    ProductVendor = Microsoft Corporation
    ProductVersion = OS: 6.3.9600 SP: 0.0 Stack: 3.0
    SecurityProfiles
        SecurityProfileName =
http://schemas.dmtf.org/wbem/wsman/1/wsman/secprofile/http/spnego-kerberos
And I can connect from the vmm server to all other cluster nodes:
PS C:\> hostname
CC1-VMM-01
PS C:\> winrm id -r:cc1-hyp-11.domaincloud1.local
IdentifyResponse
    ProtocolVersion =
http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd
    ProductVendor = Microsoft Corporation
    ProductVersion = OS: 6.3.9600 SP: 0.0 Stack: 3.0
    SecurityProfiles
        SecurityProfileName =
http://schemas.dmtf.org/wbem/wsman/1/wsman/secprofile/http/spnego-kerberos
So at this point only the test from the cc1-vmm-01 to cc1-hyp-10 seems to be problematic.
I followed the steps in the page
https://support.microsoft.com/kb/2742275 (which is referred to above). I tried the VMMCA, but it can't really get it working the way I want, or it seems to give outdated recommendations.
I tried checking for duplicate SPN's by running setspn -x on affected machines. No results (although I do not understand
what an SPN is or how it works). I rebuilt the performance counters.
It tried setting 'sc config winrm type= own' as described in [http://blinditandnetworkadmin.blogspot.nl/2012/08/kb-how-to-troubleshoot-needs-attention.html].
If I reboot this cc1-hyp-10 machine, it will start working perfectly again. However, then I can't troubleshoot the issue, and it will happen again.
I want this problem to be solved, so vmm never loses connection to the hypervisors it's managing again!
Background information:
We've set up a platform with Hyper-V to run a VM workload. The platform consists of the following hardware:
2 Dell R620's with 32GB of RAM, running hyper-v to virtualize the cloud management layer (DC's, VMM, SQL). These machines are called cc1-hyp-01 and cc1-hyp-02. They run the management vm's like cc1-dc-01/02, cc1-sql-01, cc1-vmm-01, etc. The names are self-explanatory.
The VMM machine is NOT clustered.
8 Dell M620 blades with 320GB of RAM, running hyper-v to virtualize the customer workload. The machines are
called cc1-hyp-10 until cc1-hyp-17. They are in a cluster.
2 Equallogic units form a SAN (premium storage), and we have a Dell R515 running iscsi target (budget storage).
We have Dell Force10 switches and Cisco C3750X switches to connect everything together (mostly 10GB links).
All hosts run Windows Server 2012R2 Datacenter edition. The VMM server runs System Center Virtual Machine Manage 2012 R2.
All the latest Windows updates are installed on every host. There are no firewalls between any host (vmm and hypervisors) at this level. Windows firewalls are all disabled. No antivirus software is installed, no symantec software is installed.
The only non-standard software that is installed is the Dell Host Integration Tools 4.7.1, Dell Openmanage Server Administrator, and some small stuff like 7-zip, bginfo, net-snap, etc.
The SCVMM service is running under the domain account DOMAINCLOUD1\scvmm. This machine is in the local administrators group of each cluster node.
On top of this cloud layer we're running the tenant layer with a lot of vm's for a specific customer (although they are all off now).

I think I found the culprit, after an hour of analyzing wireshark dumps I found the vmm had jumbo frames enabled on the management interface to the hosts (and the underlying infrastructure does not).. Now my winrm commands started working again.

Unable to failover the services in active-active cluster node

Hi,
i am applying the sp2 patch for sql server 2008 r2 in active-active cluster, we have 3 services in the cluster , node 1 as 2 prefered owner and node 2 as 1 prefered owner, when i try to move the service from node 2 to node1 , i am getting the below errors
DCOM was unable to communicate with the computer XXXXXXXXX using any of the configured protocols.
The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server XXXXXXXXX. The target name used was RPCSS/XXXXXX. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal
name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using
a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server
name is not fully qualified, and the target domain (XXXXXX) is different from the client domain (XXXXXXX), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.
The Cluster service failed to bring clustered service or application 'CHCROCHC045' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
Cluster resource 'SQL Server (CHCROCHC045)' in clustered service or application 'CHCROCHC045' failed.
any inputs appreciated to resolve this issue as i could not procedd with patching
BR
PGR

Hi PGR,
As the issue is more related to Windows Server, I would like to recommend you post the issue in the
Windows Server forums for better support.
In addition, below are some article about troubleshooting error ” DCOM was unable to communicate with the computer XXXXXXXXX using any of the configured protocols” for your reference.
Event ID 10009 — COM Remote Service Availability
How to troubleshoot DCOM 10009 error logged in system event?
Thanks,
Lydia Zhang
Lydia Zhang
TechNet Community Support

Error while getting cluster node subtree

Hi,
We are on SP15.
The console logs show the following error
log generation timestamp : 2006_01_17_at_17_14_05
java.rmi.RemoteException: Error while getting cluster node subtree of :name=ClusterNodeRepresentative,j2eeType=com.sap.engine.services.adminadapter.impl.ClusterNodeRepresentative,SAP_J2EEClusterNode=2053400,SAP_J2EECluster=""; nested exception is:
     com.sap.engine.services.jmx.exception.MBeanServerClusterException: Exception during invocation of remote MBeanServer method, target node: 2053400
     at com.sap.engine.services.adminadapter.impl.ConvenienceEngineAdministratorImpl.getClusterNodeSubTree(ConvenienceEngineAdministratorImpl.java:242)
     at com.sap.engine.services.adminadapter.impl.ConvenienceEngineAdministratorImplp4_Skel.dispatch(ConvenienceEngineAdministratorImplp4_Skel.java:99)
     at com.sap.engine.services.rmi_p4.DispatchImpl._runInternal(DispatchImpl.java:304)
     at com.sap.engine.services.rmi_p4.DispatchImpl._run(DispatchImpl.java:193)
     at com.sap.engine.services.rmi_p4.server.P4SessionProcessor.request(P4SessionProcessor.java:122)
     at com.sap.engine.core.service630.context.cluster.session.ApplicationSessionMessageListener.process(ApplicationSessionMessageListener.java:33)
     at com.sap.engine.core.cluster.impl6.session.MessageRunner.run(MessageRunner.java:41)
     at com.sap.engine.core.thread.impl3.ActionObject.run(ActionObject.java:37)
     at java.security.AccessController.doPrivileged(Native Method)
     at com.sap.engine.core.thread.impl3.SingleThread.execute(SingleThread.java:100)
     at com.sap.engine.core.thread.impl3.SingleThread.run(SingleThread.java:170)
Caused by: com.sap.engine.services.jmx.exception.MBeanServerClusterException: Exception during invocation of remote MBeanServer method, target node: 2053400
     at com.sap.engine.services.jmx.ClusterInterceptor.invoke(ClusterInterceptor.java:816)
     at com.sap.pj.jmx.server.interceptor.MBeanServerInterceptorChain.invoke(MBeanServerInterceptorChain.java:330)
     at com.sap.engine.services.adminadapter.impl.ConvenienceEngineAdministratorImpl.getClusterNodeSubTree(ConvenienceEngineAdministratorImpl.java:239)
     ... 10 more
Caused by: com.sap.engine.services.jmx.exception.JmxConnectorException: Unable to de-serialize request parameters, message [ JMX request (java) v1.0 len: 345 | src: cluster target-node: 2053400 req: invoke params-number: 4 params-bytes: 0 | :name=ClusterNodeRepresentative,j2eeType=com.sap.engine.services.adminadapter.impl.ClusterNodeRepresentative,SAP_J2EEClusterNode=2053400,SAP_J2EECluster="" null null null ]
     at com.sap.engine.services.jmx.MBeanServerConnectionImpl.invokeMbsInternal(MBeanServerConnectionImpl.java:680)
     at com.sap.engine.services.jmx.MBeanServerConnectionImpl.invoke(MBeanServerConnectionImpl.java:467)
     at com.sap.engine.services.jmx.MBeanServerConnectionSecurityWrapper.invoke(MBeanServerConnectionSecurityWrapper.java:221)
     at com.sap.engine.services.jmx.ClusterInterceptor.invoke(ClusterInterceptor.java:813)
     ... 12 more
Caused by: javax.management.InstanceNotFoundException: MBean with name com.sap.default:name=ClusterNodeRepresentative,j2eeType=com.sap.engine.services.adminadapter.impl.ClusterNodeRepresentative,SAP_J2EEClusterNode=2053400,SAP_J2EECluster=XD1 not found in repository
     at com.sap.pj.jmx.server.MBeanServerImpl.getClassLoaderFor(MBeanServerImpl.java:1408)
     at com.sap.pj.jmx.server.interceptor.MBeanServerWrapperInterceptor.getClassLoaderFor(MBeanServerWrapperInterceptor.java:455)
     at com.sap.engine.services.jmx.CompletionInterceptor.getClassLoaderFor(CompletionInterceptor.java:567)
     at com.sap.pj.jmx.server.interceptor.BasicMBeanServerInterceptor.getClassLoaderFor(BasicMBeanServerInterceptor.java:438)
     at com.sap.jmx.provider.ProviderInterceptor.getClassLoaderFor(ProviderInterceptor.java:330)
     at com.sap.engine.services.jmx.RedirectInterceptor.getClassLoaderFor(RedirectInterceptor.java:501)
     at com.sap.pj.jmx.server.interceptor.MBeanServerInterceptorChain.getClassLoaderFor(MBeanServerInterceptorChain.java:443)
     at com.sap.engine.services.jmx.RequestMessage.readParams(RequestMessage.java:523)
     at com.sap.engine.services.jmx.RequestMessage.getParams(RequestMessage.java:578)
     at com.sap.engine.services.jmx.MBeanServerInvoker.invokeMbs(MBeanServerInvoker.java:106)
     at com.sap.engine.services.jmx.JmxServiceConnectorServer.receiveWait(JmxServiceConnectorServer.java:173)
     at com.sap.engine.core.service630.context.cluster.message.MessageListenerWrapper.process(MessageListenerWrapper.java:81)
     at com.sap.engine.core.cluster.impl6.ms.MSListenerThread.run(MSListenerThread.java:47)
     at com.sap.engine.frame.core.thread.Task.run(Task.java:64)
     at com.sap.engine.core.thread.impl6.SingleThread.execute(SingleThread.java:78)
     at com.sap.engine.core.thread.impl6.SingleThread.run(SingleThread.java:148)
java.lang.NullPointerException
     at com.sap.engine.services.adminadapter.gui.ClusterView.addGlobalDispatcherServiceProperties(ClusterView.java:455)
     at com.sap.engine.services.adminadapter.gui.ClusterView.createGlobalTrees(ClusterView.java:508)
     at com.sap.engine.services.adminadapter.gui.ClusterView.access$1200(ClusterView.java:29)
     at com.sap.engine.services.adminadapter.gui.ClusterView$4.run(ClusterView.java:420)
java.rmi.RemoteException: Error while getting cluster node subtree of :name=ClusterNodeRepresentative,j2eeType=com.sap.engine.services.adminadapter.impl.ClusterNodeRepresentative,SAP_J2EEClusterNode=2053400,SAP_J2EECluster=""; nested exception is:
     com.sap.engine.services.jmx.exception.MBeanServerClusterException: Exception during invocation of remote MBeanServer method, target node: 2053400
     at com.sap.engine.services.adminadapter.impl.ConvenienceEngineAdministratorImpl.getClusterNodeSubTree(ConvenienceEngineAdministratorImpl.java:242)
     at com.sap.engine.services.adminadapter.impl.ConvenienceEngineAdministratorImplp4_Skel.dispatch(ConvenienceEngineAdministratorImplp4_Skel.java:99)
     at com.sap.engine.services.rmi_p4.DispatchImpl._runInternal(DispatchImpl.java:304)
     at com.sap.engine.services.rmi_p4.DispatchImpl._run(DispatchImpl.java:193)
     at com.sap.engine.services.rmi_p4.server.P4SessionProcessor.request(P4SessionProcessor.java:122)
     at com.sap.engine.core.service630.context.cluster.session.ApplicationSessionMessageListener.process(ApplicationSessionMessageListener.java:33)
     at com.sap.engine.core.cluster.impl6.session.MessageRunner.run(MessageRunner.java:41)
     at com.sap.engine.core.thread.impl3.ActionObject.run(ActionObject.java:37)
     at java.security.AccessController.doPrivileged(Native Method)
     at com.sap.engine.core.thread.impl3.SingleThread.execute(SingleThread.java:100)
     at com.sap.engine.core.thread.impl3.SingleThread.run(SingleThread.java:170)
Caused by: com.sap.engine.services.jmx.exception.MBeanServerClusterException: Exception during invocation of remote MBeanServer method, target node: 2053400
     at com.sap.engine.services.jmx.ClusterInterceptor.invoke(ClusterInterceptor.java:816)
     at com.sap.pj.jmx.server.interceptor.MBeanServerInterceptorChain.invoke(MBeanServerInterceptorChain.java:330)
     at com.sap.engine.services.adminadapter.impl.ConvenienceEngineAdministratorImpl.getClusterNodeSubTree(ConvenienceEngineAdministratorImpl.java:239)
     ... 10 more
Caused by: com.sap.engine.services.jmx.exception.JmxConnectorException: Unable to de-serialize request parameters, message [ JMX request (java) v1.0 len: 345 | src: cluster target-node: 2053400 req: invoke params-number: 4 params-bytes: 0 | :name=ClusterNodeRepresentative,j2eeType=com.sap.engine.services.adminadapter.impl.ClusterNodeRepresentative,SAP_J2EEClusterNode=2053400,SAP_J2EECluster="" null null null ]
     at com.sap.engine.services.jmx.MBeanServerConnectionImpl.invokeMbsInternal(MBeanServerConnectionImpl.java:680)
     at com.sap.engine.services.jmx.MBeanServerConnectionImpl.invoke(MBeanServerConnectionImpl.java:467)
     at com.sap.engine.services.jmx.MBeanServerConnectionSecurityWrapper.invoke(MBeanServerConnectionSecurityWrapper.java:221)
     at com.sap.engine.services.jmx.ClusterInterceptor.invoke(ClusterInterceptor.java:813)
     ... 12 more
Caused by: javax.management.InstanceNotFoundException: MBean with name com.sap.default:name=ClusterNodeRepresentative,j2eeType=com.sap.engine.services.adminadapter.impl.ClusterNodeRepresentative,SAP_J2EEClusterNode=2053400,SAP_J2EECluster=XD1 not found in repository
     at com.sap.pj.jmx.server.MBeanServerImpl.getClassLoaderFor(MBeanServerImpl.java:1408)
     at com.sap.pj.jmx.server.interceptor.MBeanServerWrapperInterceptor.getClassLoaderFor(MBeanServerWrapperInterceptor.java:455)
     at com.sap.engine.services.jmx.CompletionInterceptor.getClassLoaderFor(CompletionInterceptor.java:567)
     at com.sap.pj.jmx.server.interceptor.BasicMBeanServerInterceptor.getClassLoaderFor(BasicMBeanServerInterceptor.java:438)
     at com.sap.jmx.provider.ProviderInterceptor.getClassLoaderFor(ProviderInterceptor.java:330)
     at com.sap.engine.services.jmx.RedirectInterceptor.getClassLoaderFor(RedirectInterceptor.java:501)
     at com.sap.pj.jmx.server.interceptor.MBeanServerInterceptorChain.getClassLoaderFor(MBeanServerInterceptorChain.java:443)
     at com.sap.engine.services.jmx.RequestMessage.readParams(RequestMessage.java:523)
     at com.sap.engine.services.jmx.RequestMessage.getParams(RequestMessage.java:578)
     at com.sap.engine.services.jmx.MBeanServerInvoker.invokeMbs(MBeanServerInvoker.java:106)
     at com.sap.engine.services.jmx.JmxServiceConnectorServer.receiveWait(JmxServiceConnectorServer.java:173)
     at com.sap.engine.core.service630.context.cluster.message.MessageListenerWrapper.process(MessageListenerWrapper.java:81)
     at com.sap.engine.core.cluster.impl6.ms.MSListenerThread.run(MSListenerThread.java:47)
     at com.sap.engine.frame.core.thread.Task.run(Task.java:64)
     at com.sap.engine.core.thread.impl6.SingleThread.execute(SingleThread.java:78)
     at com.sap.engine.core.thread.impl6.SingleThread.run(SingleThread.java:148)
java.lang.NullPointerException
     at com.sap.engine.services.adminadapter.gui.ClusterView.addGlobalDispatcherServiceProperties(ClusterView.java:455)
     at com.sap.engine.services.adminadapter.gui.ClusterView.createGlobalTrees(ClusterView.java:508)
     at com.sap.engine.services.adminadapter.gui.ClusterView.access$1200(ClusterView.java:29)
     at com.sap.engine.services.adminadapter.gui.ClusterView$4.run(ClusterView.java:420)
Any clue whats it?
rgds

Go the same error
+ /usr/java14_64/bin/java -showversion -Duser.language=en -DP4ClassLoad=P4Connection -Dp4Cache=clean -jar go.jar
java version "1.4.2"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2)
Classic VM (build 1.4.2, J2RE 1.4.2 IBM AIX 5L for PowerPC (64 bit JVM) build caix64142ifx-20061222 (ifix 113727: SR7 + 112603) (JIT enabled: jitc))
java.lang.NullPointerException
        at com.sap.engine.services.adminadapter.gui.ClusterView$4.run(ClusterView.java:405)
Need some help!
Bernard

INS-40925 - One or more nodes have interfaces not configured with a subnet that is common across all cluster nodes.

Hi All,
I am facing the below error while installing Oracle RAC in Silent Mode.
SEVERE: There are no common subnets represented by network interfaces across all cluster nodes.
SEVERE: [FATAL] [INS-40925] One or more nodes have interfaces not configured with a subnet that is common across all cluster nodes.
   CAUSE: Not all nodes have network interfaces that are configured on subnets that are common to all nodes in the cluster.
   ACTION: Ensure all cluster nodes have a public interface defined with the same subnet accessible by all nodes in the cluster.
My /etc/hosts is given below.
127.0.0.1        localhost    localhost.localdomain
#Public
192.168.1.101      rac1        rac1.localdomain
192.168.1.102    rac2        rac2.localdomain
#Private
192.168.2.101    rac1-priv    rac1-priv.localdomain
192.168.2.102    rac2-priv    rac2-priv.localdomain
#Virtual
192.168.1.103      rac1-vip    rac1-vip.localdomain
192.168.1.104    rac2-vip    rac2-vip.localdomain
#SCAN
192.168.1.105    rac-scan    rac-scan.localdomain
Could you please help me to get rid of the error INS-40925....Any Idea...???

Hi Ramesh,
Please find the result of ifconfig -a from both nodes RAC1 & RAC2.
ifconfig -a in RAC1
[oracle@rac1 Desktop]$ ifconfig -a
eth0      Link encap:Ethernet HWaddr 08:00:27:17:7A:D5
          inet addr:192.168.1.101 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe17:7ad5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:102 errors:0 dropped:0 overruns:0 frame:0
          TX packets:48 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:25472 (24.8 KiB) TX bytes:3322 (3.2 KiB)
          Interrupt:19 Base address:0xd020
eth1      Link encap:Ethernet HWaddr 08:00:27:C0:AC:DB
          inet addr:192.168.2.101 Bcast:192.168.2.255 Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fec0:acdb/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:4 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:240 (240.0 b) TX bytes:816 (816.0 b)
          Interrupt:16 Base address:0xd240
lo        Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:56 errors:0 dropped:0 overruns:0 frame:0
          TX packets:56 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:6394 (6.2 KiB) TX bytes:6394 (6.2 KiB)
virbr0    Link encap:Ethernet HWaddr 52:54:00:CC:BD:FB
          inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
          UP BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
virbr0-nic Link encap:Ethernet HWaddr 52:54:00:CC:BD:FB
          BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
ifconfig -a in RAC2
[oracle@rac2 Desktop]$ ifconfig -a
eth0      Link encap:Ethernet HWaddr 08:00:27:C9:38:82
          inet addr:192.168.1.102 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fec9:3882/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:122 errors:0 dropped:0 overruns:0 frame:0
          TX packets:59 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:32617 (31.8 KiB) TX bytes:5157 (5.0 KiB)
          Interrupt:19 Base address:0xd020
eth1      Link encap:Ethernet HWaddr 08:00:27:90:B5:A0
          inet addr:192.168.2.102 Bcast:192.168.2.255 Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe90:b5a0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:4 errors:0 dropped:0 overruns:0 frame:0
          TX packets:11 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:240 (240.0 b) TX bytes:746 (746.0 b)
          Interrupt:16 Base address:0xd240
lo        Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:56 errors:0 dropped:0 overruns:0 frame:0
          TX packets:56 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:6390 (6.2 KiB) TX bytes:6390 (6.2 KiB)
virbr0    Link encap:Ethernet HWaddr 52:54:00:CC:BD:FB
          inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
          UP BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
virbr0-nic Link encap:Ethernet HWaddr 52:54:00:CC:BD:FB
          BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

Cluster Node Failure

Similar Messages

Maybe you are looking for