Node fails to join the cluster

We are observing a problem where a node, after getting restarted, fails to join the cluster.
We run two coherence clusters across three boxes. Each box runs 8 java processes, 4 processes of one cluster, another 4 process
of the other cluster. They all run as windows NT services. Sometimes, some node goes down and gets restarted. But then it fails to join the cluster with following exception :
"com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=8, Name=DistributedFIIndicativeCacheWithPublishingCacheStore, Type=DistributedCache"
Has anyone experienced and addressed such a problem? If required, I can provide exact details of the cluster setup.
-Bharat

Hi Bharat,
This may be caused by a stuck or slow DistributedService thread on one of your nodes. Please log into http://support.oracle.com and take a look at [Note 845363.1|https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id=845363.1] for more details. Additionally, consider upgrading to Coherence 3.5 as it includes the [Service Guardian for deadlock detection/resolution|http://blackbeanbag.net/wp/2009/07/20/coherence-3-5-service-guardian-deadlock-detection/].
Thanks,
Patrick

Similar Messages

  • Node failed to join the cluster because it ould not send and receive failure detection network messages

    One of my customers has a Windows Server 2008 R2 cluster for an Exchange 2010 Mailbox Database Availability Group.  Lately, they've been having problems with one of their nodes (the one node that is on a different subnet in a different datacenter) where
    their Exchange databases aren't replicating.  While looking into this issue it seems that the problem is the Network Manager isn't started because the cluster service is failing.  Since the issue seems to be with the cluster service, and not Exchange,
    I'm asking here. 
    When the cluster service starts, it appears to start working, but within a few minutes the following is logged in the system event log.
    FailoverClustering
    1572
    Critical
    Cluster Virtual Adapter
    Node 'nodename' failed to join the cluster because it could not send and receive failure detection network messages with other cluster nodes. ...
    It seems that the problem is with the 169.254 address on the cluster virtual adapter.  An entry in the cluster.log file says: Aborting connection because NetFT route to node nodename on virtual IP 169.254.1.44:~3343~ has failed to come up. 
    In my experience, you never have to mess with the cluster virtual adapter.  I'm not sure what happened here, but I doubt it has been modified.  I need the cluster to communicate with its other nodes on our routed 10. network.  I've never experienced
    this before and found little in my searches on the subject.  Any idea how I can fix this?
    Thanks,
    Joe
    Joseph M. Durnal MCM: Exchange 2010 MCITP: Enterprise Messaging Administrator, Exchange 2010 MCITP: Enterprise Messaging Administrator, MCITP: Enterprise Administrator

    Hi,
    I suspected an issue with communication on UDP port 3343. Please confirm the set rules for port 3343 on all the nodes in firewall and enabled all connections for all the profiles
    in firewall on all the nodes are opened, or confirm the connectivity of all the node.
    Use ipconfig /flushdns to update all the node DNS register, then confirm the DNS in your DNS server entry is correct.
    The similar issue article:
    Exchange 2010 DAG - NetworkManager has not yet been initialized
    https://blogs.technet.com/b/dblanch/archive/2012/03/05/exchange-2010-dag-networkmanager-has-not-yet-been-initialized.aspx?Redirected=true
    Hope this helps.
    We
    are trying to better understand customer views on social support experience, so your participation in this
    interview project would be greatly appreciated if you have time.
    Thanks for helping make community forums a great place.

  • After patching the node, the node is not joining the cluster.

    Dear All,
    We are having a two node suncluster with below release
    Sun Cluster 3.2u1 for Solaris 10 sparc
    Copyright 2008 Sun Microsystems, Inc. All Rights Reserved.
    And nodes are
    Node Name Status
    scrbdomdefrm005 Online
    scrbdomderue005 Offline
    We are patching the nodes with 2q 2009 quarter patches, first we patched the node scrbdomderue005. we have followed the below step to patch the server.
    1) Our root d0 has d1(c0t0d0s0) and d2(c1t0d0s0)
    2) we have detached the d2 from d0; metaclear d2
    3) mount the c1t0d0s0 /mnt
    4) use the patchadd -R /mnt to patch the server. While patching we got only one error the patch 126106-27 need to be install in noncluster mode.
    5) switch the RG's from node scrbdomderue005 to scrbdomdfrm005.
    6) shutdown the scrbdomderue005, boot the scrbdomderue005 with c1t0d0s0 in noncluster-single user mode, and installed the patch 126106-27 and it was successful.
    7) shutdown the scrbdomderue005, boot the scrbdomderue005 with c1t0d0s0 in clustermode, and we are getting the following error.
    Booting as part of a cluster
    NOTICE: CMM: Node scrbdomdefrm005 (nodeid = 1) with votecount = 1 added.
    NOTICE: CMM: Node scrbdomderue005 (nodeid = 2) with votecount = 1 added.
    WARNING: CMM: Open failed for quorum device /dev/did/rdsk/d5s2 with error 1.
    NOTICE: clcomm: Adapter nxge7 constructed
    NOTICE: clcomm: Adapter nxge3 constructed
    NOTICE: CMM: Node scrbdomderue005: attempting to join cluster.
    NOTICE: nxge3: xcvr addr:0x0a - link is up 1000 Mbps full duplex
    NOTICE: nxge7: xcvr addr:0x0a - link is up 1000 Mbps full duplex
    WARNING: CMM: Open failed for quorum device /dev/did/rdsk/d5s2 with error 1.
    NOTICE: CMM: Cluster doesn't have operational quorum yet; waiting for quorum.
    NOTICE: clcomm: Path scrbdomderue005:nxge7 - scrbdomdefrm005:nxge7 errors during initiation
    NOTICE: clcomm: Path scrbdomderue005:nxge3 - scrbdomdefrm005:nxge3 errors during initiation
    WARNING: Path scrbdomderue005:nxge7 - scrbdomdefrm005:nxge7 initiation encountered errors, errno = 62. Remote node may be down or unreachable through this path.
    WARNING: Path scrbdomderue005:nxge3 - scrbdomdefrm005:nxge3 initiation encountered errors, errno = 62. Remote node may be down or unreachable through this path.
    exit from console.
    We are able to boot the node scrbdomderue005 in noncluster-mode and it was successful., please check the below details.
    scrbdomderue005:/# uname -a
    SunOS scrbdomderue005 5.10 Generic_138888-07 sun4u sparc SUNW,SPARC-Enterprise
    scrbdomderue005:/#
    Before pathcing the server scrbdomderue005 the kernel version was.
    SunOS scrbdomderue005 5.10 Generic_137111-07 sun4u sparc SUNW,SPARC-Enterprise
    If i boot the scrbdomderue005 with d1(c0t0d0s0), the server is properly joining the cluster without issue.
    could any one please guide me, what could be the problem... how to resolve the issue.

    Hi
    I could be because you have installed patch 138888. It has problems with nxge interfaces used as interconnect.
    Rgds
    Carsten

  • 11gR1 1 node won't join the cluster after reboot.

    This is a high level description of a problem.
    We usually run a two node cluster.
    This week we had an issue where one node needed to be taken down. It became non responsive and upon reboot the other node no longer functioned correctly.
    So one node was left running until the maintenance window.
    Apparently when it's brought back up it has the MAC of the second node in the arp cache.
    This leads to node1 not being able to join the cluster.
    I've seen workarounds that involve refreshing the arp cache but is there anything else to this?

    This is a high level description of a problem.
    We usually run a two node cluster.
    This week we had an issue where one node needed to be taken down. It became non responsive and upon reboot the other node no longer functioned correctly.
    So one node was left running until the maintenance window.
    Apparently when it's brought back up it has the MAC of the second node in the arp cache.
    This leads to node1 not being able to join the cluster.
    I've seen workarounds that involve refreshing the arp cache but is there anything else to this?

  • Managed server not able to join the cluster

    Hi
    I have two storage node enabled coherence servers on two different machines.These two are able to form the cluster without any problem. I also have two Managed servers. When I start one, will join the cluster without any issue but when I start the fourth one which does not join the cluster. Only one Managed server joins the cluster. I am getting the following error.
    2011-12-22 15:39:26.940/356.798 Oracle Coherence GE 3.6.0.4 <Info> (thread=[ACTIVE] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)', member=n/a): Loaded cache configuration from "file:/u02/oracle/admin/atddomain/atdcluster/ATD/config/atd-client-cache-config.xml"
    2011-12-22 15:39:26.943/356.801 Oracle Coherence GE 3.6.0.4 <D4> (thread=[ACTIVE] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)', member=n/a): TCMP bound to /172.23.34.91:8190 using SystemSocketProvider
    2011-12-22 15:39:57.909/387.767 Oracle Coherence GE 3.6.0.4 <Warning> (thread=Cluster, member=n/a): This Member(Id=0, Timestamp=2011-12-22 15:39:26.944, Address=172.23.34.91:8190, MachineId=39242, Location=site:dev.icd,machine:appsoad2-web2,process:24613, Role=WeblogicServer) has been attempting to join the cluster at address 231.1.1.50:7777 with TTL 4 for 30 seconds without success; this could indicate a mis-configured TTL value, or it may simply be the result of a busy cluster or active failover.
    2011-12-22 15:39:57.909/387.767 Oracle Coherence GE 3.6.0.4 <Warning> (thread=Cluster, member=n/a): Received a discovery message that indicates the presence of an existing cluster:
    Message "NewMemberAnnounceWait"
    FromMember=Member(Id=2, Timestamp=2011-12-22 15:22:56.607, Address=172.23.34.74:8090, MachineId=39242, Location=site:dev.icd,machine:appsoad4,process:23937,member:CoherenceServer2, Role=WeblogicWeblogicCacheServer)
    FromMessageId=0
    Internal=false
    MessagePartCount=1
    PendingCount=0
    MessageType=9
    ToPollId=0
    Poll=null
    Packets
    [000]=Broadcast{PacketType=0x0DDF00D2, ToId=0, FromId=2, Direction=Incoming, ReceivedMillis=15:39:57.909, MessageType=9, ServiceId=0, MessagePartCount=1, MessagePartIndex=0, Body=0}
    Service=ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_ANNOUNCE), Id=0, Version=3.6}
    ToMemberSet=null
    NotifySent=false
    ToMember=Member(Id=0, Timestamp=2011-12-22 15:39:26.944, Address=172.23.34.91:8190, MachineId=39242, Location=site:dev.icd,machine:appsoad2-web2,process:24613, Role=WeblogicServer)
    SeniorMember=Member(Id=1, Timestamp=2011-12-22 15:22:53.032, Address=172.23.34.73:8090, MachineId=39241, Location=site:dev.icd,machine:appsoad3,process:19339,member:CoherenceServer1, Role=WeblogicWeblogicCacheServer)
    2011-12-22 15:40:02.915/392.773 Oracle Coherence GE 3.6.0.4 <Warning> (thread=Cluster, member=n/a): Received a discovery message that indicates the presence of an existing cluster:
    Message "NewMemberAnnounceWait"
    FromMember=Member(Id=2, Timestamp=2011-12-22 15:22:56.607, Address=172.23.34.74:8090, MachineId=39242, Location=site:dev.icd,machine:appsoad4,process:23937,member:CoherenceServer2, Role=WeblogicWeblogicCacheServer)
    FromMessageId=0
    Internal=false
    MessagePartCount=1
    PendingCount=0
    MessageType=9
    ToPollId=0
    Poll=null
    Packets
    {                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

    Hi,
    By default Coherence uses a multicast protocol to discover other nodes when forming a cluster. Since you are having difficulties in establishing a cluster via multicast, Can you please perform a multicast test and see if multicast is configured properly.
    http://wiki.tangosol.com/display/COH32UG/Multicast+Test
    Hope you are using same configuration files across the cluster members and all members of the cluster must specify the same cluster name in order to be allowed to join the cluster.
    <cluster-name system-property="tangosol.coherence.cluster";>xxx</cluster-name>
    I would suggest, try using the unicast-listener with the well-known-addresses instead of muticast-listener.
    http://wiki.tangosol.com/display/COH32UG/well-known-addresses
    Add similar entries like below in your tangosol override xml..
    <well-known-addresses>
    <socket-address id="1">
    <address> 172.23.34.91<;/address>
    <port>8190</port>
    </socket-address>
    <socket-address id="2">
    <address> 172.23.34.74<;/address>
    <port> 8090</port>
    </socket-address>
    </well-known-addresses>
    This list is used by all other nodes to find their way into the cluster without the use of multicast, thus at least one well known node must be running for other nodes to be able to join.
    Hope this helps!!
    Thanks,
    Ashok.
    <div id="isChromeWebToolbarDiv" style="display:none"></div>

  • Clinet Application without joining the cluster

    Hi All,
    is it possible for client application to access the cache within a coherence cluster. The Client application is not a part of cluster and it didnt start with and cache config files or anything else.
    The client application just uses :
    NamedCache cache = CacheFactory.getCache("VirtualCache");
    if a client application starts with a cache-config file it will also join the cluster in this case the JVM of the client app will aslo be loaded/distributed/replicated with the cache contents ?
    Please clarify my doubts.
    Regards
    Srinivas.

    The only clean way of NOT joining the cluster is to connect via Extend.
    You can join the cluster, and specify LocalStorage=False parameter, however, that is only applicable for distributed cache. Replicated cache data still exists on every node. A bigger issue in my mind, is that your node will be actively managing membership of other members in the cluster, and that can become a problem.
    Timur

  • Can the node's UID be different when re-joining the cluster?

    Let's say that we have a split brain scenario and when both islands meet again one of them is "marked as invalid" and is restarted. Upon cluster restart will the restarted nodes get new UID values?
    Thanks,
    -- Gato

    Gato,
    Absolutely! The UID is the member's identity and if the cluster service restarts (within a given Java process), it will be always assigned a new and unique UID.
    Regards,
    Gene

  • Public Interface not responding after second node is started in the cluster

    Hi
    Has anyone ever experienced the public interface not responding between nodes in the cluster (ping, ssh, scp) after the second nodeapps is started in the cluster?
    This is a new install so all I have installed so far is the base release of CRS 10.2.0. This is on Solaris10. The vipca failed during the installation, however I was able to proceed and manually add the nodeapps using srvctl add nodeaps -n -o -A.
    It seems after the second node is started I loose all connectivity to the public interfaces and to my default gateway.
    Also I'm getting the following messages sometimes after I try and stop the nodeapps and start them back up.
    CRS-1006: No more members to consider
    CRS-0215: Could not start resource 'ora.node1.vip'.
    Any suggestions on where I should start troubleshooting?
    Thanks

    Do you have default GW?
    It can connects with GW, can't it?
    Check metalink
    CRS-0215: Could not start resource 'ora..vip' [ID 356535.1]
    CRS-1006: No more members to consider when starting service [ID 465364.1]
    Good Luck

  • IP integration Node: failed to link the design

    I'm unable to generate supporting files when importing a VHDL file into IP integration node. I have seen people having the same issue because they used an unsupported OS but my OS is supported by Vivado 2014.4.
    I have: 
    NI LabView 2014 with all the modules and latest updates installed
    Xilinx Vivado 2014.4
    Windows 8.1 64 bit (which is sipported by labview)
    I get the following error: 
    Waiting for 2 sub-compilation(s) to finish...
    ERROR: [XSIM 43-3238] Failed to link the design.
    Generated IP unsuccessfully. Your source file(s) can't work for the FPGA famili(es) you select. Fix the above error(s) or warning(s) and generate the IP again, or go back to previous page to reselect FPGA Family Support.
    It is worth to mention that I have selected only one FPGA family (zenq) which vivado supports. The code that I'm importing is correct and very short. Just a LED test which worked fine on simulations. What is causing this issue and how my I solve this? 
    Solved!
    Go to Solution.

    Hi aan928
    You are correct in that 8.1 does support LV 2014 but LV2014 isn't FULLY supported (as in modules and drivers) FPGA being one of them. The Compiler tools that we are dependent on Xillinx for unfortunatly aren't at this moment in time. Although some FPGA compilation may work, not all compilations will. Unfortunatly there is no timeline in terms of how Xillinx & NI will fully implement support for specific modules with window 8.0 & 8.1.
    Please reference the document below to see a full list of compatabilities beween NI tools & 8.0 which may provide some help
    http://www.ni.com/white-paper/14281/en/
    Regards

    J Hird CLAD
    NI UK Applications Engineer

  • Replicating a session to a server joining the cluster

    It is my understanding that session data is replicated only during calls to HttpSession.setAttribute() and HttpSession.removeAttribute(). If a user has established a session on server A (secondary on server B) and server B goes down, then an incomplete secondary is created on some other server. If server A fails before this user accidentally causes all of the session data to be replicated, then some requests to the new managed server will work and some will not creating eratic errors.
              

    Yes the replication behaviour in 6.x is different from 51
              "I believe that we "lazily" replicate sessions in 6.x - that is, until you
              access the session,
              it is not replicated.
              The reason for this is performance. Imagine if node1 had thousands of
              sessions when node2
              was down. If we did not do session replication lazily, then as soon as node2
              was viable, it
              would be flooded with thousands of requests to replicate the existing
              sessions. I believe this is
              the reason the change was made in the architecture between 5.1 and 6.x.
              There's always a trade-off. Some customers want performance and some
              customers want reliability.
              It looks like either we can choose the reliable behavior as same as WLS5.1
              does, or explain the "lazily"
              replication of HttpSessions in our edocs."
              If this behaviour is not acceptable for you, pls contact support and raise
              your voice.
              Also reference CR070084.
              Kumar
              "Russ Hampleman" <[email protected]> wrote in message
              news:[email protected]...
              >
              > This is the exact behavior I am seeing with Weblogic 6.1sp1. This is not
              the case
              > with Weblogic 5.1sp8. Is this a bug with 6.1?
              >
              > Ben Valentine <[email protected]> wrote:
              > >It is my understanding that session data is replicated only during calls
              > >to HttpSession.setAttribute() and HttpSession.removeAttribute(). If
              > >a user has established a session on server A (secondary on server B)
              > >and server B goes down, then an incomplete secondary is created on some
              > >other server. If server A fails before this user accidentally causes
              > >all of the session data to be replicated, then some requests to the
              > >new managed server will work and some will not creating eratic errors.
              >
              

  • Node does not join cluster upon reboot

    Hi Guys,
    I have two servers [Sun Fire X4170] clustered together using Solaris cluster 3.3 for Oracle Database. They are connected to a shared storage which is Dell Equallogic [iSCSI]. Lately, I have ran into a weird kind of a problem where as both nodes come up fine and join the cluster upon reboot; however, when I reboot one of nodes then any of them does not join cluster and shows following errors:
    This is happening on both the nodes [if I reboot only one node at a time]. But if I reboot both the nodes at the same time then they successfully join the cluster and everything runs fine.
    Below is the output from one node which I rebooted and it did not join the cluster and puked out following errors. The other node is running fine will all the services.
    In order to get out of this situation, I have to reboot both the nodes together.
    # dmesg output #
    Apr 23 17:37:03 srvhqon11 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe2: link down
    Apr 23 17:37:12 srvhqon11 iscsi: [ID 933263 kern.notice] NOTICE: iscsi connection(5) unable to connect to target SENDTARGETS_DISCOVERY
    Apr 23 17:37:12 srvhqon11 iscsi: [ID 114404 kern.notice] NOTICE: iscsi discovery failure - SendTargets (010.010.017.104)
    Apr 23 17:37:13 srvhqon11 iscsi: [ID 240218 kern.notice] NOTICE: iscsi session(9) iqn.2001-05.com.equallogic:0-8a0906-96cf73708-ef30000005e50a1b-sblprdbk online
    Apr 23 17:37:13 srvhqon11 scsi: [ID 583861 kern.info] sd11 at scsi_vhci0: unit-address g6090a0887073cf961b0ae505000030ef: g6090a0887073cf961b0ae505000030ef
    Apr 23 17:37:13 srvhqon11 genunix: [ID 936769 kern.info] sd11 is /scsi_vhci/disk@g6090a0887073cf961b0ae505000030ef
    Apr 23 17:37:13 srvhqon11 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
    Apr 23 17:37:13 srvhqon11 /scsi_vhci/disk@g6090a0887073cf961b0ae505000030ef (sd11): Command failed to complete (3) on path iscsi0/[email protected]:0-8a0906-96cf73708-ef30000005e50a1b-sblprdbk0001,0
    Apr 23 17:46:54 srvhqon11 svc.startd[11]: [ID 122153 daemon.warning] svc:/network/iscsi/initiator:default: Method or service exit timed out. Killing contract 41.
    Apr 23 17:46:54 srvhqon11 svc.startd[11]: [ID 636263 daemon.warning] svc:/network/iscsi/initiator:default: Method "/lib/svc/method/iscsid start" failed due to signal KILL.
    Apr 23 17:46:54 srvhqon11 svc.startd[11]: [ID 748625 daemon.error] network/iscsi/initiator:default failed repeatedly: transitioned to maintenance (see 'svcs -xv' for details)
    Apr 24 14:50:16 srvhqon11 svc.startd[11]: [ID 694882 daemon.notice] instance svc:/system/console-login:default exited with status 1
    root@srvhqon11 # svcs -xv
    svc:/system/cluster/loaddid:default (Oracle Solaris Cluster loaddid)
    State: offline since Tue Apr 23 17:46:54 2013
    Reason: Start method is running.
    See: http://sun.com/msg/SMF-8000-C4
    See: /var/svc/log/system-cluster-loaddid:default.log
    Impact: 49 dependent services are not running:
    svc:/system/cluster/bootcluster:default
    svc:/system/cluster/cl_execd:default
    svc:/system/cluster/zc_cmd_log_replay:default
    svc:/system/cluster/sc_zc_member:default
    svc:/system/cluster/sc_rtreg_server:default
    svc:/system/cluster/sc_ifconfig_server:default
    svc:/system/cluster/initdid:default
    svc:/system/cluster/globaldevices:default
    svc:/system/cluster/gdevsync:default
    svc:/milestone/multi-user:default
    svc:/system/boot-config:default
    svc:/system/cluster/cl-svc-enable:default
    svc:/milestone/multi-user-server:default
    svc:/application/autoreg:default
    svc:/system/basicreg:default
    svc:/system/zones:default
    svc:/system/cluster/sc_zones:default
    svc:/system/cluster/scprivipd:default
    svc:/system/cluster/cl-svc-cluster-milestone:default
    svc:/system/cluster/sc_svtag:default
    svc:/system/cluster/sckeysync:default
    svc:/system/cluster/rpc-fed:default
    svc:/system/cluster/rgm-starter:default
    svc:/application/management/common-agent-container-1:default
    svc:/system/cluster/scsymon-srv:default
    svc:/system/cluster/sc_syncsa_server:default
    svc:/system/cluster/scslmclean:default
    svc:/system/cluster/cznetd:default
    svc:/system/cluster/scdpm:default
    svc:/system/cluster/rpc-pmf:default
    svc:/system/cluster/pnm:default
    svc:/system/cluster/sc_pnm_proxy_server:default
    svc:/system/cluster/cl-event:default
    svc:/system/cluster/cl-eventlog:default
    svc:/system/cluster/cl-ccra:default
    svc:/system/cluster/ql_upgrade:default
    svc:/system/cluster/mountgfs:default
    svc:/system/cluster/clusterdata:default
    svc:/system/cluster/ql_rgm:default
    svc:/system/cluster/scqdm:default
    svc:/application/stosreg:default
    svc:/application/sthwreg:default
    svc:/application/graphical-login/cde-login:default
    svc:/application/cde-printinfo:default
    svc:/system/cluster/scvxinstall:default
    svc:/system/cluster/sc_failfast:default
    svc:/system/cluster/clexecd:default
    svc:/system/cluster/sc_pmmd:default
    svc:/system/cluster/clevent_listenerd:default
    svc:/application/print/server:default (LP print server)
    State: disabled since Tue Apr 23 17:36:44 2013
    Reason: Disabled by an administrator.
    See: http://sun.com/msg/SMF-8000-05
    See: man -M /usr/share/man -s 1M lpsched
    Impact: 2 dependent services are not running:
    svc:/application/print/rfc1179:default
    svc:/application/print/ipp-listener:default
    svc:/network/iscsi/initiator:default (?)
    State: maintenance since Tue Apr 23 17:46:54 2013
    Reason: Restarting too quickly.
    See: http://sun.com/msg/SMF-8000-L5
    See: /var/svc/log/network-iscsi-initiator:default.log
    Impact: This service is not running.
    ######## Cluster Status from working node ############
    root@srvhqon10 # cluster status
    === Cluster Nodes ===
    --- Node Status ---
    Node Name Status
    srvhqon10 Online
    srvhqon11 Offline
    === Cluster Transport Paths ===
    Endpoint1 Endpoint2 Status
    srvhqon10:igb3 srvhqon11:igb3 faulted
    srvhqon10:igb2 srvhqon11:igb2 faulted
    === Cluster Quorum ===
    --- Quorum Votes Summary from (latest node reconfiguration) ---
    Needed Present Possible
    2 2 3
    --- Quorum Votes by Node (current status) ---
    Node Name Present Possible Status
    srvhqon10 1 1 Online
    srvhqon11 0 1 Offline
    --- Quorum Votes by Device (current status) ---
    Device Name Present Possible Status
    d2 1 1 Online
    === Cluster Device Groups ===
    --- Device Group Status ---
    Device Group Name Primary Secondary Status
    --- Spare, Inactive, and In Transition Nodes ---
    Device Group Name Spare Nodes Inactive Nodes In Transistion Nodes
    --- Multi-owner Device Group Status ---
    Device Group Name Node Name Status
    === Cluster Resource Groups ===
    Group Name Node Name Suspended State
    ora-rg srvhqon10 No Online
    srvhqon11 No Offline
    nfs-rg srvhqon10 No Online
    srvhqon11 No Offline
    backup-rg srvhqon10 No Online
    srvhqon11 No Offline
    === Cluster Resources ===
    Resource Name Node Name State Status Message
    ora-listener srvhqon10 Online Online
    srvhqon11 Offline Offline
    ora-server srvhqon10 Online Online
    srvhqon11 Offline Offline
    ora-stor srvhqon10 Online Online
    srvhqon11 Offline Offline
    ora-lh srvhqon10 Online Online - LogicalHostname online.
    srvhqon11 Offline Offline
    nfs-rs srvhqon10 Online Online - Service is online.
    srvhqon11 Offline Offline
    nfs-stor-rs srvhqon10 Online Online
    srvhqon11 Offline Offline
    nfs-lh-rs srvhqon10 Online Online - LogicalHostname online.
    srvhqon11 Offline Offline
    backup-stor srvhqon10 Online Online
    srvhqon11 Offline Offline
    cluster: (C383355) No response from daemon on node "srvhqon11".
    === Cluster DID Devices ===
    Device Instance Node Status
    /dev/did/rdsk/d1 srvhqon10 Ok
    /dev/did/rdsk/d2 srvhqon10 Ok
    srvhqon11 Unknown
    /dev/did/rdsk/d3 srvhqon10 Ok
    srvhqon11 Unknown
    /dev/did/rdsk/d4 srvhqon10 Ok
    /dev/did/rdsk/d5 srvhqon10 Fail
    srvhqon11 Unknown
    /dev/did/rdsk/d6 srvhqon11 Unknown
    /dev/did/rdsk/d7 srvhqon11 Unknown
    /dev/did/rdsk/d8 srvhqon10 Ok
    srvhqon11 Unknown
    /dev/did/rdsk/d9 srvhqon10 Ok
    srvhqon11 Unknown
    === Zone Clusters ===
    --- Zone Cluster Status ---
    Name Node Name Zone HostName Status Zone Status
    Regards.

    check if your global devices are mounted properly
    #cat /etc/mnttab | grep -i global
    check if proper entries are there on both systems
    #cat /etc/vfstab | grep -i global
    give output for quoram devices .
    #scstat -q
    or
    #clquorum list -v
    also check why your scsi initiator service is going offline unexpectedly
    #vi /var/svc/log/network-iscsi-initiator:default.log

  • ISE PSN node won't join cluster

    Hi All,
    Has anyone seen an issue where a PSN can't join the cluster ?
    We join PSN Node
    -Node is registered sucessfully (sync in progress)
    - 1hr later - Replication to node failed.
    - Replication Sync failed due to Secondary Database is down
    I have a customer where admin node and PSN are seperated by firewall.
    We allow in both directions
    Admin <--> PSN
    ICMP
    HTTPS
    1521
    Firewall not showing drops.
    DNS and NTP are ok.
    Current topology is 1 PSN, 1 Admin node.
    Works fine in our test lab, but not customers environmnet.
    Cheers
    Peter.

    You will probably need more stuff opened between the PSN and the network but your rules between Admin and PSN. You might wanna add syslog udp 20514 as well.
    Also, what type of FW are you using? If ASA what happens if you run packet tracer and/or packet capture? Is the flow allowed through and do you see the packets in the capture
    Last but not the least, can you confirm that the DB service is running on the secondary node? From CLI run "show application status ise" If is not either restart the node or just issue "application start ise"
    Thank you for rating!

  • Cluster node fails after testing removing both interconnects in a two node

    Hi,
    cluster node panics and fails to join cluster after testing removing both interconnects in a two node cluster. cluster is up on one node , but the panic'ed node fails to rejoin cluster saying no sufficient quorum yet and both clinterconn failed (even after conencting the interconn). Quorum device used is a shared disk.
    Is this a bug?
    Any workaround or solution?
    Cluster is 3.2 SPARC
    Thanking you
    Ushas Symon

    Sounds like a networking problem to me. If the failed node genuinely can't communicate with the remaining node then it will not be allowed to join the cluster, hence the quorum message. I would suspect either:
    * Misconnected cables
    * A switch that has block or disabled the port
    * A failed auto-negotiation
    This is of course without knowing anything about what your network infrastructure actually is!
    Tim
    ---

  • Node can not join cluster after RAC HA Testing

    Dear forum,
    We are performing RAC failover tests according to document "RAC System Test Plan Outline 11gR2, Version 2.0". In testcase #14 - Interconnect network failure (11.2.0.2 an higher), we have disabled private interconnect network of node node1 (OCR Master).
    Then - as expected - node node2 was evicted. Now, after enabling private interconnect network on node node1, i want to start CRS again on node2. However, node does not join cluster with messages:
    2012-03-15 14:12:35.138: [ CSSD][1113114944]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
    2012-03-15 14:12:35.371: [ CSSD][1109961024]clssnmvDHBValidateNCopy: node 1, node1, has a disk HB, but no network HB, DHB has rcfg 226493542, wrtcnt, 2301201, LATS 5535614, lastSeqNo 2301198, uniqueness 1331804892, timestamp 1331817153/13040714
    2012-03-15 14:12:35.479: [ CSSD][1100884288]clssnmvDHBValidateNCopy: node 1, node1, has a disk HB, but no network HB, DHB has rcfg 226493542, wrtcnt, 2301202, LATS 5535724, lastSeqNo 2301199, uniqueness 1331804892, timestamp 1331817154/13041024
    2012-03-15 14:12:35.675: [ CSSD][1080801600]clssnmvDHBValidateNCopy: node 1, node1, has a disk HB, but no network HB, DHB has rcfg 226493542, wrtcnt, 2301203, LATS 5535924, lastSeqNo 2301200, uniqueness 1331804892, timestamp 1331817154/13041364
    Rebooting node2 did not help. Node1 which was online all the time (although private interconnect interface was unplugged for a few minutes and then plugged back in). I suppose that if we reboot node2, the problem will disappear. But there should be solution, which keeps availability requirements.
    Setup:
    2 Nodes (OEL5U7, UEK)
    2 Storages
    Network bonding via Linux bonding
    GI 11.2.0.3.1
    RDBMS 11.1.0.7.10
    Any ideas?
    Regards,
    Martin

    I have found a solution myself:
    [root@node1 trace]# echo -eth3 > /sys/class/net/bond1/bonding/slaves
    [root@node1 trace]# echo -eth1 > /sys/class/net/bond1/bonding/slaves
    [root@node1 trace]# echo +eth1 > /sys/class/net/bond1/bonding/slaves
    [root@node1 trace]# echo +eth3 > /sys/class/net/bond1/bonding/slaves
    Now node2 is automatically joining the cluster.
    Regards,
    martin

  • How to fix ? please advise: In Adobe LiveCycle ES2, JBOSS(4.2.1.GA) node unable to join cluster after restart.

    Hi Team,
    We are using Adobe LiveCycle ES2, JBOSS(4.2.1.GA)  on windows OS.
    We are facing issue after every time we restart JBOSS. JBOSS node after restart is coming up but unable to join the cluster.
    We are getting below error in the jboss server.log:
    2014-07-18 00:25:37,206 WARN [org.jgroups.protocols.pbcast.GMS] join(10.183.100.39:61469) sent to 10.183.100.39:64118 timed out, retrying
    2014-07-18 00:25:44,206 WARN  [org.jgroups.protocols.pbcast.GMS] join(10.183.100.39:61469) sent to 10.183.100.39:64118 timed out, retrying
    2014-07-18 00:25:51,206 WARN [org.jgroups.protocols.pbcast.GMS] join(10.183.100.39:61469) sent to 10.183.100.39:64118 timed out, retrying
    2014-07-18 00:25:58,207 WARN [org.jgroups.protocols.pbcast.GMS] join(10.183.100.39:61469) sent to 10.183.100.39:64118 timed out, retrying
    2014-07-18 00:26:05,207 WARN [org.jgroups.protocols.pbcast.GMS] join(10.183.100.39:61469) sent to 10.183.100.39:64118 timed out, retrying
    Could you please help to advise on this.
    Thanks.

    My apologies about the wall of text. After I made my original post, I thought maybe it would better to go back and put it in a pastebin instead. I was not able to edit that post once I sent it.
    In regards to your question, the  permissions on the
    /Library/LaunchAgents/com.adobe.AAM.Updater-1.0.plist file is "read and write" for system, wheel and everyone.

Maybe you are looking for