Cluster Node Panicks when other node is rebooted

Hello
we have a two node cluster 3.2u1 cluster that uses a quorum device has a shared SAN LUN. Both nodes can see the LUN and during normal operation everything is fine.
When one of the nodes is rebooted, or even crashes (for whatever reason), the other node goes into panick mode and also reboots causing loads of issues as you can imagine.
Looking at the logs at the time of the crash, my 'guess' is that it appears that the cluster doesn't like the node that is rebooting losing access to it's quorum device and so issues a panick to all nodes of the cluster. Does that sound reasonable?
Can anybody offer any advice on how to proceed with thise one, or has similar experience? (it might not even be the issue I descirbed above).
Also, how do I read the /var/cluster/log/eventlog file in a reasonable format??
thanks
Simon.

Node 1 (that panicked) ;
Jun 14 14:17:31 ussaplon01 cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking
Jun 14 14:17:31 ussaplon01 unix: [ID 836849 kern.notice]
Jun 14 14:17:31 ussaplon01 ^Mpanic[cpu4]/thread=fffffe8003b38c80:
Jun 14 14:17:31 ussaplon01 genunix: [ID 137713 kern.notice] free: freeing free frag, dev:0xef00000102, blk:26834, cg:0, ino:1184, fs:/sapmnt/P30
Jun 14 14:17:31 ussaplon01 unix: [ID 100000 kern.notice]
Jun 14 14:17:31 ussaplon01 genunix: [ID 655072 kern.notice] fffffe8003b386a0 genunix:vcmn_err+13 ()
Jun 14 14:17:31 ussaplon01 genunix: [ID 655072 kern.notice] fffffe8003b386d0 ufs:real_panic_v+120 ()
Jun 14 14:17:31 ussaplon01 genunix: [ID 655072 kern.notice] fffffe8003b38720 ufs:ufs_fault_v+b6 ()
Jun 14 14:17:31 ussaplon01 genunix: [ID 655072 kern.notice] fffffe8003b38800 ufs:ufs_fault+9b ()
Jun 14 14:17:31 ussaplon01 genunix: [ID 655072 kern.notice] fffffe8003b388b0 ufs:free+635 ()
Jun 14 14:17:31 ussaplon01 genunix: [ID 655072 kern.notice] fffffe8003b38b00 ufs:ufs_itrunc+510 ()
Jun 14 14:17:31 ussaplon01 genunix: [ID 655072 kern.notice] fffffe8003b38b70 ufs:ufs_trans_itrunc+9e ()
Jun 14 14:17:31 ussaplon01 genunix: [ID 655072 kern.notice] fffffe8003b38c00 ufs:ufs_delete+239 ()
Jun 14 14:17:31 ussaplon01 genunix: [ID 655072 kern.notice] fffffe8003b38c60 ufs:ufs_thread_delete+ba ()
Jun 14 14:17:31 ussaplon01 genunix: [ID 655072 kern.notice] fffffe8003b38c70 unix:thread_start+8 ()
Jun 14 14:17:31 ussaplon01 unix: [ID 100000 kern.notice]
Jun 14 14:17:31 ussaplon01 genunix: [ID 672855 kern.notice] syncing file systems...
Node 2 :
Jun 14 14:17:31 ussaplon02 genunix: [ID 489438 kern.notice] NOTICE: clcomm: Path ussaplon02:e1000g6 - ussaplon01:e1000g6 being drained
Jun 14 14:17:31 ussaplon02 genunix: [ID 489438 kern.notice] NOTICE: clcomm: Path ussaplon02:e1000g4 - ussaplon01:e1000g4 being drained
Jun 14 14:17:31 ussaplon02 scsi_vhci: [ID 734749 kern.warning] WARNING: vhci_scsi_reset 0x0
Jun 14 14:17:37 ussaplon02 genunix: [ID 250885 kern.notice] NOTICE: CMM: Quorum device /dev/did/rdsk/d3s2: owner set to node 1.
Jun 14 14:17:37 ussaplon02 genunix: [ID 446068 kern.notice] NOTICE: CMM: Node ussaplon01 (nodeid = 2) is down.
Jun 14 14:17:37 ussaplon02 genunix: [ID 108990 kern.notice] NOTICE: CMM: Cluster members: ussaplon02.
Jun 14 14:17:37 ussaplon02 Cluster.RGM.rgmd: [ID 446068 daemon.notice] CMM: Node ussaplon01 (nodeid = 2) is down.
Jun 14 14:17:37 ussaplon02 genunix: [ID 279084 kern.notice] NOTICE: CMM: node reconfiguration #9 completed.

Similar Messages

  • Cluster node reboots repeatedly

    We have 2 node 10.1.0.3 cluster setup. We had a problem with a HBA card for the fibre channel to SAN and after replacing it, one of the cluster nodes keeps rebooting itself right after the Cluster processes startup.
    We have had this issue once before and Support suggested the following.. Howevere the same solution is not working this time around.. Any ideas?
    Check output of the unix command hostname is node1
    Please rename cssnorun file in /etc/oracle/scls_scr/node1/root directory. Please issue "touch /etc/oracle/scls_scr/node1/root/crsdboot" and also change the permission and ownership of the file to match that of the node 2. Please check if there is any differences in permission, ownership, and the group for any files or directory structure under /etc/oracle between two nodes.
    Please reboot node 1 after this change and see if you run into the same problem.
    Please check if there is any /tmp/crsctl* files.

    Well especially if you are Linux RH4 the new controler card will have cause the device names to change. Check that out. It could be that you are no longer seeing you vote and crs partitions. This can happen on other operating systems if the devices now have a new name because the controller card has changed.
    For Linux try the Man pages on udev and search for udev on OTN
    Regards

  • OES2 SP2a cluster node freeze

    Hi all.
    I have a 3 node cluster based on OES2 SP2a fully patched. There are a coupe of resources: Master_IP and a NSS volume.
    The cluster is virtualized on ESXi 4.1 fully patched, and vmware-tools are installed and up to date.
    If i do an "rcnetwork stop" on a node, it remains with no network for about 20 seconds, and then freezes. Does not reboot. Only freezes. The resource is balanced correctly, but the server remains hanged.
    This behaviour is the same on a server with a cluster resource on it and on a server with no cluster resource on it. Always hangs.
    The correct behaviour should be a reboot, shouldn't?
    Any hints?
    Thanks in advance.

    The node does not reboot because ....
    9.11 Preventing a Cluster Node Reboot after a Node Shutdown
    If LAN connectivity is lost between a cluster node and the other nodes in the cluster, it is possible that the lost node will be automatically shut down by the other cluster nodes. This is normal cluster operating behavior, and it prevents the lost node from trying to load cluster resources because it cannot detect the other cluster nodes. By default, cluster nodes are configured to reboot after an automatic shutdown.
    On certain occasions, you might want to prevent a downed cluster node from rebooting so you can troubleshoot problems.
    Section 9.11.1, OES 2 SP2 with Patches and Later
    Section 9.11.2, OES 2 SP2 Release Version and Earlier
    9.11.1 OES 2 SP2 with Patches and Later
    Beginning in the OES 2 SP2 Maintenance Patch for May 2010, the Novell Cluster Services reboot behavior conforms to the kernel panic setting for the Linux operating system. By default the kernel panic setting is set for no reboot after a node shutdown.
    You can set the kernel panic behavior in the /etc/sysctl.conf file by adding a kernel.panic command line. Set the value to 0 for no reboot after a node shutdown. Set the value to a positive integer value to indicate that the server should be rebooted after waiting the specified number of seconds. For information about the Linux sysctl, see the Linux man pages on sysctl and sysctl.conf.
    1.
    As the root user, open the /etc/sysctl.conf file in a text editor.
    2.
    If the kernel.panic token is not present, add it.
    kernel.panic = 0
    3.
    Set the kernel.panic value to 0 or to a positive integer value, depending on the desired behavior.
    No Reboot: To prevent an automatic cluster reboot after a node shutdown, set the kernel.panic token to value to 0. This allows the administrator to determine what caused the kernel panic condition before manually rebooting the server. This is the recommended setting.
    kernel.panic = 0
    Reboot: To allow a cluster node to reboot automatically after a node shutdown, set the kernel.panic token to a positive integer value that represents the seconds to delay the reboot.
    kernel.panic = <seconds>
    For example, to wait 1 minute (60 seconds) before rebooting the server, specify the following:
    kernel.panic = 60
    4.
    Save your changes.
    9.11.2 OES 2 SP2 Release Version and Earlier
    In OES 2 SP release version and earlier, you can modify the opt/novell/ncs/bin/ldncs file for the cluster to trigger the server to not automatically reboot after a shutdown.
    1.
    Open the opt/novell/ncs/bin/ldncs file in a text editor.
    2.
    Find the following line:
    echo -n $TOLERANCE > /proc/sys/kernel/panic
    3.
    Replace $TOLERANCE with a value of 0 to cause the server to not automatically reboot after a shutdown.
    4.
    After editing the ldncs file, you must reboot the server to cause the change to take effect.

  • After reboot cluster node went into maintanance mode (CONTROL-D)

    Hi there!
    I have configured 2 node cluster on 2 x SUN Enterprise 220R and StoreEdge D1000.
    Each time when rebooted any of the cluster nodes i get the following error during boot up:
    The / file system (/dev/rdsk/c0t1d0s0) is being checked.
    /dev/rdsk/c0t1d0s0: UNREF DIR I=35540 OWNER=root MODE=40755
    /dev/rdsk/c0t1d0s0: SIZE=512 MTIME=Jun 5 15:02 2006 (CLEARED)
    /dev/rdsk/c0t1d0s0: UNREF FILE I=1192311 OWNER=root MODE=100600
    /dev/rdsk/c0t1d0s0: SIZE=96 MTIME=Jun 5 13:23 2006 (RECONNECTED)
    /dev/rdsk/c0t1d0s0: LINK COUNT FILE I=1192311 OWNER=root MODE=100600
    /dev/rdsk/c0t1d0s0: SIZE=96 MTIME=Jun 5 13:23 2006 COUNT 0 SHOULD BE 1
    /dev/rdsk/c0t1d0s0: LINK COUNT INCREASING
    /dev/rdsk/c0t1d0s0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
    In maintanance mode i do:
    # fsck -y -F ufs /dev/rdsk/c0t1d0s0
    and it managed to correct the problem ... but problem occured again after each reboot on each cluster node!
    I have installed Sun CLuster 3.1 on Solaris 9 SPARC
    How can i get rid of it?
    Any ideas?
    Brgds,
    Sergej

    Hi i get this:
    112941-09 SunOS 5.9: sysidnet Utility Patch
    116755-01 SunOS 5.9: usr/snadm/lib/libadmutil.so.2 Patch
    113434-30 SunOS 5.9: /usr/snadm/lib Library and Differential Flash Patch
    112951-13 SunOS 5.9: patchadd and patchrm Patch
    114711-03 SunOS 5.9: usr/sadm/lib/diskmgr/VDiskMgr.jar Patch
    118064-04 SunOS 5.9: Admin Install Project Manager Client Patch
    113742-01 SunOS 5.9: smcpreconfig.sh Patch
    113813-02 SunOS 5.9: Gnome Integration Patch
    114501-01 SunOS 5.9: drmproviders.jar Patch
    112943-09 SunOS 5.9: Volume Management Patch
    113799-01 SunOS 5.9: solregis Patch
    115697-02 SunOS 5.9: mtmalloc lib Patch
    113029-06 SunOS 5.9: libaio.so.1 librt.so.1 and abi_libaio.so.1 Patch
    113981-04 SunOS 5.9: devfsadm Patch
    116478-01 SunOS 5.9: usr platform links Patch
    112960-37 SunOS 5.9: patch libsldap ldap_cachemgr libldap
    113332-07 SunOS 5.9: libc_psr.so.1 Patch
    116500-01 SunOS 5.9: SVM auto-take disksets Patch
    114349-04 SunOS 5.9: sbin/dhcpagent Patch
    120441-03 SunOS 5.9: libsec patch
    114344-19 SunOS 5.9: kernel/drv/arp Patch
    114373-01 SunOS 5.9: UMEM - abi_libumem.so.1 patch
    118558-27 SunOS 5.9: Kernel Patch
    115675-01 SunOS 5.9: /usr/lib/liblgrp.so Patch
    112958-04 SunOS 5.9: patch pci.so
    113451-11 SunOS 5.9: IKE Patch
    112920-02 SunOS 5.9: libipp Patch
    114372-01 SunOS 5.9: UMEM - llib-lumem patch
    116229-01 SunOS 5.9: libgen Patch
    116178-01 SunOS 5.9: libcrypt Patch
    117453-01 SunOS 5.9: libwrap Patch
    114131-03 SunOS 5.9: multi-terabyte disk support - libadm.so.1 patch
    118465-02 SunOS 5.9: rcm_daemon Patch
    113490-04 SunOS 5.9: Audio Device Driver Patch
    114926-02 SunOS 5.9: kernel/drv/audiocs Patch
    113318-25 SunOS 5.9: patch /kernel/fs/nfs and /kernel/fs/sparcv9/nfs
    113070-01 SunOS 5.9: ftp patch
    114734-01 SunOS 5.9: /usr/ccs/bin/lorder Patch
    114227-01 SunOS 5.9: yacc Patch
    116546-07 SunOS 5.9: CDRW DVD-RW DVD+RW Patch
    119494-01 SunOS 5.9: mkisofs patch
    113471-09 SunOS 5.9: truss Patch
    114718-05 SunOS 5.9: usr/kernel/fs/pcfs Patch
    115545-01 SunOS 5.9: nss_files patch
    115544-02 SunOS 5.9: nss_compat patch
    118463-01 SunOS 5.9: du Patch
    116016-03 SunOS 5.9: /usr/sbin/logadm patch
    115542-02 SunOS 5.9: nss_user patch
    116014-06 SunOS 5.9: /usr/sbin/usermod patch
    116012-02 SunOS 5.9: ps utility patch
    117433-02 SunOS 5.9: FSS FX RT Patch
    117431-01 SunOS 5.9: nss_nis Patch
    115537-01 SunOS 5.9: /kernel/strmod/ptem patch
    115336-03 SunOS 5.9: /usr/bin/tar, /usr/sbin/static/tar Patch
    117426-03 SunOS 5.9: ctsmc and sc_nct driver patch
    121319-01 SunOS 5.9: devfsadmd_mod.so Patch
    121316-01 SunOS 5.9: /kernel/sys/doorfs Patch
    121314-01 SunOS 5.9: tl driver patch
    116554-01 SunOS 5.9: semsys Patch
    112968-01 SunOS 5.9: patch /usr/bin/renice
    116552-01 SunOS 5.9: su Patch
    120445-01 SunOS 5.9: Toshiba platform token links (TSBW,Ultra-3i)
    112964-15 SunOS 5.9: /usr/bin/ksh Patch
    112839-08 SunOS 5.9: patch libthread.so.1
    115687-02 SunOS 5.9:/var/sadm/install/admin/default Patch
    115685-01 SunOS 5.9: sbin/netstrategy Patch
    115488-01 SunOS 5.9: patch /kernel/misc/busra
    115681-01 SunOS 5.9: usr/lib/fm/libdiagcode.so.1 Patch
    113032-03 SunOS 5.9: /usr/sbin/init Patch
    113031-03 SunOS 5.9: /usr/bin/edit Patch
    114259-02 SunOS 5.9: usr/sbin/psrinfo Patch
    115878-01 SunOS 5.9: /usr/bin/logger Patch
    116543-04 SunOS 5.9: vmstat Patch
    113580-01 SunOS 5.9: mount Patch
    115671-01 SunOS 5.9: mntinfo Patch
    113977-01 SunOS 5.9: awk/sed pkgscripts Patch
    122716-01 SunOS 5.9: kernel/fs/lofs patch
    113973-01 SunOS 5.9: adb Patch
    122713-01 SunOS 5.9: expr patch
    117168-02 SunOS 5.9: mpstat Patch
    116498-02 SunOS 5.9: bufmod Patch
    113576-01 SunOS 5.9: /usr/bin/dd Patch
    116495-03 SunOS 5.9: specfs Patch
    117160-01 SunOS 5.9: /kernel/misc/krtld patch
    118586-01 SunOS 5.9: cp/mv/ln Patch
    120025-01 SunOS 5.9: ipsecconf Patch
    116527-02 SunOS 5.9: timod Patch
    117155-08 SunOS 5.9: pcipsy Patch
    114235-01 SunOS 5.9: libsendfile.so.1 Patch
    117152-01 SunOS 5.9: magic Patch
    116486-03 SunOS 5.9: tsalarm Driver Patch
    121998-01 SunOS 5.9: two-key mode fix for 3DES Patch
    116484-01 SunOS 5.9: consconfig Patch
    116482-02 SunOS 5.9: modload Utils Patch
    117746-04 SunOS 5.9: patch platform/sun4u/kernel/drv/sparcv9/pic16f819
    121992-01 SunOS 5.9: fgrep Patch
    120768-01 SunOS 5.9: grpck patch
    119438-01 SunOS 5.9: usr/bin/login Patch
    114389-03 SunOS 5.9: devinfo Patch
    116510-01 SunOS 5.9: wscons Patch
    114224-05 SunOS 5.9: csh Patch
    116670-04 SunOS 5.9: gld Patch
    114383-03 SunOS 5.9: Enchilada/Stiletto - pca9556 driver
    116506-02 SunOS 5.9: traceroute patch
    112919-01 SunOS 5.9: netstat Patch
    112918-01 SunOS 5.9: route Patch
    112917-01 SunOS 5.9: ifrt Patch
    117132-01 SunOS 5.9: cachefsstat Patch
    114370-04 SunOS 5.9: libumem.so.1 patch
    114010-02 SunOS 5.9: m4 Patch
    117129-01 SunOS 5.9: adb Patch
    117483-01 SunOS 5.9: ntwdt Patch
    114369-01 SunOS 5.9: prtvtoc patch
    117125-02 SunOS 5.9: procfs Patch
    117480-01 SunOS 5.9: pkgadd Patch
    112905-02 SunOS 5.9: ippctl Patch
    117123-06 SunOS 5.9: wanboot Patch
    115030-03 SunOS 5.9: Multiterabyte UFS - patch mount
    114004-01 SunOS 5.9: sed Patch
    113335-03 SunOS 5.9: devinfo Patch
    113495-05 SunOS 5.9: cfgadm Library Patch
    113494-01 SunOS 5.9: iostat Patch
    113493-03 SunOS 5.9: libproc.so.1 Patch
    113330-01 SunOS 5.9: rpcbind Patch
    115028-02 SunOS 5.9: patch /usr/lib/fs/ufs/df
    115024-01 SunOS 5.9: file system identification utilities
    117471-02 SunOS 5.9: fifofs Patch
    118897-01 SunOS 5.9: stc Patch
    115022-03 SunOS 5.9: quota utilities
    115020-01 SunOS 5.9: patch /usr/lib/adb/ml_odunit
    113720-01 SunOS 5.9: rootnex Patch
    114352-03 SunOS 5.9: /etc/inet/inetd.conf Patch
    123056-01 SunOS 5.9: ldterm patch
    116243-01 SunOS 5.9: umountall Patch
    113323-01 SunOS 5.9: patch /usr/sbin/passmgmt
    116049-01 SunOS 5.9: fdfs Patch
    116241-01 SunOS 5.9: keysock Patch
    113480-02 SunOS 5.9: usr/lib/security/pam_unix.so.1 Patch
    115018-01 SunOS 5.9: patch /usr/lib/adb/dqblk
    113277-44 SunOS 5.9: sd and ssd Patch
    117457-01 SunOS 5.9: elfexec Patch
    113110-01 SunOS 5.9: touch Patch
    113077-17 SunOS 5.9: /platform/sun4u/kernal/drv/su Patch
    115006-01 SunOS 5.9: kernel/strmod/kb patch
    113072-07 SunOS 5.9: patch /usr/sbin/format
    113071-01 SunOS 5.9: patch /usr/sbin/acctadm
    116782-01 SunOS 5.9: tun Patch
    114331-01 SunOS 5.9: power Patch
    112835-01 SunOS 5.9: patch /usr/sbin/clinfo
    114927-01 SunOS 5.9: usr/sbin/allocate Patch
    119937-02 SunOS 5.9: inetboot patch
    113467-01 SunOS 5.9: seg_drv & seg_mapdev Patch
    114923-01 SunOS 5.9: /usr/kernel/drv/logindmux Patch
    117443-01 SunOS 5.9: libkvm Patch
    114329-01 SunOS 5.9: /usr/bin/pax Patch
    119929-01 SunOS 5.9: /usr/bin/xargs patch
    113459-04 SunOS 5.9: udp patch
    113446-03 SunOS 5.9: dman Patch
    116009-05 SunOS 5.9: sgcn & sgsbbc patch
    116557-04 SunOS 5.9: sbd Patch
    120241-01 SunOS 5.9: bge: Link & Speed LEDs flash constantly on V20z
    113984-01 SunOS 5.9: iosram Patch
    113220-01 SunOS 5.9: patch /platform/sun4u/kernel/drv/sparcv9/upa64s
    113975-01 SunOS 5.9: ssm Patch
    117165-01 SunOS 5.9: pmubus Patch
    116530-01 SunOS 5.9: bge.conf Patch
    116529-01 SunOS 5.9: smbus Patch
    116488-03 SunOS 5.9: Lights Out Management (lom) patch
    117131-01 SunOS 5.9: adm1031 Patch
    117124-12 SunOS 5.9: platmod, drmach, dr, ngdr, & gptwocfg Patch
    114003-01 SunOS 5.9: bbc driver Patch
    118539-02 SunOS 5.9: schpc Patch
    112837-10 SunOS 5.9: patch /usr/lib/inet/in.dhcpd
    114975-01 SunOS 5.9: usr/lib/inet/dhcp/svcadm/dhcpcommon.jar Patch
    117450-01 SunOS 5.9: ds_SUNWnisplus Patch
    113076-02 SunOS 5.9: dhcpmgr.jar Patch
    113572-01 SunOS 5.9: docbook-to-man.ts Patch
    118472-01 SunOS 5.9: pargs Patch
    122709-01 SunOS 5.9: /usr/bin/dc patch
    113075-01 SunOS 5.9: pmap patch
    113472-01 SunOS 5.9: madv & mpss lib Patch
    115986-02 SunOS 5.9: ptree Patch
    115693-01 SunOS 5.9: /usr/bin/last Patch
    115259-03 SunOS 5.9: patch usr/lib/acct/acctcms
    114564-09 SunOS 5.9: /usr/sbin/in.ftpd Patch
    117441-01 SunOS 5.9: FSSdispadmin Patch
    113046-01 SunOS 5.9: fcp Patch
    118191-01 gtar patch
    114818-06 GNOME 2.0.0: libpng Patch
    117177-02 SunOS 5.9: lib/gss module Patch
    116340-05 SunOS 5.9: gzip and Freeware info files patch
    114339-01 SunOS 5.9: wrsm header files Patch
    122673-01 SunOS 5.9: sockio.h header patch
    116474-03 SunOS 5.9: libsmedia Patch
    117138-01 SunOS 5.9: seg_spt.h
    112838-11 SunOS 5.9: pcicfg Patch
    117127-02 SunOS 5.9: header Patch
    112929-01 SunOS 5.9: RIPv2 Header Patch
    112927-01 SunOS 5.9: IPQos Header Patch
    115992-01 SunOS 5.9: /usr/include/limits.h Patch
    112924-01 SunOS 5.9: kdestroy kinit klist kpasswd Patch
    116231-03 SunOS 5.9: llc2 Patch
    116776-01 SunOS 5.9: mipagent patch
    117420-02 SunOS 5.9: mdb Patch
    117179-01 SunOS 5.9: nfs_dlboot Patch
    121194-01 SunOS 5.9: usr/lib/nfs/statd Patch
    116502-03 SunOS 5.9: mountd Patch
    113331-01 SunOS 5.9: usr/lib/nfs/rquotad Patch
    113281-01 SunOS 5.9: patch /usr/lib/netsvc/yp/ypbind
    114736-01 SunOS 5.9: usr/sbin/nisrestore Patch
    115695-01 SunOS 5.9: /usr/lib/netsvc/yp/yppush Patch
    113321-06 SunOS 5.9: patch sf and socal
    113049-01 SunOS 5.9: luxadm & liba5k.so.2 Patch
    116663-01 SunOS 5.9: ntpdate Patch
    117143-01 SunOS 5.9: xntpd Patch
    113028-01 SunOS 5.9: patch /kernel/ipp/flowacct
    113320-06 SunOS 5.9: patch se driver
    114731-08 SunOS 5.9: kernel/drv/glm Patch
    115667-03 SunOS 5.9: Chalupa platform support Patch
    117428-01 SunOS 5.9: picl Patch
    113327-03 SunOS 5.9: pppd Patch
    114374-01 SunOS 5.9: Perl patch
    115173-01 SunOS 5.9: /usr/bin/sparcv7/gcore /usr/bin/sparcv9/gcore Patch
    114716-02 SunOS 5.9: usr/bin/rcp Patch
    112915-04 SunOS 5.9: snoop Patch
    116778-01 SunOS 5.9: in.ripngd patch
    112916-01 SunOS 5.9: rtquery Patch
    112928-03 SunOS 5.9: in.ndpd Patch
    119447-01 SunOS 5.9: ses Patch
    115354-01 SunOS 5.9: slpd Patch
    116493-01 SunOS 5.9: ProtocolTO.java Patch
    116780-02 SunOS 5.9: scmi2c Patch
    112972-17 SunOS 5.9: patch /usr/lib/libssagent.so.1 /usr/lib/libssasnmp.so.1 mibiisa
    116480-01 SunOS 5.9: IEEE 1394 Patch
    122485-01 SunOS 5.9: 1394 mass storage driver patch
    113716-02 SunOS 5.9: sar & sadc Patch
    115651-02 SunOS 5.9: usr/lib/acct/runacct Patch
    116490-01 SunOS 5.9: acctdusg Patch
    117473-01 SunOS 5.9: fwtmp Patch
    116180-01 SunOS 5.9: geniconvtbl Patch
    114006-01 SunOS 5.9: tftp Patch
    115646-01 SunOS 5.9: libtnfprobe shared library Patch
    113334-03 SunOS 5.9: udfs Patch
    115350-01 SunOS 5.9: ident_udfs.so.1 Patch
    122484-01 SunOS 5.9: preen_md.so.1 patch
    117134-01 SunOS 5.9: svm flasharchive patch
    116472-02 SunOS 5.9: rmformat Patch
    112966-05 SunOS 5.9: patch /usr/sbin/vold
    114229-01 SunOS 5.9: action_filemgr.so.1 Patch
    114335-02 SunOS 5.9: usr/sbin/rmmount Patch
    120443-01 SunOS 5.9: sed core dumps on long lines
    121588-01 SunOS 5.9: /usr/xpg4/bin/awk Patch
    113470-02 SunOS 5.9: winlock Patch
    119211-07 NSS_NSPR_JSS 3.11: NSPR 4.6.1 / NSS 3.11 / JSS 4.2
    118666-05 J2SE 5.0: update 6 patch
    118667-05 J2SE 5.0: update 6 patch, 64bit
    114612-01 SunOS 5.9: ANSI-1251 encodings file errors
    114276-02 SunOS 5.9: Extended Arabic support in UTF-8
    117400-01 SunOS 5.9: ISO8859-6 and ISO8859-8 iconv symlinks
    113584-16 SunOS 5.9: yesstr, nostr nl_langinfo() strings incorrect in S9
    117256-01 SunOS 5.9: Remove old OW Xresources.ow files
    112625-01 SunOS 5.9: Dcam1394 patch
    114600-05 SunOS 5.9: vlan driver patch
    117119-05 SunOS 5.9: Sun Gigabit Ethernet 3.0 driver patch
    117593-04 SunOS 5.9: Manual Page updates for Solaris 9
    112622-19 SunOS 5.9: M64 Graphics Patch
    115953-06 Sun Cluster 3.1: Sun Cluster sccheck patch
    117949-23 Sun Cluster 3.1: Core Patch for Solaris 9
    115081-06 Sun Cluster 3.1: HA-Sun One Web Server Patch
    118627-08 Sun Cluster 3.1: Manageability and Serviceability Agent
    117985-03 SunOS 5.9: XIL 1.4.2 Loadable Pipeline Libraries
    113896-06 SunOS 5.9: en_US.UTF-8 locale patch
    114967-02 SunOS 5.9: FDL patch
    114677-11 SunOS 5.9: International Components for Unicode Patch
    112805-01 CDE 1.5: Help volume patch
    113841-01 CDE 1.5: answerbook patch
    113839-01 CDE 1.5: sdtwsinfo patch
    115713-01 CDE 1.5: dtfile patch
    112806-01 CDE 1.5: sdtaudiocontrol patch
    112804-02 CDE 1.5: sdtname patch
    113244-09 CDE 1.5: dtwm patch
    114312-02 CDE1.5: GNOME/CDE Menu for Solaris 9
    112809-02 CDE:1.5 Media Player (sdtjmplay) patch
    113868-02 CDE 1.5: PDASync patch
    119976-01 CDE 1.5: dtterm patch
    112771-30 Motif 1.2.7 and 2.1.1: Runtime library patch for Solaris 9
    114282-01 CDE 1.5: libDtWidget patch
    113789-01 CDE 1.5: dtexec patch
    117728-01 CDE1.5: dthello patch
    113863-01 CDE 1.5: dtconfig patch
    112812-01 CDE 1.5: dtlp patch
    113861-04 CDE 1.5: dtksh patch
    115972-03 CDE 1.5: dtterm libDtTerm patch
    114654-02 CDE 1.5: SmartCard patch
    117632-01 CDE1.5: sun_at patch for Solaris 9
    113374-02 X11 6.6.1: xpr patch
    118759-01 X11 6.6.1: Font Administration Tools patch
    117577-03 X11 6.6.1: TrueType fonts patch
    116084-01 X11 6.6.1: font patch
    113098-04 X11 6.6.1: X RENDER extension patch
    112787-01 X11 6.6.1: twm patch
    117601-01 X11 6.6.1: libowconfig.so.0 patch
    117663-02 X11 6.6.1: xwd patch
    113764-04 X11 6.6.1: keyboard patch
    113541-02 X11 6.6.1: XKB patch
    114561-01 X11 6.6.1: X splash screen patch
    113513-02 X11 6.6.1: platform support for new hardware
    116121-01 X11 6.4.1: platform support for new hardware
    114602-04 X11 6.6.1: libmpg_psr patch
    Is there a bundle to install or i have to install each patch separatly_?

  • OrainstRoot.sh: Failure to promote local gpnp setup to other cluster nodes

    I'm trying to build a 2 node cluster and everything appeared to be going swimmingly until the end of the 1st nodes running of the orainstRoot.sh script.
    The following is the end of the output:
    Disk Group OCR_VOTE created successfully.
    clscfg: -install mode specified
    Successfully accumulated necessary OCR keys.
    Creating OCR keys for user 'root', privgrp 'root'..
    Operation successful.
    CRS-4256: Updating the profile
    Successful addition of voting disk 4e3f692529584f8bbf7f16146bd90346.
    Successful addition of voting disk 728bed918cf54f6cbf904d37638c674b.
    Successful addition of voting disk 8ac20793405d4fdcbfcafc7e311f877d.
    Successfully replaced voting disk group with +OCR_VOTE.
    CRS-4256: Updating the profile
    CRS-4266: Voting file(s) successfully replaced
    ## STATE File Universal Id File Name Disk group
    1. ONLINE 4e3f692529584f8bbf7f16146bd90346 (ORCL:VOTE01) [OCR_VOTE]
    2. ONLINE 728bed918cf54f6cbf904d37638c674b (ORCL:VOTE02) [OCR_VOTE]
    3. ONLINE 8ac20793405d4fdcbfcafc7e311f877d (ORCL:VOTE03) [OCR_VOTE]
    Located 3 voting disk(s).
    Failed to rmtcopy "/tmp/fileLgKPGV" to "/u01/app/11.2.0/grid/gpnp/manifest.txt" for nodes {ilprevzedb01,ilprevzedb02}, rc=256
    Failed to rmtcopy "/u01/app/11.2.0/grid/gpnp/ilprevzedb01/profiles/peer/profile.xml" to "/u01/app/11.2.0/grid/gpnp/profiles/peer/profile.xml" for nodes {ilprevzedb01,ilprevzedb02}, rc=256
    rmtcopy aborted
    Failed to promote local gpnp setup to other cluster nodes at /u01/app/11.2.0/grid/crs/install/crsconfig_lib.pm line 6504.
    /u01/app/11.2.0/grid/perl/bin/perl -I/u01/app/11.2.0/grid/perl/lib -I/u01/app/11.2.0/grid/crs/install /u01/app/11.2.0/grid/crs/install/rootcrs.pl execution failed
    Has anyone run into this problem and found a solution?
    Thanks in advance!

    Ok, for everyone out there, I resolved the issue. Hopefully this will help others encountering the same problem.
    It turns out that when the OS was installed, iptables firewall was enabled. This will cause havoc with the installer scripts.
    My first inkling should have been when the installer stalled at 65% trying to copy home directories between nodes, the first time I ran through the installer.
    At that time, Googling around found that iptables might be the problem and indeed it was running, so I just did a 'service iptables stop' WITHOUT REBOOTING THE NODES and re-ran the installer.
    Well, it looks as though NOT REBOOTING THE NODES doesn't quite cut it. I then did a 'chkconfig iptables off' and REBOOTED BOTH NODES.
    Oracle support simply provided me with: How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation (Doc ID 942166.1), which didn't really work all that well, lots of failures, errors, etc. So I just deleted the 11.2.0 directory and tried running the installer again.
    This time the install went through without problems.
    Thanks!

  • How to start CRS and other services after rebooting nodes.

    Hi,
    i have created two node cluster database.how to start CRS and other service after rebooting nodes?
    Thanks,

    use crsctl command start...
    $ORA_CRS_HOME/bin/crsctl start crsOr check status about CLusterware...
    $ORA_CRS_HOME/bin/crsctl check crs
    CSS appears healthy
    CRS appears healthy
    EVM appears healthyIf not OK... check log at $ORA_CRS_HOME/log/hostname/*
    and check status database , service and etc...
    use crs_stat
    $ORA_CRS_HOME/bin/crs_stat -t
    $ORA_CRS_HOME/bin/crs_stat if you find some service or instance ... not online...
    use srvctl command, help to start
    Get Help...
    $ORA_CRS_HOME/bin/srvctl -h

  • Oracle Cluster Node Reboots Abruptly

    One of our RAC 11gR2 Cluster Node rebooted abruptly. We found the following error in the grid home alter log file and ocssd.log file.
    [cssd(6014)]CRS-1611:Network communication with node mumchora12 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.190 secondsWe need to find the Root Cause for this node reboot. Kindly assist.
    OS Version : RHEL 5.8
    GRID : 11.2.0.2
    Database : 11.2.0.2.10

    Hi,
    By looking the logs it seems private interconnect problem. I would suggest you to refer one of nice metalink doc on same issue.
    Node reboot or eviction: How to check if your private interconnect CRS can transmit network heartbeats [ID 1445075.1]
    Hope it will help you to identify the root cause of node eviction.
    Thanks

  • Cluster Node Joining other cluster

    Because of a network problem, two of our clusters shared their interconnects. This of course lead to duplicated IPs on the interconnects and reboots of the nodes. Now one of the cluster nodes of cluster a tried to join the ClusterB:
    cluster.name   
    cluster.state   enabled
    cluster.properties.cluster_id   0x48FDxxxx  [ Different from Node A ]
    cluster.properties.installmode  disabled
    cluster.properties.auth_joinlist_type   sys
    cluster.properties.auth_joinlist_hostslist      ,
    cluster.properties.cmm_version  1
    cluster.nodes.1.name    I am really surprised this node tried to join the other cluster, seems it got the ccr from there during one of the reboots.
    The real question i have now is jow to get out of this as soon as we have fixed the network problem, how can we bring back this node to the ClusterA, must we reinstall it.
    Fritz

    I'm surprised that an invalid node picked up a CCR update from a cluster that it wasn't part of. I would have expected the cluster ids to be different and thus prevent this, but to be honest I haven't checked to see how much prevention there is against this.
    Anyway, to get out of it you could hack the CCRs on the other clusters and try and put them onto a different subnet with different private addresses. You'll need to use ccradm. Messy though.
    Tim
    ---

  • Hyper-V Guest Cluster Node Failing Regularly

    Hi,
    We currently have a 4-node Server 2012 R2 Cluster witch hosts among other things, a 3 node Guest Cluster running a single clustered file service.  
    Around once a week, the guest cluster node that is currently hosting the clustered file service will fail.  It's as if the VM is blue screening.  That in itself is fairly anoying and I'll be doing all the updates and checking event log for clues
    as to the cause.  
    The problem then is that whichever physical cluster node that is hosting the VM when it fails,  will not unlock some of the VM's files.  The Virtual machine configuration lists as Online Pending.  This means that the failed VM cannot be restarted
    on any other cluster node.  The only fix is to drain the physical host it failed on, and reboot. 
    Looking for suggestions on how to fix the following.
    1. Crashing guest file cluster node
    2. Failed VM with shared VHDX requiring Phyiscal host reboot.
    Event messages for the physical host that was hosting the failed vm in order that they occured.
    Hyper-V-Worker: Event ID 18590 - 'FS-03' has encountered a fatal error.  The guest operating system reported that it failed with the following error codes: ErrorCode0: 0x9E, ErrorCode1: 0x6C2A17C0, ErrorCode2: 0x3C, ErrorCode3: 0xA, ErrorCode4:
    0x0.  If the problem persists, contact Product Support for the guest operating system.  (Virtual machine ID 36166B47-D003-4E51-AFB5-7B967A3EFD2D)
    FailoverClustering: Event ID 1069 - Cluster resource 'Virtual Machine FS-03' of type 'Virtual Machine' in clustered role 'FS-03' failed.
    Hyper-V-High-Availability: Event ID 21128 - 'Virtual Machine FS-03' failed to shutdown the virtual machine during the resource termination. The virtual machine will be forcefully stopped.
    Hyper-V-High-Availability: Event ID 21110 - 'Virtual Machine FS-03' failed to terminate.
    Hyper-V-VMMS: Event ID 20108 - The Virtual Machine Management Service failed to start the virtual machine '36166B47-D003-4E51-AFB5-7B967A3EFD2D': The group or resource is not in the correct state to perform the requested operation. (0x8007139F).
    Hyper-V-High-Availability: Event ID 21107 - 'Virtual Machine FS-03' failed to start.
    FailoverClustering: Event ID 1205 - The Cluster service failed to bring clustered role 'FS-03' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

    Hi,
    I don’t found the similar issue, Does your cluster can pass the cluster validation? Does all your Hyper-V host compatible with Server 2012r2? Have you try to disable all your
    AV soft and firewall? Please rerun Storage validation on the Cluster in non-production hours, the cluster validation report will quickly locate the issue.
    More information:
    Cluster
    http://technet.microsoft.com/en-us/library/dd581778(v=ws.10).aspx
    Hope this helps.
    We
    are trying to better understand customer views on social support experience, so your participation in this
    interview project would be greatly appreciated if you have time.
    Thanks for helping make community forums a great place.

  • After apply Oracle 10.2.0.4 CRS PSU bundle 2 , node 1 reboot suddently

    OS : hp itanium 11.31 (HP-UX B.11.31 U ia64 3664852670)
    Oracle version : 10.2.0.4
    After apply 8705958 CRS PSU 2(clusterware patchset update 2) , about one month later , node 1 reboot suddently.
    I observed that two logs :
    *$ORACLE_HOME/db10g/log/db1/racg/imon_db1.log:*
    2010-03-28 09:27:34.857: [    RACG][21] [12463][21][ora.dbrac.db1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
    GIM-00090: OS-dependent operation:mmap failed with status: 12
    GIM-00091: OS failure message: Not enough space
    GIM-00092: OS failure occurred at: sskgmsmr_13
    */var/adm/syslog/Syslog.log :*
    Mar 28 21:38:41 db1 vmunix: file: table is full
    Mar 28 21:49:20 db1 syslog: Oracle clsomon failed with fatal status 13.
    Mar 28 21:49:20 db1 syslog: Oracle CRS failure. Rebooting for cluster integrity.
    Mar 28 21:49:20 db1 vmunix: User requested reset of the system.
    Mar 28 21:48:58 db1 vmunix: file: table is full
    Mar 28 21:49:20 db1 above message repeats 635 times
    Mar 28 21:49:20 db1 vmunix:
    Mar 28 21:49:20 db1 vmunix: Oracle CRS TOC for clusterware integrity...
    from metalink : File handles not released after upgrade to 10.2.0.3 CRS Bundle#2 or 10.2.0.4 [ID 739557.1]
    The article imply that I can apply Patch 7493592 CRS 10.2.0.4 Bundle Patch #2 to fix it , but 8705958 CRS PSU 2 is superset of 7493592
    Does anyone have suggestions~? Please kindly let me know
    BRS , Jay
    --detail steps :  apply 8705958 CRS PSU 2(clusterware patchset update 2)
    --unzip opatch to latest version 10.2.0.4.9
    cd /opt/oracle/crs10g
    mv OPatch OPatch.bak
    cp -p /opt/oracle_sw/opatch/p6880880_102000_HPUX-IA64.zip /opt/oracle/crs10g/.
    unzip /opt/oracle_sw/opatch/p6880880_102000_HPUX-IA64.zip
    chown -R oracle:oinstall OPatch
    su - oracle
    #Verify that the Oracle Inventory is properly configured
    /opt/oracle/crs10g/OPatch/opatch lsinventory -detail -oh /opt/oracle/crs10g/
    /opt/oracle/crs10g/OPatch/opatch lsinventory -detail -oh /opt/oracle/db10g/
    #Unzip the PSE container file
    unzip 8705958.zip
    #Apply the patch to the CRS Home and all applicable RDBMS homes as root
    /opt/oracle/crs10g/OPatch/ocm/bin/emocmrsp
    su - root
    cd /opt/oracle_sw/PSU/8705958
    /opt/oracle/crs10g/OPatch/opatch auto -och /opt/oracle/crs10g/
    : /opt/oracle/crs10g/OPatch/ocm/bin/ocm.rsp
    Patch node db1? (y/n/abort/N/N1-N2/help):
    y
    Applying patch 8705958 on node db1
    Edited by: user7322888 on 2010/3/30 下午 7:22
    Edited by: user7322888 on 2010/3/30 下午 8:23

    Assuming you are correct that this is an error created by Oracle's patch ... open an SR at metalink.
    Otherwise, when instances shoot themselves in the head, it is generally because they can not contact other instances and most often I find VLANs.

  • Shutdown inactive node cause reboot active node

    Hi,
    when I try to shutddown a inactive node of Oracle clusterware (two nodes) the active node reboot (s.o Oracle Linux 5.6 and ocfs2).
    Fire it acquires vip of second node, then reboot and work correctly.
    Anyone can help me?
    Have a nice day

    user2907588 wrote:
    Here the log of active node:
    Jan 31 15:35:23 esse3-db1 avahi-daemon[7041]: Registering new address record for 192.168.101.222 on eth0. *> Jan 31 15:56:30 esse3-db1 kernel: bnx2: eth2 NIC Copper Link is Down*
    Jan 31 15:56:32 esse3-db1 kernel: bnx2: eth2 NIC Copper Link is Up, 100 Mbps full duplex, receive & transmit flow control ON
    Jan 31 15:56:59 esse3-db1 kernel: o2net: connection to node esse3-db2.unisalento.it (num 0) at 192.168.101.202:7777 has been idle for 30.0 seconds, shutting it down.
    Jan 31 15:56:59 esse3-db1 kernel: (0,17):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1359644189.432221 now 1359644219.431047 dr 1359644189.432200 adv 1359644189.432221:1359644189.432221 func (1f70fe7a:504) 1359644047.329097:1359644047.329102)
    Jan 31 15:56:59 esse3-db1 kernel: o2net: no longer connected to node esse3-db2.unisalento.it (num 0) at 192.168.101.202:7777 *> Jan 31 15:57:29 esse3-db1 kernel: (6377,17):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.*
    *> Jan 31 16:01:40 esse3-db1 syslogd 1.4.1: restart.*
    >
    Edited by: user2907588 on 1-feb-2013 7.39Hello,
    One of the nodes in cluster seemed to have been evicted previously due to eth2 NIC outage between nodes so as removing the failed node(could be what u r referring to "INACTIVE")
    Please check I have highlighted in provided log information... Check If you are able to ping to the specified IP, and do password-less ssh to other node (101), and ask your system/network administrator to look into it...
    Regards,
    Naga

  • VMM Thinks Cluster Node is in Maintenance

    I'm running VMM 2012 SP1 (version 3.1.6020.0). The cluster in question are Windows Server 2012 Datacenter.
    I performed maintenance on one of my Hyper-V failover clusters (installed KB's in
    this article
    ) and when I took one the nodes out of maintenance I successfully migrated VM's between the two via the Failover Cluster Manger console. However, I noticed that VMM still had the exclamation mark on the cluster name. I didn't noticed this until
    a couple of days later and now I'm trying to do a cross-cluster migration and it's not allowing me because VMM thinks the node is in maintenance. I've tried rebooting the VMM server, refreshing the cluster, refreshing all the VMMs and no luck.
    When I go into the Failover Cluster Manager on each of the cluster nodes, both nodes show in production (not in maintenance). Any ideas?
    Note: the way that I took the node out of maintenance was via the Failover Cluster Manager console and NOT through VMM console, as the VMM server was unavailable at the time).

    It is interesting that VMM was unavailable at the time you were doing this. Are you able to refresh this particular host and see if anything changes? Are the option for "stop maintenance mode" available on this host from VMM? 
    Anyhow, the root cause here will be that the data in VMM database is not consistent with your resources, so as a last attempt you could remote - and add your cluster again, just so that the database will perform a clean up of the objects. 
    -kn
    Kristian (Virtualization and some coffee: http://kristiannese.blogspot.com )

  • Cluster Node Unable to Maintain Cluster Membership

    My cluster logs are very similar to the above thread... was it ever addressed?
    [SV] Already protecting connection with message security level 'sign'
    [FTI] Stream already exists to node: false
    [Channel IP to another cluster node member] Close()
    GracefuleClose(1226) because of channel to remote endpoint another cluster node
    ~ is closed
    Cluster services stops and generates:
    The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server serverName$. The target name used was
    serverName.
    This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN
    is only registered on the account used by the server.
    Roderick Lyons

    Hi Roderick Lyons,
    Could you tell us the exact URL “above thread” I am not very sure which thread you meaning.
     Please offer more information about your environment, such as, the DC server edition, the cluster node server edition.
    If you are 2003 and 2012R2 mixed DC environment please restart your cluster node then do the further monitor.
    The related article:
    It turns out that weird things can happen when you mix Windows Server 2003 and Windows Server 2012 R2 domain controllers
    http://blogs.technet.com/b/askds/archive/2014/07/23/it-turns-out-that-weird-things-can-happen-when-you-mix-windows-server-2003-and-windows-server-2012-r2-domain-controllers.aspx
    Can't log on after changing machine account password in mixed Windows Server 2012 R2 and Windows Server 2003 environment
    http://support.microsoft.com/kb/2989971
    From the current error another possible is you never run the cluster validation before you create the cluster, please run the cluster validation first then post
    the warning or error information.
    If above solution not work please consider reboot your PDC at unproductive time.
    More information:
    Kerberos Service Principal Name on Wrong Account
    https://support.microsoft.com/kb/2706695?wa=wsignin1.0
    Fixing the Security-Kerberos / 4 error
    http://blogs.technet.com/b/dcaro/archive/2013/07/04/fixing-the-security-kerberos-4-error.aspx
    Service Principal Names (SPNs) SetSPN Syntax (Setspn.exe)
    http://social.technet.microsoft.com/wiki/contents/articles/717.service-principal-names-spns-setspn-syntax-setspn-exe.aspx
    I’m glad to be of help to you!
    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Support, contact [email protected]

  • SCVMM losing connection to cluster nodes

    Hey guys'n girls, I hope this is the right forum for this question. I already opened a ticket at MS support as well because it's impacting our production environment indirectly, but even after a week there's been no contact. Losing faith in MS support there
    The problem we're having is that scvmm is that a host enters the 'needs attention' state, with a winrm error 0x80338126. I guess it has something to do with the network or with Kerberos, and I've found some info on it, but I still haven't been able to solve
    it. Do you guys have any ideas?
    Problem summary:
    We are seeing an issue on our new hyper-v platform. The platform should have been in production last week, but this issue is delaying our project as we can't seem to get it stable.
    The problem we are experiencing is that SCVMM loses the connection to some of the Hyper-V nodes. Not one
     specific node. Last week it happened to two nodes, and today it happened to another node. I see issues with WinRM, and I expect something to do with kerberos. See the bottom of this post for background details and software versions.
    The host gets the status 'needs attention', and if you look at the status of the machine, WinRM gives an error. The error is:
    Error (2916)
    VMM is unable to complete the request. The connection to the agent cc1-hyp-10.domaincloud1.local was lost.
    WinRM: URL: [http://cc1-hyp-10.domaincloud1.local:5985], Verb: [ENUMERATE], Resource: [http://schemas.microsoft.com/wbem/wsman/1/wmi/root/cimv2/Win32_Service], Filter: [select * from Win32_Service where Name="WinRM"]
    Unknown error (0x80338126)
    Recommended Action
    Ensure that the Windows Remote Management (WinRM) service and the VMM agent are installed and running and that a firewall is not blocking HTTP/HTTPS traffic. Ensure that VMM server is able to communicate with cc1-hyp-10.domaincloud1.local over WinRM by successfully
    running the following command:
     winrm id –r:cc1-hyp-10.domaincloud1.local
    This
     problem can also be caused by a Windows Management Instrumentation (WMI) service crash. If the server is running Windows Server 2008 R2, ensure that KB 982293 (http://support.microsoft.com/kb/982293)
    is installed on it.
    If the error persists, restart cc1-hyp-10.domaincloud1.local and then try the operation again. /nRefer to
    http://support.microsoft.com/kb/2742275 for more details.
    Doing a simple test from the VMM server to the problematic cluster node shows this error:
    PS C:\> hostname
    CC1-VMM-01
    PS C:\> winrm id -r:cc1-hyp-10.domaincloud1.local
    WSManFault
        Message = WinRM cannot complete the operation. Verify that the specified computer name is valid, that the computer is accessible over the network, and that a firewall exception for the WinRM service is enabled and allows access from this
    computer. By default, the WinRM firewall exception for public profiles limits access to remote computers within the same local subnet.
    Error number:  -2144108250 0x80338126
    WinRM cannot complete the operation. Verify that the specified computer name is valid, that the computer is accessible over the network, and that a firewall exception for the WinRM service is enabled and allows access from this computer. By default, the WinRM
    firewall exception for public profiles limits access to remote computers within the same local subnet.
    I CAN connect from other hosts to this problematic cluster node:
    PS C:\> hostname
    CC1-HYP-16
    PS C:\> winrm id -r:cc1-hyp-10.domaincloud1.local
    IdentifyResponse
        ProtocolVersion =
    http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd
        ProductVendor = Microsoft Corporation
        ProductVersion = OS: 6.3.9600 SP: 0.0 Stack: 3.0
        SecurityProfiles
            SecurityProfileName =
    http://schemas.dmtf.org/wbem/wsman/1/wsman/secprofile/http/spnego-kerberos
    And I can connect from the vmm server to all other cluster nodes:
    PS C:\> hostname
    CC1-VMM-01
    PS C:\> winrm id -r:cc1-hyp-11.domaincloud1.local
    IdentifyResponse
        ProtocolVersion =
    http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd
        ProductVendor = Microsoft Corporation
        ProductVersion = OS: 6.3.9600 SP: 0.0 Stack: 3.0
        SecurityProfiles
            SecurityProfileName =
    http://schemas.dmtf.org/wbem/wsman/1/wsman/secprofile/http/spnego-kerberos
    So at this point only the test from the cc1-vmm-01 to cc1-hyp-10 seems to be problematic.
    I followed the steps in the page
    https://support.microsoft.com/kb/2742275 (which is referred to above). I tried the VMMCA, but it can't really get it working the way I want, or it seems to give outdated recommendations.
    I tried checking for duplicate SPN's by running setspn -x on affected machines. No results (although I do not understand
     what an SPN is or how it works). I rebuilt the performance counters.
    It tried setting 'sc config winrm type= own' as described in [http://blinditandnetworkadmin.blogspot.nl/2012/08/kb-how-to-troubleshoot-needs-attention.html].
    If I reboot this cc1-hyp-10 machine, it will start working perfectly again. However, then I can't troubleshoot the issue, and it will happen again.
    I want this problem to be solved, so vmm never loses connection to the hypervisors it's managing again!
    Background information:
    We've set up a platform with Hyper-V to run a VM workload. The platform consists of the following hardware:
    2 Dell R620's with 32GB of RAM, running hyper-v to virtualize the cloud management layer (DC's, VMM, SQL). These machines are called cc1-hyp-01 and cc1-hyp-02. They run the management vm's like cc1-dc-01/02, cc1-sql-01, cc1-vmm-01, etc. The names are self-explanatory.
    The VMM machine is NOT clustered.
    8 Dell M620 blades with 320GB of RAM, running hyper-v to virtualize the customer workload. The machines are
    called cc1-hyp-10 until cc1-hyp-17. They are in a cluster.
    2 Equallogic units form a SAN (premium storage), and we have a Dell R515 running iscsi target (budget storage).
    We have Dell Force10 switches and Cisco C3750X switches to connect everything together (mostly 10GB links).
    All hosts run Windows Server 2012R2 Datacenter edition. The VMM server runs System Center Virtual Machine Manage 2012 R2.
    All the latest Windows updates are installed on every host. There are no firewalls between any host (vmm and hypervisors) at this level. Windows firewalls are all disabled. No antivirus software is installed, no symantec software is installed.
    The only non-standard software that is installed is the Dell Host Integration Tools 4.7.1, Dell Openmanage Server Administrator, and some small stuff like 7-zip, bginfo, net-snap, etc.
    The SCVMM service is running under the domain account DOMAINCLOUD1\scvmm. This machine is in the local administrators group of each cluster node.
    On top of this cloud layer we're running the tenant layer with a lot of vm's for a specific customer (although they are all off now).

    I think I found the culprit, after an hour of analyzing wireshark dumps I found the vmm had jumbo frames enabled on the management interface to the hosts (and the underlying infrastructure does not).. Now my winrm commands started working again.

  • Unable to failover the services in active-active cluster node

    Hi,
    i am applying the sp2 patch for sql server 2008 r2 in active-active cluster, we have 3 services in the cluster , node 1 as 2 prefered owner and node 2 as 1 prefered owner, when i try to move the service from node 2 to node1 , i am getting the below errors
    DCOM was unable to communicate with the computer XXXXXXXXX using any of the configured protocols.
    The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server XXXXXXXXX. The target name used was RPCSS/XXXXXX. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal
    name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using
    a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server
    name is not fully qualified, and the target domain (XXXXXX) is different from the client domain (XXXXXXX), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.
    The Cluster service failed to bring clustered service or application 'CHCROCHC045' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
    Cluster resource 'SQL Server (CHCROCHC045)' in clustered service or application 'CHCROCHC045' failed.
    any inputs appreciated to resolve this issue as i could not procedd with patching
    BR
    PGR

    Hi PGR,
    As the issue is more related to Windows Server, I would like to recommend you post the issue in the
    Windows Server forums for better support.
    In addition, below are some article about troubleshooting error ” DCOM was unable to communicate with the computer XXXXXXXXX using any of the configured protocols” for your reference.
    Event ID 10009 — COM Remote Service Availability
    How to troubleshoot DCOM 10009 error logged in system event?
    Thanks,
    Lydia Zhang
    Lydia Zhang
    TechNet Community Support

Maybe you are looking for