Failover cluster fails validation after a single node restart

I had a lab environment setup that works great, passes validation, can do live migrations without issue but as soon as I restarted one of the nodes, the then still live node became the only node able to access the storage backend. What's weird is that the restarted
node can still access the CSV storage and run VMs off of it, but the validation report is unable to list the actual disks.
My Cluster consists of 2 nodes. I have an iSCSI backed shared storage server and I can see that both of my nodes
are connected to the iSCSI targets successfully, but the node I first restarted no longer lists any disks/volumes in disk management and the once available MPIO menus are disabled in the iSCSI control panel. I also tried to restart the second node after the
first node came back but although the first node was up and running and had VMs on it, restarting the second node brought the entire cluster down. I see event IDs 1177, 1573, and 1069 appear in the Cluster Events log. When the second node came back up, the
cluster came back with it, but not the storage. Both nodes seem to display similar behavior in that they cannot access the storage backend. Now the storage is inaccessible by both nodes. I was able to get both nodes connected to the storage backend by
going to the iscsicpl and disconnecting all current connections to the iSCSI backend and adding them back. Doing the test again after bringing the storage back up resulted in the same behavior and this time redoing the iSCSI connections is not helping.
I think the issue here is that the first node I restarted is unable to see any disks/volumes from the storage backend only after joining the cluster and doing a restart. Before joining the cluster I did reboots on both nodes and both were able to connect to
the iSCSI backend without issue. It wasn't until after joining the cluster that node 1 became unable to access the storage backend after reboots. The validation report fails with "No disks were found on which to perform cluster validation tests. To correct
this, review the following possible causes: ..." although none of the suggestions seem applicable and the validation report was successful right before the restart of the node.
Does anyone have suggestions on how to further troubleshoot or resolve this issue?
I am using Hyper-V Server 2012 R2 on both nodes and they are joined to the same domain.

Hi,
I don’t found the similar issue, please your storage compatible with server 2012R2, Update Network Card Drivers and firmware on both the Nodes, temporarily disable your AV
soft and firewall install the Recommended hotfixes and updates for Windows Server 2012 R2-based failover clusters update.
The Recommended hotfixes and updates for Windows Server 2012 R2-based failover clusters
http://support.microsoft.com/kb/2920151/en-us
Hope this helps.
We
are trying to better understand customer views on social support experience, so your participation in this
interview project would be greatly appreciated if you have time.
Thanks for helping make community forums a great place.

Similar Messages

Failover cluster failed due to mysterious IP conflict ?

I'm having a mysterious problem with my Failover cluster,
Cluster name: PrintCluster01.domain.com
Members: PrintServer01.domain.com andPrintServer02.domain.com
in the Failover Cluster Management – Cluster Event I received the Critical error message 1135 and 1177:
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 15/06/2011 9:07:49 PM
Event ID: 1177
Task Category: None
Level: Critical
Keywords:
User: SYSTEM
Computer: PrintServer01.domain.com
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is
connected such as hubs, switches, or bridges.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 15/06/2011 9:07:28 PM
Event ID: 1135
Task Category: None
Level: Critical
Keywords:
User: SYSTEM
Computer: PrintServer01.domain.com
Description:
Cluster node 'PrintServer02' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run
the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node
is connected such as hubs, switches, or bridges.
After further investigation, I found some interesting error here, from the very first critical error message logged in the Event viewer on PrintServer02:
Log Name: System
Source: Tcpip
Date: 15/06/2011 9:07:29 PM
Event ID: 4199
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: PrintServer02-VM.domain.com
Description:
The system detected an address conflict for IP address 192.168.127.142 with the system having network hardware address 00-50-56-AE-29-23. Network operations on this system may be disrupted as a result.
192.168.127.142 --> secondary IP of PrintServer01
how could that be possible it conflict by one of the PrintServer01 node ? the detailed is as below:
**From PrintServer01**
Ethernet adapter Local Area Connection* 8:
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Microsoft Failover Cluster Virtual Adapter
Physical Address. . . . . . . . . : 02-50-56-AE-29-23
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
IPv4 Address. . . . . . . . . . . : 169.254.1.183(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.0.0
Default Gateway . . . . . . . . . :
NetBIOS over Tcpip. . . . . . . . : Enabled
I have double check in all of the cluster members that all IP addresses is now unique.
however I'm sure that I the IP is static not by DHCP as from the IPCONFIG results below:
From **PrintServer01** (the Active Node)
Windows IP Configuration
Host Name . . . . . . . . . . . . : PrintServer01
Primary Dns Suffix . . . . . . . : domain.com
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No
DNS Suffix Search List. . . . . . : domain.com
domain.com.au
Ethernet adapter Local Area Connection* 8:
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Microsoft Failover Cluster Virtual Adapter
Physical Address. . . . . . . . . : 02-50-56-AE-29-23
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
IPv4 Address. . . . . . . . . . . : 169.254.1.183(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.0.0
Default Gateway . . . . . . . . . :
NetBIOS over Tcpip. . . . . . . . : Enabled
Ethernet adapter Cluster Public Network:
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection
Physical Address. . . . . . . . . : 00-50-56-AE-29-23
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
IPv4 Address. . . . . . . . . . . : 192.168.127.155(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
IPv4 Address. . . . . . . . . . . : 192.168.127.88(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
IPv4 Address. . . . . . . . . . . : 192.168.127.142(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
IPv4 Address. . . . . . . . . . . : 192.168.127.143(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
IPv4 Address. . . . . . . . . . . : 192.168.127.144(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : 192.168.127.254
DNS Servers . . . . . . . . . . . : 192.168.127.10
192.168.127.11
Primary WINS Server . . . . . . . : 192.168.127.10
Secondary WINS Server . . . . . . : 192.168.127.11
NetBIOS over Tcpip. . . . . . . . : Enabled
Ethernet adapter Cluster Private Network:
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection #2
Physical Address. . . . . . . . . : 00-50-56-AE-43-EC
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
IPv4 Address. . . . . . . . . . . : 10.184.2.2(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . :
NetBIOS over Tcpip. . . . . . . . : Disabled
From **PrintServer02**
Windows IP Configuration
Host Name . . . . . . . . . . . . : PrintServer02
Primary Dns Suffix . . . . . . . : domain.com
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No
DNS Suffix Search List. . . . . . : domain.com
domain.com.au
Ethernet adapter Local Area Connection* 8:
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Microsoft Failover Cluster Virtual Adapter
Physical Address. . . . . . . . . : 02-50-56-AE-5F-E5
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
IPv4 Address. . . . . . . . . . . : 169.254.2.86(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.0.0
Default Gateway . . . . . . . . . :
NetBIOS over Tcpip. . . . . . . . : Enabled
Ethernet adapter Cluster Public Network:
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection
Physical Address. . . . . . . . . : 00-50-56-AE-79-FA
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
IPv4 Address. . . . . . . . . . . : 192.168.127.172(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
IPv4 Address. . . . . . . . . . . : 192.168.127.119(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : 192.168.127.254
DNS Servers . . . . . . . . . . . : 192.168.127.10
192.168.127.11
Primary WINS Server . . . . . . . : 192.168.127.11
Secondary WINS Server . . . . . . : 192.168.127.10
NetBIOS over Tcpip. . . . . . . . : Enabled
Ethernet adapter Cluster Private Network:
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection #2
Physical Address. . . . . . . . . : 00-50-56-AE-77-8D
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
IPv4 Address. . . . . . . . . . . : 10.184.2.3(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . :
NetBIOS over Tcpip. . . . . . . . : Disabled
Any help would be greatly appreciated.
Thanks,
AWT
/* Server Support Specialist */

I
am facing the same scenario as the original poster. This is on Server 2008 R2 SP1.
WIndow event log entries follow the same pattern. The MAC address listed in connection with the duplicate IP belonged to the passive node.
Interestingly, the Cluster.log begins to explode with activity a few milliseconds before the first Windows event is logged.
2012/07/11-15:20:59.517 INFO [CHANNEL fe80::8145:f2b9:898e:784e%37:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_IO_PENDING(997)
2012/07/11-15:20:59.517 WARN [PULLER SQLTESTSQLB] ReadObject failed with GracefulClose(1226)' because of 'channel to remote endpoint fe80::8145:f2b9:898e:784e%37:~3343~
is closed'
2012/07/11-15:20:59.517 ERR [NODE] Node 1: Connection to Node 2 is broken. Reason GracefulClose(1226)' because of 'channel to remote endpoint fe80::8145:f2b9:898e:784e%37:~3343~
is closed'
2012/07/11-15:20:59.517 WARN [RGP] Node 1: only local suspects are missing (2). moving to the next stage (shortcut compensation time 05.000)
2012/07/11-15:20:59.548 WARN [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.1.79 (status 80070490)
2012/07/11-15:20:59.548 WARN [NETFTAPI] Failed to query parameters for fe80::5efe:169.254.1.79 (status 80070490)
2012/07/11-15:20:59.579 INFO [CHANNEL 192.168.3.22:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
2012/07/11-15:20:59.579 WARN cxl::ConnectWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 192.168.3.22:~3343~ is closed'
2012/07/11-15:20:59.829 INFO [GEM] Node 1: EnterRepairStage1: Gem agent for node 1
2012/07/11-15:21:00.141 INFO [GEM] Node 1: EnterRepairStage2: Gem agent for node 1
2012/07/11-15:21:00.499 WARN [RCM] Moving orphaned group Available Storage from downed node SQLTESTSQLB to node SQLTESTSQLA.
2012/07/11-15:21:00.499 WARN [RES] IP Address <Cluster IP Address>: WorkerThread: NetInterface ef150d1a-f4a1-4f4f-a5c7-6e7cb2bfacab changed to state 3.
2012/07/11-15:21:00.499 WARN [RCM] Moving orphaned group MSSTEST from downed node SQLTESTSQLB to node SQLTESTSQLA.
2012/07/11-15:21:00.546 WARN [RES] IP Address <SQL IP Address 1 (DEVSQL)>: Failed to delete IP interface 2003B882, status 87.
2012/07/11-15:21:00.562 WARN [RES] Physical Disk <Cluster Disk 2>: PR reserve failed, status 170
2012/07/11-15:21:00.577 WARN [RES] Physical Disk <Cluster Disk 1>: PR reserve failed, status 170
2012/07/11-15:21:00.593 WARN [RES] Physical Disk <Cluster Disk 3>: PR reserve failed, status 170
2012/07/11-15:21:02.215 WARN [NETFTAPI] Failed to query parameters for 192.168.3.32 (status 80070490)
2012/07/11-15:21:02.215 WARN [NETFTAPI] Failed to query parameters for 192.168.3.32 (status 80070490)
2012/07/11-15:21:05.864 DBG [NETFTAPI] received NsiParameterNotification for fe80::5cd:8cc2:186:f5cb (IpDadStatePreferred )
2012/07/11-15:21:06.565 ERR [RES] Physical Disk <Cluster Disk 2>: Failed to preempt reservation, status 170
2012/07/11-15:21:06.581 ERR [RES] Physical Disk <Cluster Disk 2>: OnlineThread: Unable to arbitrate for the disk. Error: 170.
2012/07/11-15:21:06.581 ERR [RES] Physical Disk <Cluster Disk 2>: OnlineThread: Error 170 bringing resource online.
2012/07/11-15:21:06.581 ERR [RHS] Online for resource Cluster Disk 2 failed.
2012/07/11-15:21:06.581 WARN [RCM] HandleMonitorReply: ONLINERESOURCE for 'Cluster Disk 2', gen(0) result 5018.
2012/07/11-15:21:06.581 ERR [RCM] rcm::RcmResource::HandleFailure: (Cluster Disk 2)
2012/07/11-15:21:06.581 WARN [RES] Physical Disk <Cluster Disk 2>: Terminate: Failed to open device \Device\Harddisk5\Partition1, Error 2
2012/07/11-15:21:06.581 ERR [RES] Physical Disk <Cluster Disk 1>: Failed to preempt reservation, status 170
2012/07/11-15:21:06.581 ERR [RES] Physical Disk <Cluster Disk 1>: OnlineThread: Unable to arbitrate for the disk. Error: 170.
2012/07/11-15:21:06.581 ERR [RES] Physical Disk <Cluster Disk 1>: OnlineThread: Error 170 bringing resource online.
Full cluster log here:
https://skydrive.live.com/redir?resid=A694FDEBF02727CD!133&authkey=!ADQMxHShdeDvXVc

Cluster fails validation

An error occurred while executing the test. There was an error getting information about the SAS controllers installed on the nodes. There was an error retrieving information
about the SAS host bus adapters from node Invalid class
Successfully put PR reserve on cluster disk 0 from node C while it should have failed
Cluster Disk 0 does not support Persistent Reservations. Some storage devices require specific firmware versions or settings to function properly with failover clusters.
Please contact your storage administrator or storage vendor to check the configuration of the storage to allow it to function properly with failover clusters.
Cluster Disk 1 does not support Persistent Reservations. Some storage devices require specific firmware versions or settings to function properly with failover clusters.
Please contact your storage administrator or storage vendor to check the configuration of the storage to allow it to function properly with failover clusters.

Hi,
It must your storage is not compatible with Windows Server Failover Clustering. In this circumstance that cluster most likely will work, but if you are try to run it in the
product environment you may need to do a few things.
All storage vendors and almost all current shipping models support Failover Clustering, but many require firmware updates or configuration settings. Therefore
please connect your storage vendor to confirm there need any help.
The second case if you are building a failover cluster on the VMware® virtualization environment please refer the VMware® article:
Configuring Microsoft Cluster Service fails with the error: Validate SCSI-3 Persistent Reservation (1030632)
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1030632
More information:
Windows 2008 Failover Cluster Validation Fails on ‘Validate SCSI-3 Persistent Reservation’
http://blogs.technet.com/b/askcore/archive/2009/04/15/windows-2008-failover-cluster-validation-fails-on-validate-scsi-3-persistent-reservation.aspx
Hope this helps.
*** This response contains a reference to a third party World Wide Web site. Microsoft is providing this information as a convenience to you. Microsoft does not control
these sites and has not tested any software or information found on these sites; therefore, Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. There are inherent dangers in the
use of any software found on the Internet, and Microsoft cautions you to make sure that you completely understand the risk before retrieving any software from the Internet. ***
We
are trying to better understand customer views on social support experience, so your participation in this
interview project would be greatly appreciated if you have time.
Thanks for helping make community forums a great place.

SQL 2012 installation for Failover Cluster failed

While installation of SQL 2012 on FOC validation fails on "Database Engine configuration" page with following error:
The volume that contains SQL Server data directory g:\MSSQL11.MSSQLSERVER\MSSQL\DATA does not belong to the cluster group.
Want to know how does SQL installation wizard queries volumes configured with Failover Cluster. does it:
- Enumerate "Physical Disk" resources in FOC
- does it enumerate all Storage Class resources in FOC for getting the volume list
- or it depends on WMI (Win32_Volume) to get volumes ?
The wizard correctly discovers volume g:\ in its FOC group on "Cluster Resource Group" and "Cluster Disk Selection" page. but gives the error on Database configuration page.
Any help in this would be appreciated.
Thanks in advance
Rakesh
Rakesh Agrawal

Can you please check if there is any disk in the cluster which is not in online state? Please run below script following the steps.
1. Save a script as "Disk.vbs" and use
use CSCRIPT to run it.
2. Syntax: CSCRIPT <Disk.vbs> <Windows Cluster Name>
< Script>
Option Explicit
Public objArgs, objCluster
Public Function Connect()
' Opens a global cluster object. Using Windows Script Host syntax,
' the cluster name or "" must be passed as the first argument.
Set objArgs = WScript.Arguments
if objArgs.Count=0 then
wscript.Echo "Usage Cscript <script file name> <Windows Cluster Name> "
WScript.Quit
end IF
Set objCluster = CreateObject("MSCluster.Cluster")
objCluster.Open objArgs(0)
End Function
Public Function Disconnect()
' Dereferences global objects. Used with Connect.
Set objCluster = Nothing
Set objArgs = Nothing
End Function
Connect
Dim objEnum
For Each objEnum in objCluster.Resources
If objEnum.ClassInfo = 1 Then
WScript.Echo ObjEnum.Name
Dim objDisk
Dim objPartition
On Error Resume Next
   Set objDisk = objEnum.Disk
   If Err.Number <> 0 Then
    WScript.Echo "Unable to retrieve the disk: " & Err
   Else
    For Each objPartition in objDisk.Partitions
     WScript.Echo objPartition.DeviceName
    Next
   End If
End If
Next
Disconnect
</Script>

Hyper-V Failover Cluster Networking Configuration After Install

Hello All,
Is it possible to install hyper-v and failover or in other words create a hyper-v failover cluster and then configure the networking part of the solution later? As I am coming into
terms with the networking part of it, wanted to do it later after the install. Is it possible?
And from later configuration, I am trying to say, creation of NIC Team, Virtual NICs, VLAN tagging, etc.

Hi,
Failover cluster deployment requires network connectivity between cluster nodes. You can't create a cluster without properly configured TCP\IP on cluster nodes.
http://OpsMgr.ru/

SQL SERVER Failover Cluster switch failure because the passive node automatically reassign drive letter

I switch the sql server resource group to the standby node , when the disk resource ready bring online in the passive node ,then occur exception. because the original dependency disk resource the drive letter is 'K:' , BUT when the disk bring online , it
automatically reassign new drive letter 'H:' , So the sql server resource couldnot bring online . And After Manual modify the drive letter to 'K:' in the passive node , It Works ! So my question is why it not use the original drive letter
and reassign a new one . what reasons would be cause it ? mount point ? Some log as follows:
00001cbc.000004e0::2015/03/12-14:41:11.377 WARN [RES] Physical Disk <FltLowestPrice_K>: OnlineThread: Failed to set volguid \??\Volume{e32c13d5-02e6-4924-a2d9-59a6fae1a1be}. Error: 183.
00001cbc.000004e0::2015/03/12-14:41:11.377 INFO [RES] Physical Disk <FltLowestPrice_K>: Found 2 mount points for device \Device\Harddisk8\Partition2
00001cbc.00001cdc::2015/03/12-14:41:11.377 INFO [RES] Physical Disk: PNP: Update volume exit, status 1168
00001cbc.00001cdc::2015/03/12-14:41:11.377 INFO [RES] Physical Disk: PNP: Updating volume
\\?\STORAGE#Volume#{1a8ddb8e-fe43-11e2-b7c5-6c3be5a5cdca}#0000000008100000#{53f5630d-b6bf-11d0-94f2-00a0c91efb8b}
00001cbc.00001cdc::2015/03/12-14:41:11.377 INFO [RES] Physical Disk: PNP: Update volume exit, status 5023
00001cbc.000004e0::2015/03/12-14:41:11.377 ERR [RES] Physical Disk: Failed to get volname for drive H:\, status 2
00001cbc.000004e0::2015/03/12-14:41:11.377 INFO [RES] Physical Disk <FltLowestPrice_K>: VolumeIsNtfs: Volume
\\?\GLOBALROOT\Device\Harddisk8\Partition2\ has FS type NTFS
00001cbc.000004e0::2015/03/12-14:41:11.377 INFO [RES] Physical Disk: Volume
\\?\GLOBALROOT\Device\Harddisk8\Partition2\ has FS type NTFS
00001cbc.000004e0::2015/03/12-14:41:11.377 INFO [RES] Physical Disk: MountPoint H:\ points to volume
\\?\Volume{e32c13d5-02e6-4924-a2d9-59a6fae1a1be}\

Sounds like you have an cluster hive that is out of date/bad, or some registry settings which are incorrect. You'll want to have this question transferred to the windows forum as that's really what you're asking about.
-Sean
The views, opinions, and posts do not reflect those of my company and are solely my own. No warranty, service, or results are expressed or implied.

Server 2012 Failover cluster. Make two VMs stay on the same node

We have a unique situation where i need two machines to stay on the same node. Its a 4 node cluster with 30+ resources but i want to make sure two boxes are ALWAYS on the same now. If one migrates to another node the second needs to follow. Is there
a way to do this?

How an this KB help to stay the two vm's on the same node.
With all do respect @justinv how could this helped you to your problem , your question was "We have a unique situation where i need two machines to stay on the same node. Its a 4 node cluster with 30+ resources but i want to make sure two boxes are ALWAYS
on the same now"
and the KB that elden showed you is for : "Failover clusters that are running inside of virtual machines (sometimes referred to as “guest clusters”) may have problems with nodes joining the cluster."
@justinv cloud you tell us more about this did I misunderstand your question ?
Greetings, Robert Smit Follow me @clustermvp http://robertsmit.wordpress.com/ “Please click "Vote As Helpful" if it is helpful for you and Proposed As Answer” Please remember to click “Mark as Answer” on the post that helps you
I explained in one of my replies that my underlying issue was this exactly what the KB fixed..... A guest cluster failing when moved to different nodes. Thats the only reason why i wanted them on the same node to begin with. While this post didn't solve me
original question is solved what my real problem was....

2 node failover cluster power down

I have a 2node failover cluster. When I power down a node that has the SQL server instance and resources, all the resources and service failover to the other node. When I see that all the resources and service report "online" I then power
that node. I am being told that this is improper because failover may not have completed. Is that correct?
Also, in our 2 node failover cluster is there a proper sequence to restarting the powered down nodes?

Hi,
The cluster group containing SQL Server can be configured for automatic failback to the primary node when it becomes available again. By default, this is set to off.
To Configure:
Right-click the group containing SQL Server in the cluster administrator, select 'properties' then 'failback' tab.
To prevent an auto-failback, select 'Prevent Failback', to allow select 'Allow Failback' then one of the following options:
Immediately: Not recommended as it can disrupt clients
Failback between n and n1 hours: allows a controlled failback to a preferred node (if it's online) during a certain period.
The related article:
Windows Failover Clustering Overview
http://blogs.technet.com/b/rob/archive/2008/05/07/failover-clustering.aspx
Hope this helps.
We
are trying to better understand customer views on social support experience, so your participation in this
interview project would be greatly appreciated if you have time.
Thanks for helping make community forums a great place.

Failover cluster not cleanly shutting down service

I've got a two node 2008 R2 failover cluster. I have a single service being managed by it that I configured just as a generic service. The failover works perfectly when the service is stopped, or when one of the machines goes down, and the immediate
failback I have configured works perfectly in both scenarios as well.
However, there's an issue when I take the networking down on the preferred owner of the service. As far as I can tell (this is the first time I've tried failover clustering, so I'm learning), when I take the networking down, the cluster service shuts
down, and in turn shuts down the service I've told it to manage. At this point, when the services aren't running, the service fails over to the secondary as intended. The problem shows up when I turn the networking back on. The service tries
and fails to start on the primary (as many times as I've configured it to try), and then eventually gives up and goes back to the secondary.
The reason for this, examining logs for the service, is that the required port is already in use. I checked some more, and sure enough, when I take the networking offline the service gets shut down, but the executable is still running. This is
repeatable every time. When I just stop the service, though, the executables go away. So it's something to do specifically with how the managed service gets shut down *when it's shut down due to the cluster service stopping*. For some reason
it's not cleaning up that associated executable.
Any ideas as to why this is happening and how to fix/work around it would be extremely welcome. Thank you!

Try to generate cluster log using closter log /g /copy:<path to a local folder>. You might need to bump up log verbosity using cluster /prop ClusterLogLevel=5 (you can check current level using cluster /prop).
You also can look at the SCM diagnostic channel in the event viewer. Start eventvwr. Wait for the clock icon on the Application and Services Logs to go away. Once the clock icon is gone select this entry and in the menu check Show Analytic and Debug Logs.
Now expand to the SCM provider located at
Application and Services Logs\Microsoft\Service Control Manager Performance Diagnostic Provider\Diagnostic.
or Microsoft-Windows-Services/Diagnostic
Enable the log, run repro, disable the log. After that you should see events from the SCM showing you your service state transitions.
The terminate parameters do not seems to be configurable. I can think of two ways fixing the issue
- Writing your own cluster resource DLL where you can implement your own policies. THis would be a place to start http://blogs.msdn.com/b/clustering/archive/2010/08/24/10053405.aspx.
- This option is assuming you cannot change the source code of the service to kill orphaned child processes on startup so you have to clenup using some other means. Create another service and make your service dependent on this new service. This new serice
must be much faster in responding do the SCM commands. On start of this service you using PSAPI enumirate all processes running on the machine and kill the orphaned child processes. You probably should be able to acheve something similar using GenScript resource
+ VB script that does the cleanup.
Regards, Vladimir Petter, Microsoft Corporation

Two VM's in one role - Failover cluster

Hello,
In my 2 node Hyper-V 2012R2 cluster I had 2 VM's, DC01 and APP01.
Today, I only saw DC01 in the Roles list at the Failover Cluster Manager. After a while I found APP01 under the Resources for DC01. What is happening here, and how can I revert it?

APP01 is down.
You should have 1 VM per Clustered Role.
In Server 2012 R2, it's not supported to have more than 1 VM per Clustered Role.
Sam Boutros, Senior Consultant, Software Logic, KOP, PA http://superwidgets.wordpress.com (Please take a moment to Vote as Helpful and/or Mark as Answer, where applicable) _________________________________________________________________________________
Powershell: Learn it before it's an emergency http://technet.microsoft.com/en-us/scriptcenter/powershell.aspx http://technet.microsoft.com/en-us/scriptcenter/dd793612.aspx

Can't remove Failover Cluster feature on Windows 2008 R2

Hello
When remove the Failover Cluster feature has following message:
Cannot remove Failover Clusting
This server is an active node in a failover cluster. Uninstalling the Failover CVlustering feature on thos node may impact the availabilty of clustered service and applications. It is recommended that you first evict the server from cluster membership. This
can be done through the Failover Cluster Management snap-in by expanding the console tree under Nodes, selecting the node, clicking More Actions, and then clicking Evict.
I'm sure there no cluster formed, so how can I remove it?
Thanks !

Hey I have the same problem,
Somehow cluster got installed on one node on windows 2008 R2 but it was not showing anything in cluster fail over manager wizard and cluster service is
not running
when I am trying to remove the fail over cluster it says
"This
server is an active node in a failover cluster. Uninstalling the Failover CVlustering feature on those node may impact the availability of clustered service and applications. It is recommended that you first evict the server from cluster membership. This can
be done through the Failover Cluster Management snap-in by expanding the console tree under Nodes, selecting the node, clicking More Actions, and then clicking Evict."
But there
is no cluster at all, I am not sure how remove it
So let
me know will that power shell command "clear-clusternode" fixes my problem?
and please
let me know do I need to run it in normal Power shell command line or Power shell failover cluster manager command line?

Can I upgrade Server 2012R2 Standard to 2012R2 Datacenter while in a failover cluster?

I would like to know if it is supported to upgrade my Server 2012R2 Standard to 2012R2 Datacenter while in a failover cluster. I have a 3 node cluster all running Server 2012R2 standard with only a few VM's running on them at this time.
-Jim

Should be no problem. There are no technical differences between Standard and Datacenter. The difference is in the licensing.
I would evict a node, upgrade it, and add it back in. Rinse and repeat.
. : | : . : | : . tim

Failover cluster server - File Server role is clustered - Shadow copies do not seem to travel to other node when failing over

Hi,
New to 2012 and implementing a clustered environment for our File Services role. Have got to a point where I have successfully configured the Shadow copy settings.
Have a large (15tb) disk. S:
Have a VSS drive (volume shadow copy drive) V:
Have successfully configured through Windows Explorer the Shadow copy settings.
Created dependencies in Failcover Cluster Server console whereby S: depends on V:
However, when I failover the resource and browse the Client Access Point share there are no entries under the "Previous Versions" tab.
When I visit the S: drive in windows explorer and open the Shadow copy dialogue box, there are entries showing the times and dates of the shadow copies ran when on the original node. So the disk knows about the shadow copies that were ran on the
original node but the "previous versions" tab has no entries to display.
This is in a 2012 server (NOT R2 version).
Can anyone explain what might be the reason? Do I have an "issue" or is this by design?
All help apprecieated!
Kathy
Kathleen Hayhurst Senior IT Support Analyst

Hi,
Please first check the requirements in following article:
Using Shadow Copies of Shared Folders in a server cluster
http://technet.microsoft.com/en-us/library/cc779378(v=ws.10).aspx
Cluster-managed shadow copies can only be created in a single quorum device cluster on a disk with a Physical Disk resource. In a single node cluster or majority node set cluster without a shared cluster disk, shadow copies can only be created and managed
locally.
You cannot enable Shadow Copies of Shared Folders for the quorum resource, although you can enable Shadow Copies of Shared Folders for a File Share resource.
The recurring scheduled task that generates volume shadow copies must run on the same node that currently owns the storage volume.
The cluster resource that manages the scheduled task must be able to fail over with the Physical Disk resource that manages the storage volume.
If you have any feedback on our support, please send to [email protected]

SQL Server Agent fails to connect to DB after enabling mirror on failover cluster

Hello:
We have multiple databases running in a Failover Cluster instance: SQL 2012SP1 on Server 2008 R2 failover cluster (NOT AlwaysOn). We are trying to add a high-performance mirror in a standalone instance for DR. My understanding is that should be a perfectly
normal, supported configuration.
The mirroring is working properly; however, the clustered SQL Server agent is unable to run jobs that run in the mirrored databases.
We get the following in the job log: Unable to connect to SQL Server 'VIRTUALSERVERNAME\INSTANCE'. The step failed.
There is a partner message in the agent log: [165] ODBC Error: 0, Connecting to a mirrored SQL Server instance using the MultiSubnetFailover connection option is not supported. [SQLSTATE IMH01]
The cluster is not a mulitsubnet cluster. All hosts are connected to the same subnets and there is no storage replication. I can not find any place where I can adjust the connect string options for SQL Agent.
Any guidance or suggestions on how to resolve this would be appreciated.
~joe

SQL Team - MSFT:
Thank you for taking the time to research and provide a clear answer.
This seems very much a workaround and very unsatisfactory.
You are correct, there is an IP dependency with OR condition. Moving to an AND condition is not viable for us. The whole point is to provide network redundancy. With an AND condition, if EITHER network interface fails, the service will go offline or fail
to come online without manual intervention. This is arguably worse for uptime than having a single interface available.
We are in process of rewriting all our SQL jobs to start in tempdb before transitioning to the appropriate target database. If this works for all of our jobs, I will mark the above response as answer.
Again, thank you for the answer.
Regards,
Joe M.

GI installation on a single-node cluster error.

Hello, I am trying to install GI on a single-node cluster (Solaris 10 / Sparc) but the root.sh script fails with the following error (this is not a GI installation for a Standalone Server :
root@selvac./dev/ASM/OCRVTD_DG # /app/oracle/grid/11.2/root.sh
Running Oracle 11g root script...
The following environment variables are set as:
ORACLE_OWNER= grid
ORACLE_HOME= /app/oracle/grid/11.2
Enter the full pathname of the local bin directory: [usr/local/bin]:
Copying dbhome to /usr/local/bin ...
Copying oraenv to /usr/local/bin ...
Copying coraenv to /usr/local/bin ...
Creating /var/opt/oracle/oratab file...
Entries will be added to the /var/opt/oracle/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /app/oracle/grid/11.2/crs/install/crsconfig_params
Creating trace directory
LOCAL ADD MODE
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
OLR initialization - successful
root wallet
root wallet cert
root cert export
peer wallet
profile reader wallet
pa wallet
peer wallet keys
pa wallet keys
peer cert request
pa cert request
peer cert
pa cert
peer root cert TP
profile reader root cert TP
pa root cert TP
peer pa cert TP
pa peer cert TP
profile reader pa cert TP
profile reader peer cert TP
peer user cert
pa user cert
Adding daemon to inittab
ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9312: Existing ADVM/ACFS installation detected.
ACFS-9314: Removing previous ADVM/ACFS installation.
ACFS-9315: Previous ADVM/ACFS components successfully removed.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.
CRS-2672: Attempting to start 'ora.mdnsd' on 'selvac'
CRS-2676: Start of 'ora.mdnsd' on 'selvac' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'selvac'
CRS-2676: Start of 'ora.gpnpd' on 'selvac' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'selvac'
CRS-2672: Attempting to start 'ora.gipcd' on 'selvac'
CRS-2676: Start of 'ora.cssdmonitor' on 'selvac' succeeded
CRS-2676: Start of 'ora.gipcd' on 'selvac' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'selvac'
CRS-2672: Attempting to start 'ora.diskmon' on 'selvac'
CRS-2676: Start of 'ora.diskmon' on 'selvac' succeeded
CRS-2676: Start of 'ora.cssd' on 'selvac' succeeded
ASM created and started successfully.
Disk Group OCRVTD_DG created successfully.
The ora.asm resource is not ONLINE
Did not succssfully configure and start ASM at /app/oracle/grid/11.2/crs/install/crsconfig_lib.pm line 6465.
/app/oracle/grid/11.2/perl/bin/perl -I/app/oracle/grid/11.2/perl/lib -I/app/oracle/grid/11.2/crs/install /app/oracle/grid/11.2/crs/install/rootcrs.pl execution failed
I also found the "PRVF-5150: Path OCRL:DISK1 is not a valid path on all nodes" error but as I have read it is a bug I Ignored it. But...
I think my ASM_DG OCR and voting is ok, accessible by grid user and 660. It seems ASM does not start or does not start in time.
Any help is wellcome.
Thanks in advance.

Thanks a lot for the hint. I had already checked this doc. but I think it is not the problem. Actually de error ora.asm is not online is not correct. After failing root.sh, ora.asm is ONLINE:
root@selvac./app/oracle/grid/11.2/bin # ./crsctl check resource ora.asm -init
root@selvac./app/oracle/grid/11.2/bin # ./crsctl stat resource ora.asm -init
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE
STATE=ONLINE on selvac
The last part of the /app/oracle/grid/11.2/cfgtoollogs/crsconfig/rootcrs_selvac.log file reads :
>
ASM created and started successfully.
Disk Group OCRVTD_DG created successfully.
End Command output2011-04-14 13:24:16: Executing cmd: /app/oracle/grid/11.2/bin/crsctl check resource ora.asm -init
2011-04-14 13:24:17: Executing cmd: /app/oracle/grid/11.2/bin/crsctl status resource ora.asm -init
2011-04-14 13:24:17: Command output:
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE
STATE=OFFLINE
End Command output2011-04-14 13:24:17: Checking the status of ora.asm
2011-04-14 13:24:22: Executing cmd: /app/oracle/grid/11.2/bin/crsctl status resource ora.asm -init
2011-04-14 13:24:22: Command output:
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE
STATE=OFFLINE
End Command output2011-04-14 13:24:22: Checking the status of ora.asm
2011-04-14 13:24:27: Executing cmd: /app/oracle/grid/11.2/bin/crsctl status resource ora.asm -init
2011-04-14 13:24:28: Command output:
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE
STATE=OFFLINE
End Command output2011-04-14 13:24:28: Checking the status of ora.asm
2011-04-14 13:24:33: Executing cmd: /app/oracle/grid/11.2/bin/crsctl status resource ora.asm -init
2011-04-14 13:24:33: Command output:
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE
STATE=OFFLINE
End Command output2011-04-14 13:24:33: Checking the status of ora.asm
2011-04-14 13:24:38: Executing cmd: /app/oracle/grid/11.2/bin/crsctl status resource ora.asm -init
2011-04-14 13:24:38: Command output:
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE
STATE=OFFLINE
End Command output2011-04-14 13:24:38: Checking the status of ora.asm
2011-04-14 13:24:43: Executing cmd: /app/oracle/grid/11.2/bin/crsctl status resource ora.asm -init
2011-04-14 13:24:43: Command output:
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE
STATE=OFFLINE
End Command output2011-04-14 13:24:43: Checking the status of ora.asm
2011-04-14 13:24:48: Executing cmd: /app/oracle/grid/11.2/bin/crsctl status resource ora.asm -init
2011-04-14 13:24:49: Command output:
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE
STATE=OFFLINE
End Command output2011-04-14 13:24:49: Checking the status of ora.asm
2011-04-14 13:24:54: Executing cmd: /app/oracle/grid/11.2/bin/crsctl status resource ora.asm -init
2011-04-14 13:24:54: Command output:
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE
STATE=OFFLINE
End Command output2011-04-14 13:24:54: Checking the status of ora.asm
2011-04-14 13:24:59: Executing cmd: /app/oracle/grid/11.2/bin/crsctl status resource ora.asm -init
2011-04-14 13:24:59: Command output:
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE
STATE=OFFLINE
End Command output2011-04-14 13:24:59: Checking the status of ora.asm
2011-04-14 13:25:04: Executing cmd: /app/oracle/grid/11.2/bin/crsctl status resource ora.asm -init
2011-04-14 13:25:04: Command output:
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE
STATE=OFFLINE
End Command output2011-04-14 13:25:04: Checking the status of ora.asm
2011-04-14 13:25:09: The ora.asm resource is not ONLINE
2011-04-14 13:25:09: Running as user grid: /app/oracle/grid/11.2/bin/cluutil -ckpt -oraclebase /app/grid -writeckpt -name ROOTCRS_BOOTCFG -state FAIL
2011-04-14 13:25:09: s_run_as_user2: Running /bin/su grid -c ' /app/oracle/grid/11.2/bin/cluutil -ckpt -oraclebase /app/grid -writeckpt -name ROOTCRS_BOOTCFG -state FAIL '
2011-04-14 13:25:10: Removing file /var/tmp/mbahSaGPn
2011-04-14 13:25:10: Successfully removed file: /var/tmp/mbahSaGPn
2011-04-14 13:25:10: /bin/su successfully executed
2011-04-14 13:25:10: Succeeded in writing the checkpoint:'ROOTCRS_BOOTCFG' with status:FAIL
2011-04-14 13:25:10: ###### Begin DIE Stack Trace ######
2011-04-14 13:25:10: Package File Line Calling
2011-04-14 13:25:10: --------------- -------------------- ---- ----------
2011-04-14 13:25:10: 1: main rootcrs.pl 322 crsconfig_lib::dietrap
2011-04-14 13:25:10: 2: crsconfig_lib crsconfig_lib.pm 6465 main::__ANON__
2011-04-14 13:25:10: 3: crsconfig_lib crsconfig_lib.pm 6390 crsconfig_lib::perform_initial_config
2011-04-14 13:25:10: 4: main rootcrs.pl 671 crsconfig_lib::perform_init_config
2011-04-14 13:25:10: ####### End DIE Stack Trace #######
2011-04-14 13:25:10: 'ROOTCRS_BOOTCFG' checkpoint has failed
So this must be a bug. During root.sh execution ora.asm is OFFLINE but after failing it is ONLINE. It maight be a question of waiting/repeating or timeout as I see the "Checking the status of ora.asm" command is repeated several times during root.sh, but not enough perhaps. Now root.sh is failed, installation halted but ASM is ONLINE.
Any other Idea?
Thanks again.

Failover cluster fails validation after a single node restart

Similar Messages

Maybe you are looking for