Cluster Network Randomly Failing on Hyper-V Cluster
Please let me know if there is a more appropriate forum. I am having a really strange issue that is seemingly random. I have a 3 host cluster that are all identical hardware and running Hyper-V Server 2012 R2. The networking is as follows and each network
is a different VLAN/Subnet:
3 Cluster networks for virtual machines
1 Cluster network for cluster traffic/management
1 Heartbeat network
2 iSCSI networks for storage
All of the networks are perfectly fine except for one which seems to fail on a random node at a random time during the day (so far, a maximum of once per day).
If I start to live migrate virtual machines that are on the failed network, the cluster network comes back up. The cluster networks are teamed using SCVMM and they are switch independent and running the Dynamic teaming algorithm. We have tried changing the
network switches to see if it was faulty network hardware and things ran fine for one day and then just happened again today so we've ruled that out. The only error message I get is 1127 which is the error stating that the cluster network has gone into a failed
state which doesn't help much. I've run the cluster validation tool for networking several times and it always passes 100%. What I am worried about is hardware incompatibilities as I am using Dell servers (PowerEdge R720) that have Broadcom NIC's in them.
We have 12 Ethernet ports in each server and they are all identical hardware. Four of them are integrated Broadcom, another four that are from a Broadcom quad add-on NIC, and another 4 that are from an Intel quad add-on NIC. All are server grade NIC's. The
only problem I've had in the past is with VMQ which we've had to disable as a workaround but that has always stabilized our virtual networks. In any case all of the cluster networks for virtual machines are set up identically and only this particular one randomly
fails on any one of the three hosts (it has happened at least once on each node now).
I am wondering if anyone has had this experience before. I have read that there are some nasty compatibility issues between Broadcom and Hyper-V but I am wondering if someone could give me some ideas to find out how to narrow this down since the event
logs don't seem to be speaking in obvious terms to me.
Please let me know if you have any suggestions on how to narrow down what's causing this or if there is more information that I could provide. In the meantime, I'm going to try and take note of which virtual machines are running on the host that has the
network fail just in case there's some correlation there but that could take a while to accrue any useful data and our users aren't too happy with the instability...
Thank you in advance for your time and sorry for the lengthy post!
Since I made the change last Friday evening, 4/10, I haven't experienced the issue. I won't be completely convinced that this resolved it until I monitor for at least one more week since it didn't actually present itself for the first time until I was already
one week into live deployment. Also, this link below is much more eloquent than how I put it and describes my issue exactly. Coupled with the KB article that someone posted within this comments section (the same that I posted earlier here) of the article,
this is what led me to check the VMQ status through PowerShell which is much better than going through the registry to do it (I'm running Hyper-V Server 2012 R2 which is like core so I don't have the GUI options shown in the article).
http://alexappleton.net/post/77116755157/hyper-v-virtual-machines-losing-network
I could try updating the driver but there is mention in the comments of this post that driver updates have yet to resolve this issue so we may still be waiting on Broadcom for a fix. Please confirm otherwise if anyone has any information.
Similar Messages
-
Failover Cluster Network Name Failed and Can't be Repaired
I have an issue that seem to be a different problem than any others have encountered.
I've scoured everything I can find and nothing has fixed my problem.
The problem starts with the common problem of the cluster network name failing on my 2 node server 2012 file server cluster. The computer object was still in AD and appeared to be fine so it was not the common problem of the object
getting deleted somehow. At the time, there was no other object with that name in the recycling bin, so I don't think it was mistakenly deleted and quickly recreated to cover any tracks, so to speak.
Following one guide, I tried to find the registry key that corresponded with the GUID of the object, but neither node in the cluster had it in its registry (which may be part of the problem).
Since it was in the failed state, I tried to do the repair on the object to no avail.
We run a "locked down" DC environment so all computer objects have to be pre-provisioned. They were all pre-provisioned successfully and successfully assigned during cluster creation. The cluster was running with no issues for a month
or so before this problem came up.
When I do a repair on the object while taking diagnostic logs the following 4609 error appears:
The action 'Repair' did not complete. - System.ApplicationException: An error occurred resetting the password for 'Cluster Name'. ---> System.ComponentModel.Win32Exception: Unknown error (0x80005000)
There appears to be a corresponding 4771 error with a failure code 0x18 that comes from the security log of the DC that states there was a Kerberos pre-authentication failure for the cluster network name object (Domain\Clustername$)
I believe this is what is causing the repair failure. All the information I found related to security error 4771 was either a bad credentials given for a user account or the fix was to reconnect the computer to the domain. I can't seem to find
a way to do this with the cluster network name. If there's a way please let me know.
I've tried a number of things, like resetting the object, disabling it, deleting and creating a new object with the same name, deleting that new object and recovering the original, etc...
Can anyone shed some light on what is going on and hopefully how to fix it other than rebuilding the cluster? I'm quite close to just tearing it down and building it back up but am hesitant because this cluster in currently in production...
Any help would be appreciatedHi,
I don’t find out the similar issue with yours, base on my experience, the 4096 error
often caused by the CSV disk issue, and the 0x80005000 error some time caused by the repetitive computer object in OU. Please check the above related part or run the validate test then post the error information.
Although I do have a CSV, there doesn't seem to be any problems with it and it was running just fine for a month or so before the problem started. I double checked and there is no duplicate computer objects, maybe I don't understand what you mean by
repetitive, could you explain further?
The cluster validates successfully with a few warnings:
Validating cluster resource Name: DT-FileCluster.
This resource is marked with a state of 'Failed' instead of
'Online'. This failed state indicates that the resource had a problem either
coming online or had a failure while it was online. The event logs and cluster
logs may have information that is helpful in identifying the cause of the
failure.
- This is because the cluster name is in the failed state
Validating the service principal names for Name:
DT-FileCluster.
The network name Name: DT-FileCluster does not have a valid
value for the read-only property 'ObjectGUID'. To validate the service principal
name the read-only private property 'ObjectGuid' must have a valid value. To
correct this issue make sure that the network name has been brought online at
least once. If this does not correct this issue you will need to delete the
network name and re-create it.
- This is definitely related to the problem and the GUID probably got removed when we attempted a fix by resetting the object and trying the repair from the failover cluster manager.
The user running validate, does not have permissions to create
computer objects in the 'ad.unlv.edu' domain.
- This is correct, we run a restricted domain. I have a delegated OU that I can pre-provision accounts in. The account was pro-provisioned successfully and was at one point setup and working just fine.
There are no other errors nor warnings. -
High Network Utilization on HeartBeat and LiveMigration networks - Windows 2012 R2 Hyper-V cluster
Hi all,
I have just setup a fresh Windows 2012 R2 Hyper-V cluster with 4 nodes.
Config:
Network:
2 x Networks for iSCSI
1 x VM network 1
1 x VM network 2
1 x Heartbeat
1 x LiveMigration
Disks:
5 GB = Quorum
2 x 2.99 TB (Deduplication enabled)
The problem I am getting is that HeartBeat and LiveMigration (both configured with Cluster Only traffic) have alot of traffic on them even though no LiveMigration (the only card configured for LiveMigration) is going on.
All network configuration is the same as it was in Windows 2008 R2 SP1 cluster which these machines where running before (not upgrade, fresh install and migration) and did not have this "problem".
Has anyone experienced this or have a solution to this?
Regards,
Thorir
thorirHi Thorir,
According to your discription , you can try to monitor and analyse the traffic of LM via network monitor for troubleshooting .
Here is the links for downloading and using network monitor :
http://www.microsoft.com/en-us/download/details.aspx?id=4865
http://technet.microsoft.com/en-us/library/cc723623.aspx
Any further information please feel free to let us know.
Hope this helps
Best Regards
Elton Ji
We
are trying to better understand customer views on social support experience, so your participation in this
interview project would be greatly appreciated if you have time.
Thanks for helping make community forums a great place. -
Cluster Network 2 is missing from Failover cluster Manager
I have two node Windows 2008 R2 SP1 MS SQL cluster, there was an issue with one node NIC card which was replaced and since then "Cluster Network 2" is not visible in failover cluster manager, also Heartbeat and Public IP nic card of both nodes
are appearing in "Cluster Network 1" container. even though both network public and hearbeat are on differenet subnet, but still cluster detects it as one subnet. Need to know how to bring back "Cluster Network 2"After replacing NIC, cluster worked fine on other node for couple of days and then cluster services started terminating on both the nodes.
when checked cluster logs found below error.
"ERR [CORE] Node 2: exception caught AlreadyExists(183)' because of 'already exists'(AODBP2DB1 - Local Area Connection)"
Hence to correct this problem, renamed both nodes NIC cards as Public and Private and cluster was started and tested the failover also. but since then only able to see "Cluster Network 1" and all NIC are listed under this container.
Though when looked at cluster validation report, it shows Cluster Network 1 and Cluster Network 2 entries in it
Pasted network section from Cluster Validation report
Network: Cluster Network 1
DHCP Enabled: False
Network Role: Enabled
Prefix
Prefix Length
10.100.18.0
25
Item
Value
Network Interface
Node1 - Public
DHCP Enabled
False
IP Address
10.100.18.11
Prefix Length
25
Item
Value
Network Interface
Node2 - Public
DHCP Enabled
False
IP Address
10.100.18.16
Prefix Length
25
Network: Cluster Network 2
DHCP Enabled: False
Network Role: Internal
Prefix
Prefix Length
10.101.130.0
25
Item
Value
Network Interface
Node1 - Private
DHCP Enabled
False
IP Address
10.101.130.11
Prefix Length
25
Item
Value
Network Interface
Node2 - Private1
DHCP Enabled
False
IP Address
10.101.130.13
Prefix Length
25
Verifying that each cluster network interface within a cluster network is configured with the same IP subnets.
Examining network Cluster Network 1.
Network interface Node1- Public has addresses on all the subnet prefixes of network Cluster Network 1.
Network interface Node2- Public has addresses on all the subnet prefixes of network Cluster Network 1.
Examining network Cluster Network 2.
Network interface Node1- Private has addresses on all the subnet prefixes of network Cluster Network 2.
Network interface Node2- Private1 has addresses on all the subnet prefixes of network Cluster Network 2.
Verifying that, for each cluster network, all adapters are consistently configured with either DHCP or static IP addresses.
Checking DHCP consistency for network: Cluster Network 1. Network DHCP status is disabled.
DHCP status (disabled) for network interface Node1- Public matches network Cluster Network 1.
DHCP status (disabled) for network interface Node2- Public matches network Cluster Network 1.
Checking DHCP consistency for network: Cluster Network 2. Network DHCP status is disabled.
DHCP status (disabled) for network interface Node1- Private matches network Cluster Network 2.
DHCP status (disabled) for network interface Node2- Private1 matches network Cluster Network 2. -
How to delete Cluster Network 1 in Hyper-V cluster?
Hi folks!
So I added a few 10Gbe cards to our 4 node HV 2012 R2 cluster. I connected two of them to the switches and almost immediately I saw this weird "Cluster Network 1" show up in Failover Cluster Manager.
I have since then disconnected the network cables, but it still shows up and the event logs keep saying that "Cluster Network 1" has failed.
Is there a way to delete this "Cluster Network 1"?
I have already set "Do not allow cluster network communication on this network". It is not used in the "Live Migration Settings". Rebooted nodes. Why is it still there!!?? Pulling my hair out here! LOL.
PS: the closest I could find on the 'net is this old thread:
https://social.technet.microsoft.com/Forums/windowsserver/en-US/7f8ffe60-835d-489e-a86e-20893c52fb21/cluster-network-status-failed
-Rajeev rajdude.comThe cluster goes out and finds all networks that are available. This is an automatic process. As long as there is a network there, it will show it. If you don't want it to show, disable the interface and it will disappear.
. : | : . : | : . tim
Thanks a million Tim! Disabling the correct interface made it disappear!
Actually, I had disabled the interface before I started this thread. When you posted the suggestion, I went back and looked at the network connections again and found out that I had disabled the wrong interface! My bad!
However, here is the interesting thing:
After I disabled the NIC and it disappeared from the FCM I physically disconnected its cable (actually I disconnected all unused / unconfigured NICs) ....then I re-enabled it. Now I see that FCM does NOT pick it up again.
I think FCM picked up those NICs because at one time I had put that NIC in the production VLAN (from the switch side).
One observation:
If you have NICs in a HV machine which are not connected to a switch, they do NOT show up in FCM. Once you connect them to some network, they start showing up. I guess by connecting them to a network, they are able to see NICs in other HV nodes.
Take a look at this screenshot. I have 6 NICs in there which are disconnected, none show up in FCM
-Rajeev rajdude.com -
Server 2008 R2 Failover cluster network configuration
Hi
We have a customer with a Server 2008 R2 Hyper-V failover cluster. They have 2 cluster networks, "Cluster Network 1" and "Cluster Network 2".
"Cluster Network 1": NIC team on 172.16.1.0/24 for private cluster network communication
"Cluster Network 2": NIC team on 192.168.1.0/24 for production network communication
I can see that "Cluster Network 1" is configured to "Allow cluster network communication on this network" and "Allow clients to connect through this network".
If "Cluster Network 1" is ONLY for communication between the to cluster nodes then I assume the selection in "Allow clients to connect through this network" should be removed?
/LasseIt will cause a lost network connection for any client that is accessing through that network. Those clients would need to reconnect.
Did you configure both IPs on the cluster resource name that clients are accessing? If you only configured the one you want, there should be no issue. If you configured both, then it is possible some clients might be connected via the private
network.
Another thing you should, and if you have already done this you will most likely not have issues at all, is that you should disable DNS registration on any network you do not want client access coming through. If the clients can only find the resource
through the DNS name registered, that is the way they will be coming in. In my clusters, which often have 7 or more NICs, there is only one with a published DNS record.
. : | : . : | : . tim -
How to specify only possible owner of the VM in Hyper-v cluster (Windows 2012R2)
Good day,
We want to prevent the migration of
virtual machines between the cluster nodes in indows 2012R2 Hyper-V cluster.
How to specify only possible owner of the VM in Hyper-v cluster (Windows 2012R2) ?
SQL clusteringHi Al_leont,
I ask if your using FOCM or SCVMM as you configure possible owners in different places. As well as possible owners you can also configure preferred owners, affinity and anti affinity groups and placement rules.
To configure possible owner in FOCM you select the VM you want to configure, then in the bottom window select the resources tab (change from summary tab). Right click the Virtual Machine Resource, then select the Advance Polices tab of
the popup window. You should then see the hyper-V nodes as Possible owners.
In SCVMM you just right click on the VM and select properties, then settings form the popup window.
Preferred owners, affinity and anti affinity groups, placement rules are configured in other locations or by PowerShell.
Kind Regards
Michael Coutanche
Blog:
Twitter: LinkedIn:
Note: Posts are provided “AS IS” without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose. -
I'm stuck here trying to figure this error out.
2003 domain, 2012 hyper v core 3 nodes. (I have two of these hyper V groups, hvclust2012 is the problem group, hvclust2008 is okay)
In Failover Cluster Manager I see these errors, "Cluster network name resource 'Cluster Name' failed registration of one or more associated DNS name(s) for the following reason: The handle is invalid."
I restarted the host node that was listed in having the error then another node starts showing the errors.
I tried to follow this site: http://blog.subvertallmedia.com/2012/12/06/repairing-a-failover-cluster-in-windows-server-2012-live-migration-fails-dns-cluster-name-errors/
Then this error shows up when doing the repair: there was an error repairing the active directory object for 'Cluster Name'
I looked at our domain controller and noticed I don't have access to local users and groups. I can access our other hvclust2008 (both clusters are same version 2012).
<image here>
I came upon this thread: http://social.technet.microsoft.com/Forums/en-US/85fc2ad5-b0c0-41f0-900e-df1db8625445/windows-2012-cluster-resource-name-fails-dns-registration-evt-1196?forum=winserverClustering
Now, I'm stuck on adding a managed service account (mas). I'm not sure if I'm way off track to fix this. Any advice? Thanks in advance!
<image here>Thanks Elton,
I restarted 3 hosts after applying the hotfix. Then I did the steps below and got stuck on step 5. That is when I get the error (image above). There
was an error repairing the active directory object for 'Cluster Name'. For more data, see 'Information Details'.
To reset the password on the affected name resource, perform the following steps:
From Failover Cluster Manager, locate the name resource.
Right-click on the resource, and click Properties.
On the Policies tab, select If resource fails, do not restart, and then click OK.
Right-click on the resource, click More Actions, and then click Simulate Failure.
When the name resource shows "Failed," right-click on the resource, click More Actions, and then click Repair.
After the name resource is online, right-click on the resource, and then click Properties.
On the Policies tab, select If resource fails, attempt restart on current node, and then click OK.
Thanks -
Hyper-V cluster: Unable to fail VM over to secondary host
I am working on a Server 2012 Hyper-V Cluster. I am unable to fail my VMs from one node to the other using either LIVE or Quick migration.
A force shutdown of VMHost01 will force a migration to VMHost02. And once we are on VMHost02 we can migrate back to VMHost01, but once that is done we can't move the VMs back to VMHost02 without a force shutdown.
The following error pops up:
Event ID: 21502 The Virtual Machine Management Service failed to establish a connection for a Virtual machine migration with host.... The connection attempt failed because the connected party did not properly respond after a period of time, or the established
connection failed because connected host has failed to respond (0X8007274C)
Here's what I noticed:
VMMS.exe is running on VMHost02 however it is not listening on Port 6600. Confirmed this after a reboot by running netstat -a. We have tried setting this service to a delayed start.
I have checked Firewall rules and Anti-Virus exclusions, and they are correct. I have not run the cluster validation test yet, because I'll need to schedule a period of downtime to do so.
We can start/stop the VMMS.exe service just fine and without errors, but I am puzzled as to why it will not listen on Port 6600 anywhere. Anyone have any suggestions on how to troubleshoot this particular issue?
Thanks,
Tho H. LeJust ran into the same issue in a 16-node cluster being managed by VMM. When trying to live migrate VMs using the VMM console the migration would fail with the following: Error 10698. Failover Cluster manager would report the following error code: Error
(0x8007274C).
+ Validated Live Migration and Cluster networks. Everything checked out.
+ Looking in Hyper-V manager and migrations are enabled and correct networks displayed.
+ Found this particular Blog that mentions that the Virtual Machine Management service is not listening to port 6600
http://blogs.technet.com/b/roplatforms/archive/2012/10/16/shared-nothing-migration-fails-0x8007274c.aspx
Ran the following from an elivated command line:
Netstat -ano | findstr 6600
Node 2 did not return anything
Node 1 returned correct output:
TCP
10.xxx.251.xxx:6600
0.0.0.0:0
LISTENING
4540
TCP
10.xxx.252.xxx:6600
0.0.0.0:0
LISTENING
4560
Set Hyper-V Virtual Machine Service to delayed start.
Restarted the service; no change.
Checked the Event Logs for Hyper-V VMMS and noted the following events - VMMS Listener started
for Live Migration networks, and then shortly after listener stopped.
Removed the system from the cluster and restarted - No change
Checked this host by running gpedit.msc - could not open console: Permission Error
Tried to run a GPO refresh (gpupdate /force), but error returned that LocalGPO could not apply registry settings. Group Policy
processing would not continue until this was resolved.
Checked the local group policy folder on node 2 and it was corrupt:
C:\Windows\System32\GroupPolicy\Machine\reg.pol showed 0K for the size.
Copied local policy folders from Node 1 to 2, and then was able to refresh the GPOs.
Restarting the VMMS service did not change the status of the ports.
Restarted Server, added Live Migration networks back into Hyper-V manager and now netstat output reports that VMMS service
is listening on 6600. -
Network Questions on 2012 R2 Hyper-V Cluster
I am going through the setup and configuration of a clustered Windows Server 2012 R2 Hyper-V host.
I’ve followed as much documentation as I can find, and the Cluster Validation is passing with flying colors, but I have three questions about the networking setup.
Here’s an overview as well as a diagram of our configuration:
We are running two Server 2012 R2 nodes on a Dell VRTX Blade Chassis.
We have 4-dual port 10 GBe Intel NICS installed in the VRTX Chassis.
We have two Netgear 12-Port 10 GBe switches, both uplinked to our network backbone switch.
Here’s what I’ve done on each 2012 R2 node:
-Created a NIC team using two 10GBe ports from separate physical cards in the blade chassis.
-Created a Virtual Switch using this team called “Cluster Switch” with “ManagementOS” specified.
-Created 3 virtual Nics that connect to this “Cluster Switch”:
Mangement (10.1.10.x), Cluster (172.16.1.x), Live Migration (172.16.2.x)
-Set up VLAN ID 200 on the Cluster NIC using Powershell.
-Set Bandwidth Weight on each of the 3 NICS. Mangement has 5, Cluster has 40, Live Migration has 20.
-Set a Default Minimum Bandwidth for the switch at 35 (for the VM traffic.)
-Created two virtual switches for iSCSI both with
“-AllowManagementOS $false” specified.
-Each of these switches is using a 10GBe port from separate physical cards in the blade chassis.
-Created a virtual NIC for each of the virtual switches:
ISCSI1 (172.16.3.x) and ISCSI2 (172.16.4.x)
Here’s what I’ve done on the Netgear 10GB switches:
-Created a LAG using two ports on each switch to connect them together.
-Currently, I have no traffic going across the LAG as I’m not sure how I should configure it.
-Spread out the network connections over each Netgear switch so traffic from the virtual switch “Cluster Switch” on each node is connected to both Netgear 10 GB switches.
-Connected each virtual iSCSI switch from each node to its own port on each Netgear switch.
First Question: As I mentioned, the cluster validation wizard thinks everything is great.
But what about the traffic the Host and Guest VMs use to communicate with the rest of the corporate network?
That traffic is on the same subnet as the Management NIC.
Should the Management traffic be on that same corporate subnet, or should it be on its own subnet?
If Management is on its own subnet, then how do I manage the cluster from the corporate network?
I feel like I’m missing something simple here.
Second Question: Do I even need to implement VLANS in this configuration?
Since everything is on its own subnet, I don’t see the need.
Third Question: I’m confused how the LAG will work between the two 10 Gbe switches when both have separate uplinks to the backbone switch.
I see diagrams that show this setup, but I’m not sure how to achieve it without causing a loop.
Thanks!"First Question: As I mentioned, the cluster validation wizard thinks everything is great.
But what about the traffic the Host and Guest VMs use to communicate with the rest of the corporate network?
That traffic is on the same subnet as the Management NIC.
Should the Management traffic be on that same corporate subnet, or should it be on its own subnet?
If Management is on its own subnet, then how do I manage the cluster from the corporate network?
I feel like I’m missing something simple here."
This is an operational question, not a technical question. You can have all VM and management traffic on the same network if you want. If you want to isolate the two, you can do that, too. Generally, recommended
practice is to create separate networks for host management and VM access, but it is not a strict requirement.
"Second Question: Do I even need to implement VLANS in this configuration?
Since everything is on its own subnet, I don’t see the need."
No, you don't need VLANs if separation by IP subnet is sufficient. VLANs provide a level of security against snooping that simple subnet isolation provides. Again, up to you as to how you want to configure things.
I've done it both ways, and it works both ways.
"Third Question: I’m confused how the LAG will work between the two 10 Gbe switches when both have separate uplinks to the backbone switch.
I see diagrams that show this setup, but I’m not sure how to achieve it without causing a loop."
This is pretty much outside the bounds of a clustering question. You might want to take network configuration questions to a networking forum. Or, you may want to talk with Netgear specialist. Different networking
vendors can accomplish this in different ways.
.:|:.:|:. tim -
Hyper-V 2012 R2 Cluster - Drain Roles / Fail Roles Back
Hi all,
In the past when I've needed to apply windows updates to my 3 Hyper-V cluster nodes I used to make a note of which VM's were running on each node, then I'd live migrate them to one of the other cluster nodes before pausing the node I need to work on and
carry out the updates, once I finished installing the updates I'd then simply resume the node and live migrate the VM's back to their original node.
Having recently upgraded my nodes to Windows 2012 R2 I decided to use the new functionality in Failover Cluster Manager where you can pause & drain a node of its roles, perform the updates/maintenance, and then resume & fail roles back to the node,
unfortunately this didn't go as smoothly as I'd hoped, for some reason it seems like the drain/fail back decided to be cumulative rather than one off jobs per-node ... hard to explain, hopefully the following will be clear enough if the formatting survives:
1. Beginning State:
Hyper1 Hyper2 Hyper3
VM01 VM04 VM07
VM02 VM05 VM08
VM03 VM06 VM09
2. Drain Hyper1:
Hyper1 Hyper2 Hyper3
VM04 VM01
VM05 VM02
VM06 VM03
VM07
VM08
VM09
3. Fail Roles Back:
Hyper1 Hyper2 Hyper3
VM01 VM04 VM07
VM02 VM05 VM08
VM03 VM06 VM09
4. Drain Hyper2:
Hyper1 Hyper2 Hyper3
VM01 VM04
VM02 VM05
VM03 VM06
VM07
VM08
VM09
5. Fail Roles Back:
Hyper1 Hyper2 Hyper3
VM01 VM07
VM02 VM08
VM03 VM09
VM04
VM05
VM06
6. Manually Live Migrate VM's back to correct location:
Hyper1 Hyper2 Hyper3
VM01 VM04 VM07
VM02 VM05 VM08
VM03 VM06 VM09
7. Drain Hyper3:
Hyper1 Hyper2 Hyper3
VM01 VM04
VM02 VM05
VM03 VM06
VM07
VM08
VM09
8. Fail Roles Back:
Hyper1 Hyper2 Hyper3
VM01
VM02
VM03
VM04
VM05
VM06
VM07
VM08
VM09
9. Manually Live Migrate VM's back to correct location:
Hyper1 Hyper2 Hyper3
VM01 VM04 VM07
VM02 VM05 VM08
VM03 VM06 VM09
Step 8 was a rather hairy moment, although I was pleased to see my cluster hardware capacity planning rubber stamped, good to know that if I were ever to loose 2 out of 3 nodes everything would keep ticking over!
So, I'm back to the old ways of doing things for now, has anyone else experienced this strange behaviour?
Thanks in advance,
BenHi,
Just want to confirm the current situations.
Please feel free to let us know if you need further assistance.
Regards.
We
are trying to better understand customer views on social support experience, so your participation in this
interview project would be greatly appreciated if you have time.
Thanks for helping make community forums a great place. -
Hi, I'm having a problem in a VM Guest cluster using Windows Server 2012 R2 and virtual disk sharing enabled.
It's a SQL 2012 cluster, which has around 10 vhdx disks shared this way. all the VHDX files are inside LUNs on a SAN. These LUNs are presented to all clustered members of the Windows Server 2012 R2 Hyper-V cluster, via Cluster Shared Volumes.
Yesterday happened a very strange problem, both the Quorum Disk and the DTC disks got the information completetly erased. The vhdx disks themselves where there, but the info inside was gone.
The SQL admin had to recreated both disks, but now we don't know if this issue was related to the virtualization platform or another event inside the cluster itself.
Right now I'm seen this errors on one of the VM Guest:
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 3/4/2014 11:54:55 AM
Event ID: 1069
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: ServerDB02.domain.com
Description:
Cluster resource 'Quorum-HDD' of type 'Physical Disk' in clustered role 'Cluster Group' failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster
Manager or the Get-ClusterResource Windows PowerShell cmdlet.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
<EventID>1069</EventID>
<Version>1</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2014-03-04T17:54:55.498842300Z" />
<EventRecordID>14140</EventRecordID>
<Correlation />
<Execution ProcessID="1684" ThreadID="2180" />
<Channel>System</Channel>
<Computer>ServerDB02.domain.com</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="ResourceName">Quorum-HDD</Data>
<Data Name="ResourceGroup">Cluster Group</Data>
<Data Name="ResTypeDll">Physical Disk</Data>
</EventData>
</Event>
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 3/4/2014 11:54:55 AM
Event ID: 1558
Task Category: Quorum Manager
Level: Warning
Keywords:
User: SYSTEM
Computer: ServerDB02.domain.com
Description:
The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
<EventID>1558</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>42</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2014-03-04T17:54:55.498842300Z" />
<EventRecordID>14139</EventRecordID>
<Correlation />
<Execution ProcessID="1684" ThreadID="2180" />
<Channel>System</Channel>
<Computer>ServerDB02.domain.com</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="NodeName">ServerDB02</Data>
</EventData>
</Event>
We don't know if this can happen again, what if this happens on disk with data?! We don't know if this is related to the virtual disk sharing technology or anything related to virtualization, but I'm asking here to find out if it is a possibility.
Any ideas are appreciated.
Thanks.
Eduardo RojasHi,
Please refer to the following link:
http://blogs.technet.com/b/keithmayer/archive/2013/03/21/virtual-machine-guest-clustering-with-windows-server-2012-become-a-virtualization-expert-in-20-days-part-14-of-20.aspx#.Ux172HnxtNA
Best Regards,
Vincent Wu
Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. -
Hyper-V Failover Cluster Networking Configuration After Install
Hello All,
Is it possible to install hyper-v and failover or in other words create a hyper-v failover cluster and then configure the networking part of the solution later? As I am coming into
terms with the networking part of it, wanted to do it later after the install. Is it possible?
And from later configuration, I am trying to say, creation of NIC Team, Virtual NICs, VLAN tagging, etc.Hi,
Failover cluster deployment requires network connectivity between cluster nodes. You can't create a cluster without properly configured TCP\IP on cluster nodes.
http://OpsMgr.ru/ -
We have a DAG configured between 2 mailbox servers, one in each of our main data centres. Our comms team recently performed a DR test between our 2 data centres, switiching from the main production link to the backup link. During this outage the Failover
Cluster Manager reported errors, with each mailbox server reporting the other as uncontactable. The Events that were logged include the following:
Isatap interface isatap.{02ADE20A-D5D4-437F-AD00-E6601F7E7A9D} is no longer active. (EventID 4201)
Cluster node 'MAILBOX_SERVER' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the
Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is
connected such as hubs, switches, or bridges. (EventID 1135)
File share witness resource 'File Share Witness (\\WITNESS_SERVER\SHARE_NAME)' failed to arbitrate for the file share '\\WITNESS_SERVER\SHARE_NAME'. Please ensure that file share '\\WITNESS_SERVER\SHARE_NAME' exists and is accessible by the cluster. (EventID
1564)
Cluster resource 'File Share Witness (\\\WITNESS_SERVER\SHARE_NAME)' in clustered service or application 'Cluster Group' failed. (EventID 1069)
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network
configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. (EventID 1177)
The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster. (EventID 7024)
The Microsoft Exchange Information Store service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 5000 milliseconds: Restart the service. (EventID 7031)
Looking at the Cluster Events in the Failover Cluster Manager Snap-In i see a heap of Event ID 47 (cannot activate the DAG databases as the server is not up according to Windows Failover Cluster Service) and:
Node status could not be recorded. This could prevent some network failure logic from functioning correctly. NodeStatus:IsHealthy=True,HasADAccess=True,ClusterErrorOverrideFalse,LastUpdate=5/2/2011 8:25:42 AMUTC Failure:An Active Manager operation failed.
Error: An error occurred while attempting a cluster operation. Error: Cluster API '"ClusterRegSetValue() failed with 0x6be. Error: The remote procedure call failed"' failed.. (EventID 184)
Forcefully dismounting all the locally mounted databases on server 'BACKUP_MAILBOX_SERVER. (EventID 307).
Our Comms team doesn't believe it is a comms issue as they did not log any network communication errors between the servers in the two sites (using icmp). So if it is not a comms issue, how can I configure the Failover Cluster Manager to be resilient to
this type of network failover event.
Thanks
DanIsn't it also true that in a stretched DAG with even numbered nodes, the PAM needs to be in the same site as the active DAG node? If the connection between both nodes goes down, and the PAM is in the "passive" site, the primary node will
dismount the databases since it can't check with the PAM to make sure its safe for it to be up.
In a even-numbered node stretched DAG, the PAM changes to the DR/passive site everytime a failover occurs, but doesn't automatically switch back when you reactivate the primary node. -
Hi all,
I had a very strange situation in my Hyper-V 2 nodes-cluster:
I have one networtk for HertBeat only (10.0.0.0/24) and second for HyperV internal networking for virtual machines (In properties marked "Do not allow clustern network communication")
Machines were working properly and any migration too.
One day, my secon done HyperV2 was marked red in Failover Cluster Manager mmc. I discovered that HyperV LAN is unavailable on this second node. BUT everything war working properly - HyperV2 node was on internet, communicated to AD domain, even culd run any
virtual machine...
Several times I checked the configuration, also check TMG configuratio, I was wondering if it can not be wrong settings on network access rule, I tried to restart this host - no result, ... network was still unavailable.
After about a hour I found the resolutuion:
On my second Hyper-V node Disable / Enable Local Area Connection network adapter, connected to Hyper-V LAN in Network Connections control panel!
Hope this will help to somebody ;)
Marian, just trying to help youResolutuion:
On affected Hyper-V node Disable / Enable Local Area Connection network adapter, connected to Hyper-V LAN in Network Connections control panel
I guess, sometnig flush on network configuration and / or some combination with network adapter driver
Marian, just trying to help you
Maybe you are looking for
-
My mac is a version 10.5.8. I just downloaded ios7, and it is telling me I need to download iTunes 11 to connect to my phone, but I don't have a new enough version. What do I do?! I think snow leopard was the last update I purchased. Is there a way
-
Photoshop CS5 Performance Problems with Optional Extension Plug-ins Installed
If you installed any of the Optional Extension Plug-ins in Photoshop CS5 (they aren't a part of a default installation of Photoshop CS5) - we determined that some users have had serious performance issues due way these Optional Extension Plug-ins wer
-
Upload file in Portlet (JSR-168) with library commons-FileUpdate
hi All, I have write portlet (JSR 168) "UpToFile.java" for upload a file with the Commons-FileUpload 1.1.1 but when deploy in the Oracle Portal the file is not upload, because the variable isMultipart is false: UPToFile.ProcessAction(...) : boolean i
-
I want to delete the version I am running and then upgrade
I am using a MacBook pro and having some issues with the touch pad (background info). I want to delete the current version of Firefox and up grade to FF 4.0.1 I can only find FF listed under 'Drives' in finder and am unable to delete it. I have downl
-
Add field 'ebeln' in selection-screen of FBL1N transaction
Hi Experts, Is there a way to display the field 'ebeln' in transaction FBL1N for certain document types and company code only..? I have already tried this steps below..But the problem is, this changes below is applicable to all document types. 1. Tra