Cluster Network Randomly Failing on Hyper-V Cluster

Please let me know if there is a more appropriate forum. I am having a really strange issue that is seemingly random. I have a 3 host cluster that are all identical hardware and running Hyper-V Server 2012 R2. The networking is as follows and each network
is a different VLAN/Subnet:
3 Cluster networks for virtual machines
1 Cluster network for cluster traffic/management
1 Heartbeat network
2 iSCSI networks for storage
All of the networks are perfectly fine except for one which seems to fail on a random node at a random time during the day (so far, a maximum of once per day).
If I start to live migrate virtual machines that are on the failed network, the cluster network comes back up. The cluster networks are teamed using SCVMM and they are switch independent and running the Dynamic teaming algorithm. We have tried changing the
network switches to see if it was faulty network hardware and things ran fine for one day and then just happened again today so we've ruled that out. The only error message I get is 1127 which is the error stating that the cluster network has gone into a failed
state which doesn't help much. I've run the cluster validation tool for networking several times and it always passes 100%. What I am worried about is hardware incompatibilities as I am using Dell servers (PowerEdge R720) that have Broadcom NIC's in them.
We have 12 Ethernet ports in each server and they are all identical hardware. Four of them are integrated Broadcom, another four that are from a Broadcom quad add-on NIC, and another 4 that are from an Intel quad add-on NIC. All are server grade NIC's. The
only problem I've had in the past is with VMQ which we've had to disable as a workaround but that has always stabilized our virtual networks. In any case all of the cluster networks for virtual machines are set up identically and only this particular one randomly
fails on any one of the three hosts (it has happened at least once on each node now).
I am wondering if anyone has had this experience before. I have read that there are some nasty compatibility issues between Broadcom and Hyper-V  but I am wondering if someone could give me some ideas to find out how to narrow this down since the event
logs don't seem to be speaking in obvious terms to me.
Please let me know if you have any suggestions on how to narrow down what's causing this or if there is more information that I could provide. In the meantime, I'm going to try and take note of which virtual machines are running on the host that has the
network fail just in case there's some correlation there but that could take a while to accrue any useful data and our users aren't too happy with the instability...
Thank you in advance for your time and sorry for the lengthy post!

Since I made the change last Friday evening, 4/10, I haven't experienced the issue. I won't be completely convinced that this resolved it until I monitor for at least one more week since it didn't actually present itself for the first time until I was already
one week into live deployment. Also, this link below is much more eloquent than how I put it and describes my issue exactly. Coupled with the KB article that someone posted within this comments section (the same that I posted earlier here) of the article,
this is what led me to check the VMQ status through PowerShell which is much better than going through the registry to do it (I'm running Hyper-V Server 2012 R2 which is like core so I don't have the GUI options shown in the article).
http://alexappleton.net/post/77116755157/hyper-v-virtual-machines-losing-network
I could try updating the driver but there is mention in the comments of this post that driver updates have yet to resolve this issue so we may still be waiting on Broadcom for a fix. Please confirm otherwise if anyone has any information.

Similar Messages

  • Failover Cluster Network Name Failed and Can't be Repaired

    I have an issue that seem to be a different problem than any others have encountered.
    I've scoured everything I can find and nothing has fixed my problem.
    The problem starts with the common problem of the cluster network name failing on my 2 node server 2012 file server cluster.  The computer object was still in AD and appeared to be fine so it was not the common problem of the object
    getting deleted somehow.  At the time, there was no other object with that name in the recycling bin, so I don't think it was mistakenly deleted and quickly recreated to cover any tracks, so to speak.
    Following one guide, I tried to find the registry key that corresponded with the GUID of the object, but neither node in the cluster had it in its registry (which may be part of the problem).
    Since it was in the failed state, I tried to do the repair on the object to no avail.
    We run a "locked down" DC environment so all computer objects have to be pre-provisioned.  They were all pre-provisioned successfully and successfully assigned during cluster creation.  The cluster was running with no issues for a month
    or so before this problem came up.
    When I do a repair on the object while taking diagnostic logs the following 4609 error appears:
    The action 'Repair' did not complete. - System.ApplicationException: An error occurred resetting the password for 'Cluster Name'. ---> System.ComponentModel.Win32Exception: Unknown error (0x80005000)
    There appears to be a corresponding 4771 error with a failure code 0x18 that comes from the security log of the DC that states there was a Kerberos pre-authentication failure for the cluster network name object (Domain\Clustername$)
    I believe this is what is causing the repair failure.  All the information I found related to security error 4771 was either a bad credentials given for a user account or the fix was to reconnect the computer to the domain.  I can't seem to find
    a way to do this with the cluster network name.  If there's a way please let me know.
    I've tried a number of things, like resetting the object, disabling it, deleting and creating a new object with the same name, deleting that new object and recovering the original, etc...
    Can anyone shed some light on what is going on and hopefully how to fix it other than rebuilding the cluster?  I'm quite close to just tearing it down and building it back up but am hesitant because this cluster in currently in production...
    Any help would be appreciated

    Hi,
    I don’t find out the similar issue with yours, base on my experience, the 4096 error
     often caused by the CSV disk issue, and the 0x80005000 error some time caused by the repetitive computer object in OU. Please check the above related part or run the validate test then post the error information.
    Although I do have a CSV, there doesn't seem to be any problems with it and it was running just fine for a month or so before the problem started.  I double checked and there is no duplicate computer objects, maybe I don't understand what you mean by
    repetitive, could you explain further?
    The cluster validates successfully with a few warnings:
    Validating cluster resource Name: DT-FileCluster.
    This resource is marked with a state of 'Failed' instead of
    'Online'. This failed state indicates that the resource had a problem either
    coming online or had a failure while it was online. The event logs and cluster
    logs may have information that is helpful in identifying the cause of the
    failure.
    - This is because the cluster name is in the failed state
    Validating the service principal names for Name:
    DT-FileCluster.
    The network name Name: DT-FileCluster does not have a valid
    value for the read-only property 'ObjectGUID'. To validate the service principal
    name the read-only private property 'ObjectGuid' must have a valid value. To
    correct this issue make sure that the network name has been brought online at
    least once. If this does not correct this issue you will need to delete the
    network name and re-create it.
    - This is definitely related to the problem and the GUID probably got removed when we attempted a fix by resetting the object and trying the repair from the failover cluster manager.
    The user running validate, does not have permissions to create
    computer objects in the 'ad.unlv.edu' domain.
    - This is correct, we run a restricted domain.  I have a delegated OU that I can pre-provision accounts in.  The account was pro-provisioned successfully and was at one point setup and working just fine.
    There are no other errors nor warnings.

  • High Network Utilization on HeartBeat and LiveMigration networks - Windows 2012 R2 Hyper-V cluster

    Hi all,
    I have just setup a fresh Windows 2012 R2 Hyper-V cluster with 4 nodes.
    Config:
    Network:
    2 x Networks for iSCSI
    1 x VM network 1
    1 x VM network 2
    1 x Heartbeat
    1 x LiveMigration
    Disks:
    5 GB = Quorum
    2 x 2.99 TB (Deduplication enabled)
    The problem I am getting is that HeartBeat and LiveMigration (both configured with Cluster Only traffic) have alot of traffic on them even though no LiveMigration (the only card configured for LiveMigration) is going on.
    All network configuration is the same as it was in Windows 2008 R2 SP1 cluster which these machines where running before (not upgrade, fresh install and migration) and did not have this "problem".
    Has anyone experienced this or have a solution to this?
    Regards,
    Thorir
    thorir

    Hi Thorir,
    According to your discription , you can try to monitor and analyse the traffic of LM via network monitor for troubleshooting .
    Here is the links for downloading and using network monitor  :
    http://www.microsoft.com/en-us/download/details.aspx?id=4865
    http://technet.microsoft.com/en-us/library/cc723623.aspx
     Any further information please feel free to let us know.
    Hope this helps
    Best Regards
    Elton Ji
    We
    are trying to better understand customer views on social support experience, so your participation in this
    interview project would be greatly appreciated if you have time.
    Thanks for helping make community forums a great place.

  • Cluster Network 2 is missing from Failover cluster Manager

    I have two node Windows 2008 R2 SP1 MS SQL cluster, there was an issue with one node NIC card which was replaced and since then "Cluster Network 2" is not visible in failover cluster manager, also Heartbeat and Public IP nic card of both nodes
    are appearing in "Cluster Network 1" container. even though both network public and hearbeat are on differenet subnet, but still cluster detects it as one subnet. Need to know how to bring back "Cluster Network 2"

    After replacing NIC, cluster worked fine on other node for couple of days and then cluster services started terminating on both the nodes.
    when checked cluster logs found below error.
    "ERR   [CORE] Node 2: exception caught AlreadyExists(183)' because of 'already exists'(AODBP2DB1 - Local Area Connection)"
    Hence to correct this problem, renamed both nodes NIC cards as Public and Private and cluster was started and tested the failover also. but since then only able to see "Cluster Network 1" and all NIC are listed under this container.
    Though when looked at cluster validation report, it shows Cluster Network 1 and Cluster Network 2 entries in it
    Pasted network section from Cluster Validation report
    Network: Cluster Network 1
    DHCP Enabled: False
    Network Role: Enabled
    Prefix
    Prefix Length
    10.100.18.0
    25
    Item
    Value
    Network Interface
    Node1 - Public
    DHCP Enabled
    False
    IP Address
    10.100.18.11
    Prefix Length
    25
    Item
    Value
    Network Interface
    Node2 - Public
    DHCP Enabled
    False
    IP Address
    10.100.18.16
    Prefix Length
    25
    Network: Cluster Network 2
    DHCP Enabled: False
    Network Role: Internal
    Prefix
    Prefix Length
    10.101.130.0
    25
    Item
    Value
    Network Interface
    Node1 - Private
    DHCP Enabled
    False
    IP Address
    10.101.130.11
    Prefix Length
    25
    Item
    Value
    Network Interface
    Node2 - Private1
    DHCP Enabled
    False
    IP Address
    10.101.130.13
    Prefix Length
    25
    Verifying that each cluster network interface within a cluster network is configured with the same IP subnets.
    Examining network Cluster Network 1.
    Network interface Node1- Public has addresses on all the subnet prefixes of network Cluster Network 1.
    Network interface Node2- Public has addresses on all the subnet prefixes of network Cluster Network 1.
    Examining network Cluster Network 2.
    Network interface Node1- Private has addresses on all the subnet prefixes of network Cluster Network 2.
    Network interface Node2- Private1 has addresses on all the subnet prefixes of network Cluster Network 2.
    Verifying that, for each cluster network, all adapters are consistently configured with either DHCP or static IP addresses.
    Checking DHCP consistency for network: Cluster Network 1. Network DHCP status is disabled.
    DHCP status (disabled) for network interface Node1- Public matches network Cluster Network 1.
    DHCP status (disabled) for network interface Node2- Public matches network Cluster Network 1.
    Checking DHCP consistency for network: Cluster Network 2. Network DHCP status is disabled.
    DHCP status (disabled) for network interface Node1- Private matches network Cluster Network 2.
    DHCP status (disabled) for network interface Node2- Private1 matches network Cluster Network 2.

  • How to delete Cluster Network 1 in Hyper-V cluster?

    Hi folks!
    So I added a few 10Gbe cards to our 4 node HV 2012 R2 cluster. I connected two of them to the switches and almost immediately I saw this weird "Cluster Network 1" show up in Failover Cluster Manager.
    I have since then disconnected the network cables, but it still shows up and the event logs keep saying that "Cluster Network 1" has failed.
    Is there a way to delete this "Cluster Network 1"? 
    I have already set "Do not allow cluster network communication on this network". It is not used in the "Live Migration Settings".  Rebooted nodes. Why is it still there!!?? Pulling my hair out here! LOL.
    PS: the closest I could find on the 'net is this old thread:
    https://social.technet.microsoft.com/Forums/windowsserver/en-US/7f8ffe60-835d-489e-a86e-20893c52fb21/cluster-network-status-failed
    -Rajeev rajdude.com

    The cluster goes out and finds all networks that are available.  This is an automatic process.  As long as there is a network there, it will show it.  If you don't want it to show, disable the interface and it will disappear.
    . : | : . : | : . tim
    Thanks a million Tim! Disabling the correct interface made it disappear!
    Actually, I had disabled the interface before I started this thread. When you posted the suggestion, I went back and looked at the network connections again and found out that I had disabled the wrong interface! My bad!
    However, here is the interesting thing:
    After I disabled the NIC and it disappeared from the FCM I physically disconnected its cable (actually I disconnected all unused / unconfigured NICs) ....then I re-enabled it. Now I see that FCM does NOT pick it up again.
    I think FCM picked up those NICs because at one time I had put that NIC in the production VLAN (from the switch side).
    One observation:
    If you have NICs in a HV machine which are not connected to a switch, they do NOT show up in FCM. Once you connect them to some network, they start showing up. I guess by connecting them to a network, they are able to see NICs in other HV nodes.
    Take a look at this screenshot. I have 6 NICs in there which are disconnected, none show up in FCM
    -Rajeev rajdude.com

  • Server 2008 R2 Failover cluster network configuration

    Hi
    We have a customer with a Server 2008 R2 Hyper-V failover cluster. They have 2 cluster networks, "Cluster Network 1" and "Cluster Network 2".
    "Cluster Network 1": NIC team on 172.16.1.0/24 for private cluster network communication
    "Cluster Network 2": NIC team on 192.168.1.0/24 for production network communication
    I can see that "Cluster Network 1" is configured to "Allow cluster network communication on this network" and "Allow clients to connect through this network".
    If "Cluster Network 1" is ONLY for communication between the to cluster nodes then I assume the selection in "Allow clients to connect through this network" should be removed?
    /Lasse

    It will cause a lost network connection for any client that is accessing through that network.  Those clients would need to reconnect.
    Did you configure both IPs on the cluster resource name that clients are accessing?  If you only configured the one you want, there should be no issue.  If you configured both, then it is possible some clients might be connected via the private
    network.
    Another thing you should, and if you have already done this you will most likely not have issues at all, is that you should disable DNS registration on any network you do not want client access coming through.  If the clients can only find the resource
    through the DNS name registered, that is the way they will be coming in.  In my clusters, which often have 7 or more NICs, there is only one with a published DNS record.
    . : | : . : | : . tim

  • How to specify only possible owner of the VM in Hyper-v cluster (Windows 2012R2)

    Good day,
    We want to prevent the migration of
    virtual machines between the cluster nodes in indows 2012R2 Hyper-V cluster.
    How to specify only possible owner of the VM in Hyper-v cluster (Windows 2012R2) ?
    SQL clustering

    Hi Al_leont,
    I ask if your using FOCM or SCVMM as you configure possible owners in different places. As well as possible owners you can also configure preferred owners, affinity and anti affinity groups and placement rules.
    To configure possible owner in FOCM you select the VM you want to configure, then in the bottom window select the resources tab (change from summary tab). Right click the Virtual Machine Resource, then select the Advance Polices tab of
    the popup window. You should then see the hyper-V nodes as Possible owners.
    In SCVMM you just right click on the VM and select properties, then settings form the popup window.
    Preferred owners, affinity and anti affinity groups, placement rules are configured in other locations or by PowerShell.
    Kind Regards
    Michael Coutanche
    Blog:   
    Twitter:   LinkedIn:
    Note: Posts are provided “AS IS” without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

  • Cluster network name resource 'Cluster Name' failed registration of one or more associated DNS name(s) for the following reason: The handle is invalid.

    I'm stuck here trying to figure this error out.  
    2003 domain, 2012 hyper v core 3 nodes.  (I have two of these hyper V groups, hvclust2012 is the problem group, hvclust2008 is okay)
    In Failover Cluster Manager I see these errors, "Cluster network name resource 'Cluster Name' failed registration of one or more associated DNS name(s) for the following reason:  The handle is invalid."
    I restarted the host node that was listed in having the error then another node starts showing the errors.
    I tried to follow this site:  http://blog.subvertallmedia.com/2012/12/06/repairing-a-failover-cluster-in-windows-server-2012-live-migration-fails-dns-cluster-name-errors/
    Then this error shows up when doing the repair:  there was an error repairing the active directory object for 'Cluster Name'
    I looked at our domain controller and noticed I don't have access to local users and groups.  I can access our other hvclust2008 (both clusters are same version 2012).
    <image here>
    I came upon this thread:  http://social.technet.microsoft.com/Forums/en-US/85fc2ad5-b0c0-41f0-900e-df1db8625445/windows-2012-cluster-resource-name-fails-dns-registration-evt-1196?forum=winserverClustering
    Now, I'm stuck on adding a managed service account (mas).  I'm not sure if I'm way off track to fix this.  Any advice?  Thanks in advance!
    <image here>

    Thanks Elton,
    I restarted 3 hosts after applying the hotfix.  Then I did the steps below and got stuck on step 5.  That is when I get the error (image above).  There
    was an error repairing the active directory object for 'Cluster Name'.  For more data, see 'Information Details'.
    To reset the password on the affected name resource, perform the following steps:
    From Failover Cluster Manager, locate the name resource.
    Right-click on the resource, and click Properties.
    On the Policies tab, select If resource fails, do not restart, and then click OK.
    Right-click on the resource, click More Actions, and then click Simulate Failure.
    When the name resource shows "Failed," right-click on the resource, click More Actions, and then click Repair.
    After the name resource is online, right-click on the resource, and then click Properties.
    On the Policies tab, select If resource fails, attempt restart on current node, and then click OK.
    Thanks

  • Hyper-V cluster: Unable to fail VM over to secondary host

    I am working on a Server 2012 Hyper-V Cluster. I am unable to fail my VMs from one node to the other using either LIVE or Quick migration.
    A force shutdown of VMHost01 will force a migration to VMHost02. And once we are on VMHost02 we can migrate back to VMHost01, but once that is done we can't move the VMs back to VMHost02 without a force shutdown.
    The following error pops up:
    Event ID: 21502 The Virtual Machine Management Service failed to establish a connection for a Virtual machine migration with host.... The connection attempt failed because the connected party did not properly respond after a period of time, or the established
    connection failed because connected host has failed to respond (0X8007274C)
    Here's what I noticed:
    VMMS.exe is running on VMHost02 however it is not listening on Port 6600. Confirmed this after a reboot by running netstat -a. We have tried setting this service to a delayed start.
    I have checked Firewall rules and Anti-Virus exclusions, and they are correct. I have not run the cluster validation test yet, because I'll need to schedule a period of downtime to do so.
    We can start/stop the VMMS.exe service just fine and without errors, but I am puzzled as to why it will not listen on Port 6600 anywhere. Anyone have any suggestions on how to troubleshoot this particular issue? 
    Thanks,
    Tho H. Le

    Just ran into the same issue in a 16-node cluster being managed by VMM. When trying to live migrate VMs using the VMM console the migration would fail with the following: Error 10698. Failover Cluster manager would report the following error code: Error
    (0x8007274C).
    + Validated Live Migration and Cluster networks. Everything checked out.
    + Looking in Hyper-V manager and migrations are enabled and correct networks displayed.
    + Found this particular Blog that mentions that the Virtual Machine Management service is not listening to port 6600
    http://blogs.technet.com/b/roplatforms/archive/2012/10/16/shared-nothing-migration-fails-0x8007274c.aspx
    Ran the following from an elivated command line:
    Netstat -ano | findstr 6600
    Node 2 did not return anything
    Node 1 returned correct output:
    TCP
    10.xxx.251.xxx:6600
    0.0.0.0:0
    LISTENING
    4540
    TCP
    10.xxx.252.xxx:6600
    0.0.0.0:0
    LISTENING
    4560
    Set Hyper-V Virtual Machine Service to delayed start.
    Restarted the service; no change.
    Checked the Event Logs for Hyper-V VMMS and noted the following events - VMMS Listener started
    for Live Migration networks, and then shortly after listener stopped.
    Removed the system from the cluster and restarted - No change
    Checked this host by running gpedit.msc - could not open console: Permission Error
    Tried to run a GPO refresh (gpupdate /force), but error returned that LocalGPO could not apply registry settings. Group Policy
    processing would not continue until this was resolved.
    Checked the local group policy folder on node 2 and it was corrupt:
    C:\Windows\System32\GroupPolicy\Machine\reg.pol showed 0K for the size.
    Copied local policy folders from Node 1 to 2, and then was able to refresh the GPOs.
    Restarting the VMMS service did not change the status of the ports.
    Restarted Server, added Live Migration networks back into Hyper-V manager and now netstat output reports that VMMS service
    is listening on 6600.

  • Network Questions on 2012 R2 Hyper-V Cluster

    I am going through the setup and configuration of a clustered Windows Server 2012 R2 Hyper-V host. 
    I’ve followed as much documentation as I can find, and the Cluster Validation is passing with flying colors, but I have three questions about the networking setup.
    Here’s an overview as well as a diagram of our configuration:
    We are running two Server 2012 R2 nodes on a Dell VRTX Blade Chassis. 
    We have 4-dual port 10 GBe Intel NICS installed in the VRTX Chassis. 
    We have two Netgear 12-Port 10 GBe switches, both uplinked to our network backbone switch.
    Here’s what I’ve done on each 2012 R2 node:
    -Created a NIC team using two 10GBe ports from separate physical cards in the blade chassis.
    -Created a Virtual Switch using this team called “Cluster Switch” with “ManagementOS” specified.
    -Created 3 virtual Nics that connect to this “Cluster Switch”: 
    Mangement (10.1.10.x), Cluster (172.16.1.x), Live Migration (172.16.2.x)
    -Set up VLAN ID 200 on the Cluster NIC using Powershell.
    -Set Bandwidth Weight on each of the 3 NICS.  Mangement has 5, Cluster has 40, Live Migration has 20.
    -Set a Default Minimum Bandwidth for the switch at 35 (for the VM traffic.)
    -Created two virtual switches for iSCSI both with 
    “-AllowManagementOS $false” specified.
    -Each of these switches is using a 10GBe port from separate physical cards in the blade chassis.
    -Created a virtual NIC for each of the virtual switches: 
    ISCSI1 (172.16.3.x) and ISCSI2 (172.16.4.x)
    Here’s what I’ve done on the Netgear 10GB switches:
    -Created a LAG using two ports on each switch to connect them together.
    -Currently, I have no traffic going across the LAG as I’m not sure how I should configure it.
    -Spread out the network connections over each Netgear switch so traffic from the virtual switch “Cluster Switch” on each node is connected to both Netgear 10 GB switches.
    -Connected each virtual iSCSI switch from each node to its own port on each Netgear switch.
    First Question:  As I mentioned, the cluster validation wizard thinks everything is great. 
    But what about the traffic the Host and Guest VMs use to communicate with the rest of the corporate network? 
    That traffic is on the same subnet as the Management NIC. 
    Should the Management traffic be on that same corporate subnet, or should it be on its own subnet? 
    If Management is on its own subnet, then how do I manage the cluster from the corporate network? 
    I feel like I’m missing something simple here.
    Second Question:  Do I even need to implement VLANS in this configuration? 
    Since everything is on its own subnet, I don’t see the need.
    Third Question:  I’m confused how the LAG will work between the two 10 Gbe switches when both have separate uplinks to the backbone switch. 
    I see diagrams that show this setup, but I’m not sure how to achieve it without causing a loop.
    Thanks!

    "First Question:  As I mentioned, the cluster validation wizard thinks everything is great. 
    But what about the traffic the Host and Guest VMs use to communicate with the rest of the corporate network? 
    That traffic is on the same subnet as the Management NIC. 
    Should the Management traffic be on that same corporate subnet, or should it be on its own subnet? 
    If Management is on its own subnet, then how do I manage the cluster from the corporate network? 
    I feel like I’m missing something simple here."
    This is an operational question, not a technical question.  You can have all VM and management traffic on the same network if you want.  If you want to isolate the two, you can do that, too.  Generally, recommended
    practice is to create separate networks for host management and VM access, but it is not a strict requirement.
    "Second Question:  Do I even need to implement VLANS in this configuration? 
    Since everything is on its own subnet, I don’t see the need."
    No, you don't need VLANs if separation by IP subnet is sufficient.  VLANs provide a level of security against snooping that simple subnet isolation provides.  Again, up to you as to how you want to configure things. 
    I've done it both ways, and it works both ways.
    "Third Question:  I’m confused how the LAG will work between the two 10 Gbe switches when both have separate uplinks to the backbone switch. 
    I see diagrams that show this setup, but I’m not sure how to achieve it without causing a loop."
    This is pretty much outside the bounds of a clustering question.  You might want to take network configuration questions to a networking forum.  Or, you may want to talk with Netgear specialist.  Different networking
    vendors can accomplish this in different ways.
    .:|:.:|:. tim

  • Hyper-V 2012 R2 Cluster - Drain Roles / Fail Roles Back

    Hi all,
    In the past when I've needed to apply windows updates to my 3 Hyper-V cluster nodes I used to make a note of which VM's were running on each node, then I'd live migrate them to one of the other cluster nodes before pausing the node I need to work on and
    carry out the updates, once I finished installing the updates I'd then simply resume the node and live migrate the VM's back to their original node.
    Having recently upgraded my nodes to Windows 2012 R2 I decided to use the new functionality in Failover Cluster Manager where you can pause & drain a node of its roles, perform the updates/maintenance, and then resume & fail roles back to the node,
    unfortunately this didn't go as smoothly as I'd hoped, for some reason it seems like the drain/fail back decided to be cumulative rather than one off jobs per-node ... hard to explain, hopefully the following will be clear enough if the formatting survives:
    1. Beginning State:
    Hyper1     Hyper2     Hyper3
    VM01        VM04       VM07
    VM02        VM05       VM08
    VM03        VM06       VM09
    2. Drain Hyper1:
    Hyper1     Hyper2     Hyper3
                    VM04       VM01
                    VM05       VM02
                    VM06       VM03
                                   VM07
                                   VM08
                                   VM09
    3. Fail Roles Back:
    Hyper1     Hyper2     Hyper3
    VM01        VM04       VM07
    VM02        VM05       VM08
    VM03        VM06       VM09
    4. Drain Hyper2:
    Hyper1     Hyper2     Hyper3
    VM01                       VM04
    VM02                       VM05
    VM03                       VM06
                                   VM07
                                   VM08
                                   VM09
    5. Fail Roles Back:
    Hyper1     Hyper2     Hyper3
                    VM01       VM07
                    VM02       VM08
                    VM03       VM09
                    VM04  
                    VM05
                    VM06
    6. Manually Live Migrate VM's back to correct location:
    Hyper1     Hyper2     Hyper3
    VM01        VM04       VM07
    VM02        VM05       VM08
    VM03        VM06       VM09
    7. Drain Hyper3:
    Hyper1     Hyper2     Hyper3
    VM01        VM04
    VM02        VM05
    VM03        VM06
                    VM07
                    VM08
                    VM09
    8. Fail Roles Back:
    Hyper1     Hyper2     Hyper3
                                   VM01
                                   VM02
                                   VM03
                                   VM04
                                   VM05
                                   VM06
                                   VM07
                                   VM08
                                   VM09
    9. Manually Live Migrate VM's back to correct location:
    Hyper1     Hyper2     Hyper3
    VM01        VM04       VM07
    VM02        VM05       VM08
    VM03        VM06       VM09
    Step 8 was a rather hairy moment, although I was pleased to see my cluster hardware capacity planning rubber stamped, good to know that if I were ever to loose 2 out of 3 nodes everything would keep ticking over!
    So, I'm back to the old ways of doing things for now, has anyone else experienced this strange behaviour?
    Thanks in advance,
    Ben

    Hi,
    Just want to confirm the current situations.
    Please feel free to let us know if you need further assistance.
    Regards.
    We
    are trying to better understand customer views on social support experience, so your participation in this
    interview project would be greatly appreciated if you have time.
    Thanks for helping make community forums a great place.

  • Cluster Quorum Disk failing inside Guest cluster VMs in Hyper-V Cluster using Virtual Disk Sharing Windows Server 2012 R2

    Hi, I'm having a problem in a VM Guest cluster using Windows Server 2012 R2 and virtual disk sharing enabled. 
    It's a SQL 2012 cluster, which has around 10 vhdx disks shared this way. all the VHDX files are inside LUNs on a SAN. These LUNs are presented to all clustered members of the Windows Server 2012 R2 Hyper-V cluster, via Cluster Shared Volumes.
    Yesterday happened a very strange problem, both the Quorum Disk and the DTC disks got the information completetly erased. The vhdx disks themselves where there, but the info inside was gone.
    The SQL admin had to recreated both disks, but now we don't know if this issue was related to the virtualization platform or another event inside the cluster itself.
    Right now I'm seen this errors on one of the VM Guest:
     Log Name:      System
    Source:        Microsoft-Windows-FailoverClustering
    Date:          3/4/2014 11:54:55 AM
    Event ID:      1069
    Task Category: Resource Control Manager
    Level:         Error
    Keywords:      
    User:          SYSTEM
    Computer:      ServerDB02.domain.com
    Description:
    Cluster resource 'Quorum-HDD' of type 'Physical Disk' in clustered role 'Cluster Group' failed.
    Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster
    Manager or the Get-ClusterResource Windows PowerShell cmdlet.
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
        <EventID>1069</EventID>
        <Version>1</Version>
        <Level>2</Level>
        <Task>3</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8000000000000000</Keywords>
        <TimeCreated SystemTime="2014-03-04T17:54:55.498842300Z" />
        <EventRecordID>14140</EventRecordID>
        <Correlation />
        <Execution ProcessID="1684" ThreadID="2180" />
        <Channel>System</Channel>
        <Computer>ServerDB02.domain.com</Computer>
        <Security UserID="S-1-5-18" />
      </System>
      <EventData>
        <Data Name="ResourceName">Quorum-HDD</Data>
        <Data Name="ResourceGroup">Cluster Group</Data>
        <Data Name="ResTypeDll">Physical Disk</Data>
      </EventData>
    </Event>
    Log Name:      System
    Source:        Microsoft-Windows-FailoverClustering
    Date:          3/4/2014 11:54:55 AM
    Event ID:      1558
    Task Category: Quorum Manager
    Level:         Warning
    Keywords:      
    User:          SYSTEM
    Computer:      ServerDB02.domain.com
    Description:
    The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data.
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
        <EventID>1558</EventID>
        <Version>0</Version>
        <Level>3</Level>
        <Task>42</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8000000000000000</Keywords>
        <TimeCreated SystemTime="2014-03-04T17:54:55.498842300Z" />
        <EventRecordID>14139</EventRecordID>
        <Correlation />
        <Execution ProcessID="1684" ThreadID="2180" />
        <Channel>System</Channel>
        <Computer>ServerDB02.domain.com</Computer>
        <Security UserID="S-1-5-18" />
      </System>
      <EventData>
        <Data Name="NodeName">ServerDB02</Data>
      </EventData>
    </Event>
    We don't know if this can happen again, what if this happens on disk with data?! We don't know if this is related to the virtual disk sharing technology or anything related to virtualization, but I'm asking here to find out if it is a possibility.
    Any ideas are appreciated.
    Thanks.
    Eduardo Rojas

    Hi,
    Please refer to the following link:
    http://blogs.technet.com/b/keithmayer/archive/2013/03/21/virtual-machine-guest-clustering-with-windows-server-2012-become-a-virtualization-expert-in-20-days-part-14-of-20.aspx#.Ux172HnxtNA
    Best Regards,
    Vincent Wu
    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.

  • Hyper-V Failover Cluster Networking Configuration After Install

    Hello All,
                Is it possible to install hyper-v and failover or in other words create a hyper-v failover cluster and then configure the networking part of the solution later?  As I am coming into
    terms with the networking part of it, wanted to do it later after the install.  Is it possible?
    And from later configuration, I am trying to say, creation of NIC Team, Virtual NICs, VLAN tagging, etc.

    Hi,
    Failover cluster deployment requires network connectivity between cluster nodes. You can't create a cluster without properly configured TCP\IP on cluster nodes.
    http://OpsMgr.ru/

  • Network DR test causes Exchange DAG network to fail (Failover Cluster Manager reports comms errors)

    We have a DAG configured between 2 mailbox servers, one in each of our main data centres. Our comms team recently performed a DR test between our 2 data centres, switiching from the main production link to the backup link. During this outage the Failover
    Cluster Manager reported errors, with each mailbox server reporting the other as uncontactable. The Events that were logged include the following:
    Isatap interface isatap.{02ADE20A-D5D4-437F-AD00-E6601F7E7A9D} is no longer active. (EventID 4201)
    Cluster node 'MAILBOX_SERVER' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the
    Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is
    connected such as hubs, switches, or bridges. (EventID 1135)
    File share witness resource 'File Share Witness (\\WITNESS_SERVER\SHARE_NAME)' failed to arbitrate for the file share '\\WITNESS_SERVER\SHARE_NAME'. Please ensure that file share '\\WITNESS_SERVER\SHARE_NAME' exists and is accessible by the cluster. (EventID
    1564)
    Cluster resource 'File Share Witness (\\\WITNESS_SERVER\SHARE_NAME)' in clustered service or application 'Cluster Group' failed. (EventID 1069)
    The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network
    configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. (EventID 1177)
    The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster. (EventID 7024)
    The Microsoft Exchange Information Store service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 5000 milliseconds: Restart the service. (EventID 7031)
    Looking at the Cluster Events in the Failover Cluster Manager Snap-In i see a heap of Event ID 47 (cannot activate the DAG databases as the server is not up according to Windows Failover Cluster Service) and:
    Node status could not be recorded. This could prevent some network failure logic from functioning correctly. NodeStatus:IsHealthy=True,HasADAccess=True,ClusterErrorOverrideFalse,LastUpdate=5/2/2011 8:25:42 AMUTC Failure:An Active Manager operation failed.
    Error: An error occurred while attempting a cluster operation. Error: Cluster API '"ClusterRegSetValue() failed with 0x6be. Error: The remote procedure call failed"' failed.. (EventID 184)
    Forcefully dismounting all the locally mounted databases on server 'BACKUP_MAILBOX_SERVER. (EventID 307).
    Our Comms team doesn't believe it is a comms issue as they did not log any network communication errors between the servers in the two sites (using icmp). So if it is not a comms issue, how can I configure the Failover Cluster Manager to be resilient to
    this type of network failover event.
    Thanks
    Dan

    Isn't it also true that in a stretched DAG with even numbered nodes, the PAM needs to be in the same site as the active DAG node?  If the connection between both nodes goes down, and the PAM is in the "passive" site, the primary node will
    dismount the databases since it can't check with the PAM to make sure its safe for it to be up.  
    In a even-numbered node stretched DAG, the PAM changes to the DR/passive site everytime a failover occurs, but doesn't automatically switch back when you reactivate the primary node.

  • Hyper-V internal (1 of 2, NOT Cluster) network unavailable in Failover Cluster Manager 2008R2

    Hi all,
    I had a very strange situation in my Hyper-V 2 nodes-cluster:
    I have one networtk for HertBeat only (10.0.0.0/24) and second for HyperV internal networking for virtual machines (In properties marked "Do not allow clustern network communication")
    Machines were working properly and any migration too.
    One day, my secon done HyperV2 was marked red in Failover Cluster Manager mmc. I discovered that HyperV LAN is unavailable on this second node. BUT everything war working properly - HyperV2 node was on internet, communicated to AD domain, even culd run any
    virtual machine...
    Several times I checked the configuration, also check TMG configuratio, I was wondering if it can not be wrong settings on network access rule, I tried to restart this host - no result, ... network was still unavailable.
    After about a hour I found the resolutuion:
    On my second Hyper-V node Disable / Enable Local Area Connection network adapter, connected to Hyper-V LAN in Network Connections control panel!
    Hope this will help to somebody ;)
    Marian, just trying to help you

    Resolutuion:
    On affected Hyper-V node Disable / Enable Local Area Connection network adapter, connected to Hyper-V LAN in Network Connections control panel
    I guess, sometnig flush on network configuration and / or some combination with network adapter driver
    Marian, just trying to help you

Maybe you are looking for