Host server live migration causing Guest Cluster node goes down
Hi
I have two node Hyper host cluster , Im using converged network for Host management,Live migartion and cluster network. And Separate NICs for ISCSI multi-pathing. When I live migrate the Guest node from one host to another , within guest cluster the node
is going down. I have increased clusterthroshold and clusterdelay values. Guest nodes are connecting to ISCSI network directly from ISCSI initiator on Server 2012.
The converged networks for management ,cluster and live migration networks are built on top of a NIC Team with switch Independent mode and load balancing as Hyper V port.
I have VMQ enabled on Converged fabric and jumbo frames enabled on ISCSI.
Can Anyone guess why would live migration cause failure on the guest node.
thanks
mumtaz
Repost here: http://social.technet.microsoft.com/Forums/en-US/winserverhyperv/threads
in the Hyper-V forum. You'll get a lot more help there.
This forum is for Virtual Server 2005.
Similar Messages
-
Hyper-V Guest Cluster Node Failing Regularly
Hi,
We currently have a 4-node Server 2012 R2 Cluster witch hosts among other things, a 3 node Guest Cluster running a single clustered file service.
Around once a week, the guest cluster node that is currently hosting the clustered file service will fail. It's as if the VM is blue screening. That in itself is fairly anoying and I'll be doing all the updates and checking event log for clues
as to the cause.
The problem then is that whichever physical cluster node that is hosting the VM when it fails, will not unlock some of the VM's files. The Virtual machine configuration lists as Online Pending. This means that the failed VM cannot be restarted
on any other cluster node. The only fix is to drain the physical host it failed on, and reboot.
Looking for suggestions on how to fix the following.
1. Crashing guest file cluster node
2. Failed VM with shared VHDX requiring Phyiscal host reboot.
Event messages for the physical host that was hosting the failed vm in order that they occured.
Hyper-V-Worker: Event ID 18590 - 'FS-03' has encountered a fatal error. The guest operating system reported that it failed with the following error codes: ErrorCode0: 0x9E, ErrorCode1: 0x6C2A17C0, ErrorCode2: 0x3C, ErrorCode3: 0xA, ErrorCode4:
0x0. If the problem persists, contact Product Support for the guest operating system. (Virtual machine ID 36166B47-D003-4E51-AFB5-7B967A3EFD2D)
FailoverClustering: Event ID 1069 - Cluster resource 'Virtual Machine FS-03' of type 'Virtual Machine' in clustered role 'FS-03' failed.
Hyper-V-High-Availability: Event ID 21128 - 'Virtual Machine FS-03' failed to shutdown the virtual machine during the resource termination. The virtual machine will be forcefully stopped.
Hyper-V-High-Availability: Event ID 21110 - 'Virtual Machine FS-03' failed to terminate.
Hyper-V-VMMS: Event ID 20108 - The Virtual Machine Management Service failed to start the virtual machine '36166B47-D003-4E51-AFB5-7B967A3EFD2D': The group or resource is not in the correct state to perform the requested operation. (0x8007139F).
Hyper-V-High-Availability: Event ID 21107 - 'Virtual Machine FS-03' failed to start.
FailoverClustering: Event ID 1205 - The Cluster service failed to bring clustered role 'FS-03' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.Hi,
I don’t found the similar issue, Does your cluster can pass the cluster validation? Does all your Hyper-V host compatible with Server 2012r2? Have you try to disable all your
AV soft and firewall? Please rerun Storage validation on the Cluster in non-production hours, the cluster validation report will quickly locate the issue.
More information:
Cluster
http://technet.microsoft.com/en-us/library/dd581778(v=ws.10).aspx
Hope this helps.
We
are trying to better understand customer views on social support experience, so your participation in this
interview project would be greatly appreciated if you have time.
Thanks for helping make community forums a great place. -
Live Migration to Best possible node
Hi,
I have a 20 node cluster with virtual machine role.
I would like know equivalent power shell command for migrating VMs to best possible node.
When I right click on VM from cluster manager, I see below option for Live migration to Best possible node. Same thing I would like to achieve through powershell.
Thanks in advance.
Thanks, KrishnaWell, you're asking the cluster to make its best determination on where the VMs should go, so I don't really know that I can second-guess its behavior.
You could ensure that all highly available VMs are moved off a specific node by using
Suspend-ClusterNode:
Suspend-ClusterNode -Name "node1" -Drain
Then when you want to put the roles back, use
Resume-ClusterNode:
Resume-ClusterNode -Name "node1" -Failback Immediate
You can enter multiple node names at once, if you want.
But if stress testing a network device is your aim, I would look at actual test tools, like
IOMeter.
Eric Siron Altaro Hyper-V Blog
I am an independent blog contributor, not an Altaro employee. I am solely responsible for the content of my posts.
"Every relationship you have is in worse shape than you think." -
Reports server 10.1.2.0.2 keeps going down !
Hi Guys,
The reports server 10.1.2.0.2 ( In-Process NOT standalone one ) works all day, when we come to work next day, we find that reports server is down and giving
the following error:
REP-501: Unable to connect to the specified database.
We have to do a restart for the reports server to work. The reports server connect
s to a cluster database, therefore, we dont shutdown the database during night f
or backup for reports server connections to fail.
I am thinking of maybe it is related to being idle and losing connections some how.
http://servername:7778/reports/rwservlet?server=rep_servername&envid=PIBIS&report=SWRANBT&destype=CACHE&
desformat=PDF¶mform=YES&userid=
Operating System: Sun Solaris SPARC 64-bit
Oracle Application Server: 10.1.2.0.2
Oracle DB: 10.1.0.4
Please advise,
Cheers,
FerasHello,
Please check the the engine (rwEng) trace file, is it showing similar error as follows;
REP-0501: Unable to connect to the specified database.
ORA-24323: value not allowed
therefore, we dont shutdown the database during night for backup It has also been stated as a reason/cause in the doc;
It is possible that the connection has been lost simply because a scheduled (e.g. overnight) backup and restart of the Oracle Server has taken place while the Report Server has remained up and running.
Please review and see if it is helpful ;
REP-501 on Initial Run After The Server's been Idle. Cannot Recover until Engine Restarts: Doc ID: Note:357652.1
https://metalink.oracle.com/metalink/plsql/f?p=130:14:804350786344799034::::p14_database_id,p14_docid,p14_show_header,p14_show_help,p14_black_frame,p14_font:NOT,357652.1,1,1,1,helvetica
Adith -
Live Upgrade fails on cluster node with zfs root zones
We are having issues using Live Upgrade in the following environment:
-UFS root
-ZFS zone root
-Zones are not under cluster control
-System is fully up to date for patching
We also use Live Upgrade with the exact same same system configuration on other nodes except the zones are UFS root and Live Upgrade works fine.
Here is the output of a Live Upgrade:
bash-3.2# lucreate -n sol10-20110505 -m /:/dev/md/dsk/d302:ufs,mirror -m /:/dev/md/dsk/d320:detach,attach,preserve -m /var:/dev/md/dsk/d303:ufs,mirror -m /var:/dev/md/dsk/d323:detach,attach,preserve
Determining types of file systems supported
Validating file system requests
The device name </dev/md/dsk/d302> expands to device path </dev/md/dsk/d302>
The device name </dev/md/dsk/d303> expands to device path </dev/md/dsk/d303>
Preparing logical storage devices
Preparing physical storage devices
Configuring physical storage devices
Configuring logical storage devices
Analyzing system configuration.
Comparing source boot environment <sol10> file systems with the file
system(s) you specified for the new boot environment. Determining which
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
The device </dev/dsk/c0t1d0s0> is not a root device for any boot environment; cannot get BE ID.
Creating configuration for boot environment <sol10-20110505>.
Source boot environment is <sol10>.
Creating boot environment <sol10-20110505>.
Creating file systems on boot environment <sol10-20110505>.
Preserving <ufs> file system for </> on </dev/md/dsk/d302>.
Preserving <ufs> file system for </var> on </dev/md/dsk/d303>.
Mounting file systems for boot environment <sol10-20110505>.
Calculating required sizes of file systems for boot environment <sol10-20110505>.
Populating file systems on boot environment <sol10-20110505>.
Checking selection integrity.
Integrity check OK.
Preserving contents of mount point </>.
Preserving contents of mount point </var>.
Copying file systems that have not been preserved.
Creating shared file system mount points.
Creating snapshot for <data/zones/img1> on <data/zones/img1@sol10-20110505>.
Creating clone for <data/zones/img1@sol10-20110505> on <data/zones/img1-sol10-20110505>.
Creating snapshot for <data/zones/jdb3> on <data/zones/jdb3@sol10-20110505>.
Creating clone for <data/zones/jdb3@sol10-20110505> on <data/zones/jdb3-sol10-20110505>.
Creating snapshot for <data/zones/posdb5> on <data/zones/posdb5@sol10-20110505>.
Creating clone for <data/zones/posdb5@sol10-20110505> on <data/zones/posdb5-sol10-20110505>.
Creating snapshot for <data/zones/geodb3> on <data/zones/geodb3@sol10-20110505>.
Creating clone for <data/zones/geodb3@sol10-20110505> on <data/zones/geodb3-sol10-20110505>.
Creating snapshot for <data/zones/dbs9> on <data/zones/dbs9@sol10-20110505>.
Creating clone for <data/zones/dbs9@sol10-20110505> on <data/zones/dbs9-sol10-20110505>.
Creating snapshot for <data/zones/dbs17> on <data/zones/dbs17@sol10-20110505>.
Creating clone for <data/zones/dbs17@sol10-20110505> on <data/zones/dbs17-sol10-20110505>.
WARNING: The file </tmp/.liveupgrade.4474.7726/.lucopy.errors> contains a
list of <2> potential problems (issues) that were encountered while
populating boot environment <sol10-20110505>.
INFORMATION: You must review the issues listed in
</tmp/.liveupgrade.4474.7726/.lucopy.errors> and determine if any must be
resolved. In general, you can ignore warnings about files that were
skipped because they did not exist or could not be opened. You cannot
ignore errors such as directories or files that could not be created, or
file systems running out of disk space. You must manually resolve any such
problems before you activate boot environment <sol10-20110505>.
Creating compare databases for boot environment <sol10-20110505>.
Creating compare database for file system </var>.
Creating compare database for file system </>.
Updating compare databases on boot environment <sol10-20110505>.
Making boot environment <sol10-20110505> bootable.
ERROR: unable to mount zones:
WARNING: zone jdb3 is installed, but its zonepath /.alt.tmp.b-tWc.mnt/zoneroot/jdb3-sol10-20110505 does not exist.
WARNING: zone posdb5 is installed, but its zonepath /.alt.tmp.b-tWc.mnt/zoneroot/posdb5-sol10-20110505 does not exist.
WARNING: zone geodb3 is installed, but its zonepath /.alt.tmp.b-tWc.mnt/zoneroot/geodb3-sol10-20110505 does not exist.
WARNING: zone dbs9 is installed, but its zonepath /.alt.tmp.b-tWc.mnt/zoneroot/dbs9-sol10-20110505 does not exist.
WARNING: zone dbs17 is installed, but its zonepath /.alt.tmp.b-tWc.mnt/zoneroot/dbs17-sol10-20110505 does not exist.
zoneadm: zone 'img1': "/usr/lib/fs/lofs/mount /.alt.tmp.b-tWc.mnt/global/backups/backups/img1 /.alt.tmp.b-tWc.mnt/zoneroot/img1-sol10-20110505/lu/a/backups" failed with exit code 111
zoneadm: zone 'img1': call to zoneadmd failed
ERROR: unable to mount zone <img1> in </.alt.tmp.b-tWc.mnt>
ERROR: unmounting partially mounted boot environment file systems
ERROR: cannot mount boot environment by icf file </etc/lu/ICF.2>
ERROR: Unable to remount ABE <sol10-20110505>: cannot make ABE bootable
ERROR: no boot environment is mounted on root device </dev/md/dsk/d302>
Making the ABE <sol10-20110505> bootable FAILED.
ERROR: Unable to make boot environment <sol10-20110505> bootable.
ERROR: Unable to populate file systems on boot environment <sol10-20110505>.
ERROR: Cannot make file systems for boot environment <sol10-20110505>.
Any ideas why it can't mount that "backups" lofs filesystem into /.alt? I am going to try and remove the lofs from the zone configuration and try again. But if that works I still need to find a way to use LOFS filesystems in the zones while using Live Upgrade
ThanksI was able to successfully do a Live Upgrade with Zones with a ZFS root in Solaris 10 update 9.
When attempting to do a "lumount s10u9c33zfs", it gave the following error:
ERROR: unable to mount zones:
zoneadm: zone 'edd313': "/usr/lib/fs/lofs/mount -o rw,nodevices /.alt.s10u9c33zfs/global/ora_export/stage /zonepool/edd313 -s10u9c33zfs/lu/a/u04" failed with exit code 111
zoneadm: zone 'edd313': call to zoneadmd failed
ERROR: unable to mount zone <edd313> in </.alt.s10u9c33zfs>
ERROR: unmounting partially mounted boot environment file systems
ERROR: No such file or directory: error unmounting <rpool1/ROOT/s10u9c33zfs>
ERROR: cannot mount boot environment by name <s10u9c33zfs>
The solution in this case was:
zonecfg -z edd313
info ;# display current setting
remove fs dir=/u05 ;#remove filesystem linked to a "/global/" filesystem in the GLOBAL zone
verify ;# check change
commit ;# commit change
exit -
JMS Uniform Distribute Queue Unit Of Order, problem when one node goes down
Hi ,
I have the following code which post a message (with Unit of Order set ) to a Uniform Distribute Queue in a cluster with two member servers (server1 and server2).
--UDQ is targeted to a subdeployment that is mapped to two JMS servers pointing to each member servers
--Connection Factory is using default targeting ( i tried mapping to Sub deployment also)
javax.naming.InitialContext serverContext = new javax.naming.InitialContext();
javax.jms.QueueConnectionFactory qConnFactory = (javax.jms.QueueConnectionFactory)serverContext.lookup(jmsQConnFactoryName);
javax.jms.QueueConnection qConn = (javax.jms.QueueConnection)qConnFactory.createConnection();
javax.jms.QueueSession qSession = qConn.createQueueSession(false, Session.AUTO_ACKNOWLEDGE);
javax.jms.Queue q = ( javax.jms.Queue)serverContext.lookup(jmsQName);
weblogic.jms.extensions.WLMessageProducer qSender = (weblogic.jms.extensions.WLMessageProducer) qSession.createProducer(q);
qSender.setUnitOfOrder("MyUnitOfOrder");
javax.jms.ObjectMessage message = qSession.createObjectMessage();
HashMap<String, Object> map = new HashMap<String, Object>();
map.put("something", "SomeObject");
message.setObject(map);
qSender.send(message);
} catch (Exception e) {
Steps followed:
1. Post a message from "server1"
2. Message picked up by "server2"
3. Everything fine
4. Shutdown "server2"
5. Post a message from "server1"
6. ERROR: "hashed member of MyAppJMSModule!MyDistributedQ is MyAppJMSModule!MyJMSServer-2@MyDistributedQ which is not available"
WebLogic version : 10.3.5
Is there a way (other than configuring Path Service ) to make this code work "with unit of order" for a UDQ even if some member servers go down ?
Thanks very much for your time.If you want to avoid use of the Path Service, then the alternative is to make the destination members highly available. This will help ensure that the host member for a particular UOO is up.
One approach to HA is to configure "service migration". For more information see the Automatic Service Migration white-paper at
http://www.oracle.com/technology/products/weblogic/pdf/weblogic-automatic-service-migration-whitepaper.pdf
In addition, I recommend referencing Best Practices for JMS Beginners and Advanced Users
http://docs.oracle.com/cd/E17904_01/web.1111/e13738/best_practice.htm#JMSAD455 to help with WL configuration in general.
Hope this helps,
Tom -
JDBC read stuck if RAC node goes down
We did several tests with Java applications against our RAC DB and face a hanging application if we power off the RAC node that executes the current (long) running query.
We can see that the application receives HA-events via UCP:
2015-01-22 13:02:11 | r-thread-1 | WARN | o.ucp.jdbc.oracle.ONSDatabaseFailoverEvent | NO timezone in HA event
However, the application started a query before and the query is not aborted with an exception. A Thread dump after about 7 minutes shows that the application is hanging in a socket read call:
"pool-1-thread-1" #32 prio=5 os_prio=0 tid=0x00007fedf45b2000 nid=0xbc4 runnable [0x00007fee00cd3000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at oracle.net.ns.Packet.receive(Packet.java:283)
at oracle.net.ns.DataPacket.receive(DataPacket.java:103)
at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:230)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:175)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:100)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:85)
at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:123)
at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:79)
at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1122)
at oracle.jdbc.driver.T4CMAREngine.unmarshalSB1(T4CMAREngine.java:1099)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:288)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:191)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:523)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:207)
at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:863)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1153)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1275)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3576)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3620)
- locked <0x00000000c0ddcb20> (a oracle.jdbc.driver.T4CConnection)
at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1491)
at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(JdbcTemplate.java:703)
The expected behaviour would be that a running query is aborted with an exception. (BTW: This happens if the service is taken down with "shutdown immediate". All ok for this case.)
We consider to implement custom ONS listeners [1], but we actually expect that UCP would handle such situations or lets us register strategies/callbacks for certain events.
Our config:
Oracle Enterprise 11.2.0.4.0 with RAC
ons.jar 12.1.0.1
ojdbc6.jar 11.2.0.2
ucp.jar 12.1.0.1
Server JRE 1.8.0_25
Any hints appreciated.
[1] http://docs.oracle.com/cd/E11882_01/java.112/e16548/apxracfan.htm#JJDBC28945You're concept isn't right:
http://docs.oracle.com/cd/E11882_01/server.112/e25494/restart.htm#ADMIN13178
Overview of Fast Application Notification
FAN is a notification mechanism that Oracle Restart can use to notify other processes about configuration changes that include service status changes, such as UP or DOWN events. FAN provides the ability to immediately terminate inflight transaction when an instance or server fails. Integrated Oracle clients receive the events and respond. Applications can respond either by propagating the error to the user or by resubmitting the transactions and masking the error from the application user. When a DOWN event occurs, integrated clients immediately clean up connections to the terminated database. When an UP event occurs, the clients create new connections to the new primary database instance.
Also, take a look at these docs: http://docs.oracle.com/cd/E11882_01/java.112/e12265/rac.htm#JJUCP08100 ; and https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=890204623685515&id=566573.1&_afrWindowMode=0&_adf.ctrl-s…
And make a test, execute a query that took about 1 minute and after you executed, just power down the node where it is executing, to see if it will retrieve the results.
Regards. -
Unable to live migrate VM (error 21502)
Hi,
I have four node Hyper-V cluster build on Windows Server 2012. I found an issue when one virtual machine is unable to live migrate to another cluster node with following error:
Live migration of 'Virtual Machine VM' failed.
Virtual machine migration operation for 'VM' failed at migration destination 'HYPERV2'. (Virtual machine ID EB7708F3-6D0B-4F7E-9EC9-EA7EE718A134)
'VM' Microsoft Emulated IDE Controller (Instance ID 83F8638B-8DCA-4152-9EDA-2CA8B33039B4): Failed to restore with Error 'The process cannot access the file because another process has locked a portion of the file.' (0x80070021). (Virtual machine ID EB7708F3-6D0B-4F7E-9EC9-EA7EE718A134)
'VM': Failed to open attachment 'C:\ClusterStorage\Volume1\VM\VM.vhdx'. Error: 'The process cannot access the file because another process has locked a portion of the file.' (0x80070021). (Virtual machine ID EB7708F3-6D0B-4F7E-9EC9-EA7EE718A134)
'VM': Failed to open attachment 'C:\ClusterStorage\Volume1\VM\VM.vhdx'. Error: 'The process cannot access the file because another process has locked a portion of the file.' (0x80070021). (Virtual machine ID EB7708F3-6D0B-4F7E-9EC9-EA7EE718A134)
It's possible to migrate VM in Stopped state but then the VM cannot start on new host with following error:
'Virtual Machine VM' failed to start.
'VM' failed to start. (Virtual machine ID EB7708F3-6D0B-4F7E-9EC9-EA7EE718A134)
'VM' Microsoft Emulated IDE Controller (Instance ID 83F8638B-8DCA-4152-9EDA-2CA8B33039B4): Failed to Power on with Error 'The process cannot access the file because another process has locked a portion of the file.' (0x80070021). (Virtual machine ID EB7708F3-6D0B-4F7E-9EC9-EA7EE718A134)
'VM': Failed to open attachment 'C:\ClusterStorage\Volume1\VM\VM.vhdx'. Error: 'The process cannot access the file because another process has locked a portion of the file.' (0x80070021). (Virtual machine ID EB7708F3-6D0B-4F7E-9EC9-EA7EE718A134)
'VM': Failed to open attachment 'C:\ClusterStorage\Volume1\VM\VM.vhdx'. Error: 'The process cannot access the file because another process has locked a portion of the file.' (0x80070021). (Virtual machine ID EB7708F3-6D0B-4F7E-9EC9-EA7EE718A134)
Live storage migration works fine. When I migrate VM back to original node then VM starts correctly.
Thanks for any response.Hi, Daniel,
Sometimes you might face failed live migration due to VMSwitches being named differently. So, the first thing to do is to make sure that VMSwitches on both hosts have the same name.
Also, you can try to take cluster offline and perform repairing procedure that appears to fix the mysterious issue causing live migrations of VMs to fail. ( Open Failover Cluster Manager -> Select the cluster name -> Take Offline
-> More Actions, click Repair.
Otherwise, if you’re short on time and willing to migrate VM as soon as possible, you can perform one-time backup/restore operation, using one of the free backup utilities available on the market (VeaamZIP or similar). In many way this
tool acts as zip-utility for VMs. It helped us a lot, when migration failed for whatever reason, and we didn't have enough time to find the root cause.
Kind regards, Leonardo. -
Hi ,
We are using SAP MDM 5.5 application installed in Microsoft Cluster.
Unfortunately one of our cluster node goes down and as per System Management team we have rebuild the node 2 from scratch.
While checking the resolution I got below MS link which explains the similar situation and its resolution .
http://technet.microsoft.com/en-us/library/cc786625(v=ws.10).aspx
Scenario 6u2014Single Cluster Node Corruption or Failure .
While System management team is working on this I want to just check what other option do we have, if we have to rebuild the server from scratch then what will be the process.
I am assuming below process.
1. Windows team rebuild the server (O.S and Cluster configuration).
2. We have to install Oracle DB and MDM application from installation media.
3. We have to add this node 2 to existing cluster configuration (on node1).
But I am not sure about this process and have some doubt like on node 2 do we have to perform fresh installation of apps and DB like we did while installing the cluster first time or in this case there will be different process as apps & db are working fine on node 1.
Please help me if anyone has ever faced this kind of issue.
Thanks and Regards
Alok
Edited by: Alok Jain on Mar 6, 2012 7:47 AMHi buddy,
What a pity!!! :(
I wish the best for this recovering!!!
About Your questions:
Am I being too paranoid with this and wasting too much time on a mock environment while running on risky hardware? I don't think so, As You've never done it yet, I guess it's safer test it before. It can became worse if You do the wrong thing :)
Is the recovery of this node really as straight forward as it seems: Delete the Node, Add the node back?Yes, As You have to rebuild the node, You`ll have to rebuild CRS too. You have to remove and add the node again, Don't forget about the instance, listeners, services,etc. The procedure on the documentations is really really clean.
Can I add the node back as the same named node or will the cluster freak out due to some linguring previous config?You can add the node back as the same named node.
Are there any other "gotchas" I may not be thinking about that some of you may have experienced?As You told this is very crucial component to Your production system, If I were You, I would Work with Oracle support, instead of executing everything by myself.
Good Luck!
Cerreia -
Server 2012 cluster - virtual machine live migration does not work
Hi,
We have a hyper-v cluster with two nodes running Windows Server 2012. All the configurations are identical.
When I try to make a Live migration from one node to the other I get an error message saying:
Live migration of 'Virtual Machine XXXXXX' failed.
I get no other error messages, not even in event viewer. This same happens with all of our virtual machines.
A normal Quick migration works just fine for all of the virtual machines, so network configuration should not be an issue.
The above error message does not provide much information.Hi,
Please check whether your configuration meet live migration requirement:
Two (or more) servers running Hyper-V that:
Support hardware virtualization.
Yes they support virtualization.
Are using processors from the same manufacturer (for example, all AMD or all Intel).
Both Servers are identical and brand new Fujitsu-Siemens RX300S7 with the same kind of processor (Xeon E5-2620).
Belong to either the same Active Directory domain, or to domains that trust each other.
Both nodes are in the same domain.
Virtual machines must be configured to use virtual hard disks or virtual Fibre Channel disks (no physical disks).
All of the vitual machines have virtual hard disks.
Use of a private network is recommended for live migration network traffic.
Have tried this, but does not help.
Requirements for live migration in a cluster:
Windows Failover Clustering is enabled and configured.
Yes
Cluster Shared Volume (CSV) storage in the cluster is enabled.
Yes
Requirements for live migration using shared storage:
All files that comprise a virtual machine (for example, virtual hard disks, snapshots, and configuration) are stored on an SMB share. They are all on the same CSV
Permissions on the SMB share have been configured to grant access to the computer accounts of all servers running Hyper-V.
Requirements for live migration with no shared infrastructure:
No extra requirements exist.
Also please refer to this article to check whether you have finished all preparation works for live migration:
Virtual Machine Live Migration Overview
http://technet.microsoft.com/en-us/library/hh831435.aspx
Hyper-V: Using Live Migration with Cluster Shared Volumes in Windows Server 2008 R2
http://technet.microsoft.com/en-us/library/dd446679(v=WS.10).aspx
Configure and Use Live Migration on Non-clustered Virtual Machines
http://technet.microsoft.com/en-us/library/jj134199.aspx
Hope this helps!
TechNet Subscriber Support
If you are
TechNet Subscription user and have any feedback on our support quality, please send your feedback
here.
Lawrence
TechNet Community Support
I have also read all of the technet articles but can't find anything that could help. -
JNDI Lookup for multiple server instances with multiple cluster nodes
Hi Experts,
I need help with retreiving log files for multiple server instances with multiple cluster nodes. The system is Netweaver 7.01.
There are 3 server instances all instances with 3 cluster nodes.
There are EJB session beans deployed on them to retreive the log information for each server node.
In the session bean there is a method:
public List getServers() {
List servers = new ArrayList();
ClassLoader saveLoader = Thread.currentThread().getContextClassLoader();
try {
Properties prop = new Properties();
prop.setProperty(Context.INITIAL_CONTEXT_FACTORY, "com.sap.engine.services.jndi.InitialContextFactoryImpl");
prop.put(Context.SECURITY_AUTHENTICATION, "none");
Thread.currentThread().setContextClassLoader((com.sap.engine.services.adminadapter.interfaces.RemoteAdminInterface.class).getClassLoader());
InitialContext mInitialContext = new InitialContext(prop);
RemoteAdminInterface rai = (RemoteAdminInterface) mInitialContext.lookup("adminadapter");
ClusterAdministrator cadm = rai.getClusterAdministrator();
ConvenienceEngineAdministrator cea = rai.getConvenienceEngineAdministrator();
int nodeId[] = cea.getClusterNodeIds();
int dispatcherId = 0;
String dispatcherIP = null;
String p4Port = null;
for (int i = 0; i < nodeId.length; i++) {
if (cea.getClusterNodeType(nodeId[i]) != 1)
continue;
Properties dispatcherProp = cadm.getNodeInfo(nodeId[i]);
dispatcherIP = dispatcherProp.getProperty("Host", "localhost");
p4Port = cea.getServiceProperty(nodeId[i], "p4", "port");
String[] loc = new String[3];
loc[0] = dispatcherIP;
loc[1] = p4Port;
loc[2] = null;
servers.add(loc);
mInitialContext.close();
} catch (NamingException e) {
} catch (RemoteException e) {
} finally {
Thread.currentThread().setContextClassLoader(saveLoader);
return servers;
and the retreived server information used here in another class:
public void run() {
ReadLogsSession readLogsSession;
int total = servers.size();
for (Iterator iter = servers.iterator(); iter.hasNext();) {
if (keepAlive) {
try {
Thread.sleep(500);
} catch (InterruptedException e) {
status = status + e.getMessage();
System.err.println("LogReader Thread Exception" + e.toString());
e.printStackTrace();
String[] serverLocs = (String[]) iter.next();
searchFilter.setDetails("[" + serverLocs[1] + "]");
Properties prop = new Properties();
prop.put(Context.INITIAL_CONTEXT_FACTORY, "com.sap.engine.services.jndi.InitialContextFactoryImpl");
prop.put(Context.PROVIDER_URL, serverLocs[0] + ":" + serverLocs[1]);
System.err.println("LogReader run [" + serverLocs[0] + ":" + serverLocs[1] + "]");
status = " Reading :[" + serverLocs[0] + ":" + serverLocs[1] + "] servers :[" + currentIndex + "/" + total + " ] ";
prop.put("force_remote", "true");
prop.put(Context.SECURITY_AUTHENTICATION, "none");
try {
Context ctx = new InitialContext(prop);
Object ob = ctx.lookup("com.xom.sia.ReadLogsSession");
ReadLogsSessionHome readLogsSessionHome = (ReadLogsSessionHome) PortableRemoteObject.narrow(ob, ReadLogsSessionHome.class);
status = status + "Found ReadLogsSessionHome ["+readLogsSessionHome+"]";
readLogsSession = readLogsSessionHome.create();
if(readLogsSession!=null){
status = status + " Created ["+readLogsSession+"]";
List l = readLogsSession.getAuditLogs(searchFilter);
serverLocs[2] = String.valueOf(l.size());
status = status + serverLocs[2];
allRecords.addAll(l);
}else{
status = status + " unable to create readLogsSession ";
ctx.close();
} catch (NamingException e) {
status = status + e.getMessage();
System.err.println(e.getMessage());
e.printStackTrace();
} catch (CreateException e) {
status = status + e.getMessage();
System.err.println(e.getMessage());
e.printStackTrace();
} catch (IOException e) {
status = status + e.getMessage();
System.err.println(e.getMessage());
e.printStackTrace();
} catch (Exception e) {
status = status + e.getMessage();
System.err.println(e.getMessage());
e.printStackTrace();
currentIndex++;
jobComplete = true;
The application is working for multiple server instances with a single cluster node but not working for multiple cusltered environment.
Anybody knows what should be changed to handle more cluster nodes?
Thanks,
GergelyThanks for the response.
I was afraid that it would be something like that although
was hoping for
something closer to the application pools we use with IIS to
isolate sites
and limit the impact one badly behaving one can have on
another.
mmr
"Ian Skinner" <[email protected]> wrote in message
news:fe5u5v$pue$[email protected]..
> Run CF with one instance. Look at your processes and see
how much memory
> the "JRun" process is using, multiply this by number of
other CF
> instances.
>
> You are most likely going to end up on implementing a
"handful" of
> instances versus "dozens" of instance on all but the
beefiest of servers.
>
> This can be affected by how much memory each instance
uses. An
> application that puts major amounts of data into
persistent scopes such as
> application and|or session will have a larger foot print
then a leaner
> application that does not put much data into memory
and|or leave it there
> for a very long time.
>
> I know the first time we made use of CF in it's
multi-home flavor, we went
> a bit overboard and created way too many. After nearly
bringing a
> moderate server to its knees, we consolidated until we
had three or four
> or so IIRC. A couple dedicated to to each of our largest
and most
> critical applications and a couple general instances
that ran many smaller
> applications each.
>
>
>
>
> -
VM live migration during OVM server upgrade
Hi Guys,
I'm planning to upgrade OVM 3.1.1 to 3.2.7.
There are 4 OVM Servers in server pool and all is using the same CPU family which means the live migration is possible.
I just wondering if I upgrade one OVM server to 3.2.7 first and then is it still available to live migrate VMs from 3.1.1. servers to new 3.2.7 server?
Thanks in advance.
JayHi Jay,
I'd do the following:
- free up one OVS by migrating all guests to the remaining OVS
- upgrade OVM Manager straigh to 3.2.8
- upgrade the idle OVS to 3.2.8
- live migrate your guests from one 3.1.1 OVS to the new, idle 3.2.8 OVS - if not using OVMM, then using xm
- round robin upgrade your remaining OVS
I've done that a couple of times…
Cheers,
budy -
Guest Cluster error in Hyper-V Cluster
Hello everybody,
in my environment I do have an issue with failover clusters (Exchange, Fileserver) while performing a live migration of one virtual clusternode. The clustergroup is going offline.
The environment is the following:
2x Hyper-V Clusters: Hyper-V-Cluster1 and Hyper-V-Cluster2 (Windows Server 2012 R2) with 5 Nodes per Cluster
1x Scaleout Fileserver (Windows Server 2012 R2) with 2 Nodes
1x Exchange Cluster (Windows Server 2012 R2) with EX01 VM running on Hyper-V-Cluster1 and EX02 VM running on Hyper-V-Cluster2
1x Fileserver Failover Cluster (Windows Server 2012 R2) with FS01 VM running on Hyper-V-Cluster1 and FS02 VM running on Hyper-V-Cluster2
The physical networks on the Hyper-V Nodes are redundant with 2x 10Gb/s uplinks to 2x physical switches for VMs in a LBFO Team:
New-NetLbfoTeam
-Name 10Gbit_TEAM -TeamMembers 10Gbit_01,10Gbit_02
-TeamingMode SwitchIndependent -LoadBalancingAlgorithm HyperVPort
The SMB 3 traffic runs on 2x 10Gb/s NIC without NIC-Teaming (SMB-Multichannel).
SMB is used for livemigrations.
The VMs for clustering were installed according to the technet guideline:
http://technet.microsoft.com/en-us/library/dn265980.aspx
Because my Hyper-V Uplinks are allready redundant, I am using one NIC inside the VM.
As I understand, there is no advantage of using two NICs inside the VM as long they are connected to the same vSwitch.
Now, when I want to perform a hardware maintenance, I have to livemigrate the EX01 VM from Hyper-V-Cluster1-Node-1 to Hyper-V-Cluster1-Node-2.
EX02 VM still runs untouched on Hyper-V-Cluster2-Node-1.
At the end of the livemigration I see error 1135 (source: FailoverClustering) on EX01 VM, which says that EX02 VM was removed from Failover Cluster and I have to check my network.
The clustergroup of exchange is offline after that event and I have to bring it online again manually.
Any ideas what can cause this behavior?
Thanks.
Greetings,
torstenHello again,
I found the cause and the solution :-)
In the article here: http://technet.microsoft.com/en-us/library/dn440540.aspx
is the description of my cluster failure:
########## relevant part from article #######################
Protect against short-term network interruptions
Failover cluster nodes use the network to send heartbeat packets to other nodes of the cluster. If a node does not receive a response from another node for a specified period of time, the cluster removes the node from cluster membership. By default, a guest
cluster node is considered down if it does not respond within 5 seconds. Other nodes that are members of the cluster will take over any clustered roles that were running on the removed node.
Typically, during the live migration of a virtual machine there is a fast final transition when the virtual machine is stopped on the source node and is running on the destination node. However, if something causes the final transition to take longer than
the configured heartbeat threshold settings, the guest cluster considers the node to be down even though the live migration eventually succeeds. If the live migration final transition is completed within the TCP time-out interval (typically around 20 seconds),
clients that are connected through the network to the virtual machine seamlessly reconnect.
To make the cluster heartbeat time-out more consistent with the TCP time-out interval, you can change the
SameSubnetThreshold and CrossSubnetThreshold cluster properties from the default of 5 seconds to 20 seconds. By default, the cluster sends a heartbeat every 1 second. The threshold specifies how many heartbeats to miss in succession
before the cluster considers the cluster node to be down.
After changing both parameters in failover cluster as described the error is gone.
Greetings,
torsten -
There is Hyper-V cluster with 2 nodes. Windows Server 2012 R2 is used as operating system.
Trying to live migrate test VM from node 1 to node 2 and get error 21502:
Live migration of 'Virtual Machine test' failed.
'Virtual Machine test' failed to fixup network settings. Verify VM settings and update them as necessary.
VM has Network Adapter connected to Virtual switch. This vSwitch has Private network as connection type.
If I set virtual switch property to "Not connected" in Network Adapter settings of VM I get successful migration.
All VM's that are not connected to any private networks (virtual switches with private network connection type) can be live migrated without any issues.
Is there any official reference related to Hyper-V live migration of VM's that have "private network" connection type?I can Live Migrate virtual machines with adapters on private switches without error. Aside from having the wrong name, the only way I can get it to fail is if I make the switch on one host use a different QoS minimum mode than the other and
enable QoS on the virtual adapter. Even then I get a different message than what you're getting. I only get that one with differently named switches.
There is a PowerShell cmdlet available to see why a guest won't run on another host.
Here's an example of its usage.
There's a way to use it to get it to Live Migrate.
But there is no way to truly Live Migrate three virtual machines in perfect lockstep. Even if you figure out whatever is preventing you from migrating these machines, there will still be periods during Live Migration where they can't communicate across that
private network. You also can't guarantee that all these guests will always be running on the same host without preventing Live Migration in the first place. This is why there really isn't anyone doing what you're trying to do. I suggest you consider another
isolation solution, like VLANs.
Eric Siron Altaro Hyper-V Blog
I am an independent blog contributor, not an Altaro employee. I am solely responsible for the content of my posts.
"Every relationship you have is in worse shape than you think." -
Live Migrating Virtual Machines with Shared VHDx
I am facing problems when live migrating a Virtual Machine that is using Shared VHDx. The Virtual Machine gets migrated that is the configuration gets migrated, but the Virtual Machine fails to start up and if manually tried, it fails too.
What is the method to to live migrate virtual machines that are using Shared VHDx. Thanks in advance.Another couple of gotchas:
You cannot do host-level backups of the guest cluster. This is the same as it always was. You will have to install backup agents in the guest cluster nodes and back them up as if they were physical machines.
You cannot perform a hot-resize of the shared VHDX. But you can hot-add more shared VHDX files to the clustered VMs.
You cannot Storage Live Migrate the shared VHDX file. You can move the other VM files and perform normal Live Migration.
as Long as you have your shared VHDx on a SMB3 Share you also could have the Nodes of the Guest Cluster on different Hyper-V Hosts.
Maybe you are looking for
-
How do I undo the IO6 update cannot access most of You Tube videos
I am very dissatisfied with your decision to remove google maps and you tube, I can't access so many things, you know very well that google maps is best on the web like Tom Tom is for GPS devises, I am unable to access 3/4 of videos on You Tube becau
-
Error with JNDI when creating new data source
Hi : This weekend, while adding new applications to the production environment we ran into a problem where the JNDI tree of the new managed servers was unreachable. The error message we saw was: <1364066840022> <BEA-149231> <Unable to set the activat
-
I recently did a tune up on my Windows and now I can't get my old settings, like my IGoogle Homepage or log in to Frontier's my yahoo powered by Google to access my Email, are they being blocked (firewall) do I need to be reinstalled to Firefox?
-
Hi, when we are copying the roles in tcode PFCG , press enter getting dump. But when we press tab " COPY ALL", its not coming. Whtas the probelm in it , when press enter. Kindly suggest.
-
hi i am trying to create an work schedule in TM. when i am trying to create work schedule rules there was a field asking for monthly working hours.... what entry should i make it here (i know the weekly working hours)...... thanx