HA nfs failover time? [SC3.1 2005Q4]

Just build a test cluster to play with for a project (a v210, a v240 and a 3310). All appears to be working fine and we have a couple of NFS services running on it. One question however :-)
How long should it take to failover a simple NFS resource group? It's currently taking something like 1min45secs to failover and the scswitch command doesn't return for over 4min30secs. Is that normal (it probably is I just thought NFS would migrate faster than this for some reason :)).
Also, why does the scswitch command take so much longer to return - the service has failed over and started fine yes it still takes a couple more mins to return a prompt. Is it waiting for sucessful probes or something (which I guess makes sense...)
cheers,
Darren

Failing over from one machine (dev-v210) to the other (dev-v240):
Aug 1 11:31:59 dev-v210 Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group nfs-rg1 state on node dev-v210 change to RG_PENDING_OFFLINE
Aug 1 11:31:59 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-res state on node dev-v210 change to R_MON_STOPPING
Aug 1 11:31:59 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-hastorageplus-res state on node dev-v210 change to R_MON_STOPPING
Aug 1 11:31:59 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-hafoip-res state on node dev-v210 change to R_MON_STOPPING
Aug 1 11:31:59 dev-v210 Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <hafoip_monitor_stop> for resource <nfs1-hafoip-res>, resource group <nfs-rg1>, timeout <300> seconds
Aug 1 11:31:59 dev-v210 Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <hastorageplus_monitor_stop> for resource <nfs1-hastorageplus-res>, resource group <nfs-rg1>, timeout <90> seconds
Aug 1 11:31:59 dev-v210 Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <nfs_monitor_stop> for resource <nfs1-res>, resource group <nfs-rg1>, timeout <300> seconds
Aug 1 11:31:59 dev-v210 Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <hastorageplus_monitor_stop> completed successfully for resource <nfs1-hastorageplus-res>, resource group <nfs-rg1>, time used: 0% of timeout <90 seconds>
Aug 1 11:31:59 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-hastorageplus-res state on node dev-v210 change to R_ONLINE_UNMON
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <hafoip_monitor_stop> completed successfully for resource <nfs1-hafoip-res>, resource group <nfs-rg1>, time used: 0% of timeout <300 seconds>
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-hafoip-res state on node dev-v210 change to R_ONLINE_UNMON
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <nfs_monitor_stop> completed successfully for resource <nfs1-res>, resource group <nfs-rg1>, time used: 0% of timeout <300 seconds>
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-res state on node dev-v210 change to R_ONLINE_UNMON
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-res state on node dev-v210 change to R_STOPPING
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <nfs_svc_stop> for resource <nfs1-res>, resource group <nfs-rg1>, timeout <300> seconds
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource nfs1-res status on node dev-v210 change to R_FM_UNKNOWN
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource nfs1-res status msg on node dev-v210 change to <Stopping>
Aug 1 11:32:00 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_svc_stop]: [ID 584207 daemon.notice] Stopping nfsd and mountd.
Aug 1 11:32:00 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_svc_stop]: [ID 948424 daemon.notice] Stopping NFS daemon /usr/lib/nfs/mountd.
Aug 1 11:32:00 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_svc_stop]: [ID 948424 daemon.notice] Stopping NFS daemon /usr/lib/nfs/nfsd.
Aug 1 11:32:00 dev-v210 nfssrv: [ID 624069 kern.notice] NOTICE: nfs_server: server is now quiesced; NFSv4 state has been preserved
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <nfs_svc_stop> completed successfully for resource <nfs1-res>, resource group <nfs-rg1>, time used: 0% of timeout <300 seconds>
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-res state on node dev-v210 change to R_STOPPED
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-hastorageplus-res state on node dev-v210 change to R_STOPPING
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <hastorageplus_stop> for resource <nfs1-hastorageplus-res>, resource group <nfs-rg1>, timeout <1800> seconds
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource nfs1-hastorageplus-res status on node dev-v210 change to R_FM_UNKNOWN
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource nfs1-hastorageplus-res status msg on node dev-v210 change to <Stopping>
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <hastorageplus_stop> completed successfully for resource <nfs1-hastorageplus-res>, resource group <nfs-rg1>, time used: 0% of timeout <1800 seconds>
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-hastorageplus-res state on node dev-v210 change to R_STOPPED
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-hafoip-res state on node dev-v210 change to R_STOPPING
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource nfs1-hafoip-res status on node dev-v210 change to R_FM_UNKNOWN
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource nfs1-hafoip-res status msg on node dev-v210 change to <Stopping>
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <hafoip_stop> for resource <nfs1-hafoip-res>, resource group <nfs-rg1>, timeout <300> seconds
Aug 1 11:32:00 dev-v210 ip: [ID 678092 kern.notice] TCP_IOC_ABORT_CONN: local = 129.012.020.137:0, remote = 000.000.000.000:0, start = -2, end = 6
Aug 1 11:32:00 dev-v210 ip: [ID 302654 kern.notice] TCP_IOC_ABORT_CONN: aborted 0 connection
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource nfs1-hafoip-res status on node dev-v210 change to R_FM_OFFLINE
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource nfs1-hafoip-res status msg on node dev-v210 change to <LogicalHostname offline.>
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <hafoip_stop> completed successfully for resource <nfs1-hafoip-res>, resource group <nfs-rg1>, time used: 0% of timeout <300 seconds>
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-hafoip-res state on node dev-v210 change to R_OFFLINE
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-res state on node dev-v210 change to R_POSTNET_STOPPING
Aug 1 11:32:00 dev-v210 Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <nfs_postnet_stop> for resource <nfs1-res>, resource group <nfs-rg1>, timeout <300> seconds
Aug 1 11:32:00 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_postnet_stop]: [ID 584207 daemon.notice] Stopping lockd and statd.
Aug 1 11:32:00 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_postnet_stop]: [ID 948424 daemon.notice] Stopping NFS daemon /usr/lib/nfs/lockd.
Aug 1 11:32:01 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_postnet_stop]: [ID 948424 daemon.notice] Stopping NFS daemon /usr/lib/nfs/statd.
Aug 1 11:32:01 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_postnet_stop]: [ID 530938 daemon.notice] Starting NFS daemon /usr/lib/nfs/statd.
Aug 1 11:33:51 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_postnet_stop]: [ID 906922 daemon.notice] Started NFS daemon /usr/lib/nfs/statd.
Aug 1 11:33:51 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_postnet_stop]: [ID 530938 daemon.notice] Starting NFS daemon /usr/lib/nfs/lockd.
Aug 1 11:33:51 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_postnet_stop]: [ID 906922 daemon.notice] Started NFS daemon /usr/lib/nfs/lockd.
Aug 1 11:33:51 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_postnet_stop]: [ID 530938 daemon.notice] Starting NFS daemon /usr/lib/nfs/mountd.
Aug 1 11:33:51 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_postnet_stop]: [ID 906922 daemon.notice] Started NFS daemon /usr/lib/nfs/mountd.
Aug 1 11:33:51 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_postnet_stop]: [ID 530938 daemon.notice] Starting NFS daemon /usr/lib/nfs/nfsd.
Aug 1 11:33:51 dev-v210 nfssrv: [ID 760318 kern.notice] NOTICE: nfs_server: server was previously quiesced; existing NFSv4 state will be re-usedAug 1 11:33:51 dev-v210 SC[SUNW.nfs:3.1,nfs-rg1,nfs1-res,nfs_postnet_stop]: [ID 906922 daemon.notice] Started NFS daemon /usr/lib/nfs/nfsd.
Aug 1 11:33:51 dev-v210 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource nfs1-res status on node dev-v210 change to R_FM_OFFLINE
Aug 1 11:33:51 dev-v210 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource nfs1-res status msg on node dev-v210 change to <Completed successfully.>
Aug 1 11:33:51 dev-v210 Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <nfs_postnet_stop> completed successfully for resource <nfs1-res>, resource group <nfs-rg1>, time used: 36% of timeout <300 seconds>
Aug 1 11:33:51 dev-v210 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource nfs1-res state on node dev-v210 change to R_OFFLINE
The delay seems come with "Starting NFS daemon /usr/lib/nfs/statd." It appears to stop it and then start it again - and the starting takes a couple of mins.
When the other node starts it up again we see a similar thing - starting statd takes a couple of mins.
Other than that it works fine - it feels like statd blocks on some sort of timeout?...
Would be good to get this failing over faster if possible!
Uname reports "SunOS dev-v210 5.10 Generic_118833-17 sun4u sparc SUNW,Sun-Fire-V210". Not using veritas VM at all - this is all SVM on these machines.
Darren

Similar Messages

Any experience with NFS failover in Sun Cluster?

Hello,
I am planning to install dual-node Sun Cluster for NFS failover configuration. The SAN storage is shared between nodes via Fibre Channel. The NFS shares will be manually assigned to nodes and should fail over / takeback between nodes.
Is this setup tested well? How the NFS clients survive the failover (without "stale NFS handle" errrors)? Does it work smoothly for Solaris,Linux,FreeBSD clients?
Please share your experience.
TIA,
-- Leon

My 3 year old linux installtion on my laptop, which is my NFS client most of the time uses udp as default (kernel 2.4.19).
Anyway the key is that the NFS client, or better, the RPC implementation on the client is intelligent enough to detect a failed TCP connection and tries to reestablish it with the same IP address. Now once the cluster has failed over the logical IP the reconnect will be successful and NFS traffic continues as if nothing bad had happened. This only(!) works if the NFS mount was done with the "hard" option. Only this makes the client retry the connection.
Other "dumb" TCP based applications might not retry and thus would need manual intervention.
Regarding UFS or PxFS, it does not make a difference. NFS does not know the difference. It shares a mount point.
Hope that helped.

Failover time.

hi,
          I have got a problem with failover time.
          My environment,
          One cluster: two weblogic servers5.1 sp4s running on Sun Solaris. The
          cluster uses In-memory replication.
          Web Server is Apache running on Sun solaris. Apache bridge is setup
          with weblogic.conf reads:
          WeblogicCluster 10.2.2.20:7001,10.2.2.21:7001
          ConnectTimeoutSecs 10
          ConnectRetrySecs 5
          StatPath true
          HungServerRecoverSecs 30:100:120
          Everything is starting fine. Both weblogic server says Joins the
          cluster....and application is working fine. When one weblogic server is
          forced to shutdown, failover takes place fine.
          The problem occurs when the machine, that has first entry in
          weblogic.conf file( 10.2.2.20 )running weblogic server is unplugged from
          the network, failover takes after three minutes.
          Could someone help me how to reduce this time. Is there any property
          that has to be set in the weblogic.conf or in weblogic.properties file
          that need to be set.
          Thanks in Advance
          Arun


arunbabu wrote:
          > hi,
          > I have got a problem with failover time.
          > My environment,
          > One cluster: two weblogic servers5.1 sp4s running on Sun Solaris. The
          > cluster uses In-memory replication.
          > Web Server is Apache running on Sun solaris. Apache bridge is setup
          > with weblogic.conf reads:
          >
          > WeblogicCluster 10.2.2.20:7001,10.2.2.21:7001
          > ConnectTimeoutSecs 10
          > ConnectRetrySecs 5
          > StatPath true
          > HungServerRecoverSecs 30:100:120
          >
          > Everything is starting fine. Both weblogic server says Joins the
          > cluster....and application is working fine. When one weblogic server is
          > forced to shutdown, failover takes place fine.
          > The problem occurs when the machine, that has first entry in
          > weblogic.conf file( 10.2.2.20 )running weblogic server is unplugged from
          > the network, failover takes after three minutes.
          > Could someone help me how to reduce this time. Is there any property
          > that has to be set in the weblogic.conf or in weblogic.properties file
          > that need to be set.
          HungServerRecoverSecs seconds
          This implementation takes care of the hung or unresponsive servers in
          the cluster. The plug-in waits for HungServerRecoverSecs for the server to
          respond and then declares that server dead, failing over to the next server.
          The minimum value for this setting is 10 and the maximum value is 600. The
          default is set at 300. It should be set to a very large value. If it is less
          than the time the servlets take to process, then you will see unexpected
          results.
          Try reducing hungserver recover seconds. But remember if you application
          processing takes long time then you will in trouble since the plugin will be
          failing over to other servers in the cluster and you will be thrashing the
          servers.
          - Prasad
          >
          > Thanks in Advance
          > Arun
          Cheers
          - Prasad

What are typical failover times for application X on Sun Cluster

Our company does not yet have any hands-on experience with clustering anything on Solaris, although we do with Veritas and Miscrosoft. My experience with MS is that it is as close to seemless (instantaneous) as possible. The Veritas clustering takes a little bit longer to activate the standby's. A new application we are bringing in house soon runs on Sun cluster (it is some BEA Tuxedo/WebLogic/Oracle monster). They claim the time it takes to flip from the active node to the standby node is ~30minutes. This to us seems a bit insane since they are calling this "HA". Is this type of failover time typical in Sun land? Thanks for any numbers or reference.

This is a hard question to answer because it depends on the cluster agent/application.
On one hand you may have a simple Sun Cluster application that fails over in seconds because it has to do a limited amount of work (umount here, mount there, plumb network interface, etc) to actually failover.
On the other hand these operations may, depending on the application, take longer than another application due to the very nature of that application.
An Apache web server failover may take 10-15 seconds but an Oracle failover may take longer. There are many variables that control what happens from the time that a node failure is detected to the time that an application appears on another cluster node.
If the failover time is 30 minutes I would ask your vendor why that is exactly.
Not in a confrontational way but a 'I don't get how this is high availability' since the assumption is that up to 30 minutes could elapse from the time that your application goes down to it coming back on another node.
A better solution might be a different application vendor (I know, I know) or a scalable application that can run on more than one cluster node at a time.
The logic with the scalable approach is that if a failover takes 30 minutes or so to complete it (failover) becomes an expensive operation so I would rather that my application can use multiple nodes at once rather than eat a 30 minute failover if one node dies in a two node cluster:
serverA > 30 minute failover > serverB
seems to be less desirable than
serverA, serverB, serverC, etc concurrently providing access to the application so that failover only happens when we get down to a handful of nodes
Either one is probably more desirable than having an application outage(?)

VIP failover time

I have configured a critical service(ap-kal-pinglist) for the VIP redundant failover, default freq,maxfail and retry freq is 5,3,5, so I think the failover time is 5+5*3*2=35s.But the virtual-router's state changed from "master" to "backup" in around 5 secs after connection lost.
Anyone help me to understand it?

Service sw1-up-down connect to e2 interface,going down in 15sec
Service sw2-up-down connect to e3 interface,going down in 4sec?
JAN 14 02:38:41 5/1 3857 NETMAN-2: Generic:LINK DOWN for e2
JAN 14 02:39:57 5/1 3858 NETMAN-2: Generic:LINK DOWN for e3
JAN 14 02:39:57 5/1 3859 VRRP-0: VrrpTx: Failed on Ipv4FindInterface
JAN 14 02:40:11 5/1 3860 NETMAN-2: Enterprise:Service Transition:sw2-up-down -> down
JAN 14 02:40:11 5/1 3861 NETMAN-2: Enterprise:Service Transition:sw1-up-down -> down

Failover time using BFD

Hi Champs,
we have configured BFD in multihoming scenario with BGP routing protocol.Timer configuration is bfd interval 100 min_rx 100 multiplier 5.
Failover from first ISP to second ISP takes 30 sec and same from first ISP to second ISP takes more than 1min. Can you suggest reason for different failver times and how can i have equal failover time from both ISP.How convergence time is calculated in BGP + BFD scenario?
Regards
V

Vicky,
A simple topology would help better understand the scenario. Do you have both the ISP terminated on same router or different router?.
How many prefixes are you learning?. Full internet table or few prefixes?.
Accordingly, you can consider BGP PIC or best external to speed up the convergence.
-Nagendra

2540 / RDAC path failover time

Hi,
I have a RHEL 5.3 server with two single port HBAs. These connect to a Brocade 300 switch and are zoned to two controllers on a 2540. Each HBA is zoned to see each controller. RDAC is used as the multipathing driver.
When testing the solution, if I pull the cable from the active path between the HBA and the switch, it takes 60 seconds before the path fails over to the second HBA. No controller failover is taking place on the array - the path already exists through the brocade between the preferred array controller and the second HBA. After 60 seconds disk I/O continues to the original controller.
Is this normal ? Is there a way of reducing the failover time ? I had a look at the /etc/mpp.conf variables but there is nothing obvious there that is causing this delay.
Thanks

Thanks Hugh,
I forgot to mention that we were using Qlogic HBAs so our issue was a bit different...
To resolve our problem; since we had 2x2FC HBA cards in each server we needed to configure zoning on the brocade switch to ensure that each HBA port only saw one of the two array controllers (previously both controllers were visable to each HBA port - which was breaking some RDAC rule). Also we upgraded the qlogic drivers using qlinstall -i before installing RDAC (QLogic drivers which come with RHEL5.3 are pretty old it seems).
Anyway, after these changes path failovers were working as expected and our timeout value of 60sec for Oracle ocfs2 cluster was not exceeded.
We actually ended up having to increase the ocfs2 timeout from 60 to 120 seconds because another test case failed - it was taking more than 60sec for a controller to failover (simulated by placing active controller offline from the service advisor). We are not sure if this time is expected or not... anyway have a service request open for this.
Thanks again,
Trev

Optimize rac failover time?

I have 2node RAC and the failover time is taking 4 minutes. Please advice some tips/documents/links that shows, how to optimize the rac failover time?
[email protected]

Hi
Could you provide some more information of what it is you are trying to achieve. I assume you are talking about a the time it takes for clients to start connecting to the available instance on the second node, could you clarify this?
There is SQLnet parameters that can be set, you can also make shadow connections with the preconnect parameter in your fail_over section of your tnsnames.ora on the clients.
Have you set both of your hosts as preferred in the service configuration on the RAC cluster. The impact will be less in a failure as approximately half of your connections will be unaffeced when an instance fails.
Cheers
Peter

Fwsm failover times in real crash

Hi,
I have got two cat6k vss and two servis modelu FWSM
How fast FWSM will be switch over to back up Firewall, after active-fw crash/down power?
Sent from Cisco Technical Support iPad App

Hi,
The initial 15 seconds detection time can be reduced to 3 seconds, by tuning failover polltime and holdtime to the following:
"failover polltime unit 1 holdtime 3"
Also keep in mind after switchover new active will establish nbr relation with nbr router. At any point of time standby does not participate in OSPF process. so in short new active have to re-establish adjacencies.
Hope that helps.
Thanks,
Varun

RAC failover time problem!

I try TAF (transparent application failover) on RAC (9.0.1.3 and 9.2.0.1) and i have the same problem. When i play with "shutdown abort" the failover is fast (about 5-7 sec.). When i turn off the node, the failover is working fine, but it take too mutch time (about 3 minutes). Is there any parameter (tcp/ip or oracle net timeout, or keepalive parameter) that helps?
Thanks: Robert Gasz

Can you confirm that you are able to set up RAC with 9.2.0 on Linux?
Did you use the files downloadable from technet?
I have problems in joining the cluster from the second node (anyone) when
the first is up (anyone).
I hadn't this problem with 9.0.1.x version.
What is the release of Linuz you are using?
Are you using raw devices or lvm or what for your raw partitions?
Thanks in advance.
Bye,
Gianluca

FWSM Failover times

Hi Folks
I have 2 6509's with fwsm in them. They are xconfigured in active standby failover.... default values
the 6500's are OSPF routers also. Everything is redundant HSRP, FWSM etc.
when we reboot one of the 6500's it takes approximately 45 seconds for the standby FWSM to become active.
Is this normal? can the time be shortened?
any comments appreciated.

Hi,
The initial 15 seconds detection time can be reduced to 3 seconds, by tuning failover polltime and holdtime to the following:
"failover polltime unit 1 holdtime 3"
Also keep in mind after switchover new active will establish nbr relation with nbr router. At any point of time standby does not participate in OSPF process. so in short new active have to re-establish adjacencies.
Hope that helps.
Thanks,
Varun

RAC Active Active cluster failover time

Hi,
In a RAC active active cluster , how long does it take to failover to the surviving instance.
As per the docu I understand that rollback is done just for the select statements and not others. Is that correct?

RAC is an active-active cluster situation by design.
A failover from a session from a stopped/crashed instance to a surviving one can be implemented in several ways.
The most common way to do failover is using TAF, Transparent Application Failover, which is implemented on the client (using settings in the tnsnames.ora file)
When an instance of a RAC cluster is crashed, the surviving instances (actually the voted master instance) will detect an instance is crashed, and recover the crashed instance using its online redologfiles. Current transactions in that instance will be rolled back. The time it will take is depended on activity in the database, thus the amount to recover.

OS X Lion, NFS shares, Time Machine?

A few years ago I bought a Mac Mini server. I wanted to use it as a storage server using attached drives. I got everything up and running but kept running into the same problem: The NFS server deamon would stop / crash for no apparent reason when I was writing large files to the server over NFS.
I spent some time troubleshooting the issue but never resolved it and eventually resorted to wiping out OS X on the Mac Mini server an install CentOS Linux instead. The NFS deamon here is rock solid and I have used it ever since.
Fast forward to today where my Time Capsule died due to the usual power supply failures (thanks to Apple for a crappy design). I then realized that I could use my Mac Mini Server as a host for Time Machine, I'd just need to get it running OS X Server again. I would still want to serve writeable NFS shares so this leads to my question:
Does anyone know for sure whether OS X Mountain Lion has had any improvements to the NFS deamon over previous versions? I'd be quite happy to buy the new OS, but I'd prefer to be a little more confident that all the work would pay off (in the form of a stable NFS service)

The Server.app and Server Admin utilities for Lion no longer let you configure NFS. However the NFS software is still there and in fact was significantly upgraded and now supports NFS v4.
See http://support.apple.com/kb/HT4695

Eigrp Failover Time

hello Friends
I have problem with EIGRP Failover. I have 2 Branch Routers MPLS-R1 and MW-R2 (Running HSRP Between) and configure as stub and connected to Local LAN (With 4 Sub Interface) and Have WAN Connection to HO ASR(MPLS-R1 is 4 MB and Prefered, MW-SatalliteR2 is 2MB). All My traffic going thru MPLS-R1. I have EIGRP Running on them.
Now when I shut down Lan Interface on MPLS-R1, MW-R2 takeover as ACTIVE HSRP for all Vlan, but Traffic from HO ASR takes around 15 to 20 Timouts (30 to 40 Seconds) to switchover to MW-R2 Route. normal time should be 15 secs.
Kindly help.

It depends where the failure is. If they are point to point links then if one end fails the other end should go down as well. However if these are being provisoned by a provider they may well appear point to point but they may go through provider switches in which case if one end fails the other might still think it is up.
The only way to test if to shut one end down and see if the other end goes down as well.
Lets assume they do behave like that and you are using HSRP. You are also tracking within HSRP the status of the point to point link -
1) if the link itself fails then both routers should see the WAN interface go down and because you are tracking with HSRP both router failover to the othe HSRP router so it all works
2) if the WAN interface on either router fails the other router's WAN interface should go down and again because of HSRP tracking both routers failover.
3) if either active router at each end of the link fails then again the other router's WAN interface should go down and both routers failover.
but -
4) if the LAN interface on one of the active routers fails then it fails over to the other router. But the active router at the other end does not failover because it's WAN link is still up because it was the LAN interface that failed on the router not the WAN interface.
All of the above as i say depends on whether those links act as true point to point links. If they don't then you definitel can't rely on HSRP with tracking for any failover.
So it depends. And if you wanted to be sure of failing over in all scenarios then you may need additional configuration.
Jon

Lowering OD failover times

Hello,
I have around 15 labs of Mac's that I look after. They are bound to Active Directory for user accounts and Open Directory for computer settings as per Mike Bombich's guide on his website.
On the whole this works great. I have an OD-Master and an OD-Replica for failover purposes, there is a 100 M/bit link between them even though they are 8 miles apart.
Last week to test the failover I decided to turn off the OD-Master and re-booted a lab of 20 Macs. The Macs took between 15-40 mins to come back up to a working state
My question in a nutshell..... Is there any way I can lower this time?

It could be as simple as a DNS issue.
You could also try to add the second server in Directory Access manually. It may be that the clients are searching for that second OD server for that long.

HA nfs failover time? [SC3.1 2005Q4]

Similar Messages

Maybe you are looking for