UCCX Purposely Prevent Fail-over

Hi. I was wondering if shutting down the engine on a secondary server would be enough to prevent fail-over in an HA environment.
Basically; we have had a call with TACs on the servers for no apparent reason, failing over and then back. What we found was happening was that the 2 servers were losing heartbeat to each other, so the secondary server was trying to take control. This then cause all of our agents to fail over, but calls could get lost as the primary server actually was fully functioning. This lead us to another TAC case on an error on a router near the secondary server that was causing the loss of heartbeat. Problem is, that router cannot come down for some time and is due to just be replaced at the end of the year.
So now, maybe not entirely to my liking; we/someone wants to try just having the primary running and then if worse comes to worse, we can start the secondary back up again and I am curious what the best procedure to do that would be. Hopes would be that this somehow would at least stop the random fail-overs, even if it doesn't actually address the real issue.

I have to rely on another guy for the router, switches and UCM side of things and he hasn't said exactly what the error message is, but that he called into TACs and it is supposed to only be cosmetic, and a reboot of the router would clear it. Unfortunately, where that router is; it will not be brought down until the end of the year on a maintenance window.
At any rate; the UCCX server has been ruled out as we have had multiple tickets with TACS for the UCCX then to the UCM and they both have been pointing to a network issue that does not get avoided by having the secondary server down, mainly because we do have a CM publisher and subscriber on the same network.

Similar Messages

How to do I prevent fail over

This is actually a two part question.
First - I need to upgrade a wireless controller 4402 and I need update boot loader and software. Can I do both at the same time?
Second - How can I prevent the APS from failing over when I reboot with out having to go into each AP and take off secondary controller.

Specific to the AP failover. Why dont you deploy AP FALLBACK? When a controller falls offline for
any reason the APs join other controllers. HOWEVER when the controller comes online they will fallback to the controllers
you want with NO intervention from you.
fyi
NoteWhen an access point’s primary controller comes back online, the access point disassociates from the backup controller and reconnects to its primary controller. The access point falls back to its primary controller and not to any secondary controller for which it is configured. For example, if an access point is configured with primary, secondary, and tertiary controllers, it fails over to the tertiary controller when the primary and secondary controllers become unresponsive and waits for the primary controller to come back online so that it can fall back to the primary controller. The access point does not fall back from the tertiary controller to the secondary controller if the secondary controller comes back online; it stays connected to the tertiary controller until the primary controller comes back up.

Failed over to a Aysnc Replica and now previous primary replica(Now Secondary) is in NOT SYNC state

Hello All,
Here is my situation :
3 Nodes in an AG configuration, and its a multi-site cluster. Sync commit between 2 nodes in one DC and Async commit to a node in the DR DC.
AG is failed over to the Async Replica which is the DR site and all the databases comes up fine and application also can connect using the listener.
When checked the state of secondary databases, its in NOT SYNC mode. Data is suspended automatically. I can resume data movement to fix the problem, but was curious why this will be in NOT SYNC mode?
Thanks in advance.
Thank you,
Anup
<div> Anup | Database Consultant Blog: <a href="www.sqlsailor.com/">www.sqlsailor.com</a> Twitter: <a href="https://twitter.com/#!/AnupWarrier"> Follow me !</a>
 Please use Mark as Answer if my post solved your problem and use Vote As Helpful if a post was useful. </div>

Hello Anup,
The reason this happens is because of the forced failover needed to be used when moving to an Async replica. It will cause all other replicas to become suspended due to the fact that it is never known if data loss will occur or not.
It might not make sense right now, but think about a situation where the databases are not synchronized and failover is forced (it has to work in all situations). There may be a good bit of data on the primary replica that has not yet made it (or partially)
to the async secondary. It wouldn't make sense to negotiate the primary back down (after all, it's the async one) and undo valid transactions. It also allows for a database snapshot or other method to be done on the old sync primary which could be used for
DR purposes to get those valid transactions and data out.
BOL Doc:
http://msdn.microsoft.com/en-us/library/hh213151.aspx#ForcedFailover
Sean Gallardy | Blog |
Twitter

After adding 2nd WiSM and failing over AP's some apps don't work

We have a dual core made up of 2 6513's. In 6513#1 we have WiSM#1 which we have had for sometime now. We have added a 2nd WiSM in 6513#2 for redundancy purposes also we are going to be re-configuring the WiSM in 6513#1 to more match that of the new WiSM in 6513#2. We have installed the new WiSM and failed over the AP's from 6513#1 so we can re-configure it's WiSM. The failover went great and no issues, with the exception that a web application or two didn't function from wireless clients and users were having issues getting to some mapped drives. The only difference from the new WiSM config vs the old WiSM is that on the old WiSM the AP's were in the same VLAN as the controller management interfaces. Now with the new WiSM it's configuration has the controllers AP mgt interfaces ip addresses in a different VLAN from the AP's, we are doing this based on Cisco best practices. If we revert the AP's back to the original WiSM/controllers the PC's where they are on the same vlan/subnet the applications and shares that were having issues the other way work. We have placed a call with Cisco TAC and they say our configs look good and we even sent them some packet captures and they said everything looks normal. The wireless clients can ping and resolve the server hosting the application database just fine.
Thanks

We did create the mobility groups, and we are using DHCP opt 43. The AP's find the 2nd WiSM#2 just fine and associate to the controllers and all the WLAN's work just fine. The only issue is that after the AP's are on the new WiSM and controllers there is an application or 2 that is having trouble locating it's database server and that some share's are not working. Again the only difference in this new setup in that now the AP's are on a different subnet/vlan from the controller mgt addresses where as before they were in the same subnet/vlan and the application and shares worked fine. It's almost like it is a bit of a routing issue?
Thanks

Physical standby database fail-over

Hi,
I am working on Oracle 10.2.0.3 on Solaris SPARC 64-bit.
I have a Dataguard configuration with a single Physical standby database that uses real time application. We had a major application upgrade yesterday and before the start of upgrade, we cancelled the media recovery and disabled the log_archive_dest_n so that it doesn't ship the archive logs to standby site. We left the dataguard configuration in this mode incase of a rollback.
Primary:
alter system set log_archive_dest_state_2='DEFER';
alter system switch logfile;
Standby:
alter database recover managed standby database cancel;Due to application upgrade induced problems we had to failover to the physical standby, which was not in sync with primary from yesterday. I used the following method to fail-over since i do not want to apply any redo from yesterday.
Standby:
alter database activate physical standby database;
alter database open;
shutdown immediate;
startupSo, after this step, the database was a stand alone database, which doesn't have any standby databases yet (but it still has log_archive_config parameter set and log_archive_dest_n parameters set but i have 'DEFER' the log_archive_dest_n pointing to the old primary). I have even changed the "archive log deletion policy to NONE"
RMAN> configure archivelog deletion policy to none;After the fail-over was completed, the log sequence started from Sequence 1. We cleared the FRA to make space for the new archive logs and started off a FULL database backup (backup incremental level 0 database plus archivelog delete input). The backup succeded but we got these alerts in the backup log that RMAN cannot delete the archivelogs.
RMAN-08137: WARNING: archive log not deleted as it is still neededMy question here is
1) Even though i have disabled the log_archive_dest_n parameters, why is RMAN not able to delete the archivelogs after backup when there is no standby database for this failed-over database?
2) Are all the old backups marked unusable after a fail-over is performed?
FYI... flashback database was not used in this case as it did not server our purpose.
Any information or documentation links would be greatly appreciated.
Thanks,
Harris.

Thanks for the reply.
The FINISH FORCE works in some cases but if there is an archive gap (though it didn't report in our case), it might not work some times (DOCID: 846087.1). So, we followed the Switch-over & Fail-Over best practices where it mentioned about this "ACTIVE PHYSICAL STANDBY" for a fail-over if you intend not to apply any archivelogs. The process we followed is the Right one.
Anyhow, we got the issue resolved. Below is the resolution path.
1) Even though if you DEFER the LOG_ARCHIVE_DEST_STATE_N parameter's on the primary, there are some situations where the Primary database in a dataguard configuration where it will not delete the archive logs due to some SCN issues. This issue may or may not arise in all fail-over scenarios. If it does, then do the following checks
Follow DOCID: 803635.1, which talks about a PLSQL procedure to check for problematic SCN's in a dataguard configuration even though the physical standby databases are no available (i.e., if the dataguard parameters are set, log_archive_config, log_archive_dest_n='SERVICE=..." still set and even though corresponding LOG_ARCHIVE_DEST_STATE_N parameters are DEFERRED).
If this procedure returns any rows, then the primary database is not able to delete the archivelogs because it is still thinking there is a standby database and trying to save the archive logs because of the SCN conflict.
So, the best thing to do is, remove the DG related parameters from the spfile (log_archive_config, log_archive_dest_n parameters).
After i made these changes, i ran a test backup using "backup archivelog all delete input", the archive logs got deleted after backup without any issues.
Thanks,
Harris.
Edited by: user11971589 on Nov 18, 2010 2:55 PM

Why sever-side state saving doesn't support fail over?

Hi all,
In my previous thread "ADF server-side state saving method" Frank said that it doesn't support fail over.
Re: ADF server-side state saving method
My customer is wondering the reason.
If anyone has a clear statement about it, could you share it?
Any help will be much appreciated.
Atsushi

Timo,
As I wrote in my previous thread, my customer adopted multi-windows application design because they didn't know it caused viewExpiredException frequently.
Now I'm looking for the best setting for avoiding the exception and need ADF guru's help.
Frank said that ADF is on Sun's RI. And it seems that the state-saving parameters of Mojarra are working correctly in my environment. However any ADF docs don't mention the behavior of server-side state saving clearly. When I set state-saving method "server", view states are managed per logical view (≒ window). And it seems better for multi-window application than using client-token based state management from the perspective of preventing viewExpiredException.
Because fail over is not their requirement, if we could make sure that server-side state saving doesn't have other side-effects they might adopt it.
So I'd like to know in more detail about the behavior.
Thanks,
Atsushi

Is there a way to config WLS to fail over from a primary RAC cluster to a DR RAC cluster?

Here's the situation:
We have two Oracle RAC clusters, one in a primary site, and the other in a DR site
Although they run active/active using some sort of replication (Oracle Streams? not sure), we are being asked to use only the one currently being used as the primary to prevent latency & conflict issues
We are using this only for read-only queries.
We are not concerned with XA
We're using WebLogic 10.3.5 with MultiDatasources, using the Oracle Thin driver (non-XA for this use case) for instances
I know how to set up MultiDatasources for an individual RAC cluster, and I have been doing that for years.
Question:
Is there a way to configure MultiDatasources (mDS) in WebLogic to allow for automatic failover between the two clusters, or does the app have to be coded to failover from an mDS that's not working to one that's working (with preference to a currently labelled "primary" site).
Note:
We still want to have load balancing across the current "primary" cluster's members
Is there a "best practice" here?

Hi Steve,
There are 2 ways to connect WLS to a Oracle RAC.
1. Use the Oracle RAC service URL which contains the details of all the RAC nodes and the respective IP address and DNS.
2. Connect to the primary cluster as you are currently doing and use a MDS to load-balance/failover between multiple nodes in the primary RAC (if applicable).
In case of a primary RAC nodes failure and switch to DR RAC nodes, use WLST scripts to change the connection URL and restart the application to remove any old connections.
Such DB fail-over tests can be conducted in a test/reference environment to set up the required log monitoring and subsequent steps to measure the timelines.
Thanks,
Souvik.

Why DML not failed over in TAF??

Hi,
I have an OLTP application running on 2 node 10gR2 RAC(10.2.0.3) on AIX 5.3L ML 8. I have configured TAF here for SESSION failover.I would like to know two things from you all:
1) Though each instance is able to read other instnace's undo tablespace data and redolog, then allso why TAF is not able failover the DML transactions?
2) As of now is there any way to failover the DML other than cathing the error thrown back to application and re-executing the query?Is it possible in the 11gR1?
I am gratefull to you all if you are sparing your valuable time to answer this.
Thanks and Regards,
Vijay Shanker

Re: Failover DML on RAC
The reason is transaction processing and its implications.
Imagine that you updated a row, then waited idly, then some other session wanted that same row and waited for you to either rollback or commit.
You failed.
Automatically, Oracle will rollback your transaction and release all your locks.
What should the other session do: wait to see that maybe you have TAF or FCF and will reconnect and rerun your uncommitted DML, or should it proceed with its own work?
Failed session rollback currently happens regardless of whether you or anybody else have TAF, FCF, or even whether you have RAC.
But in order for you to be able to replay your DML safely after reconnect, that transaction rollback had to be prevented, and your new failed over session should magically re-attach to the failed session's transaction.
Maybe some day Oracle will implement something like that, but it's not easy, and Oracle leaves it up to the application to decide what to do (TAF-specific error codes).
On the other hand, replaying selects is fairly easy: re-executing the query (with scn as of the originally failed cursor to ensure read-consistency) and re-fetching up to the point of last fetch.

Firewall keeps failing over when IPS fails

Is there a way to prevent the firewall from failing over if the IPS fails, I do not have it selected as a critera but I've been having some issues with the IPS module and the firewall keeps failing over.

Hello Matt,
There is an enhancement request for this:
http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCsm81086
But there isn't an ETA yet. You can save the bug to get updates.
Regards,
Felipe
Security Team.

/WS fail-over

Hello everyone,
I'm having some difficulties working with /WS fail-over, or rather with /WS fail-back.
In my Tuxedo 8.0RP133 /WS client I define a WSNADDR containing two destinations,
something like
WSNADDR=//host:port1,//host:port2
My destination is a TUXEDO 6.5 application with two WSL servers.
In my client I trap tpcall() errors and for some errors (TPESYSTEM, for example)
I assume that an idle timeout or some temporary error has occurred, call tpterm(),
tpinit() and retry tpcall() again.
This sort of works, most of the time... But it seems that the fail-over (from
host:port1 to host:port2) is one-way only. If I shut down the first WSL the client
fails-over to the second one quite nicely. If I then restart the first WSL and
shut down the second one, the client fails. As far as I can understand, once the
client process has started using the second address in WSNADDR there is no turning
back to the first one again.
Is this the way it is supposed to be? Have I misinterpreted the syntax for WSNADDR?
Or is this simply a case for BEA support?
Best regards,
/Per

Thanks for your input, Amit.
I was probably only using the second WSL all the time. Seems like an encryption
settings problem prevented the first one from ever being useful for me...
/Per
"Amit" <[email protected]> wrote:
>
The WSL Connection to the Tuxedo Server is decided at the time of tpinit.
So if
you have specified 2 address for tpinit, the workstation tries to connect
to the
first ip address specified and if it is not successful, it tries to connect
to
the second ip address...
this process is done every time you do a tpinit.
So i guess, if you do a tpinit after bringing back the 1st IP Address
WSL, you
should be able connect ..
I hope this helps.
-Amit
"Per Lindström" <[email protected]> wrote:
Hello everyone,
I'm having some difficulties working with /WS fail-over, or rather with
/WS fail-back.
In my Tuxedo 8.0RP133 /WS client I define a WSNADDR containing two destinations,
something like
WSNADDR=//host:port1,//host:port2
My destination is a TUXEDO 6.5 application with two WSL servers.
In my client I trap tpcall() errors and for some errors (TPESYSTEM,for
example)
I assume that an idle timeout or some temporary error has occurred,call
tpterm(),
tpinit() and retry tpcall() again.
This sort of works, most of the time... But it seems that the fail-over
(from
host:port1 to host:port2) is one-way only. If I shut down the firstWSL
the client
fails-over to the second one quite nicely. If I then restart the first
WSL and
shut down the second one, the client fails. As far as I can understand,
once the
client process has started using the second address in WSNADDR there
is no turning
back to the first one again.
Is this the way it is supposed to be? Have I misinterpreted the syntax
for WSNADDR?
Or is this simply a case for BEA support?
Best regards,
/Per

SQL Server 2014 Always on HA takes 8-14 seconds to fail over. Application side timeouts occur

Hi All,
I have a very similar post in the SQL Server 2014 forums too (https://social.technet.microsoft.com/Forums/sqlserver/en-US/adb5e338-907e-4405-aa62-d3ea93c7a98a/sql-server-2014-always-on-ha-takes-814-seconds-to-fail-over-application-side-timeouts-occur?forum=sqldisasterrecovery) -
advice in the end was to post a question here.
SQL Server Nodes, 2014 (12.0.2480.0)
1 Share witness (on separate subnet)
1 Cluster
1 Listener
I have been testing the response time to failovers – both manual (right-click, fail over in SSMS) and Automatic (shut down the primary host). The way I am testing response is to have a SSMS query running on my desktop, connected to the listener querying
a small table and hit execute.
The Query response time, from execute to receiving the result, has been between 8 and 14 seconds based on my testing. My previous experience (in a separate environment) showed around 2 second fail over times in a very similar configuration.
Availability DB is 200Mb and is not actively used. The nodes are synchronised.
SQL Server Hosts: Windows 2012, 2 cpu, 8gb RAM.
Questions:
1: It’s a big question but what should I expect for a ‘normal’ fail over time. Keep in mind this scenario is about as simple as it gets.
2: As it stands an 8 to 14 second ‘outage’ could cause some applications to time out. Or am I being un-reasonable? I am seeing the very simple query in SSMS to time out with this:
Msg 983, Level 14, State 1, Line 2
Unable to access availability database 'DATABASE' because the database replica is not in the PRIMARY or SECONDARY role. Connections to
an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later.
Cluster logs are long - this section accounts for 8 seconds of the 11 second outage I experienced. I can supply the full log if required. Also this log is just the 2 cluster nodes, I removed the witness share to make sure it was as simple as possible.
00001090.00002128::2015/02/25-03:05:08.255 INFO [GEM] Node 2: Deleting [1:65 , 1:71] (both included) as it has been ack'd by every node
00001ee4.00002130::2015/02/25-03:05:10.107 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:5b81e7bd-58fe-4be9-a68a-c48ba2aa552b:Netbios
00001090.00002128::2015/02/25-03:05:11.888 INFO [GEM] Node 2: Deleting [1:72 , 1:73] (both included) as it has been ack'd by every node
00001090.00002698::2015/02/25-03:05:11.889 INFO [GUM] Node 2: Processing RequestLock 2:49
00001090.00002128::2015/02/25-03:05:11.890 INFO [GUM] Node 2: Processing GrantLock to 2 (sent by 1 gumid: 67)
00001090.00002698::2015/02/25-03:05:11.890 INFO [GUM] Node 2: executing request locally, gumId:68, my action: /dm/update, # of updates: 1
00001090.00002128::2015/02/25-03:05:12.890 INFO [GEM] Node 2: Deleting [1:74 , 1:74] (both included) as it has been ack'd by every node
00001ee4.00002130::2015/02/25-03:05:15.107 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:5b81e7bd-58fe-4be9-a68a-c48ba2aa552b:Netbios
00001090.00002128::2015/02/25-03:05:16.988 INFO [GUM] Node 2: Processing RequestLock 1:28
Thanks in advance.
Keegan

Hi Keegan,
From these event log , what I can see is "Sending request Netname" wasted the time .
Could you please tell us the network configuration of that cluster nodes ?
If I recall correctly , it is recommended to only remain Tcp/IP protocol and disable NetBIOS over TCP/IP for "Private Network" , also do not configure DNS/Wins default gateway for "Private Network" :
https://support.microsoft.com/kb/258750?wa=wsignin1.0
After that please test again .
Best Regards,
Elton JI
Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact [email protected] .

Front End pool failed over

Hi all,
1. I setup a pool with three Front End servers (FQDN of pool is pool.site1.sip96x2.com and it's pointed to IP address of three Front End servers). Everything works fine. But When I disable network interface on FE1 and FE2, the Lync clients are disconnected.
I haven't understood clearly how the Lync clients failed over in a pool? Please clarify to me.
2. I have two central site (Root site and Primary site, they have different domain sip96x2.com and site1.sip96x2.com). The simple URL dialin is pointed to Front End server at Root site. So if the link between Root site and Primary site is down, how can the
users at Primary site connect to dialin URL?
3. In building topology for Front End pool, I checked Override FQDN internal web service and the FQDN is "poolint.site1.sip96x2.com". I created three A records "poolint.site1.sip96x2.com" and pointed to three IP addresses of Front End
servers. Is it right?
Thanks so much!

Ah ok, well first thing if I am reading this correctly, pool pairing Standard with Enterprise is not supported. You should only pair Standard with Standard and Enterprise with Enterprise (even though topology builder won't stop you) Take a look here for
support scenarios http://technet.microsoft.com/en-us/library/jj204697.aspx
To deal with the simple URLs in the event of failover you need to add them using Powershell. Take a look at this article which explains and gives an example: http://blogs.perficient.com/microsoft/2012/01/configuring-simple-urls-for-multiple-lync-pools/
If this helped you please click "Vote As Helpful" if it answered your question please click "Mark As Answer"
Georg Thomas | Lync MVP
Blog www.lynced.com.au | Twitter
@georgathomas
Lync Edge Port Check (Beta)

How Front End pool deals with fail over to keep user state?

   Hello to all, I searched a lot of articles to understand how Lync 2010 keeps user state if a fail happens in a Front Pool node, but didn't find anything clear.
     I found a MS info. about ths topic : " The Front End Servers maintain transient information—such as logged-on state and control information for an IM, Web, or audio/video (A/V) conference—only for the duration of a user’s session.
This configuration
is an advantage because in the event of a Front End Server failure, the clients connected to that server can quickly reconnect to another Front End Server that belongs to the same Front End pool. "
    As I read, the client uses DNS to reconnect to another Front End in the pool. When it reconnects to an available server, does he lose what he/she was doing at Lync client? Can the server that is now hosting his section recover all
"user's session data"? Is positive, how?
   Regards, EEOC.

The presence information and other dynamic user data is stored in the RTCDYN database on the backend SQL database in a 2010 pool:
http://blog.insidelync.com/2011/04/the-lync-server-databases/ If you fail over to another pool member, this pool member has access to the same data.
Ongoing conversations and the like are cached at the workstation.
Please remember, if you see a post that helped you please click "Vote As Helpful" and if it answered your question please click "Mark As Answer".
SWC Unified Communications

Is it possible to add hyper-V fail over clustering afterwards?

Hi,
We are testing Windows 2012R2 Hyper-V using only one stand alone host without fail over clustering now with few virtual machines. Is it possible to add fail over clustering afterwards and add second Hyper-V node and shared disk and move virtual
machines there or do we have to install both nodes from scratch?
~ Jukka ~

Hi Jukka,
Inaddition, before you build hyper-v failover cluster please refer to these requirements within the article below :
http://technet.microsoft.com/en-us/library/jj863389.aspx
Best Regards
Elton Ji
We
are trying to better understand customer views on social support experience, so your participation in this
interview project would be greatly appreciated if you have time.
Thanks for helping make community forums a great place.

OCR and voting disks on ASM, problems in case of fail-over instances

Hi everybody
in case at your site you :
- have an 11.2 fail-over cluster using Grid Infrastructure (CRS, OCR, voting disks),
where you have yourself created additional CRS resources to handle single-node db instances,
their listener, their disks and so on (which are started only on one node at a time,
can fail from that node and restart to another);
- have put OCR and voting disks into an ASM diskgroup (as strongly suggested by Oracle);
then you might have problems (as we had) because you might:
- reach max number of diskgroups handled by an ASM instance (63 only, above which you get ORA-15068);
- experiment delays (especially in case of multipath), find fake CRS resources, etc.
whenever you dismount disks from one node and mount to another;
So (if both conditions are true) you might be interested in this story,
then please keep reading on for the boring details.
One step backward (I'll try to keep it simple).
Oracle Grid Infrastructure is mainly used by RAC db instances,
which means that any db you create usually has one instance started on each node,
and all instances access read / write the same disks from each node.
So, ASM instance on each node will mount diskgroups in Shared Mode,
because the same diskgroups are mounted also by other ASM instances on the other nodes.
ASM instances have a spfile parameter CLUSTER_DATABASE=true (and this parameter implies
that every diskgroup is mounted in Shared Mode, among other things).
In this context, it is quite obvious that Oracle strongly recommends to put OCR and voting disks
inside ASM: this (usually called CRS_DATA) will become diskgroup number 1
and ASM instances will mount it before CRS starts.
Then, additional diskgroup will be added by users, for DATA, REDO, FRA etc of each RAC db,
and will be mounted later when a RAC db instance starts on the specific node.
In case of fail-over cluster, where instances are not RAC type and there is
only one instance running (on one of the nodes) at any time for each db, it is different.
All diskgroups of db instances don't need to be mounted in Shared Mode,
because they are used by one instance only at a time
(on the contrary, they should be mounted in Exclusive Mode).
Yet, if you follow Oracle advice and put OCR and voting inside ASM, then:
- at installation OUI will start ASM instance on each node with CLUSTER_DATABASE=true;
- the first diskgroup, which contains OCR and votings, will be mounted Shared Mode;
- all other diskgroups, used by each db instance, will be mounted Shared Mode, too,
even if you'll take care that they'll be mounted by one ASM instance at a time.
At our site, for our three-nodes cluster, this fact has two consequences.
One conseguence is that we hit ORA-15068 limit (max 63 diskgroups) earlier than expected:
- none ot the instances on this cluster are Production (only Test, Dev, etc);
- we planned to have usually 10 instances on each node, each of them with 3 diskgroups (DATA, REDO, FRA),
so 30 diskgroups each node, for a total of 90 diskgroups (30 instances) on the cluster;
- in case one node failed, surviving two should get resources of the failing node,
in the worst case: one node with 60 diskgroups (20 instances), the other one with 30 diskgroups (10 instances)
- in case two nodes failed, the only node survived should not be able to mount additional diskgroups
(because of limit of max 63 diskgroup mounted by an ASM instance), so all other would remain unmounted
and their db instances stopped (they are not Production instances);
But it didn't worked, since ASM has parameter CLUSTER_DATABASE=true, so you cannot mount 90 diskgroups,
you can mount 62 globally (once a diskgroup is mounted on one node, it is given a number between 2 and 63,
and other diskgroups mounted on other nodes cannot reuse that number).
So as a matter of fact we can mount only 21 diskgroups (about 7 instances) on each node.
The second conseguence is that, every time our CRS handmade scripts dismount diskgroups
from one node and mount it to another, there are delays in the range of seconds (especially with multipath).
Also we found inside CRS log that, whenever we mounted diskgroups (on one node only), then
behind the scenes were created on the fly additional fake resources
of type ora*.dg, maybe to accomodate the fact that on other nodes those diskgroups were left unmounted
(once again, instances are single-node here, and not RAC type).
That's all.
Did anyone go into similar problems?
We opened a SR to Oracle asking about what options do we have here, and we are disappointed by their answer.
Regards
Oscar

Hi Klaas-Jan
- best practises require that also online redolog files are in a separate diskgroup, in case of ASM logical corruption (we are a little bit paranoid): in case DATA dg gets corrupted, you can restore Full backup plus Archived RedoLog plus Online Redolog (otherwise you will stop at the latest Archived).
So we have 3 diskgroups for each db instance: DATA, REDO, FRA.
- in case of fail-over cluster (active-passive), Oracle provide some templates of CRS scripts (in $CRS_HOME/crs/crs/public) that you edit and change at your will, also you might create additionale scripts in case of additional resources you might need (Oracle Agents, backups agent, file systems, monitoring tools, etc)
About our problem, the only solution is to move OCR and voting disks from ASM and change pfile af all ASM instance (parameter CLUSTER_DATABASE from true to false ).
Oracle aswers were a litlle bit odd:
- first they told us to use Grid Standalone (without CRS, OCR, voting at all), but we told them that we needed a Fail-over solution
- then they told us to use RAC Single Node, which actually has some better features, in csae of planned fail-over it might be able to migreate
client sessions without causing a reconnect (for SELECTs only, not in case of a running transaction), but we already have a few fail-over cluster, we cannot change them all
So we plan to move OCR and voting disks into block devices (we think that the other solution, which needs a Shared File System, will take longer).
Thanks Marko for pointing us to OCFS2 pros / cons.
We asked Oracle a confirmation that it supported, they said yes but it is discouraged (and also, doesn't work with OUI nor ASMCA).
Anyway that's the simplest approach, this is a non-Prod cluster, we'll start here and if everthing is fine, after a while we'll do it also on Prod ones.
- Note 605828.1, paragraph 5, Configuring non-raw multipath devices for Oracle Clusterware 11g (11.1.0, 11.2.0) on RHEL5/OL5
- Note 428681.1: OCR / Vote disk Maintenance Operations: (ADD/REMOVE/REPLACE/MOVE)
-"Grid Infrastructure Install on Linux", paragraph 3.1.6, Table 3-2
Oscar

UCCX Purposely Prevent Fail-over

Similar Messages

Maybe you are looking for