Fail Over and Redundancy with UCCE 7.5

I have a customer  that is installing UCCE and they want to run side A and side B in stand alone if the visable and private network are both down.. Based on the SRND it states the system looks at the PG with the most active connections and takes over and the other side goes dark. I am desging this in a distributed mode with agants in both sites. Any ideas other than Parent Child.

... the system looks at the PG with the most active connections and takes over and the other side goes dark.
Not quite. Behaviour of a duplex Router pair when the private network breaks is a complex affair.
As you probably know, the MDS pairs form a "synchronized zone" - one MDS will be PAIRED-ENABLED and the other PAIRED-DISABLED.
Consider all the PGs out there. On some PGs, the active link of the pgagent will be connected to the ccagent on the enabled side, while on the remainder of the PGs, the pgagent active link will be connected to the disabled side.
When a pgagent has an active link to the disabled side, that MDS cannot set the message order - it has to send the message to its peer MDS (enabled), who sets the message order, and now both Routers get the message in the same order at the same time.
Therefore, when the private network breaks, any PGs that have the active link connected to the disabled side will realign to the enabled side. The idle side remains connected - it's just a state change.
Idle paths and active paths both count for device majority.
The rules for the enabled side are simple: if it has device majority, it goes straight to ISOLATED-ENABLED. If it doesn't, it goes to ISOLATED-DISABLED.
The disabled side is more complex. First it checks for device majority. If it has this, it initiates the TOS (test other side) process. If every PG it can communicate with reports that it has no communication to the other side, then it will promote itself to  ISOLATED-ENABLED.
If the private network breaks and the public network is affected such that neither side has device majority, they both go disabled. Assuming the private link stays down, but the public network starts to come back in stages, eventually the majority of the PGs will be able to talk to one of the disabled sides, and then that will initiate the TOS process, and will go enabled.
Now let's consider what you have - you say "agents at both sites".
Let's imagine for a moment you have a 3rd site and 4th site that have no agents - they are just for the central controller. You have a dedicated link between sites 3 and 4 for the private network, and a public network out to sites 1 and 2
At sites 1 and 2, you have a Call Manager cluster, pair of PGs etc.
If the private network goes down, one of the sides will run simplex until the network is restored. Routing at sites 1 and 2 is unaffected.
If the public network to site 1 is down, routing at site 1 is broken until the network is restored. Site 2 is unaffected.
If the public network to site 2 is down, routing at site 2 is broken until the network is restored. Site 1 is unaffected.
If both networks are down, the whole system is isolated, no routing occurs until the visible has come back to the point where one of the sides will come up as ISOLATED-ENABLED.
Now what happens when we colocate the central controllers at the agent sites as in your model. Have we improved the situation? On the surface it looks like we have - and that's what your customer is saying with "they want to run side A and side B in stand alone if the visable and private network are both down".
When the private link breaks and the public link breaks, each router is ISOLATED-DISABLED and cannot come up because it only sees 1 of 2 PGs (the ones on the LAN at the site). So now you are down on both sites.
You might address this by installing at site 1 a third PG, configured in the normal way (it doesn't do anything) talking to both Call Routers, one local, one remote. It can be simplex.
Now when the private link breaks and the public link breaks, site 1 can see the majority of the PGs so it comes up in ISOLATED-ENABLED. Routing resumes at site 1, but site 2 remains off the air. This is the best result you can achieve.
The most important thing to think about is this: when the private network comes back up, the synchronizers try to do a state transfer. Assuming success, the synchronizers change to PAIRED mode. Now the routers and loggers will exchange state. If each site had been working in simplex mode ("split brain"), then when they come together you will have a totally messed up database. This corrupted state will most likely be unrecoverable.
It has happened in the past. I'll spare you the gory details.

Similar Messages

  • Time Machine Failing Over and Over

    So I am backing up my EX HD with Time Machine for the first time and it keep failing. It stops between 3 gb and about 20gb (I have 180 total). Keeps giving the error "The backup was not performed because an error occurred while copying files to the backup disk ". The HD I was using is a brand new Western Digital that I formatted two partitions. One for my old windows machine and this MBP. I formatted it exactly several times and it keeps failing.
    My neighbor said I need to have a HD that is completely clean, so i went and bought a brand new Iomega HD just for Mac. It also failed over and over! I have read about every thread out there and cant seem to find a possible cause other than my internal HD is toast.
    Has anyone else had this experience before?
    -T

    Sorry I'm not sure I understand your question now.
    Let me try: if you excluded your Documents folder and TM now works, your backup size was likely the problem.
    So if you simply remove Documents from the TM exclusion list you will run back into the size problem... no good.
    If you first delete the large contents from the documents folder on your startup disk, then you can ask TM to backup again the Documents folder by removing it from the TM exclusion list, but obviously only what remains there (not the deleted material) will be saved by TM.
    In other words, if you want to keep your 80 GB files on your startup disk, either you do not backup them or you use a larger backup disk. Otherwise you may remove the 80 GB from your startup disk; but now you probably want to keep 2 copies of them in 2 different external disks, for safety reasons, without using TM.
    Did I answer your question ?
    Piero
    Message was edited by: PieroF

  • VPN device with dual ISP, fail-over, and load balancing

    We currently service a client that has a PIX firewall that connects to multiple, separate outside vendors via IPSEC VPN. The VPN connections are mission critical and if for any reason the VPN device or the internet connection (currently only a T1) goes down, the business goes down too. We're looking for a solution that allows dual-ISP, failover, and load balancing. I see that there are several ASA models as well as the IOS that support this but what I'm confused about is what are the requirements for the other end of the VPN, keeping in mind that the other end will always be an outside vendor and out of our control. Current VPN endpoints for outside vendors are to devices like VPN 3000 Concentrator, Sonicwall, etc. that likely do not support any type of fail-over, trunking, load-balancing. Is this just not possible?

    Unless I am mistaken the ASA doesn't do VPN Load Balancing for point-to-point IPSec connections either. What you're really after is opportunistic connection failover, and/or something like DMVPN. Coordinating opportunistic failover shouldn't be too much of an issue with the partners, but be prepared for lot of questions.

  • ISE admin , PSN and monitoring node fail-over and fall back scenario

    Hi Experts,
    I have question about ISE failover .
    I have two ISE appliaces in two different location . I am trying to understand the fail-over scenario and fall-back scenario
    I have gone through document as well however still not clear.
    my Primary ISE server would have primary admin role , primary monitoring node and secondary ISE would have secondary admin and secondary monitoring role .
    In case of primary ISE appliance failure , I will have to login into secondary ISE node and make admin role as primary but how about if primary ISE comes back ? what would be scenario ?
    during the primary failure will there any impact with users for authentication ? as far as PSN is available from secondary , it should work ...right ?
    and what is the actual method to promote the secondary ISE admin node to primary ? do i have to even manually make monitoring node role changes ?
    will i have to reboot the secondary ISE after promoting admin role to primary  ?

    We have the same set up across an OTV link and have tested this scenario out multiple times. You don't have to do anything if communication is broken between the prim and secondary nodes. The secondary will automatically start authenticating devices that it is in contact with. If you promote the secondary to primary after the link is broke it will assume the primary role when the link is restored and force the former primary nodes to secondary.

  • Replication fail-over and reconfiguration

    I would like to get a conversation going on the topic of Replication, I have
    setup replication on several sites using the Netscape / iPlanet 4.x server
    and all has worked fine so far. I now need to produce some documentation and
    testing for replication fail-over for the master. I would like to hear from
    anyone with some experience on promoting a consumer to a supplier. I'm
    looking for the best practice on this issue. Here is what I am thinking,
    please feel free to correct me or add input.
    Disaster recovery plan:
    1.) Select a consumer from the group of read-only replicas
    2.) Change the database from Read-Only to Read-Write
    3.) Delete the replication agreement (in my case I am using a SIR)
    4.) Create a new agreement to reflect the supplier status of the chosen
    replica (again a SIR for me)
    5.) Reinitialize the consumers (Online or LDIF depending on your number of
    entries)
    That is the general plan so far. Other questions and topics might include:
    1.) What to do when the original master comes back online
    2.) DNS round-robin strategies (Hardware assistance, Dynamic DNS, etc)
    3.) General backup and recovery procedures when: 1.) Directory is corrupted
    2.) Link is down / network is partitioned 3.) Disk / server corruption /
    destruction
    Well I hope that is a good basis for getting a discussion going. Feel free
    to email me if you have questions or I can help you with one of your issues.
    Best regards,
    Ray Cormier

    There is no failover in Meta-Directory 5.1, you can implement manual failover on the metaview by using multi-master replication with Directory Server. There are limitations and this is a manual process.
    - Paul

  • Users contacts missing after failing over and then failing back pool

    We have 2 Lync enterprise pools that are paired.
    3 days ago, I failed the central management store, and all users from pool01 to pool02.
    This morning, I failed the CMS and all users back from pool02 to pool01.
    All users signed back in to Lync and no issues were reported. A user then contacted me to say that his contact list was empty.
    I had him sign out and back in to Lync, and also had him sign into Lync from a different workstation, as well as his mobile device. All of which showed his contacts list as empty.
    We have unified contacts enabled (Hybrid mode with Office 365 exchange online, and Lync on prem). When I check the users Outlook contacts, I can see all of his contacts listed under "Lync Contacts", along with the current presence of each user.
    If I perform an export-csuserdata for that user's userdata, the XML file contained within the ZIP file shows the contacts that he is missing.
    I've also checked the client log on the workstation too, and can see that Lync can see the contacts as it lists them in the log. They do not appear in the Lync client though.
    Environment details:
    Lync 2013 - 2 enterprise pools running the latest December 2014 CU updates.
    Lync 2013 clients - running on Windows 8.1. User who is experiencing the issue is running client version 15.0.4675.1000 (32 bit)
    I have attempted to re-import the user data using both import-csuserdata (and restarting the front end services) and update-csuserdata. Both of these have had no effect.

    Hi Eason,
    Thanks for your reply. I've doubled checked and can confirm that only one policy exists, which enables it globally.
    I believe this problem relates to issues that always seem to happen when ever our primary pool is failed over to the backup pool, and then failed back.
    What I often see is that upon pool failback, things like response group announcements don't play on inbound calls (white noise is heard, followed by the call disconnecting), and agents somehow get signed out of queues (although they appear to be signed in to
    the queue when checking their Response Group settings in their Lync client. I've also noticed that every time we fail back, a different user will come to me and report that either their entire contacts list is missing, or that half of their contacts are missing.
    I am able to restore these from backup though.
    This appears to happen regardless of if the failover to the backup pool is due to a disaster, or to simply perform pool maintenance on our primary pool.

  • Fail over and Commonj

    We have session bean inside weblogic which creates a number of Work objects executed at cluster of Tangosol Coherence caches. It doesn't wait until created Works complete. If cluster of weblogic servers dies we have an ability to fix integrity problems based on tlog files. What if cluster of cache server dies? Is there any way to find out which Work objects crashed using only Tangosol features?

    It sounds like you may be trying to ensure "at least
    once" (i.e. guaranteed) processing of the work items.
    Is that correct?
    YesOK, first the bad news: The Work Manager implementation in Coherence does not have those guarantees.
    Also, you said previously:
    What if cluster of cache server dies? Is there any way to find out which Work
    objects crashed using only Tangosol features?So you want the work items to survive cluster shut-down. That means that you want them to be persistent. And from your "tlog" comment, I assume you mean transactional as well. None of those qualities is defined by the commonJ spec, but you'll find some of the commonJ implementations (e.g. Redwood) that provide them.
    To achieve similar with Coherence:
    1. Define a partitioned cache (e.g. "pending-work") with write-through (or write-behind) to a database
    2. Place Work items into the cache
    3. On backing map "insert" event(s), issue work item(s); you can do this with local affinity on the work manager, since the work items are already partitioned
    4. On work completion, delete work items from cache
    5. On application startup, preload the "pending-work" cache from the database
    This doesn't guarantee "only once", so the work items MUST be idempotent or otherwise non-destructive.
    You can achieve many guarantees (concurrency, etc.) by taking advantage of other features, but obviously there is some work related to this.
    Peace.

  • CAS ARRAY fail-over and emails stuck

    Dear all,
    for some reasons we are in Exchange Server Coexistance mode that is 1 Exchange 2003 Server and 2. Exchange 2010 servers.
    We have CAS array(node-1 And Node-2) and DAG in place , but the problem is whenever my Node-1 is down emails are getting stuck routing group connector on legacy server and Exchange 2010 to Exchange 2010 emails are working.
    oppositly
    when my Node-2 is down everything works.
    how do I FIX this ?
    TheAtulA

    Dear all,
    for some reasons we are in Exchange Server Coexistance mode that is 1 Exchange 2003 Server and 2. Exchange 2010 servers.
    We have CAS array(node-1 And Node-2) and DAG in place , but the problem is whenever my Node-1 is down emails are getting stuck routing group connector on legacy server and Exchange 2010 to Exchange 2010 emails are working.
    oppositly
    when my Node-2 is down everything works.
    how do I FIX this ?
    TheAtulA
    I assume the CAS also have the hub trasnport installed?
    Check the Routing Group Connector(s) (Get-RoutingGroupConnector) and ensure the source and destination transports include both CAS nodes, not just Node-1.
    If not, then use set-routinggroupconnector to set the correct source and target servers 
    https://technet.microsoft.com/en-us/library/aa998581(v=exchg.141).aspx
    Twitter!: Please Note: My Posts are provided “AS IS” without warranty of any kind, either expressed or implied.

  • Redundancy with dual nic servers

    Hi I have 2 11500's configured with box-to-box redundancy. I have a number of app servers each with dual nic's (which are Teamed) and which are connected directly to the CSS's. Nic 1 in each goes to the master CSS1 and is therefore live. Nic2 goes to standby CSS2.The CSS's are connected two 4500 switches to the public.I am monitoring the links to the 4500's,if i switch off the master CCS1,we fail over and the servers all connect via nic2 to the new master CCS2. But when the link to the 4500 from the CSS1 Master goes down the CSS'S failover but the nic2's do not connect to the CCS2 because Nic 1 in each server has not failed ie they still see CSS1 as up. Is there a workaround to this problem ?
    Thanks
    J

    I know of no way to link CSS interfaces so that if the uplink goes down, the the other ports are shutdown. There may be another way to configure the adapter teaming or failover on the server side. I know some OS's send out test frames from one adapter to the other to verify network integrity.
    What I'd recommend is that you setup a VLAN on your 4500's for the server's physical connections, and uplink that to a "backend" interface on the CSS. This can be done with the CSS in either a router or bridge configuration, but I'd recommend router mode.

  • Audio Applications in Unity Fail-over

    Hi all,
    I am going to install Cisco Unity with fail-over and what I remember, I should to rebuild the applications like Auto Attendant in secondary server. because this is not part of the replication.
    Am I right? or no need to rebuild the applications?

    Hi JFV,
    That is no longer the case
    How Standby Redundancy Works in Cisco Unity 8.x
    Cisco Unity standby redundancy uses failover functionality to provide duplicate Cisco Unity servers for disaster recovery. The primary server is located at the primary facility, and the secondary server is located at the disaster-recover facility.
    Standby redundancy functions in the following manner:
    •Data is replicated to the secondary server, with the exceptions noted in the "Data That Is Not Replicated in Cisco Unity 8.x" section.
    •Automatic failover is disabled.
    •In the event of a loss of the primary server, the secondary server is manually activated.
    Data That Is Not Replicated in Cisco Unity 8.x
    Changes to the following Cisco Unity settings are not replicated between the primary and secondary servers. You must manually change values on both servers.
    •Registry settings
    •Recording settings
    •Phone language settings
    •GUI language settings
    •Port settings
    •Integration settings
    •Conversation scripts
    •Key mapping scripts (can be modified through the Custom Key Map tool)
    •Media Master server name settings
    •Exchange message store, when installed on the secondary server
    http://www.cisco.com/en/US/docs/voice_ip_comm/unity/8x/failover/guide/8xcufg040.html#wp1099338
    Cheers!
    Rob

  • Is it possible to add hyper-V fail over clustering afterwards?

    Hi,
    We are testing Windows 2012R2 Hyper-V using only one stand alone host without fail over clustering now with few virtual machines. Is it possible to add fail over clustering afterwards and add second Hyper-V node and shared disk and move virtual
    machines there or do we have to install both nodes from scratch?
    ~ Jukka ~

    Hi Jukka,
    Inaddition, before you build hyper-v failover cluster please refer to these requirements within the article below :
    http://technet.microsoft.com/en-us/library/jj863389.aspx
    Best Regards
    Elton Ji
    We
    are trying to better understand customer views on social support experience, so your participation in this
    interview project would be greatly appreciated if you have time.
    Thanks for helping make community forums a great place.

  • UCCX Purposely Prevent Fail-over

    Hi.  I was wondering if shutting down the engine on a secondary server would be enough to prevent fail-over in an HA environment.
    Basically; we have had a call with TACs on the servers for no apparent reason, failing over and then back.  What we found was happening was that the 2 servers were losing heartbeat to each other, so the secondary server was trying to take control.  This then cause all of our agents to fail over, but calls could get lost as the primary server actually was fully functioning.  This lead us to another TAC case on an error on a router near the secondary server that was causing the loss of heartbeat.  Problem is, that router cannot come down for some time and is due to just be replaced at the end of the year.
    So now, maybe not entirely to my liking; we/someone wants to try just having the primary running and then if worse comes to worse, we can start the secondary back up again and I am curious what the best procedure to do that would be.  Hopes would be that this somehow would at least stop the random fail-overs, even if it doesn't actually address the real issue.

    I have to rely on another guy for the router, switches and UCM side of things and he hasn't said exactly what the error message is, but that he called into TACs and it is supposed to only be cosmetic, and a reboot of the router would clear it.   Unfortunately, where that router is; it will not be brought down until the end of the year on a maintenance window.
    At any rate; the UCCX server has been ruled out as we have had multiple tickets with TACS for the UCCX then to the UCM and they both have been pointing to a network issue that does not get avoided by having the secondary server down, mainly because we do have a CM publisher and subscriber on the same network.

  • /WS fail-over

    Hello everyone,
    I'm having some difficulties working with /WS fail-over, or rather with /WS fail-back.
    In my Tuxedo 8.0RP133 /WS client I define a WSNADDR containing two destinations,
    something like
    WSNADDR=//host:port1,//host:port2
    My destination is a TUXEDO 6.5 application with two WSL servers.
    In my client I trap tpcall() errors and for some errors (TPESYSTEM, for example)
    I assume that an idle timeout or some temporary error has occurred, call tpterm(),
    tpinit() and retry tpcall() again.
    This sort of works, most of the time... But it seems that the fail-over (from
    host:port1 to host:port2) is one-way only. If I shut down the first WSL the client
    fails-over to the second one quite nicely. If I then restart the first WSL and
    shut down the second one, the client fails. As far as I can understand, once the
    client process has started using the second address in WSNADDR there is no turning
    back to the first one again.
    Is this the way it is supposed to be? Have I misinterpreted the syntax for WSNADDR?
    Or is this simply a case for BEA support?
    Best regards,
    /Per

    Thanks for your input, Amit.
    I was probably only using the second WSL all the time. Seems like an encryption
    settings problem prevented the first one from ever being useful for me...
    /Per
    "Amit" <[email protected]> wrote:
    >
    The WSL Connection to the Tuxedo Server is decided at the time of tpinit.
    So if
    you have specified 2 address for tpinit, the workstation tries to connect
    to the
    first ip address specified and if it is not successful, it tries to connect
    to
    the second ip address...
    this process is done every time you do a tpinit.
    So i guess, if you do a tpinit after bringing back the 1st IP Address
    WSL, you
    should be able connect ..
    I hope this helps.
    -Amit
    "Per Lindström" <[email protected]> wrote:
    Hello everyone,
    I'm having some difficulties working with /WS fail-over, or rather with
    /WS fail-back.
    In my Tuxedo 8.0RP133 /WS client I define a WSNADDR containing two destinations,
    something like
    WSNADDR=//host:port1,//host:port2
    My destination is a TUXEDO 6.5 application with two WSL servers.
    In my client I trap tpcall() errors and for some errors (TPESYSTEM,for
    example)
    I assume that an idle timeout or some temporary error has occurred,call
    tpterm(),
    tpinit() and retry tpcall() again.
    This sort of works, most of the time... But it seems that the fail-over
    (from
    host:port1 to host:port2) is one-way only. If I shut down the firstWSL
    the client
    fails-over to the second one quite nicely. If I then restart the first
    WSL and
    shut down the second one, the client fails. As far as I can understand,
    once the
    client process has started using the second address in WSNADDR there
    is no turning
    back to the first one again.
    Is this the way it is supposed to be? Have I misinterpreted the syntax
    for WSNADDR?
    Or is this simply a case for BEA support?
    Best regards,
    /Per

  • Bea weblogic 6.1 does not oracle Database fail over

    Hi We had Concurrency Strategy:excusive . Now we change that to Database for performace
    reasons. Since we change that now when we do oracle database fail over weblogic
    6.1 does not detect database fail over and it need restart.
    how we can resolve this ??

    mt wrote:
    Hi We had Concurrency Strategy:excusive . Now we change that to Database for performace
    reasons. Since we change that now when we do oracle database fail over weblogic
    6.1 does not detect database fail over and it need restart.
    how we can resolve this ??Are your pools set to test connections at reserve time?
    Joe

  • Dabase fail over problem after we change Concurrency Strategy:

    Hi We had Concurrency Strategy:excusive . Now we change that to Database for performace
    reasons. Since we change that now when we do oracle database fail over weblogic
    6.1 does not detect database fail over and it need to be rebooted.
    how we can resolve this ??

    Hi,
    It is just faining one of the application servers, developer wrote that when installing CI, Local hostname is written in Database and SDM. We will have to do a Homogeneous system copy to change the name.
    The problem is that I used Virtual SAP group name in CI and DI application servers, in SCS and ASCS  we used Virtual hostnames and it is OK according to SAP developer.
    The Start and instance profiles were checked and everything was fine, just the dispatcher from CI is having problems when comming from Node B to Node A.
    Regards

Maybe you are looking for