UCCX 9.0(2) HA unexpected failover

We are seeing unexpected failovers between a HA pair of UCCX 9.0(2) servers on a single site, can anyone suggest where to look for why these failovers are occurring?  Engine service runtime suggests it is not a service failure.  Thanks.

As stated earlier this was due to a network issue:
6209: Mar 10 14:49:17.691 GMT %MCVD-CVD-5-HEARTBEAT_MISSING_HEARTBEAT:CVD does not receive heartbeat from node for a long period: nodeId=2,dt=15562
6210: Mar 10 14:49:17.691 GMT %MCVD-CVD-4-HEARTBEAT_SUSPECT_NODE_CRASH:CVD suspects node crash: state=Heartbeat State,nodeInfo=Node id=2 ip=10.65.64.114 convId=22 cmd=34 viewLen=1,dt=10370
6211: Mar 10 14:49:17.691 GMT %MCVD-CVD-7-UNK:Convergence State activated, type=13, remNode Node id=2 ip=10.65.64.114 convId=22 cmd=34 viewLen=1
6212: Mar 10 14:49:17.691 GMT %MCVD-CVD-4-HEARTBEAT_SUSPECT_NODE_CRASH:CVD suspects node crash: state=Convergence State,nodeInfo=Node id=2 ip=10.65.64.114 convId=22 cmd=34 viewLen=1,dt=15562
6213: Mar 10 14:49:17.691 GMT %MCVD-CVD-7-UNK:com.cisco.cluster.impl.cvd.net.impl.UnicastTransmitter: setData retransmitInterval=100
6214: Mar 10 14:49:17.692 GMT %MCVD-CVD-7-UNK: >> try to process HeartbeatConvergenceStartedCmdImpl
6215: Mar 10 14:49:17.692 GMT %MCVD-CVD-7-UNK: >> try to process HeartbeatNodeLeaveCmdImpl nodeId=2
6216: Mar 10 14:49:17.692 GMT %MCVD-CVD-3-NODE_LEAVE_CLUSTER:Node leave cluster: nodeId=2
6217: Mar 10 14:49:17.698 GMT %MCVD-CVD-7-UNK:removeSubscriber 2
6218: Mar 10 14:49:17.701 GMT %MCVD-CLUSTER_MGR-7-UNK:try to process NodeLeaveCmdImpl, nodeId=2
6219: Mar 10 14:49:17.701 GMT %MCVD-CLUSTER_MGR-7-UNK:process Node Leave, id=2
6220: Mar 10 14:49:17.701 GMT %MCVD-CLUSTER_MGR-7-UNK:Post Convergence Event: CONVERGENCE_STARTED, name=Cisco Unified CCX Database
6221: Mar 10 14:49:17.702 GMT %MCVD-CLUSTER_MGR-7-UNK:DatastoreService171: Cisco Unified CCX Database on node 2 change master from true to false
6222: Mar 10 14:49:17.702 GMT %MCVD-CLUSTER_MGR-7-UNK:Post Master Event: MASTER_DROPPED, name=Cisco Unified CCX Database, node=2
6223: Mar 10 14:49:17.702 GMT %MCVD-CLUSTER_MGR-7-UNK:Post Convergence Event: CONVERGENCE_STARTED, name=Cisco Unified CCX Engine
6224: Mar 10 14:49:17.702 GMT %MCVD-CLUSTER_MGR-7-UNK:JavaService175: Cisco Unified CCX Engine on node 2 change master from true to false
6225: Mar 10 14:49:17.702 GMT %MCVD-CLUSTER_MGR-7-UNK:Post Master Event: MASTER_DROPPED, name=Cisco Unified CCX Engine, node=2
6226: Mar 10 14:49:17.702 GMT %MCVD-CLUSTER_MGR-7-UNK:Post Master Event: MASTER_DROPPED, name=Cisco Desktop Browser and IP Phone Agent Service, node=2
6227: Mar 10 14:49:17.702 GMT %MCVD-CLUSTER_MGR-7-UNK:Cisco Desktop Agent E-Mail Service.stateChanged() from IN SERVICE to UNKNOWN, oldMaster=false newMaster=false
6228: Mar 10 14:49:17.702 GMT %MCVD-CLUSTER_MGR-7-UNK:Cisco Desktop Sync Service.stateChanged() from IN SERVICE to UNKNOWN, oldMaster=false newMaster=false
6229: Mar 10 14:49:17.703 GMT %MCVD-CLUSTER_MGR-7-UNK:Cisco Desktop Enterprise Service.stateChanged() from IN SERVICE to UNKNOWN, oldMaster=false newMaster=false
6230: Mar 10 14:49:17.703 GMT %MCVD-CLUSTER_MGR-7-UNK:Cisco Desktop Recording and Statistics Service.stateChanged() from IN SERVICE to UNKNOWN, oldMaster=false newMaster=false
6231: Mar 10 14:49:17.703 GMT %MCVD-CLUSTER_MGR-7-UNK:Cisco Desktop Browser and IP Phone Agent Service.stateChanged() from IN SERVICE to UNKNOWN, oldMaster=false newMaster=false
6232: Mar 10 14:49:17.703 GMT %MCVD-CLUSTER_MGR-7-UNK:Cisco Desktop Call/Chat Service.stateChanged() from IN SERVICE to UNKNOWN, oldMaster=false newMaster=false
6233: Mar 10 14:49:17.703 GMT %MCVD-CLUSTER_MGR-7-UNK:Cisco Desktop License and Resource Manager Service.stateChanged() from IN SERVICE to UNKNOWN, oldMaster=false newMaster=false
6234: Mar 10 14:49:17.703 GMT %MCVD-CLUSTER_MGR-7-UNK:Node 2 change state from PARTIAL SERVICE to UNKNOWN
6235: Mar 10 14:49:17.703 GMT %MCVD-CVD-7-UNK:BootstrapListenerImpl cvd IGNORE repositoryShutdown() arg0=10.65.64.114
6236: Mar 10 14:49:17.703 GMT %MCVD-CLUSTER_MGR-7-UNK:Post Master Event: MASTER_DROPPED, name=Cisco Desktop Enterprise Service, node=2
After few consecutive heartbeat failures node 2 suspected that the other node is down and took over the mastership.
I see multiple failovers in these log file snippet above is for the timestamp you stated, you need to get the routing and switching guys to check the network between the nodes.

Similar Messages

  • SQL Cluster unexpected failover

    So we had one of our SQL clusters unexpectedly failover recently. Second time in a few months. Two node active/passive SQL 2012 cluster running on Windows 2012 Standard.
    Here's what we could cull from the application/system logs?
    1. "
    Cluster resource 'SQLServer' of type 'SQL Server' in clustered role 'SQLServerRole' failed.
    Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster
    Manager or the Get-ClusterResource Windows PowerShell cmdlet."
    2. "
    Cluster resource 'SQLServer' (resource type 'SQL Server', DLL 'sqsrvres.dll') did not respond to a request in a timely fashion. Cluster health detection will attempt to automatically recover by terminating the Resource Hosting Subsystem (RHS) process running
    this resource. This may affect other resources hosted in the same RHS process. The resources will then be restarted. 
    The suspect resource 'SQLServer' will be marked to run in an isolated RHS process to avoid impacting multiple resources in the event that this resource failure occurs again. Please ensure services, applications, or underlying infrastructure (such as storage
    or networking) associated with the suspect resource is functioning properly."
    3. "The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource.  Please determine which resource and resource DLL is causing
    the issue and verify it is functioning properly."
    4. "A timeout (30000 milliseconds) was reached while waiting for a transaction response from the MSSQLSERVER service."
    Cluster.log wasn't much more helpful on the root cause either:
    00000f28.00001c78::2014/12/04-21:25:54.662 INFO  [RES] Network Name <Cluster Name>: Netbios: Slow Operation, FinishWithReply: 0
    00000f28.00001c78::2014/12/04-21:25:54.662 INFO  [RES] Network Name:  [NN] got sync reply: 0
    00000f28.00001c78::2014/12/04-21:25:54.662 INFO  [RES] Network Name <Cluster Name>: Netbios: End of Slow Operation, state: Initialized/Idle, prevWorkState: Idle
    00000f20.00000e94::2014/12/04-21:25:55.240 INFO  [RES] SQL Server Agent <SQL Server Agent>: [sqagtres] IsAlive request.
    00000f20.00000e94::2014/12/04-21:25:55.240 INFO  [RES] SQL Server Agent <SQL Server Agent>: [sqagtres] CheckServiceAlive: returning TRUE (success)
    00001134.000001d8::2014/12/04-21:25:57.287 ERR   [RES] SQL Server <SQLServer>: [sqsrvres] Failure detected, diagnostics heartbeat is lost
    00001134.000001d8::2014/12/04-21:25:57.287 INFO  [RES] SQL Server <SQLServer>: [sqsrvres] IsAlive returns FALSE
    00001134.000001d8::2014/12/04-21:25:57.287 WARN  [RHS] Resource SQLServer IsAlive has indicated failure.
    00000880.0000161c::2014/12/04-21:25:57.303 INFO  [NM] Received request from client address HOST-XXX-SQL02.
    00000880.0000161c::2014/12/04-21:25:57.303 INFO  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'SQLServer', gen(3) result 1/0.
    00000880.000023a4::2014/12/04-21:25:57.303 INFO  [GEM] Sending 1 messages as a batched GEM message
    00000880.0000161c::2014/12/04-21:25:57.303 INFO  [RCM] Res SQLServer: Online -> ProcessingFailure( StateUnknown )
    00000880.0000161c::2014/12/04-21:25:57.303 INFO  [RCM] TransitionToState(SQLServer) Online-->ProcessingFailure.
    00000880.0000161c::2014/12/04-21:25:57.318 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (SQLServerRole, Online --> Pending)
    00000880.00001db8::2014/12/04-21:25:57.334 INFO  [GEM] Sending 1 messages as a batched GEM message
    00000880.0000161c::2014/12/04-21:25:57.334 ERR   [RCM] rcm::RcmResource::HandleFailure: (SQLServer)
    00000880.00001db8::2014/12/04-21:25:57.334 INFO  [GEM] Sending 1 messages as a batched GEM message
    00000880.00000bac::2014/12/04-21:25:57.334 INFO  [RCM] ignored non-local state Pending for group SQLServerRole
    00000880.0000161c::2014/12/04-21:25:57.350 INFO  [RCM] resource SQLServer: failure count: 1, restartAction: 2 persistentState: 1.
    00000880.0000161c::2014/12/04-21:25:57.350 INFO  [RCM] Greater than restartPeriod time has elapsed since first failure of SQLServer, resetting failureTime and failureCount.
    00000880.0000161c::2014/12/04-21:25:57.350 INFO  [RCM] Will queue immediate restart (500 milliseconds) of SQLServer after terminate is complete."
    Any ideas? Anywhere we could look for more specific info? Any preventative measures we could take?
    Thanks,
    Ryan

    Hello,
    Since you are using SQL Server 2012, there is an extended events trace running on the cluster that holds all of the return values from sp_server_diagnostics, check that out (.xel) to see if there is anything in there.
    The error is pretty straight forward, there wasn't a timely response to the sp_server_diagnostics return set. Look for schedulers that are overwhelmed, SQL server paging a bunch of memory (outside OS pressure), someone pausing a service, etc.
    Is this happening during a peak traffic or load time?
    -Sean
    The views, opinions, and posts do not reflect those of my company and are solely my own. No warranty, service, or results are expressed or implied.

  • Sync two 10.6 file servers

    Hi everyone,
    this might be a very basic question, but I haven’t seen a good guide on how to accomplish this, hence the question.
    I would like to use 2 absolutely identical file servers (no other services, only AFP) with one of them being a backup that is not in use unless the first one crashes, burns, gets stolen, etc. The servers will be a Mac mini Server each with both having a 12 TB FireWire RAID attached. They need to be in separate locations, but Gigabit Ethernet is available to connect them.
    In order to be clear: I thought of syncing both the system and the data on the RAID.
    Do you have any recommendations on how to best accomplish this? Is there a best practice? Is rsync what I need here?
    Thanks
    Björn

    There are many, many elements to your question and synching the data is the least (and easiest?) part of the equation.
    For one, what's your failover model? Do you want the failover to happen instantly, with no disruption to the users? or do you mind if the users get disconnected and have to reconnect to their shares?
    Or maybe you want failover to only happen manually? (i.e. only when you know the primary server is going to be down for a while). This is common because the cost of failback (i.e. resynching the 'backup' data to the primary server) is time consuming and could take longer than the primary server would be offline, anyway - if it'll take 2 hours to sync your data back then there's no point in failing over if your server is going to be back in 10 minutes.
    Then there's the volume of data and, more importantly, the rate of change. Even if you have 10TB of data there may be only a few megabytes of data that changes daily and needs to be kept in sync. That wil ave a big impact on your replication strategy.
    While on that subject, how much tolerance do you have for the servers being out of sync? If you need them to be real-time then you don't have the equipment for this - real-time replication of filesystems is a tricky (and expensive) task. If you want to sync daily, or even a few times a day, then that's easier, with the cost being a few hours' lost work should an unexpected failover happen. That may or may not be viable for you.
    Either way I would not recommend Retrospect for this (or even for regular backups). A simple rsync shell script can replicate the data between two servers, it's largely an issue of frequency and volume that you have to consider.

  • Surprises on Replacing failed primary unit

    Dear friends,
    I have done failover for firewalls umpteen number of times but yesterday it failed for some reason.
    I had replaced the failed primary unit with a fresh one and i had expected that it will detect the secondary unit as active and try to begin config replication from it but rather it wiped off the secondary unit's config. I dont think that i faulted in the sequence but let me share with you what i did:
    1. Put the four or five lines of failover configuration (except the failover command) and did a no shut on the failover interface (management0/0)
    2. Ran the failover command
    Instead of getting the config from the active unit, it started forcing the configs to the other unit. To restore, i had to reload the active unit to restore its config. After that i reloaded the fresh unit and now the failover happened as expected.
    I think that i should forced a reload of the new unit before trying to establish failover.
    Has anyone tried this in a fail-proof way during production hours? if yes, can you please share with me the steps?
    I did not ask for downtime because i was confident but i resulted in bringing down the ASA for 5 minutes because of the unexpected failover action.
    Thanks a lot
    Gautam

    Dear kureli,
    Thanks a lot for the efforts you took. I really appreciate it.
    Here's the exact sequence of steps that happened:
    1.  When primary unit failed, secondary got active and i dont remember if sh fail showed "secondary- not detected" or "secondary - failed"
    2.  I replaced the faulty primary unit with another primary unit and said no shut on the m0/0 failover interface and also put all the failover commands except "failover" command.
    3. I made sure that the new primary unit runs the same code (i checked only the main code version, i did not check the asdm version similarity). The asdm versions were different on both boxes though.
    4. After powering up the box and connecting cables, i said failover. It then prompted me saying that SSL license is not the same on both units and disabling failover.
    5. I applied for an activiation key from [email protected] and then got the SSL license from them.
    6. Next day i went back to the customer and installed the license key. After installing the license key, i said failover. It gave me the message "No response from mate"
    7. I then said no failover to disable failover on the new primary unit.
    8. I then went to secondary active unit and said failover as failover was disabled
    9. I then went back to primary unit and said failover
    10. This is where blank config replication started !!
    11. Reloaded secondary unit to undo the blank running config
    12. Went to Primary unit and disconnected the failover cable. Rebooted the primary unit and connected the failover cable.
    13. Secondary came up as active, primary then came up, and this time primary honored the secondary as active and did config replication
    14. All was well then!!
    Not sure still why this happened and it was a bit shameful for me to see this happening after 3.5 years of firewalling experience.
    Anyways, i am willing to learn and improve from now on.
    Probably next time, i would try to make sure that i apply the failover configs, reload, and while reload connect the failover cable.
    I think the learning lesson is that if the unit reloads, the reloaded unit always honors the currently active unit and does not try to override its role.
    This is what worked for me.
    Thanks a lot
    Gautam

  • Tidal Email Adapter error monitoring folder 'INBOX':Illegal whitespace in address

    I am experiencing an issue with a tidal email adapter and email events are not being triggered.
    TIDAL Enterprise Scheduler: version 5.3.1.316
    Java version: 1.6.0_27
    Java Virtual Machine version: 20.2-b06
    Adapter Host: version 5.3.1.299
    Java version: 1.6.0_27
    Java Virtual Machine version: 20.2-b06
    Issue started shortly after our environment did a unexpected failover a couple of days ago.  The servers have been rebooted but the issue still appears
    I have disabled the email adapter and recreated a new one, issue is still there.  Any suggestions on resolving this would be appreciated. The email mailbox is working and recieving messages, and I am able to connect using a pop3 client to the mailbox no problem.
    Any suggestions would be appreciated.
    Here is the Tidal Master email adapter log messages:
    02/29 13:45:12:759[MD-8]: (mem=5062504/16515072) Connection 7: Error monitoring folder 'INBOX':Illegal whitespace in address
    02/29 13:45:12:759[MD-8]: (mem=5059960/16515072) javax.mail.internet.AddressException: Illegal whitespace in address in string ``Mail Delivery Subsystem''
    at javax.mail.internet.InternetAddress.checkAddress(InternetAddress.java:900)
    at javax.mail.internet.InternetAddress.parse(InternetAddress.java:793)
    at javax.mail.internet.InternetAddress.parseHeader(InternetAddress.java:554)
    at javax.mail.internet.MimeMessage.getAddressHeader(MimeMessage.java:658)
    at javax.mail.internet.MimeMessage.getFrom(MimeMessage.java:321)
    at com.tidalsoft.service.logic.EmailInterface.b(Unknown Source)
    at com.tidalsoft.service.logic.EmailInterface.a(Unknown Source)
    at com.tidalsoft.service.logic.EmailInterface.a(Unknown Source)
    at bd.p(Unknown Source)
    at bd.l(Unknown Source)
    at com.tidalsoft.service.logic.ConnectionMessageHandler.onPoll(Unknown Source)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at bt.a(Unknown Source)
    at ba.b(Unknown Source)
    at ba.a(Unknown Source)
    at com.tidalsoft.framework.message.BaseMessageHandlerImpl.onMessage(Unknown Source)
    at com.tidalsoft.framework.data.DataWrapper.onMessage(Unknown Source)
    at ad.run(Unknown Source)
    02/29 13:45:12:759[MD-8]: (mem=5062504/16515072) Connection 7: Error monitoring folder 'INBOX':Illegal whitespace in address
    02/29 13:45:12:759[MD-8]: (mem=5059960/16515072) javax.mail.internet.AddressException: Illegal whitespace in address in string ``Mail Delivery Subsystem''
    at javax.mail.internet.InternetAddress.checkAddress(InternetAddress.java:900)
    at javax.mail.internet.InternetAddress.parse(InternetAddress.java:793)
    at javax.mail.internet.InternetAddress.parseHeader(InternetAddress.java:554)
    at javax.mail.internet.MimeMessage.getAddressHeader(MimeMessage.java:658)
    at javax.mail.internet.MimeMessage.getFrom(MimeMessage.java:321)
    at com.tidalsoft.service.logic.EmailInterface.b(Unknown Source)
    at com.tidalsoft.service.logic.EmailInterface.a(Unknown Source)
    at com.tidalsoft.service.logic.EmailInterface.a(Unknown Source)
    at bd.p(Unknown Source)
    at bd.l(Unknown Source)
    at com.tidalsoft.service.logic.ConnectionMessageHandler.onPoll(Unknown Source)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at bt.a(Unknown Source)
    at ba.b(Unknown Source)
    at ba.a(Unknown Source)
    at com.tidalsoft.framework.message.BaseMessageHandlerImpl.onMessage(Unknown Source)
    at com.tidalsoft.framework.data.DataWrapper.onMessage(Unknown Source)
    at ad.run(Unknown Source)
    Thanks,
    Pete

    We found the problem, there was an message in the inbox that is being monitored by the email event.  The message was an undeliverable and the sender was = Mail Delivery Subsystem 
    We had received previous undeliverables from this company and the sender was =  Mail Delivery Subsystem <[email protected]
    The email monitoring was impacted by those previous message but as soon as it only reflected Mail Delivery Subsystem and no email domanin name it failed.
    Once we removed the messages from the inbox the email events triggered as expected.  I now need to build a rule to remove these message from the mailbox.
    Thanks,
    Pete

  • UCCX 8.0 SNMP notification to report failover from active to the standby server?

    Does any know the UCCX 8.0 SNMP notification to report failover from active to the standby server?  (The specific notification.)
    ftp://ftp.cisco.com/pub/mibs/v2/CISCO-VOICE-APPS-MIB.my
    I see this in the MIB.
    cvaModuleStart NOTIFICATION-TYPE
      OBJECTS   { cvaAlarmSeverity, cvaModuleName }
      STATUS    current
      DESCRIPTION
            "A cvaModuleStart notification signifies that an
            application module or subsystem has successfully
            started and transitioned into in-service state. 
            This notification is working in conjunction with
            the cvaModuleStop notification to notify the start
            and stop status of a particular application module."
      ::= {ciscoVoiceAppsMIBNotifications 1}

    Attached are two files:
    cad-ecc-viewer.html
    This is a template HTML document which dictates how the pop up will look and what data fields are available.
    cad-ecc-viewer.vbs
    This is the Windows Scripting Host file which you run from a CDA workflow, and you pass it the values of the call data, it then launches an instance of IE and loads the above template.
    By default the code is setup to use the following data from the call, but can be modified to work with more, less, or different data:
    Customer Name
    Customer Status (like Premium, or Platinum)
    Customer Number (like an account number)
    Customer Phone Number
    So when you specify the VBS file to run in CDA you need to pass those variables in that order.
    The CDA should expect the VBS file on the root of C:\ by default, and the VBS file expects the HTML Template on the root of C:\ also.
    I have only tested this on UCCX 7x and IE 8x.  Use the code as a guide to your own solution, that suites your business requirements.
    EDIT: I see that this does not work on my Win7/IE9 system, so I will spend some time updating the code for Win7/IE9 and I'll let you know how it goes.
    Anthony Holloway
    Please use the star ratings to help drive great content to the top of searches.

  • Unexpected DAG mailbox database failover

    Good evening
    I am running a three node Exchange 2013 server environment in a stretched DAG configuration. I have been experiencing a problem recently whereby failover is occurring unexpectedly. Below is a brief description of the environment.
    Site 1 (Head Office)
    2 x Exchange 2013 CAS/MBX (dual role servers, lets call them A and B)
    Both are members of the only DAG we have
    They both have copies of the only mailbox database we have
    Server A has activation preference 1
    Server B has activation preference 2
    Site 2 (Datacenter)
    1 x Exchange 2013 CAS/MBX (dual role server, lets call it C)
    C is a member of the same DAG as above
    It hosts a copy of the same mailbox database as above
    Server C has activation preference 3 (although I have configured it to never automatically activate a database copy)
    Server C will only be actively used in a disaster scenario where we lose both A and B in the head office for whatever reason
    So recently I have noticed that when the mailbox database is mounted and active on server A after a period of not more than an hour or two, the database is automatically moved to server B. This has happened 3 times now.
    I find these informational alerts in the event viewer on server A.
    Event ID 2136
    Log Name: Application
    Source: MSExchangeRepl
    Date: 14/11/2014 16:01:03
    Event ID: 2136
    Task Category: Service
    Level: Warning
    Keywords: Classic
    User: N/A
    Computer: ServerA.domain.internal
    Description:
    Unable to communicate with the Microsoft Exchange Information Store service to coordinate log truncation for database 'Mailbox Database 1\ServerA' due to an RPC communication failure. Error: 3355381764
    Extended error: Failed to open a log truncation context to source server 'ServerA.domain.internal'. Hresult: 0xc7ff1004. Error: Error returned from an ESE function call (-1305).
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
    <System>
    <Provider Name="MSExchangeRepl" />
    <EventID Qualifiers="32772">2136</EventID>
    <Level>3</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2014-11-14T16:01:03.000000000Z" />
    <EventRecordID>4381652</EventRecordID>
    <Channel>Application</Channel>
    <Computer>ServerA.domain.internal</Computer>
    <Security />
    </System>
    <EventData>
    <Data>Mailbox Database 1\ServerA</Data>
    <Data>3355381764</Data>
    <Data>Failed to open a log truncation context to source server 'ServerB.domain.internal'. Hresult: 0xc7ff1004. Error: Error returned from an ESE function call (-1305).
    </Data>
    </EventData>
    </Event>
    Event ID  3169
    Log Name: Application
    Source: MSExchangeRepl
    Date: 14/11/2014 16:01:04
    Event ID: 3169
    Task Category: Service
    Level: Information
    Keywords: Classic
    User: N/A
    Computer: ServerA.domain.internal
    Description:
    (Active Manager) Database Mailbox Database 1 was successfully moved from ServerA.domain.internal to ServerB.domain.internal. Move comment: Managed availability system failover initiated by Responder=RpsDeepTestPSProxyFailover Component=RPS.
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
    <System>
    <Provider Name="MSExchangeRepl" />
    <EventID Qualifiers="16388">3169</EventID>
    <Level>4</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2014-11-14T16:01:04.000000000Z" />
    <EventRecordID>4381655</EventRecordID>
    <Channel>Application</Channel>
    <Computer>ServerA.domain.internal</Computer>
    <Security />
    </System>
    <EventData>
    <Data>Mailbox Database 1</Data>
    <Data>ServerA.domain.internal</Data>
    <Data>ServerB.domain.internal</Data>
    <Data>Managed availability system failover initiated by Responder=RpsDeepTestPSProxyFailover Component=RPS.</Data>
    </EventData>
    </Event>
    Does anyone have any ideas as to why this might be happening or whether there are any additional troubleshooting steps I can take?
    Regards

    Hi George,
    Agree with Jared,  running CollectOverMetrics.ps1 script is a good choice to troubleshoot the issue.
    Except this method, you can also refer to the following article to troubleshoot the issue with other useful information, like mailbox database replication status information, HighAvailability channel event logs..
    http://technet.microsoft.com/en-us/library/dd351258(v=exchg.150).aspx
    Best regards,
    Niko Cheng
    TechNet Community Support

  • UCCX HA failover, what is sync'ed?

    Hi
    When setting up UCCX HA the configuration is sync'ed with the heatbeat and if the primary failes the slave takes over.
    I have found information about the exchange of configuration but i havent found information about what happens with the sessions in progress when the primary fails.
    Is it stateful? Meaning that the sessions are copied from primary to the slave so the sessions that are in progress are kept alive or will they have to be reestablished (the callers are disconnected and have to place the call again).
    If i have overlooked the document(s) describing this i'd like to be refered to it    
    Thanks in advance.        
    /Tony

    Hi Tony,
    If UCCX Publisher (primary Engine) fails and Subscriber (second Node) takes over the Engine mastership (assuming that the CUCM is up and running during this time and no failover is happening in CUCM cluster) than,
    1. All the connected calls will continue to work, eventhough you will see a warining message in the CAD and CSD stating that the active conection to th UCCX first node is lost, the RTP is kept alive but you may not see the exact call detail recordes because Primary engine fails here. However the CAD and CSD will automatically get connected to the UCCX second Node.
    2. New calls attempetd during this failover time will not get connected, you need to wait untill the secondary nodes takes control fully.
    3. You can not perform any config changes if any one of the UCCX node is down in a High Availability setup.
    4. IPPA Agents need to login one more time with the UCCX second node service selected.
    Hope this helps.
    Anand
    Please rate helpful posts !!

  • How to test UCCX failover (no HA)

    Hello,
    I would like to test uccx failover with one UCCX and there is no HA configured.
    I configured in UCCX/Cisco Unified CM Configuration => there is 2 Callmanagers for AXL Service Providers and 2 Callmanagers for CTI Managers if I stopping both Callmanagers server AXL Service Providers and CTI Managers (First server in list).
    Can you tell me how I can test uccx failover (CAD, Calls, etc....)?
    There is one uccx which is installed (no HA).
    Thank you so much.
    BR
    Aubert

    Hi Gergely,
    Thanks a lot for your advices and if I understand this is the same test scenario if we  disconnected the first callmanager with activated AXL service Providers or the CTI manager or the test scenario is different for both services.
    -To test calls and to verify which are not disconnect
    -To verify CAD which should not disconnect
    -To verify reports work
    Thank you so much;
    BR
    Aubert

  • UCCX 8.5.1 Failover scenario

    Hi All,
    Would someone please be kind enough to explain how the latest version of UCCX failover? We found that when the primary server fails, everything
    failover to the standby server correctly. The confusing bit is when the primary server comes back online, when testing in our lab and on the day of the switch over (which was OOHs) the agents, supervisors and all calls moved back to the primary. I know this is not the norm in previous version. The issue I'm facing now is the customer had a power cut during hours and the everything failover correctly. When the primary came back up nothing moved across back to the primary. Can someone please confirm if the primary will only assume control if there are no active calls in the system?
    Thank you
    Brett

    With the new version it seems as there is no master, it's whatever server comes up first becomes the faster and it will failover when there is an issue with the master.  If there's an error and the standby becomes master, then it will not go back to the other server unless it too suffers an error.
    david

  • UCCX Failover and NTP

    According to what I've found in UCCX 9.0 installation guides, the secondary node in a HA installation looks at the primary for NTP. What happens in the case of failover when the primary is unavailable?
    Thanks in advance,
    Dillon

    When you issue a "show ntp status" on the secondary node, it in fact does list the primary node's address as its NTP source.  This is true whether the secondary node is Slave or Master.
    With that, in the event of a failover between UCCX nodes, as long as the primary node was reachable on the network, the secondary will continue to sync its clock to it.
    However, if the primary is offline all together, then this would mean that the secondary's clock could start to drift.  This is no different in a healthy UCCX HA pair when the NTP servers go offline and then the Primary's clock starts to drift.
    I'm not positive on what this looks like in production, but I would suspect it's not critical for the secondary to keep processing calls.  It doesn't depend on the time on any other server to function properly.  Maybe the log files will have timestamps which do not line up with your CUCM log files.
    Since we're on the topic of extended outages, if you do take a large enough hit to your primary node that you are worried about time drift, then you also have to consider this in an HAoW environment:
    Data in Agent Datastore, Historical Datastore and Repository Datastore of Informix IDS database start merging after the network partition is restored and this could potentially generate heavy data traffic over the WAN. Cisco recommends restoring the WAN link during after hours to minimize the performance impact.
    Source: UCCX SRND, Page 4-12
    Anthony Holloway
    Please use the star ratings to help drive great content to the top of searches.

  • UCCX 8 Network Failover

    All,
    I looked at UCCX 8.0 SRND about the server Network Failover and could not find it. Be fore UCCX 8, UCCX server only connects on one Network interface. UCCX 8 does have command "set network failover ena" in the command list. Question is "Does UCCX 8 suport Network failover???"
    Thanks,
    Wenqian

    Hi
    In my experience the number of outages caused by software failures or administrative errors outweigh those caused by LAN/NIC failures by a huge amount. So much that the question of NIC failover doesn't really cause too much issue, especially when you have a pair of UCCX servers in the cluster (as most of my customers do, especially where redundancy is important). I think it's more important to stay within the guidelines - so when it does fail TAC don't have the option of pointing at your NIC configuration as 'unsupported'.
    If NIC failover is really that important to the customer, then one option is to deploy it on UCS. You can have a single vNIC on the virtual machine, and multiple physical NICs in a fault tolerant mode whilst still staying in a 'supported' config.
    Aaron
    Please rate helpful posts..

  • CAD Log out due to CTI failed connection Failover UCCX 8.5

    Hi guys,
    I have a  issue with CAD running  CCX 8.5.SU4 every day maybe on the morning or afternoon i have issue connectivy with CTI manager "UCCX",
    i found the following  discussion and i follow all the step :
    https://supportforums.cisco.com/thread/2141592
    Update to the latest version of CCX and change the LRO parameter under UCS , but i experience the same issue ...
    All CAD at the same time got loss connectivity with CTI manager.
    Under IP Phone agent i experience the same but some times no.
    Here the logs:
    CAD LOGS :
    013-11-01 22:36:06:683 INFO VOIP4001 The Desktop Monitoring module on the local host has been successfully initialized.
    2013-11-01 22:36:07:667 INFO VOIP2021 Desktop monitoring enabled for extension [860].
    2013-11-01 22:40:55:183 WARN ACMI3002 Unable to send HEARTBEAT_REQ to CTI service: <com.calabrio.util.socket.SplkSocketException: Write error: 10054:Se ha forzado la interrupción de una conexión existente por el host remoto.>.
    2013-11-01 22:40:55:183 WARN ACMI3002 Unable to send CLOSE_REQ to CTI service: <com.calabrio.util.socket.SplkSocketException: Write error: 10054:Se ha forzado la interrupción de una conexión existente por el host remoto.>.
    2013-11-01 22:40:55:183 INFO STD0005 Client <CPDGlobals> disconnected from service at <10.100.10.7>.
    2013-11-01 22:40:55:339 INFO STD0004 Client <CPDGlobals> connected to service at <10.100.10.7>.
    2013-11-01 22:40:56:152 INFO VOIP4005 Desktop monitoring disabled for extension [860].
    2013-11-01 22:40:56:261 INFO DESK1120 Login to telephony server.
    2013-11-01 22:40:59:308 INFO DESK1118 Login to chat server.
    2013-11-01 22:40:59:980 INFO FCCC0000 Successfully connected to the Desktop Chat Service.
    CCX ENGINE :
    EXCEPTION:java.net.SocketException: Socket closed
    212732: Nov 04 17:03:40.018 CET %MIVR-ICD_CTI-3-EXCEPTION: at java.net.SocketInputStream.socketRead0(Native Method)
    [addr=/172.16.143.18,port=50851,localport=12028]
    212746: Nov 04 17:03:40.019 CET %MIVR-ICD_CTI-7-UNK:returning from thread : MIVR_ICD_CTI_client_thread_102-322-
    212747: Nov 04 17:03:40.019 CET %MIVR-ICD_CTI-7-UNK:NULL socketInfo is found due to socket is closed. Socket[addr=/10.100.10.7,port=51479,localport=12028]
    212748: Nov 04 17:03:40.019 CET %MIVR-ICD_CTI-7-UNK:returning from thread : MIVR_ICD_CTI_client_thread_104-326-
    212749: Nov 04 17:03:40.019 CET %MIVR-SS_RM-7-UNK:LOGOUT_OBSERVE [479]: Removed Address observer and Call observer for extn [479]
    212750: Nov 04 17:03:40.019 CET %MIVR-SS_RM-7-UNK:Removed mapping for extension key [479]
    212751: Nov 04 17:03:40.019 CET %MIVR-SS_RM-7-UNK:The mapping corresponding to 479 has been removed from SecondaryExtToPrimaryExt!
    212752: Nov 04 17:03:40.019 CET %MIVR-SS_RM-7-UNK: All mappings for ICD extension key: 863 has been removed from PrimaryExtToSecondaryExt!
    212753: Nov 04 17:03:40.019 CET %MIVR-SS_RM-7-UNK:The extension 863 has been closed
    212754: Nov 04 17:03:40.019 CET %MIVR-SS_RM-7-UNK:Rsrc: rosario New State:LOGOFF Old State:UNAVAILABLE Reason code:32765 = sin conexion.
    All the pc have WINDOWS 7 and the correct version of have,
    any ideas ? customer is very Angry
    Thanks
    Carlos

    Hi Ravi,
    Is not HAoWan , is a standalone.
    After configuring a port-mirroring for capture all the traffic from UCS .
    I found that PCs is discard packtet with destination port tcp 7 , when CCX send retransmissions CAD disconnected.
    I open port TCP 7 on Windows but still have the same error after 8 hours..i got more than 20 disconected sessions.
    IP upload for your the traces , you could paste the filter :
    ip.src == 172.16.144.110 or ip.dst == 172.16.144.110
    CCX is 10.100.10.7

  • UCCX 7.01 CTI manager failover

    Hi, I couldn't find any definitive  documentation on the what determines if a CCX server fails over to the  secondary CTI manager.  Is it loss of ping?  Is it CTI service related.
    Env:
    2 CCM servers running 7.0.2
    1 CCX server running 7.0.1
    Our  CCX server is pointed to the subscriber first and publisher second  under AXL provider and CTI manager.  Replication between the 2 servers  failed on Jan 2nd.  When folks tried to log into CCX on the morning of  Jan 3rd, they received an error stating that a CTI manager is not  available.  We logged into CCX and removed the subscriber from the list  of selected AXL and CTIM servers and people were able to log into CCX.   We rebooted the subscriber and after some time, replication was again  restored.
    My  main concern was why did the CTI Manager running on the publisher not  take over?  Was it because the Subscriber was still pingable?  The TAC  engineer stated that the failover was based on pings.  Can anyone verify  this?  He did not seem too confident when he made this statement and it  would make much more sense to have the failover based on service  availability.
    We've  since gone into RTMT and setup e-mail alerts so in the future, we'll at  least be given a heads up if there is a replication failure.
    Thanks in advance!
    -Eric

    This is what I found but it does not seem to contain any useable information.
    503: Nov 11 15:19:46.829 EST %MIVR-CLUSTER_MGR-7-UNK:Post Convergence Event: CONVERGENCE_COMPLETED, name=CRS SQL Server - Repository
    504: Nov 11 15:19:46.844 EST %MIVR-PROMPT_MGR-1-MGR_PARTIAL_SERVICE:Prompt Manager in partial service:
    505: Nov 11 15:19:46.844 EST %MIVR-GRAMMAR_MGR-6-MGR_IN_SERVICE:Grammar Manager in service:
    506: Nov 11 15:19:46.844 EST %MIVR-CLUSTER_MGR-7-UNK:try to process MasterConvergenceCompletedCmdImpl: name CRS SQL Server - Historical, nodeId=1, type=MASTER_ELECTED, uniqueId=99, master=true, updateTick=423, baseTick=422, nodeCurrentTick=423
    /* Style Definitions */
    table.MsoNormalTable
    {mso-style-name:"Table Normal";
    mso-tstyle-rowband-size:0;
    mso-tstyle-colband-size:0;
    mso-style-noshow:yes;
    mso-style-priority:99;
    mso-style-qformat:yes;
    mso-style-parent:"";
    mso-padding-alt:0in 5.4pt 0in 5.4pt;
    mso-para-margin:0in;
    mso-para-margin-bottom:.0001pt;
    mso-pagination:widow-orphan;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";
    mso-ascii-font-family:Calibri;
    mso-ascii-theme-font:minor-latin;
    mso-fareast-font-family:"Times New Roman";
    mso-fareast-theme-font:minor-fareast;
    mso-hansi-font-family:Calibri;
    mso-hansi-theme-font:minor-latin;}

  • UCCX 8.5.1 Failover setup error

    Hi All,
    just installed a HA UCX 8.5.1 SU3 server and added to existing server to create a cluster.  The install completed OK and all services are activated.  The last message after 1st logon on 2nd node said the server was added to the cluster and gave the following message:
    Infomational Message If you have configured CM Telephony Call Control Groups on your Primary(First) Node, update the Port Groups to create the corresponding ports for this node.
    We already have Call control groups as the UCCX has been live for several years.  I logon to the primary and navigate to the port groups, selecting a port group.  I chose the drop down and select the 2nd node I have just added and click go.  I then click update but get an error that states failed to update user in CM.  This is the same error I get within each of the port groups I have tried this on.
    The user that was setup in UCCX first node was UCCXAdministrator and I made sure to enter these case sensitively.  However when I check CUCM or UCCX user management pages I can't find this user listed.  However I can logon to UCCX with this account.  Any suggestions on how to get the port groups setup on the new 2nd node.
    thanks in advance
    David

    Hi David,
    The account UCCXAdministrator which you said you can log into UCCX with, but couldn't find in CUCM.  The fact that you can log into UCCX with it, tells me that it's a CUCM End User account, so be sure to search End Users and not Application Users for it.  With that said, since you can login with it, this account probably is not that important in your troubleshooting.
    The account which you need to find is a CUCM Application User account, which should have AXL rights.  Since you can log into UCCX at all, I would say that this account is setup correctly as well.  That's because UCCX authenticates via CUCM's AXL API, and unless it wasn't setup and working properly, you wouldn't be able to login to UCCX.
    You mentioned that you went into the CCG, chose the second server from the drop down, clicked Go, then clicked Update.  But, you never mentioned if you actually configured anything first.  Did you?  Did you create the second port group for the second node before you clicked Update?  Or did you simply click Go, then click Update?
    Also, since the error happened while in AppAdmin, then the logs you want to collect and review at the Application Administration logs in RTMT.  They may or may not give you more clues as to what's happening.
    Anthony Holloway
    Please use the star ratings to help drive great content to the top of searches.

Maybe you are looking for

  • IMac 2009 will not boot up 16gb ram

    Mac will not boot up have tried safe mode d mode and resetting The prim. Command opt PR.  Ideas or am I on the way to getting a new hard drive ?

  • Windows vs Mac Performance - Mac wins at the moment

    I can tell you from firsthand experience that Lightroom is very useable on a Mac but significantly less so on WinXP. I have a 1yr old XP laptop, 2GHz Core Duo, 2GB RAM, 128MB ATI X1400, 7200RPM hard drive with 60% free space where my library and phot

  • Supressed Portlet Title Bar

    I'm developing a portlet that needs to display additional content if the portlet title bar has been suppressed in the portlet object settings. Is it possible to programatically determine if the portlet title bar has been suppressed?

  • Firefox asks to download like.php from sites with Facebook social button

    So, I don't know if this is a problem with one site or Facebook in general, but everytime I access this site, a download window for "like.php" appears four times in a row. I've seen the same problems in this forums when I searched for it, but none of

  • Creating Bridge on weblogic to connect between OSM and AIA

    Hi, Am trying to create a messaging bridge in weblogic server to move msg from a queue to AIA's queue. I have created both Source & Target destinations and associated to bridge. My doubt is as both servers are different app servers, if i use initial