Extend-TCP client not failing over to another proxy after machine failure

I have a configuration of three hosts. on A is the client, on B & C are a proxy and a cache instance. I've defined an AddressProvider that returns the address of B and then C. The client just repeatedly calls the cache (read-only). The client configuration is:
<?xml version="1.0"?>
<cache-config
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://xmlns.oracle.com/coherence/coherence-cache-config"
xsi:schemaLocation="http://xmlns.oracle.com/coherence/coherence-cache-config
coherence-cache-config.xsd">
<caching-scheme-mapping>
<cache-mapping>
<cache-name>cache1</cache-name>
<scheme-name>extend-near</scheme-name>
</cache-mapping>
</caching-scheme-mapping>

<caching-schemes>
<near-scheme>
<scheme-name>extend-near</scheme-name>
<front-scheme>
<local-scheme>
<high-units>1000</high-units>
</local-scheme>
</front-scheme>
<back-scheme>
<remote-cache-scheme>
<scheme-ref>remote-cache1</scheme-ref>
</remote-cache-scheme>
</back-scheme>
<invalidation-strategy>all</invalidation-strategy>
</near-scheme>
<remote-cache-scheme>
<scheme-name>remote-cache1</scheme-name>
<service-name>cache1ExtendedTcpProxyService</service-name>
<initiator-config>
<tcp-initiator>
<remote-addresses>
<address-provider>
<class-name>com.foo.clients.Cache1AddressProvider</class-name>
</address-provider>
</remote-addresses>
<connect-timeout>10s</connect-timeout>
</tcp-initiator>
<outgoing-message-handler>
<request-timeout>5s</request-timeout>
</outgoing-message-handler>
</initiator-config>
</remote-cache-scheme>
</caching-schemes>
If I shutdown the proxy that the client is connected to on host B, failover occurs quickly by calling the AddressProvider. But if I shut down the network for host B (or drop the TCP port of the proxy on B) to simulate a machine failure, failover does not occur. The client simply continues to try to contact B and dutifully times out in 5 seconds. It never asks the AddressProvider for another address.
How do I get failover to kick in?

Hello,
If you are testing Coherence*Extend failover in the face of a network, machine, or NIC failure, you should enable Connection heartbeats on both the <tcp-initiator/> and <tcp-acceptor/>. For example:
Client cache config:
<remote-cache-scheme>
<scheme-name>extend-direct</scheme-name>
<service-name>ExtendTcpCacheService</service-name>
<initiator-config>
    <tcp-initiator>
      <remote-addresses>
        <socket-address>
          <address system-property="tangosol.coherence.extend.address">localhost</address>
          <port system-property="tangosol.coherence.extend.port">9099</port>
        </socket-address>
      </remote-addresses>
      <connect-timeout>2s</connect-timeout>
    </tcp-initiator>
    <outgoing-message-handler>
      <heartbeat-interval>10s</heartbeat-interval>
      <heartbeat-timeout>5s</heartbeat-timeout>
      <request-timeout>15s</request-timeout>
    </outgoing-message-handler>
</initiator-config>
</remote-cache-scheme>Proxy cache config:
<proxy-scheme>
<scheme-name>example-proxy</scheme-name>
<service-name>ExtendTcpProxyService</service-name>
<thread-count system-property="tangosol.coherence.extend.threads">2</thread-count>
<acceptor-config>
    <tcp-acceptor>
      <local-address>
        <address system-property="tangosol.coherence.extend.address">localhost</address>
        <port system-property="tangosol.coherence.extend.port">9099</port>
      </local-address>
    </tcp-acceptor>
    <outgoing-message-handler>
      <heartbeat-interval>10s</heartbeat-interval>
      <heartbeat-timeout>5s</heartbeat-timeout>
      <request-timeout>15s</request-timeout>
    </outgoing-message-handler>
</acceptor-config>
<autostart system-property="tangosol.coherence.extend.enabled">true</autostart>
</proxy-scheme>This is because it may take the TCP/IP stack a considerable amount of time to detect that it's peer is unavailable after a network, machine, or NIC failure (O/S dependent).
Jason

Similar Messages

NIC not failing Over in Cluster

Hi there...I have configured 2 Node cluster with SoFS role...for VM Cluster and HA using Windows Server 2012 Data Center. Current set up is Host Server has 3 NICS (2 with Default Gateway setup (192.x.x.x), 3 NIC is for heartbeat 10.X.X.X). Configured CSV
(can also see the shortcut in the C:\). Planning to setup few VMs pointing to the disk in the 2 separate storage servers (1 NIC in 192.x.x.x) and also have 2 NIC in 10.x.x.x network. I am able to install VM and point the disk to the share in the cluster volume
1.
I have created 2 VM Switch for 2 separate Host server (using Hyper-V manager). When I test the functionality by taking Node 2, I can see the Disk Owner node is changing to Node 1, but the VM NIC 2 is not failing over automatically to VM NIC 1 (but I can
see the VM NIC 1 is showing up un-selected in the VM Settings). when I go to the VM Settings > Network Adapter, I get error -
An Error occurred for resource VM "VM Name". select the "information details" action to view events for this resource. The network adapter is configures to a switch which no longer exists or a resource
pool that has been deleted or renamed (with configuration error in "Virtual Switch" drop down menu).
Can you please let me know any resolution to fix this issue...Hoping to hear from you.
VT

Hi,
From your description “My another thing I would like to test is...I also would like to bring a disk down (right now, I have 2 disk - CSV and one Quorum disk) for that 2 node
cluster. I was testing by bringing a csv disk down, the VM didnt failover” Are you trying to test the failover cluster now? If so, please refer the following related KB:
Test the Failover of a Clustered Service or Application
http://technet.microsoft.com/en-us/library/cc754577.aspx
Hope this helps.
We
are trying to better understand customer views on social support experience, so your participation in this
interview project would be greatly appreciated if you have time.
Thanks for helping make community forums a great place.

Server Pool Master fails and cannot fail over to another VM Server

Dear All,
Oracle VM 2.2.2
I have 2 VM Servers connect to Storage 6140 Array and on VM Manager I enable HA on the server pool, then on all virtual machines.
- VM Server 1 has role as Server Pool Master, Utility Server, Virtual Machine Server and has virtual machines running
- VM Server 2 has role as Utility Server, Virtual Machine Server and has virtual machines running.
I try to shutdown the VM Server 1 act as Server Pool Master but I don't see Server Pool Master fail over to another VM Server 2 and also status become to Unreachable both of 2 Servers.
Especially, All virtual machines cannot be accessible.
Please kindly give advice for this.
Thanks and regards,
Heng

Thanks Avi, I'll find and read that document. And thanks also for elaborating about the Utility Server.
After reading the followups to my original question, I tried to think of possible server "layouts" in a HA environment.
1) "N" servers in the pool, one of them is Pool Master, Utility Server AND VM Guests Server at the same time. Maybe this will be the preferred server for smaller, quicker VMs.
2) "N" servers in the pool, one is Pool Master AND Utility Server, but has no VM guests running on it
3) "N" servers in the pool, one is the Pool Master, another one is the Utility Server (none of them has VMs running on them), and finally a number of VM Guest servers
Let's take case 1. If the Pool Master & Utility server fails, given that it has VM guests running on it as well, I understand from your explanation that I'll be ANYWAY able to manually "live migrate" the guests somewhere else, using VM Manager. Is this correct?
If it's correct, then it's just a question of how much money I want to spend to have dedicated servers for different tasks, JUST FOR BETTER PERFORMANCES REASONS. Do you agree? And especially: do YOU have dedicated Pool Masters (just to figure out your "real" approach to the problem :-) )
I feel that I still miss something, the picture is not completely clear to me. The fact is, that I'm now testing on my new bladesystem, but for now I put up one single blade. Testing HA will be the next step. I was just trying to get a few things sorted out in advance, but there is still something that I'm missing, as I was saying...
Looking forward to your next reply, thanx again
Rob

How do you configure the RTC as an Extend/TCP client vs Compute Client?

How do you choose between the Real Time Client acting as an Extend/TCP Client or a Compute Client?
Thanks,
Andrew

Hi Andrew,
I believe RT client as a Compute client means storage-disabled normal TCMP cluster node in a Grid Edition cluster, but I may be wrong.
Best regards,
Robert

VIP is not failed over to surviving nodes in oracle 11.2.0.2 grid infra

Hi ,
It is a 8 node 11.2.0.2 grid infra.
While pulling both cables from public nic the VIP is not failed over to surviving nodes in 2 nodes but remainng nodes VIP is failed over to surviving node in the same cluster. Please help me on this.
If we will remove the power from these servers VIP is failed over to surviving nodes
Public nic's are in bonding.
grdoradr105:/apps/grid/grdhome/sh:+ASM5> ./crsstat.sh |grep -i vip |grep -i 101
ora.grdoradr101.vip ONLINE OFFLINE
grdoradr101:/apps/grid/grdhome:+ASM1> cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.4.0-1 (October 7, 2008)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eth0
MII Status: up
Speed: 100 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 84:2b:2b:51:3f:1e
Slave Interface: eth1
MII Status: up
Speed: 100 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 84:2b:2b:51:3f:20
Thanks
Bala

Please check below MOS note for this issue.
1276737.1
HTH
Edited by: krishan on Jul 28, 2011 2:49 AM

Thin Client connection not failing over

I'm using the following thin client connection and the sessions do not failover. Test with SQLPLUS and the sessions do fail over. One difference I see between the two different connections is the thin connection has NONE for the failover_method and failover_type but the SQLPLUS connection show BASIC for failover_method and SELECT for failover_type.
Is there any issues with the thin client the version is 10.2.0.3
jdbc:oracle:thin:@(description=(address_list=(load_balance=YES)(address=(protocol=tcp)(host=crpu306-vip.wm.com)(port=1521))(address=(protocol=tcp)(host=crpu307-vip.wm.com)(port=1521)))(connect_data=(service_name=ocsqat02)(failover_mode=(type=select)(method=basic)(DELAY=5)(RETRIES=180))))

You have to use (FAILOVER=on) as well on jdbc url.
http://download.oracle.com/docs/cd/B19306_01/network.102/b14212/advcfg.htm#sthref1292
Example: TAF with Connect-Time Failover and Client Load Balancing
Implement TAF with connect-time failover and client load balancing for multiple addresses. In the following example, Oracle Net connects randomly to one of the protocol addresses on sales1-server or sales2-server. If the instance fails after the connection, the TAF application fails over to the other node's listener, reserving any SELECT statements in progress.sales.us.acme.com=
(DESCRIPTION=
*(LOAD_BALANCE=on)*
*(FAILOVER=on)*
(ADDRESS=
(PROTOCOL=tcp)
(HOST=sales1-server)
(PORT=1521))
(ADDRESS=
(PROTOCOL=tcp)
(HOST=sales2-server)
(PORT=1521))
(CONNECT_DATA=
(SERVICE_NAME=sales.us.acme.com)
*(FAILOVER_MODE=*
*(TYPE=select)*
*(METHOD=basic))))*
Example: TAF Retrying a Connection
TAF also provides the ability to automatically retry connecting if the first connection attempt fails with the RETRIES and DELAY parameters. In the following example, Oracle Net tries to reconnect to the listener on sales1-server. If the failover connection fails, Oracle Net waits 15 seconds before trying to reconnect again. Oracle Net attempts to reconnect up to 20 times.sales.us.acme.com=
(DESCRIPTION=
(ADDRESS=
(PROTOCOL=tcp)
(HOST=sales1-server)
(PORT=1521))
(CONNECT_DATA=
(SERVICE_NAME=sales.us.acme.com)
*(FAILOVER_MODE=*
*(TYPE=select)*
*(METHOD=basic)*
*(RETRIES=20)*
*(DELAY=15))))*

Stateful bean not failing over

          I have a cluster of two servers and a Admin server. Both servers are running NT
          4 sp6 and WLS6 sp1.
          When I stop one of the servers, the client does n't automatically failover to
          the other server, instead it fails unable to contact server that has failed.
          My bean is configured to have its home clusterable and is a stateful bean. My
          client holds onto the remote interface, and makes calls through this. If Server
          B fails then it should automatically fail over to server A.
          I have tested my multicast address and all seems to be working fine between servers,
          my stateless bean work well, load balancing between servers nicely.
          Does anybody have any ideas, regarding what could be causing the stateful bean
          remote interface not to be providing failover info.
          Also is it true that you can have only one JMS destination queue/topic per cluster..The
          JMS cluster targeting doesn't work at the moment, so you need to deploy to individual
          servers?
          Thanks


Did you enable stateful session bean replication in the
          weblogic-ejb-jar.xml?
          -- Rob
          Wayne Highland wrote:
          >
          > I have a cluster of two servers and a Admin server. Both servers are running NT
          > 4 sp6 and WLS6 sp1.
          > When I stop one of the servers, the client does n't automatically failover to
          > the other server, instead it fails unable to contact server that has failed.
          >
          > My bean is configured to have its home clusterable and is a stateful bean. My
          > client holds onto the remote interface, and makes calls through this. If Server
          > B fails then it should automatically fail over to server A.
          >
          > I have tested my multicast address and all seems to be working fine between servers,
          > my stateless bean work well, load balancing between servers nicely.
          >
          > Does anybody have any ideas, regarding what could be causing the stateful bean
          > remote interface not to be providing failover info.
          >
          > Also is it true that you can have only one JMS destination queue/topic per cluster..The
          > JMS cluster targeting doesn't work at the moment, so you need to deploy to individual
          > servers?
          >
          > Thanks
          Coming Soon: Building J2EE Applications & BEA WebLogic Server
          by Michael Girdley, Rob Woollen, and Sandra Emerson
          http://learnweblogic.com

GSLB Zone-Based DNS Payment Gw - Config Active-Active: Not Failing Over

Hello All:
Currently having a bit of a problem, have exhausted all resources and brain power dwindling.
Brief:
Two geographically diverse sites. Different AS's, different front ends. Migrated from one site with two CSS 11506's to two sites with one 11506 each.
Flow of connection is as follows:
Client --> FW Public Destination NAT --> CSS Private content VIP/destination NAT --> server/service --> CSS Source VIP/NAT --> FW Public Source NAT --> client.
Using Load Balancers as DNS servers, authoritative for zones due to the requirement for second level Domain DNS load balancing (i.e xxxx.com, AND FQDNs http://www.xxxx.com). Thus, CSS is configured to respond as authoritative for xxxx.com, http://www.xxxx.com, postxx.xxxx.com, tmx.xxxx.com, etc..., but of course cannot do MX records, so is also configured with dns-forwarders which consequently were the original DNS servers for the domains. Those DNS servers have had their zone files changed to reflect that the new DNS servers are in fact the CSS'. Domain records (i.e. NS records in the zone file), and the records at the registrar (i.e. tucows, which I believe resells .com, .net and .org for netsol) have been changed to reflect the same. That part of the equation has already been tested and is true to DNS Workings. The reason for the forwarders is of course for things such as non load balanced Domain Names, as well as MX records, etc...
Due to design, which unfortunately cannot be changed, dns-record configuration uses kal-ap, example:
dns-record a http://www.xxxx.com 0 111.222.333.444 multiple kal-ap 10.xx.1.xx 254 sticky-enabled weightedrr 10
So, to explain so we're absolutely clear:
- 111.222.333.444 is the public address returned to the client.
- multiple is configured so we return both site addresses for redundancy (unless I'm misunderstanding that configuration option)
- kal-ap and the 10.xx.1.xx address because due to the configuration we have no other way of knowing the content rule/service is down and to stop advertising the address for said server/rule
- sticky-enabled because we don't want to lose a payment and have it go through twice or something crazy like that
- weighterr 10 (and on the other side weightedrr 1) because we want to keep most of the traffic on the site that is closer to where the bulk of the clients are
So, now, the problem becomes, that the clients (i.e. something like an interac machine, RFID tags...) need to be able to fail over almost instantly to either of the sites should one lose connectivity and/or servers/services. However, this does not happen. The CSS changes it's advertisement, and this has been confirmed by running "nslookups/digs" directly against the CSSs... however, the client does not recognize this and ends up returning a "DNS Error/Page not found".
Thinking this may have something to do with the "sticky-enabled" and/or the fact that DNS doesn't necessarily react very well to a TTL of "0".
Any thoughts... comments... suggestions... experiences???
Much appreciated in advance for any responses!!!
Oh... should probably add:
nslookups to some DNS servers consistently - ALWAYS the same ones - take 3 lookups before getting a reply. Other DNS servers are instant....
Cheers,
Ben Shellrude
Sr. Network Analyst
MTS AllStream Inc

Hi Ben,
if I got your posting right the CSSes are doing their job and do advertise the correct IP for a DNS-query right?
If some of your clients are having a problem this might be related to DNS-caching. Some clients are caching the DNS-response and do not do a refresh until they fail or this timeout is gone.
Even worse if the request fails you sometimes have to reset the clients DNS-demon so that they are requesting IP-addresses from scratch. I had this issue with some Unixboxes. If I remeber it corretly you can configure the DNS behaviour for unix boxes and can forbidd them to cache DNS responsed.
Kind Regards,
joerg

Coherence *Extend-TCP configuration not working

Hi,
     I was trying to setup the Coherence *Extend-TCP configuration on my solaris box.
     To start with, i'm trying to start a Cache server instance by using the cluster-side configuration XML (given at URL below)
     http://wiki.tangosol.com/display/COH32UG/Configuring+and+Using+Coherence*Extend
     But while starting its throwing me the below error. The Coherence version that i'm using is 3.2/353. Please advise.
     Exception in thread "main" java.lang.IllegalArgumentException: The "Proxy" element is missing a required acceptor configuration element
     at com.tangosol.coherence.component.util.daemon.queueProcessor.service.ProxyService.configure(ProxyService.CDB:30)
     at com.tangosol.coherence.component.util.SafeService.startService(SafeService.CDB:5)
     at com.tangosol.coherence.component.util.SafeService.getRunningService(SafeService.CDB:26)
     at com.tangosol.coherence.component.util.SafeService.ensureRunningService(SafeService.CDB:1)
     at com.tangosol.coherence.component.util.SafeService.start(SafeService.CDB:9)
     at com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:775)
     at com.tangosol.net.DefaultCacheServer.start(DefaultCacheServer.java:138)
     at com.tangosol.net.DefaultCacheServer.main(DefaultCacheServer.java:60)
     regards
     Mike

Sorry,
     I noticed that the above error occurs for version 3.1.1 (& not for 3.2) as previously
     specified in previous message (above). My apologies.
     As a follow-up, i've now installed 3.2 jars on my environment & i noticed that the
     above error doesnt occur for this version. The cache server seems to be coming
     up fine now (with the appropriate TCP/IP configuration tag in the xml).
     But when i try to run my client application (which attempts to connect to this
     remote cache server), it throws an InvocationTargetException error (full exception
     below).
     The error indicates that i'm missing some elements in the XML configuration.
     Exception
     (Wrapped) java.lang.reflect.InvocationTargetException
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
          at java.lang.reflect.Method.invoke(Unknown Source)
          at com.tangosol.net.extend.AdapterFactory.ensureCacheServiceAdapter(AdapterFactory.java:69)
          at com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:729)
          at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:650)
          at com.tangosol.net.DefaultConfigurableCacheFactory.configureCache(DefaultConfigurableCacheFactory.java:831)
          at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:284)
          at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:622)
          at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:600)
          at com.tangosol.examples.explore.SimpleCacheClient.main(SimpleCacheClient.java:25)
     Caused by: java.lang.IllegalArgumentException: Missing required initiator child configuration element: <extend-cache-scheme tier='front'>
     <scheme-name>extend-direct</scheme-name>
     <service-name>ExtendTcpCacheService</service-name>
     <initiator-config tier='front'>
     <tcp-initiator>
     <remote-addresses>
     <socket-address>
     <address>gpblnx1d.nam.nsroot.net</address>
     <port>32000</port>
     </socket-address>
     </remote-addresses>
     <connect-timeout>10s</connect-timeout>
     <request-timeout>5s</request-timeout>
     </tcp-initiator>
     </initiator-config>
     </extend-cache-scheme>
          at com.tangosol.coherence.extend.component.comm.Adapter.getInitiatorElement(Adapter.CDB:13)
          at com.tangosol.coherence.extend.component.comm.adapter.CacheServiceStub.configure(CacheServiceStub.CDB:5)
          at com.tangosol.coherence.extend.component.application.library.generic.CoherenceExtend.createCacheServiceStub(CoherenceExtend.CDB:4)
          at com.tangosol.coherence.extend.component.application.library.generic.CoherenceExtend.ensureCacheServiceStub(CoherenceExtend.CDB:15)

Extend TCP- clients Failover

Hi,
In an attempt to failover a TCP client we had the following config file which setup with the following elements
               <heartbeat-interval>50s</heartbeat-interval>
               <heartbeat-timeout>35s</heartbeat-timeout>
When inserting the event handler for the service stopping with the following code
public static void InstallServiceEventHandler(string servicename, ServiceEventHandler eh) //this should be a generic handler in
Tangosol.Net.IService ics = CacheFactory.GetService(servicename); -->This line throws
try{
ics.ServiceStopping += eh;
catch (Exception e)
log.Error("Exception in the service Event handler insertion", e);
return;
The Exception is
{"The element 'outgoing-message-handler' in namespace 'http://schemas.tangosol.com/cache' has *invalid* child element 'heartbeat-interval' in namespace 'http://schemas.tangosol.com/cache'."}
On commenting out the hearbeat* lines the above line executes....Which is of course useless to detect Server failures without a heartbeat.
What are we doing wrong?
Thanks,
Vipin
Given below is the
<cache-config xmlns="http://schemas.tangosol.com/cache">
<caching-scheme-mapping>
<cache-mapping>
<cache-name>dist-*</cache-name>
<scheme-name>extend-direct</scheme-name>
</cache-mapping>
</caching-scheme-mapping>
<caching-schemes>
<remote-cache-scheme>
<scheme-name>extend-direct</scheme-name>
<service-name>ExtendTcpCacheService</service-name>
<initiator-config>
<tcp-initiator>
<remote-addresses>
          <socket-address>
<address>nycs00057388.us.net.intra</address>
<port>8078</port>
</socket-address>
               <socket-address>
                    <address>nycs00057389.us.net.intra</address>
                    <port>8078</port>
               </socket-address>
          </remote-addresses>
</tcp-initiator>
<outgoing-message-handler>
<request-timeout>30s</request-timeout>
               <heartbeat-interval>50s</heartbeat-interval>
               <heartbeat-timeout>35s</heartbeat-timeout>
          </outgoing-message-handler>
</initiator-config>
</remote-cache-scheme>
<remote-invocation-scheme>
<scheme-name>extend-invocation</scheme-name>
<service-name>ExtendTcpInvocationService</service-name>
<initiator-config>
<tcp-initiator>
<remote-addresses>
<socket-address>
<address>nycs00057388.us.net.intra</address>
<port>8078</port>
</socket-address>
</remote-addresses>
</tcp-initiator>
<outgoing-message-handler>

</outgoing-message-handler>
</initiator-config>
</remote-invocation-scheme>
</caching-schemes>
</cache-config>

Hi Vipin -
While I do not have a definite answer on the issue, the internal tracking number is COH-2534. While I cannot commit on dates, at last check it was being worked on for inclusion in 3.6, and the fix would likely be back-ported to 3.5.x.
I suggest that you open an SR with Oracle Support if you have not already done so, so that you can specifically request the resolution of this and the backport to 3.5.
I apologize for the inconvenience that this has caused you.
Peace,
Cameron Purdy | Oracle Coherence

Http cluster servlet not failing over when no answer received from server

          I am using weblogic 510 sp9. I have a weblogic server proxying all requests to
          a weblogic cluster using the httpclusterservlet.
          When I kill the weblogic process servicing my request, I see the next request
          get failed over to the secondary server and all my session information has been
          replicated. In short I see the behavior I expect.
          r.troon
          However, when I either disconnect the primary server from the network or just
          switch this server off, I just get a message back
          to the browser - "unable to connect to servers".
          I don't really understand why the behaviour should be different . I would expect
          both to failover in the same manner. Does the cluster servlet only handle tcp
          reset failures?
          Has anybody else experience this or have any ideas.
          Thanks


I think I might have found the answer......
The AD objects for the clusters had been moved from the Computers OU into a newly created OU. I'm suspecting that the cluster node computer objects didn't have perms to the cluster object within that OU and that was causing the issue. I know I've seen cluster
object issues before when moving to a new OU.
All has started working again for the moment so I now just need to investigate what permissions I need on the new OU so that I can move the cluster object in.

BGP in Dual Homing setup not failing over correctly

Hi all,
we have dual homed BGP connections to our sister company network but the failover testing is failing.
If i shutdown the WAN interface on the primary router, after about 5 minutes, everything converges and fails over fine.
But, if i shut the LAN interface down on the primary router, we never regain connectivity to the sister network.
Our two ASR's have an iBGP relationship and I can see that after a certain amount of time, the BGP routes with a next hop of the primary router get flushed from BGP and the prefferred exit path is through the secondary router. This bit works OK, but i believe that the return traffic is still attempting to return over the primary link...
To add to this, we have two inline firewalls on each link which are only performing IPS, no packet filtering.
Any pointers would be great.
thanks
Mario

Hi John,
right... please look at the output below which is the partial BGP table during a link failure...
10.128.0.0/9 is the problematic summary that still keeps getting advertised out when we do not want it to during a failure....
now there are prefixes in the BGP table which fall within that large summary address space. But I am sure that they are all routes that are being advertised to us from the eBGP peer...
*> 10.128.0.0/9     0.0.0.0                            32768 i
s> 10.128.56.16/32 172.17.17.241                 150      0 2856 64619 i
s> 10.128.56.140/32 172.17.17.241                 150      0 2856 64619 i
s> 10.160.0.0/21    172.17.17.241                 150      0 2856 64611 i
s> 10.160.14.0/24   172.17.17.241                 150      0 2856 64611 i
s> 10.160.16.0/24   172.17.17.241                 150      0 2856 64611 i
s> 10.200.16.8/30   172.17.17.241                 150      0 2856 65008 ?
s> 10.200.16.12/30 172.17.17.241                 150      0 2856 65006 ?
s> 10.255.245.0/24 172.17.17.241                 150      0 2856 64548 ?
s> 10.255.253.4/32 172.17.17.241                 150      0 2856 64548 ?
s> 10.255.253.10/32 172.17.17.241                 150      0 2856 64548 ?
s> 10.255.255.8/30 172.17.17.241                 150      0 2856 6670 ?
s> 10.255.255.10/32 172.17.17.241                 150      0 2856 ?
s> 10.255.255.12/30 172.17.17.241                 150      0 2856 6670 ?
s> 10.255.255.14/32 172.17.17.241                 150      0 2856 ?
i would not expect summary addresses to still be advertised if the specific prefixes are coming from eBGP... am i wrong?
thanks for everything so far...
Mario De Rosa

Why DML not failed over in TAF??

Hi,
I have an OLTP application running on 2 node 10gR2 RAC(10.2.0.3) on AIX 5.3L ML 8. I have configured TAF here for SESSION failover.I would like to know two things from you all:
1) Though each instance is able to read other instnace's undo tablespace data and redolog, then allso why TAF is not able failover the DML transactions?
2) As of now is there any way to failover the DML other than cathing the error thrown back to application and re-executing the query?Is it possible in the 11gR1?
I am gratefull to you all if you are sparing your valuable time to answer this.
Thanks and Regards,
Vijay Shanker

Re: Failover DML on RAC
The reason is transaction processing and its implications.
Imagine that you updated a row, then waited idly, then some other session wanted that same row and waited for you to either rollback or commit.
You failed.
Automatically, Oracle will rollback your transaction and release all your locks.
What should the other session do: wait to see that maybe you have TAF or FCF and will reconnect and rerun your uncommitted DML, or should it proceed with its own work?
Failed session rollback currently happens regardless of whether you or anybody else have TAF, FCF, or even whether you have RAC.
But in order for you to be able to replay your DML safely after reconnect, that transaction rollback had to be prevented, and your new failed over session should magically re-attach to the failed session's transaction.
Maybe some day Oracle will implement something like that, but it's not easy, and Oracle leaves it up to the application to decide what to do (TAF-specific error codes).
On the other hand, replaying selects is fairly easy: re-executing the query (with scn as of the originally failed cursor to ensure read-consistency) and re-fetching up to the point of last fetch.

Proper steps to fail over to another host in a cluster

Hello,
Pardon my ignorance. What is the proper steps to force a fail over to the standby host in a cluster with two nodes?
My secondary host is the currently the active host for custer name. I would like to force it to fail to the primary, which is acting as a standby. Thank you in advance.

Hi MS_Moron,
You can refer the following KB gracefully move the cluster resource to another node.
Test the Failover of a Clustered Service or Application
http://technet.microsoft.com/en-us/library/cc754577.aspx
I’m glad to be of help to you!
We
are trying to better understand customer views on social support experience, so your participation in this
interview project would be greatly appreciated if you have time.
Thanks for helping make community forums a great place.

Problems with Oracle FailSafe - Primary node not failing over the DB to the

I am using 11.1.0.7 on Windows 64 bit OS, two nodes clustered at OS level. The Cluster is working fine at Windows level and the shared drive fails over. However, the database does not failover when the primary node is shutdown or restarted.
The Oracle software is on local drive on each box. The Oracle DB files and Logs are on shared drive.

Is the database listed in your cluster group that you are failing over?

Extend-TCP client not failing over to another proxy after machine failure

Similar Messages

Maybe you are looking for