CSM 4.2(5): Reoccuring failed health probes

Hi all
I've finally started to investigate an issue I have with our CSM setup. Several times a day I get the below syslog message from the 6500
10:49:11: %CSM_SLB-6-RSERVERSTATE: Module 4 server state changed: SLB-NETMGT: TCP health probe failed for server
Then a few seconds later
10:49:41: %CSM_SLB-6-RSERVERSTATE: Module 4 server state changed: SLB-NETMGT: TCP health probe re-activated server
I never seems to catch the event in action and can never verify if the real server is indeed failed or if this is only a probe timeout. I have both layer 2 and layer 3 server farms in operation and this problem occurs on all of my server farms a few times a day.
No pattern and I have no other indications of any problems. I have most of the probes set on 1 repeat and 30sec timeout. Increase the probe timeouts perhaps?
Regards
Fredrik

Those error messages are related to probing the CSM does when determining server health. For a TCP probe, this means that the CSM either gets a TCP RST from the server or it does not see a SYN-ACK coming from the server.

Similar Messages

CSM health probe for server farm with multiple vservers

Is there a way to specify the vserver port that a health probe monitors when multiple vservers are configured for the same serverfarm? Let's say I have a serverfarm named farm1. farm1 services two ports www and https so two vservers vserver_www and vserver_https are configured and bound to farm1. I would like to enable http health probe on farm1 with the intention of only monitoring vserver_www http port but, instead, the health probe monitors both www and https and since a http probe on https fails it takes farm1 reals and both vservers vserver_www and vserver_https out-of-service. Is there a way to configure a health probe to monitor a specific port? Or, should I create two duplicate serverfarms farm1 bound to vserver_www and farm2 bound to vserver_https and only enable http health probe on farm1? Any other ideas welcomed.

Appreciate the feedback. I also found what I was looking for in configuration examples. To summarize I've borrowed the comment from the URL below:
# The port for the probe is inherited from the vservers.
# The port is necessary in this case, since the same farm
# is serving a vserver on port 80 and one on port 23.
# If the "port 80" parameter is removed, the HTTP probe
# will be sent out on both ports 80 and 23, thus failing
# on port 23 which does not serve HTTP requests.
http://www.cisco.com/univercd/cc/td/doc/product/lan/cat6000/mod_icn/csm/csm_4_2/config/cfgxpls.htm

Multiple health probes on CSM

We have a CSM blade in a 6509, IOS 12.2(18)SXF7, CSM softvare version 4.2(7);
We'd like to create a serverfarm, where servers are checked for several ports and only considered as working when all probes succeed.
Although Cisco docs state that there should be a possibility to associate multiple probes with a serverfarm, I haven't managed to do so.
Here's what I've tried:
probe PING icmp
interval 5
failed 10
receive 4
probe TCP-1234 tcp
interval 10
retries 2
failed 25
port 1234
real PROBE-TEST-R
address 1.2.3.4
serverfarm PROBE-TEST-SF
real name PROBE-TEST-R
health probe PING
health probe TCP-1234
but when trying to add the second probe, I get:
% You must first disassociate from probe PING.
Any ideas, how multiple probes could be implemented?

Configure them as probe under the serverfarm..not health probe.
serverfarm PROBE-TEST-SF
probe PING
probe TCP-1234
Gilles.

ACE failing server out using TCP health probe

We have a mix of ACE20s and ACE30s currently and I am seeing the ACE in both HW platforms failing out our servers sporadically after a sucessful TCP handshake. Here is the configuration:
probe tcp TCP-25
   port 25
   interval 25
   faildetect 2
   passdetect interval 90
   open 10
When I do a show probe TCP-25 detail I see the default recv timeout is 10.
I captured a trace between the ACE and the server. When the health probes pass I see a good 3 way TCP handshake, then 50ms later the server sends a SMTP 220 then ace from ace, fin ack from ace and graceful TCP termination occurs. When the probe fails I see a sucessful TCP handshake but the ACE sends FIN ACK 47ms after it sends ACK for the TCP connection. Server then sends ACK and ACE sends RST.
Shouldn't ACE wait 10 seconds in this example for server to respond after TCP handshake?

TAC/Martin Nash was very helpful in explaining this. The TCP 3 way handshake was sucessful, but the ACE sent a FIN ACK as expected, but after the server sent an ACK the server did not send a FIN ACK so the ACE marked it down. The health check not only requires a 3 way handshake, but a clean teardown of the TCP session.

CSM HTTP Health Probe

Is there any way to configure an HTTP health probe that will test a web page and fail if it takes too long for the server to respond. I have attempted to do this (see below) but the "receive" parameter doesn't seem to help. We are currently having a problem where one of the web servers for whatever reason gets really slow, while the other works fine with about the same number of users, I'd like to fail the slow when this occurrs.
Here is my probe config:
probe HTTP-SERVERASP http
request method get url /server.asp
expect status 200 299
interval 5
failed 30
receive 5
Thanks...Jeff

Jeff,
receive seems to be the solution for what you need.
Did you verify how fast/slow the server is responding.
Currently you allow 5 sec for the response to come back and 3 consecutives must fail before the server is brought down, so if your server resond 1 time fast enough, the server stays up.
So, use a sniffer trace to verify the response time.
Send me the trace if you want.
Gilles.

CSM Health Probe source IP

Can anyone tell me what IP address health probes are sourced from on the CSM? I've got a simple ICMP health probe setup but I'm trying to figure out what the source of those probes will be.
Is it the Vlan IP or maybe the VIP or possibily the router interface IP?
Thanks,
Bob

this is the vlan ip.
Gilles.

SSLM Health Probe?

I have a (2) 6509's, each with a CSM and SSLM. One CSM is active and both SSLM's are active. I load balance encrypted requests to the SSLM's.
The SSLM decrypts the incoming HTTPS requests and sends the request back to the CSM using HTTP (clear text). The CSM serverfarm then load balances the session to one of the web servers. Because the web server responds back in clear text, I have implemented a health probe to monitor the web page for a specific string of characters within the serverfarm. If a web page displays the page incorrectly, the probe fails for that server.
Now I have a new requirement, where I must re-encrypt the traffic (backend encryption) and send the requests to the server encrypted (HTTPS).
My question are:
1. Can I implement health probes on the SSLM?
2. Can I implement an effective health probe from the CSM so that I can still poll for a string of characters?
Thank you.

SSLM should only be probed with ICMP

ACE http health probes - best practice for interval and passdetect interval?

Hi,
Is there a recommended standard for http health probes in terms of interval and passdetect interval timings, i.e. should the passdetect interval always be less than the interval or visa versa? Can a http probe be 'mis-configured', i.e. return a 'false positive' by configuring an interval timeout thats 'incompatible' with the device it's polling?
I have a http probe for a serverfarm consisting of two Apache http servers and get intermittent 'server reply timeout' probe failures. I'm keen to ensure that the configuration of the probe isn't at fault so I can be confident that a failed probe indicates a problem with the server and not my configuration.
The probe is currently configured as below:-
probe http http-apache
interval 30
passdetect interval 15
passdetect count 6
request method get url /cs/images/ACE.html
expect status 200 304
Any advice on the subject woud be gratefully received.
thanks
Matthew

Hi Gilles,
Thanks for the advice. In another dicussion (found here https://supportforums.cisco.com/message/462397#462397) a poster has stated that:-
"(The) "Probe interval" should always be less then (open+recieve) timeout value. Default open & receive timeouts are 10 seconds."
Are you able to advise on whether the above is correct and if so, why? I currently have an interval value of 30 that obviously goes against the advice above (which I've interpretted to mean that if you leave the open & receive timeouts at their default settings your probe interval should be less than 20 seconds?).
thanks
Matthew

Cisco ACE Health Probes

Probe Interval: 5
Pass Detect (Seconds): 60
Fail Detect: 3
Please can someone explain the above settings that are configured for a health probe? am I correct in thinking the probe is sent every 5 seconds, and must fail 3 times in order to failover? Does the "Pass Detect" indicate that the server must be back online for 60 seconds before being placed back into the server farm?
Also if we have a primary server and a back up server (used if primary fails), if the primary fails and the backup server becomes active, will the primary server become available again when it comes back online, or will all connections continue to go to the backup? Is there anyway to make the old primary the new backup when it comes back online?

Hi,
You are right about Probe interval and fail detect, but Pass detect has two parameters:-interval and count, where interval defines the amount of time to wait for sending the probe back to failed server where as count paramater will control the minimum succefullt probe return from server for making it active again.
Regarding the backup server, once the prmary server comes online again all new connection will be redirected to it, while all existing connection will continue on existing one. I guess "inservice standby" will be the command of your interest in gracefully removing the primary and bringing the backup active.

Configuring Health Probe for Server Farm

If I have a server farm with real servers listening on port 8888 and I apply an HTTP-type health probe with no port number specified, will the ACE know to probe the servers at 8888 or will it try to probe port 80?

Hi,
Yes it should inherit the port from the real servers defined in the serverfarm. This gives you the flexibility to associate same probe with different serverfarms probing different servers on different ports. This is probe port inheritance feature which is there in ACE.
Regards,
Kanwal

Health probe for RDP farm

I have an RDP server farm that lost a disk. The RDP service was still running but users were unable to log in. I'd like to create a health probe that does maybe a combination of TCP probe for port 3389 and something that can determine if the drive that stores user profiles is available.
I cannot add any new service (http or ftp) to the server.
Can anyone think of another way to do this? Is there any way I can check SNMP mibs on the windows server or maybe WMI through TCL?
Thanks.

Can you drop me a mail offline ([email protected]) and I can share what I have. Matthew

ACE Health probe using get URL

Hello,
We are trying to create a health probe for our google search appliances and as part of the URL get there is a question mark but the ACE doesn't like that. Is there a way around this or should it be done differently?
request method get url /searchq? (This is what we want the URL to be)
request method get url /searchq (This is where it thinks i'm asking it for help)
Thanks in Advance.

Hello,
You need to typ CRTL+v prior to entering the ?
That's the Control key then lowercase v, then your question mark.
Hope this helps,
Sean

WLS 9.2: State FAILED Health OK!

Hi,
I have seen instances where the Weblogic Server 9.2 Admin console shows the state of managed server as FAILED, but the health is OK. I wanted to know if this is how the console behaves (normally), as according to me the server health cannot be OK when it has failed.
I can only think of one thing, the server not responding to any health checks from the Admin Server and so the health showing the last available state (which is OK before server failed). But, I have seen this happen (ie Admin console showing state FAILED health OK) even an hour after the managed server failed (probably due to deadlocks).
Please help me comprehend this state better.
Thanks in advance
Vikas

hi,
once u refresh the page, that will automatically changes health status...

Function of health probe timers on CSM

Hi,
we use the following configuration on a csm to monitor a server farm and I'm wondering how exactly the probe timers work.
===
serverfarm sf
nat server
nat client natpool1
failaction purge
real name serv1
weight 1
inservice
real name serv2
weight 1
inservice
probe probe1
probe probe1 script
script LDAP_PROBE
interval 5
retries 2
receive 1
port 389
===
So in my eyes the probes are sent every 5 seconds. When a probe isn't answered within one second it's marked as failed. If two probes are failed (retries 2) the real server is marked as down.
Is this correct?
In a network trace I see a different behaviour: Probes are sent every 5 seconds. If a real server goes out-of-service I see a probe which is not answered and the next probe is sent after 10 seconds (I expected 5 seconds). 5 seconds later the real server is marked down in the switch log.
It would be fine if anybody could help me.
Best Regards,
Thorsten Steffen

Hi,
following the meaning of the parameters:
Router(config-slb-probe)#
interval seconds
Sets the interval between probes in seconds (from the end of the previous probe to the beginning of the next probe) when the server is healthy.
Range = 2-65535 seconds
Default = 120 seconds
Router(config-slb-probe)#
retries retry-count
Sets the number of failed probes that are allowed before marking the server as failed.
Range = 0-65535
Default = 3
Router(config-slb-probe)#
failed failed-interval
Sets the time between health checks when the server has been marked as failed. The time is in seconds.
Range = 2-65535
Default = 300 seconds
Router(config-slb-probe)# open
open-timeout
Sets the maximum time to wait for a TCP connection. This command is not used for any non-TCP health checks (ICMP or DNS1).
Range = 1-65535
Default = 10 seconds
There are two different timeout values: open and receive. The open timeout specifies how many seconds to wait for the connection to open (that is, how many seconds to wait for SYN ACK after sending SYN). The receive timeout specifies how many seconds to wait for data to be received (that is, how many seconds to wait for an HTTP reply after sending a GET/HHEAD request). Because TCP probes close as soon as they open without sending any data, the receive timeout is not used.
When sniffing, you should see a probe each 5 seconds. When a probe fails for the first time, a second probe should be send after 5 seconds. when this probe fails too, the server is put out of service.
That should be the behaviour you should see.
HTH,
Dario

CSM HTTPS or SSL Health Probe

We are currently using TCP probe for HTTPS webServer health checking. Is there a HTTPS or SSL probe available on CSM to send a url to detect if the HTTPS Apache WebServer is up or not?
Many Thx, Q.Xie

You can download the TCL script file from the same locstion as the CSM software.
In this TCL file you should find the following scripts
[root@linux-1 cisco]# cat /tftpboot/c6slb-apc.4-2-1.tcl | grep -i "name ="
#!name = CHECKPORT_STD_SCRIPT
#!name = ECHO_PROBE_SCRIPT
#!name = FINGER_PROBE_SCRIPT
#!name = FTP_PROBE_SCRIPT
#!name = HTTPCONTENT_PROBE
#!name = HTTPHEADER_PROBE
#!name = HTTPPROXY_PROBE
#!name = HTTP_PROBE_SCRIPT
#!name = IMAP_PROBE
#!name = LDAP_PROBE
#!name = MAIL_PROBE
#!name = POP3_PROBE
#!name = PROBENOTICE_PROBE
#!name = RTSP_PROBE
#!name = SSL_PROBE_SCRIPT
#!name = TFTP_PROBE
There is a SSL_PROBE_SCRIPT that will verify that the SSL server respond to a client SSL HELLO message.
It does not verify if you can send an HTTP request.
It only sends a HELLO as a client and wait for the server HELLO.
With the SSLM for the CSM, there might be a way to achieve HTTPS probe.
I never tried it, but the solution I see would be to create an HTTP probe on the CSM and direct to the SSLM which will do the encryption and forward it to the server.
Regards,
Gilles

CSM 4.2(5): Reoccuring failed health probes

Similar Messages

Maybe you are looking for