Monitoring for probe failure to real servers

Hi All,
I'm working on task to trigger an alert to SNMP server for the probe failure to the real server.
Is there any way we can SNMP trap for probe failure to any real servers.
Regards,
Thiyagu

Thiyagu,
This is possible, refer the link below for ACE management features.
https://supportforums.cisco.com/docs/DOC-22543
Regards,
Siva

Similar Messages

GUI for monitoring ACE probes

Hello,
Can Cisco LMS monitor and report on ACE module probes.
Thanks.

yes, as imported MIBs.
ACE appliance 3.x currently supports more SNMP OIDs for the probes than does the ACE module 2.x, but will ACE module 2.3 due by Q4CY09, they will both have same capabilities for probes monitoring.
See:
TableName:cslbxProbeCfgTable
cslbxProbeState
INDEX: slbEntity, cslbxProbeName
For Probe State per Probe Name.
cslbxProbeState can have two values ACTIVE and INACTIVE
As part of reporting probe statistics per RServer the following OIDs will be added in the cesRServerProbeTable in CISCO-ENHANCED-SLB-MIB
Table Name:cesRserverProbeTable
cesRserverProbesPassed
cesRserverProbesFailed
cesRserverProbeHealthMonState
INDEX: Probe Name, RServerProbe Statistics per RServer (configured probe).
This will display stats based on Probe Name per Rserver (Rservers are physical devices not associated with any server farm). Stats generated when probe is associated to a rserver.
Table Name: cesRealServerProbeTable
cesRealServerProbeName
cesRealServerProbeStorageType
cesRealServerProbeRowStatus
INDEX: Probe Name, Server Farm Name,Real Server Name, Real Server Port
Represents a probe associated with a real server directly. For example the following configuration adds an entry to the table.
As part of reporting probe statistics for probes that are assigned to real server/server farm the following table with the OIDs will be added in CISCO-SLB-HEALTH-MON-MIB
cshMonServerfarmRealProbeStatsTable: (New Table)
cshMonServerfarmRealPassedProbes
cshMonServerfarmRealFailedProbes
cshMonServerfarmRealProbeHealthMonState
INDEX:Probe Name, Server Farm Name,Real Server Name, Real Server Port ,Inherited Port
Statistics for probes assigned to real server/serverfarm

Temperature failure for probe (CPU_A)

Hello,
we have a couple of C150s, running quite ok. Ever since enabling snmp traps both the boxes are bombing our management station with traps saying
Temperature has exceeded a recoverable failure threshold for probe (CPU_A)
Documetation says this trap is sent when CPU temp is above 90C. Also it suggests heatsink issue.
Before we set out on replacing the (new) boxes, we'd like to confirm it's real HW issue, but I'm unable to get any independent reading of the temperature probes. SNMP returns only "planar" temperature, which is cool:
SNMPv2-SMI::enterprises.15497.1.1.1.9.1.2.1 = INTEGER: 15
SNMPv2-SMI::enterprises.15497.1.1.1.9.1.3.1 = STRING: "Planar"
Upgrade from AsyncOS 5.5 to 6.1 made no difference.
In case of HW failure orange light should be flashing, too, and this is something the on-site personel couldn't confirm. OTOH, the LEDs seem to be hidden a bit on C150 ...
What would you suggest to do next?
Thanks,
jozef :-)

Hello jariih,
reading through the MIB, .15497.1.1.1.9.1.* should be temperature table, with lines numbered by the last number, and columns of value-name pairs numbered in the previous one ("2" and "3").
On bigger boxes there are more - your listing shows 5 lines, among them also CPU_A value. Too bad such value doesn't exist here on C150, there's just one line with "Planar" value:
SNMPv2-SMI::enterprises.15497.1.1.1.9.1.3.1 = STRING: "Planar"
SNMPv2-SMI::enterprises.15497.1.1.1.9.1.3.2 = No Such Instance currently exists at this OID
SNMPv2-SMI::enterprises.15497.1.1.1.9.1.3.3 = No Such Instance currently exists at this OID
SNMPv2-SMI::enterprises.15497.1.1.1.9.1.3.4 = No Such Instance currently exists at this OID
SNMPv2-SMI::enterprises.15497.1.1.1.9.1.3.5 = No Such Instance currently exists at this OID
Anyway, support asked for remote access and will check the boxes.

CSS 11501 7.40 Monitoring the services on real servers?

Hi,
Just want to ask some basic questions, How can i monitor the services (ie 80 and 443) of the real servers. So that when the CSS11501 detects that one of the services of one of the real servers is down, it will not forward the traffic to that server. Or is the CSS is configured to monitor the services by default?
Because we are planning to upgrade one of the webservers (web01) while web02 is running, if we shutdown the service 80 and 443, does it affect the end-user, will CSS automatically redirect it to web02?
Regards,
Marlon

Here is my sample configuration
!************************** SERVICE **************************
service WEB01-79-HTTP
ip address 172.20.13.4
keepalive type tcp
keepalive port 80
active
service WEB01-79-HTTPS
ip address 172.20.13.4
keepalive type tcp
keepalive port 443
active
service WEB01-80-HTTP
ip address 172.20.13.5
keepalive type tcp
keepalive port 80
active
service WEB01-80-HTTPS
ip address 172.20.13.5
keepalive type tcp
keepalive port 443
active
service WEB01-82-HTTP
ip address 172.20.13.6
keepalive type tcp
keepalive port 80
active
service WEB01-82-HTTPS
ip address 172.20.13.6
keepalive type tcp
keepalive port 443
active
service WEB01-83-HTTP
ip address 172.20.13.7
keepalive type tcp
keepalive port 80
active
service WEB01-83-HTTPS
ip address 172.20.13.7
keepalive type tcp
keepalive port 443
active
service WEB01-79
ip address 172.20.13.4
active
service WEB01-80
ip address 172.20.13.5
active
service WEB02-82
ip address 172.20.13.6
active
service WEB02-83
ip address 172.20.13.7
active
!*************************** OWNER ***************************
owner VRL
content VIP
redundancy-l4-stateless
content WEB-HTTP1
vip address 172.20.10.85
protocol tcp
port 80
advanced-balance sticky-srcip
add service WEB01-79-HTTP
add service WEB01-82-HTTP
redundancy-l4-stateless
active
content WEB-HTTP2
vip address 172.20.10.86
port 80
protocol tcp
advanced-balance sticky-srcip
add service WEB01-80-HTTP
add service WEB01-83-HTTP
redundancy-l4-stateless
active
content WEB-HTTPS1
advanced-balance sticky-srcip
vip address 172.20.10.85
protocol tcp
port 443
add service WEB01-79-HTTPS
add service WEB01-82-HTTPS
redundancy-l4-stateless
application ssl
sticky-inact-timeout 20
active
content WEB-HTTPS2
advanced-balance sticky-srcip
vip address 172.20.10.86
protocol tcp
port 443
add service WEB01-80-HTTPS
add service WEB01-83-HTTPS
redundancy-l4-stateless
application ssl
sticky-inact-timeout 20
active
content WEB01-79
add service WEB01-79
vip address 172.20.10.79
redundancy-l4-stateless
active
content WEB01-80
add service WEB01-80
vip address 172.20.10.80
redundancy-l4-stateless
active
content WEB02-82
add service WEB02-82
vip address 172.20.10.82
redundancy-l4-stateless
active
content WEB02-83
add service WEB02-83
vip address 172.20.10.83
redundancy-l4-stateless
active
!*************************** GROUP ***************************
group WEB01-79
add service WEB01-79
vip address 172.20.10.79
redundancy-l4-stateless
active
group WEB01-80
add service WEB01-80
vip address 172.20.10.80
redundancy-l4-stateless
active
group WEB02-82
add service WEB02-82
vip address 172.20.10.82
redundancy-l4-stateless
active
group WEB02-83
add service WEB02-83
vip address 172.20.10.83
redundancy-l4-stateless
active

Distribution Monitor for 2 different servers from 2 different sites

Hello all,
We are trying to use Distribution Monitor during a parallel Unicode Conversion on a SAP 4.7 system.
The source system and target system are 2 different servers located on 2 different sites (more than 500Kms distant).
Questions:
1. Can we use Distribution Monitor with 1 source server dedicated for the Export and 1 target server dedicated for the import of a package?
2. If it is not possible, what are the constraints in fact?
3. Can we have a scenario where Distribution Monitor is used on the source system in order to use the parallelism benefit and Migration Monitor used on the target system?
Thanks for your help & feedback,
Chris

Hi Chris,
1. Can we use Distribution Monitor with 1 source server dedicated for the Export and 1 target server dedicated for the import of a package? The Answer is No
In order to use Distribution monitor, u need minimum two application servers on source systems and correspondingly atleast minimum two application servers on target system
For example let us say Application server A and Application server B on sources systems and Application Server C and Application server D on target ytem
Then configure Distribution monitor properties to include two application servers as source systems and two application servers as target systems. When u exeute distribution monitor preparation, first it scan database servers in source system nd target system and then scan CI servers in source and target system. Then Packages will be distributed in two application servers A and B
Run Export from Application server A for first fifty packages , at the same time Run import these first fifty packages in Application Server C
Run Export from Application Server B for other remining packages and at the same time , run import the remaining packages into Application Server D.
(that is one to one correspondence)
2. If it is not possible, what are the constraints in fact? - There is no constraints. However there is lots of time consuming during Distribution monitor preparation and checking.
3. Can we have a scenario where Distribution Monitor is used on the source system in order to use the parallelism benefit and Migration Monitor used on the target system? - The Answer is No.
You cannot mix distribution monitor tool for source system and Migration tool used on target system.
You have to use any one tool depending on the size of database used.
if your database size used is very very large then recommend to use distribution monitor where u in sou can have multiple R3load jobs in each application server. Say Application server A use 20 R3load jobs and Application server B use 15 r3load jobs).
Thanks
APR

Time sync monitor for all windows member servers

How to monitor time sync issue on all the member servers and domain controllers. Can we monitor through scom 2007 r2. Specially we need an alert when all the member servers are not in time sync.
B John

There is no monitor for exactly your ask.
The AD management pack includes a Time Service Health monitor for DCs that makes sure the time service is running:
http://social.technet.microsoft.com/Forums/en-US/0c921fa7-45ed-4bee-8f53-c92c750f6cbf/scom-module-for-monitoring-time-sync-issues?forum=operationsmanagergeneral
Here is my "domain time tip:"
Select your preferred time server from this list
http://tycho.usno.navy.mil/NTP/. For example 'tick.usno.navy.mil'.
Then run these commands on your PDC emulator of the domain:
W32tm /config /syncfromflags:manual /manualpeerlist:"<DNS-name-of-time-server>"
W32tm /config /reliable:yes
W32tm /config /update
W32tm /resync
Net stop w32time
Net start w32time
If you execute these commands, AD defaults will cause all your member servers to discover the domain time standard and update to it.
Good luck,
John Joyner MVP-SC-CDM

Disable SQL Monitoring for few servers

Hi Team,
I need to disable complete SQL Monitoring from all SQL servers except for a couple of servers.
For this I disabled the discovery "SQL Server Installation Seed" that is targeted on "Windows Server" class. Even after applying this override, I see that my objects are discovered in SQL server classes and I still get SQL server alerts
for all the SQL servers.
I am running SCOM 2012 SP1, so do I need to run "Remove-SCOMDisabledClassInstance"?
Also am I disabling the discovery for the correct class "Installation Seed"?
Thanks,
S K Agrawal

Hi Faizan,
Suppose you want to disable monitoring for a particular set of SQL servers. Here are the steps to be followed -
1) Create a group of these SQL servers.
2) Considering these are SQL server 2008, go to Authoring--> Object Discoveries.
3) Then disable the discovery for "Installation seed" class because that is targeted on "windows Server" class and is the base class for all other SQL Classes. You can do this by creating an override and disabling the discovery for the group created in Step
1.
4) Open a Operations Manager PowerShell window and run "Remove-SCOMDisabledClassInstance". You might get some error about RuleID and ObjectID. Don't worry about that and keep running "Remove-SCOMDisabledClassInstance" a few more times, finally you will get
a message of command completed successfully.
5) Then to confirm that discovery has been disabled, wait for few minutes and then go to Monitoring--> Discovered Inventory and change the target type to any of the SQL server 2008 class. You will not find the SQL servers from the group in the discovered
inventory.
To Re-enable the SQL monitoring, just remove the override MP or go to "installation seed" class for SQL server 2008. Open Override Summary, Click on Edit, and Remove the override where you have disabled the discovery.
I hope this helps.
Please mark this as Answer if this helped you.

Disable monitoring for "There were database redundancy check failures for database ..."

I'm trying to disable the monitor for a database in a DAG that we don't want to have copies for.
I tried the following, but a few hours later the monitoring alert appears again;
Set-MailboxDatabase –Identity "name” –AutoDagExcludeFromMonitoring $True
Add-GlobalMonitoringOverride -Identity DataProtection\EnableDatabaseMonitoringResponder -ItemType Responder -PropertyName “Enabled” -PropertyValue “0” -Duration 60.00:00:00
Does anybody know how to disable the database redundancy check for a specific database?

Hi,
According to your post, I understand that you want to disable the monitor for a specific database in a DAG.
If I misunderstand your concern, please do not hesitate to let me know.
I’m afraid we cannot disable the database redundancy check for a specific database.
However if you don’t want this copy, you can remove this corrupted copy then create a new one:
Remove-MailboxDatabaseCopy -Identity DB1\MBX1 -Confirm:$False
More details about Add a Mailbox Database Copy, for your reference:
https://technet.microsoft.com/en-us/library/dd298080(v=exchg.150).aspx
Thanks
Please remember to mark the replies as answers if they help, and unmark the answers if they provide no help. If you have feedback for TechNet Support, contact [email protected]
Allen Wang
TechNet Community Support

ACE30-MOD-k9 in bridge mode. Individual server in the same vlan of Real Servers not reacheable.

I configured ACE30-MOD-K9 in bridge mode and I configured a server farm with his real servers. The traffic passes and is balanced correctly between all RSERVER. But I can not contact a server that is on the same vlan of the serverpharm but doesn't belong at this serverfarm.
I Thought that the traffic directed to this "spare" server shouldn't be balanced but the bridge should permit traffic to pass. (trasperent mode) Is it correct ?
What does ACE in bridge mode with traffic directed to servers that do not belong to any server farm but are present on the same VLAN (same bridge group)?
In rispect at the following configuration 10.10.10.168 isn't reacheable
access-list INBOUND line 8 extended permit ip any any
access-list INBOUND line 16 extended permit icmp any any
probe http HTTP_PROBE1
expect status 200 200
rserver host RS_WEB1
ip address 10.10.10.163
inservice
rserver host RS_WEB2
ip address 10.10.10.164
inservice
rserver host RS_WEB3
ip address 10.10.10.165
inservice
rserver host RS_WEB4
ip address 10.10.10.167
inservice
serverfarm host SF_FIREGROUP
rserver RS_WEB1
    inservice
rserver RS_WEB2
    inservice
rserver RS_WEB3
    inservice
rserver RS_WEB4
    inservice
sticky ip-netmask 255.255.255.255 address source sticky-ip
replicate sticky
serverfarm SF_FIREGROUP
sticky http-cookie myCookie sticky-cookie
cookie insert browser-expire
serverfarm SF_FIREGROUP
class-map match-any VS_FIREGROUP
2 match virtual-address 10.10.10.169 tcp eq www
4 match virtual-address 10.10.10.169 tcp eq 8081
5 match virtual-address 10.10.10.169 tcp eq 8082
6 match virtual-address 10.10.10.169 tcp eq 8083
7 match virtual-address 10.10.10.169 tcp eq 8084
8 match virtual-address 10.10.10.169 tcp eq 8085
9 match virtual-address 10.10.10.169 tcp eq 8097
class-map match-any VS_FIREGROUP_HTTPS
2 match virtual-address 10.10.10.169 tcp eq https
policy-map type loadbalance first-match HTTP
class class-default
    sticky-serverfarm sticky-cookie
policy-map type loadbalance first-match HTTPS
class class-default
    sticky-serverfarm sticky-ip
policy-map multi-match HTTP_HTTPS_MULTI_MATCH
class VS_FIREGROUP
    loadbalance vip inservice
    loadbalance policy HTTP
    loadbalance vip advertise active
class VS_FIREGROUP_HTTPS
    loadbalance vip inservice
    loadbalance policy HTTPS
    loadbalance vip advertise active
interface vlan 4
bridge-group 1
access-group input INBOUND
service-policy input HTTP_HTTPS_MULTI_MATCH
no shutdown
interface vlan 700
bridge-group 1
access-group input INBOUND
no shutdown
interface bvi 1
ip address 10.10.10.150 255.255.255.0
no shutdown
ip route 0.0.0.0 0.0.0.0 10.10.10.1
Thanks a lot
Francesco

Hi Francesco,
Just to add more a bit, A bridge group is very similar to routed mode except ACE cannot NAT pass through traffic, vlan's cannot be shared and couple of other things but client's should be able to access the server as in before.
But also whether in bridge or routed mode, ACE does create flows and applies other security parameters if configured to the traffic. This is for security. Also, ACE should know the MAC of the device to forward the traffic to. Can you check if ACE has the MAC of the destination? You can also put a route for testing purpose and see if that resolves the issue. That should probably be the quickest way to check if ACE is creating any issue here.
Regards,
Kanwal

Ace probe failure after IIS app pool recycle?

Windows Server 2003 SP2
ACE Module A2(1.6a)
I suspect this is caused by an IIS6 setting, but posting here in case anyone has seen this. For this one particular site, we have 4 servers in the farm. 2 of those servers are fine. The other 2 (new) servers will generate probe failure after the site's app pool recycles. I then remove the 2 servers from service and re-activate (no inservice, then inservice) and the probe comes back as operational. It appears that the app pool recycle somehow is resetting the hash on the default page, though I'm not sure how. Any ideas are very much appreciated.

Yeah, the hash is inside the probe. Here's the config for the serverfarm and the probe. Public-007 and Public-008 are new servers...the other 6 have been in the farm for the last 2.5 years and they don't have this issue. It's only the 2 new boxes that the probe fails when the app pool is recycled.
serverfarm host PUBLIC
probe URL-DEFAULT-ASPX
rserver PUBLIC-001
    inservice
rserver PUBLIC-002
    inservice
rserver PUBLIC-003
    inservice
rserver PUBLIC-004
    inservice
rserver PUBLIC-005
    inservice
rserver PUBLIC-006
    inservice
rserver PUBLIC-007
    inservice
rserver PUBLIC-008
    inservice
probe http URL-DEFAULT-ASPX
interval 2
faildetect 2
passdetect interval 2
passdetect count 2
request method get url /default.aspx
expect status 200 200
hash

Real Servers not connected to ACE VLAN and Real Servers are clients accessing the VIP

Hi,
I have a very strange set up and need some help to get my config working
I have a ASA firewall with three VLANs
VLAN 1 = Internet
VLAN 2 = DMZ
VLAN 3 = Goes to ACE
On the ACE I have four VLANs
VLAN 3 = Goes to ASA
VALN 4 = Web Server Tier
VALN 5 = DB Tier
VALN 6 = VIPs
Our Application team have asked us to create a New VIP on the ACE with real servers in DMZ (Server A and Server B)
And they have told us that the cleints accessing the VIP will be Server A and Server B
I have always created VIPs with real servers directly connected to the ACE but not connected elsewhere.
I belive I have a big challenge of opening ports on the firewall etc to get this set up working. Also, should i use some sort of NAT / SNAT?
Could anyone guide me on this setup please?
Raj

Hi Raj,
First of all it is possible to add servers in ACE which are HOP away from ACE interfaces. Here servers are HOP away but there VIP is part of ACE interface subnet. The only need is that servers return traffic towards client should be passed through ACE (so that ACE can manitain states and chage the source IP of the reply packet from server IP to VIP on which client has requested the connection).
When servers are HOP away and ACE do not come in path between server and client then we have to to do SNAT for intial client request. This configuration will force the return traffic from server to ACE (as server will NAT IP as client IP).
In your case DMZ-VIP which is created for two real servers A and B, will be accesses by these servers only. So it is a situation of server accessing there own VIP. For this scenario to work we have to have SNAT (no matter whether servers are directly connected or HOP away). So best solution here is VIP in VLAN 3, Rserevrs for this VIP in DMZ, and SNAT client request, using free IP in VLAN 3.
Also you have to open ports on firewall for both "real server Probes" and actual application ports, moreover policies modification on firewall for allowing traffic from DMZ to ACE VIP, DMZ to NAT IP and there vice versa traffic.

ACE show serverfarm - failure counter does not incremented on Probe-Failure event

Hi,
Despite of probe-failure the failure counter is not incremented. Is there any correlation between the configured probe and the failure counter?
(Custom script probe is used for this serverfarm)
# sh serverfarm xxxxxSt
serverfarm     : xxxxxSt, type: HOST
total rservers : 2
                                                ----------connections-----------
       real                  weight state        current    total      failures
   ---+---------------------+------+------------+----------+----------+---------
   rserver: xxxxx6
       10.222.0.90:8000      8      OPERATIONAL 13         157        0
   rserver: xxxxx7
       10.222.0.92:8000      8      PROBE-FAILED 0          0          0
Thanks,
Attila

Hi Attila,
The Connection Failure counter under show serverfarm is for Loadbalanced Connections which are failing.
If Probes are failing, this counter will not increment.
The Connection failure counter can increment for various reasons some of them are,
- Server not responding to the SYN packet sent by ACE for Loadbalanced connection
- Server sending Reset to the SYN packet sent by ACE for Loadbalanced connection
To check on stats for Probe, you can run "show probe detail" command.
Hope this helps,
Best Regards,
Rahul

ACE 4710 same real servers, different ports.

Hi! I have the following question based on a new site requirement. The following sites use the same back end servers. Names changed to protect the innocent and my finger fumbling with pretty names for my actual config.
I have two real servers being load balanced: 10.0.0.1 and 10.0.0.2
They have:
Site A URL= www.testsite.com:80
Site B URL= www.newstuff.com:81
I want Site B answering on port 81 for anything referencing the URL match for either port :80, and :81, then redirect to :81 anything that is on :80.
I want Site A answering on port 80 for anything not referencing the Site B URL.
How do I split the traffic coming in while also redirecting if only needed for the one site?
Also, one further question, how do I handle monitoring the ports up for each as validation for the VIP? If either port goes down is that going to take both of them offline?

Hi,
Since they are two different URL's, they would be resolving to two different VIPs. You can create two serverfarms with same servers but listening on ports 81 and 80 and create a class-map for different IP's or even same IP, listening on port 81 and 80. Any client coming with port 80 as destination would be loadbalanced to serverfarm_80 and any client coming on port 81 as destination would be loadbalanced to serverfarm_81.
class-map match-all Test_80
2 match virtual-address 10.1.1.1 tcp eq www
class-map match-all Test_81
3 match virtual-address 10.1.1.2 tcp eq 81
rserver r1
ip address 10.0.0.1
inservice
rserver r2
ip address 10.0.0.2
inservice
serverfarm_80
rserver r1 80
inservice
rserver r2 80
inservice
serverfarm_81
rserver r1 81
inservice
rserver r2 81
inservice
policy-map type loadbalance http first-match http
class class-default
    serverfarm serverfarm_80
policy-map type loadbalance http first-match http_81
class class-default
    serverfarm serverfarm_81
policy-map multi-match Test
class Test_80
    loadbalance vip inservice
    loadbalance policy http
    loadbalance vip icmp-reply active
   class Test_81
    loadbalance vip inservice
    loadbalance policy http_81
    loadbalance vip icmp-reply active
Let me know if you have any questions.
Regards,
Kanwal
Note: Please mark answers if they are helpful.

ACE keep probing real servers using "https get 302"

Hi all,
I got one problem with cisco ACE in my company. Currently, two ACE appliances are working as HA redundancy. Previously I enabled some https and http probing using get 302 for some servers and services. But then I was told to remove all https or http probing, and instead use tcp port 443 and 80. After that, one of the serverfarm (server groups) is receiving https get 302 and I already checked in the monitoring and see whether there's any https probing regarding the respected real servers. But I could not find any. Even I disable all probing to that serverfarm, all the server members still receiving https get 302. Is this behavior a bug?
The ACE version is A3(2.1). And the HA status is on standby cold. Can standby cold cause this kind of trouble?

Hi Daniel,
I just corrected the cert problem and made the state peer into standby hot. But still it still keep probing the get 302. And then I tried to restart both ACEs. The first step is to restart the second ACE (standby) and then switched over all context to the second one. The problem is that when I made the second one to be active, some services were not working, especially the ones with ssl terminated in ACE. I'm pretty sure that both ACEs were in sync.
Any idea what is the problem?

Guidelines for Health Monitoring for TimesTen

This document provides some guidance on monitoring the health of a TimesTen
datastore. Information is provided on monitoring the health of the
datastore itself, and on monitoring the health of replication.
There are two basic mechanisms for monitoring TimesTen:
1. Reactive - monitor for alerts either via SNMP traps (preferred) or
by scanning the Timesten daemon log (very difficult) and reacting
to problms as they occur.
2. Proactive - probe TimesTen periodically and react if problems, or
potential problems, are detected.
This document focusses on the second (proactive) approach.
First, some basic recommendations and guidelines relating to monitoring
TimesTen:
1. Monitoring should be implemented as a separate process which maintains
a persistent connection to TimesTen. Monitoring schemes (typically based
on scripts) that open a connection each time they check TimesTen impose
an unnecessary and undesireable loading on the system and are discouraged.
2. Many aspects of monitoring are 'stateful'. They require periodic
sampling of some metric maintained by TimesTen and comparing its
value with the previous sample. This is another reason why a separate
process with a persistent connection is desireable.
3. A good monitoring implementation will be configurable since the values
used for some of the chcks may depend on e.g. the TimesTen configuration
in use or the workload being handled.
MONITORING THE HEALTH OF A DATASTORE
====================================
At the simples level, this can be achieved by performing a simple SELECT
against one of the system tables. The recommended table to use is the
SYS.MONITOR table. If this SELECT returns within a short time then the
datastore can be considered basically healthy.
If the SELECT does not return within a short time then the datastroe is
stuck in a low level hang situation (incredibly unlikely and very serious).
More likely, the SELECT may return an error such as 994 or 846 indicating
that the datastore has crashed (again very unlikely, but possible).
A slightly more sophisticated version would also include an update to a
row in a dummy table. This would ensure that the datastore is also capable
of performing updates. This is important since if the filesystem holding
the trsnaction logs becomes full the datastore may start to refuse write
operations while still allowing reads.
Now, the SYS.MONITOR table contains many useful operational metrics. A more
sphisticated monitoring scheme could sample some of these metrics and
compute the delta between subsequent samples, raising an alert if the
delta exceeds some (configurable) threshold.
Some examples of metrics that could be handled in this way are:
PERM_IN_USE_SIZE and PERM_IN_USE_HIGH_WATER compared to PERM_ALLOCATED_SIZE
(to detect if datastore is in danger of becoming full).
TEMP_IN_USE_SIZE and TEMP_IN_USE_HIGH_WATER compared to TEMP_ALLOCATED_SIZE
(ditto for temp area).
XACT_ROLLBACKS - excessive rollbacks are a sign of excessive database
contention or application logic problems.
DEADLOCKS - as for XACT_ROLLBACKS.
LOCK_TIMEOUTS - excessive lock timeouts usually indicate high levels of
contention and/or application logic problems.
CMD_PREPARES & CMD_REPREPARES - it is very important for performance that
applications use parameterised SQL statements that they prepare just once
and then execute many times. If these metrics are continuously increasing
then this points to bad application programming which will be hurting
performance.
CMD_TEMP_INDEXES - if this value is increasing then the optimiser is
comntinually creating temporary indices to process certain queries. This
is usually a serious performance problem and indicates a missing index.
LOG_BUFFER_WAITS - of this value is increasing over timne this indicates
inadequate logging capacity. Yiou may need to increase the size of the
datastore log buffer (LogBuffSize) and log file size (LogFileSize). If that
does not alleviate the problem you may need to change your disk layout or
even obtain a higher performance storage subsystem.
LOG_FS_READS - this indicates an inefficieny in 'log snoop' processing as
performed by replication and the XLA/JMS API. To alleviate this you should
try increasing LogBuffSize and LogFileSize.
Checking these metrics is of course optional and not necessary for a basic
healthy/failed decision but if you do check them then you will detect more
subtle problems in advance and be able to take remedial action.
MONITORING THE HEALTH OF REPLICATION
====================================
This is a little more complex but is vital to achieve a robust and reliable
system. ideally, monitorting should be implemented at both datstores, the
active and the standby. There are many more failure modes possible for
a replicated system than for a standalone datastore and it is not possible
to ennumerate them all here. However the information provided here should
be sufficient to form the basis of a robist monitoring scheme.
Monitoring replication at the ACTIVE datastore
1.     CALL ttDataStoreStatus() and check result set;
If no connections with type 'replication' exists, conclude that
replication agents are stopped, restart the agents and skip
next steps.
It is assumed here that the replication start policy is 'norestart'.
An alarm about unstable replication agents should be raised
if this is Nth restart in M seconds (N and M are configuration parameters).
The alarm can later be cleared when the agents stayed alive K
seconds (K is configuration parameter).
2.     CALL ttReplicationStatus() and check result set;
This returns a row for every replication peer for this datastore.
If the pState is not 'start' for any peer, raise an alarm about paused or
stopped replication and skip rest of the steps.
It is assumed that master cannot help the fact that state is not
'start'. An operator may have stopped/paused the replication or
TimesTen stopped the replication because of fail threshold
strategy. In former case the operator hopefully starts the replication
sooner or later (of course, after that TimesTen may stop it again
because of the fail threshold strategy). In latter case the standby
side monitor process should recognise the fact and duplicate the data
store with setMasterRepStart-option which sets state back to 'start'.
If for any peer, lastMsg > MAX (MAX is a configuration parameter), raise
an alarm for potential communication problems.
Note that if replication is idle (nothing to replicate), or there is
very little replication traffic, the value for lastMsg may become as
high as 60 seconds without indicating any problem. The test logic
should cater for this (i.e. MAX must be > 60 seconds).
3.     CALL ttBookmark();
Compute the holdLSN delta between the values from this call and the
previous call and if the delta is greater than maximum allowed
(configuration parameter), raise an alarm about standby
that is too far behind. Continue to next step.
Notice that maximum delta should be less than FAILTHRESHOLD * logSize.
4.     CALL ttRepSyncSubscriberStatus(datastore, host);
This step is only needed if you are using RETURN RECEIPT or RETURN TWOSAFE
with the optional DISABLE RETURN feature.
If disabled is 1, raise an alarm for disabled return service.
Continue to next step. If RESUME RETURN policy is not enabled we could,
of course, try to enable return service again (especially when DURABLE
COMMIT is OFF).
There should be no reason to reject TimesTen own mechanisms that
control return service. Thus, no other actions for disabled return
service.
Monitoring replication at the STANDBY datastore
1.     CALL ttDataStoreStatus();
If no connections with type 'replication' exists, conclude that
replication agents are stopped, restart the agents and skip
next steps.
It is assumed that replication start policy is 'norestart'.
An alarm about unstable replication agents should be raised
if this is Nth restart in M seconds (N and M are configuration parameters).
The alarm can later be cleared when the agents stayed alive K
seconds (K is configuration parameter).
2.     Call SQLGetInfo(...,TT_REPLICATION_INVALID,...);
If the status is 1, this indicates that the active store has marked this store
as failed due to it being too far out of sync due to log FAILTHRESHOLD.
Start recovery actions by destroying the datastore and recreating via a
'duplicate' operation from the active.
3.     Check 'timerecv' value for relevant row in TTREP.REPPEERS
If (timerecv - previous timerecv) > MAX (MAX is a configuration parameter),
raise an alarm for potential communication problems.
You can determine the correct row in TTREP.REPPEERS by first getting the
correct TT_STORE_ID value from TTREP.TTSTORES based on the values in
HOST_NAME and TT_STORE_NAME (you want the id corresponding to the active
store) and then using that to query TTREP.REPPEERS (you can use a join if
you like).
The recovery actions that should be taken in the event of a problem with
replication depend on several factors:
1. The application requirements
2. The type of replication configuration
3. The replication mode (asynchronous, return receipt or return twosafe)
that is in use
Consult the Timesten replication guide for information on detailed recovery
procedures for each combination.
================================ END ==================================

The information in the forum article is the abridged text of a whitepaper I wrote recommending best practice for building a monitoring infrastructure for TimesTen. i.e. you write an 'application' in C, C++ or Java that performs these monitoring activities and run it continually in production against your datastores. Various aspects of the behaviour of the application could be controlled by configurable parameters; these are not TimesTen parameters but parameters defined and used by the monitoring application.
In the specific case you mentioned, the 'lastMsg' value returned by ttReplicationStatus is the number of seconds since the last message was received from that peer. The monitoring application would compare this against some meaningful threshold (maybe 30 seconds) and if lastMsg is > that value, raise an alarm. To allow flexibility, the value compared against )MAX) should be configurable.
Does that make sense?
Chris

Monitoring for probe failure to real servers

Similar Messages

Maybe you are looking for