Member death detection (Coh-3.5.3)

Hi,
Can someone explain death detection in 3.5.3 please. I have a reasonable idea how it works:
A member suspects another member has departed so it asks two other members for confirmation.
If these other members confirm departure then the original member informs the rest of the cluster that the member has departed.
Wherever possible the members being asked to confirm departure will be different roles to the member asking for confirmation
We occasionally lose storage nodes in this way but I have some questions around what I am seeing in the logs below.
The scenario is this:
* Member 27 has a timeout sending a packet to Member 16
* Member 27 asks Member 83 and member 85 to confirm departure of Member 16
* Member 83 rejects the confirmation request (@ 2010-09-22 05:21:43.411)
* Member 85 accepts the confirmation request (I assume it does as it has no rejection in its log)
* Member 27 informs the rest of the cluster that Member 16 has departed
* Member 1 (the senior member) heartbeats Member 16 causing it to re-initialise itself - it then rejoins as Member 127.
My question is given that Member 83 rejected the confirmation request I assume it could still see Member 16. What exactly are the rules around forcing a Member to depart the cluster when this happens.
The nodes asked to confirm departure were another storage node (which rejected the request) and a storage disabled worker node (which accepted the request).
These storage disabled nodes can sometimes be under reasonable load so might not be the best ones to ask to confirm departure.
What happens when only one of the two members confirms departure?
Can we choose which roles get asked to confirm departure?
Member 27
2010-09-22 05:21:43.410/648886.462 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=27): Timeout while delivering a packet Directed{PacketType=0x0DDF00D5, ToId=16, FromId=27, Direction=Outgoing, SentCount=79, SentMillis=05:21:43.111, ToMemberSet=null, ServiceId=7, MessageType=16, FromMessageId=32360401, ToMessageId=1730276, MessagePartCount=1, MessagePartIndex=0, NackInProgress=false, ResendScheduled=05:21:43.311, Timeout=05:21:43.15, PendingResendSkips=0, DeliveryState=outstanding, Body=0x0034D45C01001B012B110B545A001B012B110B5459004C021564BEA9FC8FE2CA80014C230D992515A16200A501843100004E084744532047424C4F40A6014E063834353235374000004CA90215A06200A401945F00A201BE2000A4014219A501A16200A501843100004E084744532047424C4F40A6014E063834353235374040..., Body.length=1445}; requesting the departure confirmation for Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
by MemberSet(Size=2, BitSetCount=4
  Member(Id=83, Timestamp=2010-09-14 17:07:39.704, Address=xx.xxx.34.97:8091, MachineId=35169, Location=machine:xxxxx06432,process:25212,member:xxxxx06432:Data-6, Role=RbsOdcCoreDaoODCCacheServer)
   Member(Id=85, Timestamp=2010-09-21 00:43:23.208, Address=xx.xxx.34.88:8090, MachineId=35160, Location=machine:xxxxx06441,process:10718,member:xxxxx06441:VEST-6, Role=RbsOdcVestCoreVestMain)
2010-09-22 05:21:43,412 [Logger@9227652 3.5.3/465p2] INFO  Coherence - 2010-09-22 05:21:43.411/648886.463 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=27): Member departure confirmed by MemberSet(Size=1, BitSetCount=4
   Member(Id=85, Timestamp=2010-09-21 00:43:23.208, Address=xx.xxx.34.88:8090, MachineId=35160, Location=machine:xxxxx06441,process:10718,member:xxxxx06441:VEST-6, Role=RbsOdcVestCoreVestMain)
  ); removing Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.412/648886.464 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=27): Member 16 left service Management with senior member 1
Member 83
2010-09-22 05:19:06.003/648688.372 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): Member 100 joined Service PutAllInvocationService with senior member 1
648726.432: [GC 648726.432: [ParNew: 186837K->14089K(191744K), 0.0153790 secs] 2272524K->2100231K(2538752K), 0.0155370 secs] [Times: user=0.19 sys=0.00, real=0.01 secs]
648782.121: [GC 648782.121: [ParNew: 184585K->8246K(191744K), 0.0077860 secs] 2270727K->2094904K(2538752K), 0.0079440 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
648784.254: [GC 648784.254: [ParNew: 178742K->11283K(191744K), 0.0231890 secs] 2265400K->2097941K(2538752K), 0.0232940 secs] [Times: user=0.18 sys=0.00, real=0.02 secs]
648840.470: [GC 648840.470: [ParNew: 180452K->8909K(191744K), 0.0078950 secs] 2267110K->2095568K(2538752K), 0.0080540 secs] [Times: user=0.06 sys=0.01, real=0.01 secs]
648842.869: [GC 648842.869: [ParNew: 179405K->9775K(191744K), 0.0189500 secs] 2266064K->2096433K(2538752K), 0.0190970 secs] [Times: user=0.21 sys=0.00, real=0.02 secs]
2010-09-22 05:21:43.411/648845.780 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): MemberLeft request for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.411/648845.780 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=83): Rejecting the departure confirmation request by Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer) regarding Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.413/648845.782 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): MemberLeft notification for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.413/648845.782 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): Member 16 left service Management with senior member 1
Member 85
2010-09-22 05:19:05.894/102952.247 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): Member 100 joined Service distributed-pof-service with senior member 1
2010-09-22 05:19:06.003/102952.356 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): Member 100 joined Service PutAllInvocationService with senior member 1
09/22/10 05:21:07.667 INFO: [ProcessWrapper] [STDOUT] 103074.362: [GC 103074.362: [ParNew: 225038K->3371K(249216K), 0.0025490 secs] 782627K->560965K(2069504K), 0.0026190 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
2010-09-22 05:21:43.411/103109.764 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): MemberLeft request for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.412/103109.765 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): MemberLeft notification for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.412/103109.765 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): Member 16 left service Management with senior member 1
Member 16
09/22/10 05:11:03.478 INFO: [ProcessWrapper] [STDOUT] 648253.507: [GC 648253.507: [ParNew: 173464K->2443K(191744K), 0.0128920 secs] 1401433K->1230671K(2538752K), 0.0130370 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
09/22/10 05:12:18.307 INFO: [ProcessWrapper] [STDOUT] 648328.337: [GC 648328.337: [ParNew: 172939K->2972K(191744K), 0.0108240 secs] 1401167K->1231401K(2538752K), 0.0109550 secs] [Times: user=0.05 sys=0.01, real=0.01 secs]
09/22/10 05:13:30.532 INFO: [ProcessWrapper] [STDOUT] 648400.564: [GC 648400.564: [ParNew: 173468K->2582K(191744K), 0.0095490 secs] 1401897K->1231266K(2538752K), 0.0097180 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
09/22/10 05:14:51.958 INFO: [ProcessWrapper] [STDOUT] 648481.990: [GC 648481.990: [ParNew: 173078K->2969K(191744K), 0.0087620 secs] 1401762K->1231877K(2538752K), 0.0088810 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
09/22/10 05:16:12.607 INFO: [ProcessWrapper] [STDOUT] 648562.641: [GC 648562.641: [ParNew: 173465K->2798K(191744K), 0.0067770 secs] 1402373K->1231945K(2538752K), 0.0069300 secs] [Times: user=0.07 sys=0.01, real=0.01 secs]
09/22/10 05:17:23.249 INFO: [ProcessWrapper] [STDOUT] 648633.284: [GC 648633.284: [ParNew: 173294K->2824K(191744K), 0.0064570 secs] 1402441K->1232187K(2538752K), 0.0065950 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
09/22/10 05:18:43.913 INFO: [ProcessWrapper] [STDOUT] 648713.948: [GC 648713.948: [ParNew: 173320K->2812K(191744K), 0.0065200 secs] 1402683K->1232354K(2538752K), 0.0066450 secs] [Times: user=0.05 sys=0.01, real=0.00 secs]
2010-09-22 05:19:05.894/648735.426 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Member 100 joined Service distributed-pof-service with senior member 1
2010-09-22 05:19:06.003/648735.535 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Member 100 joined Service PutAllInvocationService with senior member 1
09/22/10 05:21:30.919 INFO: [ProcessWrapper] [STDOUT] 648880.948: [GC 648880.948: [ParNew: 173272K->3255K(191744K), 0.0126250 secs] 1403096K->1233261K(2538752K), 0.0127910 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
2010-09-22 05:21:44.005/648893.537 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=16): Received cluster heartbeat from the senior Member(Id=1, Timestamp=2010-09-14 17:06:38.751, Address=xx.xxx.34.98:8088, MachineId=35170, Location=machine:xxxxx06433,process:20795,member:xxxxx06433:Data-1, Role=RbsOdcCoreDaoODCCacheServer) that does not contain this Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer); stopping cluster service.
2010-09-22 05:21:44.005/648893.537 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Service Cluster left the clusterAny more information would be appreciated (or any settings I can tweak).
Cheers,
JK

Thanks Mark, we will look the timeout settings.
In the logs for the accusing node we see...
2010-09-22 05:21:14.408/648857.460 Oracle Coherence GE 3.5.3/465p2 <D6> (thread=PacketPublisher, member=27): Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer) has failed to respond to 17 packets; declaring this member as paused.
648857.732: [GC 648857.732: [ParNew: 189247K->16395K(191744K), 0.0229780 secs] 1556730K->1384220K(2538752K), 0.0231600 secs] [Times: user=0.29 sys=0.00, real=0.02 secs]
2010-09-22 05:21:43.410/648886.462 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=27): Timeout while delivering a packet Directed{PacketType=0x0DDF00D5, ToId=16, FromId=27, Direction=Outgoing, SentCount=79, SentMillis=05:21:43.111, ToMemberSet=null, ServiceId=7, MessageType=16, FromMessageId=32360401, ToMessageId=1730276, MessagePartCount=1, MessagePartIndex=0, NackInProgress=false, ResendScheduled=05:21:43.311, Timeout=05:21:43.15, PendingResendSkips=0, DeliveryState=outstanding, Body=0x0034D45C01001B012B110B545A001B012B110B5459004C021564BEA9FC8FE2CA80014C230D992515A16200A501843100004E084744532047424C4F40A6014E063834353235374000004CA90215A06200A401945F00A201BE2000A4014219A501A16200A501843100004E084744532047424C4F40A6014E063834353235374040..., Body.length=1445}; requesting the departure confirmation for Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
by MemberSet(Size=2, BitSetCount=4
Member(Id=83, Timestamp=2010-09-14 17:07:39.704, Address=xx.xxx.34.97:8091, MachineId=35169, Location=machine:xxxxx06432,process:25212,member:xxxxx06432:Data-6, Role=RbsOdcCoreDaoODCCacheServer)
   Member(Id=85, Timestamp=2010-09-21 00:43:23.208, Address=xx.xxx.34.88:8090, MachineId=35160, Location=machine:xxxxx06441,process:10718,member:xxxxx06441:VEST-6, Role=RbsOdcVestCoreVestMain)
2010-09-22 05:21:43,412 [Logger@9227652 3.5.3/465p2] INFO  Coherence - 2010-09-22 05:21:43.411/648886.463 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=27): Member departure confirmed by MemberSet(Size=1, BitSetCount=4
   Member(Id=85, Timestamp=2010-09-21 00:43:23.208, Address=xx.xxx.34.88:8090, MachineId=35160, Location=machine:xxxxx06441,process:10718,member:xxxxx06441:VEST-6, Role=RbsOdcVestCoreVestMain)
   ); removing Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.412/648886.464 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=27): Member 16 left service Management with senior member 1So the first hint anything is wrong is the debug message at 05:21:14.408
Confirmation is requested at 05:21:43.410 which is 29 seconds later, so I assume we have the "dev" timeout settings of 30 seconds.
The logs for the suspect member have...
09/22/10 05:17:23.249 INFO: [ProcessWrapper] [STDOUT] 648633.284: [GC 648633.284: [ParNew: 173294K->2824K(191744K), 0.0064570 secs] 1402441K->1232187K(2538752K), 0.0065950 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
09/22/10 05:18:43.913 INFO: [ProcessWrapper] [STDOUT] 648713.948: [GC 648713.948: [ParNew: 173320K->2812K(191744K), 0.0065200 secs] 1402683K->1232354K(2538752K), 0.0066450 secs] [Times: user=0.05 sys=0.01, real=0.00 secs]
09/22/10 05:19:05.897 INFO: [ProcessWrapper] [STDOUT] 2010-09-22 05:19:05,895 [Logger@9248631 3.5.3/465p2] DEBUG Coherence - 2010-09-22 05:19:05.894/648735.426 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Member 100 joined Service distributed-pof-service with senior member 1
09/22/10 05:19:06.004 INFO: [ProcessWrapper] [STDOUT] 2010-09-22 05:19:06,004 [Logger@9248631 3.5.3/465p2] DEBUG Coherence - 2010-09-22 05:19:06.003/648735.535 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Member 100 joined Service PutAllInvocationService with senior member 1
09/22/10 05:20:08.250 INFO: [ProcessWrapper] [STDOUT] 648798.277: [GC 648798.277: [ParNew: 173308K->2776K(191744K), 0.0136850 secs] 1402850K->1232600K(2538752K), 0.0138770 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
09/22/10 05:20:16.851 INFO: [FabricDiagnosticsPlugin] [M: 15M/40M/184M] [T: O(30)] [3.0.1.7] [JRE: 1.5.0_05/Sun Microsystems Inc.] [OS: Linux/2.6.9-89.0.9.ELlargesmp/i386] [H: xx.xxx.34.93]
09/22/10 05:21:30.919 INFO: [ProcessWrapper] [STDOUT] 648880.948: [GC 648880.948: [ParNew: 173272K->3255K(191744K), 0.0126250 secs] 1403096K->1233261K(2538752K), 0.0127910 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
09/22/10 05:21:44.005 INFO: [ProcessWrapper] [STDOUT] 2010-09-22 05:21:44,005 [Logger@9248631 3.5.3/465p2] ERROR Coherence - 2010-09-22 05:21:44.005/648893.537 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=16): Received cluster heartbeat from the senior Member(Id=1, Timestamp=2010-09-14 17:06:38.751, Address=xx.xxx.34.98:8088, MachineId=35170, Location=machine:xxxxx06433,process:20795,member:xxxxx06433:Data-1, Role=RbsOdcCoreDaoODCCacheServer) that does not contain this Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer); stopping cluster service.which seem to show it is reasonably OK, there are GC pauses but none are 30 seconds, although I suppose there are other reasons besides GC that may cause comms timeouts.
Cheers,
JK

Similar Messages

  • PSVC application death detected

    Please Help!
    My Sunfire V490 has an error in /var/adm/messages indicating "picld[129]: [ID 230523 daemon.error] PSVC application death detected".
    Platform:
    SunOS apgdb3 5.10 Generic_125100-05 sun4u sparc SUNW,Sun-Fire-V490
    I did a search an realize that this error is related to Platform Service. The most common cause of this is that server platform-specific patches are not up to date.
    For SunOS 5.9, Solaris 9, there is a recommendation to install patch 113447-23. Patch 113447-23 is obsoleted by patch 118558-39 and 118558-39 is recommended for SunOS 5.9.
    Questions:
    1) How can I fix the error "PSVC application death detected" in SunOS 5.10?
    2) If I want to install patch 118558-39 of SunOS 5.9 in SunOS 5.10 Solaris 10, how can I find a equivalent patch for SunOS 5.10?
    3) If I want to install/update platform-specific patches for Sunfire V490, SunOS 5.10, which patch(es) should be the right one? (I did a search, many results return)
    Much appreciated for your help.
    Joe_ES

    Hallo Salthaus_de
    Tbh, it might work if you do as you just said, move it over to a package, and it might not Work. :)
    The most typical reason that this happens is that your application install ends before it is done doing what it should. I have seen this happening alot with "Autodesk" products, they start a "setup.exe" which spawns another "setup.exe" and exits the first.
    This then tells SCCM, the setup.exe's PID you requested to look at has ended, and you can now do your detection check.
    This Means it will be installing for another 15 minutes, and taking up msiexec.exe - which Means the applications installing after this one possibly could end up not installing with the errorcode, another product is currently being installed.
    Based on this i would still, eventhough the "workaround" Works, as it should, find out why this happens.
    Just my 2 cents, what you do NeXT is up to you. :)
    Kind regards
    Morten Leth

  • Detected another cluster senior running on an incompatible protocol at null

    Recently saw this confusing error. I'm not sure what "null" should have been but I think the error is erroneous also since everything I have is running 3.7.1.1.
    -Andrew
    0656PM.log-2012-03-01 20:23:35.016/5246.928 Oracle Coherence GE 3.7.1.1 <Error> (thread=Cluster, member=16): Detected another cluster senior, running on an incompatible protocol at null manual intervention may be required

    If none of the troubleshooting for the display adaptor (graphics card driver update, BadDrivres.txt etc.) do not work, we have to understand that the display could actually not be supported. By display, I mean the card-driver-driver version combination. So,
    1. It could be a bad card - None of the available drivers work for the application. Here, ditch the card and get a newer one that works.
    2. It could be a bad driver - For e.g. Only DirectX drivers work, OpenGL ones could be pathetic (Many games detect this and if this is bad, I have seen Photoshop struggle with this). Or some display configurations would be badly supported (multiple monitors, some funny resolutions, some of the ports, crossfire etc.). Older drivers/cards/some OEM drivers etc. have these issues in my experience.
    3. It could be a bad driver version - Here upgrade or a downgrade works.
    Because of a Video editor, the (lack of) capabilities of the card is highlighted. Video Playback alone doesn't nearly exercise the capabilities the cards usually have to offer. Editing really does. That is why it matters if the Video Editor actually says that it supports Graphics acceleration (Or GPU acceleration) for any of the workflows.
    What I am saying here is that when we blame the application for any of the issues, we have to be certain that the issues are NOT due to one of the mentioned three causes. I have seen this all too often in the gaming forums, Nero forums (some years back), in PPro forum, and here. If the application menufacturer is really to blame, the mistake would be all too obvious and too many of us would have complained about it. I do not think this is the case with Premiere Elements 10.
    Introspection works.

  • "Service Cluster left the cluster" - lost all my data

    My four storage enabled cluster nodes lost all their cached data when the all services left the cluster in response to some issue(?). Is that the expected behavior? Is the correct procedure to transactionally store to disk so you can reload when this happens or should this simply never happen? Seems like this should not happen. These four nodes are on the the same server. At about time 12:31 everything goes pear shaped.
    2011-01-14 12:31:16.904/50004.436 Oracle Coherence GE 3.6.0.0 <Error> (thread=Cluster, member=3): This senior Member(Id=3, Timestamp=2011-01-13 22:37:52.106, Address=192.168.3.20:8088, MachineId=27412, Location=machine:amd4,process:4428,member:Administrator, Role=CoherenceServer) appears to have been disconnected from other nodes due to a long period of inactivity and the seniority has been assumed by the Member(Id=9, Timestamp=2011-01-13 22:38:01.438, Address=192.168.3.20:8094, MachineId=27412, Location=machine:amd4,process:3904,member:Administrator, Role=CoherenceServer); stopping cluster service.
    2011-01-14 12:31:16.905/50004.437 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=3): Service Cluster left the cluster
    2011-01-14 12:31:16.906/50004.438 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedStatsCacheService, member=3): Service DistributedStatsCacheService left the cluster
    2011-01-14 12:31:16.906/50004.438 Oracle Coherence GE 3.6.0.0 <D5> (thread=Proxy:ExtendTcpProxyService, member=3): Service ExtendTcpProxyService left the cluster
    2011-01-14 12:31:16.907/50004.439 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedQuotesCacheService, member=3): Service DistributedQuotesCacheService left the cluster
    2011-01-14 12:31:16.913/50004.445 Oracle Coherence GE 3.6.0.0 <D5> (thread=Invocation:Management, member=3): Service Management left the cluster
    2011-01-14 12:31:16.913/50004.445 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedOrdersService, member=3): Service DistributedOrdersService left the cluster
    2011-01-14 12:31:16.913/50004.445 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedCacheService, member=3): Service DistributedCacheService left the cluster
    2011-01-14 12:31:16.914/50004.446 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=214992652, Open=false)
    2011-01-14 12:31:16.914/50004.446 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=8305999, Open=false)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1383343339, Open=false)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: TcpConnection(Id=0x0000012D84061C15C0A803149CF3279B334BE6140AC76C47CA03670D76A96D22, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65480)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1003858188, Open=false)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1586910282, Open=false)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: TcpConnection(Id=0x0000012D84060E5AC0A8031442EA3CC26AC425D55D93A6AFC5404E5A76A96D1E, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65472)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor:TcpProcessor, member=3): Released: TcpConnection(Id=0x0000012D84061C15C0A803149CF3279B334BE6140AC76C47CA03670D76A96D22, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65480)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=160435953, Open=false)
    2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor:TcpProcessor, member=3): Released: TcpConnection(Id=0x0000012D84060E5AC0A8031442EA3CC26AC425D55D93A6AFC5404E5A76A96D1E, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65472)
    2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1635893341, Open=false)
    2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: TcpConnection(Id=0x0000012D84061203C0A8031455CD3A790F6009CA79AEC8BACC464D9976A96D20, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65478)
    2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor:TcpProcessor, member=3): Released: TcpConnection(Id=0x0000012D84061203C0A8031455CD3A790F6009CA79AEC8BACC464D9976A96D20, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65478)
    2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedExecutionsService, member=3): Service DistributedExecutionsService left the cluster
    2011-01-14 12:31:16.919/50004.451 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedPositionsCacheService, member=3): Service DistributedPositionsCacheService left the clusterand ...
    2011-01-14 12:31:22.874/50006.273 Oracle Coherence GE 3.6.0.0 <Info> (thread=main, member=n/a): Restarting cluster
    2011-01-14 12:31:22.924/50006.323 Oracle Coherence GE 3.6.0.0 <D4> (thread=main, member=n/a): TCMP bound to /192.168.3.20:8094 using SystemSocketProvider
    2011-01-14 12:31:52.937/50036.336 Oracle Coherence GE 3.6.0.0 <Warning> (thread=Cluster, member=n/a): This Member(Id=0, Timestamp=2011-01-14 12:31:22.924, Address=192.168.3.20:8094, MachineId=27412, Location=machine:amd4,process:4136,member:Administrator, Role=CoherenceServer) has been attempting to join the cluster at address 225.0.0.1:54321 with TTL 4 for 30 seconds without success; this could indicate a mis-configured TTL value, or it may simply be the result of a busy cluster or active failover.
    2011-01-14 12:31:52.950/50036.349 Oracle Coherence GE 3.6.0.0 <Warning> (thread=Cluster, member=n/a): Received a discovery message that indicates the presence of an existing cluster that does not respond to join requests; this is usually caused by a network layer failure:Logs starting at 12:30 from the four nodes are here:
    http://www.nmedia.net/~andrew/logs/1.log
    http://www.nmedia.net/~andrew/logs/2.log
    http://www.nmedia.net/~andrew/logs/3.log
    http://www.nmedia.net/~andrew/logs/4.log
    If someone could tell me if this is a bug in the cluster re-join logic or something I screwed up that would be great. Thanks!
    Andrew

    Hi Andrew
    I had a quick look at your logs but cannot say for certain why your cluster died. I can say that losing data is a normal consequence of node loss though. If you have the backup count set to 1 then you can lose a single node without losing data. If you lose more than one node (on different machines, or the same machine if you only have one) over a very short space of time then you will almost certainly lose at least one partition and hence lose the data within that partition.
    Going back to you logs is is difficult to determine the underlying cause without the whole set of logs. You have posted links to four logs but from looking at them the cluster has about 16 nodes. I know from experience (as we had a cluster that was quite unstable for a while) that tracing these issues through the logs can be a bit awkwrd but you soon get the hang of it :-)
    For example in the log http://www.nmedia.net/~andrew/logs/1.log you have...
    2011-01-14 12:31:16.807/49993.331 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=9): MemberLeft notification for Member(Id=3, Timestamp=2011-01-13 22:37:52.106, Address=192.168.3.20:8088, MachineId=27412, Location=machine:amd4,process:4428,member:Administrator, Role=CoherenceServer, PublisherSuccessRate=0.9975, ReceiverSuccessRate=0.9999, PauseRate=0.0, Threshold=93, Paused=false, Deferring=false, OutstandingPackets=0, DeferredPackets=0, ReadyPackets=0, LastIn=261ms, LastOut=277ms, LastSlow=n/a) received from Member(Id=22, Timestamp=2011-01-14 08:21:22.284, Address=192.168.3.121:8092, MachineId=27513, Location=machine:H1,process:3716,member:Howard, Role=Order_entry_window, PublisherSuccessRate=0.8326, ReceiverSuccessRate=1.0, PauseRate=0.0024, Threshold=1456, Paused=false, Deferring=false, OutstandingPackets=0, DeferredPackets=0, ReadyPackets=0, LastIn=0ms, LastOut=8ms, LastSlow=n/a)...which is Member-9 recieving a message about the departure of Member-3 from Member-22, so you would then need to look at the logs for Member-22 to see why it thought Member-3 had departed and also look at the logs for Member-3 for that time to see what might be wrong with it.
    The more worrying message would be these...
    2011-01-14 12:31:16.709/49993.233 Oracle Coherence GE 3.6.0.0 <Warning> (thread=PacketPublisher, member=9): Experienced a 19025 ms communication delay (probable remote GC) with Member(Id=21, Timestamp=2011-01-14 08:21:12.174, Address=192.168.3.121:8090, MachineId=27513, Location=machine:H1,process:4316,member:Howard, Role=OrderbookviewerViewer); 111 packets rescheduled, PauseRate=0.0014, Threshold=1696...a 19 second delay is a long time and would suggest either very long GC pauses of a network problem. Do you have GC logs of these processes. Are all the servers connected to the same switch or is the cluster distributed over more than one part of your network? Do you have too much on one machine, are you overloading the NIC, are you swapping, all these can cause delays and/or los of packets.
    We have had problems with storage disabled nodes doing long GC pauses and causing storage nodes to drop out of the cluster. Our cluster was on 3.5.3-p8 whereas you are on 3.6.0.0 which is supposed to have better node death detection so you might not have the same issues we had.
    Sorry to not be more help,
    JK

  • I would like to know the role of the each thread on coherence

    Help me.
    I would like to know the role of the each thread on coherence.
    There are too many kind of threads.
    Example ~
    GC Slave     GC Slave     RUNNABLE
    RMI TCP Accept-1972     RMI TCP Accept-1972     RUNNABLE
    Health Center trace subscriber     Health Center trace subscriber     RUNNABLE
    LT=0:P=342534:O=0:port=55170     LT=0:P=342534:O=0:port=55170     RUNNABLE
    Attach API wait loop     Attach API wait loop     RUNNABLE
    PacketListener1     PacketListener1     RUNNABLE
    PacketListener1P     PacketListener1P     RUNNABLE
    PacketListenerN     PacketListenerN     RUNNABLE
    Cluster|Member(Id=1, Timestamp=2013-04-05 10:45:44.655, Address=192.168.240.157:8088, MachineId=50044, Location=site:,machine:TMTEST-PC,process:5316, Role=CoherenceServer)     Cluster|Member(Id=1, Timestamp=2013-04-05 10:45:44.655, Address=192.168.240.157:8088, MachineId=50044, Location=site:,machine:TMTEST-PC,process:5316, Role=CoherenceServer)     RUNNABLE
    RT=0:P=342534:O=0:TCPTransportConnection[addr=192.168.240.157,port=55178,local=55170]     RT=0:P=342534:O=0:TCPTransportConnection[addr=192.168.240.157,port=55178,local=55170]     RUNNABLE
    Finalizer thread     Finalizer thread     RUNNABLE
    WT=10     WT=10     RUNNABLE
    main     main     TIMED_WAITING
    IpMonitor     IpMonitor     TIMED_WAITING
    Invocation:Management:EventDispatcher     Invocation:Management:EventDispatcher     TIMED_WAITING
    Invocation:Management     Invocation:Management     TIMED_WAITING
    DistributedCache     DistributedCache     TIMED_WAITING
    JMX server connection timeout 52     JMX server connection timeout 52     TIMED_WAITING
    RMI Scheduler(0)     RMI Scheduler(0)     WAITING
    Thread-6     Thread-6     WAITING
    stop JMX Server on shutdown     stop JMX Server on shutdown     WAITING
    Logger@9228429 3.7.1.7     Logger@9228429 3.7.1.7     WAITING
    PacketReceiver     PacketReceiver     WAITING
    PacketPublisher     PacketPublisher     WAITING
    PacketSpeaker     PacketSpeaker     WAITING
    WT=7     WT=7     WAITING
    WT=9     WT=9     WAITING
    -----------------------------------------------------------------------------------------------------------------------------------------------

    Briefly
    PacketListener1 PacketListener1P PacketListenerN - listening IO threads for TCMP transport protocol
    Cluster|Member(Id=1, Timestamp=2013-04-05 10:45:44.655, Address=192.168.240.157:8088, MachineId=50044, Location=site:,machine:TMTEST-PC,process:5316, Role=CoherenceServer) - main thread for cluster service (discovery, node joing / leave, etc)
    IpMonitor - IP monitor, participates in death detection scheme
    Invocation:Management:EventDispatcher - Event dispatch thread for distributed JMX service in Coherence
    Invocation:Management - main thread for distributed JMX service in Coherence
    DistributedCache - main thread for DistributedCache cache service
    Logger@9228429 3.7.1.7 - Coherence async logging thread
    PacketReceiver - Thread dispatching incomming network packets
    PacketPublisher - Thread sending out packets via TCMP
    PacketSpeaker - Thread sending out packets via TCMP (offloads some work from PacketPublisher for better core utilization)

  • TCP Extend (DefaultCacheServer rejects connections)

    Hi guys
    Have been trying to setup TCP Extend to make a Linux box use cache configured on a windows box and the DefaultCacheServer rejects TCP connections. The config files I'm using are attached. Can anyone help ?
    The DefaultCacheServer comes up nicely
    SafeCluster: Name=n/a
    Group{Address=224.3.2.0, Port=32367, TTL=1}
    MasterMemberSet
    ThisMember=Member(Id=1, Timestamp=2007-03-29 16:07:16.026, Address=147.114.162.160:54321, MachineId=17312)
    OldestMember=Member(Id=1, Timestamp=2007-03-29 16:07:16.026, Address=147.114.162.160:54321, MachineId=17312)
    ActualMemberSet=MemberSet(Size=1, BitSetCount=2
    Member(Id=1, Timestamp=2007-03-29 16:07:16.026, Address=147.114.162.160:54321, MachineId=17312)
    RecycleMillis=120000
    RecycleSet=MemberSet(Size=0, BitSetCount=0
    Services
    TcpRing{TcpSocketAccepter{State=STATE_OPEN, ServerSocket=147.114.162.160:54321}, Connections=[]}
    ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_JOINED), Id=0, Version=3.2, OldestMemberId=1}
    DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=1, Version=3.2, OldestMemberId=1, LocalStorage=enabled, PartitionCount=257, Bac
    upCount=1, AssignedPartitions=257, BackupPartitions=0}
    but when I run the client, I get this
    2007-03-29 16:09:42.698 Tangosol Coherence DGE 3.2/367 <D4> (thread=TcpRingListener, member=1): Rejecting connection to member 649 using TcpSocket{Sta
    te=STATE_OPEN, Socket=Socket[addr=/172.26.102.115,port=36952,localport=54321]}<br><br> <b> Attachment: </b><br>cluster-side-config.xml <br> (*To use this attachment you will need to rename 516.bin to cluster-side-config.xml after the download is complete.)<br><br> <b> Attachment: </b><br>client-side-config.xml <br> (*To use this attachment you will need to rename 517.bin to client-side-config.xml after the download is complete.)

    Hi pandeyv,
    You need to configure an instance of the ProxyService in your cluster-side cache configuration file. Coherence*Extend clients connect to the ProxyService over TCP/IP and not the TcpRingService. The TcpRingService is only used by cluster members for death detection.
    See the following for instructions on configuring the cluster and client-side configuration files:
    http://wiki.tangosol.com/display/COH32UG/Configuring+and+Using+Coherence*Extend
    Additionally, I noticed that you are using an old release of Coherence 3.2. Please upgrade to the latest 3.2 service pack (3.2.2):
    http://www.tangosol.com/product-downloads.jsp
    Regards,
    Jason

  • Startup timeout and packet-delivery timeout

    Hi,
    At the moment it takes my first cluster node approximately 30 seconds to start and setting the packet-delivery timeout smaller than this means the system cannot start. I'm trying to reduce the packet-delivery setting to improve responsiveness during failover caused by hardware failures. I think 15-20 seconds would be ideal, allowing for GC pauses.
    Subsequent nodes can start in a couple of seconds.
    Is this a reasonable time to expect the first Coherence node in a cluster to start up? What kind of values is everyone else working with?
    Thanks & Regards,
    Martin
    The error when packet-delivery is less than 30 seconds:
    2009-11-24 13:04:05.568/24.141 Oracle Coherence GE 3.4/405 <Error> (thread=main, member=n/a): Error while starting cluster: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
    MemberSet=ServiceMemberSet(
    OldestMember=n/a
    ActualMemberSet=MemberSet(Size=0, BitSetCount=0
    MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
         at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onStartupTimeout(Grid.CDB:6)
         at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:27)
         at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:38)
         at com.tangosol.coherence.component.net.Cluster.onStart(Cluster.CDB:317)
         at com.tangosol.coherence.component.net.Cluster.start(Cluster.CDB:11)
         at com.tangosol.coherence.component.util.SafeCluster.startCluster(SafeCluster.CDB:3)
         at com.tangosol.coherence.component.util.SafeCluster.restartCluster(SafeCluster.CDB:7)
         at com.tangosol.coherence.component.util.SafeCluster.ensureRunningCluster(SafeCluster.CDB:27)
         at com.tangosol.coherence.component.util.SafeCluster.start(SafeCluster.CDB:2)
         at com.tangosol.net.CacheFactory.ensureCluster(CacheFactory.java:951)
         at com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:748)
         at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:710)
         at com.tangosol.net.DefaultConfigurableCacheFactory.configureCache(DefaultConfigurableCacheFactory.java:919)
         at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:277)
         at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:689)
         at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:667)
         at com.changingworlds.datagrid.cache.Cache.init(Cache.java:111)
         at com.changingworlds.datagrid.cache.Cache.initializeNamedCache(Cache.java:95)
         at com.changingworlds.discovery.DiscoveryMain.initSpring(DiscoveryMain.java:201)
         at com.changingworlds.discovery.DiscoveryMain.createDiscovery(DiscoveryMain.java:154)
         at com.changingworlds.discovery.DiscoveryMain.main(DiscoveryMain.java:78)
    Edited by: MartinMc on Nov 24, 2009 1:15 PM

    Hi Martin,
    The packet delivery timeout relates to death detection, not to cluster formation. You'll want to have a look at the join-timeout-milliseconds specified within the multicast-listener element (see http://coherence.oracle.com/display/COH35UG/multicast-listener) if you wish to change the amount of time it takes to form a new cluster. The reason for this timeout is to prevent a new node from accidentally forming a secondary cluster if the existing cluster members are temporarily unreachable while the new node starts. Assuming you are running more then just a few nodes in your cluster, you should be fine lowering this value to 5-10s.
    I don't however see how this relates to failover unless by failover you mean starting an entirely new cluster after the complete loss of the formerly running cluster.
    thanks,
    Mark
    Oracle Coherence

  • Thread STUCK for more than 10minutes at  com.tangosol.util.SegmentedHashMap

    <[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "707" seconds working on the request "weblogic.work.SelfTuningWorkManagerImpl$WorkAdapterImpl@11ddce1", which is more than the configured time (StuckThreadMaxTime) of "600" seconds. Stack trace:
    java.lang.Object.wait(Native Method)
    com.tangosol.util.SegmentedHashMap.contendForSegment(SegmentedHashMap.java:1391)
    com.tangosol.util.SegmentedHashMap.lockSegment(SegmentedHashMap.java:1301)
    com.tangosol.util.SegmentedHashMap.lockBucket(SegmentedHashMap.java:1266)
    com.tangosol.util.SegmentedHashMap.invokeOnKey(SegmentedHashMap.java:1058)
    com.tangosol.util.SegmentedConcurrentMap.lock(SegmentedConcurrentMap.java:197)
    com.tangosol.net.cache.CachingMap.get(CachingMap.java:462)

    Hi,
    A 2019 ms delay could be caused by a number of things. It could be network related or more likely as the message says the WebLogic node could have been doing a GC. As your WebLogic node is a Cluster member you need to make sure that its GC is properly tuned to avoid long GC pauses as this can destabelize the whole cluster. I see you are using 3.5.3/465p8 which is an older version of Coherence and lacks the newer node death detection algorithms. I know from the project I work on that also uses 3.5.3/465p8 that long GCs on cluster members can cause other nodes in the cluster to die if you get GC pauses that are too long or too frequent.
    I doubt that this message is related to the stuck thread issue in your original post, though it is hard to tell for certain.
    JK

  • Special character issue

    Hi
    I am getting following error while uploading master data.
    Master data (dealt by table level) has errors
    Detected duplicate member ID 'Agrippal'
    Dimension member DNA/PLG Prime + GP14 is an invalid member ID
    Dimension member DTaP/alum vaccine (p is an invalid member ID
    Dimension member DTwP + Hib full liqu is an invalid member ID
    Dimension member DTwP + Hib lyophilis is an invalid member ID
    Dimension member Flu + MF59 + CpG is an invalid member ID
    Dimension member Fluvirin + MF59 is an invalid member ID
    Detected duplicate member ID 'H. pylori'
    Dimension member HBV/MF59 is an invalid member ID
    Dimension member Hepatitis C (E1E2HCV is an invalid member ID
    Dimension member Laminarin-CRM197 + a is an invalid member ID
    Dimension member MenB (New Zealand st is an invalid member ID
    Dimension member MenB (Norwegian stra is an invalid member ID
    Dimension member MenB Engineered (Chi is an invalid member ID
    Dimension member MenC lyophilised + M is an invalid member ID
    Detected duplicate member ID 'Rabies vaccine'
    Dimension member Rhein-Biotech DTaP/H is an invalid member ID
    Dimension member Staph Aureus (Staphy is an invalid member ID
    Detected duplicate member ID 'Td-IPV'
    Dimension member V Quinvaxem (DTwP) p is an invalid member ID
    Dimension member VLP-H5; Virus-Like P is an invalid member ID
    Dimension member Vi - CRM (NVGH Vacci is an invalid member ID
    Error in Admin module
    Record count: 80
    Accept count: 0
    I unsuccessfully tried with couple of options wirting javascript in internal column of conversion file.  Any body could you please guide me to rectify this issue.
    Thanks in advance
    regards
    mahi

    Hi,
    I don't think that it is an OSS Note that you require to implement as it appears the source data has special characters in?
    What does the source data look like?
    BPC has restrictions on the ID's allowed imported into BPC, and a useful OSS note explains some of these restrictions:
    SAP Note 1448836 - [Link|https://websmp130.sap-ag.de/sap(bD1lbiZjPTAwMQ==)/bc/bsp/spn/sapnotes/index2.htm?numm=1448836]
    You can use javascript in a conversion file to use EXTERNAL column to ' * ' so it selects all, then use js: javascript to remove special characters. You'll need to do a bit of research into exactly how you'd set it up but the SAP help file has some guideance here:
    [http://help.sap.com/saphelp_bpc70/helpdata/en/81/94a8a5febd40268d5c59b4fc31be37/content.htm|http://help.sap.com/saphelp_bpc70/helpdata/en/81/94a8a5febd40268d5c59b4fc31be37/content.htm]
    Hope it helps,
    Nick

  • Sql Apply issue in logical standby database--(10.2.0.5.0) x86 platform

    Hi Friends,
    I am getting the following exception in logical standby database at the time of Sql Apply.
    After run the command alter database start logical standby apply sql apply services start but after few second automatically stop and getting following exception.
    alter database start logical standby apply
    Tue May 17 06:42:00 2011
    No optional part
    Attempt to start background Logical Standby process
    LOGSTDBY Parameter: MAX_SERVERS = 20
    LOGSTDBY Parameter: MAX_SGA = 100
    LOGSTDBY Parameter: APPLY_SERVERS = 10
    LSP0 started with pid=30, OS id=4988
    Tue May 17 06:42:00 2011
    Completed: alter database start logical standby apply
    Tue May 17 06:42:00 2011
    LOGSTDBY status: ORA-16111: log mining and apply setting up
    Tue May 17 06:42:00 2011
    LOGMINER: Parameters summary for session# = 1
    LOGMINER: Number of processes = 4, Transaction Chunk Size = 201
    LOGMINER: Memory Size = 100M, Checkpoint interval = 500M
    Tue May 17 06:42:00 2011
    LOGMINER: krvxpsr summary for session# = 1
    LOGMINER: StartScn: 0 (0x0000.00000000)
    LOGMINER: EndScn: 0 (0x0000.00000000)
    LOGMINER: HighConsumedScn: 2660033 (0x0000.002896c1)
    LOGMINER: session_flag 0x1
    LOGMINER: session# = 1, preparer process P002 started with pid=35 OS id=4244
    LOGSTDBY Apply process P014 started with pid=47 OS id=5456
    LOGSTDBY Apply process P010 started with pid=43 OS id=6484
    LOGMINER: session# = 1, reader process P000 started with pid=33 OS id=4732
    Tue May 17 06:42:01 2011
    LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1417, X:\TANVI\ARCHIVE2\ARC01417_0748170313.001
    Tue May 17 06:42:01 2011
    LOGMINER: Turning ON Log Auto Delete
    Tue May 17 06:42:01 2011
    LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01417_0748170313.001
    Tue May 17 06:42:01 2011
    LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1418, X:\TANVI\ARCHIVE2\ARC01418_0748170313.001
    LOGSTDBY Apply process P008 started with pid=41 OS id=4740
    LOGSTDBY Apply process P013 started with pid=46 OS id=7864
    LOGSTDBY Apply process P006 started with pid=39 OS id=5500
    LOGMINER: session# = 1, builder process P001 started with pid=34 OS id=4796
    Tue May 17 06:42:02 2011
    LOGMINER: skipped redo. Thread 1, RBA 0x00058a.00000950.0010, nCV 6
    LOGMINER: op 4.1 (Control File)
    Tue May 17 06:42:02 2011
    LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01418_0748170313.001
    Tue May 17 06:42:03 2011
    LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1419, X:\TANVI\ARCHIVE2\ARC01419_0748170313.001
    Tue May 17 06:42:03 2011
    LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01419_0748170313.001
    Tue May 17 06:42:03 2011
    LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1420, X:\TANVI\ARCHIVE2\ARC01420_0748170313.001
    Tue May 17 06:42:03 2011
    LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01420_0748170313.001
    Tue May 17 06:42:03 2011
    LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1421, X:\TANVI\ARCHIVE2\ARC01421_0748170313.001
    LOGSTDBY Analyzer process P004 started with pid=37 OS id=5096
    Tue May 17 06:42:03 2011
    LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01421_0748170313.001
    LOGSTDBY Apply process P007 started with pid=40 OS id=2760
    Tue May 17 06:42:03 2011
    Errors in file x:\oracle\product\10.2.0\admin\tanvi\bdump\tanvi_p001_4796.trc:
    ORA-00600: internal error code, arguments: [krvxbpx20], [1], [1418], [2380], [16], [], [], []
    LOGSTDBY Apply process P012 started with pid=45 OS id=7152
    Tue May 17 06:42:03 2011
    LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1422, X:\TANVI\ARCHIVE2\ARC01422_0748170313.001
    Tue May 17 06:42:03 2011
    LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01422_0748170313.001
    Tue May 17 06:42:03 2011
    LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1423, X:\TANVI\ARCHIVE2\ARC01423_0748170313.001
    Tue May 17 06:42:03 2011
    LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01423_0748170313.001
    Tue May 17 06:42:03 2011
    LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1424, X:\TANVI\ARCHIVE2\ARC01424_0748170313.001
    LOGMINER: session# = 1, preparer process P003 started with pid=36 OS id=5468
    Tue May 17 06:42:03 2011
    LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01424_0748170313.001
    Tue May 17 06:42:04 2011
    LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1425, X:\TANVI\ARCHIVE2\ARC01425_0748170313.001
    LOGSTDBY Apply process P011 started with pid=44 OS id=6816
    LOGSTDBY Apply process P005 started with pid=38 OS id=5792
    LOGSTDBY Apply process P009 started with pid=42 OS id=752
    Tue May 17 06:42:05 2011
    krvxerpt: Errors detected in process 34, role builder.
    Tue May 17 06:42:05 2011
    krvxmrs: Leaving by exception: 600
    Tue May 17 06:42:05 2011
    Errors in file x:\oracle\product\10.2.0\admin\tanvi\bdump\tanvi_p001_4796.trc:
    ORA-00600: internal error code, arguments: [krvxbpx20], [1], [1418], [2380], [16], [], [], []
    LOGSTDBY status: ORA-00600: internal error code, arguments: [krvxbpx20], [1], [1418], [2380], [16], [], [], []
    Tue May 17 06:42:06 2011
    Errors in file x:\oracle\product\10.2.0\admin\tanvi\bdump\tanvi_lsp0_4988.trc:
    ORA-12801: error signaled in parallel query server P001
    ORA-00600: internal error code, arguments: [krvxbpx20], [1], [1418], [2380], [16], [], [], []
    Tue May 17 06:42:06 2011
    LogMiner process death detected
    Tue May 17 06:42:06 2011
    logminer process death detected, exiting logical standby
    LOGSTDBY Analyzer process P004 pid=37 OS id=5096 stopped
    LOGSTDBY Apply process P010 pid=43 OS id=6484 stopped
    LOGSTDBY Apply process P008 pid=41 OS id=4740 stopped
    LOGSTDBY Apply process P012 pid=45 OS id=7152 stopped
    LOGSTDBY Apply process P014 pid=47 OS id=5456 stopped
    LOGSTDBY Apply process P005 pid=38 OS id=5792 stopped
    LOGSTDBY Apply process P006 pid=39 OS id=5500 stopped
    LOGSTDBY Apply process P007 pid=40 OS id=2760 stopped
    LOGSTDBY Apply process P011 pid=44 OS id=6816 stopped
    Tue May 17 06:42:10 2011

    Errors in file x:\oracle\product\10.2.0\admin\tanvi\bdump\tanvi_p001_4796.trc:
    ORA-00600: internal error code, arguments: [krvxbpx20], [1], [1418], [2380], [16], [], [], []submit an SR to ORACLE SUPPORT.
    refer these too
    *ORA-600/ORA-7445 Error Look-up Tool [ID 153788.1]*
    *Bug 6022014: ORA-600 [KRVXBPX20] ON LOGICAL STANDBY*

  • Standby Logical DB ORA-00600: internal error code, arguments: [ksmovrflow],

    hi
    please help me how fix this problem?
    I have primary oracle database version 9.2.0.1.0
    I had configured standby Logical database on separate computer & everything was going good but today I noticed the archived log havent applied & I have ORA-16111: log mining and apply setting up in DBA_LOGSTDBY_EVENTS without any forwarding.
    in alert log file I found out these error:
    Errors in file c:\oracle\admin\bmdbsb\udump\bmdbsb_p004_2796.trc:
    ORA-00600: internal error code, arguments: [ksmovrflow], [knahs:ddl_string], [], [], [], [], [], []
    Wed Jun 18 13:25:29 2003
    Errors in file c:\oracle\admin\bmdbsb\bdump\bmdbsb_lsp0_2212.trc:
    ORA-12805: parallel query server died unexpectedly
    Wed Jun 18 13:25:29 2003
    logminer process death detected, exiting logical standby
    what can I do?
    thanks

    Hi,
    create pfile from spfile;
    shutdown immediate;
    edit the changes now in pfile
    startup pfile='/////' open;
    spfile is not a binary file, it will not accept any changes but in 10g we have to create
    spfile initialy it runs with pfile until you create spfile.
    regards,
    Nirmal

  • Failover Cluster testing - not working

    Hello
    We are trying to perform failover testing on OC4J clusters. We have two nodes clustered. On each node we have installed SOA suite - four OC4J instances (home, oc4J_soa, oc4j_wsm, and oc4j_esbdt).
    This is how we are performing the test:
    We have deployed servlet on both oc4j_soa with different application names but with same context-root. We have observed following behavior.
    1) In mod_oc4j.conf file , we have "Oc4jSelectMethod roundrobin:local".
    So when we hit http://<loadbalancer>/<context-root-for-servlet> it works fine in roundrobin fashion. We have set the HTML page title as node name in servlets.
    And so we see Node 1 title one time and Node 2 title the other time.
    2) Now if I undeploy the servlet from Node 1 and hit http://<loadbalancer>/<context-root-for-servlet> It doen't work one time (gives HTTP 500 error) and it works the other time.
    ie. it is still sending one request to Node 1 and other request to Node 2.
    well, it should not do this right ? if my application is down/not available on Node 1 , all request should go to Node 2.
    3) Also what i have observed is , when we hit the Nodes' http url directly instead of going through the loadbalancer it crisscross the requests.
    ie. http://NODE 1/<context-root-for-servlet> - this one always goes to Node 2
    and http://NODE 2/<context-root-for-servlet> - always goes to Node 1
    This is something weird.
    Anybody has any idea , please ? I am not sure why are we getting unexpected behavior mentioned 2) and 3 )
    Please let me know if you need anything about config details.
    Thanks
    /Mishit

    2) I hope there is a Hardware LBR front ending this architecture ... If yes then HW LBR have intelligent death detection mechanisms where in if Node 1 crashes it stops serving requests to the failed node until it is back online .... so this setting is more at the LBR then at mod_oc4j.conf
    3) If load balancing is configured correctly, I dont think u shud be getting this issue..
    To test load balancing you can do as below:
    - Ensure Virtual host configuration is done in Apache of both nodes
    - Ensure Virtual host entry is added to /etc/hosts
    Like on Node 1 /etc/hosts
    <Node 2 IP> <Virtual hostname>
    Similarly on Node 2 /etc/hosts
    <Node 1 IP> < Virtual hostname>
    - Give Virtual URL to your LBR guy and he will do the settings in LBR
    - Bring down node 1 keeping node 2 alive & check below:
    http://<virtualURL>/contextroot
    http://<node2>/contextroot
    - Similarly bring down node 2 keeping node 1 alive & check below:
    http://<virtualURL>/contextroot
    http://<node1>/contextroot

  • Error OLAP processing one application - SAP BPC 7.0 MS

    Hi everybody,
    This is our problem, when i try to process my application called "Consolidation", it always fails and appears the following error message:
    "Cube process: Errors in the OLAP storage engine. The Attribute key cannot be found: Table dbo_tblFactConsolidation, column ACCOUNT, value: E1500."
    We tried to launch a full process of this dimension "ACCOUNT", but also it fails.
    Could anybody say me what is happening?
    Thanks for your help.

    Juan,
    Here is the solution for solving this issue.
    <Reason>
        -. Your fact table has invalid record. In your case, it is 'E1500'.
        -. It means, Account dimension doesn't have E1500 member but your fact table has that records.
           Therefore, Analysis Service can't process it.
        -. Usually it happens when user load data from outside source to BPC without validation.
        -. If you are using 'make dimension package' 5.1 SP8, it might be a bug. I saw that issue in that  version but this error doesn't happen when you create dimension from excel worksheet through BPC admin console.
    <Solution>
        -.  Select records from fact tables that has 'E1500' from account column.
        -.  Delete it.
    <Note>
         - Due to MS analysis Service error detect system, this invalid member will detect one by one.
           Therefore, you may need to repeat this step again and again.
         - One solution to avoid this is finding invalid members between fact and dim tables using Join query.
           Here is sample SQL query,
           select a.ID, b.account from mbrAccount as a right join tblFactFinance as b
           on a.ID = b.ACCOUNT where a.ID is null
           It will show all invalid member Id in account column from facttable  so that you can figure out which account was wrong. It can save your time a lot to avoid process cube again and again.
    I hope it will help you
    James Lim

  • Logmnr/capure error b'coz of corruption of redo log block

    Hi,
    We all know that capture process reads the REDO entries from redo log files or archived log files. Therefore we need to ahev db in ARCHIVELOG mode.
    In alert log file, I found error saying :
    Creating archive destination LOG_ARCHIVE_DEST_1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\LOCATION01\1_36.ARC'
    ARC0: Log corruption near block 66922 change 0 time ?
    ARC0: All Archive destinations made inactive due to error 354
    Fri Apr 04 12:57:44 2003
    Errors in file e:\oracle\admin\repf\bdump\trishul_arc0_1724.trc:
    ORA-00354: corrupt redo log block header
    ORA-00353: log corruption near block 66922 change 0 time 04/04/2003 11:05:40
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
    As a normal practice, we do have multiplexing of redo log files at diff location, but even that second copy of redo log is of no use to recover the redo log. This explains redo log could not be archived, since it can't be read. Same is true even for Logmnr process, it could not read the redo log file and it failed. Now, we have wae to recover from this situation (as far as DB is concern, not Stream Replication), since the shutdown after this error was IMMEDIATE causing checkpoing, and rollback/rollforward is not required during system startup. (No instance recovery) We can make db NOARCHIVELOG mode, drop that particular group, and create new one, and turn db to ARCHIVELOG mode This will certainly serve the purpose as far as consistency of DB is concern.
    Here is a catch for Stream Replication. The redo log that got corrupted must be having few transaction which are not being archived, and each will be having corresponding SCN. Now, Capture Process read the info sequentially in order of SCN. Few transaction are now missed, and Capture process can't jump to next SCN skipping few SCN in between. So, we have to re-instantiate the objects on the another system which has no erros, and start working on it. My botheration is what will happen to those missed transaction on the another database. It's absolete loss of the data. In development I can manage that. But in real time Production stage, this is a critical situation. How to recover from this situation to get back the corrupted info from redo log ?
    I have not dropped any of the log group yet. B'coz I would like to recover from this situation without LOSS of data.
    Thanx, & regards,
    Kamlesh Chaudhary
    Content of trace files :
    Dump file e:\oracle\admin\repf\bdump\trishul_arc0_1724.trc
    Fri Apr 04 12:57:31 2003
    ORACLE V9.2.0.2.1 - Production vsnsta=0
    vsnsql=12 vsnxtr=3
    Windows 2000 Version 5.0 Service Pack 2, CPU type 586
    Oracle9i Enterprise Edition Release 9.2.0.2.1 - Production
    With the Partitioning, OLAP and Oracle Data Mining options
    JServer Release 9.2.0.2.0 - Production
    Windows 2000 Version 5.0 Service Pack 2, CPU type 586
    Instance name: trishul
    Redo thread mounted by this instance: 1
    Oracle process number: 16
    Windows thread id: 1724, image: ORACLE.EXE
    *** SESSION ID:(13.1) 2003-04-04 12:57:31.000
    - Created archivelog as 'E:\ORACLE\ORADATA\REPF\ARCHIVE\LOCATION02\1_36.ARC'
    - Created archivelog as 'E:\ORACLE\ORADATA\REPF\ARCHIVE\LOCATION01\1_36.ARC'
    *** 2003-04-04 12:57:44.000
    ARC0: All Archive destinations made inactive due to error 354
    *** 2003-04-04 12:57:44.000
    kcrrfail: dest:2 err:354 force:0
    *** 2003-04-04 12:57:44.000
    kcrrfail: dest:1 err:354 force:0
    ORA-00354: corrupt redo log block header
    ORA-00353: log corruption near block 66922 change 0 time 04/04/2003 11:05:40
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
    *** 2003-04-04 12:57:44.000
    ARC0: Archiving not possible: error count exceeded
    ORA-16038: log 2 sequence# 36 cannot be archived
    ORA-00354: corrupt redo log block header
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
    ORA-16014: log 2 sequence# 36 not archived, no available destinations
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
    ORA-16014: log 2 sequence# 36 not archived, no available destinations
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
    ORA-16014: log 2 sequence# 36 not archived, no available destinations
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
    ORA-16014: log 2 sequence# 36 not archived, no available destinations
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
    ORA-16014: log 2 sequence# 36 not archived, no available destinations
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
    ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG
    Dump file e:\oracle\admin\repf\udump\trishul_cp01_2048.trc
    Fri Apr 04 12:57:27 2003
    ORACLE V9.2.0.2.1 - Production vsnsta=0
    vsnsql=12 vsnxtr=3
    Windows 2000 Version 5.0 Service Pack 2, CPU type 586
    Oracle9i Enterprise Edition Release 9.2.0.2.1 - Production
    With the Partitioning, OLAP and Oracle Data Mining options
    JServer Release 9.2.0.2.0 - Production
    Windows 2000 Version 5.0 Service Pack 2, CPU type 586
    Instance name: trishul
    Redo thread mounted by this instance: 1
    Oracle process number: 30
    Windows thread id: 2048, image: ORACLE.EXE (CP01)
    *** 2003-04-04 12:57:28.000
    *** SESSION ID:(27.42) 2003-04-04 12:57:27.000
    TLCR process death detected. Shutting down TLCR
    error 1280 in STREAMS process
    ORA-01280: Fatal LogMiner Error.
    OPIRIP: Uncaught error 447. Error stack:
    ORA-00447: fatal error in background process
    ORA-01280: Fatal LogMiner Error
    **********************

    I have the similar problem - I am using Steams environment, and have got this
    "ORA-00353: log corruption near block" errors in the alert.log file
    during capture the changes on the primary database, and Capture
    process became aborted after that.
    Was that transactions lost, or after i've started the Capture process
    again the were captured and send to the target database?
    Have anyone solved that problem?
    Can you help me with it?

  • 11i load balancing web nodes without use of Hardware http load balancer

    I am looking at note 217368.1 (Advanced Configurations and Topologies for Enterprise Deployments of E-Business Suite 11i) and some other notes on load balancing but some aspects are not clear.
    Aim is to implement load balancing traffic to web nodes without using Hardware ( BigIP, cisco etc) for HTTP layer load balancing.
    Which is more preferable between dns or Apache Jserv load balancer ?
    Need details like failover capabilities, death detection of node, functionality testing and ways to monitor Apache Jserv load balancer.
    Any help in this regard is welcome .
    thx
    arun

    Oracle recommends using loadbalancing hardware rather than using DNS. If you want the features you mention above, you will need a hardware loadbalancer.
    http://blogs.oracle.com/stevenChan/2006/06/indepth_loadbalancing_ebusines.html
    http://blogs.oracle.com/stevenChan/2009/01/using_cisco_ace_series_hardware_load-balancers_ebs12.html
    HTH
    Srini

Maybe you are looking for