Member death detection (Coh-3.5.3)
Hi,
Can someone explain death detection in 3.5.3 please. I have a reasonable idea how it works:
A member suspects another member has departed so it asks two other members for confirmation.
If these other members confirm departure then the original member informs the rest of the cluster that the member has departed.
Wherever possible the members being asked to confirm departure will be different roles to the member asking for confirmation
We occasionally lose storage nodes in this way but I have some questions around what I am seeing in the logs below.
The scenario is this:
* Member 27 has a timeout sending a packet to Member 16
* Member 27 asks Member 83 and member 85 to confirm departure of Member 16
* Member 83 rejects the confirmation request (@ 2010-09-22 05:21:43.411)
* Member 85 accepts the confirmation request (I assume it does as it has no rejection in its log)
* Member 27 informs the rest of the cluster that Member 16 has departed
* Member 1 (the senior member) heartbeats Member 16 causing it to re-initialise itself - it then rejoins as Member 127.
My question is given that Member 83 rejected the confirmation request I assume it could still see Member 16. What exactly are the rules around forcing a Member to depart the cluster when this happens.
The nodes asked to confirm departure were another storage node (which rejected the request) and a storage disabled worker node (which accepted the request).
These storage disabled nodes can sometimes be under reasonable load so might not be the best ones to ask to confirm departure.
What happens when only one of the two members confirms departure?
Can we choose which roles get asked to confirm departure?
Member 27
2010-09-22 05:21:43.410/648886.462 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=27): Timeout while delivering a packet Directed{PacketType=0x0DDF00D5, ToId=16, FromId=27, Direction=Outgoing, SentCount=79, SentMillis=05:21:43.111, ToMemberSet=null, ServiceId=7, MessageType=16, FromMessageId=32360401, ToMessageId=1730276, MessagePartCount=1, MessagePartIndex=0, NackInProgress=false, ResendScheduled=05:21:43.311, Timeout=05:21:43.15, PendingResendSkips=0, DeliveryState=outstanding, Body=0x0034D45C01001B012B110B545A001B012B110B5459004C021564BEA9FC8FE2CA80014C230D992515A16200A501843100004E084744532047424C4F40A6014E063834353235374000004CA90215A06200A401945F00A201BE2000A4014219A501A16200A501843100004E084744532047424C4F40A6014E063834353235374040..., Body.length=1445}; requesting the departure confirmation for Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
by MemberSet(Size=2, BitSetCount=4
Member(Id=83, Timestamp=2010-09-14 17:07:39.704, Address=xx.xxx.34.97:8091, MachineId=35169, Location=machine:xxxxx06432,process:25212,member:xxxxx06432:Data-6, Role=RbsOdcCoreDaoODCCacheServer)
Member(Id=85, Timestamp=2010-09-21 00:43:23.208, Address=xx.xxx.34.88:8090, MachineId=35160, Location=machine:xxxxx06441,process:10718,member:xxxxx06441:VEST-6, Role=RbsOdcVestCoreVestMain)
2010-09-22 05:21:43,412 [Logger@9227652 3.5.3/465p2] INFO Coherence - 2010-09-22 05:21:43.411/648886.463 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=27): Member departure confirmed by MemberSet(Size=1, BitSetCount=4
Member(Id=85, Timestamp=2010-09-21 00:43:23.208, Address=xx.xxx.34.88:8090, MachineId=35160, Location=machine:xxxxx06441,process:10718,member:xxxxx06441:VEST-6, Role=RbsOdcVestCoreVestMain)
); removing Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.412/648886.464 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=27): Member 16 left service Management with senior member 1
Member 83
2010-09-22 05:19:06.003/648688.372 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): Member 100 joined Service PutAllInvocationService with senior member 1
648726.432: [GC 648726.432: [ParNew: 186837K->14089K(191744K), 0.0153790 secs] 2272524K->2100231K(2538752K), 0.0155370 secs] [Times: user=0.19 sys=0.00, real=0.01 secs]
648782.121: [GC 648782.121: [ParNew: 184585K->8246K(191744K), 0.0077860 secs] 2270727K->2094904K(2538752K), 0.0079440 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
648784.254: [GC 648784.254: [ParNew: 178742K->11283K(191744K), 0.0231890 secs] 2265400K->2097941K(2538752K), 0.0232940 secs] [Times: user=0.18 sys=0.00, real=0.02 secs]
648840.470: [GC 648840.470: [ParNew: 180452K->8909K(191744K), 0.0078950 secs] 2267110K->2095568K(2538752K), 0.0080540 secs] [Times: user=0.06 sys=0.01, real=0.01 secs]
648842.869: [GC 648842.869: [ParNew: 179405K->9775K(191744K), 0.0189500 secs] 2266064K->2096433K(2538752K), 0.0190970 secs] [Times: user=0.21 sys=0.00, real=0.02 secs]
2010-09-22 05:21:43.411/648845.780 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): MemberLeft request for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.411/648845.780 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=83): Rejecting the departure confirmation request by Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer) regarding Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.413/648845.782 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): MemberLeft notification for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.413/648845.782 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): Member 16 left service Management with senior member 1
Member 85
2010-09-22 05:19:05.894/102952.247 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): Member 100 joined Service distributed-pof-service with senior member 1
2010-09-22 05:19:06.003/102952.356 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): Member 100 joined Service PutAllInvocationService with senior member 1
09/22/10 05:21:07.667 INFO: [ProcessWrapper] [STDOUT] 103074.362: [GC 103074.362: [ParNew: 225038K->3371K(249216K), 0.0025490 secs] 782627K->560965K(2069504K), 0.0026190 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
2010-09-22 05:21:43.411/103109.764 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): MemberLeft request for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.412/103109.765 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): MemberLeft notification for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.412/103109.765 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): Member 16 left service Management with senior member 1
Member 16
09/22/10 05:11:03.478 INFO: [ProcessWrapper] [STDOUT] 648253.507: [GC 648253.507: [ParNew: 173464K->2443K(191744K), 0.0128920 secs] 1401433K->1230671K(2538752K), 0.0130370 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
09/22/10 05:12:18.307 INFO: [ProcessWrapper] [STDOUT] 648328.337: [GC 648328.337: [ParNew: 172939K->2972K(191744K), 0.0108240 secs] 1401167K->1231401K(2538752K), 0.0109550 secs] [Times: user=0.05 sys=0.01, real=0.01 secs]
09/22/10 05:13:30.532 INFO: [ProcessWrapper] [STDOUT] 648400.564: [GC 648400.564: [ParNew: 173468K->2582K(191744K), 0.0095490 secs] 1401897K->1231266K(2538752K), 0.0097180 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
09/22/10 05:14:51.958 INFO: [ProcessWrapper] [STDOUT] 648481.990: [GC 648481.990: [ParNew: 173078K->2969K(191744K), 0.0087620 secs] 1401762K->1231877K(2538752K), 0.0088810 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
09/22/10 05:16:12.607 INFO: [ProcessWrapper] [STDOUT] 648562.641: [GC 648562.641: [ParNew: 173465K->2798K(191744K), 0.0067770 secs] 1402373K->1231945K(2538752K), 0.0069300 secs] [Times: user=0.07 sys=0.01, real=0.01 secs]
09/22/10 05:17:23.249 INFO: [ProcessWrapper] [STDOUT] 648633.284: [GC 648633.284: [ParNew: 173294K->2824K(191744K), 0.0064570 secs] 1402441K->1232187K(2538752K), 0.0065950 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
09/22/10 05:18:43.913 INFO: [ProcessWrapper] [STDOUT] 648713.948: [GC 648713.948: [ParNew: 173320K->2812K(191744K), 0.0065200 secs] 1402683K->1232354K(2538752K), 0.0066450 secs] [Times: user=0.05 sys=0.01, real=0.00 secs]
2010-09-22 05:19:05.894/648735.426 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Member 100 joined Service distributed-pof-service with senior member 1
2010-09-22 05:19:06.003/648735.535 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Member 100 joined Service PutAllInvocationService with senior member 1
09/22/10 05:21:30.919 INFO: [ProcessWrapper] [STDOUT] 648880.948: [GC 648880.948: [ParNew: 173272K->3255K(191744K), 0.0126250 secs] 1403096K->1233261K(2538752K), 0.0127910 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
2010-09-22 05:21:44.005/648893.537 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=16): Received cluster heartbeat from the senior Member(Id=1, Timestamp=2010-09-14 17:06:38.751, Address=xx.xxx.34.98:8088, MachineId=35170, Location=machine:xxxxx06433,process:20795,member:xxxxx06433:Data-1, Role=RbsOdcCoreDaoODCCacheServer) that does not contain this Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer); stopping cluster service.
2010-09-22 05:21:44.005/648893.537 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Service Cluster left the clusterAny more information would be appreciated (or any settings I can tweak).
Cheers,
JK
Thanks Mark, we will look the timeout settings.
In the logs for the accusing node we see...
2010-09-22 05:21:14.408/648857.460 Oracle Coherence GE 3.5.3/465p2 <D6> (thread=PacketPublisher, member=27): Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer) has failed to respond to 17 packets; declaring this member as paused.
648857.732: [GC 648857.732: [ParNew: 189247K->16395K(191744K), 0.0229780 secs] 1556730K->1384220K(2538752K), 0.0231600 secs] [Times: user=0.29 sys=0.00, real=0.02 secs]
2010-09-22 05:21:43.410/648886.462 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=27): Timeout while delivering a packet Directed{PacketType=0x0DDF00D5, ToId=16, FromId=27, Direction=Outgoing, SentCount=79, SentMillis=05:21:43.111, ToMemberSet=null, ServiceId=7, MessageType=16, FromMessageId=32360401, ToMessageId=1730276, MessagePartCount=1, MessagePartIndex=0, NackInProgress=false, ResendScheduled=05:21:43.311, Timeout=05:21:43.15, PendingResendSkips=0, DeliveryState=outstanding, Body=0x0034D45C01001B012B110B545A001B012B110B5459004C021564BEA9FC8FE2CA80014C230D992515A16200A501843100004E084744532047424C4F40A6014E063834353235374000004CA90215A06200A401945F00A201BE2000A4014219A501A16200A501843100004E084744532047424C4F40A6014E063834353235374040..., Body.length=1445}; requesting the departure confirmation for Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
by MemberSet(Size=2, BitSetCount=4
Member(Id=83, Timestamp=2010-09-14 17:07:39.704, Address=xx.xxx.34.97:8091, MachineId=35169, Location=machine:xxxxx06432,process:25212,member:xxxxx06432:Data-6, Role=RbsOdcCoreDaoODCCacheServer)
Member(Id=85, Timestamp=2010-09-21 00:43:23.208, Address=xx.xxx.34.88:8090, MachineId=35160, Location=machine:xxxxx06441,process:10718,member:xxxxx06441:VEST-6, Role=RbsOdcVestCoreVestMain)
2010-09-22 05:21:43,412 [Logger@9227652 3.5.3/465p2] INFO Coherence - 2010-09-22 05:21:43.411/648886.463 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=27): Member departure confirmed by MemberSet(Size=1, BitSetCount=4
Member(Id=85, Timestamp=2010-09-21 00:43:23.208, Address=xx.xxx.34.88:8090, MachineId=35160, Location=machine:xxxxx06441,process:10718,member:xxxxx06441:VEST-6, Role=RbsOdcVestCoreVestMain)
); removing Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.412/648886.464 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=27): Member 16 left service Management with senior member 1So the first hint anything is wrong is the debug message at 05:21:14.408
Confirmation is requested at 05:21:43.410 which is 29 seconds later, so I assume we have the "dev" timeout settings of 30 seconds.
The logs for the suspect member have...
09/22/10 05:17:23.249 INFO: [ProcessWrapper] [STDOUT] 648633.284: [GC 648633.284: [ParNew: 173294K->2824K(191744K), 0.0064570 secs] 1402441K->1232187K(2538752K), 0.0065950 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
09/22/10 05:18:43.913 INFO: [ProcessWrapper] [STDOUT] 648713.948: [GC 648713.948: [ParNew: 173320K->2812K(191744K), 0.0065200 secs] 1402683K->1232354K(2538752K), 0.0066450 secs] [Times: user=0.05 sys=0.01, real=0.00 secs]
09/22/10 05:19:05.897 INFO: [ProcessWrapper] [STDOUT] 2010-09-22 05:19:05,895 [Logger@9248631 3.5.3/465p2] DEBUG Coherence - 2010-09-22 05:19:05.894/648735.426 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Member 100 joined Service distributed-pof-service with senior member 1
09/22/10 05:19:06.004 INFO: [ProcessWrapper] [STDOUT] 2010-09-22 05:19:06,004 [Logger@9248631 3.5.3/465p2] DEBUG Coherence - 2010-09-22 05:19:06.003/648735.535 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Member 100 joined Service PutAllInvocationService with senior member 1
09/22/10 05:20:08.250 INFO: [ProcessWrapper] [STDOUT] 648798.277: [GC 648798.277: [ParNew: 173308K->2776K(191744K), 0.0136850 secs] 1402850K->1232600K(2538752K), 0.0138770 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
09/22/10 05:20:16.851 INFO: [FabricDiagnosticsPlugin] [M: 15M/40M/184M] [T: O(30)] [3.0.1.7] [JRE: 1.5.0_05/Sun Microsystems Inc.] [OS: Linux/2.6.9-89.0.9.ELlargesmp/i386] [H: xx.xxx.34.93]
09/22/10 05:21:30.919 INFO: [ProcessWrapper] [STDOUT] 648880.948: [GC 648880.948: [ParNew: 173272K->3255K(191744K), 0.0126250 secs] 1403096K->1233261K(2538752K), 0.0127910 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
09/22/10 05:21:44.005 INFO: [ProcessWrapper] [STDOUT] 2010-09-22 05:21:44,005 [Logger@9248631 3.5.3/465p2] ERROR Coherence - 2010-09-22 05:21:44.005/648893.537 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=16): Received cluster heartbeat from the senior Member(Id=1, Timestamp=2010-09-14 17:06:38.751, Address=xx.xxx.34.98:8088, MachineId=35170, Location=machine:xxxxx06433,process:20795,member:xxxxx06433:Data-1, Role=RbsOdcCoreDaoODCCacheServer) that does not contain this Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer); stopping cluster service.which seem to show it is reasonably OK, there are GC pauses but none are 30 seconds, although I suppose there are other reasons besides GC that may cause comms timeouts.
Cheers,
JK
Similar Messages
-
PSVC application death detected
Please Help!
My Sunfire V490 has an error in /var/adm/messages indicating "picld[129]: [ID 230523 daemon.error] PSVC application death detected".
Platform:
SunOS apgdb3 5.10 Generic_125100-05 sun4u sparc SUNW,Sun-Fire-V490
I did a search an realize that this error is related to Platform Service. The most common cause of this is that server platform-specific patches are not up to date.
For SunOS 5.9, Solaris 9, there is a recommendation to install patch 113447-23. Patch 113447-23 is obsoleted by patch 118558-39 and 118558-39 is recommended for SunOS 5.9.
Questions:
1) How can I fix the error "PSVC application death detected" in SunOS 5.10?
2) If I want to install patch 118558-39 of SunOS 5.9 in SunOS 5.10 Solaris 10, how can I find a equivalent patch for SunOS 5.10?
3) If I want to install/update platform-specific patches for Sunfire V490, SunOS 5.10, which patch(es) should be the right one? (I did a search, many results return)
Much appreciated for your help.
Joe_ESHallo Salthaus_de
Tbh, it might work if you do as you just said, move it over to a package, and it might not Work. :)
The most typical reason that this happens is that your application install ends before it is done doing what it should. I have seen this happening alot with "Autodesk" products, they start a "setup.exe" which spawns another "setup.exe" and exits the first.
This then tells SCCM, the setup.exe's PID you requested to look at has ended, and you can now do your detection check.
This Means it will be installing for another 15 minutes, and taking up msiexec.exe - which Means the applications installing after this one possibly could end up not installing with the errorcode, another product is currently being installed.
Based on this i would still, eventhough the "workaround" Works, as it should, find out why this happens.
Just my 2 cents, what you do NeXT is up to you. :)
Kind regards
Morten Leth -
Detected another cluster senior running on an incompatible protocol at null
Recently saw this confusing error. I'm not sure what "null" should have been but I think the error is erroneous also since everything I have is running 3.7.1.1.
-Andrew
0656PM.log-2012-03-01 20:23:35.016/5246.928 Oracle Coherence GE 3.7.1.1 <Error> (thread=Cluster, member=16): Detected another cluster senior, running on an incompatible protocol at null manual intervention may be requiredIf none of the troubleshooting for the display adaptor (graphics card driver update, BadDrivres.txt etc.) do not work, we have to understand that the display could actually not be supported. By display, I mean the card-driver-driver version combination. So,
1. It could be a bad card - None of the available drivers work for the application. Here, ditch the card and get a newer one that works.
2. It could be a bad driver - For e.g. Only DirectX drivers work, OpenGL ones could be pathetic (Many games detect this and if this is bad, I have seen Photoshop struggle with this). Or some display configurations would be badly supported (multiple monitors, some funny resolutions, some of the ports, crossfire etc.). Older drivers/cards/some OEM drivers etc. have these issues in my experience.
3. It could be a bad driver version - Here upgrade or a downgrade works.
Because of a Video editor, the (lack of) capabilities of the card is highlighted. Video Playback alone doesn't nearly exercise the capabilities the cards usually have to offer. Editing really does. That is why it matters if the Video Editor actually says that it supports Graphics acceleration (Or GPU acceleration) for any of the workflows.
What I am saying here is that when we blame the application for any of the issues, we have to be certain that the issues are NOT due to one of the mentioned three causes. I have seen this all too often in the gaming forums, Nero forums (some years back), in PPro forum, and here. If the application menufacturer is really to blame, the mistake would be all too obvious and too many of us would have complained about it. I do not think this is the case with Premiere Elements 10.
Introspection works. -
"Service Cluster left the cluster" - lost all my data
My four storage enabled cluster nodes lost all their cached data when the all services left the cluster in response to some issue(?). Is that the expected behavior? Is the correct procedure to transactionally store to disk so you can reload when this happens or should this simply never happen? Seems like this should not happen. These four nodes are on the the same server. At about time 12:31 everything goes pear shaped.
2011-01-14 12:31:16.904/50004.436 Oracle Coherence GE 3.6.0.0 <Error> (thread=Cluster, member=3): This senior Member(Id=3, Timestamp=2011-01-13 22:37:52.106, Address=192.168.3.20:8088, MachineId=27412, Location=machine:amd4,process:4428,member:Administrator, Role=CoherenceServer) appears to have been disconnected from other nodes due to a long period of inactivity and the seniority has been assumed by the Member(Id=9, Timestamp=2011-01-13 22:38:01.438, Address=192.168.3.20:8094, MachineId=27412, Location=machine:amd4,process:3904,member:Administrator, Role=CoherenceServer); stopping cluster service.
2011-01-14 12:31:16.905/50004.437 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=3): Service Cluster left the cluster
2011-01-14 12:31:16.906/50004.438 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedStatsCacheService, member=3): Service DistributedStatsCacheService left the cluster
2011-01-14 12:31:16.906/50004.438 Oracle Coherence GE 3.6.0.0 <D5> (thread=Proxy:ExtendTcpProxyService, member=3): Service ExtendTcpProxyService left the cluster
2011-01-14 12:31:16.907/50004.439 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedQuotesCacheService, member=3): Service DistributedQuotesCacheService left the cluster
2011-01-14 12:31:16.913/50004.445 Oracle Coherence GE 3.6.0.0 <D5> (thread=Invocation:Management, member=3): Service Management left the cluster
2011-01-14 12:31:16.913/50004.445 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedOrdersService, member=3): Service DistributedOrdersService left the cluster
2011-01-14 12:31:16.913/50004.445 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedCacheService, member=3): Service DistributedCacheService left the cluster
2011-01-14 12:31:16.914/50004.446 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=214992652, Open=false)
2011-01-14 12:31:16.914/50004.446 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=8305999, Open=false)
2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1383343339, Open=false)
2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: TcpConnection(Id=0x0000012D84061C15C0A803149CF3279B334BE6140AC76C47CA03670D76A96D22, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65480)
2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1003858188, Open=false)
2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1586910282, Open=false)
2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: TcpConnection(Id=0x0000012D84060E5AC0A8031442EA3CC26AC425D55D93A6AFC5404E5A76A96D1E, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65472)
2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor:TcpProcessor, member=3): Released: TcpConnection(Id=0x0000012D84061C15C0A803149CF3279B334BE6140AC76C47CA03670D76A96D22, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65480)
2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=160435953, Open=false)
2011-01-14 12:31:16.915/50004.447 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor:TcpProcessor, member=3): Released: TcpConnection(Id=0x0000012D84060E5AC0A8031442EA3CC26AC425D55D93A6AFC5404E5A76A96D1E, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65472)
2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: Channel(Id=1635893341, Open=false)
2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor, member=3): Closed: TcpConnection(Id=0x0000012D84061203C0A8031455CD3A790F6009CA79AEC8BACC464D9976A96D20, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65478)
2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D6> (thread=Proxy:ExtendTcpProxyService:TcpAcceptor:TcpProcessor, member=3): Released: TcpConnection(Id=0x0000012D84061203C0A8031455CD3A790F6009CA79AEC8BACC464D9976A96D20, Open=false, LocalAddress=192.168.3.20:9091, RemoteAddress=192.168.3.6:65478)
2011-01-14 12:31:16.916/50004.448 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedExecutionsService, member=3): Service DistributedExecutionsService left the cluster
2011-01-14 12:31:16.919/50004.451 Oracle Coherence GE 3.6.0.0 <D5> (thread=DistributedCache:DistributedPositionsCacheService, member=3): Service DistributedPositionsCacheService left the clusterand ...
2011-01-14 12:31:22.874/50006.273 Oracle Coherence GE 3.6.0.0 <Info> (thread=main, member=n/a): Restarting cluster
2011-01-14 12:31:22.924/50006.323 Oracle Coherence GE 3.6.0.0 <D4> (thread=main, member=n/a): TCMP bound to /192.168.3.20:8094 using SystemSocketProvider
2011-01-14 12:31:52.937/50036.336 Oracle Coherence GE 3.6.0.0 <Warning> (thread=Cluster, member=n/a): This Member(Id=0, Timestamp=2011-01-14 12:31:22.924, Address=192.168.3.20:8094, MachineId=27412, Location=machine:amd4,process:4136,member:Administrator, Role=CoherenceServer) has been attempting to join the cluster at address 225.0.0.1:54321 with TTL 4 for 30 seconds without success; this could indicate a mis-configured TTL value, or it may simply be the result of a busy cluster or active failover.
2011-01-14 12:31:52.950/50036.349 Oracle Coherence GE 3.6.0.0 <Warning> (thread=Cluster, member=n/a): Received a discovery message that indicates the presence of an existing cluster that does not respond to join requests; this is usually caused by a network layer failure:Logs starting at 12:30 from the four nodes are here:
http://www.nmedia.net/~andrew/logs/1.log
http://www.nmedia.net/~andrew/logs/2.log
http://www.nmedia.net/~andrew/logs/3.log
http://www.nmedia.net/~andrew/logs/4.log
If someone could tell me if this is a bug in the cluster re-join logic or something I screwed up that would be great. Thanks!
AndrewHi Andrew
I had a quick look at your logs but cannot say for certain why your cluster died. I can say that losing data is a normal consequence of node loss though. If you have the backup count set to 1 then you can lose a single node without losing data. If you lose more than one node (on different machines, or the same machine if you only have one) over a very short space of time then you will almost certainly lose at least one partition and hence lose the data within that partition.
Going back to you logs is is difficult to determine the underlying cause without the whole set of logs. You have posted links to four logs but from looking at them the cluster has about 16 nodes. I know from experience (as we had a cluster that was quite unstable for a while) that tracing these issues through the logs can be a bit awkwrd but you soon get the hang of it :-)
For example in the log http://www.nmedia.net/~andrew/logs/1.log you have...
2011-01-14 12:31:16.807/49993.331 Oracle Coherence GE 3.6.0.0 <D5> (thread=Cluster, member=9): MemberLeft notification for Member(Id=3, Timestamp=2011-01-13 22:37:52.106, Address=192.168.3.20:8088, MachineId=27412, Location=machine:amd4,process:4428,member:Administrator, Role=CoherenceServer, PublisherSuccessRate=0.9975, ReceiverSuccessRate=0.9999, PauseRate=0.0, Threshold=93, Paused=false, Deferring=false, OutstandingPackets=0, DeferredPackets=0, ReadyPackets=0, LastIn=261ms, LastOut=277ms, LastSlow=n/a) received from Member(Id=22, Timestamp=2011-01-14 08:21:22.284, Address=192.168.3.121:8092, MachineId=27513, Location=machine:H1,process:3716,member:Howard, Role=Order_entry_window, PublisherSuccessRate=0.8326, ReceiverSuccessRate=1.0, PauseRate=0.0024, Threshold=1456, Paused=false, Deferring=false, OutstandingPackets=0, DeferredPackets=0, ReadyPackets=0, LastIn=0ms, LastOut=8ms, LastSlow=n/a)...which is Member-9 recieving a message about the departure of Member-3 from Member-22, so you would then need to look at the logs for Member-22 to see why it thought Member-3 had departed and also look at the logs for Member-3 for that time to see what might be wrong with it.
The more worrying message would be these...
2011-01-14 12:31:16.709/49993.233 Oracle Coherence GE 3.6.0.0 <Warning> (thread=PacketPublisher, member=9): Experienced a 19025 ms communication delay (probable remote GC) with Member(Id=21, Timestamp=2011-01-14 08:21:12.174, Address=192.168.3.121:8090, MachineId=27513, Location=machine:H1,process:4316,member:Howard, Role=OrderbookviewerViewer); 111 packets rescheduled, PauseRate=0.0014, Threshold=1696...a 19 second delay is a long time and would suggest either very long GC pauses of a network problem. Do you have GC logs of these processes. Are all the servers connected to the same switch or is the cluster distributed over more than one part of your network? Do you have too much on one machine, are you overloading the NIC, are you swapping, all these can cause delays and/or los of packets.
We have had problems with storage disabled nodes doing long GC pauses and causing storage nodes to drop out of the cluster. Our cluster was on 3.5.3-p8 whereas you are on 3.6.0.0 which is supposed to have better node death detection so you might not have the same issues we had.
Sorry to not be more help,
JK -
I would like to know the role of the each thread on coherence
Help me.
I would like to know the role of the each thread on coherence.
There are too many kind of threads.
Example ~
GC Slave GC Slave RUNNABLE
RMI TCP Accept-1972 RMI TCP Accept-1972 RUNNABLE
Health Center trace subscriber Health Center trace subscriber RUNNABLE
LT=0:P=342534:O=0:port=55170 LT=0:P=342534:O=0:port=55170 RUNNABLE
Attach API wait loop Attach API wait loop RUNNABLE
PacketListener1 PacketListener1 RUNNABLE
PacketListener1P PacketListener1P RUNNABLE
PacketListenerN PacketListenerN RUNNABLE
Cluster|Member(Id=1, Timestamp=2013-04-05 10:45:44.655, Address=192.168.240.157:8088, MachineId=50044, Location=site:,machine:TMTEST-PC,process:5316, Role=CoherenceServer) Cluster|Member(Id=1, Timestamp=2013-04-05 10:45:44.655, Address=192.168.240.157:8088, MachineId=50044, Location=site:,machine:TMTEST-PC,process:5316, Role=CoherenceServer) RUNNABLE
RT=0:P=342534:O=0:TCPTransportConnection[addr=192.168.240.157,port=55178,local=55170] RT=0:P=342534:O=0:TCPTransportConnection[addr=192.168.240.157,port=55178,local=55170] RUNNABLE
Finalizer thread Finalizer thread RUNNABLE
WT=10 WT=10 RUNNABLE
main main TIMED_WAITING
IpMonitor IpMonitor TIMED_WAITING
Invocation:Management:EventDispatcher Invocation:Management:EventDispatcher TIMED_WAITING
Invocation:Management Invocation:Management TIMED_WAITING
DistributedCache DistributedCache TIMED_WAITING
JMX server connection timeout 52 JMX server connection timeout 52 TIMED_WAITING
RMI Scheduler(0) RMI Scheduler(0) WAITING
Thread-6 Thread-6 WAITING
stop JMX Server on shutdown stop JMX Server on shutdown WAITING
Logger@9228429 3.7.1.7 Logger@9228429 3.7.1.7 WAITING
PacketReceiver PacketReceiver WAITING
PacketPublisher PacketPublisher WAITING
PacketSpeaker PacketSpeaker WAITING
WT=7 WT=7 WAITING
WT=9 WT=9 WAITING
-----------------------------------------------------------------------------------------------------------------------------------------------Briefly
PacketListener1 PacketListener1P PacketListenerN - listening IO threads for TCMP transport protocol
Cluster|Member(Id=1, Timestamp=2013-04-05 10:45:44.655, Address=192.168.240.157:8088, MachineId=50044, Location=site:,machine:TMTEST-PC,process:5316, Role=CoherenceServer) - main thread for cluster service (discovery, node joing / leave, etc)
IpMonitor - IP monitor, participates in death detection scheme
Invocation:Management:EventDispatcher - Event dispatch thread for distributed JMX service in Coherence
Invocation:Management - main thread for distributed JMX service in Coherence
DistributedCache - main thread for DistributedCache cache service
Logger@9228429 3.7.1.7 - Coherence async logging thread
PacketReceiver - Thread dispatching incomming network packets
PacketPublisher - Thread sending out packets via TCMP
PacketSpeaker - Thread sending out packets via TCMP (offloads some work from PacketPublisher for better core utilization) -
TCP Extend (DefaultCacheServer rejects connections)
Hi guys
Have been trying to setup TCP Extend to make a Linux box use cache configured on a windows box and the DefaultCacheServer rejects TCP connections. The config files I'm using are attached. Can anyone help ?
The DefaultCacheServer comes up nicely
SafeCluster: Name=n/a
Group{Address=224.3.2.0, Port=32367, TTL=1}
MasterMemberSet
ThisMember=Member(Id=1, Timestamp=2007-03-29 16:07:16.026, Address=147.114.162.160:54321, MachineId=17312)
OldestMember=Member(Id=1, Timestamp=2007-03-29 16:07:16.026, Address=147.114.162.160:54321, MachineId=17312)
ActualMemberSet=MemberSet(Size=1, BitSetCount=2
Member(Id=1, Timestamp=2007-03-29 16:07:16.026, Address=147.114.162.160:54321, MachineId=17312)
RecycleMillis=120000
RecycleSet=MemberSet(Size=0, BitSetCount=0
Services
TcpRing{TcpSocketAccepter{State=STATE_OPEN, ServerSocket=147.114.162.160:54321}, Connections=[]}
ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_JOINED), Id=0, Version=3.2, OldestMemberId=1}
DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=1, Version=3.2, OldestMemberId=1, LocalStorage=enabled, PartitionCount=257, Bac
upCount=1, AssignedPartitions=257, BackupPartitions=0}
but when I run the client, I get this
2007-03-29 16:09:42.698 Tangosol Coherence DGE 3.2/367 <D4> (thread=TcpRingListener, member=1): Rejecting connection to member 649 using TcpSocket{Sta
te=STATE_OPEN, Socket=Socket[addr=/172.26.102.115,port=36952,localport=54321]}<br><br> <b> Attachment: </b><br>cluster-side-config.xml <br> (*To use this attachment you will need to rename 516.bin to cluster-side-config.xml after the download is complete.)<br><br> <b> Attachment: </b><br>client-side-config.xml <br> (*To use this attachment you will need to rename 517.bin to client-side-config.xml after the download is complete.)Hi pandeyv,
You need to configure an instance of the ProxyService in your cluster-side cache configuration file. Coherence*Extend clients connect to the ProxyService over TCP/IP and not the TcpRingService. The TcpRingService is only used by cluster members for death detection.
See the following for instructions on configuring the cluster and client-side configuration files:
http://wiki.tangosol.com/display/COH32UG/Configuring+and+Using+Coherence*Extend
Additionally, I noticed that you are using an old release of Coherence 3.2. Please upgrade to the latest 3.2 service pack (3.2.2):
http://www.tangosol.com/product-downloads.jsp
Regards,
Jason -
Startup timeout and packet-delivery timeout
Hi,
At the moment it takes my first cluster node approximately 30 seconds to start and setting the packet-delivery timeout smaller than this means the system cannot start. I'm trying to reduce the packet-delivery setting to improve responsiveness during failover caused by hardware failures. I think 15-20 seconds would be ideal, allowing for GC pauses.
Subsequent nodes can start in a couple of seconds.
Is this a reasonable time to expect the first Coherence node in a cluster to start up? What kind of values is everyone else working with?
Thanks & Regards,
Martin
The error when packet-delivery is less than 30 seconds:
2009-11-24 13:04:05.568/24.141 Oracle Coherence GE 3.4/405 <Error> (thread=main, member=n/a): Error while starting cluster: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
MemberSet=ServiceMemberSet(
OldestMember=n/a
ActualMemberSet=MemberSet(Size=0, BitSetCount=0
MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onStartupTimeout(Grid.CDB:6)
at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:27)
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:38)
at com.tangosol.coherence.component.net.Cluster.onStart(Cluster.CDB:317)
at com.tangosol.coherence.component.net.Cluster.start(Cluster.CDB:11)
at com.tangosol.coherence.component.util.SafeCluster.startCluster(SafeCluster.CDB:3)
at com.tangosol.coherence.component.util.SafeCluster.restartCluster(SafeCluster.CDB:7)
at com.tangosol.coherence.component.util.SafeCluster.ensureRunningCluster(SafeCluster.CDB:27)
at com.tangosol.coherence.component.util.SafeCluster.start(SafeCluster.CDB:2)
at com.tangosol.net.CacheFactory.ensureCluster(CacheFactory.java:951)
at com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:748)
at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:710)
at com.tangosol.net.DefaultConfigurableCacheFactory.configureCache(DefaultConfigurableCacheFactory.java:919)
at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:277)
at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:689)
at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:667)
at com.changingworlds.datagrid.cache.Cache.init(Cache.java:111)
at com.changingworlds.datagrid.cache.Cache.initializeNamedCache(Cache.java:95)
at com.changingworlds.discovery.DiscoveryMain.initSpring(DiscoveryMain.java:201)
at com.changingworlds.discovery.DiscoveryMain.createDiscovery(DiscoveryMain.java:154)
at com.changingworlds.discovery.DiscoveryMain.main(DiscoveryMain.java:78)
Edited by: MartinMc on Nov 24, 2009 1:15 PMHi Martin,
The packet delivery timeout relates to death detection, not to cluster formation. You'll want to have a look at the join-timeout-milliseconds specified within the multicast-listener element (see http://coherence.oracle.com/display/COH35UG/multicast-listener) if you wish to change the amount of time it takes to form a new cluster. The reason for this timeout is to prevent a new node from accidentally forming a secondary cluster if the existing cluster members are temporarily unreachable while the new node starts. Assuming you are running more then just a few nodes in your cluster, you should be fine lowering this value to 5-10s.
I don't however see how this relates to failover unless by failover you mean starting an entirely new cluster after the complete loss of the formerly running cluster.
thanks,
Mark
Oracle Coherence -
Thread STUCK for more than 10minutes at com.tangosol.util.SegmentedHashMap
<[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "707" seconds working on the request "weblogic.work.SelfTuningWorkManagerImpl$WorkAdapterImpl@11ddce1", which is more than the configured time (StuckThreadMaxTime) of "600" seconds. Stack trace:
java.lang.Object.wait(Native Method)
com.tangosol.util.SegmentedHashMap.contendForSegment(SegmentedHashMap.java:1391)
com.tangosol.util.SegmentedHashMap.lockSegment(SegmentedHashMap.java:1301)
com.tangosol.util.SegmentedHashMap.lockBucket(SegmentedHashMap.java:1266)
com.tangosol.util.SegmentedHashMap.invokeOnKey(SegmentedHashMap.java:1058)
com.tangosol.util.SegmentedConcurrentMap.lock(SegmentedConcurrentMap.java:197)
com.tangosol.net.cache.CachingMap.get(CachingMap.java:462)Hi,
A 2019 ms delay could be caused by a number of things. It could be network related or more likely as the message says the WebLogic node could have been doing a GC. As your WebLogic node is a Cluster member you need to make sure that its GC is properly tuned to avoid long GC pauses as this can destabelize the whole cluster. I see you are using 3.5.3/465p8 which is an older version of Coherence and lacks the newer node death detection algorithms. I know from the project I work on that also uses 3.5.3/465p8 that long GCs on cluster members can cause other nodes in the cluster to die if you get GC pauses that are too long or too frequent.
I doubt that this message is related to the stuck thread issue in your original post, though it is hard to tell for certain.
JK -
Hi
I am getting following error while uploading master data.
Master data (dealt by table level) has errors
Detected duplicate member ID 'Agrippal'
Dimension member DNA/PLG Prime + GP14 is an invalid member ID
Dimension member DTaP/alum vaccine (p is an invalid member ID
Dimension member DTwP + Hib full liqu is an invalid member ID
Dimension member DTwP + Hib lyophilis is an invalid member ID
Dimension member Flu + MF59 + CpG is an invalid member ID
Dimension member Fluvirin + MF59 is an invalid member ID
Detected duplicate member ID 'H. pylori'
Dimension member HBV/MF59 is an invalid member ID
Dimension member Hepatitis C (E1E2HCV is an invalid member ID
Dimension member Laminarin-CRM197 + a is an invalid member ID
Dimension member MenB (New Zealand st is an invalid member ID
Dimension member MenB (Norwegian stra is an invalid member ID
Dimension member MenB Engineered (Chi is an invalid member ID
Dimension member MenC lyophilised + M is an invalid member ID
Detected duplicate member ID 'Rabies vaccine'
Dimension member Rhein-Biotech DTaP/H is an invalid member ID
Dimension member Staph Aureus (Staphy is an invalid member ID
Detected duplicate member ID 'Td-IPV'
Dimension member V Quinvaxem (DTwP) p is an invalid member ID
Dimension member VLP-H5; Virus-Like P is an invalid member ID
Dimension member Vi - CRM (NVGH Vacci is an invalid member ID
Error in Admin module
Record count: 80
Accept count: 0
I unsuccessfully tried with couple of options wirting javascript in internal column of conversion file. Any body could you please guide me to rectify this issue.
Thanks in advance
regards
mahiHi,
I don't think that it is an OSS Note that you require to implement as it appears the source data has special characters in?
What does the source data look like?
BPC has restrictions on the ID's allowed imported into BPC, and a useful OSS note explains some of these restrictions:
SAP Note 1448836 - [Link|https://websmp130.sap-ag.de/sap(bD1lbiZjPTAwMQ==)/bc/bsp/spn/sapnotes/index2.htm?numm=1448836]
You can use javascript in a conversion file to use EXTERNAL column to ' * ' so it selects all, then use js: javascript to remove special characters. You'll need to do a bit of research into exactly how you'd set it up but the SAP help file has some guideance here:
[http://help.sap.com/saphelp_bpc70/helpdata/en/81/94a8a5febd40268d5c59b4fc31be37/content.htm|http://help.sap.com/saphelp_bpc70/helpdata/en/81/94a8a5febd40268d5c59b4fc31be37/content.htm]
Hope it helps,
Nick -
Hi Friends,
I am getting the following exception in logical standby database at the time of Sql Apply.
After run the command alter database start logical standby apply sql apply services start but after few second automatically stop and getting following exception.
alter database start logical standby apply
Tue May 17 06:42:00 2011
No optional part
Attempt to start background Logical Standby process
LOGSTDBY Parameter: MAX_SERVERS = 20
LOGSTDBY Parameter: MAX_SGA = 100
LOGSTDBY Parameter: APPLY_SERVERS = 10
LSP0 started with pid=30, OS id=4988
Tue May 17 06:42:00 2011
Completed: alter database start logical standby apply
Tue May 17 06:42:00 2011
LOGSTDBY status: ORA-16111: log mining and apply setting up
Tue May 17 06:42:00 2011
LOGMINER: Parameters summary for session# = 1
LOGMINER: Number of processes = 4, Transaction Chunk Size = 201
LOGMINER: Memory Size = 100M, Checkpoint interval = 500M
Tue May 17 06:42:00 2011
LOGMINER: krvxpsr summary for session# = 1
LOGMINER: StartScn: 0 (0x0000.00000000)
LOGMINER: EndScn: 0 (0x0000.00000000)
LOGMINER: HighConsumedScn: 2660033 (0x0000.002896c1)
LOGMINER: session_flag 0x1
LOGMINER: session# = 1, preparer process P002 started with pid=35 OS id=4244
LOGSTDBY Apply process P014 started with pid=47 OS id=5456
LOGSTDBY Apply process P010 started with pid=43 OS id=6484
LOGMINER: session# = 1, reader process P000 started with pid=33 OS id=4732
Tue May 17 06:42:01 2011
LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1417, X:\TANVI\ARCHIVE2\ARC01417_0748170313.001
Tue May 17 06:42:01 2011
LOGMINER: Turning ON Log Auto Delete
Tue May 17 06:42:01 2011
LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01417_0748170313.001
Tue May 17 06:42:01 2011
LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1418, X:\TANVI\ARCHIVE2\ARC01418_0748170313.001
LOGSTDBY Apply process P008 started with pid=41 OS id=4740
LOGSTDBY Apply process P013 started with pid=46 OS id=7864
LOGSTDBY Apply process P006 started with pid=39 OS id=5500
LOGMINER: session# = 1, builder process P001 started with pid=34 OS id=4796
Tue May 17 06:42:02 2011
LOGMINER: skipped redo. Thread 1, RBA 0x00058a.00000950.0010, nCV 6
LOGMINER: op 4.1 (Control File)
Tue May 17 06:42:02 2011
LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01418_0748170313.001
Tue May 17 06:42:03 2011
LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1419, X:\TANVI\ARCHIVE2\ARC01419_0748170313.001
Tue May 17 06:42:03 2011
LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01419_0748170313.001
Tue May 17 06:42:03 2011
LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1420, X:\TANVI\ARCHIVE2\ARC01420_0748170313.001
Tue May 17 06:42:03 2011
LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01420_0748170313.001
Tue May 17 06:42:03 2011
LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1421, X:\TANVI\ARCHIVE2\ARC01421_0748170313.001
LOGSTDBY Analyzer process P004 started with pid=37 OS id=5096
Tue May 17 06:42:03 2011
LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01421_0748170313.001
LOGSTDBY Apply process P007 started with pid=40 OS id=2760
Tue May 17 06:42:03 2011
Errors in file x:\oracle\product\10.2.0\admin\tanvi\bdump\tanvi_p001_4796.trc:
ORA-00600: internal error code, arguments: [krvxbpx20], [1], [1418], [2380], [16], [], [], []
LOGSTDBY Apply process P012 started with pid=45 OS id=7152
Tue May 17 06:42:03 2011
LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1422, X:\TANVI\ARCHIVE2\ARC01422_0748170313.001
Tue May 17 06:42:03 2011
LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01422_0748170313.001
Tue May 17 06:42:03 2011
LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1423, X:\TANVI\ARCHIVE2\ARC01423_0748170313.001
Tue May 17 06:42:03 2011
LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01423_0748170313.001
Tue May 17 06:42:03 2011
LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1424, X:\TANVI\ARCHIVE2\ARC01424_0748170313.001
LOGMINER: session# = 1, preparer process P003 started with pid=36 OS id=5468
Tue May 17 06:42:03 2011
LOGMINER: End mining logfile: X:\TANVI\ARCHIVE2\ARC01424_0748170313.001
Tue May 17 06:42:04 2011
LOGMINER: Begin mining logfile for session 1 thread 1 sequence 1425, X:\TANVI\ARCHIVE2\ARC01425_0748170313.001
LOGSTDBY Apply process P011 started with pid=44 OS id=6816
LOGSTDBY Apply process P005 started with pid=38 OS id=5792
LOGSTDBY Apply process P009 started with pid=42 OS id=752
Tue May 17 06:42:05 2011
krvxerpt: Errors detected in process 34, role builder.
Tue May 17 06:42:05 2011
krvxmrs: Leaving by exception: 600
Tue May 17 06:42:05 2011
Errors in file x:\oracle\product\10.2.0\admin\tanvi\bdump\tanvi_p001_4796.trc:
ORA-00600: internal error code, arguments: [krvxbpx20], [1], [1418], [2380], [16], [], [], []
LOGSTDBY status: ORA-00600: internal error code, arguments: [krvxbpx20], [1], [1418], [2380], [16], [], [], []
Tue May 17 06:42:06 2011
Errors in file x:\oracle\product\10.2.0\admin\tanvi\bdump\tanvi_lsp0_4988.trc:
ORA-12801: error signaled in parallel query server P001
ORA-00600: internal error code, arguments: [krvxbpx20], [1], [1418], [2380], [16], [], [], []
Tue May 17 06:42:06 2011
LogMiner process death detected
Tue May 17 06:42:06 2011
logminer process death detected, exiting logical standby
LOGSTDBY Analyzer process P004 pid=37 OS id=5096 stopped
LOGSTDBY Apply process P010 pid=43 OS id=6484 stopped
LOGSTDBY Apply process P008 pid=41 OS id=4740 stopped
LOGSTDBY Apply process P012 pid=45 OS id=7152 stopped
LOGSTDBY Apply process P014 pid=47 OS id=5456 stopped
LOGSTDBY Apply process P005 pid=38 OS id=5792 stopped
LOGSTDBY Apply process P006 pid=39 OS id=5500 stopped
LOGSTDBY Apply process P007 pid=40 OS id=2760 stopped
LOGSTDBY Apply process P011 pid=44 OS id=6816 stopped
Tue May 17 06:42:10 2011Errors in file x:\oracle\product\10.2.0\admin\tanvi\bdump\tanvi_p001_4796.trc:
ORA-00600: internal error code, arguments: [krvxbpx20], [1], [1418], [2380], [16], [], [], []submit an SR to ORACLE SUPPORT.
refer these too
*ORA-600/ORA-7445 Error Look-up Tool [ID 153788.1]*
*Bug 6022014: ORA-600 [KRVXBPX20] ON LOGICAL STANDBY* -
hi
please help me how fix this problem?
I have primary oracle database version 9.2.0.1.0
I had configured standby Logical database on separate computer & everything was going good but today I noticed the archived log havent applied & I have ORA-16111: log mining and apply setting up in DBA_LOGSTDBY_EVENTS without any forwarding.
in alert log file I found out these error:
Errors in file c:\oracle\admin\bmdbsb\udump\bmdbsb_p004_2796.trc:
ORA-00600: internal error code, arguments: [ksmovrflow], [knahs:ddl_string], [], [], [], [], [], []
Wed Jun 18 13:25:29 2003
Errors in file c:\oracle\admin\bmdbsb\bdump\bmdbsb_lsp0_2212.trc:
ORA-12805: parallel query server died unexpectedly
Wed Jun 18 13:25:29 2003
logminer process death detected, exiting logical standby
what can I do?
thanksHi,
create pfile from spfile;
shutdown immediate;
edit the changes now in pfile
startup pfile='/////' open;
spfile is not a binary file, it will not accept any changes but in 10g we have to create
spfile initialy it runs with pfile until you create spfile.
regards,
Nirmal -
Failover Cluster testing - not working
Hello
We are trying to perform failover testing on OC4J clusters. We have two nodes clustered. On each node we have installed SOA suite - four OC4J instances (home, oc4J_soa, oc4j_wsm, and oc4j_esbdt).
This is how we are performing the test:
We have deployed servlet on both oc4j_soa with different application names but with same context-root. We have observed following behavior.
1) In mod_oc4j.conf file , we have "Oc4jSelectMethod roundrobin:local".
So when we hit http://<loadbalancer>/<context-root-for-servlet> it works fine in roundrobin fashion. We have set the HTML page title as node name in servlets.
And so we see Node 1 title one time and Node 2 title the other time.
2) Now if I undeploy the servlet from Node 1 and hit http://<loadbalancer>/<context-root-for-servlet> It doen't work one time (gives HTTP 500 error) and it works the other time.
ie. it is still sending one request to Node 1 and other request to Node 2.
well, it should not do this right ? if my application is down/not available on Node 1 , all request should go to Node 2.
3) Also what i have observed is , when we hit the Nodes' http url directly instead of going through the loadbalancer it crisscross the requests.
ie. http://NODE 1/<context-root-for-servlet> - this one always goes to Node 2
and http://NODE 2/<context-root-for-servlet> - always goes to Node 1
This is something weird.
Anybody has any idea , please ? I am not sure why are we getting unexpected behavior mentioned 2) and 3 )
Please let me know if you need anything about config details.
Thanks
/Mishit2) I hope there is a Hardware LBR front ending this architecture ... If yes then HW LBR have intelligent death detection mechanisms where in if Node 1 crashes it stops serving requests to the failed node until it is back online .... so this setting is more at the LBR then at mod_oc4j.conf
3) If load balancing is configured correctly, I dont think u shud be getting this issue..
To test load balancing you can do as below:
- Ensure Virtual host configuration is done in Apache of both nodes
- Ensure Virtual host entry is added to /etc/hosts
Like on Node 1 /etc/hosts
<Node 2 IP> <Virtual hostname>
Similarly on Node 2 /etc/hosts
<Node 1 IP> < Virtual hostname>
- Give Virtual URL to your LBR guy and he will do the settings in LBR
- Bring down node 1 keeping node 2 alive & check below:
http://<virtualURL>/contextroot
http://<node2>/contextroot
- Similarly bring down node 2 keeping node 1 alive & check below:
http://<virtualURL>/contextroot
http://<node1>/contextroot -
Error OLAP processing one application - SAP BPC 7.0 MS
Hi everybody,
This is our problem, when i try to process my application called "Consolidation", it always fails and appears the following error message:
"Cube process: Errors in the OLAP storage engine. The Attribute key cannot be found: Table dbo_tblFactConsolidation, column ACCOUNT, value: E1500."
We tried to launch a full process of this dimension "ACCOUNT", but also it fails.
Could anybody say me what is happening?
Thanks for your help.Juan,
Here is the solution for solving this issue.
<Reason>
-. Your fact table has invalid record. In your case, it is 'E1500'.
-. It means, Account dimension doesn't have E1500 member but your fact table has that records.
Therefore, Analysis Service can't process it.
-. Usually it happens when user load data from outside source to BPC without validation.
-. If you are using 'make dimension package' 5.1 SP8, it might be a bug. I saw that issue in that version but this error doesn't happen when you create dimension from excel worksheet through BPC admin console.
<Solution>
-. Select records from fact tables that has 'E1500' from account column.
-. Delete it.
<Note>
- Due to MS analysis Service error detect system, this invalid member will detect one by one.
Therefore, you may need to repeat this step again and again.
- One solution to avoid this is finding invalid members between fact and dim tables using Join query.
Here is sample SQL query,
select a.ID, b.account from mbrAccount as a right join tblFactFinance as b
on a.ID = b.ACCOUNT where a.ID is null
It will show all invalid member Id in account column from facttable so that you can figure out which account was wrong. It can save your time a lot to avoid process cube again and again.
I hope it will help you
James Lim -
Logmnr/capure error b'coz of corruption of redo log block
Hi,
We all know that capture process reads the REDO entries from redo log files or archived log files. Therefore we need to ahev db in ARCHIVELOG mode.
In alert log file, I found error saying :
Creating archive destination LOG_ARCHIVE_DEST_1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\LOCATION01\1_36.ARC'
ARC0: Log corruption near block 66922 change 0 time ?
ARC0: All Archive destinations made inactive due to error 354
Fri Apr 04 12:57:44 2003
Errors in file e:\oracle\admin\repf\bdump\trishul_arc0_1724.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 66922 change 0 time 04/04/2003 11:05:40
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
As a normal practice, we do have multiplexing of redo log files at diff location, but even that second copy of redo log is of no use to recover the redo log. This explains redo log could not be archived, since it can't be read. Same is true even for Logmnr process, it could not read the redo log file and it failed. Now, we have wae to recover from this situation (as far as DB is concern, not Stream Replication), since the shutdown after this error was IMMEDIATE causing checkpoing, and rollback/rollforward is not required during system startup. (No instance recovery) We can make db NOARCHIVELOG mode, drop that particular group, and create new one, and turn db to ARCHIVELOG mode This will certainly serve the purpose as far as consistency of DB is concern.
Here is a catch for Stream Replication. The redo log that got corrupted must be having few transaction which are not being archived, and each will be having corresponding SCN. Now, Capture Process read the info sequentially in order of SCN. Few transaction are now missed, and Capture process can't jump to next SCN skipping few SCN in between. So, we have to re-instantiate the objects on the another system which has no erros, and start working on it. My botheration is what will happen to those missed transaction on the another database. It's absolete loss of the data. In development I can manage that. But in real time Production stage, this is a critical situation. How to recover from this situation to get back the corrupted info from redo log ?
I have not dropped any of the log group yet. B'coz I would like to recover from this situation without LOSS of data.
Thanx, & regards,
Kamlesh Chaudhary
Content of trace files :
Dump file e:\oracle\admin\repf\bdump\trishul_arc0_1724.trc
Fri Apr 04 12:57:31 2003
ORACLE V9.2.0.2.1 - Production vsnsta=0
vsnsql=12 vsnxtr=3
Windows 2000 Version 5.0 Service Pack 2, CPU type 586
Oracle9i Enterprise Edition Release 9.2.0.2.1 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.2.0 - Production
Windows 2000 Version 5.0 Service Pack 2, CPU type 586
Instance name: trishul
Redo thread mounted by this instance: 1
Oracle process number: 16
Windows thread id: 1724, image: ORACLE.EXE
*** SESSION ID:(13.1) 2003-04-04 12:57:31.000
- Created archivelog as 'E:\ORACLE\ORADATA\REPF\ARCHIVE\LOCATION02\1_36.ARC'
- Created archivelog as 'E:\ORACLE\ORADATA\REPF\ARCHIVE\LOCATION01\1_36.ARC'
*** 2003-04-04 12:57:44.000
ARC0: All Archive destinations made inactive due to error 354
*** 2003-04-04 12:57:44.000
kcrrfail: dest:2 err:354 force:0
*** 2003-04-04 12:57:44.000
kcrrfail: dest:1 err:354 force:0
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 66922 change 0 time 04/04/2003 11:05:40
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
*** 2003-04-04 12:57:44.000
ARC0: Archiving not possible: error count exceeded
ORA-16038: log 2 sequence# 36 cannot be archived
ORA-00354: corrupt redo log block header
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
ORA-16014: log 2 sequence# 36 not archived, no available destinations
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
ORA-16014: log 2 sequence# 36 not archived, no available destinations
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
ORA-16014: log 2 sequence# 36 not archived, no available destinations
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
ORA-16014: log 2 sequence# 36 not archived, no available destinations
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG'
ORA-16014: log 2 sequence# 36 not archived, no available destinations
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\REDO02.LOG'
ORA-00312: online log 2 thread 1: 'E:\ORACLE\ORADATA\REPF\ARCHIVE\REDO02.LOG
Dump file e:\oracle\admin\repf\udump\trishul_cp01_2048.trc
Fri Apr 04 12:57:27 2003
ORACLE V9.2.0.2.1 - Production vsnsta=0
vsnsql=12 vsnxtr=3
Windows 2000 Version 5.0 Service Pack 2, CPU type 586
Oracle9i Enterprise Edition Release 9.2.0.2.1 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.2.0 - Production
Windows 2000 Version 5.0 Service Pack 2, CPU type 586
Instance name: trishul
Redo thread mounted by this instance: 1
Oracle process number: 30
Windows thread id: 2048, image: ORACLE.EXE (CP01)
*** 2003-04-04 12:57:28.000
*** SESSION ID:(27.42) 2003-04-04 12:57:27.000
TLCR process death detected. Shutting down TLCR
error 1280 in STREAMS process
ORA-01280: Fatal LogMiner Error.
OPIRIP: Uncaught error 447. Error stack:
ORA-00447: fatal error in background process
ORA-01280: Fatal LogMiner Error
**********************I have the similar problem - I am using Steams environment, and have got this
"ORA-00353: log corruption near block" errors in the alert.log file
during capture the changes on the primary database, and Capture
process became aborted after that.
Was that transactions lost, or after i've started the Capture process
again the were captured and send to the target database?
Have anyone solved that problem?
Can you help me with it? -
11i load balancing web nodes without use of Hardware http load balancer
I am looking at note 217368.1 (Advanced Configurations and Topologies for Enterprise Deployments of E-Business Suite 11i) and some other notes on load balancing but some aspects are not clear.
Aim is to implement load balancing traffic to web nodes without using Hardware ( BigIP, cisco etc) for HTTP layer load balancing.
Which is more preferable between dns or Apache Jserv load balancer ?
Need details like failover capabilities, death detection of node, functionality testing and ways to monitor Apache Jserv load balancer.
Any help in this regard is welcome .
thx
arunOracle recommends using loadbalancing hardware rather than using DNS. If you want the features you mention above, you will need a hardware loadbalancer.
http://blogs.oracle.com/stevenChan/2006/06/indepth_loadbalancing_ebusines.html
http://blogs.oracle.com/stevenChan/2009/01/using_cisco_ace_series_hardware_load-balancers_ebs12.html
HTH
Srini
Maybe you are looking for
-
Downloading pictures from a cell phone to iPhoto 5
I just bought a Motorola V551 camera phone and was told by the sales person that if I bought a USB cable for it (which I did) that I would be able to directly down load pictures to my Mac. I have a iMac G5 with OS 10.4.2 and iPhoto 5.0.4 and assumed
-
How Can I Make a Flash Animation for TV?
I'm learning flash animation and am wondering once I save the flash file to a dvd format, how does the end-user navigate when they put the dvd into their TV? Do I need to design and incorporate a menu for the end-user? If so, how does the TV remote c
-
Created a web site in iweb. I have a domain name through GoDaddy, What resource should I use to make site public?
-
ipad keeps requesting outlook password, trying cancel but still in cycle - can not do anything else. NOTE - recently change email password on PC
-
I have problems in links only mode, i can't playback my videos.
I have problems in links only mode, i can't playback my videos.