Cluster shutdown porcess?

Is there a way to let Coherence know that we are going to shutdown the whole cluster, so don't bother to transfer data for distributed cache to other storage node when we shutdown node one by one?
We have a cluster with multiple strage enable nodes.
We are calling CacheFactory.shutdown() on each node during cluster shutdown process. Turn out that Coherence will try to move data to other storage node when we start to shutdown storage node one by one which make the whole shutdown process lot longer especially if the cluster contain huge amount of data already.
Is there a way to let Coherence know that the whole cluster will be tear down and don't bother to move data around? Please beware the issue a kill command to all nodes is not an option for us.

There is no Coherence API to shutdown an entire cluster or prevent the data transfer as nodes are shutdown. You can use the InvocationService to run an Invocable that simply does a System.exit(), which will effectively force the cluster down immediately.
--David

Similar Messages

Disable PartitionListeners on service prior to cluster shutdown

We're running a Coherence cluster of 70 storage nodes and a further 10 storage-disabled tcp-extend nodes, all running DefaultCacheServer.
I have a PartitionListener configured on a distributed cache service which sends an email to a distribution list in the event of a 'PARTITION_LOST' event. However, currently when the storage-enabled processes are stopped (by killing the processes), we get spurious emails with partition losses due to the cluster being shut down.
Is there a tidy, built in way to disable the partition listeners across a cluster before the storage nodes are stopped, and avoid sending these spurious mails? Note that the mails are not sent by the process that is currently stopping, but by one of the other processes when the PartitionEvent is detected.
I've tried using of Cluster.shutdown() but this does not stop the listeners from being called. I also tried using Cluster.isRunning() in the listener to determine whether to send the error email, but it appears the cluster gets restarted (and isRunning() returns true again) if CacheFactory.getCache() is called after Cluster.shutdown() has been called, so this is not a safe mechanism either.
My current solution is to register each partition listener as a JMX bean on the management service (Registry) as it's instantiated, and expose a disable() method. I then run a standalone java class which joins the cluster and iterates over those registered beans calling disable on each one, prior to stopping the storage processes.
The above solution works, but am I missing a built-in way of doing this? I'd rather use an out of the box solution if there's one that fits the bill.
Thanks in advance,
Giles Taylor

Hi Giles,
I don't think that there is a feature for this particular purpose, as for Coherence the death of a node is a usual event, it does not have the concept of "I am going to kill all nodes, now".
However, it is very easy to to implement it on your own.
Just have a special-purpose cache, on which you register an event listener. The event-listener listening for a certain change in the cache (e.g. you put a specific value to a specific key, to indicate that full shutdown will commence now) should then set a static flag which indicates that full cluster shutdown has started and therefore you do not need to send emails on partition loss from now on. Alternatively you can do the same thing with an InvocationService (that invocation service needs to be started on all nodes). Alternatively you can also put that special entry simply into a replicated cache (so checking it can be locally carried out).
Then in the partition listener you just check the static flag or the replicated cache, according to whichever approach from the previous ones you chose, and you send the mail only if you did not indicate that full shutdown is commencing.
Then if you want to do a full shutdown, you first carry out the action according to the approach which you chose, and only then start to kill nodes.
BR,
Robert

Close/shutdown the Sun Cluster Package/resource Group

Hi,
I have a SUN cluster system.
I want to know what script do when the SUN cluster shutdown the package "app-gcota-rg" as I may need to modify it ?? Where can I find out this information in the system??
In which directory and log file ???
Any suggestion ???
Resource Groups --
Group Name Node Name State
Group: ora_gcota_rg ytgcota-1 Online
Group: ora_gcota_rg ytgcota-2 Offline
Group: app-gcota-rg ytgcota-1 Online
Group: app-gcota-rg ytgcota-2 Offline

Hi,
you would first find out which resources belong to app-gcota-rg.
Do a "clrs list -g app-gcota-rg". Then find out which of the resource is the one dealing with your application. Then try to find out its resource type:
"clrs show -v <resource name>| fgrep Type". If it is a standard type like HA Oracle, it is an extremely bad idea to hack the scripts, as you'll lose support. If type is SUNWgds, the scripts to start, stop and monitor the application are user supplied. You can find their pathnames using:
"clrs show -v <resource-name>| fgrep _command". This should display full pathnames.
Regards
Hartmut

Synchronization of memberLeaving event across the cluster

Hi,
Let's assume that we have 2 nodes and one of them is leaving the cluster (but will continue to operate alone). Both nodes use a MemberListener to react to the memberLeaving(MemberEvent) event. Moreover, in our case as part of the leaving event logic we need to send a lot of messages from one node to the other node. That is why we need to have both nodes be part of the cluster until the leaving event is done in all nodes.
So my question is: Will be node leaving the cluster wait to actually leave the cluster once all the other nodes (and itself too) finished processing the memberLeaving event?
In case the answer is no I was thinking that the way to ensure a proper clean up would be to do all the clean up logic in the node leaving the cluster but that would imply bringing all the shared/cached data to the leaving node and that wouldn't be a scalable operation.
Ideas are welcome. :)
Thanks,
-- Gato

Hi Gato,
The memberLeaving event is an asynchronous notification raised on a node B when another node A issues a programmatic Service.shutdown() or Cluster.shutdown() call. There is no way for code running on B to block or delay the shutdown process on the node A.
If you want to "negotiate" the conditions of a clean shutdown between cluster nodes, you would need to use a "satellite" InvocationService to communicate all necessary information and actions across the cluster before initiating the shutdown sequence.
Regards,
Gene

Apache cluster down because of sharedaddress problem

Hi,
I had a problem in my 2-node + QS cluster that, the apache service resource going offline on both the nodes. I could find Probe timeout for the apache resource in /var/adm/messages. The sharedaddress resource is online on node 1 and offline on node 2. The cluster was online, and my apache resource was working properly, and all of the sudden this probelm arised and the apache resource went offline on both nodes
Then I made the sharedaddress resource offline. After offlining, when I do the ifconfig -a i could see the shared ip address is shoing up on interface lo0:1 on both the nodes!!!!!!!!!!!!!!!!!!!
I unplumberd the lo0:1 on both nodes, and made the shared address resource online. It shows online on clrg status on node 1, and ifconfig -a showd bge3:1 with the shared address on node 1 and nothing (no shared ip) on node 2.
After I shutdown the cluster (cluster shutdown) and startup both nodes, it came up as normal. Apache is also online on both nodes !!!!!!!!!
Any clue?

Hi,
If it was a non-existing problem, why were the appache resource offline on both nodes?. In a "CLUSETR", even if the apache resource is offline due to any reason, the resource must be running on the other node . !!! In my case , I was not able to access my web page !!! The whole cluster is down. There were no configuration change, nobody has even logged in to the servers. For sure there was something which made the apache resource down. It cant be with apache, cos both nodes cant have the same problem at the same time. Thats why I thought it might be a problem with the shared address...
I couldn't even rectify the problem by restarting both the shared address resource group and the apache resource group. All what saved me was the total shutdown of the cluster and restart. !!!!!!
Thanks
Ushas Symon

Automate shutdown

I have successfully installed Oracle 8.1.6 on Red Hat 6.1 and everything is
working fine apart from the automatic database startup and shutdown. I can
successfully automate start database using
standart dbora script with link /etc/rc.d/init.d/rc2.d/S99dbora.
But /etc/rc.d/init.d/rc0.d/K10dbora not run
when shutdown system.
Script /etc/rc.d/rc admit K10dbora link...:(
Has anyone else experienced this problem?
null

WLST Online scripting - <a class="jive-link-external" href="http://www.togotutor.com">http://www.togotutor.com</a> (Learn Programming and Administration for Free at Togotutor)
def connection_to_Admin():
try:
connect(username, password, URL)
except wlst.WLSTException, ex:
print "Caught exception while connecting: %s" % ex
else:
print "------- Connected successfully! ---------"
connection_to_Admin()
state(cluster_1,'Cluster') --> state of cluster
shutdown(cluster_1,'Cluster') --> shutdown the cluster
state(cluster_1,'Cluster') --> Gives the state, that server is down
disconnect() --> disconnect from admin console
Thanks
Togotutor
<a class="jive-link-external" href="http://www.togotutor.com">http://www.togotutor.com</a> (Learn Programming and Administration for Free)
Edited by: togotutor on Aug 12, 2010 3:17 PM

Putting cluster into maintenace for network disruption

Hi all
all our network switches will be rebooted in the next days. They will be rebooted at the same time a reboot takes in the best case 5 minutes, this means we will loose the interconnects and the public interfaces at the same time in our clusters. I know we should reboot the redundant switches after each other, but this has been planned by other people with other priorities.....
What is the best method to bring them in a maintenance mode so the don't panic and switch?
My current suggestion ist to shutdown one node and unmonitor all logical hostname resources on the remaining node.
Any better ideas?

Well Tim, a cluster shutdown would be a clean solution, but we try to keep the downtime as short as possible (our priorities are slightly different from the priorities of the networking people ;-) ). There will be hundredths (yes!!!) of switches so a save maintenance window would probably be an hour or so, while the switches on whom a single cluster depends will be rebooted much faster. So in order to keep the downtime minimal, we try to keep at least one cluster node up and running so the service is available as soon as possible.
I know this rebooting of all switches simultaneously is a call for a nightmare, if i just think of all the applications which are to stupid to reconnect automatically to the database if the loose the connection....
The only save solution would be a detailed planing and a step by step approach but may be we have to live with the brute force method.
Fritz

Shutting down a Coherence node via JMX?

I retrieved a Coherence node using "Coherence:type=Node,nodeId=<whatever>", and then invoke the shutdown method (see attachment). The node seems to shutdown fine (I don't see it anymore in the Agent View), but after a few seconds, it came back (refreshing the Agent View and I see it again). Sometimes it comes back with a different nodeId.
I'm confused as to what the shutdown() method is doing.
Is there anything equivalent to calling Cluster.shutdown() for a particular node? Thanks. Attachment: CoherenceJMX.JPG (*To use this attachment you will need to rename 313.bin to CoherenceJMX.JPG after the download is complete.)

Thanks Dimitri, I actually didn't see your note until now. I took a different approach and created a rather trivial extension of DefaultCacheServer:
public class PCSDefaultCacheServer extends DefaultCacheServer {
public static void main(String[] args) {
System.out.println("Registering Custom Management Extensions...");
Registry registry = CacheFactory.ensureCluster().getManagement();
String name = registry.ensureGlobalName("type=Node,name=CustomManagement");
CoherenceCustomManagementMBean bean = new CoherenceCustomManagement();
registry.register(name, bean);
System.out.println("Done.");
DefaultCacheServer.main(args);
The CoherenceCustomManagement MBean registered above exposes two methods:
* shutdown() - which simply shutdown all Coherence services
* stop() - which is the above + System.exit to terminate the JVM
Regardless, your alternative is very helpful, thank you!

Secondary Node Rebooted instead of falling to Ok prompt

Hi all,
We need to get system backup for our clustered DB, before and after our maintenance work.
We have the following configuration:
Node #1 and Node #2
Solaris 8
SunCluster 3.0
Oracle 9.3.4
VxVM 3.2
Before issuing cluster shutdown command, I verified which node is primary.
#scstat
I issued scshutdown -y -i0 on the primary node, the secondary node rebooted instead of halting to {ok} prompt. (The primary server successfully fall to ok prompt)
When I checked on the logs on the secondary node.
May 16 08:18:41 SC[SUNW.HAStoragePlus,ttmapd-rg,tmsstor-res,hastorageplus_prenet_start_private]: Global device path /dev/vx/rdsk/tms_usr_dg01/bak_redo11_vol is not recognized as a device group or a device special file.
May 16 08:18:42 SC[SUNW.LogicalHostname,ttmapd-rg,tmslhost3-res,hafoip_start]: pnm_init: RPC: Rpcbind failure - RPC: Unable to receive
May 16 08:18:42 SC[SUNW.LogicalHostname,ttmapd-rg,tmslhost0-res,hafoip_start]: pnm_init: RPC: Rpcbind failure - RPC: Unable to receive
May 16 08:18:42 SC[SUNW.LogicalHostname,ttmapd-rg,tmslhost3-res,hafoip_start]: Failed to validate NAFO group name <nafo0> nafo errorcode <5>.
May 16 08:18:42 SC[SUNW.LogicalHostname,ttmapd-rg,tmslhost0-res,hafoip_start]: Failed to validate NAFO group name <nafo1> nafo errorcode <5>.
May 16 08:18:42 SC[SUNW.LogicalHostname,ttmapd-rg,tmslhost0-res,hafoip_stop]: pnm_init: RPC: Rpcbind failure - RPC: Unable to receive
May 16 08:18:42 SC[SUNW.LogicalHostname,ttmapd-rg,tmslhost0-res,hafoip_stop]: Failed to validate NAFO group name <nafo1> nafo errorcode <5>.
May 16 08:18:42 SC[SUNW.LogicalHostname,ttmapd-rg,tmslhost3-res,hafoip_stop]: pnm_init: RPC: Rpcbind failure - RPC: Unable to receive
May 16 08:18:42 SC[SUNW.LogicalHostname,ttmapd-rg,tmslhost3-res,hafoip_stop]: Failed to validate NAFO group name <nafo0> nafo errorcode <5>.
Has anyone encountered this error before.
Thank you in advance.
Regards,
Rachele

scshutdown gives a shutdown command on both nodes. here is the procedure.
-failover your resourcegroup to node_2
root@node_2#scswitch -z -g oracle -h node_2
you can check the status of the with scstat
-on the node which you want to backup:
root@node_1#init 0
ok boot -sx
s = single usermode
x = outside of the cluster
Once you're in single usermode, start your backup
If you wish to avoid a bunch of logging on node_2, you can always disable, or set it in maintenance, node_1 on node_2 with scconf:
root@node_2#scconf -q node=node_1,maintstate
(make sure you know what you're doing here)
to reboot node_1 and join it into the cluster:
root@node_1#umount -a
root@node_1#sync
root@node_1#reboot
ok boot (if auto-boot? is set to false)
root@node_2#scconf -q node=node_1,reset
that's it
cheers,
Kim

Replacing a clustered machine

Hello,
We are going to replace our clustered C600 series by C650 machines.
Since we have a rather complex firewall config I like to do a “in-place” machine swap.
What I’m planning to do is the following:
• Give the new machine IP addresses as our production machines have now
• Install the certificates on the new machine
• Shut down the listeners of the new machine
• Shut down the new machine
• Shut down the listeners on the old machine
• Wait until the queue’s are empty
• Remove the machine from the cluster
• Shutdown the machine
• Replace the C600 by the C650
• Connect only the management interface and boot the machine
• Check if the listeners are still stopped
• Connect DATA1 and DATA2
• Check if the connectivity is as expected (can I connect to my internal mail servers, can I connect to internet mail servers)
• Check if DNS is working
• Add the machine to the cluster.
• Check the configuration
• Start the listeners
I have a few questions:
• Is this a good / safe approach or am I overlooking something?
• Is it sensible to install the certificates while the machine is still stand-alone or will I have to do it after machine has become a part of the cluster? (it’s a terrible job so I like to do it only once)
• When I stop the listeners before shutting down the machine, will they stay stopped after booting the systems again or will they be started automatically?
All input / responses are appreciated. Thank you!
Steven

Steven,
I can't see any traps in your approach so far.
IMHO, you should install certificates before joining the cluster if you use them for TLS connections. As soon as the machine is part of the cluster it has listeners defined which will be active for incoming connections. These connection may fail if the sender expects official certificates.
The listeners stay suspended after a reboot.
Joerg

VIP failover in Oracle RAC

Dear all,
I am using Oracle Rac 10gR2 running on top of Sun Cluster 3.2u3.
I have a test to check the failover ability of VIP in Oracle RAC, however the result was not as I expected.
The test scenario was:
- Turn on the 02 nodes and wait for all services including both Sun Cluster and Oracle RAC online.
- Using SQL Navigator to connect to the database using the VIP on node1. (VIP1)
- Shutdown the node1.
- All services and resources on node2 still online, however after a long time (about 10 mins), I did not see the VIP1 failover to the alive node.
- The "crs_stat -t" command did not show the VIP1 online on node2 (alice node).
- The SQL Navigator could not establish the connection to the databasse using the VIP1 any more.
The output of "crs_stat -t" command before shutting down the node1:
oracle@t5120-02 $ crs_stat -t
Name Type Target State Host
ora.orcl.db application ONLINE ONLINE t5120-02
ora....l1.inst application ONLINE ONLINE t5120-01
ora....l2.inst application ONLINE ONLINE t5120-02
ora....01.lsnr application ONLINE ONLINE t5120-01
ora....-01.gsd application ONLINE ONLINE t5120-01
ora....-01.ons application ONLINE ONLINE t5120-01
ora....-01.vip application ONLINE ONLINE t5120-01
ora....02.lsnr application ONLINE ONLINE t5120-02
ora....-02.gsd application ONLINE ONLINE t5120-02
ora....-02.ons application ONLINE ONLINE t5120-02
ora....-02.vip application ONLINE ONLINE t5120-02
The output of "crs_stat -t" command after shutting down the node1:
oracle@t5120-02 $ crs_stat -t
Name Type Target State Host
ora.orcl.db application ONLINE ONLINE t5120-02
ora....l1.inst application OFFLINE OFFLINE
ora....l2.inst application ONLINE ONLINE t5120-02
ora....01.lsnr application OFFLINE OFFLINE
ora....-01.gsd application OFFLINE OFFLINE
ora....-01.ons application OFFLINE OFFLINE
ora....-01.vip application OFFLINE OFFLINE
ora....02.lsnr application ONLINE ONLINE t5120-02
ora....-02.gsd application ONLINE ONLINE t5120-02
ora....-02.ons application ONLINE ONLINE t5120-02
ora....-02.vip application ONLINE ONLINE t5120-02
So my questions are:
- Was my test scenario correct to check the failover ability of VIP in Oracle RAC?
- Is there any additional configuration needed to perform on the system to achieve the VIP failover?
Please help me in this case as I am new to Oracle RAC.
Thanks.
HuyNQ.

Dear Rajesh,
Sorry for late reply.
I have already tested 02 cases: shutting down a node and crashing a node. Below are the output of the log files in the 2 test cases.
Once again, when shutting down a node, the VIP did not failover although the CRS on that node was shutdown before all other services and resources of Sun Cluster shutdown.
Please help to check the log files and give me advise if you see anything abnormally.
Thanks.
* In case of shutting down the node 1: (at about 09:05 Sep 17)
Shutdown node 1:
root@t5120-01 # shutdown -y -g0 -i0
Shutdown started. Fri Sep 17 09:04:55 ICT 2010
Changing to init state 0 - please wait
Broadcast Message from root (console) on t5120-01 Fri Sep 17 09:04:55...
THE SYSTEM t5120-01 IS BEING SHUT DOWN NOW ! ! !
Log off now or risk your files being damaged
crsd.log file on node 2:
root@t5120-02 # more /u01/app/oracle/10.2.0/crs/log/t5120-02/crsd/crsd.log
2010-09-16 16:35:56.281: [ CRSRES][1326] t5120-02 : CRS-1019: Resource ora.t5120-01.gsd (application) cannot run on t5120-02
2010-09-16 16:35:56.320: [ CRSRES][1325] t5120-02 : CRS-1019: Resource ora.t5120-01.LISTENER_T5120-01.lsnr (application) cannot run on t5120-02
2010-09-16 16:35:56.346: [ CRSRES][1327] t5120-02 : CRS-1019: Resource ora.t5120-01.ons (application) cannot run on t5120-02
2010-09-16 17:06:10.202: [ CRSRES][1520] StopResource: setting CLI values
2010-09-17 09:06:10.567: [ CRSCOMM][5709] CLEANUP: Searching for connections to failed node t5120-01
2010-09-17 09:06:10.577: [ CRSEVT][5709] Processing member leave for t5120-01, incarnation: 11
2010-09-17 09:06:10.665: [    CRSD][5709] SM: recovery in process: 8
2010-09-17 09:06:10.665: [ CRSEVT][5709] Do failover for: t5120-01
2010-09-17 09:06:10.826: [ CRSEVT][5709] Post recovery done evmd event for: t5120-01
2010-09-17 09:06:10.898: [    CRSD][5709] SM: recoveryDone: 0
2010-09-17 09:06:10.918: [ CRSEVT][5710] Processing RecoveryDone
crs_stat -t on node 2:
oracle@t5120-02 $ crs_stat -t
Name Type Target State Host
ora.orcl.db application ONLINE ONLINE t5120-02
ora....l1.inst application OFFLINE OFFLINE
ora....l2.inst application ONLINE ONLINE t5120-02
ora....01.lsnr application OFFLINE OFFLINE
ora....-01.gsd application OFFLINE OFFLINE
ora....-01.ons application OFFLINE OFFLINE
ora....-01.vip application OFFLINE OFFLINE
ora....02.lsnr application ONLINE ONLINE t5120-02
ora....-02.gsd application ONLINE ONLINE t5120-02
ora....-02.ons application ONLINE ONLINE t5120-02
ora....-02.vip application ONLINE ONLINE t5120-02
* In case of crashing the node 1: (at about 09:32 Sep 17)
Crash the node 1:
root@t5120-01 # Sep 17 09:31:16 t5120-01 Cluster.CCR: pmmd: fsync_core_files: could not get any core file paths: pcorefile error Invalid argument, gcorefile error Invalid argument, zcorefile error Invalid argument
Sep 17 09:31:16 t5120-01 Cluster.CCR: [ID 408757 daemon.alert] pmmd: fsync_core_files: could not get any core file paths: pcorefile error Invalid argument, gcorefile error Invalid argument, zcorefile error Invalid argument
Notifying cluster that this node is panicking
crsd.log file on node 2:
root@t5120-02 # tail -30 /u01/app/oracle/10.2.0/crs/log/t5120-02/crsd/crsd.log
2010-09-16 16:35:56.281: [ CRSRES][1326] t5120-02 : CRS-1019: Resource ora.t5120-01.gsd (application) cannot run on t5120-02
2010-09-16 16:35:56.320: [ CRSRES][1325] t5120-02 : CRS-1019: Resource ora.t5120-01.LISTENER_T5120-01.lsnr (application) cannot run on t5120-02
2010-09-16 16:35:56.346: [ CRSRES][1327] t5120-02 : CRS-1019: Resource ora.t5120-01.ons (application) cannot run on t5120-02
2010-09-16 17:06:10.202: [ CRSRES][1520] StopResource: setting CLI values
2010-09-17 09:06:10.567: [ CRSCOMM][5709] CLEANUP: Searching for connections to failed node t5120-01
2010-09-17 09:06:10.577: [ CRSEVT][5709] Processing member leave for t5120-01, incarnation: 11
2010-09-17 09:06:10.665: [    CRSD][5709] SM: recovery in process: 8
2010-09-17 09:06:10.665: [ CRSEVT][5709] Do failover for: t5120-01
2010-09-17 09:06:10.826: [ CRSEVT][5709] Post recovery done evmd event for: t5120-01
2010-09-17 09:06:10.898: [    CRSD][5709] SM: recoveryDone: 0
2010-09-17 09:06:10.918: [ CRSEVT][5710] Processing RecoveryDone
2010-09-17 09:32:08.810: [ CRSCOMM][5837] CLEANUP: Searching for connections to failed node t5120-01
2010-09-17 09:32:08.811: [ CRSEVT][5837] Processing member leave for t5120-01, incarnation: 13
2010-09-17 09:32:08.824: [    CRSD][5837] SM: recovery in process: 8
2010-09-17 09:32:08.824: [ CRSEVT][5837] Do failover for: t5120-01
2010-09-17 09:32:09.036: [ CRSRES][5837] startup = 0
2010-09-17 09:32:09.075: [ CRSRES][5837] startup = 0
2010-09-17 09:32:09.106: [ CRSRES][5837] startup = 0
2010-09-17 09:32:09.132: [ CRSRES][5837] startup = 0
2010-09-17 09:32:09.153: [ CRSRES][5837] startup = 0
2010-09-17 09:32:09.565: [ CRSRES][5839] startRunnable: setting CLI values
2010-09-17 09:32:09.575: [ CRSRES][5839] Attempting to start `ora.t5120-01.vip` on member `t5120-02`
2010-09-17 09:32:16.276: [ CRSRES][5839] Start of `ora.t5120-01.vip` on member `t5120-02` succeeded.
2010-09-17 09:32:16.340: [ CRSEVT][5837] Post recovery done evmd event for: t5120-01
2010-09-17 09:32:16.342: [    CRSD][5837] SM: recoveryDone: 0
2010-09-17 09:32:16.348: [ CRSEVT][5846] Processing RecoveryDone
crs_stat -t on node 2:
oracle@t5120-02 $ crs_stat -t
Name Type Target State Host
ora.orcl.db application ONLINE ONLINE t5120-02
ora....l1.inst application ONLINE OFFLINE
ora....l2.inst application ONLINE ONLINE t5120-02
ora....01.lsnr application ONLINE OFFLINE
ora....-01.gsd application ONLINE OFFLINE
ora....-01.ons application ONLINE OFFLINE
ora....-01.vip application ONLINE ONLINE t5120-02
ora....02.lsnr application ONLINE ONLINE t5120-02
ora....-02.gsd application ONLINE ONLINE t5120-02
ora....-02.ons application ONLINE ONLINE t5120-02
ora....-02.vip application ONLINE ONLINE t5120-02

The Cluster not failover when i shutdown one managed server?

Hello, I created one cluster whit two managed servers, and deployed an application across the cluster, but the weblogic server gave me two url and two different port for access to this application.
http://server1:7003/App_name
http://server1:7005/App_name
When I shutdown immediate one managed server i lost the connection whit the application from this managed server, My question is, the failover and de load balancer not work, why??
Why two diferent address?
thank any help

Well you have two different addresses (URL) because those are two physical managed servers. By creating a cluster you are not automatically going to have a virtual address (URL) that will load balance requests for that application between those two managed servers.
If you want one URL to access this application, you will have to have some kind of web server in front of your WebLogic. You can install and configure Oracle HTTP Server to route requests to WebLogic cluster. Refer this:
http://download.oracle.com/docs/cd/E12839_01/web.1111/e10144/intro_ohs.htm#i1008837
And this for details on how to configure mod_wl_ohs to route requests from OHS to WLS:
http://download.oracle.com/docs/cd/E12839_01/web.1111/e10144/under_mods.htm#BABGCGHJ
Hope this helps.
Thanks
Shail

Oracle shutdown steps in a cluster

Hi all,
The PRD setup is as follows
One single oracle 10.2.0.4 Database, with two physical Servers (DBServ-1, DBServ-2)
Running Application is SAP
Oracle is in a SUSE Linux cluster . (Not RAC environment. Just an OS level cluster to maintain fail over);
Currently DBServ-1 node is active, and DBServ-2 is standby
We needs to shutdown this entire setup and restart again.
What is the correct set of sequence to shutdown oracle DB?
Since DBServ-2 is standby, without any issue we can shutdown DBServ-2 as normal server shutdown. (ex: init 0)
But, oracle is up and running on DBServ-1 Node.
From where should i stop this?
Is it from the SuSE Cluster or from inside the DB? (login as / and issue shutdown immediate)
Regards,
Zerandib

Hi,
As i understand is OS is clustered not the db. Better is to manually shutdown the db clean and then start it.
TO start login through sqlplus sys as sysdba
1. startup
2. Once its done, start the listener
3. Start the application
Anand

Errors of BEA-101147 during shutdown if a managed server form a cluster

Hi there,
 We have a cluster with two machines running one managed server on each. Acces to these servers is made with mod_wl.so on Apache. The problem weÂ´re having is: when I shutdown one of the managed servers it seems to undeploy the web application before stopping serving requests. Issuing thus the following error message on the log:
 
 ####<05/10/2004 08h19min43s BRT> <Debug> <HTTP> <calipso> <calipsoFOCC> <ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'> <<WLS Kernel>> <> <BEA-101147> <HttpServer(20184735,null default ctx,calipsoFOCC) Found no context for "/externo/portal/pop/sendData.jsp". This request does not match the context path for any installed Web applications, and there is no default Web application configured.>
 
 The effect for the end user is that he receives 404 errors. Even tought the session is being replicated to the other server. 
 IÂ´m running WLS Premium 8.1SP2.. If anyone has a clue on this I would be very greateful. 
 Regards, 
 Renato Moutinho

Hi,
 You didn't mention what WLS release you are using. There is a known problem with 8.1sp2 and 8.1sp3. You should open a case with BEA Support and reference change request number CR185885.
 Jane

Excessive (?) cluster delays during shutdown of storage enabled node.

We are experiencing significant delays when shutting down a storage enabled node. At the moment, this is happening in a benchmark environment. If these delays were to occur in production, however, they would push us well outside of our acceptable response times, so we are looking for ways to reduce/eliminate the delays.
Some background:
- We're running in a 'grid' style arrangement with a dedicated cache tier.
- We're running our benchmarks with a vanilla distributed cache -- binary storage, no backups, no operations other than put/get.
- We're allocating a relatively large number of partitions (1973), basing that number on the total potential cluster storage and the '50MB per partition' rule.
- We're using JSW to manage startup/shutdown, calling DefaultCacheServer.main() to start the cache server, and using the shutdown hook (from the operational config) to shutdown the instance.
- We're currently running all of the dedicated cache JVMs on a single machine (that won't be the case in production, of course), with a relatively higher ratio of JVMs to cores --> about 2 to 1.
- We're using a simple benchmarking client that is issuing a combination of puts/gets against the distributed cache. The ids for these puts/gets are randomized (completely synthetic, i know).
- We're currently handling all operations on the distributed service thread (i.e. thread count is zero).
What we see:
- When adding a new node to a cluster under steady load (~50% CPU idle avg) , there is a very slight degradation, but only very slight. There is no apparent pause, and the maximum operation times against the cluster might barely exceed ~100 ms.
- When later removing that node from the cluster (kill the JVM, triggering the coherence supplied shutdown hook), there is an obvious, extended pause. During this time, the maximum operation times against the cluster are as high as 5, 10, or even 15 seconds.
At the beginning of the pause, a client will see this message:
2010-07-13 22:23:53.227/55.738 Oracle Coherence GE 3.5.3/465 <D5> (thread=Cluster, member=10): Member 8 left service Management with senior member 1
During the length of the pause, the cache server logging indicates that primary partitions are being shuffled around.
When the partition shuffle is complete, the clients become immediately responsive, and display these messages:
2010-07-13 22:23:58.935/61.446 Oracle Coherence GE 3.5.3/465 <D5> (thread=Cluster, member=10): Member 8 left service hibL2-distributed with senior member 1
2010-07-13 22:23:58.973/61.484 Oracle Coherence GE 3.5.3/465 <D5> (thread=Cluster, member=10): MemberLeft notification for Member 8 received from Member(Id=8, Timestamp=2010-07-13 22:23:21.378, Address=x.x.x.x:8001, MachineId=47282, Location=site:xxx.com,machine:xxx,process:30552,member:xxx-S02, Role=server)
2010-07-13 22:23:58.973/61.484 Oracle Coherence GE 3.5.3/465 <D5> (thread=Cluster, member=10): Member(Id=8, Timestamp=2010-07-13 22:23:58.973, Address=x.x.x.x:8001, MachineId=47282, Location=site:xxx.com,machine:xxx,process:30552,member:xxx-S02, Role=server) left Cluster with senior member 1
2010-07-13 22:23:59.135/61.646 Oracle Coherence GE 3.5.3/465 <D5> (thread=Cluster, member=10): TcpRing: disconnected from member 8 due to the peer departure
Note that there was almost nothing actually in the entire cluster-wide cache at this point -- maybe 10 MB of data at most.
Any thoughts on how we could eliminate (or nearly eliminate) these pauses on shutdown?

Increasing the number of threads associated with the distributed service does not seem to have a noticable effect. I might try it in a larger scale test, just to make sure, but initial indications are not positive.
From the client side, the operations seem hung behind the DistributedCache$BinaryMap.waitForPartitionRedistribution() method. The call stack is listed below.
"main" prio=10 tid=0x09a75400 nid=0x6f02 in Object.wait() [0xb7452000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.waitForPartitionRedistribution(DistributedCache.CDB:96)
- locked <0x9765c938> (a com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap$Contention)
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.waitForRedistribution(DistributedCache.CDB:10)
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.ensureRequestTarget(DistributedCache.CDB:21)
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$BinaryMap.get(DistributedCache.CDB:16)
at com.tangosol.util.ConverterCollections$ConverterMap.get(ConverterCollections.java:1547)
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$ViewMap.get(DistributedCache.CDB:1)
at com.tangosol.coherence.component.util.SafeNamedCache.get(SafeNamedCache.CDB:1)
at com.ea.nova.coherence.lt.GetRandomTask.main(GetRandomTask.java:90)
Any help appreciated!

Cluster shutdown porcess?

Similar Messages

Maybe you are looking for