Cluster 3.2 failure retry time

Dear All,
I have messaging sever 7 in a cluster, however, for some reasons after so many watcher crashes, the cluster didn't restart the messaging resource.
I am wondering if there is a retry timeout, or retry number of times.
Would anyone please let me know if there are such options? If so, how to set them?
Regards,
Scotty

Hi Scotty,
if you want an indefinite restart, you can ser retry_count to -q. However I would recommend not to do that. Think about a cyclic failure, like after 10 seconds yor messaging server does not react to the probe any more and a restart is submitted. In this cases you will find it hard to interact.
The retry count is a safety feature which prevents such reconfiguration storms.
My suggestion is to set it to a fair number.
So Retry_count * Thorough_probe_interval needs to be smaller than Retry_interval If the default number seems to small, increase it.
The command is:
clrs set -p retry_count=<new value> <your resource name>
Cheers
Detlef

Similar Messages

Is it possible to set delivery retry time-interval between messages?

Hello!
          I use Wls 8.1.5 and would love to be able to set delivery retry time-interval between message.
          The complete situtaion looks like this.
          We got one queue (queue_A) that retrives alot of messages.
          At queue_A we got a BMT-MDB that consumes all messages (one by one).
          The BMT-MDB is supposed to forward the messages to an other queue (queue_B). (We also do some other things as well)
          But if we can not forward the message to queue_B because of some Exception is thrown, the transaction is Rolledback exactly as we want.
          But when this happens (for eg the server where queue_b resides is down) our BMT-MDB keeps trying to post the message like an enegizer-bunny. It would feel better if it waited for some seconds before it try again.
          We have tried to set play around with RedeliveryDelayOverride="5000" at the queue_B. But if we set to anything bigger then RedeliveryDelayOverride="-1", then other messages is passing throug and also put into "pending". This ended up in out of memory when the load was big.
          So what we would love to be able is to have all messages in a FIFO-row at the queue_A. And if queue_B is out of order then keep on trying consume the firts messagesd for every 5 sec.
          So if any one know if and how to configure a Wls 8.1.5 please let us know!
          Best regards
          Fredrik

Hello Again Fredrik!
          Later versions have several more options, but here are some that might work on 8.1:
          (A) A good bit of coding: Write code that undeploys the MDB on certain failures, and redeploys it sometime later.
          (B) A small amount of coding: On a failure, simply force a tx rollback, then put a Thread.currentThread.sleep() in the MDB application itself. (Just make sure the MDB is setup with a dedicated thread pool to avoid using up default threads.)
          (C) Requires no coding: Have the MDB forward the message to a local destination rather than the remote destination (so the forward always succeeds), then use the messaging bridge to forward the message onward. The messaging bridge automatically does a periodic retry on failure, and doesn't need to use redelivery delays.
          (D) Requires no coding: Raise a support case with BEA - I personally consider the fact that redelivery delay messages fail to page out likely indicates that there's a bug in 8.1. (Upgrade to 9.x quite likely doesn't have this problem.)
          Hope this helps,
          Tom

Why do i keep getting "Message Send Failure" every time I send pictures on iMessage? All are sent to other iPhone/iPad users.

For a few months now, I've been getting "Message Send Failure" every time I send pictures to other Apple devices (iPhones, iPads, Macs), despite the recipient receiving them. So, even though the pictures are going through, the "Message Send Failure" notice keeps showing up. I checked with other people who have Apple devices and they also encountered the same issue whenever they send pictures (usually send an email to make sure it was received). I thought when iOS 7.1.1 came out that it would fix this issue. Has anyone else encountered the same issue? Thanks!

I started getting the error on my iPhone 5 (iOS 7.1.x) several weeks ago - if not a month ago. Today I finally decided to run through a wholesale sign-out/sign-in process on all my devices (iMac, MacBook, iPad 1, iPad 3 and iPhone5) and it worked!
I signed out of iMessage/Messages on all devices. I also deleted my iCloud accounts on all my devices (make sure you tell the software to keep local copies when you are prompted - relax, it all stays in the cloud, too) Then I signed back in to iMessage/Messages and then re-added iCloud accounts all around. I just sent a message with a photo attachment and did not get the "Not Delivered" message - yay!
(FYI - even though I received the error every time I sent a message with an image attachment, the message still went through - very, very happy, though to finally get rid of the message...)

Why do I get a conversion failure every time when I try to merge documents together???

Why do I get a conversion failure every time when I try to merge documents together???

Hi mdrhine,
I'm sorry that you've been unable to merge files. Let's see what we can figure out.
Are you unable to merge any files? How large are the files that you're trying to merge, and how many are you trying to merge at once?
If you're merging more than a few files, or those files are particularly large, ttry merging those files in smaller batches. If one or more files is causing the conversion failure, merging in smaller batches should help you isolate the problematic file or files.
I look forward to hearing back from you with some details about the files that you're trying to merge.
Best,
Sara

DST: the SQL Cluster did not change the time by itself..

Hello,
DST: the SQL Cluster did not change the time by itself..
Where should I look for the issue. The time was changed manually to continue processing....
Thanks,
Dom
System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

Hi,
Thanks for your posting.
For the Sql cluster issue, i think you may ask in:
http://social.msdn.microsoft.com/Forums/sqlserver/en-US/home?category=sqlserver&filter=alltypes&sort=lastpostdesc
Regards.
Vivian Wang

I get a conversion failure every time I upload a pdf to combine - any ideas?

I get a conversion failure every time I upload a pdf to combine - any ideas?

Hi sharon,
Have you tried combining any other PDFs apart from ones which fail to combine?
Are the PDF files complex or simple?
You might try using a different browser.
Regards,
Anubha

Automatic Execution - Retry Times

Hi all,
I have set my Retry Times Property in my Engine on Weblogic Server for OBPM to 1
Retry Interval = 60 seconds
I save it and restart my engine..
But it seems the Retry is not happening..
Its directly entering into a System Exception Loop without retrying even once although I have set it to retry atleast once..
Why is that so?
Any idea?
Any changes to be made to the Weblogic Server or the Process Administrator ?
Do we have to rebuilt the Engine.ear file after making these changes?
Edited by: user8766631 on May 3, 2010 8:44 AM

Hey Frank. Kory Squire here. I had a similar situation with layouts for entering Management Cockpit data. I have FOX formulas that validate data entry and an error message is raised if data is incorrect. I bundled these into a local planning sequence, then added this to the planning folder with the properties 'Execute Function before Layout Change.' I created the Web layout based on this planning folder. When a user enters data in the web layout, each time they press enter, click the check/save buttons, the sequence is triggered. So in our case, the settings in the planning folder seem to have carried over into the web layout.
Take care
Kory

"Message Send Failure" every time I send pictures on message?

"Message Send Failure" every time I send pictures on message? Help. This is also happening on my husband's phone. I have the 5C and he has the 5S. This all started about 4-5 weeks ago. The picture do however go through.

There have been many, many reports of this lately so I believe it's not anything you're doing.

Server 2012 File Server Cluster Shadow Copies Disappear Some Time After Failover

Hello,
I've seen similar questions posted on here before however I have yet to find a solution that worked for us so I'm adding my process in hopes someone can point out where I went wrong.
The problem: After failover, shadow copies are only available for a short time on the secondary server. Before the task to create new shadow copies happens the shadow copies are deleted. Failing back shows them missing on the primary server as
well when this happens.
We have a 2 node (hereafter server1 and server2) cluster with a quorum disk. There are 8 disk resources which are mapped to the cluster via iScsi. 4 of these disks are setup as storage and the other 4 are currently set up as shadow copy volumes
for their respective storage volume.
Previously we weren't using separate shadow copy volumes and seeing the same issue described in the topic title. I followed two other topics on here that seemed close and then setup the separate shadow copy volumes however it has yet to alleviate the
issue. These are the two other topics :
Topic 1: https://social.technet.microsoft.com/Forums/windowsserver/en-US/ba0d2568-53ac-4523-a49e-4e453d14627f/failover-cluster-server-file-server-role-is-clustered-shadow-copies-do-not-seem-to-travel-to?forum=winserverClustering
Topic 2: https://social.technet.microsoft.com/Forums/windowsserver/en-US/c884c31b-a50e-4c9d-96f3-119e347a61e8/shadow-copies-missing-after-failover-on-2008-r2-cluster
After reading both of those topics I did the following:
1) Add the 4 new volumes to the cluster for shadow copies
2) Made each storage volume dependent on it's shadow copy volume in FCM
3) Went to the currently active node directly and opened up "My Computer", I then went to the properties of each storage volume and set up shadow copies to go to the respective shadow copy volume drive letter with correct size for spacing, etc.
4) I then went back to FCM and right clicked on the corresponding storage volume and choose "Configure Shadow Copy" and set the schedule for 12:00 noon and 5:00 PM.
5) I noticed that on the nodes the task was created and that the task would failover between the nodes and appeared correct.
6) Everything appears to failover correctly, all volumes come up, drive letters are same, shadow copy storage settings are the same, and 4 scheduled tasks for shadow copy appear on the current node after failover.
Thinking everything was setup according to best practice I did some testing by changing file contents throughout the day making sure that previous versions were created as scheduled on server1. I then rebooted Server1 to simulate failure. Server2
picked up the role within about 10 seconds and files were avaiable. I checked and I could still see previous versions for the files after failover that were created on server1. Unfortunately that didn't last as the next day before noon I was going
to make more changes to files to ensure that not only could we see the shadow copies that were created when Server1 owned the file server role but also that the copies created on Server2 would be seen on failback. I was disappointed to discover that
the shadow copies were all gone and failing back didn't produce them either.
Does anyone have any insight into this issue? I must be missing a switch somewhere or perhaps this isn't even possible with our cluster type based on this: http://technet.microsoft.com/en-us/library/cc779378%28v=ws.10%29.aspx
Now here's an interesting part, shadow copies on 1 of our 4 volumes have been retained from both nodes through the testing, but I can't figure out what makes it different though I do suspect that perhaps the "Disk#s" in computer management / disk
management perhaps need to be the same between servers? For example, on server 1 the disk #s for cluster volume 1 might be "Disk4" but on server 2 the same volume might be called "Disk7", however I think that operations like this
and shadow copy are based on the disk GUID and perhaps this shouldn't matter.
Edit, checked on the disk numbers, I see no correlation between what I'm seeing in shadow copy and what is happening to the numbers. All other items, quotas, etc fail and work correctly despite these diffs:
Disk Numbers on Server 1:
Format: "shadow/storerelation volume = Disk Number"
aHome storage1 = 16
aShared storage2 = 09
sHome storage3 = 01
sShared storage4 = 04
aHome shadow1 = 10
aShared shadow2 = 11
sHome shadow3 = 02
sShared shadow4 = 05
Disk numbers on Server 2:
aHome storage1 = 16 (SAME)
aShared storage2 = 04 (DIFF)
sHome storage3 = 05 (DIFF)
sShared storage4 = 08 (DIFF)
aHome shadow1 = 10 (SAME)
aShared shadow2 = 11 (SAME)
sHome shadow3 = 06 (DIFF)
sShared shadow4 = 09 (DIFF)
Thanks in advance for your assistance/guidance on this matter!

Hello Alex,
Thank you for your reply. I will go through your questions in order as best I can, though I'm not the backup expert here.
1) "Did you see any event ID when the VSS fail?
please offer us more information about your environment, such as what type backup you are using the soft ware based or hard ware VSS device."
I saw a number of events on inspection. Interestingly enough, the event ID 60 issues did not occur on the drive where shadow copies did remain after the two reboots. I'm putting my event notes in a code block to try to preserve formatting/readability.
I've written down events from both server 1 and 2 in this code block, documenting the first reboot causing the role to move to server 2 and then the second reboot going back to server 1:
JANUARY 2
9:34:20 PM - Server 1 - Event ID: 1074 - INFO - Source: User 32 - Standard reboot request from explorer.exe (Initiated by me)
9:34:21 PM - Server 1 - Event ID: 7036 - INFO - Source: Service Control Manager - "The Volume Shadow Copy service entered the running state."
9:34:21 PM - Server 1 - Event ID: 60 - ERROR - Source: volsnap - "The description for Event ID 60 from source volsnap cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
\Device\HarddiskVolumeShadowCopy49
F:
T:
The locale specific resource for the desired message is not present"
9:34:21 PM - Server 1 - Event ID 60 - ERROR - Source: volsnap - "The description for Event ID 60 from source volsnap cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
\Device\HarddiskVolumeShadowCopy1
H:
V:
The locale specific resource for the desired message is not present"
***The above event repeats with only the number changing, drive letters stay same, citing VolumeShadowCopy# numbers 6, 13, 18, 22, 27, 32, 38, 41, 45, 51,
9:34:21 PM - Server 1 - Event ID: 60 - ERROR - Source: volsnap - "The description for Event ID 60 from source volsnap cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
\Device\HarddiskVolumeShadowCopy4
E:
S:
The locale specific resource for the desired message is not present"
***The above event repeats with only the number changing, drive letters stay same, citing VolumeShadowCopy# numbers 5, 10, 19, 21, 25, 29, 37, 40, 46, 48, 48
9:34:28 PM - Server 1 - Event ID: 7036 - INFO - Source: Service Control Manager - "The NetBackup Legacy Network Service service entered the stopped state."
9:34:28 PM - Server 1 - Event ID: 7036 - INFO - Source: Service Control Manager - "The Volume Shadow Copy service entered the stopped state.""
9:34:29 PM - Server 1 - Event ID: 7036 - INFO - Source: Service Control Manager - "The NetBackup Client Service service entered the stopped state."
9:34:30 PM - Server 1 - Event ID: 7036 - INFO - Source: Service Control Manager - "The NetBackup Discovery Framework service entered the stopped state."
10:44:07 PM - Server 2 - Event ID: 7036 - INFO - Source: Service Control Manager - "The Volume Shadow Copy service entered the running state."
10:44:08 PM - Server 2 - Event ID: 7036 - INFO - Source: Service Control Manager - "The Microsoft Software Shadow Copy Provider service entered the running state."
10:45:01 PM - Server 2 - Event ID: 48 - ERROR - Source: bxois - "Target failed to respond in time to a NOP request."
10:45:01 PM - Server 2 - Event ID: 20 - ERROR - Source: bxois - "Connection to the target was lost. The initiator will attempt to retry the connection."
10:45:01 PM - Server 2 - Event ID: 153 - WARN - Source: disk - "The IO operation at logical block address 0x146d2c580 for Disk 7 was retried."
10:45:03 PM - Server 2 - Event ID: 34 - INFO - Source: bxois - "A connection to the target was lost, but Initiator successfully reconnected to the target. Dump data contains the target name."
JANUARY 3
At around 2:30 I reboot Server 2, seeing that shadow copy was missing after previous failure. Here are the relevant events from the flip back to server 1.
2:30:34 PM - Server 2 - Event ID: 60 - ERROR - Source: volsnap - "The description for Event ID 60 from source volsnap cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
\Device\HarddiskVolumeShadowCopy24
F:
T:
The locale specific resource for the desired message is not present"
2:30:34 PM - Server 2 - Event ID: 60 - ERROR - Source: volsnap - "The description for Event ID 60 from source volsnap cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
\Device\HarddiskVolumeShadowCopy23
E:
S:
The locale specific resource for the desired message is not present"
We are using Symantec NetBackup. The client agent is installed on both server1 and 2. We're backing them up based on the complete drive letter for each storage volume (this makes recovery easier). I believe this is what you would call "software
based VSS". We don't have the infrastructure/setup to do hardware based snapshots. The drives reside on a compellent san mapped to the cluster via iScsi.
2) "Confirm the following registry is exist:
- HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\VSS\Settings"
The key is there, however the DWORD value is not, would that mean that the
default value is being used at this point?

Cluster point of failure

I'm trying to setup an environment where if my primary web server goes down then request will be sent to the backup. I think clustering can help me here but my fear is that I have a single point of failure on the managing server. If i have a cluster is one machine managing all traffic? and if that machine were to go down my entire site would be down. Any suggestion at how to handle this at the router level would be appreciated also.
Scott

I'm not sure I understand your question completely.
You can certainly run multiple managed servers and/or a cluster of managed servers to give you some redundancy.
You can run multiple physical and/or virtual machines.
You can run multiple sites etc for disaster recovery.
I can't recall a site I've visited in a long time that didn't do all of these.
Was there a specific question you had about HA or failure scenarios?
-- Rob
WLS Blog http://dev2dev.bea.com/blog/rwoollen/

Default retry time for B2B

hi all,
i found out that the B2B will retry after waiting 2 hours counting from the previous failure.
example : 10.09am failed ( HTTP timeout ), then next retry is 12.09pm.
is there any place that i can change it from 2hours to 30mins?
thanks
kin wah

Kinwah,
I guess you are interested in the following parameters while defining the delivery channel.
Time To Acknowledgement Enter a value in minutes.
This value specifies the time in which an acknowledgment must be received. If an acknowledgment is not received, then retries occur.
Retry Count Enter a value. This defines the number of times to retry.
Please change the Time To ack according to your requirements.
Rgds,Ramesh

Solaris Cluster Private Link Failure

Hi,
I have configured Solaris Cluster 3.3 and add two Back to Back interconnect cable.
Sun Cluster is working fine but private link is fail and i can not ping the clusternode2-priv and clusternode1-priv form each other. some cammands faile
~ # ping clusternode2-priv
no answer from clusternode2-priv
~ # metaset -s nfsds -a -h t1u331 t1u332
metaset: 172.16.4.1: metad client create: RPC: Rpcbind failure
~ # scstat
-- Cluster Nodes --
Node name Status
Cluster node: n1u332 Online
Cluster node: n1u331 Online
-- Cluster Transport Paths --
Endpoint Endpoint Status
Transport path:   n1u332:nxge2           n1u331:nxge2           Path online
Transport path:   n1u332:nxge1           n1u331:nxge1           Path online
-- Quorum Summary from latest node reconfiguration --
Quorum votes possible: 3
Quorum votes needed: 2
Quorum votes present: 3
-- Quorum Votes by Node (current status) --
Node Name Present Possible Status
Node votes: n1u332 1 1 Online
Node votes: n1u331 1 1 Online
-- Quorum Votes by Device (current status) --
Device Name Present Possible Status
Device votes: /dev/did/rdsk/d4s2 1 1 Online
-- Device Group Servers --
Device Group Primary Secondary
-- Device Group Status --
Device Group Status
-- Multi-owner Device Groups --
Device Group Online Status
-- Resource Groups and Resources --
Group Name Resources
-- Resource Groups --
Group Name Node Name State Suspended
-- Resources --
Resource Name Node Name State Status Message
-- IPMP Groups --
Node Name Group Status Adapter Status
[root @ n1u332]
~ # ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
e1000g0: flags=1000802<BROADCAST,MULTICAST,IPv4> mtu 1500 index 2
inet 0.0.0.0 netmask 0
ether 0:15:17:e3:a4:e8
vsw0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3
inet 10.131.58.76 netmask ffffff00 broadcast 10.131.58.255
groupname ipmp-grp
ether 0:14:4f:f9:1:bd
vsw0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 10.131.58.75 netmask ffffff00 broadcast 10.131.58.255
vsw1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 4
inet 10.131.58.77 netmask ffffff00 broadcast 10.131.58.255
groupname ipmp-grp
ether 0:14:4f:fb:44:4
nxge1: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 7
inet 172.16.0.129 netmask ffffff80 broadcast 172.16.0.255
ether 0:14:4f:a0:81:d9
nxge2: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 6
inet 172.16.1.1 netmask ffffff80 broadcast 172.16.1.127
ether 0:14:4f:a0:81:da
clprivnet0: flags=1009843<UP,BROADCAST,RUNNING,MULTICAST,MULTI_BCAST,PRIVATE,IPv4> mtu 1500 index 8
inet 172.16.4.1 netmask fffffe00 broadcast 172.16.5.255
ether 0:0:0:0:0:1
[root @ n1u332]
~ # dladm show-dev
vsw0 link: up speed: 1000 Mbps duplex: full
vsw1 link: up speed: 1000 Mbps duplex: full
e1000g0 link: down speed: 0 Mbps duplex: half
e1000g1 link: up speed: 1000 Mbps duplex: full
e1000g2 link: unknown speed: 0 Mbps duplex: half
e1000g3 link: unknown speed: 0 Mbps duplex: half
nxge0 link: up speed: 100 Mbps duplex: full
nxge1 link: up speed: 1000 Mbps duplex: full
nxge2 link: up speed: 1000 Mbps duplex: full
nxge3 link: up speed: 100 Mbps duplex: full
e1000g4 link: unknown speed: 0 Mbps duplex: half
e1000g5 link: up speed: 1000 Mbps duplex: full
clprivnet0              link: unknown   speed: 0     Mbps       duplex: unknown
Edited by: 808696 on Mar 2, 2011 8:27 AM

If your private interconnect had really failed then one or other of the cluster nodes would have panicked. I think it is more likely that either you have changed the nsswitch.conf entry for hosts such that it does not include 'cluster' first, although I would have expected that to result in an unresolved host name. The other option is that you have hardened your machine in some way with ipfilters or security settings.
Has it ever worked?
Tim
---

Hyper-V guest SQL 2012 cluster live migration failure

I have two IBM HX5 nodes connected to IBM DS5300. Hyper-V 2012 cluster was built on blades. In HV cluster was made six virtual machines, connected to DS5300 via HV Virtual SAN. These VMs was formed a guest SQL Cluster. Databases' files are placed on
DS5300 storage and available through VM FibreChannel Adapters. IBM MPIO Module is installed on all hosts and VMs.
SQL Server instances work without problem. But! When I try to live migrate SQL VM to another HV node an SQL Instance fails. In SQL error log I see:
2013-06-19 10:39:44.07 spid1s      Error: 17053, Severity: 16, State: 1.
2013-06-19 10:39:44.07 spid1s      SQLServerLogMgr::LogWriter: Operating system error 170(The requested resource is in use.) encountered.
2013-06-19 10:39:44.07 spid1s      Write error during log flush.
2013-06-19 10:39:44.07 spid55      Error: 9001, Severity: 21, State: 4.
2013-06-19 10:39:44.07 spid55      The log for database 'Admin' is not available. Check the event log for related error messages. Resolve any errors and restart the database.
2013-06-19 10:39:44.07 spid55      Database Admin was shutdown due to error 9001 in routine 'XdesRMFull::CommitInternal'. Restart for non-snapshot databases will be attempted after all connections to the database are aborted.
2013-06-19 10:39:44.31 spid36s     Error: 17053, Severity: 16, State: 1.
2013-06-19 10:39:44.31 spid36s     fcb::close-flush: Operating system error (null) encountered.
2013-06-19 10:39:44.31 spid36s     Error: 17053, Severity: 16, State: 1.
2013-06-19 10:39:44.31 spid36s     fcb::close-flush: Operating system error (null) encountered.
2013-06-19 10:39:44.32 spid36s     Error: 17053, Severity: 16, State: 1.
2013-06-19 10:39:44.32 spid36s     fcb::close-flush: Operating system error (null) encountered.
2013-06-19 10:39:44.32 spid36s     Error: 17053, Severity: 16, State: 1.
2013-06-19 10:39:44.32 spid36s     fcb::close-flush: Operating system error (null) encountered.
2013-06-19 10:39:44.33 spid36s     Starting up database 'Admin'.
2013-06-19 10:39:44.58 spid36s     349 transactions rolled forward in database 'Admin' (6:0). This is an informational message only. No user action is required.
2013-06-19 10:39:44.58 spid36s     SQLServerLogMgr::FixupLogTail (failure): alignBuf 0x000000001A75D000, writeSize 0x400, filePos 0x156adc00
2013-06-19 10:39:44.58 spid36s     blankSize 0x3c0000, blkOffset 0x1056e, fileSeqNo 1313, totBytesWritten 0x0
2013-06-19 10:39:44.58 spid36s     fcb status 0x42, handle 0x0000000000000BC0, size 262144 pages
2013-06-19 10:39:44.58 spid36s     Error: 17053, Severity: 16, State: 1.
2013-06-19 10:39:44.58 spid36s     SQLServerLogMgr::FixupLogTail: Operating system error 170(The requested resource is in use.) encountered.
2013-06-19 10:39:44.58 spid36s     Error: 5159, Severity: 24, State: 13.
2013-06-19 10:39:44.58 spid36s     Operating system error 170(The requested resource is in use.) on file "v:\MSSQL\log\Admin\Log.ldf" during FixupLogTail.
2013-06-19 10:39:44.58 spid36s     Error: 3414, Severity: 21, State: 1.
2013-06-19 10:39:44.58 spid36s     An error occurred during recovery, preventing the database 'Admin' (6:0) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected,
contact Technical Support.
In windows system log I see a lot of warnings like this:
- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
<Provider
Name="Microsoft-Windows-Ntfs" Guid="{3FF37A1C-A68D-4D6E-8C9B-F79E8B16C482}" />
<EventID>140</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000008</Keywords>
<TimeCreated
SystemTime="2013-06-19T06:39:44.314400200Z" />
<EventRecordID>25239</EventRecordID>
<Correlation
/>
<Execution
ProcessID="4620" ThreadID="4284" />
<Channel>System</Channel>
<Computer>sql-node-5.local.net</Computer>
<Security
UserID="S-1-5-21-796845957-515967899-725345543-17066" />
</System>
- <EventData>
<Data Name="VolumeId">\\?\Volume{752f0849-6201-48e9-8821-7db897a10305}</Data>
<Data Name="DeviceName">\Device\HarddiskVolume70</Data>
<Data Name="Error">0x80000011</Data>
</EventData>
</Event>
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{752f0849-6201-48e9-8821-7db897a10305}, DeviceName: \Device\HarddiskVolume70.
({Device Busy}
The device is currently busy.)
There aren't any error or warning in HV hosts.

Hello,
I am trying to involve someone more familiar with this topic for a further look at this issue. Sometime delay might be expected from the job transferring. Your patience is greatly appreciated.
Thank you for your understanding and support.
Regards,
Fanny Liu
If you have any feedback on our support, please click
here.
Fanny Liu
TechNet Community Support

Computers in cluster spending all their CPU time with system; many questio

I'd be interested to know if anyone on the list has been successful in getting QMaster to work on a home network of G4 computers, 800 - 867 MHz, NO server, 100 mbps hubs, existing CAT5 wiring violates the radius recommendations for 100 mbps.
At one point, I had been successful in having all three macs busy processing (CPU activity monitor mostly in the green) only to find gaps in my encoded video.
Most recently, I tried having my main computer as client rather than controller. The client computer compressing happily, but the other two were in the red, i.e. devoting much but not all of their CPU time to the system, with little user CPU activity. The estimated completion time was more than double what the job should have taken on my local machine, so I canceled the job after about an hour.
Host names are unknown according to QAdministrator.
I am still unclear about shared cluster storage. Does it matter how it is set other than for the machine that is controller?
Should I try to mount volumes in the finder for the other two computers on each machine? The documentation doesn't say anything about this.
What folders need to have read and write privileges for all other computers on the network. What is the most reliable way to set this? What user group do I choose? What folders do I apply this to? Might I need to set up for a common group for all my machines, similar to Windows Workgroups? If so, how do I do this?
Thanks,
Cris

hi Cris, yes it can be frustrating.. please see a link to a post I did when I had loads of trouble and I provided a detailed resolution at http://discussions.apple.com/thread.jspa?messageID=4171772&#417.
THere are some options to STOP QMASTER from copying objects for COMPRESSOR by simply mounting (NFS) the volumes with ALL your file systems where source and target files will be.
Also for the CLUSTER CONTROLLER use the Qmaster system prefs to SET the cluster file to one of the NETWORK or SSAFS (xsa) shared volumes .. I guess you may not have XSAN.. so just have ALL the volumes mounted so each HOST can access them.
G5 QUAD 8GB ram w/3.5TB + 2 x 15in MBPCore Mac OS X (10.4.9)

Unicast cluster - heartbeat message failure messages

Using unicast messaging mode and i see following messages
####<Jul 9, 2010 12:46:56 AM PDT> <Info> <Cluster> <anaeur30> <WL10MP2-ServiceSTServer6> <[ACTIVE] ExecuteThread: '45'
for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1278661616559> <BEA-000112> <Removing WL10M
P2-ServiceSTServer1 jvmid:6806396782256322086S:anaeur10:[7033,7033,-1,-1,-1,-1,-1]:anaeur10:7033,anaeur10:7035,anaeur2
0:7033,anaeur20:7035,anaeur30:7033,anaeur30:7035,anaeur50:7033,anaeur50:7035:WL10MP2-ServiceTier:WL10MP2-ServiceSTServ
er1 from cluster view due to timeout.>
####<Jul 9, 2010 12:55:36 AM PDT> <Info> <Cluster> <anaeur30> <WL10MP2-ServiceSTServer6> <[ACTIVE] ExecuteThread: '34'
for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1278662136552> <BEA-000112> <Removing WL10M
P2-ServiceSTServer2 jvmid:-2694311272134716565S:anaeur10:[7035,7035,-1,-1,-1,-1,-1]:anaeur10:7033,anaeur10:7035,anaeur
20:7033,anaeur20:7035,anaeur30:7033,anaeur30:7035,anaeur50:7033,anaeur50:7035:WL10MP2-ServiceTier:WL10MP2-ServiceSTSer
ver2 from cluster view due to timeout.>
During the same time frame, I see lost multicast messages on all the instances for a about 20 minutes. What could be the problem? Why am i seeing the multicast messages when using uncast? My config.xml has multicast related entries for each server but how will that be effective? is that an issue? we see servers dropping out frequently from cluster.
000115> <Lost 1 multicast message(s).>
####<Jul 9, 2010 12:46:42 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661602751> <BEA-000115> <Lost 1 multicast message(s).>
####<Jul 9, 2010 12:46:46 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661606548> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:47:04 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661624185> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:48:40 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278661720809> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:54:14 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278662054823> <BEA-000115> <Lost 2 multicast message(s).>
####<Jul 9, 2010 12:54:14 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278662054827> <BEA-000115> <Lost 1 multicast message(s).>
####<Jul 9, 2010 12:54:14 AM PDT> <Info> <Cluster> <anaeur10> <WL10MP2-ServiceSTServer2> <weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1278662054827> <BEA-000115> <Lost 2 multicast message(s).>

SJ,
Thanks, that's perfect explanation i was looking for. We always create cluster from console and it could be that we used MULTICAST messaging mode in past hence the entries in config.xml. What made me to raise the question "will UNICAST or MULTICAST be used" is that when ever we experience a drop out server issue from cluster, i see the following message written into each managed server log. Ideally, the following should be written into log if the multicast messaging mode is in operation, right?
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490260768> <BEA-000115> <Lost 2 multicast message(s).>
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490260768> <BEA-000115> <Lost 2 multicast message(s).>
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490261355> <BEA-000115> <Lost 2 multicast message(s).>
<weblogic.cluster.MessageReceiver> <<WLS Kernel>> <> <> <1276490261355> <BEA-000115> <Lost 2 multicast message(s).>
The above message is not written all the time but only when server removed from cluster group. Please be inforemed that i have enable unicast debug mode. will unicast also writes messages as above when hearbeat message lost?
To trace our issue further, i have to manually remove reference from config.xml and monitor for sometime. its still mystery why the clusters are dropping out. Sometimes, soon after cluster instances dropped out i can see the drop-out frequency as "Rarely" and after a week or so the members are regrouped with difference group leader. Are you aware of any issue with unicast messaging mode in WL10 MP2?
Is it good idea of testing multicast?
Thanks a lot for your time.
-RR

Cluster 3.2 failure retry time

Similar Messages

Maybe you are looking for