I need a recommended way of recovering from Bus Off errors

I have write problems when a UUT is instructed to reset. When the UUT in the reset state I get Error Passive warnings and Bus Off errors after attempting to write extended CAN messages using the Frame API.
In brief the test goes like this,
1. I send extended message 0x500 with 8 data bytes containing information to tell the UUT to go into reset.
2. I wait 400ms hopeing that the UUT get's at least 1 of the 3 possible messages. It always does and does a reset.
3. I then MUST send message 0x500 with updated data telling the UUT to come out of reset.
Problem is the write fails with a Bus Off error (can't remember the error code as I am typing this at home). I can get this to work in a brute force kind of way by repeating these steps below several times in a loop,
1. reopen the network object
2. reopen all perodic tx objects,
3. do a ncAction NC_OP_RESET on the network object,
4. do a ncAction NC_OP_RESET on all periodic tx objects,
5. do a ncAction NC_OP_START on the network object,
6. do a ncAction NC_OP_START on all the periodic tx objects,
(no warnings or errors so far from these calls allthough occasionally ends up with an exception and NI-CAN internal driver errors. I'm probably abusing the CAN standard and API with all the rapid opening and closing of all these handles and blindly ignoring errors.)
7. then do a ncWrite for all periodic tx objects (we usually get Error Passive warnings here, if the write is repeated it frequently gets a Bus Off error).
When the UUT (by chance in all honesty) gets the 0x500 message and comes out of reset, CAN operations are fine, but the problem lies when the UUT is in reset, I can't send the updated 0x500 message to tell it to come out of reset. I get randomly Error Passive errors and Bus Off errors.
Found out today this is what the UUT is doing when in reset (written in PDL),
while not received 0x500 with data indicating to come out of reset
   possibly repower most of the UUT circuitry (I can't remember)
   reset Bosch CANBUS controller circuitry on ASIC (takes 2us I'm told)
   do some unit reset processing, takes up to 100ms
wend
(yes, it resets the CANBUS controller roughly every 100ms!)
I need a sensible way of recovering from a Bus Off error and retry sending that 0x500 message again.
Any thoughts, comments, solutions?
Regards.

Hi Flump,
The idea here is that many CAN devices will "sleep" after some predetermined period of inactivity (not receiving a frame). In such cases, the device usually wakes up after seeing activity on the bus, where the amount of time it takes to go from the "sleep" state to an "active" state will inevitably vary from device to device. Well, suppose the controller on a CAN network sends a frame to a device which is "sleeping," and the device takes, for arguments sake, 10 seconds to "wake up" and become active again. By definition in the CAN standard, frames which are not acknowledged will be retransmitted. Also in the CAN standard is the requirement that a device or controller implement transmit and receive "error counters" in order that an "errant" device or controller can be "silenced" if it continues to generate errors. There are 3 basic error states, the last (worst) of which is the Bus Off Error State, which occurs when the error counter exceeds 255. Herein lies the problem; if a device takes a long time to wake up, then a controller will send, and subsequently resend, the frame while it attempts to communicate with the "sleeping" device. Since the controller's transmit error counter will increase by 8 for each frame which is sent and NOT acknowledged, and it will continue sending frames until acknowledged, the controller can actually reach a Bus Off Error state before the device fully "wakes up." This is usually undesirable, and can be prevented.
For more information about the CAN standard, see Appendix B of the NI-CAN Hardware and Software Manual linked in the Related Links section below.
The solution may be to send a single wake-up frame (just one time), then delay to allow the device to "wake-up," and then continue normal communication. It is important to realize that when a device "sleeps," it actually relies on the fact that a CAN controller will send frames multiple times. That is, the first frame received when a device is "sleeping" is NOT processed. The sudden voltage change on the bus caused by a frame transmission is sensed by a CAN device and will cause it to resume active operating conditions, but the frame which initiates the wakeup cannot be processed because the hardware was previously asleep (some of it literally not powered). Thus, if we have a mechanism for sending a single "wake-up" frame, and then delay until all devices (or at least the one we intend to communicate with) wake up, we can resume normal communications while knowing deterministically that subsequent commands should/will be processed by the device to which we wish to communicate.
In the NI-CAN API, the way to transmit a single frame - one time only - is by setting the Single Shot Transmit attribute to 1 (using the set attribute function: in LabVIEW use the ncSetAttr.vi for the Frame API and CAN Set Property.vi for the Channel API). For Frame API users, the Network Configuration object (programmed explicitly) can be used, where of course we must stop and start the task (using ncAction.vi) around the attribute setting. The sequence of events would generally be: Network Config (should have happened anyway at some point), Network Open, Stop, Set Attribute, Start, Write "wakeup frame," proceed with the program after sufficient delay. Please note that the required delay may be very small; the "10 second" wake-up time suggested for a device above is much much longer than a normal device's "wake-up period". Of course, the baud rate used on a given network will factor into how many frames can be sent by a controller in a given period, and therefore how fast a corresponding error counter will increment as a result of unacknowledged frames.
Attached is an example, which will write a single "wake-up" frame using the technique described above, where the write will take place when a "Wake-Up" button is clicked.
Is this what you are looking for?
AdamB
Message Edited by AdamB on 12-12-2006 04:47 AM
Applications Engineering Team Leader | National Instruments | UK & Ireland
Attachments:
SendSingleWakeUpFrame.vi ‏73 KB

Similar Messages

  • Why do I get "The system has recovered from a serious error" after system shut down

    Every time when start/restarting the computer using the attached .vi I get a window with "The system has recovered from a serious error." Although this does not seems to have any effect on application and I am able to use the test equipment but could someone explain why this is happening?
    LV 8.2
    TestStand 3.5
    MAX 4.1
    NI DAQ Card PXI-4461
    Many Thanks
    Mehran
    Mehran Fard
    Attachments:
    System Shutdown.doc ‏63 KB

    Hi Fardm,
    how should we answer that question when you don't attach the VI, but only a picture hidden in some proprietary file format?
    Attach the VI!
    You also have some Rube-Goldberg in your code:
    IF not zero THEN use ErrorCode ELSE use zero
    Why not just connect errorcode directly with BundleByName? You don't need the Select function here...
    Best regards,
    GerdW
    CLAD, using 2009SP1 + LV2011SP1 + LV2014SP1 on WinXP+Win7+cRIO
    Kudos are welcome

  • "system has recovered from a serious error" problem

    I just built a PC on a MSI PT880-Neo (running XP professional), and most of my configurations and whatnot has been worked out.  However, I'm finding that my PC will every so often, inexplicably, just freak out and go to a blue screen with lots of white technical lettering that says something about "prevent damage"--and then proceeds to restart.  Horrible!
    When it boots up, I get a little dialog that says my PC has "recovered from a serious error" with the following:
    BCCode : 7f     BCP1 : 00000000     BCP2 : 00000000     BCP3 : 00000000    
    BCP4 : 00000000     OSVer : 5_1_2600     SP : 1_0     Product : 256_1
    What is all this?  Never really seen anything like it--but it's frustrating and disconcerting.
    If anyone has any suggestions on how to troubleshoot this or what this might be about, I'd really appreciate it.

    First off, are you using the onboard network interface, I had a weird blue screen of death, with an error code I couldn't figure out and all it said, was to call Hardware Vendor. This particular blue screen of death, did not go away until I uninstalled the drivers for the onboard nic, shut down, installed a brand new Netgear FA311 Network card. In my case it wasn't a driver issue at all, like some people here claimed it was, and I shall not name said person or persons.
    If you are using the onboard nic, putting in a PCI nic(network card), could possibly make this problem you are having go away.

  • System has recover from a serious error! Plz help

    For some reason now whenever I boot my computer up I get a message saying it has recovered from a serious error, it used to spontaneously reboot itself but that has stopped now. any ideas on how to fix?

    Hi,
    See if you can boot SAFE MODE, F8 key..............
    Del

  • Windows has recovered from a serious error

    Hi all, I've been reading a lot about overclocking and I finally started today. I have set my FSB to 233, and I have been trying to find the lowest voltage I can run it at. 1.6 works fine but I tried going lower. This resulted in a spontaneous reboot. After windows booted back up when I increased the voltage, the windows error reporting dialog came up and said "windows has recovered from a serious error..." So I send the report... and internet explorer opens with a little info. It says the crash was due to a problem with a driver but does not mention a specific driver. does this osund familiar to anyone? are there any drivers i should investigate or replace? thanks a lot for your help!

    often with overclocking you will see 'file not found' BSODs on startup. It is reasonable to assume that our overclock may have upset a driver, driver component or driver associated file.
    Don't sweat it, as long as your not on that vCore now there should be no damage done to Windows.
    Have to admit I've had "Windows has...serious error" before following a vCore increasment on an AMD board.

  • My system recovered from a serious error recently. During the error, Firefox crashed. Since I have restarted my syetme and Firefox, I cannot use any of the toolbar options such as:close, minimise and maximise. Also, when I right click, an outline of a box

    My system recovered from a serious error recently. During the error, Firefox crashed. Since I have restarted my system and Firefox, I cannot use any of the toolbar options such as:close, minimise and maximise. Also, when I right click, an outline of a box appears without the visible options although on some occasions they appear after 10 seconds or so. Can you advise me of what to do to cure this problem please or is it a case of using a Windows tool such as System Restore?
    == This happened ==
    Every time Firefox opened
    == 22nd May 2010

    I HAVE NOW SOLVED THE PROBLEM BY RUNNING WINDOWS REGISTRY REPAIR TOOL WHICH HAS OBVIOUSLY REPAIRED CORRUPT REGISTRY FILES.

  • Where are temp files for Photoshop CS4? Is there any way to recover from themM?

    Hi. My Mac crashed with all my work. I forgot to do Cmd S and now I lost 2 days worth of work in Photoshop CS4. Any chance the work is still in temp files? If yes, is there any way to recover files from them? Best, Caldvin

    If you had any smart objects involved, you might be able to get to them. But no, not your main file.
    How did you manage to work for two days without a save!?!
    We all have to suffer these every now and again, teaches you to hit “save” every thirty seconds. You’ll be obsessive about it for eight months, then forget, loose hours of work and repeat. Perfectly natural.
    On the plus side doing something the second time is usually quicker…

  • Best way to recover from "drop user; drop table space"

    Hello,
    I am practicing several different RMAN recovery senarios on oracle 11g windows 2003.
    The senario that I am stuck on is.
    drop user MYUSER cascade;
    drop table space MYUSER including contents and datafiles;
    Originally, I was trying to do this with RMAM. From further reading, it seems that RMAN fit for this type of recovery.
    What is the best approach to recover from this?
    thanks for any tips.

    DBPITR did not bring backup my tablespaces.
    break database
    drop user PWRPLANT CASCADE
    drop role PWRPLANT_ROLE_USER
    drop role PWRPLANT_ROLE_DEV
    drop role PWRPLANT_ROLE_ADMIN
    alter tablespace PWRPLANT_IDX offline
    alter tablespace PWRPLANT offline
    drop tablespace PWRPLANT_IDX INCLUDING CONTENTS AND DATAFILES
    drop tablespace PWRPLANT INCLUDING CONTENTS AND DATAFILES
    recover with BDITR
    RMAN> run {set until sequence 56; restore database; recover database;}
    executing command: SET until clause
    using target database control file instead of recovery catalog
    Starting restore at 28-APR-11
    allocated channel: ORA_DISK_1
    channel ORA_DISK_1: SID=317 device type=DISK
    channel ORA_DISK_1: starting datafile backup set restore
    channel ORA_DISK_1: specifying datafile(s) to restore from backup set
    channel ORA_DISK_1: restoring datafile 00001 to I:\ORACLE\ORADATA\PWRGAME\SYSTEM
    01.DBF
    channel ORA_DISK_1: restoring datafile 00002 to I:\ORACLE\ORADATA\PWRGAME\SYSAUX
    01.DBF
    channel ORA_DISK_1: restoring datafile 00003 to I:\ORACLE\ORADATA\PWRGAME\UNDOTB
    S01.DBF
    channel ORA_DISK_1: restoring datafile 00004 to I:\ORACLE\ORADATA\PWRGAME\USERS0
    1.DBF
    channel ORA_DISK_1: reading from backup piece I:\ORACLE\FLASH_RECOVERY_AREA\PWRG
    AME\BACKUPSET\2011_04_28\O1_MF_NNNDF_DATABASE_FULL_BACKUP_6VMMSSXV_.BKP
    channel ORA_DISK_1: piece handle=I:\ORACLE\FLASH_RECOVERY_AREA\PWRGAME\BACKUPSET
    \2011_04_28\O1_MF_NNNDF_DATABASE_FULL_BACKUP_6VMMSSXV_.BKP tag=DATABASE_FULL_BAC
    KUP
    channel ORA_DISK_1: restored backup piece 1
    channel ORA_DISK_1: restore complete, elapsed time: 00:07:06
    Finished restore at 28-APR-11
    Starting recover at 28-APR-11
    using channel ORA_DISK_1
    starting media recovery
    archived log for thread 1 with sequence 55 is already on disk as file I:\ORACLE\
    PRODUCT\11.1.0\DB_1\RDBMS\ARC00055_0748950531.001
    archived log file name=I:\ORACLE\PRODUCT\11.1.0\DB_1\RDBMS\ARC00055_0748950531.0
    01 thread=1 sequence=55
    media recovery complete, elapsed time: 00:00:02
    Finished recover at 28-APR-11
    Did I miss something?

  • How to recover from STOP: 0x0000007E Error?

    Satellite L25-S119
    Windows XP Home Edition
    The other day while trying to install a new iPod Touch the laptop died with the blue screen of death as IE was trying to open with the iPod connected to the USB port.
    Every time the machine boots now this blue screen comes up while trying to load windows. Using the boot menu option to disable auto restart on error I was able to record the details as follows:
    Stop: 0x0000007E (0xC0000005, 0x84F22B58, 0xF7AEB69C, 0xF7AE7398)
    Sometimes the 3rd number in the () is 0xF7AE769C, but otherwise the error is always consistent. No other information in the message. Windows did detect the iPod as it booted but never got to the point where IE started. I have tried every option in the boot menu but Windows will not start up in any mode without this error coming up.
    I got out and opened the recovery disk that came with the computer (this is the only disk we received with it) and the only thing this will do is format the hard disk and re-install the original software. I certainly do not want to wipe out everything on my disk so how do I boot up the machine so I can try to determine what happened to it? Doesn't Toshiba provide a "repair" disk that allows you to start the computer and try to repair the installation as it is without going to such extreme measures as formatting the entire disk? I know another machine I have provided this.
    Can you please help me find a way to boot up to repair this?
    Thank you.......

    To access to the files without issues with permissions, you can try using a PE-based diagnostic disc... I think UBCD (Ultimate Boot CD) has something like that, or anything based on BartPE, PE Builder, etc. All of the PE (Preinstallation Environment?) discs I've used ignore file permissions entirely, so that should be okay for your uses.
    RE: Purchasing another Toshiba laptop... I'm actually using an HP laptop now, and have a ridiculous number of HP products. Mostly because I get them broken and repair them, or buy them cheap. I only got the Toshiba laptop because a family member needed one, and I got it for 65% off retail price at Best Buy (yay for November clearance!). I got a 17" HP laptop for $199 and a 15.4" Compaq laptop for $75... all with one-year warranties, all important accessories, and in good condition.
    Either way, the parts inside the computer are usually made by the same companies... AOI, LG, Samsung, Philips for the LCD; Hitachi, Toshiba, WD, Seagate, Samsung, Fujitsu for the HDD; CPU/Chipset: Intel/AMD/VIA; GPU: AMD(ATI)/nVidia/Intel/VIA; DVD:Toshiba/Samsung, LG, Pioneer, Philips; Touchpad:ALPS, Synaptics; Audio: Realtek, Conexant, Creative; LAN/WLAN: Marvell, Intel, Realtek, Broadcom; Motherboard/Assembly: Compal, HannStar, Winbook (I think)....obviously there are others, but you get the idea.
    Very few companies actually manufacture their own products...... some don't even design their own products.
    And cee_64, some information there isn't entirely accurate (but close enough to call me nit-picky, which I'll accept). Some manufacturers do include recovery discs, specifically all Dell computers until recently, but many still come with them; and ASUS laptops. Additionally, many Dell discs are near-generic, and can be used to install (but not activate) Windows on other systems.
    Toshiba isn't the only manufacturer that makes HDDs and laptops... Fujitsu and Samsung make laptops and HDDs, and both supply diagnostic tools. Samsung also makes RAM, ODDs, and other semiconductor-based products. (Back in the day, when IBM still made consumer IT, it made HDDs and laptops).

  • How to recover from Oracle XA Error Native Error 24776?

    Hello,
    I believe that I have a connection leak somewhere and that an Oracle
    connection is hung.
    I get the following error when trying to access the data source
    getConnection method.
    How can I recover from this? The WL Server console shows 0 connections
    to the connection pool, but obviously some are in there.
    Thanks,
    java.sql.SQLException: XA error: XAER_RMERR : A resource manager error
    has occur
    ed in the transaction branch start() failed on resource 'Oracle Connection
    Pool': XAER_RMERR : A resource manager error has occured in the
    transaction bra
    nch
    javax.transaction.xa.XAException: [BEA][Oracle JDBC Driver]Oracle XA
    Error Occur
    red. Native Error: 24776
    at
    weblogic.jdbcx.oracle.OracleImplXAResource.checkError(Unknown Source)
    at weblogic.jdbcx.oracle.OracleImplXAResource.start(Unknown Source)
    at weblogic.jdbcx.base.BaseXAResource.start(Unknown Source)
    at weblogic.jdbc.jta.DataSource.start(DataSource.java:617)
    at
    weblogic.transaction.internal.XAServerResourceInfo.start(XAServerReso
    urceInfo.java:1075)
    at
    weblogic.transaction.internal.XAServerResourceInfo.xaStart(XAServerRe
    sourceInfo.java:1007)
    at
    weblogic.transaction.internal.XAServerResourceInfo.enlist(XAServerRes
    ourceInfo.java:203)
    at
    weblogic.transaction.internal.ServerTransactionImpl.enlistResource(Se
    rverTransactionImpl.java:419)
    at weblogic.jdbc.jta.DataSource.enlist(DataSource.java:1287)
    at
    weblogic.jdbc.jta.DataSource.refreshXAConnAndEnlist(DataSource.java:1
    250)
    at weblogic.jdbc.jta.DataSource.getConnection(DataSource.java:385)
    at weblogic.jdbc.jta.DataSource.connect(DataSource.java:343)
    at
    weblogic.jdbc.common.internal.RmiDataSource.getConnection(RmiDataSour
    ce.java:305)

    Steven ,
    It's already been answered:
    http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&c2coff=1&threadm=4009ccc9%241%40newsgroups.bea.com&rnum=1&prev=/groups%3Fq%3D24776%2520OracleImplXAResource%26hl%3Den%26lr%3D%26ie%3DUTF-8%26c2coff%3D1%26sa%3DN%26tab%3Dwg
    Regards,
    Slava Imeshev
    "Steven Ostrowski" <[email protected]> wrote in message news:40ce1480$1@mktnews1...
    Hello,
    I believe that I have a connection leak somewhere and that an Oracle
    connection is hung.
    I get the following error when trying to access the data source
    getConnection method.
    How can I recover from this? The WL Server console shows 0 connections
    to the connection pool, but obviously some are in there.
    Thanks,
    java.sql.SQLException: XA error: XAER_RMERR : A resource manager error
    has occur
    ed in the transaction branch start() failed on resource 'Oracle Connection
    Pool': XAER_RMERR : A resource manager error has occured in the
    transaction bra
    nch
    javax.transaction.xa.XAException: [BEA][Oracle JDBC Driver]Oracle XA
    Error Occur
    red. Native Error: 24776
    at
    weblogic.jdbcx.oracle.OracleImplXAResource.checkError(Unknown Source)
    at weblogic.jdbcx.oracle.OracleImplXAResource.start(Unknown Source)
    at weblogic.jdbcx.base.BaseXAResource.start(Unknown Source)
    at weblogic.jdbc.jta.DataSource.start(DataSource.java:617)
    at
    weblogic.transaction.internal.XAServerResourceInfo.start(XAServerReso
    urceInfo.java:1075)
    at
    weblogic.transaction.internal.XAServerResourceInfo.xaStart(XAServerRe
    sourceInfo.java:1007)
    at
    weblogic.transaction.internal.XAServerResourceInfo.enlist(XAServerRes
    ourceInfo.java:203)
    at
    weblogic.transaction.internal.ServerTransactionImpl.enlistResource(Se
    rverTransactionImpl.java:419)
    at weblogic.jdbc.jta.DataSource.enlist(DataSource.java:1287)
    at
    weblogic.jdbc.jta.DataSource.refreshXAConnAndEnlist(DataSource.java:1
    250)
    at weblogic.jdbc.jta.DataSource.getConnection(DataSource.java:385)
    at weblogic.jdbc.jta.DataSource.connect(DataSource.java:343)
    at
    weblogic.jdbc.common.internal.RmiDataSource.getConnection(RmiDataSour
    ce.java:305)
    [att1.html]

  • I need help in restoring data recovered from hard drive crash to new hard drive

    I have the data that was recovered by a recovery service (long story) from my wife's hard drive after it crashed, which is now stored on my computer. There are two user data sets that have been recovered, one for each of us. I need to put each on my wife's computer with its new hard drive, which now has OSX 10.10.? (Yosemite) on it.
    My wife's computer was running OSX 10.7 (Lion) when the hard drive failed and my computer is still running OSX 10.6.8 (Snow Leopard).
    I was told that I should use the Data Transfer Utility (starting my computer up while holding the "T" key), but I have forgotten what I have to do when I start up her computer. Are there any operating system compatibility problems that are sensitive to this process?

    That is a Microsoft Windows question??  You should probably ask this question on Microsoft's Forums.  BTW there is no "Windows 9".  Only Windows 7, Windows 8 and apparently MS is skipping the number 9 in favor of 10.

  • Best way to recover from incorrect migration

    Let me start out by saying that we know we did things incorrectly, and that "doing it over again the right way" isn't the advice we're looking for in this discussion.
    Here's the situation:
    Existing database instance (A) is Oracle 10g on AIX.
    New database instance (B) is Oracle 11g on Solaris.
    We exported our schema from A and imported it into B.
    Unfortunately, about 12 tables in the schema were versioned using Workspace Manager in the 10g instance, but no data from WMSYS was exported.
    So we have a situation where we have MyTable_LT tables containing rows, but the MyTable views are reporting 0 rows. User_WM_Versioned_Tables reports 0 rows.
    It would be perfectly OK with us if we could "unversion" the tables in the new instance, and start again.
    Is there a way to determine which rows in the _LT tables are belonging to the LIVE workspace?
    Alternatively, is there a way to export/import the metadata from WMSYS in instance A into instance B?
    Thanks in advance for your advice.

    Ben -
    Thank you - that is very helpful. Your first solution looked like it would be best.
    Yesterday I looked at obtaining the LIVE version data from the source DB instance, but the drift in data content since the time the original export was done meant that the data refresh would have been fraught with risk, just as you identified.
    I convinced the powers-that-be that a do-over of the full export/import was the best way to go, however, this time we first un-versioned the tables in the source DB.
    I intend to study the documentation for how best to perform this process if we *did* want to retain all the workspaces and versioned data, but can I beg your indulgence and ask you what you would recommend?
    Assuming a clean install of Oracle 11g (complete with WMSYS) as the destination, would it be appropriate to:
    drop the WMSYS schema in the destination;
    export the MY_SCHEMA and WMSYS from the source;
    import into the destination
    upgrade the WMSYS in the destination from 10g to 11g
    Is this the basic process, or am I missing something important?
    Thanks,
    - Colin

  • What is the best way to recover from an error which requires a reconnect (e.g. ORA-01033)?

    We use 11g with the OCI library.  A session pool was created for the server process (OCISessionPoolCreate), and a session was obtained in a thread (OCISessionGet).  Is it sufficient to call OCISessionRelease(...,OCI_SESSRLS_DROPSESS) in the thread, and then get a new session?  Or should the application call OCISessionPoolDestroy(...,OCI_SPD_FORCE), and then recreate the pool?

    Do the following:
    Copy the backup library package/folder to the Pictures folder on your new Mac. If you have any other libraries there move them to the Desktop first.
    Download and run iPhoto Library Upgrader 1.1 on the library to convert it to the new format.  The app will be in your Applications/Utilities folder.
    Launch iPhoto 9.5.1 and open the "converted" library.
    Now you can delete any libraries that you previously moved to the Desktop.

  • ASM auto recover from I/O Error

    My question is in ASM: Is there a way to have it auto recover when there is I/O errors? In my environment we have ASM on raw devices and the I/O paths were not available so ASM shutdown all I/O to diskgroups, which in turn caused the Database using that ASM to crash. What I want to know is, when ASM detects the I/O errors, will it continue to try I/O for a given period, and if so can that period be defined by a parameter in the init.ora?

    When there are I/O failures to multiple spindles in a diskgroup (that is, ASM can't isolate the failed disk and throw it out of the diskgroup or there are no other accessible failgroups left for this diskgroup,) ASM shuts down immediately and takes dependent instances with it to prevent damage to the database.
    If you think of it, the only way your database can be protected from corruption and work loss under these circumstances is to stop all attempts to write anything to the failed path and shutdown immediately not waiting for the path to recover. If you do wait and the path does not recover, more work might be performed in the database while this wait times out and it will be lost because the database will ultimately crash anyway. So the earlier you shutdown, the less crash recovery you'll need and the less workload you will lose. This also minimizes the chance that some data makes its way to the physical storage and gets corrupted because of malfunctioning controller.
    So the answer to your question is no, it will not retry beyond what the OS I/O layer will under the hood.
    Regards,
    Vladimir M. Zakharychev

  • How to recover from database consistency errors?

    I have a SQL server cluster 2005. Due to the broken of SAN storage's controller and disks, one of my SharePoint content DB has corrupted and has been causing numerous error to the SharePoint. I have run the command "DBCC CHECKDB WITH NO_INFOMDGS"
    and the truncated output are as following:
    Table error: Object ID 53575229, index ID 1, partition ID 72057594038583296, alloc unit ID 72057594043564032 (type In-row data), page ID (3:21580503) contains an incorrect page ID in its page header. The PageId in the page header = (3:21580511).
    CHECKDB found 0 allocation errors and 6 consistency errors in table 'AllDocs' (object ID 53575229).
    Table error: Object ID 53575229, index ID 1, partition ID 72057594038583296, alloc unit ID 72057594043564032 (type In-row data), page ID (3:21580503) contains an incorrect page ID in its page header. The PageId in the page header = (3:21580511).
    CHECKDB found 0 allocation errors and 6 consistency errors in table 'AllDocs' (object ID 53575229).
    Object ID 1058102810, index ID 4, partition ID 72057594052411392, alloc unit ID 72057594058571776 (type In-row data): Page (3:21580478) could not be processed.  See other errors for details.
    CHECKDB found 0 allocation errors and 4 consistency errors in table 'EventCache' (object ID 1058102810).
    Table error: Object ID 1762105318, index ID 1, partition ID 72057594055819264, alloc unit ID 72057594062897152 (type LOB data). The off-row data node at page (3:21985593), slot 35, text ID 5702751551488 is not referenced.
    Msg 8964, Level 16, State 1, Line 1
    Table error: Object ID 1762105318, index ID 1, partition ID 72057594055819264, alloc unit ID 72057594062897152 (type LOB data). The off-row data node at page (3:21985594), slot 14, text ID 5702751354880 is not referenced.
    Msg 8986, Level 16, State 1, Line 1
    Too many errors found (201) for object ID 1762105318. To see all error messages rerun the statement using "WITH ALL_ERRORMSGS".
    CHECKDB found 0 allocation errors and 307 consistency errors in table 'AuditData' (object ID 1762105318).
    CHECKDB found 0 allocation errors and 363 consistency errors in database 'ALCIM_WSS_Content'.
    repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (WSS_Content).
    Error show in the event log:
    SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 3:21580475; actual 0:0). It occurred during a read of page (3:21580475) in database ID 9 at offset 0x00002929576000 in file 'E:\Microsoft SQL Server\Data\MSSQL.1\MSSQL\DATA\WSS_Content_2.ndf'. 
    The last DB backup creatred was about a month ago so doing DB resotre will be my last choice. Is it possible I can recover the DB without data loss using "DBCC CHECKDB ('WSS_Content', REPAIR_REBUILD)"? Any alternative method to acheive my
    goal?
    Thank you.

    Hi,
    Check this part of the output that you have posted
    "repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (WSS_Content)."
    Which clearly states that your only option is "repair_allow_data_loss". This option should ONLY be tried as a last resort. If you have any chance of restoring the backup as Bass_player suggested that should be your way. Even if you run repair_allow_data_loss
    and it runs successfully and fixes the corruption, you still would be facing logical corruption with data, as we never know which all records repair_allow_data_loss removes.
    More over in case of SharePoint databases as far as I know, Microsoft Sharepoint Support never used to support those sharepoint databases which were repaired. They will only support a backup of the database in case of corruptions.
    I would suggest you to start working on a better disaster recovery plan in the mean while you are waiting for the backups :)
    HTH,
    Regards, Ashwin Menon My Blog - http:\\sqllearnings.wordpress.com

Maybe you are looking for