Received: (at submit) by bugs.debian.org; 19 Jul 2001 20:37:56 +0000 From sharkey@nngroup.physics.sunysb.edu Thu Jul 19 15:37:55 2001 Return-path: Received: from ale.physics.sunysb.edu [::ffff:129.49.56.40] by master.debian.org with esmtp (Exim 3.12 1 (Debian)) id 15NKYV-0004Nu-00; Thu, 19 Jul 2001 15:37:55 -0500 Received: from sharkey by ale.physics.sunysb.edu with local (Exim 3.22 #1 (Debian)) id 15NKYH-0001En-00; Thu, 19 Jul 2001 16:37:41 -0400 From: Eric Sharkey Subject: raidtools2: data loss when recovering from multiple "bad" disks To: submit@bugs.debian.org X-Mailer: bug 3.3.9 Message-Id: Sender: Eric Sharkey Date: Thu, 19 Jul 2001 16:37:41 -0400 Delivered-To: submit@bugs.debian.org Package: raidtools2 Version: 0.90.990824-11 Severity: grave I just lost two weeks worth of data on my primary raid due to an error in raidtools reconstruction procedure. I'm still trying to work out exactly what happened. This is mostly my own fault for not making backups and ignoring a known problem, but, raidtools could have performed better in this case. I have /dev/md0 mounted on /home, so I still have /var/log intact and can go through and figure out exactly what broke when. For me, /dev/md0 is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1. The first sign of trouble is here, this seems to be flakey hardware or a kernel bug causing DMA problems: Jul 4 23:52:37 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:37 ale kernel: hde: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:45 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 5 00:07:08 ale -- MARK -- Jul 5 00:27:08 ale -- MARK -- Jul 5 00:47:08 ale -- MARK -- Jul 5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy } Jul 5 00:48:28 ale kernel: hde: DMA disabled Jul 5 00:48:29 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:32 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:32 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 16817840 Jul 5 00:48:32 ale kernel: ^IOperation continuing on 1 devices At this point /dev/hde1 is marked bad, and the raid continues in degraded mode using /dev/hdg1 only. Later, it happens on the other drive: Jul 5 13:29:33 ale kernel: hdg: lost interrupt Jul 5 13:30:03 ale last message repeated 3 times Jul 5 13:31:03 ale last message repeated 6 times Jul 5 13:32:03 ale last message repeated 6 times Jul 5 13:32:43 ale last message repeated 4 times Jul 5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy } Jul 5 13:32:56 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:56 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:56 ale kernel: hdg: drive not ready for command [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 77070400 [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery thread got woken up ... Jul 5 13:32:58 ale kernel: Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery thread finished ... Errors like this pour into /var/log/messages at very high speed until: Jul 5 13:36:21 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:36:22 ale kernel: hdg: drive not ready for command Jul 5 13:36:30 ale kernel: ide3: reset: success and then all is well again. Or is it? At the end of this thrashing, /dev/hdg1 has also been marked as bad, but the machine keeps using it. I neglect the machine, knowing it's in degraded mode, but not having the time to go fix it. Eventually an unrelated problem crops up which requires attention: Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying Jul 19 02:02:56 ale last message repeated 4 times Its network card vanishes. I don't have a clue what caused this, but I can't ignore it any longer, so I come in and power cycle the machine. Big mistake. Jul 19 11:53:10 ale kernel: md: raid0 personality registered Jul 19 11:53:10 ale kernel: md: raid1 personality registered Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays. Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: md: autorun ... Jul 19 11:53:10 ale kernel: md: considering hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hde1 ... Jul 19 11:53:10 ale kernel: md: created md0 Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: running: Jul 19 11:53:10 ale kernel: md: now! Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1 Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0 Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing mirrors Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors Jul 19 11:53:10 ale kernel: md: syncing RAID array md0 Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc. Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction. Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 60030336 blocks. Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: ... autorun DONE. It copies the contents of /dev/hde1, which was marked bad first, onto /dev/hdg1, which was marked bad later, overwriting two weeks worth of changes. I blame myself for this, *but* this should not have happened. The recovery process should have copied /dev/hdg1 onto /dev/hde1, and not the other way around! Now, I'm not really sure where raidtools starts and the kernel ends, so this may actually be a kernel problem. It was running 2.4.5 at the time this happened. If you could forward this report to whoever is most responsible for the bits that handle reconstruction, I'd appreciate it. Thanks, Eric -- System Information Debian Release: testing/unstable Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown Versions of the packages raidtools2 depends on: ii debconf 0.9.66 Debian configuration management system ii libc6 2.2.3-6 GNU C Library: Shared libraries and Timezone