Report forwarded to debian-bugs-dist@lists.debian.org, Pekka Aleksi Knuutila <pa@debian.org>:
Bug#105924; Package raidtools2.   debian-bugs-dist@lists.debian.orgPekka Aleksi Knuutila  Subject: Bug#105924: raidtools2: data loss when recovering from multiple "bad" disks Reply-To: Eric Sharkey , 105924@bugs.debian.org Resent-From: Eric Sharkey Orignal-Sender: Eric Sharkey Resent-To: debian-bugs-dist@lists.debian.org Resent-CC: Pekka Aleksi Knuutila Resent-Date: Thu, 19 Jul 2001 20:48:02 GMT Resent-Message-ID: Resent-Sender: owner@bugs.debian.org X-Debian-PR-Message: report 105924 X-Debian-PR-Package: raidtools2 X-Debian-PR-Keywords: X-Loop: owner@bugs.debian.org Received: via spool by submit@bugs.debian.org id=B.99557507616991 (code B ref -1); Thu, 19 Jul 2001 20:48:02 GMT From: Eric Sharkey To: submit@bugs.debian.org X-Mailer: bug 3.3.9 Message-Id: Sender: Eric Sharkey Date: Thu, 19 Jul 2001 16:37:41 -0400 Delivered-To: submit@bugs.debian.org Package: raidtools2 Version: 0.90.990824-11 Severity: grave I just lost two weeks worth of data on my primary raid due to an error in raidtools reconstruction procedure. I'm still trying to work out exactly what happened. This is mostly my own fault for not making backups and ignoring a known problem, but, raidtools could have performed better in this case. I have /dev/md0 mounted on /home, so I still have /var/log intact and can go through and figure out exactly what broke when. For me, /dev/md0 is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1. The first sign of trouble is here, this seems to be flakey hardware or a kernel bug causing DMA problems: Jul 4 23:52:37 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:37 ale kernel: hde: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:45 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 5 00:07:08 ale -- MARK -- Jul 5 00:27:08 ale -- MARK -- Jul 5 00:47:08 ale -- MARK -- Jul 5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy } Jul 5 00:48:28 ale kernel: hde: DMA disabled Jul 5 00:48:29 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:32 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:32 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 16817840 Jul 5 00:48:32 ale kernel: ^IOperation continuing on 1 devices At this point /dev/hde1 is marked bad, and the raid continues in degraded mode using /dev/hdg1 only. Later, it happens on the other drive: Jul 5 13:29:33 ale kernel: hdg: lost interrupt Jul 5 13:30:03 ale last message repeated 3 times Jul 5 13:31:03 ale last message repeated 6 times Jul 5 13:32:03 ale last message repeated 6 times Jul 5 13:32:43 ale last message repeated 4 times Jul 5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy } Jul 5 13:32:56 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:56 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:56 ale kernel: hdg: drive not ready for command [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 77070400 [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery thread got woken up ... Jul 5 13:32:58 ale kernel: Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery thread finished ... Errors like this pour into /var/log/messages at very high speed until: Jul 5 13:36:21 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:36:22 ale kernel: hdg: drive not ready for command Jul 5 13:36:30 ale kernel: ide3: reset: success and then all is well again. Or is it? At the end of this thrashing, /dev/hdg1 has also been marked as bad, but the machine keeps using it. I neglect the machine, knowing it's in degraded mode, but not having the time to go fix it. Eventually an unrelated problem crops up which requires attention: Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying Jul 19 02:02:56 ale last message repeated 4 times Its network card vanishes. I don't have a clue what caused this, but I can't ignore it any longer, so I come in and power cycle the machine. Big mistake. Jul 19 11:53:10 ale kernel: md: raid0 personality registered Jul 19 11:53:10 ale kernel: md: raid1 personality registered Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays. Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: md: autorun ... Jul 19 11:53:10 ale kernel: md: considering hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hde1 ... Jul 19 11:53:10 ale kernel: md: created md0 Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: running: Jul 19 11:53:10 ale kernel: md: now! Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1 Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0 Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing mirrors Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors Jul 19 11:53:10 ale kernel: md: syncing RAID array md0 Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc. Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction. Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 60030336 blocks. Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: ... autorun DONE. It copies the contents of /dev/hde1, which was marked bad first, onto /dev/hdg1, which was marked bad later, overwriting two weeks worth of changes. I blame myself for this, *but* this should not have happened. The recovery process should have copied /dev/hdg1 onto /dev/hde1, and not the other way around! Now, I'm not really sure where raidtools starts and the kernel ends, so this may actually be a kernel problem. It was running 2.4.5 at the time this happened. If you could forward this report to whoever is most responsible for the bits that handle reconstruction, I'd appreciate it. Thanks, Eric -- System Information Debian Release: testing/unstable Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown Versions of the packages raidtools2 depends on: ii debconf 0.9.66 Debian configuration management system ii libc6 2.2.3-6 GNU C Library: Shared libraries and Timezone   Acknowledgement sent to Eric Sharkey <sharkey@superk.physics.sunysb.edu>:
New Bug report received and forwarded. Copy sent to Pekka Aleksi Knuutila <pa@debian.org>.   -t  From: owner@bugs.debian.org (Debian Bug Tracking System) To: Eric Sharkey Subject: Bug#105924: Acknowledgement (raidtools2: data loss when recovering from multiple "bad" disks) Message-ID: In-Reply-To: References: X-Debian-PR-Message: ack 105924 Thank you for the problem report you have sent regarding Debian. This is an automatically generated reply, to let you know your message has been received. It is being forwarded to the developers mailing list for their attention; they will reply in due course. Your message has been sent to the package maintainer(s): Pekka Aleksi Knuutila If you wish to submit further information on your problem, please send it to 105924@bugs.debian.org (and *not* to submit@bugs.debian.org). Please do not reply to the address at the top of this message, unless you wish to report a problem with the Bug-tracking system. Darren Benham (administrator, Debian Bugs database)   Received: (at submit) by bugs.debian.org; 19 Jul 2001 20:37:56 +0000 From sharkey@nngroup.physics.sunysb.edu Thu Jul 19 15:37:55 2001 Return-path: Received: from ale.physics.sunysb.edu [::ffff:129.49.56.40] by master.debian.org with esmtp (Exim 3.12 1 (Debian)) id 15NKYV-0004Nu-00; Thu, 19 Jul 2001 15:37:55 -0500 Received: from sharkey by ale.physics.sunysb.edu with local (Exim 3.22 #1 (Debian)) id 15NKYH-0001En-00; Thu, 19 Jul 2001 16:37:41 -0400 From: Eric Sharkey Subject: raidtools2: data loss when recovering from multiple "bad" disks To: submit@bugs.debian.org X-Mailer: bug 3.3.9 Message-Id: Sender: Eric Sharkey Date: Thu, 19 Jul 2001 16:37:41 -0400 Delivered-To: submit@bugs.debian.org Package: raidtools2 Version: 0.90.990824-11 Severity: grave I just lost two weeks worth of data on my primary raid due to an error in raidtools reconstruction procedure. I'm still trying to work out exactly what happened. This is mostly my own fault for not making backups and ignoring a known problem, but, raidtools could have performed better in this case. I have /dev/md0 mounted on /home, so I still have /var/log intact and can go through and figure out exactly what broke when. For me, /dev/md0 is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1. The first sign of trouble is here, this seems to be flakey hardware or a kernel bug causing DMA problems: Jul 4 23:52:37 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:37 ale kernel: hde: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:45 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 5 00:07:08 ale -- MARK -- Jul 5 00:27:08 ale -- MARK -- Jul 5 00:47:08 ale -- MARK -- Jul 5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy } Jul 5 00:48:28 ale kernel: hde: DMA disabled Jul 5 00:48:29 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:32 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:32 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 16817840 Jul 5 00:48:32 ale kernel: ^IOperation continuing on 1 devices At this point /dev/hde1 is marked bad, and the raid continues in degraded mode using /dev/hdg1 only. Later, it happens on the other drive: Jul 5 13:29:33 ale kernel: hdg: lost interrupt Jul 5 13:30:03 ale last message repeated 3 times Jul 5 13:31:03 ale last message repeated 6 times Jul 5 13:32:03 ale last message repeated 6 times Jul 5 13:32:43 ale last message repeated 4 times Jul 5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy } Jul 5 13:32:56 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:56 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:56 ale kernel: hdg: drive not ready for command [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 77070400 [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery thread got woken up ... Jul 5 13:32:58 ale kernel: Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery thread finished ... Errors like this pour into /var/log/messages at very high speed until: Jul 5 13:36:21 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:36:22 ale kernel: hdg: drive not ready for command Jul 5 13:36:30 ale kernel: ide3: reset: success and then all is well again. Or is it? At the end of this thrashing, /dev/hdg1 has also been marked as bad, but the machine keeps using it. I neglect the machine, knowing it's in degraded mode, but not having the time to go fix it. Eventually an unrelated problem crops up which requires attention: Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying Jul 19 02:02:56 ale last message repeated 4 times Its network card vanishes. I don't have a clue what caused this, but I can't ignore it any longer, so I come in and power cycle the machine. Big mistake. Jul 19 11:53:10 ale kernel: md: raid0 personality registered Jul 19 11:53:10 ale kernel: md: raid1 personality registered Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays. Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: md: autorun ... Jul 19 11:53:10 ale kernel: md: considering hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hde1 ... Jul 19 11:53:10 ale kernel: md: created md0 Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: running: Jul 19 11:53:10 ale kernel: md: now! Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1 Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0 Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing mirrors Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors Jul 19 11:53:10 ale kernel: md: syncing RAID array md0 Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc. Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction. Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 60030336 blocks. Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: ... autorun DONE. It copies the contents of /dev/hde1, which was marked bad first, onto /dev/hdg1, which was marked bad later, overwriting two weeks worth of changes. I blame myself for this, *but* this should not have happened. The recovery process should have copied /dev/hdg1 onto /dev/hde1, and not the other way around! Now, I'm not really sure where raidtools starts and the kernel ends, so this may actually be a kernel problem. It was running 2.4.5 at the time this happened. If you could forward this report to whoever is most responsible for the bits that handle reconstruction, I'd appreciate it. Thanks, Eric -- System Information Debian Release: testing/unstable Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown Versions of the packages raidtools2 depends on: ii debconf 0.9.66 Debian configuration management system ii libc6 2.2.3-6 GNU C Library: Shared libraries and Timezone   Reply sent to Pekka Aleksi Knuutila <zur@edu.lahti.fi>:
You have marked Bug as forwarded.   -t  From: owner@bugs.debian.org (Debian Bug Tracking System) To: Pekka Aleksi Knuutila Cc: Pekka Aleksi Knuutila Bcc: debian-bugs-forwarded@lists.debian.org Subject: Bug#105924: marked as forwarded (raidtools2: data loss when recovering from multiple "bad" disks) Message-ID: In-Reply-To: <20010720212624.I32470@edu.lahti.fi> References: <20010720212624.I32470@edu.lahti.fi> X-Debian-PR-Message: forwarded 105924 Your message dated Fri, 20 Jul 2001 21:26:24 +0300 with message-id <20010720212624.I32470@edu.lahti.fi> has caused the Debian Bug report #105924, regarding raidtools2: data loss when recovering from multiple "bad" disks to be marked as having been forwarded to the upstream software author(s) mingo@redhat.com. (NB: If you are a system administrator and have no idea what I am talking about this indicates a serious mail system misconfiguration somewhere. Please contact me immediately.) Darren Benham (administrator, Debian Bugs database)   Received: (at 105924-forwarded) by bugs.debian.org; 20 Jul 2001 18:26:40 +0000 From zur@edu.lahti.fi Fri Jul 20 13:26:40 2001 Return-path: Received: from spoon.edu.lahti.fi (edu.lahti.fi) [::ffff:212.226.80.23] by master.debian.org with smtp (Exim 3.12 1 (Debian)) id 15Nez1-0001JF-00; Fri, 20 Jul 2001 13:26:39 -0500 Received: (qmail 2912 invoked from network); 20 Jul 2001 18:26:24 -0000 Received: from nexus.edu.lahti.fi (zur@212.226.80.21) by mail.edu.lahti.fi with SMTP; 20 Jul 2001 18:26:24 -0000 Received: by nexus.edu.lahti.fi (sSMTP sendmail emulation); Fri, 20 Jul 2001 21:26:24 +0300 Date: Fri, 20 Jul 2001 21:26:24 +0300 From: Pekka Aleksi Knuutila To: mingo@redhat.com Cc: sharkey@superk.physics.sunysb.edu, 105924-forwarded@bugs.debian.org Subject: [sharkey@superk.physics.sunysb.edu: Bug#105924: raidtools2: data loss when recovering from multiple "bad" disks] Message-ID: <20010720212624.I32470@edu.lahti.fi> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="ZoaI/ZTpAVc4A5k6" Content-Disposition: inline User-Agent: Mutt/1.2.5i Delivered-To: 105924-forwarded@bugs.debian.org --ZoaI/ZTpAVc4A5k6 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Eric Sharkey wrote: > Now, I'm not really sure where raidtools starts and the kernel ends, so > this may actually be a kernel problem. It was running 2.4.5 at the time > this happened. If you could forward this report to whoever is most > responsible for the bits that handle reconstruction, I'd appreciate it. To my understanding, the resync procedure is handled by the kernel. I'm not sure who is in charge of the md drivers currently, hopefully Ingo Molnar can take a look. Thanks --Aleksi -- P.A. Knuutila 5285A09F ECEB0B22 881EA428 CF7E8E24 1ADD95A3 --ZoaI/ZTpAVc4A5k6 Content-Type: message/rfc822 Content-Disposition: inline Return-Path: Delivered-To: zur@edu.lahti.fi Received: (qmail 6642 invoked from network); 19 Jul 2001 20:48:06 -0000 Received: from master.debian.org (216.234.231.130) by mail.edu.lahti.fi with SMTP; 19 Jul 2001 20:48:06 -0000 Received: from pa by master.debian.org with local (Exim 3.12 1 (Debian)) id 15NKiK-0005Qb-00; Thu, 19 Jul 2001 15:48:04 -0500 Received: from gecko by master.debian.org with local (Exim 3.12 1 (Debian)) id 15NKiK-0005QQ-00; Thu, 19 Jul 2001 15:48:04 -0500 Subject: Bug#105924: raidtools2: data loss when recovering from multiple "bad" disks Reply-To: Eric Sharkey , 105924@bugs.debian.org Resent-From: Eric Sharkey Orignal-Sender: Eric Sharkey Resent-To: debian-bugs-dist@lists.debian.org Resent-CC: Pekka Aleksi Knuutila Resent-Date: Thu, 19 Jul 2001 20:48:02 GMT Resent-Message-ID: X-Debian-PR-Message: report 105924 X-Debian-PR-Package: raidtools2 X-Debian-PR-Keywords: X-Loop: owner@bugs.debian.org Received: via spool by submit@bugs.debian.org id=B.99557507616991 (code B ref -1); Thu, 19 Jul 2001 20:48:02 GMT From: Eric Sharkey To: submit@bugs.debian.org X-Mailer: bug 3.3.9 Message-Id: Sender: Eric Sharkey Date: Thu, 19 Jul 2001 16:37:41 -0400 Delivered-To: submit@bugs.debian.org Delivered-To: pa@debian.org Resent-Sender: Pekka Aleksi Knuutila Package: raidtools2 Version: 0.90.990824-11 Severity: grave I just lost two weeks worth of data on my primary raid due to an error in raidtools reconstruction procedure. I'm still trying to work out exactly what happened. This is mostly my own fault for not making backups and ignoring a known problem, but, raidtools could have performed better in this case. I have /dev/md0 mounted on /home, so I still have /var/log intact and can go through and figure out exactly what broke when. For me, /dev/md0 is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1. The first sign of trouble is here, this seems to be flakey hardware or a kernel bug causing DMA problems: Jul 4 23:52:37 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:37 ale kernel: hde: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:45 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 5 00:07:08 ale -- MARK -- Jul 5 00:27:08 ale -- MARK -- Jul 5 00:47:08 ale -- MARK -- Jul 5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy } Jul 5 00:48:28 ale kernel: hde: DMA disabled Jul 5 00:48:29 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:32 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:32 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 16817840 Jul 5 00:48:32 ale kernel: ^IOperation continuing on 1 devices At this point /dev/hde1 is marked bad, and the raid continues in degraded mode using /dev/hdg1 only. Later, it happens on the other drive: Jul 5 13:29:33 ale kernel: hdg: lost interrupt Jul 5 13:30:03 ale last message repeated 3 times Jul 5 13:31:03 ale last message repeated 6 times Jul 5 13:32:03 ale last message repeated 6 times Jul 5 13:32:43 ale last message repeated 4 times Jul 5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy } Jul 5 13:32:56 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:56 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:56 ale kernel: hdg: drive not ready for command [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 77070400 [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery thread got woken up ... Jul 5 13:32:58 ale kernel: Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery thread finished ... Errors like this pour into /var/log/messages at very high speed until: Jul 5 13:36:21 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:36:22 ale kernel: hdg: drive not ready for command Jul 5 13:36:30 ale kernel: ide3: reset: success and then all is well again. Or is it? At the end of this thrashing, /dev/hdg1 has also been marked as bad, but the machine keeps using it. I neglect the machine, knowing it's in degraded mode, but not having the time to go fix it. Eventually an unrelated problem crops up which requires attention: Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying Jul 19 02:02:56 ale last message repeated 4 times Its network card vanishes. I don't have a clue what caused this, but I can't ignore it any longer, so I come in and power cycle the machine. Big mistake. Jul 19 11:53:10 ale kernel: md: raid0 personality registered Jul 19 11:53:10 ale kernel: md: raid1 personality registered Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays. Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: md: autorun ... Jul 19 11:53:10 ale kernel: md: considering hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hde1 ... Jul 19 11:53:10 ale kernel: md: created md0 Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: running: Jul 19 11:53:10 ale kernel: md: now! Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1 Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0 Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing mirrors Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors Jul 19 11:53:10 ale kernel: md: syncing RAID array md0 Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc. Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction. Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 60030336 blocks. Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: ... autorun DONE. It copies the contents of /dev/hde1, which was marked bad first, onto /dev/hdg1, which was marked bad later, overwriting two weeks worth of changes. I blame myself for this, *but* this should not have happened. The recovery process should have copied /dev/hdg1 onto /dev/hde1, and not the other way around! Now, I'm not really sure where raidtools starts and the kernel ends, so this may actually be a kernel problem. It was running 2.4.5 at the time this happened. If you could forward this report to whoever is most responsible for the bits that handle reconstruction, I'd appreciate it. Thanks, Eric -- System Information Debian Release: testing/unstable Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown Versions of the packages raidtools2 depends on: ii debconf 0.9.66 Debian configuration management system ii libc6 2.2.3-6 GNU C Library: Shared libraries and Timezone --ZoaI/ZTpAVc4A5k6--   Severity set to `normal'. Request was from Pekka Aleksi Knuutila <zur@edu.lahti.fi> to control@bugs.debian.org.   Received: (at control) by bugs.debian.org; 13 Jan 2002 21:40:13 +0000 From zur@edu.lahti.fi Sun Jan 13 15:40:13 2002 Return-path: Received: from tux.edu.lahti.fi (edu.lahti.fi) [212.226.80.30] by master.debian.org with smtp (Exim 3.12 1 (Debian)) id 16PsMO-0007c2-00; Sun, 13 Jan 2002 15:40:12 -0600 Received: (qmail 28042 invoked from network); 13 Jan 2002 21:40:05 -0000 Received: from nexus.edu.lahti.fi (zur@212.226.80.21) by mail.edu.lahti.fi with SMTP; 13 Jan 2002 21:40:05 -0000 Received: by nexus.edu.lahti.fi (sSMTP sendmail emulation); Sun, 13 Jan 2002 23:40:11 +0200 Date: Sun, 13 Jan 2002 23:40:11 +0200 From: Pekka Aleksi Knuutila To: control@bugs.debian.org Subject: downgrading #105924 Message-ID: <20020113234011.B24182@edu.lahti.fi> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Delivered-To: control@bugs.debian.org severity 105924 normal thanks   Reply sent to Barry deFreese <bddebian@comcast.net>:
You have taken responsibility.   -t  MIME-Version: 1.0 X-Mailer: MIME-tools 5.420 (Entity 5.420) X-Loop: owner@bugs.debian.org From: owner@bugs.debian.org (Debian Bug Tracking System) To: Barry deFreese Subject: Bug#105924: marked as done (raidtools2: data loss when recovering from multiple "bad" disks) Message-ID: References: <1208229016.752997.27437.nullmailer@comcast.net> X-Debian-PR-Message: closed 105924 X-Debian-PR-Package: raidtools2 Content-Type: multipart/mixed; boundary="----------=_1208229482-8680-0" This is a multi-part message in MIME format... ------------=_1208229482-8680-0 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Your message dated Mon, 14 Apr 2008 23:10:16 -0400 with message-id <1208229016.752997.27437.nullmailer@comcast.net> and subject line raidtools2 has been removed from Debian, closing #105924 has caused the Debian Bug report #105924, regarding raidtools2: data loss when recovering from multiple "bad" disks to be marked as done. This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact owner@bugs.debian.org immediately.) --=20 105924: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D105924 Debian Bug Tracking System Contact owner@bugs.debian.org with problems ------------=_1208229482-8680-0 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by bugs.debian.org; 19 Jul 2001 20:37:56 +0000 Return-path: Received: from ale.physics.sunysb.edu [::ffff:129.49.56.40] by master.debian.org with esmtp (Exim 3.12 1 (Debian)) id 15NKYV-0004Nu-00; Thu, 19 Jul 2001 15:37:55 -0500 Received: from sharkey by ale.physics.sunysb.edu with local (Exim 3.22 #1 (Debian)) id 15NKYH-0001En-00; Thu, 19 Jul 2001 16:37:41 -0400 From: Eric Sharkey Subject: raidtools2: data loss when recovering from multiple "bad" disks To: submit@bugs.debian.org X-Mailer: bug 3.3.9 Message-Id: Sender: Eric Sharkey Date: Thu, 19 Jul 2001 16:37:41 -0400 Delivered-To: submit@bugs.debian.org Package: raidtools2 Version: 0.90.990824-11 Severity: grave I just lost two weeks worth of data on my primary raid due to an error in raidtools reconstruction procedure. I'm still trying to work out exactly what happened. This is mostly my own fault for not making backups and ignoring a known problem, but, raidtools could have performed better in this case. I have /dev/md0 mounted on /home, so I still have /var/log intact and can go through and figure out exactly what broke when. For me, /dev/md0 is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1. The first sign of trouble is here, this seems to be flakey hardware or a kernel bug causing DMA problems: Jul 4 23:52:37 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:37 ale kernel: hde: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:45 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 5 00:07:08 ale -- MARK -- Jul 5 00:27:08 ale -- MARK -- Jul 5 00:47:08 ale -- MARK -- Jul 5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy } Jul 5 00:48:28 ale kernel: hde: DMA disabled Jul 5 00:48:29 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:32 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:32 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 16817840 Jul 5 00:48:32 ale kernel: ^IOperation continuing on 1 devices At this point /dev/hde1 is marked bad, and the raid continues in degraded mode using /dev/hdg1 only. Later, it happens on the other drive: Jul 5 13:29:33 ale kernel: hdg: lost interrupt Jul 5 13:30:03 ale last message repeated 3 times Jul 5 13:31:03 ale last message repeated 6 times Jul 5 13:32:03 ale last message repeated 6 times Jul 5 13:32:43 ale last message repeated 4 times Jul 5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy } Jul 5 13:32:56 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:56 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:56 ale kernel: hdg: drive not ready for command [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 77070400 [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery thread got woken up ... Jul 5 13:32:58 ale kernel: Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery thread finished ... Errors like this pour into /var/log/messages at very high speed until: Jul 5 13:36:21 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:36:22 ale kernel: hdg: drive not ready for command Jul 5 13:36:30 ale kernel: ide3: reset: success and then all is well again. Or is it? At the end of this thrashing, /dev/hdg1 has also been marked as bad, but the machine keeps using it. I neglect the machine, knowing it's in degraded mode, but not having the time to go fix it. Eventually an unrelated problem crops up which requires attention: Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying Jul 19 02:02:56 ale last message repeated 4 times Its network card vanishes. I don't have a clue what caused this, but I can't ignore it any longer, so I come in and power cycle the machine. Big mistake. Jul 19 11:53:10 ale kernel: md: raid0 personality registered Jul 19 11:53:10 ale kernel: md: raid1 personality registered Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays. Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: md: autorun ... Jul 19 11:53:10 ale kernel: md: considering hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hde1 ... Jul 19 11:53:10 ale kernel: md: created md0 Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: running: Jul 19 11:53:10 ale kernel: md: now! Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1 Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0 Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing mirrors Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors Jul 19 11:53:10 ale kernel: md: syncing RAID array md0 Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc. Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction. Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 60030336 blocks. Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: ... autorun DONE. It copies the contents of /dev/hde1, which was marked bad first, onto /dev/hdg1, which was marked bad later, overwriting two weeks worth of changes. I blame myself for this, *but* this should not have happened. The recovery process should have copied /dev/hdg1 onto /dev/hde1, and not the other way around! Now, I'm not really sure where raidtools starts and the kernel ends, so this may actually be a kernel problem. It was running 2.4.5 at the time this happened. If you could forward this report to whoever is most responsible for the bits that handle reconstruction, I'd appreciate it. Thanks, Eric -- System Information Debian Release: testing/unstable Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown Versions of the packages raidtools2 depends on: ii debconf 0.9.66 Debian configuration management system ii libc6 2.2.3-6 GNU C Library: Shared libraries and Timezone ------------=_1208229482-8680-0 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 105924-done) by bugs.debian.org; 15 Apr 2008 03:08:26 +0000 X-Spam-Checker-Version: SpamAssassin 3.1.4-bugs.debian.org_2005_01_02 (2006-07-26) on rietz.debian.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=4.0 tests=BAYES_00,DNS_FROM_RFC_POST, FORGED_RCVD_HELO,SUBJ_HAS_UNIQ_ID autolearn=no version=3.1.4-bugs.debian.org_2005_01_02 Return-path: Received: from qmta08.emeryville.ca.mail.comcast.net ([76.96.30.80]) by rietz.debian.org with esmtp (Exim 4.63) (envelope-from ) id 1JlbX0-0007zY-4r for 105924-done@bugs.debian.org; Tue, 15 Apr 2008 03:08:26 +0000 Received: from OMTA13.emeryville.ca.mail.comcast.net ([76.96.30.52]) by QMTA08.emeryville.ca.mail.comcast.net with comcast id De2o1Z00317UAYkA803E00; Tue, 15 Apr 2008 03:07:01 +0000 Received: from bddebian2.bddebian.com ([71.224.175.179]) by OMTA13.emeryville.ca.mail.comcast.net with comcast id Df8K1Z0063sciBK8Z00000; Tue, 15 Apr 2008 03:08:20 +0000 X-Authority-Analysis: v=1.0 c=1 a=MIoPn-dSKh8A:10 a=dmK9XBzfEoEA:10 a=xNf9USuDAAAA:8 a=GEYYyrE9ksqY0zE9onsA:9 a=FU2Jw3VsHMYcqgDMxd8A:7 a=4EFE0p0xMxM5AgsYaN-U8EqqfVwA:4 a=10sAvMsTeQkA:10 Received: (nullmailer pid 27438 invoked by uid 1000); Tue, 15 Apr 2008 03:10:16 -0000 From: Barry deFreese To: 105924-done@bugs.debian.org Subject: raidtools2 has been removed from Debian, closing #105924 Date: Mon, 14 Apr 2008 23:10:16 -0400 Message-Id: <1208229016.752997.27437.nullmailer@comcast.net> X-CrossAssassin-Score: 13 Version: 1.00.3-17+rm The raidtools2 package has been removed from Debian testing, unstable and experimental, so I am now closing the bugs that were still opened against it. For more information about this package's removal, read http://bugs.debian.org/298968 . That bug might give the reasons why this package was removed, and suggestions of possible replacements. Don't hesitate to reply to this mail if you have any question. Thank you for your contribution to Debian. Barry deFreese ------------=_1208229482-8680-0--   Notification sent to Eric Sharkey <sharkey@superk.physics.sunysb.edu>:
Bug acknowledged by developer.   -t  MIME-Version: 1.0 X-Mailer: MIME-tools 5.420 (Entity 5.420) X-Loop: owner@bugs.debian.org From: owner@bugs.debian.org (Debian Bug Tracking System) To: Eric Sharkey Subject: Bug#105924 closed by Barry deFreese (raidtools2 has been removed from Debian, closing #105924) Message-ID: References: <1208229016.752997.27437.nullmailer@comcast.net> X-Debian-PR-Message: they-closed 105924 X-Debian-PR-Package: raidtools2 Reply-To: 105924@bugs.debian.org Content-Type: multipart/mixed; boundary="----------=_1208229483-8680-1" This is a multi-part message in MIME format... ------------=_1208229483-8680-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This is an automatic notification regarding your Bug report which was filed against the raidtools2 package: #105924: raidtools2: data loss when recovering from multiple "bad" disks It has been closed by Barry deFreese . Their explanation is attached below along with your original report. If this explanation is unsatisfactory and you have not received a better one in a separate message then please contact Barry deFreese by replying to this email. --=20 105924: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D105924 Debian Bug Tracking System Contact owner@bugs.debian.org with problems ------------=_1208229483-8680-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 105924-done) by bugs.debian.org; 15 Apr 2008 03:08:26 +0000 X-Spam-Checker-Version: SpamAssassin 3.1.4-bugs.debian.org_2005_01_02 (2006-07-26) on rietz.debian.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=4.0 tests=BAYES_00,DNS_FROM_RFC_POST, FORGED_RCVD_HELO,SUBJ_HAS_UNIQ_ID autolearn=no version=3.1.4-bugs.debian.org_2005_01_02 Return-path: Received: from qmta08.emeryville.ca.mail.comcast.net ([76.96.30.80]) by rietz.debian.org with esmtp (Exim 4.63) (envelope-from ) id 1JlbX0-0007zY-4r for 105924-done@bugs.debian.org; Tue, 15 Apr 2008 03:08:26 +0000 Received: from OMTA13.emeryville.ca.mail.comcast.net ([76.96.30.52]) by QMTA08.emeryville.ca.mail.comcast.net with comcast id De2o1Z00317UAYkA803E00; Tue, 15 Apr 2008 03:07:01 +0000 Received: from bddebian2.bddebian.com ([71.224.175.179]) by OMTA13.emeryville.ca.mail.comcast.net with comcast id Df8K1Z0063sciBK8Z00000; Tue, 15 Apr 2008 03:08:20 +0000 X-Authority-Analysis: v=1.0 c=1 a=MIoPn-dSKh8A:10 a=dmK9XBzfEoEA:10 a=xNf9USuDAAAA:8 a=GEYYyrE9ksqY0zE9onsA:9 a=FU2Jw3VsHMYcqgDMxd8A:7 a=4EFE0p0xMxM5AgsYaN-U8EqqfVwA:4 a=10sAvMsTeQkA:10 Received: (nullmailer pid 27438 invoked by uid 1000); Tue, 15 Apr 2008 03:10:16 -0000 From: Barry deFreese To: 105924-done@bugs.debian.org Subject: raidtools2 has been removed from Debian, closing #105924 Date: Mon, 14 Apr 2008 23:10:16 -0400 Message-Id: <1208229016.752997.27437.nullmailer@comcast.net> X-CrossAssassin-Score: 13 Version: 1.00.3-17+rm The raidtools2 package has been removed from Debian testing, unstable and experimental, so I am now closing the bugs that were still opened against it. For more information about this package's removal, read http://bugs.debian.org/298968 . That bug might give the reasons why this package was removed, and suggestions of possible replacements. Don't hesitate to reply to this mail if you have any question. Thank you for your contribution to Debian. Barry deFreese ------------=_1208229483-8680-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by bugs.debian.org; 19 Jul 2001 20:37:56 +0000 Return-path: Received: from ale.physics.sunysb.edu [::ffff:129.49.56.40] by master.debian.org with esmtp (Exim 3.12 1 (Debian)) id 15NKYV-0004Nu-00; Thu, 19 Jul 2001 15:37:55 -0500 Received: from sharkey by ale.physics.sunysb.edu with local (Exim 3.22 #1 (Debian)) id 15NKYH-0001En-00; Thu, 19 Jul 2001 16:37:41 -0400 From: Eric Sharkey Subject: raidtools2: data loss when recovering from multiple "bad" disks To: submit@bugs.debian.org X-Mailer: bug 3.3.9 Message-Id: Sender: Eric Sharkey Date: Thu, 19 Jul 2001 16:37:41 -0400 Delivered-To: submit@bugs.debian.org Package: raidtools2 Version: 0.90.990824-11 Severity: grave I just lost two weeks worth of data on my primary raid due to an error in raidtools reconstruction procedure. I'm still trying to work out exactly what happened. This is mostly my own fault for not making backups and ignoring a known problem, but, raidtools could have performed better in this case. I have /dev/md0 mounted on /home, so I still have /var/log intact and can go through and figure out exactly what broke when. For me, /dev/md0 is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1. The first sign of trouble is here, this seems to be flakey hardware or a kernel bug causing DMA problems: Jul 4 23:52:37 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:37 ale kernel: hde: timeout waiting for DMA Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 4 23:52:45 ale kernel: hdg: timeout waiting for DMA Jul 4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14 Jul 4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete } Jul 5 00:07:08 ale -- MARK -- Jul 5 00:27:08 ale -- MARK -- Jul 5 00:47:08 ale -- MARK -- Jul 5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy } Jul 5 00:48:28 ale kernel: hde: DMA disabled Jul 5 00:48:29 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:29 ale kernel: hde: drive not ready for command Jul 5 00:48:32 ale kernel: ide2: reset: master: error (0x00?) Jul 5 00:48:32 ale kernel: hde: status error: status=0x00 { } Jul 5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 16817840 Jul 5 00:48:32 ale kernel: ^IOperation continuing on 1 devices At this point /dev/hde1 is marked bad, and the raid continues in degraded mode using /dev/hdg1 only. Later, it happens on the other drive: Jul 5 13:29:33 ale kernel: hdg: lost interrupt Jul 5 13:30:03 ale last message repeated 3 times Jul 5 13:31:03 ale last message repeated 6 times Jul 5 13:32:03 ale last message repeated 6 times Jul 5 13:32:43 ale last message repeated 4 times Jul 5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy } Jul 5 13:32:56 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:56 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:56 ale kernel: hdg: drive not ready for command [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?) Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 77070400 [clip] Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery thread got woken up ... Jul 5 13:32:58 ale kernel: Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery thread finished ... Errors like this pour into /var/log/messages at very high speed until: Jul 5 13:36:21 ale kernel: hdg: status error: status=0x00 { } Jul 5 13:36:22 ale kernel: hdg: drive not ready for command Jul 5 13:36:30 ale kernel: ide3: reset: success and then all is well again. Or is it? At the end of this thrashing, /dev/hdg1 has also been marked as bad, but the machine keeps using it. I neglect the machine, knowing it's in degraded mode, but not having the time to go fix it. Eventually an unrelated problem crops up which requires attention: Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying Jul 19 02:02:56 ale last message repeated 4 times Its network card vanishes. I don't have a clue what caused this, but I can't ignore it any longer, so I come in and power cycle the machine. Big mistake. Jul 19 11:53:10 ale kernel: md: raid0 personality registered Jul 19 11:53:10 ale kernel: md: raid1 personality registered Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays. Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: md: autorun ... Jul 19 11:53:10 ale kernel: md: considering hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hdg1 ... Jul 19 11:53:10 ale kernel: md: adding hde1 ... Jul 19 11:53:10 ale kernel: md: created md0 Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: bind Jul 19 11:53:10 ale kernel: md: running: Jul 19 11:53:10 ale kernel: md: now! Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1 Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0 Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing mirrors Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors Jul 19 11:53:10 ale kernel: md: syncing RAID array md0 Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc. Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction. Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 60030336 blocks. Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb offset: 60030336 Jul 19 11:53:10 ale kernel: md: ... autorun DONE. It copies the contents of /dev/hde1, which was marked bad first, onto /dev/hdg1, which was marked bad later, overwriting two weeks worth of changes. I blame myself for this, *but* this should not have happened. The recovery process should have copied /dev/hdg1 onto /dev/hde1, and not the other way around! Now, I'm not really sure where raidtools starts and the kernel ends, so this may actually be a kernel problem. It was running 2.4.5 at the time this happened. If you could forward this report to whoever is most responsible for the bits that handle reconstruction, I'd appreciate it. Thanks, Eric -- System Information Debian Release: testing/unstable Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown Versions of the packages raidtools2 depends on: ii debconf 0.9.66 Debian configuration management system ii libc6 2.2.3-6 GNU C Library: Shared libraries and Timezone ------------=_1208229483-8680-1--   Received: (at 105924-done) by bugs.debian.org; 15 Apr 2008 03:08:26 +0000 From bdefreese@comcast.net Tue Apr 15 03:08:26 2008 X-Spam-Checker-Version: SpamAssassin 3.1.4-bugs.debian.org_2005_01_02 (2006-07-26) on rietz.debian.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=4.0 tests=BAYES_00,DNS_FROM_RFC_POST, FORGED_RCVD_HELO,SUBJ_HAS_UNIQ_ID autolearn=no version=3.1.4-bugs.debian.org_2005_01_02 Return-path: Received: from qmta08.emeryville.ca.mail.comcast.net ([76.96.30.80]) by rietz.debian.org with esmtp (Exim 4.63) (envelope-from ) id 1JlbX0-0007zY-4r for 105924-done@bugs.debian.org; Tue, 15 Apr 2008 03:08:26 +0000 Received: from OMTA13.emeryville.ca.mail.comcast.net ([76.96.30.52]) by QMTA08.emeryville.ca.mail.comcast.net with comcast id De2o1Z00317UAYkA803E00; Tue, 15 Apr 2008 03:07:01 +0000 Received: from bddebian2.bddebian.com ([71.224.175.179]) by OMTA13.emeryville.ca.mail.comcast.net with comcast id Df8K1Z0063sciBK8Z00000; Tue, 15 Apr 2008 03:08:20 +0000 X-Authority-Analysis: v=1.0 c=1 a=MIoPn-dSKh8A:10 a=dmK9XBzfEoEA:10 a=xNf9USuDAAAA:8 a=GEYYyrE9ksqY0zE9onsA:9 a=FU2Jw3VsHMYcqgDMxd8A:7 a=4EFE0p0xMxM5AgsYaN-U8EqqfVwA:4 a=10sAvMsTeQkA:10 Received: (nullmailer pid 27438 invoked by uid 1000); Tue, 15 Apr 2008 03:10:16 -0000 From: Barry deFreese To: 105924-done@bugs.debian.org Subject: raidtools2 has been removed from Debian, closing #105924 Date: Mon, 14 Apr 2008 23:10:16 -0400 Message-Id: <1208229016.752997.27437.nullmailer@comcast.net> X-CrossAssassin-Score: 13 Version: 1.00.3-17+rm The raidtools2 package has been removed from Debian testing, unstable and experimental, so I am now closing the bugs that were still opened against it. For more information about this package's removal, read http://bugs.debian.org/298968 . That bug might give the reasons why this package was removed, and suggestions of possible replacements. Don't hesitate to reply to this mail if you have any question. Thank you for your contribution to Debian. Barry deFreese