Report forwarded to debian-bugs-dist@lists.debian.org, Pekka Aleksi Knuutila <pa@debian.org>:
Bug#105924; Package raidtools2.
debian-bugs-dist@lists.debian.orgPekka Aleksi Knuutila
Subject: Bug#105924: raidtools2: data loss when recovering from multiple "bad" disks
Reply-To: Eric Sharkey , 105924@bugs.debian.org
Resent-From: Eric Sharkey
Orignal-Sender: Eric Sharkey
Resent-To: debian-bugs-dist@lists.debian.org
Resent-CC: Pekka Aleksi Knuutila
Resent-Date: Thu, 19 Jul 2001 20:48:02 GMT
Resent-Message-ID:
Resent-Sender: owner@bugs.debian.org
X-Debian-PR-Message: report 105924
X-Debian-PR-Package: raidtools2
X-Debian-PR-Keywords:
X-Loop: owner@bugs.debian.org
Received: via spool by submit@bugs.debian.org id=B.99557507616991
(code B ref -1); Thu, 19 Jul 2001 20:48:02 GMT
From: Eric Sharkey
To: submit@bugs.debian.org
X-Mailer: bug 3.3.9
Message-Id:
Sender: Eric Sharkey
Date: Thu, 19 Jul 2001 16:37:41 -0400
Delivered-To: submit@bugs.debian.org
Package: raidtools2
Version: 0.90.990824-11
Severity: grave
I just lost two weeks worth of data on my primary raid due to an error in
raidtools reconstruction procedure. I'm still trying to work out exactly
what happened. This is mostly my own fault for not making backups and
ignoring a known problem, but, raidtools could have performed better in
this case.
I have /dev/md0 mounted on /home, so I still have /var/log intact and
can go through and figure out exactly what broke when. For me, /dev/md0
is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1.
The first sign of trouble is here, this seems to be flakey hardware or
a kernel bug causing DMA problems:
Jul 4 23:52:37 ale kernel: hdg: timeout waiting for DMA
Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 4 23:52:37 ale kernel: hde: timeout waiting for DMA
Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 4 23:52:45 ale kernel: hdg: timeout waiting for DMA
Jul 4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 5 00:07:08 ale -- MARK --
Jul 5 00:27:08 ale -- MARK --
Jul 5 00:47:08 ale -- MARK --
Jul 5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy }
Jul 5 00:48:28 ale kernel: hde: DMA disabled
Jul 5 00:48:29 ale kernel: ide2: reset: master: error (0x00?)
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:32 ale kernel: ide2: reset: master: error (0x00?)
Jul 5 00:48:32 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 16817840
Jul 5 00:48:32 ale kernel: ^IOperation continuing on 1 devices
At this point /dev/hde1 is marked bad, and the raid continues in degraded
mode using /dev/hdg1 only. Later, it happens on the other drive:
Jul 5 13:29:33 ale kernel: hdg: lost interrupt
Jul 5 13:30:03 ale last message repeated 3 times
Jul 5 13:31:03 ale last message repeated 6 times
Jul 5 13:32:03 ale last message repeated 6 times
Jul 5 13:32:43 ale last message repeated 4 times
Jul 5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy }
Jul 5 13:32:56 ale kernel: ide3: reset: master: error (0x00?)
Jul 5 13:32:56 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:32:56 ale kernel: hdg: drive not ready for command
[clip]
Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)
Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 77070400
[clip]
Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery thread got woken up ...
Jul 5 13:32:58 ale kernel:
Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery thread finished ...
Errors like this pour into /var/log/messages at very high speed until:
Jul 5 13:36:21 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:36:22 ale kernel: hdg: drive not ready for command
Jul 5 13:36:30 ale kernel: ide3: reset: success
and then all is well again. Or is it? At the end of this thrashing,
/dev/hdg1 has also been marked as bad, but the machine keeps using it.
I neglect the machine, knowing it's in degraded mode, but not having
the time to go fix it.
Eventually an unrelated problem crops up which requires attention:
Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying
Jul 19 02:02:56 ale last message repeated 4 times
Its network card vanishes. I don't have a clue what caused this, but I
can't ignore it any longer, so I come in and power cycle the machine.
Big mistake.
Jul 19 11:53:10 ale kernel: md: raid0 personality registered
Jul 19 11:53:10 ale kernel: md: raid1 personality registered
Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays.
Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: md: autorun ...
Jul 19 11:53:10 ale kernel: md: considering hdg1 ...
Jul 19 11:53:10 ale kernel: md: adding hdg1 ...
Jul 19 11:53:10 ale kernel: md: adding hde1 ...
Jul 19 11:53:10 ale kernel: md: created md0
Jul 19 11:53:10 ale kernel: md: bind
Jul 19 11:53:10 ale kernel: md: bind
Jul 19 11:53:10 ale kernel: md: running:
Jul 19 11:53:10 ale kernel: md: now!
Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k
Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1
Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0
Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing mirrors
Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors
Jul 19 11:53:10 ale kernel: md: syncing RAID array md0
Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc.
Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device
Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction.
Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 60030336 blocks.
Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 60030336
Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb offset: 60030336
Jul 19 11:53:10 ale kernel: md: ... autorun DONE.
It copies the contents of /dev/hde1, which was marked bad first, onto
/dev/hdg1, which was marked bad later, overwriting two weeks worth of
changes.
I blame myself for this, *but* this should not have happened. The recovery
process should have copied /dev/hdg1 onto /dev/hde1, and not the other way
around!
Now, I'm not really sure where raidtools starts and the kernel ends, so
this may actually be a kernel problem. It was running 2.4.5 at the time
this happened. If you could forward this report to whoever is most
responsible for the bits that handle reconstruction, I'd appreciate it.
Thanks,
Eric
-- System Information
Debian Release: testing/unstable
Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown
Versions of the packages raidtools2 depends on:
ii debconf 0.9.66 Debian configuration management system
ii libc6 2.2.3-6 GNU C Library: Shared libraries and Timezone
Acknowledgement sent to Eric Sharkey <sharkey@superk.physics.sunysb.edu>:
New Bug report received and forwarded. Copy sent to Pekka Aleksi Knuutila <pa@debian.org>.
-t
From: owner@bugs.debian.org (Debian Bug Tracking System)
To: Eric Sharkey
Subject: Bug#105924: Acknowledgement (raidtools2: data loss when recovering from multiple "bad" disks)
Message-ID:
In-Reply-To:
References:
X-Debian-PR-Message: ack 105924
Thank you for the problem report you have sent regarding Debian.
This is an automatically generated reply, to let you know your message has
been received. It is being forwarded to the developers mailing list for
their attention; they will reply in due course.
Your message has been sent to the package maintainer(s):
Pekka Aleksi Knuutila
If you wish to submit further information on your problem, please send
it to 105924@bugs.debian.org (and *not* to
submit@bugs.debian.org).
Please do not reply to the address at the top of this message,
unless you wish to report a problem with the Bug-tracking system.
Darren Benham
(administrator, Debian Bugs database)
Received: (at submit) by bugs.debian.org; 19 Jul 2001 20:37:56 +0000
From sharkey@nngroup.physics.sunysb.edu Thu Jul 19 15:37:55 2001
Return-path:
Received: from ale.physics.sunysb.edu [::ffff:129.49.56.40]
by master.debian.org with esmtp (Exim 3.12 1 (Debian))
id 15NKYV-0004Nu-00; Thu, 19 Jul 2001 15:37:55 -0500
Received: from sharkey by ale.physics.sunysb.edu with local (Exim 3.22 #1 (Debian))
id 15NKYH-0001En-00; Thu, 19 Jul 2001 16:37:41 -0400
From: Eric Sharkey
Subject: raidtools2: data loss when recovering from multiple "bad" disks
To: submit@bugs.debian.org
X-Mailer: bug 3.3.9
Message-Id:
Sender: Eric Sharkey
Date: Thu, 19 Jul 2001 16:37:41 -0400
Delivered-To: submit@bugs.debian.org
Package: raidtools2
Version: 0.90.990824-11
Severity: grave
I just lost two weeks worth of data on my primary raid due to an error in
raidtools reconstruction procedure. I'm still trying to work out exactly
what happened. This is mostly my own fault for not making backups and
ignoring a known problem, but, raidtools could have performed better in
this case.
I have /dev/md0 mounted on /home, so I still have /var/log intact and
can go through and figure out exactly what broke when. For me, /dev/md0
is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1.
The first sign of trouble is here, this seems to be flakey hardware or
a kernel bug causing DMA problems:
Jul 4 23:52:37 ale kernel: hdg: timeout waiting for DMA
Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 4 23:52:37 ale kernel: hde: timeout waiting for DMA
Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 4 23:52:45 ale kernel: hdg: timeout waiting for DMA
Jul 4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 5 00:07:08 ale -- MARK --
Jul 5 00:27:08 ale -- MARK --
Jul 5 00:47:08 ale -- MARK --
Jul 5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy }
Jul 5 00:48:28 ale kernel: hde: DMA disabled
Jul 5 00:48:29 ale kernel: ide2: reset: master: error (0x00?)
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:32 ale kernel: ide2: reset: master: error (0x00?)
Jul 5 00:48:32 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 16817840
Jul 5 00:48:32 ale kernel: ^IOperation continuing on 1 devices
At this point /dev/hde1 is marked bad, and the raid continues in degraded
mode using /dev/hdg1 only. Later, it happens on the other drive:
Jul 5 13:29:33 ale kernel: hdg: lost interrupt
Jul 5 13:30:03 ale last message repeated 3 times
Jul 5 13:31:03 ale last message repeated 6 times
Jul 5 13:32:03 ale last message repeated 6 times
Jul 5 13:32:43 ale last message repeated 4 times
Jul 5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy }
Jul 5 13:32:56 ale kernel: ide3: reset: master: error (0x00?)
Jul 5 13:32:56 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:32:56 ale kernel: hdg: drive not ready for command
[clip]
Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)
Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 77070400
[clip]
Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery thread got woken up ...
Jul 5 13:32:58 ale kernel:
Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery thread finished ...
Errors like this pour into /var/log/messages at very high speed until:
Jul 5 13:36:21 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:36:22 ale kernel: hdg: drive not ready for command
Jul 5 13:36:30 ale kernel: ide3: reset: success
and then all is well again. Or is it? At the end of this thrashing,
/dev/hdg1 has also been marked as bad, but the machine keeps using it.
I neglect the machine, knowing it's in degraded mode, but not having
the time to go fix it.
Eventually an unrelated problem crops up which requires attention:
Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying
Jul 19 02:02:56 ale last message repeated 4 times
Its network card vanishes. I don't have a clue what caused this, but I
can't ignore it any longer, so I come in and power cycle the machine.
Big mistake.
Jul 19 11:53:10 ale kernel: md: raid0 personality registered
Jul 19 11:53:10 ale kernel: md: raid1 personality registered
Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays.
Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: md: autorun ...
Jul 19 11:53:10 ale kernel: md: considering hdg1 ...
Jul 19 11:53:10 ale kernel: md: adding hdg1 ...
Jul 19 11:53:10 ale kernel: md: adding hde1 ...
Jul 19 11:53:10 ale kernel: md: created md0
Jul 19 11:53:10 ale kernel: md: bind
Jul 19 11:53:10 ale kernel: md: bind
Jul 19 11:53:10 ale kernel: md: running:
Jul 19 11:53:10 ale kernel: md: now!
Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k
Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1
Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0
Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing mirrors
Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors
Jul 19 11:53:10 ale kernel: md: syncing RAID array md0
Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc.
Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device
Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction.
Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 60030336 blocks.
Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 60030336
Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb offset: 60030336
Jul 19 11:53:10 ale kernel: md: ... autorun DONE.
It copies the contents of /dev/hde1, which was marked bad first, onto
/dev/hdg1, which was marked bad later, overwriting two weeks worth of
changes.
I blame myself for this, *but* this should not have happened. The recovery
process should have copied /dev/hdg1 onto /dev/hde1, and not the other way
around!
Now, I'm not really sure where raidtools starts and the kernel ends, so
this may actually be a kernel problem. It was running 2.4.5 at the time
this happened. If you could forward this report to whoever is most
responsible for the bits that handle reconstruction, I'd appreciate it.
Thanks,
Eric
-- System Information
Debian Release: testing/unstable
Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown
Versions of the packages raidtools2 depends on:
ii debconf 0.9.66 Debian configuration management system
ii libc6 2.2.3-6 GNU C Library: Shared libraries and Timezone
Reply sent to Pekka Aleksi Knuutila <zur@edu.lahti.fi>:
You have marked Bug as forwarded.
-t
From: owner@bugs.debian.org (Debian Bug Tracking System)
To: Pekka Aleksi Knuutila
Cc: Pekka Aleksi Knuutila
Bcc: debian-bugs-forwarded@lists.debian.org
Subject: Bug#105924: marked as forwarded (raidtools2: data loss when recovering from multiple "bad" disks)
Message-ID:
In-Reply-To: <20010720212624.I32470@edu.lahti.fi>
References: <20010720212624.I32470@edu.lahti.fi>
X-Debian-PR-Message: forwarded 105924
Your message dated Fri, 20 Jul 2001 21:26:24 +0300
with message-id <20010720212624.I32470@edu.lahti.fi>
has caused the Debian Bug report #105924,
regarding raidtools2: data loss when recovering from multiple "bad" disks
to be marked as having been forwarded to the upstream software
author(s) mingo@redhat.com.
(NB: If you are a system administrator and have no idea what I am
talking about this indicates a serious mail system misconfiguration
somewhere. Please contact me immediately.)
Darren Benham
(administrator, Debian Bugs database)
Received: (at 105924-forwarded) by bugs.debian.org; 20 Jul 2001 18:26:40 +0000
From zur@edu.lahti.fi Fri Jul 20 13:26:40 2001
Return-path:
Received: from spoon.edu.lahti.fi (edu.lahti.fi) [::ffff:212.226.80.23]
by master.debian.org with smtp (Exim 3.12 1 (Debian))
id 15Nez1-0001JF-00; Fri, 20 Jul 2001 13:26:39 -0500
Received: (qmail 2912 invoked from network); 20 Jul 2001 18:26:24 -0000
Received: from nexus.edu.lahti.fi (zur@212.226.80.21)
by mail.edu.lahti.fi with SMTP; 20 Jul 2001 18:26:24 -0000
Received: by nexus.edu.lahti.fi (sSMTP sendmail emulation); Fri, 20 Jul 2001 21:26:24 +0300
Date: Fri, 20 Jul 2001 21:26:24 +0300
From: Pekka Aleksi Knuutila
To: mingo@redhat.com
Cc: sharkey@superk.physics.sunysb.edu, 105924-forwarded@bugs.debian.org
Subject: [sharkey@superk.physics.sunysb.edu: Bug#105924: raidtools2: data loss when recovering from multiple "bad" disks]
Message-ID: <20010720212624.I32470@edu.lahti.fi>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="ZoaI/ZTpAVc4A5k6"
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
Delivered-To: 105924-forwarded@bugs.debian.org
--ZoaI/ZTpAVc4A5k6
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Eric Sharkey wrote:
> Now, I'm not really sure where raidtools starts and the kernel ends, so
> this may actually be a kernel problem. It was running 2.4.5 at the time
> this happened. If you could forward this report to whoever is most
> responsible for the bits that handle reconstruction, I'd appreciate it.
To my understanding, the resync procedure is handled by the kernel. I'm
not sure who is in charge of the md drivers currently, hopefully Ingo
Molnar can take a look.
Thanks --Aleksi
--
P.A. Knuutila 5285A09F ECEB0B22 881EA428 CF7E8E24 1ADD95A3
--ZoaI/ZTpAVc4A5k6
Content-Type: message/rfc822
Content-Disposition: inline
Return-Path:
Delivered-To: zur@edu.lahti.fi
Received: (qmail 6642 invoked from network); 19 Jul 2001 20:48:06 -0000
Received: from master.debian.org (216.234.231.130)
by mail.edu.lahti.fi with SMTP; 19 Jul 2001 20:48:06 -0000
Received: from pa by master.debian.org with local (Exim 3.12 1 (Debian))
id 15NKiK-0005Qb-00; Thu, 19 Jul 2001 15:48:04 -0500
Received: from gecko by master.debian.org with local (Exim 3.12 1 (Debian))
id 15NKiK-0005QQ-00; Thu, 19 Jul 2001 15:48:04 -0500
Subject: Bug#105924: raidtools2: data loss when recovering from multiple "bad" disks
Reply-To: Eric Sharkey , 105924@bugs.debian.org
Resent-From: Eric Sharkey
Orignal-Sender: Eric Sharkey
Resent-To: debian-bugs-dist@lists.debian.org
Resent-CC: Pekka Aleksi Knuutila
Resent-Date: Thu, 19 Jul 2001 20:48:02 GMT
Resent-Message-ID:
X-Debian-PR-Message: report 105924
X-Debian-PR-Package: raidtools2
X-Debian-PR-Keywords:
X-Loop: owner@bugs.debian.org
Received: via spool by submit@bugs.debian.org id=B.99557507616991
(code B ref -1); Thu, 19 Jul 2001 20:48:02 GMT
From: Eric Sharkey
To: submit@bugs.debian.org
X-Mailer: bug 3.3.9
Message-Id:
Sender: Eric Sharkey
Date: Thu, 19 Jul 2001 16:37:41 -0400
Delivered-To: submit@bugs.debian.org
Delivered-To: pa@debian.org
Resent-Sender: Pekka Aleksi Knuutila
Package: raidtools2
Version: 0.90.990824-11
Severity: grave
I just lost two weeks worth of data on my primary raid due to an error in
raidtools reconstruction procedure. I'm still trying to work out exactly
what happened. This is mostly my own fault for not making backups and
ignoring a known problem, but, raidtools could have performed better in
this case.
I have /dev/md0 mounted on /home, so I still have /var/log intact and
can go through and figure out exactly what broke when. For me, /dev/md0
is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1.
The first sign of trouble is here, this seems to be flakey hardware or
a kernel bug causing DMA problems:
Jul 4 23:52:37 ale kernel: hdg: timeout waiting for DMA
Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 4 23:52:37 ale kernel: hde: timeout waiting for DMA
Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 4 23:52:45 ale kernel: hdg: timeout waiting for DMA
Jul 4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 5 00:07:08 ale -- MARK --
Jul 5 00:27:08 ale -- MARK --
Jul 5 00:47:08 ale -- MARK --
Jul 5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy }
Jul 5 00:48:28 ale kernel: hde: DMA disabled
Jul 5 00:48:29 ale kernel: ide2: reset: master: error (0x00?)
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:32 ale kernel: ide2: reset: master: error (0x00?)
Jul 5 00:48:32 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 16817840
Jul 5 00:48:32 ale kernel: ^IOperation continuing on 1 devices
At this point /dev/hde1 is marked bad, and the raid continues in degraded
mode using /dev/hdg1 only. Later, it happens on the other drive:
Jul 5 13:29:33 ale kernel: hdg: lost interrupt
Jul 5 13:30:03 ale last message repeated 3 times
Jul 5 13:31:03 ale last message repeated 6 times
Jul 5 13:32:03 ale last message repeated 6 times
Jul 5 13:32:43 ale last message repeated 4 times
Jul 5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy }
Jul 5 13:32:56 ale kernel: ide3: reset: master: error (0x00?)
Jul 5 13:32:56 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:32:56 ale kernel: hdg: drive not ready for command
[clip]
Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)
Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 77070400
[clip]
Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery thread got woken up ...
Jul 5 13:32:58 ale kernel:
Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery thread finished ...
Errors like this pour into /var/log/messages at very high speed until:
Jul 5 13:36:21 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:36:22 ale kernel: hdg: drive not ready for command
Jul 5 13:36:30 ale kernel: ide3: reset: success
and then all is well again. Or is it? At the end of this thrashing,
/dev/hdg1 has also been marked as bad, but the machine keeps using it.
I neglect the machine, knowing it's in degraded mode, but not having
the time to go fix it.
Eventually an unrelated problem crops up which requires attention:
Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying
Jul 19 02:02:56 ale last message repeated 4 times
Its network card vanishes. I don't have a clue what caused this, but I
can't ignore it any longer, so I come in and power cycle the machine.
Big mistake.
Jul 19 11:53:10 ale kernel: md: raid0 personality registered
Jul 19 11:53:10 ale kernel: md: raid1 personality registered
Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays.
Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: md: autorun ...
Jul 19 11:53:10 ale kernel: md: considering hdg1 ...
Jul 19 11:53:10 ale kernel: md: adding hdg1 ...
Jul 19 11:53:10 ale kernel: md: adding hde1 ...
Jul 19 11:53:10 ale kernel: md: created md0
Jul 19 11:53:10 ale kernel: md: bind
Jul 19 11:53:10 ale kernel: md: bind
Jul 19 11:53:10 ale kernel: md: running:
Jul 19 11:53:10 ale kernel: md: now!
Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k
Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1
Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0
Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing mirrors
Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors
Jul 19 11:53:10 ale kernel: md: syncing RAID array md0
Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc.
Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device
Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction.
Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 60030336 blocks.
Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 60030336
Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb offset: 60030336
Jul 19 11:53:10 ale kernel: md: ... autorun DONE.
It copies the contents of /dev/hde1, which was marked bad first, onto
/dev/hdg1, which was marked bad later, overwriting two weeks worth of
changes.
I blame myself for this, *but* this should not have happened. The recovery
process should have copied /dev/hdg1 onto /dev/hde1, and not the other way
around!
Now, I'm not really sure where raidtools starts and the kernel ends, so
this may actually be a kernel problem. It was running 2.4.5 at the time
this happened. If you could forward this report to whoever is most
responsible for the bits that handle reconstruction, I'd appreciate it.
Thanks,
Eric
-- System Information
Debian Release: testing/unstable
Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown
Versions of the packages raidtools2 depends on:
ii debconf 0.9.66 Debian configuration management system
ii libc6 2.2.3-6 GNU C Library: Shared libraries and Timezone
--ZoaI/ZTpAVc4A5k6--
Severity set to `normal'.
Request was from Pekka Aleksi Knuutila <zur@edu.lahti.fi>
to control@bugs.debian.org.
Received: (at control) by bugs.debian.org; 13 Jan 2002 21:40:13 +0000
From zur@edu.lahti.fi Sun Jan 13 15:40:13 2002
Return-path:
Received: from tux.edu.lahti.fi (edu.lahti.fi) [212.226.80.30]
by master.debian.org with smtp (Exim 3.12 1 (Debian))
id 16PsMO-0007c2-00; Sun, 13 Jan 2002 15:40:12 -0600
Received: (qmail 28042 invoked from network); 13 Jan 2002 21:40:05 -0000
Received: from nexus.edu.lahti.fi (zur@212.226.80.21)
by mail.edu.lahti.fi with SMTP; 13 Jan 2002 21:40:05 -0000
Received: by nexus.edu.lahti.fi (sSMTP sendmail emulation); Sun, 13 Jan 2002 23:40:11 +0200
Date: Sun, 13 Jan 2002 23:40:11 +0200
From: Pekka Aleksi Knuutila
To: control@bugs.debian.org
Subject: downgrading #105924
Message-ID: <20020113234011.B24182@edu.lahti.fi>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
Delivered-To: control@bugs.debian.org
severity 105924 normal
thanks
Reply sent to Barry deFreese <bddebian@comcast.net>:
You have taken responsibility.
-t
MIME-Version: 1.0
X-Mailer: MIME-tools 5.420 (Entity 5.420)
X-Loop: owner@bugs.debian.org
From: owner@bugs.debian.org (Debian Bug Tracking System)
To: Barry deFreese
Subject: Bug#105924: marked as done (raidtools2: data loss when recovering
from multiple "bad" disks)
Message-ID:
References: <1208229016.752997.27437.nullmailer@comcast.net>
X-Debian-PR-Message: closed 105924
X-Debian-PR-Package: raidtools2
Content-Type: multipart/mixed; boundary="----------=_1208229482-8680-0"
This is a multi-part message in MIME format...
------------=_1208229482-8680-0
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=utf-8
Your message dated Mon, 14 Apr 2008 23:10:16 -0400
with message-id <1208229016.752997.27437.nullmailer@comcast.net>
and subject line raidtools2 has been removed from Debian, closing #105924
has caused the Debian Bug report #105924,
regarding raidtools2: data loss when recovering from multiple "bad" disks
to be marked as done.
This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.
(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)
--=20
105924: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D105924
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
------------=_1208229482-8680-0
Content-Type: message/rfc822
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
Received: (at submit) by bugs.debian.org; 19 Jul 2001 20:37:56 +0000
Return-path:
Received: from ale.physics.sunysb.edu [::ffff:129.49.56.40]
by master.debian.org with esmtp (Exim 3.12 1 (Debian))
id 15NKYV-0004Nu-00; Thu, 19 Jul 2001 15:37:55 -0500
Received: from sharkey by ale.physics.sunysb.edu with local (Exim 3.22 #1 (Debian))
id 15NKYH-0001En-00; Thu, 19 Jul 2001 16:37:41 -0400
From: Eric Sharkey
Subject: raidtools2: data loss when recovering from multiple "bad" disks
To: submit@bugs.debian.org
X-Mailer: bug 3.3.9
Message-Id:
Sender: Eric Sharkey
Date: Thu, 19 Jul 2001 16:37:41 -0400
Delivered-To: submit@bugs.debian.org
Package: raidtools2
Version: 0.90.990824-11
Severity: grave
I just lost two weeks worth of data on my primary raid due to an error in
raidtools reconstruction procedure. I'm still trying to work out exactly
what happened. This is mostly my own fault for not making backups and
ignoring a known problem, but, raidtools could have performed better in
this case.
I have /dev/md0 mounted on /home, so I still have /var/log intact and
can go through and figure out exactly what broke when. For me, /dev/md0
is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1.
The first sign of trouble is here, this seems to be flakey hardware or
a kernel bug causing DMA problems:
Jul 4 23:52:37 ale kernel: hdg: timeout waiting for DMA
Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 4 23:52:37 ale kernel: hde: timeout waiting for DMA
Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 4 23:52:45 ale kernel: hdg: timeout waiting for DMA
Jul 4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 5 00:07:08 ale -- MARK --
Jul 5 00:27:08 ale -- MARK --
Jul 5 00:47:08 ale -- MARK --
Jul 5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy }
Jul 5 00:48:28 ale kernel: hde: DMA disabled
Jul 5 00:48:29 ale kernel: ide2: reset: master: error (0x00?)
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:32 ale kernel: ide2: reset: master: error (0x00?)
Jul 5 00:48:32 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 16817840
Jul 5 00:48:32 ale kernel: ^IOperation continuing on 1 devices
At this point /dev/hde1 is marked bad, and the raid continues in degraded
mode using /dev/hdg1 only. Later, it happens on the other drive:
Jul 5 13:29:33 ale kernel: hdg: lost interrupt
Jul 5 13:30:03 ale last message repeated 3 times
Jul 5 13:31:03 ale last message repeated 6 times
Jul 5 13:32:03 ale last message repeated 6 times
Jul 5 13:32:43 ale last message repeated 4 times
Jul 5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy }
Jul 5 13:32:56 ale kernel: ide3: reset: master: error (0x00?)
Jul 5 13:32:56 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:32:56 ale kernel: hdg: drive not ready for command
[clip]
Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)
Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 77070400
[clip]
Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery thread got woken up ...
Jul 5 13:32:58 ale kernel:
Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery thread finished ...
Errors like this pour into /var/log/messages at very high speed until:
Jul 5 13:36:21 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:36:22 ale kernel: hdg: drive not ready for command
Jul 5 13:36:30 ale kernel: ide3: reset: success
and then all is well again. Or is it? At the end of this thrashing,
/dev/hdg1 has also been marked as bad, but the machine keeps using it.
I neglect the machine, knowing it's in degraded mode, but not having
the time to go fix it.
Eventually an unrelated problem crops up which requires attention:
Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying
Jul 19 02:02:56 ale last message repeated 4 times
Its network card vanishes. I don't have a clue what caused this, but I
can't ignore it any longer, so I come in and power cycle the machine.
Big mistake.
Jul 19 11:53:10 ale kernel: md: raid0 personality registered
Jul 19 11:53:10 ale kernel: md: raid1 personality registered
Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays.
Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: md: autorun ...
Jul 19 11:53:10 ale kernel: md: considering hdg1 ...
Jul 19 11:53:10 ale kernel: md: adding hdg1 ...
Jul 19 11:53:10 ale kernel: md: adding hde1 ...
Jul 19 11:53:10 ale kernel: md: created md0
Jul 19 11:53:10 ale kernel: md: bind
Jul 19 11:53:10 ale kernel: md: bind
Jul 19 11:53:10 ale kernel: md: running:
Jul 19 11:53:10 ale kernel: md: now!
Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k
Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1
Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0
Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing mirrors
Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors
Jul 19 11:53:10 ale kernel: md: syncing RAID array md0
Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc.
Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device
Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction.
Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 60030336 blocks.
Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 60030336
Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb offset: 60030336
Jul 19 11:53:10 ale kernel: md: ... autorun DONE.
It copies the contents of /dev/hde1, which was marked bad first, onto
/dev/hdg1, which was marked bad later, overwriting two weeks worth of
changes.
I blame myself for this, *but* this should not have happened. The recovery
process should have copied /dev/hdg1 onto /dev/hde1, and not the other way
around!
Now, I'm not really sure where raidtools starts and the kernel ends, so
this may actually be a kernel problem. It was running 2.4.5 at the time
this happened. If you could forward this report to whoever is most
responsible for the bits that handle reconstruction, I'd appreciate it.
Thanks,
Eric
-- System Information
Debian Release: testing/unstable
Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown
Versions of the packages raidtools2 depends on:
ii debconf 0.9.66 Debian configuration management system
ii libc6 2.2.3-6 GNU C Library: Shared libraries and Timezone
------------=_1208229482-8680-0
Content-Type: message/rfc822
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
Received: (at 105924-done) by bugs.debian.org; 15 Apr 2008 03:08:26 +0000
X-Spam-Checker-Version: SpamAssassin 3.1.4-bugs.debian.org_2005_01_02
(2006-07-26) on rietz.debian.org
X-Spam-Level:
X-Spam-Status: No, score=-2.0 required=4.0 tests=BAYES_00,DNS_FROM_RFC_POST,
FORGED_RCVD_HELO,SUBJ_HAS_UNIQ_ID autolearn=no
version=3.1.4-bugs.debian.org_2005_01_02
Return-path:
Received: from qmta08.emeryville.ca.mail.comcast.net ([76.96.30.80])
by rietz.debian.org with esmtp (Exim 4.63)
(envelope-from )
id 1JlbX0-0007zY-4r
for 105924-done@bugs.debian.org; Tue, 15 Apr 2008 03:08:26 +0000
Received: from OMTA13.emeryville.ca.mail.comcast.net ([76.96.30.52])
by QMTA08.emeryville.ca.mail.comcast.net with comcast
id De2o1Z00317UAYkA803E00; Tue, 15 Apr 2008 03:07:01 +0000
Received: from bddebian2.bddebian.com ([71.224.175.179])
by OMTA13.emeryville.ca.mail.comcast.net with comcast
id Df8K1Z0063sciBK8Z00000; Tue, 15 Apr 2008 03:08:20 +0000
X-Authority-Analysis: v=1.0 c=1 a=MIoPn-dSKh8A:10 a=dmK9XBzfEoEA:10
a=xNf9USuDAAAA:8 a=GEYYyrE9ksqY0zE9onsA:9 a=FU2Jw3VsHMYcqgDMxd8A:7
a=4EFE0p0xMxM5AgsYaN-U8EqqfVwA:4 a=10sAvMsTeQkA:10
Received: (nullmailer pid 27438 invoked by uid 1000);
Tue, 15 Apr 2008 03:10:16 -0000
From: Barry deFreese
To: 105924-done@bugs.debian.org
Subject: raidtools2 has been removed from Debian, closing #105924
Date: Mon, 14 Apr 2008 23:10:16 -0400
Message-Id: <1208229016.752997.27437.nullmailer@comcast.net>
X-CrossAssassin-Score: 13
Version: 1.00.3-17+rm
The raidtools2 package has been removed from Debian testing, unstable and
experimental, so I am now closing the bugs that were still opened
against it.
For more information about this package's removal, read
http://bugs.debian.org/298968 . That bug might give the reasons why
this package was removed, and suggestions of possible replacements.
Don't hesitate to reply to this mail if you have any question.
Thank you for your contribution to Debian.
Barry deFreese
------------=_1208229482-8680-0--
Notification sent to Eric Sharkey <sharkey@superk.physics.sunysb.edu>:
Bug acknowledged by developer.
-t
MIME-Version: 1.0
X-Mailer: MIME-tools 5.420 (Entity 5.420)
X-Loop: owner@bugs.debian.org
From: owner@bugs.debian.org (Debian Bug Tracking System)
To: Eric Sharkey
Subject: Bug#105924 closed by Barry deFreese
(raidtools2 has been removed from Debian, closing #105924)
Message-ID:
References: <1208229016.752997.27437.nullmailer@comcast.net>
X-Debian-PR-Message: they-closed 105924
X-Debian-PR-Package: raidtools2
Reply-To: 105924@bugs.debian.org
Content-Type: multipart/mixed; boundary="----------=_1208229483-8680-1"
This is a multi-part message in MIME format...
------------=_1208229483-8680-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"
This is an automatic notification regarding your Bug report
which was filed against the raidtools2 package:
#105924: raidtools2: data loss when recovering from multiple "bad" disks
It has been closed by Barry deFreese .
Their explanation is attached below along with your original report.
If this explanation is unsatisfactory and you have not received a
better one in a separate message then please contact Barry deFreese by
replying to this email.
--=20
105924: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D105924
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
------------=_1208229483-8680-1
Content-Type: message/rfc822
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
Received: (at 105924-done) by bugs.debian.org; 15 Apr 2008 03:08:26 +0000
X-Spam-Checker-Version: SpamAssassin 3.1.4-bugs.debian.org_2005_01_02
(2006-07-26) on rietz.debian.org
X-Spam-Level:
X-Spam-Status: No, score=-2.0 required=4.0 tests=BAYES_00,DNS_FROM_RFC_POST,
FORGED_RCVD_HELO,SUBJ_HAS_UNIQ_ID autolearn=no
version=3.1.4-bugs.debian.org_2005_01_02
Return-path:
Received: from qmta08.emeryville.ca.mail.comcast.net ([76.96.30.80])
by rietz.debian.org with esmtp (Exim 4.63)
(envelope-from )
id 1JlbX0-0007zY-4r
for 105924-done@bugs.debian.org; Tue, 15 Apr 2008 03:08:26 +0000
Received: from OMTA13.emeryville.ca.mail.comcast.net ([76.96.30.52])
by QMTA08.emeryville.ca.mail.comcast.net with comcast
id De2o1Z00317UAYkA803E00; Tue, 15 Apr 2008 03:07:01 +0000
Received: from bddebian2.bddebian.com ([71.224.175.179])
by OMTA13.emeryville.ca.mail.comcast.net with comcast
id Df8K1Z0063sciBK8Z00000; Tue, 15 Apr 2008 03:08:20 +0000
X-Authority-Analysis: v=1.0 c=1 a=MIoPn-dSKh8A:10 a=dmK9XBzfEoEA:10
a=xNf9USuDAAAA:8 a=GEYYyrE9ksqY0zE9onsA:9 a=FU2Jw3VsHMYcqgDMxd8A:7
a=4EFE0p0xMxM5AgsYaN-U8EqqfVwA:4 a=10sAvMsTeQkA:10
Received: (nullmailer pid 27438 invoked by uid 1000);
Tue, 15 Apr 2008 03:10:16 -0000
From: Barry deFreese
To: 105924-done@bugs.debian.org
Subject: raidtools2 has been removed from Debian, closing #105924
Date: Mon, 14 Apr 2008 23:10:16 -0400
Message-Id: <1208229016.752997.27437.nullmailer@comcast.net>
X-CrossAssassin-Score: 13
Version: 1.00.3-17+rm
The raidtools2 package has been removed from Debian testing, unstable and
experimental, so I am now closing the bugs that were still opened
against it.
For more information about this package's removal, read
http://bugs.debian.org/298968 . That bug might give the reasons why
this package was removed, and suggestions of possible replacements.
Don't hesitate to reply to this mail if you have any question.
Thank you for your contribution to Debian.
Barry deFreese
------------=_1208229483-8680-1
Content-Type: message/rfc822
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
Received: (at submit) by bugs.debian.org; 19 Jul 2001 20:37:56 +0000
Return-path:
Received: from ale.physics.sunysb.edu [::ffff:129.49.56.40]
by master.debian.org with esmtp (Exim 3.12 1 (Debian))
id 15NKYV-0004Nu-00; Thu, 19 Jul 2001 15:37:55 -0500
Received: from sharkey by ale.physics.sunysb.edu with local (Exim 3.22 #1 (Debian))
id 15NKYH-0001En-00; Thu, 19 Jul 2001 16:37:41 -0400
From: Eric Sharkey
Subject: raidtools2: data loss when recovering from multiple "bad" disks
To: submit@bugs.debian.org
X-Mailer: bug 3.3.9
Message-Id:
Sender: Eric Sharkey
Date: Thu, 19 Jul 2001 16:37:41 -0400
Delivered-To: submit@bugs.debian.org
Package: raidtools2
Version: 0.90.990824-11
Severity: grave
I just lost two weeks worth of data on my primary raid due to an error in
raidtools reconstruction procedure. I'm still trying to work out exactly
what happened. This is mostly my own fault for not making backups and
ignoring a known problem, but, raidtools could have performed better in
this case.
I have /dev/md0 mounted on /home, so I still have /var/log intact and
can go through and figure out exactly what broke when. For me, /dev/md0
is a raid1 (mirror) combination of /dev/hde1 and /dev/hdg1.
The first sign of trouble is here, this seems to be flakey hardware or
a kernel bug causing DMA problems:
Jul 4 23:52:37 ale kernel: hdg: timeout waiting for DMA
Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:37 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 4 23:52:37 ale kernel: hde: timeout waiting for DMA
Jul 4 23:52:37 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:37 ale kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 4 23:52:45 ale kernel: hdg: timeout waiting for DMA
Jul 4 23:52:45 ale kernel: ide_dmaproc: chipset supported ide_dma_timeout func
only: 14
Jul 4 23:52:46 ale kernel: hdg: irq timeout: status=0x50 { DriveReady SeekComplete }
Jul 5 00:07:08 ale -- MARK --
Jul 5 00:27:08 ale -- MARK --
Jul 5 00:47:08 ale -- MARK --
Jul 5 00:48:28 ale kernel: hde: status timeout: status=0x80 { Busy }
Jul 5 00:48:28 ale kernel: hde: DMA disabled
Jul 5 00:48:29 ale kernel: ide2: reset: master: error (0x00?)
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:29 ale kernel: hde: drive not ready for command
Jul 5 00:48:32 ale kernel: ide2: reset: master: error (0x00?)
Jul 5 00:48:32 ale kernel: hde: status error: status=0x00 { }
Jul 5 00:48:32 ale kernel: end_request: I/O error, dev 21:01 (hde), sector 16817840
Jul 5 00:48:32 ale kernel: ^IOperation continuing on 1 devices
At this point /dev/hde1 is marked bad, and the raid continues in degraded
mode using /dev/hdg1 only. Later, it happens on the other drive:
Jul 5 13:29:33 ale kernel: hdg: lost interrupt
Jul 5 13:30:03 ale last message repeated 3 times
Jul 5 13:31:03 ale last message repeated 6 times
Jul 5 13:32:03 ale last message repeated 6 times
Jul 5 13:32:43 ale last message repeated 4 times
Jul 5 13:32:54 ale kernel: hdg: irq timeout: status=0x80 { Busy }
Jul 5 13:32:56 ale kernel: ide3: reset: master: error (0x00?)
Jul 5 13:32:56 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:32:56 ale kernel: hdg: drive not ready for command
[clip]
Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)
Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:32:58 ale kernel: end_request: I/O error, dev 22:01 (hdg), sector 77070400
[clip]
Jul 5 13:32:58 ale kernel: ide3: reset: master: error (0x00?)<6>md: recovery thread got woken up ...
Jul 5 13:32:58 ale kernel:
Jul 5 13:32:58 ale kernel: hdg: status error: status=0x00<6>md: recovery thread finished ...
Errors like this pour into /var/log/messages at very high speed until:
Jul 5 13:36:21 ale kernel: hdg: status error: status=0x00 { }
Jul 5 13:36:22 ale kernel: hdg: drive not ready for command
Jul 5 13:36:30 ale kernel: ide3: reset: success
and then all is well again. Or is it? At the end of this thrashing,
/dev/hdg1 has also been marked as bad, but the machine keeps using it.
I neglect the machine, knowing it's in degraded mode, but not having
the time to go fix it.
Eventually an unrelated problem crops up which requires attention:
Jul 19 02:02:53 ale kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jul 19 02:02:56 ale kernel: nfs: server cypress not responding, still trying
Jul 19 02:02:56 ale last message repeated 4 times
Its network card vanishes. I don't have a clue what caused this, but I
can't ignore it any longer, so I come in and power cycle the machine.
Big mistake.
Jul 19 11:53:10 ale kernel: md: raid0 personality registered
Jul 19 11:53:10 ale kernel: md: raid1 personality registered
Jul 19 11:53:10 ale kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27Jul 19 11:53:10 ale kernel: md: Autodetecting RAID arrays.
Jul 19 11:53:10 ale kernel: (read) hde1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: (read) hdg1's sb offset: 60030336 [events: 0000004b]Jul 19 11:53:10 ale kernel: md: autorun ...
Jul 19 11:53:10 ale kernel: md: considering hdg1 ...
Jul 19 11:53:10 ale kernel: md: adding hdg1 ...
Jul 19 11:53:10 ale kernel: md: adding hde1 ...
Jul 19 11:53:10 ale kernel: md: created md0
Jul 19 11:53:10 ale kernel: md: bind
Jul 19 11:53:10 ale kernel: md: bind
Jul 19 11:53:10 ale kernel: md: running:
Jul 19 11:53:10 ale kernel: md: now!
Jul 19 11:53:10 ale kernel: md: hdg1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md: hde1's event counter: 0000004b
Jul 19 11:53:10 ale kernel: md0: max total readahead window set to 124k
Jul 19 11:53:10 ale kernel: md0: 1 data-disks, max readahead per data-disk: 124kJul 19 11:53:10 ale kernel: raid1: device hdg1 operational as mirror 1
Jul 19 11:53:10 ale kernel: raid1: device hde1 operational as mirror 0
Jul 19 11:53:10 ale kernel: raid1: raid set md0 not clean; reconstructing mirrors
Jul 19 11:53:10 ale kernel: raid1: raid set md0 active with 2 out of 2 mirrors
Jul 19 11:53:10 ale kernel: md: syncing RAID array md0
Jul 19 11:53:10 ale kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc.
Jul 19 11:53:10 ale kernel: md: updating md0 RAID superblock on device
Jul 19 11:53:10 ale kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction.
Jul 19 11:53:10 ale kernel: md: <6>md: using 124k window, over a total of 60030336 blocks.
Jul 19 11:53:10 ale kernel: hdg1 [events: 0000004c](write) hdg1's sb offset: 60030336
Jul 19 11:53:10 ale kernel: md: hde1 [events: 0000004c](write) hde1's sb offset: 60030336
Jul 19 11:53:10 ale kernel: md: ... autorun DONE.
It copies the contents of /dev/hde1, which was marked bad first, onto
/dev/hdg1, which was marked bad later, overwriting two weeks worth of
changes.
I blame myself for this, *but* this should not have happened. The recovery
process should have copied /dev/hdg1 onto /dev/hde1, and not the other way
around!
Now, I'm not really sure where raidtools starts and the kernel ends, so
this may actually be a kernel problem. It was running 2.4.5 at the time
this happened. If you could forward this report to whoever is most
responsible for the bits that handle reconstruction, I'd appreciate it.
Thanks,
Eric
-- System Information
Debian Release: testing/unstable
Kernel Version: Linux ale 2.4.6 #1 SMP Thu Jul 5 14:08:45 EDT 2001 i686 unknown
Versions of the packages raidtools2 depends on:
ii debconf 0.9.66 Debian configuration management system
ii libc6 2.2.3-6 GNU C Library: Shared libraries and Timezone
------------=_1208229483-8680-1--
Received: (at 105924-done) by bugs.debian.org; 15 Apr 2008 03:08:26 +0000
From bdefreese@comcast.net Tue Apr 15 03:08:26 2008
X-Spam-Checker-Version: SpamAssassin 3.1.4-bugs.debian.org_2005_01_02
(2006-07-26) on rietz.debian.org
X-Spam-Level:
X-Spam-Status: No, score=-2.0 required=4.0 tests=BAYES_00,DNS_FROM_RFC_POST,
FORGED_RCVD_HELO,SUBJ_HAS_UNIQ_ID autolearn=no
version=3.1.4-bugs.debian.org_2005_01_02
Return-path:
Received: from qmta08.emeryville.ca.mail.comcast.net ([76.96.30.80])
by rietz.debian.org with esmtp (Exim 4.63)
(envelope-from )
id 1JlbX0-0007zY-4r
for 105924-done@bugs.debian.org; Tue, 15 Apr 2008 03:08:26 +0000
Received: from OMTA13.emeryville.ca.mail.comcast.net ([76.96.30.52])
by QMTA08.emeryville.ca.mail.comcast.net with comcast
id De2o1Z00317UAYkA803E00; Tue, 15 Apr 2008 03:07:01 +0000
Received: from bddebian2.bddebian.com ([71.224.175.179])
by OMTA13.emeryville.ca.mail.comcast.net with comcast
id Df8K1Z0063sciBK8Z00000; Tue, 15 Apr 2008 03:08:20 +0000
X-Authority-Analysis: v=1.0 c=1 a=MIoPn-dSKh8A:10 a=dmK9XBzfEoEA:10
a=xNf9USuDAAAA:8 a=GEYYyrE9ksqY0zE9onsA:9 a=FU2Jw3VsHMYcqgDMxd8A:7
a=4EFE0p0xMxM5AgsYaN-U8EqqfVwA:4 a=10sAvMsTeQkA:10
Received: (nullmailer pid 27438 invoked by uid 1000);
Tue, 15 Apr 2008 03:10:16 -0000
From: Barry deFreese
To: 105924-done@bugs.debian.org
Subject: raidtools2 has been removed from Debian, closing #105924
Date: Mon, 14 Apr 2008 23:10:16 -0400
Message-Id: <1208229016.752997.27437.nullmailer@comcast.net>
X-CrossAssassin-Score: 13
Version: 1.00.3-17+rm
The raidtools2 package has been removed from Debian testing, unstable and
experimental, so I am now closing the bugs that were still opened
against it.
For more information about this package's removal, read
http://bugs.debian.org/298968 . That bug might give the reasons why
this package was removed, and suggestions of possible replacements.
Don't hesitate to reply to this mail if you have any question.
Thank you for your contribution to Debian.
Barry deFreese