Commit Graph

339 Commits

Author SHA1 Message Date
Kevin Corry
702ca6f0be [PATCH] dm mirror log: sector size fix
On-disk logs for dm-mirror devices are currently hard-coded to use 512 byte
hard-sector-sizes.  This patch fixes dm-log so it will work with devices with
non-512-byte hard-sector-sizes.

To maintain full compatibility, instead of moving the clean-bits bitset to a
bitset, and enlarges the disk-header buffer to encompass both the header and
the bitset.  The I/O routines for the bitset are removed, and the I/O routines
for the disk-header now also read/write the bitset.

Signed-off-by: Kevin Corry <kevcorry@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:35 -07:00
Milan Broz
143535396c [PATCH] dm table: get_target: fix last index
The table is indexed from 0, so an index equal to t->num_targets should be
rejected.

(There is no code in the current tree that would exercise this bug.)

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:35 -07:00
Neil Brown
e4c8b3ba34 [PATCH] dm: mirror sector offset fix
The device-mapper core does not perform any remapping of bios before passing
them to the targets.  If a particular mapping begins part-way into a device,
targets obtain the sector relative to the start of the mapping by subtracting
ti->begin.

The dm-raid1 target didn't do this everywhere: this patch fixes it, taking
care to subtract ti->begin exactly once for each bio.

[akpm: too late for 2.6.17 - suitable for 2.6.17.x after it has settled]

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:35 -07:00
Jeff Mahoney
f0b0411536 [PATCH] dm: fix block device initialisation
In alloc_dev(), we register the device with the block layer and then continue
to initialize the device.  But register_disk() makes the device available to
be opened before we have completed initialising it.

This patch moves the final bits of the initialization above the disk
registration.

[akpm: too late for 2.6.17 - suitable for 2.6.17.x after it has settled]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:35 -07:00
Jeff Mahoney
10da4f795f [PATCH] dm: add module ref counting
The reference counting on dm-mod is zero if no mapped devices are open.  This
is incorrect, and can lead to an oops if the module is unloaded while mapped
devices exist.

This patch claims a reference to the module whenever a device is created, and
drops it again when the device is freed.

Devices must be removed before dm-mod is unloaded.

[akpm: too late for 2.6.17 - suitable for 2.6.17.x after it has settled]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:35 -07:00
Jeff Mahoney
7ec75f2547 [PATCH] dm: fix mapped device ref counting
To avoid races, _minor_lock must be held while changing mapped device
reference counts.

There are a few paths where a mapped_device pointer is returned before a
reference is taken.  This patch fixes them.

[akpm: too late for 2.6.17 - suitable for 2.6.17.x after it has settled]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:35 -07:00
Jeff Mahoney
fba9f90e56 [PATCH] dm: add DMF_FREEING
There is a chicken and egg problem between the block layer and dm in which the
gendisk associated with a mapping keeps a reference-less pointer to the
mapped_device.

This patch uses a new flag DMF_FREEING to indicate when the mapped_device is
no longer valid.  This is checked to prevent any attempt to open the device
from succeeding while the device is being destroyed.

[akpm: too late for 2.6.17 - suitable for 2.6.17.x after it has settled]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:34 -07:00
Jeff Mahoney
f32c10b099 [PATCH] dm: change minor_lock to spinlock
While removing a device, another another thread might attempt to resurrect it.

This patch replaces the _minor_lock mutex with a spinlock and uses
atomic_dec_and_lock() to serialize reference counting in dm_put().

[akpm: too late for 2.6.17 - suitable for 2.6.17.x after it has settled]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:34 -07:00
Jeff Mahoney
62f75c2f32 [PATCH] dm: move idr_pre_get
idr_pre_get() can sleep while allocating memory.

The next patch will change _minor_lock into a spinlock, so this patch moves
idr_pre_get() outside the lock in preparation.

[akpm: too late for 2.6.17 - suitable for 2.6.17.x after it has settled]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:34 -07:00
Jeff Mahoney
ba61fdd17d [PATCH] dm: fix idr minor allocation
One part of the system can attempt to use a mapped device before another has
finished initialising it or while it is being freed.

This patch introduces a place holder value, MINOR_ALLOCED, to mark the minor
as allocated but in a state where it can't be used, such as mid-allocation or
mid-free.  At the end of the initialization, it replaces the place holder with
the pointer to the mapped_device, making it available to the rest of the dm
subsystem.

[akpm: too late for 2.6.17 - suitable for 2.6.17.x after it has settled]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:34 -07:00
Alasdair G Kergon
c51c275249 [PATCH] dm snapshot: unify chunk_size
Persistent snapshots currently store a private copy of the chunk size.
Userspace also supplies the chunk size when loading a snapshot.  Ensure
consistency by only storing the chunk_size in one place instead of two.

Currently the two sizes will differ if the chunk size supplied by userspace
does not match the chunk size an existing snapshot actually uses.  Amongst
other problems, this causes an incorrect 'percentage full' to be reported.

The patch ensures consistency by only storing the chunk_size in one place,
removing it from struct pstore.  Some initialisation is delayed until the
correct chunk_size is known.  If read_header() discovers that the wrong chunk
size was supplied, the 'area' buffer (which the header already got read into)
is reinitialised to the correct size.

[akpm: too late for 2.6.17 - suitable for 2.6.17.x after it has settled]

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:34 -07:00
Akinobu Mita
179e09172a [PATCH] drivers: use list_move()
This patch converts the combination of list_del(A) and list_add(A, B) to
list_move(A, B) under drivers/.

Acked-by: Corey Minyard <minyard@mvista.com>
Cc: Ben Collins <bcollins@debian.org>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Alasdair Kergon <dm-devel@redhat.com>
Cc: Gerd Knorr <kraxel@bytesex.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frank Pavlic <fpavlic@de.ibm.com>
Acked-by: Matthew Wilcox <matthew@wil.cx>
Cc: Andrew Vasquez <linux-driver@qlogic.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:18 -07:00
Adrian Bunk
a5d6839b75 [PATCH] drivers/md/raid6algos.c: fix a NULL dereference
This patch fixes a NULL dereference spotted by the Coverity checker.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
NeilBrown
c331eb04b9 [PATCH] md: Fix badness in sysfs_notify caused by md_new_event
From: NeilBrown <neilb@suse.de>

If an error is reported by a drive in a RAID array (which is done via
bi_end_io - in interrupt context), we call md_error and md_new_event which
calls sysfs_notify.  However sysfs_notify grabs a mutex and so cannot be
called in interrupt context.

This patch just creates a variant of md_new_event which avoids the sysfs
call, and uses that.  A better fix for later is to arrange for the event to
be called from user-context.

Note: avoiding the sysfs call isn't a problem as an error will not, by
itself, modify the sync_action attribute.  (We do still need to
wake_up(&md_event_waiters) as an error by itself will modify /proc/mdstat).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-31 16:27:11 -07:00
Neil Brown
c71d48877e [PATCH] Unlock md devices when stopping them on reboot.
otherwise we get nasty messages about locks not being released.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-26 11:52:11 -07:00
NeilBrown
5c4c33318d [PATCH] md: fix possible oops when starting a raid0 array
This loop that sets up the hash_table has problems.

Careful examination will show that the last time through, everything but
the first line is pointless.  This is because all it does is change 'cur'
and 'size' and neither of these are used after the loop.  This should ring
warning bells...  That last time through the loop,

        size += conf->strip_zone[cur].size

can index off the end of the strip_zone array.  Depending on what it finds
there, it might exit the loop cleanly, or it might spin going further and
further beyond the array until it hits an unmapped address.

This patch rearranges the code so that the last, pointless, iteration of
the loop never happens.  i.e.  the one statement of the last loop that is
needed is moved the the end of the previous loop - or to before the loop
starts - and the loop counter starts from 1 instead of 0.

Cc: "Don Dupuis" <dondster@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-23 10:35:31 -07:00
NeilBrown
2adc7d47c4 [PATCH] md: Fix inverted test for 'repair' directive.
We should be able to write 'repair' to /sys/block/mdX/md/sync_action,
however due to and inverted test, that always given EINVAL.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:17 -07:00
NeilBrown
5e7dd2ab6b [PATCH] md: Fix 'rdev->nr_pending' count when retrying barrier requests
When retrying a failed BIO_RW_BARRIER request, we need to keep the reference
in ->nr_pending over the whole retry.  Currently, we only hold the reference
if the failed request is the *last* one to finish - which is silly, because it
would normally be the first to finish.

So move the rdev_dec_pending call up into the didn't-fail branch.  As the rdev
isn't used in the later code, calling rdev_dec_pending earlier doesn't hurt.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-01 18:17:42 -07:00
NeilBrown
62de608da0 [PATCH] md: Improve detection of lack of barrier support in raid1
Move the test for 'do barrier work' down a bit so that if the first write to a
raid1 is a BIO_RW_BARRIER write, the checking done by superblock writes will
cause the right thing to happen.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-01 18:17:42 -07:00
NeilBrown
bea2771871 [PATCH] md: Change ENOTSUPP to EOPNOTSUPP
Because that is what you get if a BIO_RW_BARRIER isn't supported!

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-01 18:17:42 -07:00
NeilBrown
e0a33270ed [PATCH] md: Fixed refcounting/locking when attempting read error correction in raid10
We need to hold a reference to rdevs while reading and writing to attempt to
correct read errors.  This reference must be taken under an rcu lock.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-01 18:17:42 -07:00
NeilBrown
df30d0f4ca [PATCH] md: Avoid oops when attempting to fix read errors on raid10
We should add to the counter for the rdev *after* checking if the rdev is
NULL!!!

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-01 18:17:42 -07:00
Ingo Molnar
5dc5cf7dd2 [PATCH] md: locking fix
- fix mddev_lock() usage bugs in md_attr_show() and md_attr_store().
  [they did not anticipate the possibility of getting a signal]

- remove mddev_lock_uninterruptible() [unused]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-20 07:54:04 -07:00
NeilBrown
4508a7a734 [PATCH] sysfs: Allow sysfs attribute files to be pollable
It works like this:
  Open the file
  Read all the contents.
  Call poll requesting POLLERR or POLLPRI (so select/exceptfds works)
  When poll returns,
     close the file and go to top of loop.
   or lseek to start of file and go back to the 'read'.

Events are signaled by an object manager calling
   sysfs_notify(kobj, dir, attr);

If the dir is non-NULL, it is used to find a subdirectory which
contains the attribute (presumably created by sysfs_create_group).

This has a cost of one int  per attribute, one wait_queuehead per kobject,
one int per open file.

The name "sysfs_notify" may be confused with the inotify
functionality.  Maybe it would be nice to support inotify for sysfs
attributes as well?

This patch also uses sysfs_notify to allow /sys/block/md*/md/sync_action
to be pollable

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-04-14 11:41:24 -07:00
NeilBrown
6f91fe88e4 [PATCH] md: make sure 64bit fields in version-1 metadata are 64-bit aligned
reshape_position is a 64bit field that was not 64bit aligned.  So swap with
new_level.

NOTE: this is a user-visible change.  However:
  - The bad code has not appeared in a released kernel
  - This code is still marked 'experimental'
  - This only affects version-1 superblock, which are not in wide use
  - These field are only used (rather than simply reported) by user-space
    tools in extemely rare circumstances : after a reshape crashes in the
    first second of the reshape process.

So I believe that, at this stage, the change is safe.  Especially if people
heed the 'help' message on use mdadm-2.4.1.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:30 -07:00
Eric Sesterhenn
b638548384 BUG_ON() Conversion in md/raid10.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner and can better optimized away

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:34:29 +02:00
Eric Sesterhenn
43dab9bbe9 BUG_ON() Conversion in md/raid6main.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner and can better optimized away

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:33:30 +02:00
Eric Sesterhenn
78bafebd46 BUG_ON() Conversion in md/raid5.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner and can better optimized away

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:31:42 +02:00
Eric Sesterhenn
9e77c485f7 BUG_ON() Conversion in md/raid1.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner and can better optimized away

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01 01:08:49 +02:00
Eric Sesterhenn
543cb2a451 BUG_ON() Conversion in md/dm-target.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner and can better optimized away

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01 01:08:12 +02:00
NeilBrown
ec350a7fc1 [PATCH] md: Raid-6 did not create sysfs entries for stripe cache
Signed-off-by: Brad Campbell <brad@wasp.net.au>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:19:01 -08:00
NeilBrown
926ce2d8a7 [PATCH] md: Remove some code that can sleep from under a spinlock
And remove the comments that were put in inplace of a fix too....

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:19:01 -08:00
NeilBrown
6b1117d505 [PATCH] md: Don't clear bits in bitmap when writing to one device fails during recovery
Currently a device failure during recovery leaves bits set in the bitmap.
This normally isn't a problem as the offending device will be rejected because
of errors.  However if device re-adding is being used with non-persistent
bitmaps, this can be a problem.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:19:01 -08:00
NeilBrown
df5b89b323 [PATCH] md: Convert reconfig_sem to reconfig_mutex
... being careful that mutex_trylock is inverted wrt down_trylock

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:03 -08:00
Arjan van de Ven
48c9c27b8b [PATCH] sem2mutex: drivers/md
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:03 -08:00
NeilBrown
2f889129de [PATCH] md: Restore 'remaining' count when retrying an write operation
When retrying a write due to barrier failure, we don't reset 'remaining', so
it goes negative and never hits 0 again.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:03 -08:00
NeilBrown
8ddeeae51f [PATCH] md: Fix md grow/size code to correctly find the maximum available space
An md array can be asked to change the amount of each device that it is using,
and in particular can be asked to use the maximum available space.  This
currently only works if the first device is not larger than the rest.  As
'size' gets changed and so 'fit' becomes wrong.  So check if a 'fit' is
required early and don't corrupt it.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:03 -08:00
NeilBrown
f6344757a9 [PATCH] md: Remove bi_end_io call out from under a spinlock
raid5 overloads bi_phys_segments to count the number of blocks that the
request was broken in to so that it knows when the bio is completely handled.

Accessing this must always be done under a spinlock.  In one case we also call
bi_end_io under that spinlock, which probably isn't ideal as bi_end_io could
be expensive (even though it isn't allowed to sleep).

So we reducde the range of the spinlock to just accessing bi_phys_segments.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:03 -08:00
NeilBrown
b3b46be38a [PATCH] md: Remove some stray semi-colons after functions called in macro..
wait_event_lock_irq puts a ';' after its usage of the 4th arg, so we don't
need to.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown
df8e7f7639 [PATCH] md: Improve comments about locking situation in raid5 make_request
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown
e464eafdb4 [PATCH] md: Support suspending of IO to regions of an md array
This allows user-space to access data safely.  This is needed for raid5
reshape as user-space needs to take a backup of the first few stripes before
allowing reshape to commence.

It will also be useful in cluster-aware raid1 configurations so that all
cluster members can leave a section of the array untouched while a
resync/recovery happens.

A 'start' and 'end' of the suspended range are written to 2 sysfs attributes.
Note that only one range can be suspended at a time.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown
16484bf596 [PATCH] md: Make 'reshape' a possible sync_action action
This allows reshape to be triggerred via sysfs (which is the only way to start
it happening).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown
63c70c4f3a [PATCH] md: Split reshape handler in check_reshape and start_reshape
check_reshape checks validity and does things that can be done instantly -
like adding devices to raid1.  start_reshape initiates a restriping process to
convert the whole array.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown
b578d55fdd [PATCH] md: Only checkpoint expansion progress occasionally
Instead of checkpointing at each stripe, only checkpoint when a new write
would overwrite uncheckpointed data.  Block any write to the uncheckpointed
area.  Arbitrarily checkpoint at least every 3Meg.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown
f67055780c [PATCH] md: Checkpoint and allow restart of raid5 reshape
We allow the superblock to record an 'old' and a 'new' geometry, and a
position where any conversion is up to.  The geometry allows for changing
chunksize, layout and level as well as number of devices.

When using verion-0.90 superblock, we convert the version to 0.91 while the
conversion is happening so that an old kernel will refuse the assemble the
array.  For version-1, we use a feature bit for the same effect.

When starting an array we check for an incomplete reshape and restart the
reshape process if needed.  If the reshape stopped at an awkward time (like
when updating the first stripe) we refuse to assemble the array, and let
user-space worry about it.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown
292695531a [PATCH] md: Final stages of raid5 expand code
This patch adds raid5_reshape and end_reshape which will start and finish the
reshape processes.

raid5_reshape is only enabled in CONFIG_MD_RAID5_RESHAPE is set, to discourage
accidental use.

Read the 'help' for the CONFIG_MD_RAID5_RESHAPE entry.

and Make sure that you have backups, just in case.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown
ccfcc3c10b [PATCH] md: Core of raid5 resize process
This patch provides the core of the resize/expand process.

sync_request notices if a 'reshape' is happening and acts accordingly.

It allocated new stripe_heads for the next chunk-wide-stripe in the target
geometry, marking them STRIPE_EXPANDING.

Then it finds which stripe heads in the old geometry can provide data needed
by these and marks them STRIPE_EXPAND_SOURCE.  This causes stripe_handle to
read all blocks on those stripes.

Once all blocks on a STRIPE_EXPAND_SOURCE stripe_head are read, any that are
needed are copied into the corresponding STRIPE_EXPANDING stripe_head.  Once a
STRIPE_EXPANDING stripe_head is full, it is marks STRIPE_EXPAND_READY and then
is written out and released.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown
7ecaa1e6a1 [PATCH] md: Infrastructure to allow normal IO to continue while array is expanding
We need to allow that different stripes are of different effective sizes, and
use the appropriate size.  Also, when a stripe is being expanded, we must
block any IO attempts until the stripe is stable again.

Key elements in this change are:
 - each stripe_head gets a 'disk' field which is part of the key,
   thus there can sometimes be two stripe heads of the same area of
   the array, but covering different numbers of devices.  One of these
   will be marked STRIPE_EXPANDING and so won't accept new requests.
 - conf->expand_progress tracks how the expansion is progressing and
   is used to determine whether the target part of the array has been
   expanded yet or not.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown
ad01c9e375 [PATCH] md: Allow stripes to be expanded in preparation for expanding an array
Before a RAID-5 can be expanded, we need to be able to expand the stripe-cache
data structure.

This requires allocating new stripes in a new kmem_cache.  If this succeeds,
we copy cache pages over and release the old stripes and kmem_cache.

We then allocate new pages.  If that fails, we leave the stripe cache at it's
new size.  It isn't worth the effort to shrink it back again.

Unfortuanately this means we need two kmem_cache names as we, for a short
period of time, we have two kmem_caches.  So they are raid5/%s and
raid5/%s-alt

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown
b55e6bfcd2 [PATCH] md: Split disks array out of raid5 conf structure so it is easier to grow
The remainder of this batch implements raid5 reshaping.  Currently the only
shape change that is supported is added a device, but it is envisioned that
changing the chunksize and layout will also be supported, as well as changing
the level (e.g.  1->5, 5->6).

The reshape process naturally has to move all of the data in the array, and so
should be used with caution.  It is believed to work, and some testing does
support this, but wider testing would be great for increasing my confidence.

You will need a version of mdadm newer than 2.3.1 to make use of raid5 growth.
 This is because mdadm need to take a copy of a 'critical section' at the
start of the array incase there is a crash at an awkward moment.  On restart,
mdadm will restore the critical section and allow reshape to continue.

I hope to release a 2.4-pre by early next week - it still needs a little more
polishing.

This patch:

Previously the array of disk information was included in the raid5 'conf'
structure which was allocated to an appropriate size.  This makes it awkward
to change the size of that array.  So we split it off into a separate
kmalloced array which will require a little extra indexing, but is much easier
to grow.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00