This is for bugzilla bug #248176: GFS2: invalid metadata block
Patches 1 thru 3 were accepted upstream, but there were problems
with 4 and 5. Those issues have been resolved and now the recovery
tests are passing without errors. This code has gone through
41 * 3 successful gfs2 recovery tests before it hit an
unrelated (openais) problem.
This is a complete rewrite of patch 4 for bug #248176.
Part of the problem was that inodes were being recycled
before their buffers were flushed to the journal logs.
Another problem was that the clone bitmaps were being
searched for deleted inodes to recycle, but only the
"real" bitmaps should be searched for that purpose.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We only need a single gfs2_scand process rather than the one
per filesystem which we had previously. As a result the parameter
determining the frequency of gfs2_scand runs becomes a module
parameter rather than a mount parameter as it was before.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
these struct *_operations are all method tables, thus should be const.
Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is patch 5 of 5 for bug #248176
Metadata corruption was occurring because page references weren't
being removed in all cases. I previously added a function called
detach_bufdata, but I discovered there already WAS a function out
there to do the job. It's called gfs2_meta_cache_flush. So I added
a call to that to remove the page references.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
this is more clear.
Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is patch three of five for bug #248176.
The try_rgrp_unlink code in rgrp.c had an infinite loop. This was
caused because the bitmap function rgblk_search can return a block
less than the "goal" block, in which case it was looping. The fix is
to make it always march forward as needed.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is patch 2 of 5 for bug #248176.
The list_move code previously concocted in log.c for bug #238162
(see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238162#c23)
never runs as bh can now never be NULL at this point.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is the first of five patches for bug #248176:
There were still some critical variables being manipulated outside
the log_lock spinlock. That usually resulted in a hang.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This fixes an oops which was occurring during glock dumping due to the
seq file code not taking a reference to the glock. Also this fixes a
memory leak which occurred in certain cases, in turn preventing the
filesystem from unmounting.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When looking at an unrelated problem, I noticed that nfsd does not
set nameidata pointer on create (ie nd is NULL). This should
cause an oops in some cases in which when NFSd is mounted over GFS2.
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch cleans up duplicate includes in
fs/gfs2/
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If a glock is in the exclusive state and a request for demote to
deferred has been received, then further requests for demote to
shared are being ignored. This patch fixes that by ensuring that
we demote to unlocked in that case.
Signed-off-by: Josef Whiter <jwhiter@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
One of the races relates to referencing a variable while not holding
its protecting spinlock. The patch simply moves the test inside the
spin lock. The other races occurs when a demote to unlocked request
occurs during the time a demote to shared request is already running.
This of course only happens in the case that the lock was in the
exclusive mode to start with. The patch adds a check to see if another
demote request has occurred in the mean time and if it has, then it
performs a second demote.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The recent fix for a circular lock dependency unfortunately introduced a
potential memory leak in the event where the call to nlmsvc_lookup_host
fails for some reason.
Thanks to Roel Kluin for spotting this.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When IOCB_FLAG_RESFD flag is set and iocb->aio_resfd is incorrect,
statement 'goto out_put_req' is executed. At label 'out_put_req',
aio_put_req(..) is called, which requires 'req->ki_filp' set.
Signed-off-by: Yan Zheng<yanzheng@21cn.com>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/cooloney/blackfin-2.6:
Blackfin arch: fix PORT_J BUG for BF537/6 EMAC driver reported by Kalle Pokki <kalle.pokki@iki.fi>
Blackfin arch: gpio pinmux and resource allocation API required by BF537 on chip ethernet mac driver
Blackfin arch: add some missing syscall
binfmt_flat: checkpatch fixing minimum support for the blackfin relocations
Binfmt_flat: Add minimum support for the Blackfin relocations
The fs was not unlocking the local alloc inode mutex in the code path in
which it failed to find a window of free bits in the global bitmap.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Nick Piggin points out that splice isn't being good about the mmap
semaphore: while two readers can nest inside each others, it does leave
a possible deadlock if a writer (ie a new mmap()) comes in during that
nesting.
Original "just move the locking" patch by Nick, replaced by one by me
based on an optimistic pagefault_disable(). And then Jens tested and
updated that patch.
Reported-by: Nick Piggin <npiggin@suse.de>
Tested-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit b394e43e99.
Lachlan McIlroy says:
It tried to fix an issue where log replay is replaying an inode cluster
initialisation transaction that should not be replayed because the inode
cluster on disk is more up to date. Since we don't log file sizes (we
rely on inode flushing to get them to disk) then we can't just replay
all the transations in the log and expect the inode to be completely
restored. We lose file size updates. Unfortunately this fix is causing
more (serious) problems than it is fixing.
SGI-PV: 969656
SGI-Modid: xfs-linux-melb:xfs-kern:29804a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
It doesn't look as if the NFS file name limit is being initialised correctly
in the struct nfs_server. Make sure that we limit whatever is being set in
nfs_probe_fsinfo() and nfs_init_server().
Also ensure that readdirplus and nfs4_path_walk respect our file name
limits.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The problem is that the garbage collector for the 'host' structures
nlm_gc_hosts(), holds nlm_host_mutex while calling down to
nlmsvc_mark_resources, which, eventually takes the file->f_mutex.
We cannot therefore call nlmsvc_lookup_host() from within
nlmsvc_create_block, since the caller will already hold file->f_mutex, so
the attempt to grab nlm_host_mutex may deadlock.
Fix the problem by calling nlmsvc_lookup_host() outside the file->f_mutex.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
[PATCH] WE : Add missing auth compat-ioctl
[PATCH] softmac: Fix inability to associate with WEP networks
Different types of ufs hold state in different places, to hide complexity
of this, there is ufs_get_fs_state, it returns state according to
"UFS_SB(sb)->s_flags", but during mount ufs_get_fs_state is called, before
setting s_flags, this cause message for ufs types like sun ufs: "fs need
fsck", and remount in readonly state.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add minimum support for the Blackfin relocations, since we don't have
enough space in each reloc. The idea is to store a value with one
relocation so that subsequent ones can access it.
Actually, this patch is required for Blackfin. Currently if BINFMT_FLAT is
enabled, git-tree kernel will fail to compile.
Signed-off-by: Bernd Schmidt <bernd.schmidt@analog.com>
Signed-off-by: Bryan Wu <bryan.wu@analog.com>
Cc: David McCullough <davidm@snapgear.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Miles Bader <miles.bader@necel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
ocfs2: Pack vote message and response structures
ocfs2: Don't double set write parameters
ocfs2: Fix pos/len passed to ocfs2_write_cluster
ocfs2: Allow smaller allocations during large writes
Johannes just found that we are missing a compat-ioctl
declaration. The fix is trivial. As previous patches for compat-ioctl,
this should also go to stable.
More info :
http://marc.info/?l=linux-wireless&m=119029667902588&w=2
Signed-off-by: Jean Tourrilhes <jt@hpl.hp.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The ocfs2_vote_msg and ocfs2_response_msg structs needed to be
packed to ensure similar sizeofs in 32-bit and 64-bit arches. Without this,
we had inadvertantly broken 32/64 bit cross mounts.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The target page offsets were being incorrectly set a second time in
ocfs2_prepare_page_for_write(), which was causing problems on a 16k page
size kernel. Additionally, ocfs2_write_failure() was incorrectly using those
parameters instead of the parameters for the individual page being cleaned
up.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This was broken for file systems whose cluster size is greater than page
size. Pos needs to be incremented as we loop through the descriptors, and
len needs to be capped to the size of a single cluster.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The ocfs2 write code loops through a page much like the block code, except
that ocfs2 allocation units can be any size, including larger than page
size. Typically it's equal to or larger than page size - most kernels run 4k
pages, the minimum ocfs2 allocation (cluster) size.
Some changes introduced during 2.6.23 changed the way writes to pages are
handled, and inadvertantly broke support for > 4k page size. Instead of just
writing one cluster at a time, we now handle the whole page in one pass.
This means that multiple (small) seperate allocations might happen in the
same pass. The allocation code howver typically optimizes by getting the
maximum which was reserved. This triggered a BUG_ON in the extend code where
it'd ask for a single bit (for one part of a > 4k page) and get back more
than it asked for.
Fix this by providing a variant of the high level allocation function which
allows the caller to specify a maximum. The traditional function remains and
just calls the new one with a maximum determined from the initial
reservation.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This simplifies signalfd code, by avoiding it to remain attached to the
sighand during its lifetime.
In this way, the signalfd remain attached to the sighand only during
poll(2) (and select and epoll) and read(2). This also allows to remove
all the custom "tsk == current" checks in kernel/signal.c, since
dequeue_signal() will only be called by "current".
I think this is also what Ben was suggesting time ago.
The external effect of this, is that a thread can extract only its own
private signals and the group ones. I think this is an acceptable
behaviour, in that those are the signals the thread would be able to
fetch w/out signalfd.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new xlog_recover_do_reg_buffer checks call be16_to_cpu on di_gen which
is a 32bit value so sparse rightly complains. Fortunately the warning is
harmless because we don't care for the value, but only whether it's
non-NULL. Due to that fact we can simply kill the endian swaps on this and
the previous di_mode check entirely.
SGI-PV: 969656
SGI-Modid: xfs-linux-melb:xfs-kern:29709a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
xfs_filestream_mount() sets up an mru cache with:
err = xfs_mru_cache_create(&mp->m_filestream, lifetime, grp_count,
(xfs_mru_cache_free_func_t)xfs_fstrm_free_func);
but that cast is causing problems...
typedef void (*xfs_mru_cache_free_func_t)(unsigned long, void*);
but:
void xfs_fstrm_free_func( xfs_ino_t ino, fstrm_item_t *item)
so on a 32-bit box, it's casting (32, 32) args into (64, 32) and I assume
it's getting garbage for *item, which subsequently causes an explosion.
With this change the filestreams xfsqa tests don't oops on my 32-bit box.
SGI-PV: 967795
SGI-Modid: xfs-linux-melb:xfs-kern:29510a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
[XFS] Avoid replaying inode buffer initialisation log items if on-disk version is newer.
[XFS] Ensure file size updates have been completed before writing inode to disk.
[XFS] On-demand reaping of the MRU cache
The do_split() function for htree dir blocks is intended to split a leaf
block to make room for a new entry. It sorts the entries in the original
block by hash value, then moves the last half of the entries to the new
block - without accounting for how much space this actually moves. (IOW,
it moves half of the entry *count* not half of the entry *space*). If by
chance we have both large & small entries, and we move only the smallest
entries, and we have a large new entry to insert, we may not have created
enough space for it.
The patch below stores each record size when calculating the dx_map, and
then walks the hash-sorted dx_map, calculating how many entries must be
moved to more evenly split the existing entries between the old block and
the new block, guaranteeing enough space for the new entry.
The dx_map "offs" member is reduced to u16 so that the overall map size
does not change - it is temporarily stored at the end of the new block, and
if it grows too large it may be overwritten. By making offs and size both
u16, we won't grow the map size.
Also add a few comments to the functions involved.
This fixes the testcase reported by hooanon05@yahoo.co.jp on the
linux-ext4 list, "ext3 dir_index causes an error"
Thanks to Andreas Dilger for discussing the problem & solution with me.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Tested-by: Junjiro Okajima <hooanon05@yahoo.co.jp>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: <linux-ext4@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert asserts (BUGs) in dx_probe from bad on-disk data to recoverable
errors with helpful warnings. With help catching other asserts from Duane
Griffin <duaneg@dghda.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of running the mru cache reaper all the time based on a timeout,
we should only run it when the cache has active objects. This allows CPUs
to sleep when there is no activity rather than be woken repeatedly just to
check if there is anything to do.
SGI-PV: 968554
SGI-Modid: xfs-linux-melb:xfs-kern:29305a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
ocfs2: Fix calculation of i_blocks during truncate
[PATCH] ocfs2: Fix a wrong cluster calculation.
[PATCH] ocfs2: fix mount option parsing
ocfs2: update docs for new features
The inode->i_flock list contains the leases, flocks and posix
locks in the specified order. However, the flocks are added in
the head of this list thus hiding the leases from F_GETLEASE
command, from time_out_leases() and other code that expects
the leases to come first.
The following example will demonstrate this:
#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/file.h>
static void show_lease(int fd)
{
int res;
res = fcntl(fd, F_GETLEASE);
switch (res) {
case F_RDLCK:
printf("Read lease\n");
break;
case F_WRLCK:
printf("Write lease\n");
break;
case F_UNLCK:
printf("No leases\n");
break;
default:
printf("Some shit\n");
break;
}
}
int main(int argc, char **argv)
{
int fd, res;
fd = open(argv[1], O_RDONLY);
if (fd == -1) {
perror("Can't open file");
return 1;
}
res = fcntl(fd, F_SETLEASE, F_WRLCK);
if (res == -1) {
perror("Can't set lease");
return 1;
}
show_lease(fd);
if (flock(fd, LOCK_SH) == -1) {
perror("Can't flock shared");
return 1;
}
show_lease(fd);
return 0;
}
The first call to show_lease() will show the write lease set, but
the second will show no leases.
Fix the flock adding so that the leases always stay in the head
of this list.
Found during making the flocks pid-namespaces aware.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Taneli Vähäkangas <vahakang@cs.helsinki.fi> reported that commit
786d7e1612 aka "Fix rmmod/read/write races
in /proc entries" broke SBCL + SLIME combo.
The old code in do_select() used DEFAULT_POLLMASK, if couldn't find
->poll handler. The new code makes ->poll always there and returns 0 by
default, which is not correct. Return DEFAULT_POLLMASK instead.
Steps to reproduce:
install emacs, SBCL, SLIME
emacs
M-x slime in *inferior-lisp* buffer
[watch it doing "Connecting to Swank on port X.."]
Please, apply before 2.6.23.
P.S.: why SBCL can't just read(2) /proc/cpuinfo is a mystery.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: T Taneli Vahakangas <vahakang@cs.helsinki.fi>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dput must be called before mntput here.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-By: David Howells <dhowells@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>