Commit Graph

16450 Commits

Author SHA1 Message Date
Tao Ma
0a1ea437d8 ocfs2: Only bug out when page size is larger than cluster size.
In CoW, we have to make sure that the page is already written
out to the disk. So we have a BUG_ON(PageDirty(page)).

In ppc platform we have pagesize=64K, so if the cs=4K, if the
file have fragmented clusters, we will map the page many times.
See this file as an example.
Tree Depth: 0   Count: 19   Next Free Rec: 14
	## Offset        Clusters       Block#          Flags
	0  0             4              2164864         0x2 Refcounted
	1  4             2              9302792         0x2 Refcounted
...

We have to replace the extent recs one by one, so the page with index 0
will be mapped and dirtied twice.

I'd like to leave the BUG_ON there while adding a check so that in
case we meet with an error in other platforms, we can find it easily.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:15:35 -08:00
Tao Ma
d622b89a2f ocfs2: Fix memory overflow in cow_by_page.
In ocfs2_duplicate_clusters_by_page, we calculate map_end
by shifting page_index. But actually in case we meet with
a large offset(say in a i686 box, poff_t is only 32 bits
and page_index=2056240), we will overflow. So change the
type of page_index to loff_t.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:14:20 -08:00
Sunil Mushran
26636bf6b2 ocfs2/dlm: Print more messages during lock migration
When a lock resource is migrated, the dlm compares the migrated
locks with that that was already existing on the new node. If the
comparison fails, it BUGs. This patch prints more messages when the
comparison fails inorder to help with the root cause analyis.

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1206
This does not fix bz1206. However, if we run into it again, we will
have more information to chew on.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:21:09 -08:00
Sunil Mushran
71656fa6ec ocfs2/dlm: Ignore LVBs of locks in the Blocked list
During lock resource migration, o2dlm fills the packet with a LVB from the
first valid lock. For sanity, it ensures that the other valid locks have the
same LVB. If not, it BUGs.

The valid locks are ones that have granted EX or PR lock levels and are either
on the Granted or Converting lists. Locks in the Blocked list cannot have a
valid LVB.

This patch ensures that we skip the locks in the Blocked list.

Fixes oss bugzilla#1202
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1202

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:57 -08:00
Sunil Mushran
2bd632165c ocfs2/trivial: Remove trailing whitespaces
Patch removes trailing whitespaces.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:51 -08:00
Wengang Wang
e5f2cb2b1a ocfs2: fix a misleading variable name
a local variable "dlm_version" is used as a fs locking version.
rename it fs_version.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:48 -08:00
Tao Ma
1097df3ffe ocfs2: Sync max_inline_data_with_xattr from tools.
In ocfs2-tools, we have added ocfs2_max_inline_data_with_xattr,
so add it in the kernel's ocfs2_fs.h.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:45 -08:00
OGAWA Hirofumi
1dd473fdf1 ocfs2: Fix refcnt leak on ocfs2_fast_follow_link() error path
If ->follow_link handler returns an error, it should decrement
nd->path refcnt. But ocfs2_fast_follow_link() doesn't decrement.

This patch fixes the problem by using nd_set_link() style error handling
instead of playing with nd->path.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-11 15:38:50 -08:00
Linus Torvalds
1b4d40a517 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: Ensure we force all busy extents in range to disk
  xfs: Don't flush stale inodes
  xfs: fix timestamp handling in xfs_setattr
  xfs: use DECLARE_EVENT_CLASS
2010-01-11 09:48:48 -08:00
Linus Torvalds
79ecb043ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Use MAX_LFS_FILESIZE for meta inode size
  GFS2: Fix gfs2_xattr_acl_chmod()
  GFS2: Fix locking bug in rename
  GFS2: Ensure uptodate inode size when using O_APPEND
2010-01-11 09:48:29 -08:00
Linus Torvalds
db1fc95744 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Fix dquot_transfer for filesystems different from ext4
2010-01-11 09:48:14 -08:00
Minchan Kim
7f53a09ed4 smaps: fix wrong rss count
A long time ago we regarded zero page as file_rss and vm_normal_page
doesn't return NULL.

But now, we reinstated ZERO_PAGE and vm_normal_page's implementation can
return NULL in case of zero page.  Also we don't count it with file_rss
any more.

Then, RSS and PSS can't be matched.  For consistency, Let's ignore zero
page in smaps_pte_range.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Matt Mackall <mpm@selenic.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11 09:34:07 -08:00
KOSAKI Motohiro
1306d603fc proc: partially revert "procfs: provide stack information for threads"
Commit d899bf7b (procfs: provide stack information for threads) introduced
to show stack information in /proc/{pid}/status.  But it cause large
performance regression.  Unfortunately /proc/{pid}/status is used ps
command too and ps is one of most important component.  Because both to
take mmap_sem and page table walk are heavily operation.

If many process run, the ps performance is,

[before d899bf7b]

% perf stat ps >/dev/null

 Performance counter stats for 'ps':

     4090.435806  task-clock-msecs         #      0.032 CPUs
             229  context-switches         #      0.000 M/sec
               0  CPU-migrations           #      0.000 M/sec
             234  page-faults              #      0.000 M/sec
      8587565207  cycles                   #   2099.425 M/sec
      9866662403  instructions             #      1.149 IPC
      3789415411  cache-references         #    926.409 M/sec
        30419509  cache-misses             #      7.437 M/sec

   128.859521955  seconds time elapsed

[after d899bf7b]

% perf stat  ps  > /dev/null

 Performance counter stats for 'ps':

     4305.081146  task-clock-msecs         #      0.028 CPUs
             480  context-switches         #      0.000 M/sec
               2  CPU-migrations           #      0.000 M/sec
             237  page-faults              #      0.000 M/sec
      9021211334  cycles                   #   2095.480 M/sec
     10605887536  instructions             #      1.176 IPC
      3612650999  cache-references         #    839.160 M/sec
        23917502  cache-misses             #      5.556 M/sec

   152.277819582  seconds time elapsed

Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
provide almost same information. we can use it.

Commit d899bf7b introduced two features:

 1) Add the annotattion of [thread stack: xxxx] mark to
    /proc/{pid}/task/{tid}/maps.
 2) Add StackUsage field to /proc/{pid}/status.

I only revert (2), because I haven't seen (1) cause regression.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11 09:34:06 -08:00
Jan Kara
05b5d89823 quota: Fix dquot_transfer for filesystems different from ext4
Commit fd8fbfc1 modified the way we find amount of reserved space
belonging to an inode. The amount of reserved space is checked
from dquot_transfer and thus inode_reserved_space gets called
even for filesystems that don't provide get_reserved_space callback
which results in a BUG.

Fix the problem by checking get_reserved_space callback and return 0 if
the filesystem does not provide it.

CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-01-11 13:06:41 +01:00
Steven Whitehouse
ba198098a2 GFS2: Use MAX_LFS_FILESIZE for meta inode size
Using ~0ULL was cauing sign issues in filemap_fdatawrite_range, so
use MAX_LFS_FILESIZE instead.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-11 08:57:55 +00:00
Dave Chinner
fd45e47841 xfs: Ensure we force all busy extents in range to disk
When we search for and find a busy extent during allocation we
force the log out to ensure the extent free transaction is on
disk before the allocation transaction. The current implementation
has a subtle bug in it--it does not handle multiple overlapping
ranges.

That is, if we free lots of little extents into a single
contiguous extent, then allocate the contiguous extent, the busy
search code stops searching at the first extent it finds that
overlaps the allocated range. It then uses the commit LSN of the
transaction to force the log out to.

Unfortunately, the other busy ranges might have more recent
commit LSNs than the first busy extent that is found, and this
results in xfs_alloc_search_busy() returning before all the
extent free transactions are on disk for the range being
allocated. This can lead to potential metadata corruption or
stale data exposure after a crash because log replay won't replay
all the extent free transactions that cover the allocation range.

Modified-by: Alex Elder <aelder@sgi.com>

(Dropped the "found" argument from the xfs_alloc_busysearch trace
event.)

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10 12:22:02 -06:00
Dave Chinner
44e08c45cc xfs: Don't flush stale inodes
Because inodes remain in cache much longer than inode buffers do
under memory pressure, we can get the situation where we have
stale, dirty inodes being reclaimed but the backing storage has
been freed.  Hence we should never, ever flush XFS_ISTALE inodes
to disk as there is no guarantee that the backing buffer is in
cache and still marked stale when the flush occurs.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10 12:22:00 -06:00
Christoph Hellwig
d6d59bada3 xfs: fix timestamp handling in xfs_setattr
We currently have some rather odd code in xfs_setattr for
updating the a/c/mtime timestamps:

 - first we do a non-transaction update if all three are updated
   together
 - second we implicitly update the ctime for various changes
   instead of relying on the ATTR_CTIME flag
 - third we set the timestamps to the current time instead of the
   arguments in the iattr structure in many cases.

This patch makes sure we update it in a consistent way:

 - always transactional
 - ctime is only updated if ATTR_CTIME is set or we do a size
   update, which is a special case
 - always to the times passed in from the caller instead of the
   current time

The only non-size caller of xfs_setattr that doesn't come from
the VFS is updated to set ATTR_CTIME and pass in a valid ctime
value.

Reported-by: Eric Blake <ebb9@byu.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10 12:21:58 -06:00
Christoph Hellwig
ea9a48881e xfs: use DECLARE_EVENT_CLASS
Using DECLARE_EVENT_CLASS allows us to to use trace event code
instead of duplicating it in the binary.  This was not available
before 2.6.33 so it had to be done as a separate step once the
prerequisite was merged.

This only requires changes to xfs_trace.h and the results are
rather impressive:

hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o*
text	   data	    bss	    dec	    hex	filename
 607732	  41884	   3616	 653232	  9f7b0	fs/xfs/xfs.o
1026732	  41884	   3808	1072424	 105d28	fs/xfs/xfs.o.old

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10 12:21:56 -06:00
Linus Torvalds
82062e7b50 Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  reiserfs: Relax reiserfs_xattr_set_handle() while acquiring xattr locks
  reiserfs: Fix unreachable statement
  reiserfs: Don't call reiserfs_get_acl() with the reiserfs lock
  reiserfs: Relax lock on xattr removing
  reiserfs: Relax the lock before truncating pages
  reiserfs: Fix recursive lock on lchown
  reiserfs: Fix mistake in down_write() conversion
2010-01-08 14:03:55 -08:00
Linus Torvalds
dbd6a7cfea Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: kill some warnings on i386 builds
2010-01-08 13:57:32 -08:00
Linus Torvalds
e2b6d02cca Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  nfs: fix oops in nfs_rename()
  sunrpc: fix build-time warning
  sunrpc: on successful gss error pipe write, don't return error
  SUNRPC: Fix the return value in gss_import_sec_context()
  SUNRPC: Fix up an error return value in gss_import_sec_context_kerberos()
2010-01-08 13:55:14 -08:00
Dave Chinner
a539bd8c86 xfs: kill some warnings on i386 builds
Randy Dunlap Reported printk() format-related warnings reported
on i386 builds in his environment.  Dave Chinner provided this
patch to eliminate them.

Signed-off by: Dave Chinner <david@fromorbit.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>

Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-08 13:32:29 -06:00
Steven Whitehouse
e412bdb126 GFS2: Fix gfs2_xattr_acl_chmod()
The ref counting for the bh returned by gfs2_ea_find() was
wrong. This patch ensures that we always drop the ref count
to that bh correctly.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-08 13:42:59 +00:00
Steven Whitehouse
24b977b5fd GFS2: Fix locking bug in rename
The rename code was taking a resource group lock in cases where
it wasn't actually needed, this caused problems if the rename
was resulting in an inode being unlinked. The patch ensures that
we only take the rgrp lock early if it is really needed.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-08 13:42:42 +00:00
Steven Whitehouse
56aa616a03 GFS2: Ensure uptodate inode size when using O_APPEND
The VFS reads the inode size during generic_file_aio_write() but
with no locking around it. In order to get the expected result
from O_APPEND opens, this patch updated the inode size before
calling generic_file_aio_write()

There is of course still a race here, in that there is nothing to
prevent another node coming in and extending the file in the
mean time. On the other hand, when used with file locking this
will ensure that the expected results are obtained.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-08 13:42:27 +00:00
Frederic Weisbecker
31370f62ba reiserfs: Relax reiserfs_xattr_set_handle() while acquiring xattr locks
Fix remaining xattr locks acquired in reiserfs_xattr_set_handle()
while we are holding the reiserfs lock to avoid lock inversions.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
2010-01-07 16:02:53 +01:00
Jiri Slaby
e0baec1b63 reiserfs: Fix unreachable statement
Stanse found an unreachable statement in reiserfs_ioctl. There is a
if followed by error assignment and `break' with no braces. Add the
braces so that we don't break every time, but only in error case,
so that REISERFS_IOC_SETVERSION actually works when it returns no
error.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Reiserfs <reiserfs-devel@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-01-07 14:03:18 +01:00
Frederic Weisbecker
6c28705418 reiserfs: Don't call reiserfs_get_acl() with the reiserfs lock
reiserfs_get_acl is usually not called under the reiserfs lock,
as it doesn't need it. But it happens when it is called by
reiserfs_acl_chmod(), which creates a dependency inversion against
the private xattr inodes mutexes for the given inode.

We need to call it without the reiserfs lock, especially since
it's unnecessary.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
2010-01-07 13:46:48 +01:00
Mike Frysinger
04e4f2b18c FDPIC: Respect PT_GNU_STACK exec protection markings when creating NOMMU stack
The current code will load the stack size and protection markings, but
then only use the markings in the MMU code path.  The NOMMU code path
always passes PROT_EXEC to the mmap() call.  While this doesn't matter
to most people whilst the code is running, it will cause a pointless
icache flush when starting every FDPIC application.  Typically this
icache flush will be of a region on the order of 128KB in size, or may
be the entire icache, depending on the facilities available on the CPU.

In the case where the arch default behaviour seems to be desired
(EXSTACK_DEFAULT), we probe VM_STACK_FLAGS for VM_EXEC to determine
whether we should be setting PROT_EXEC or not.

For arches that support an MPU (Memory Protection Unit - an MMU without
the virtual mapping capability), setting PROT_EXEC or not will make an
important difference.

It should be noted that this change also affects the executability of
the brk region, since ELF-FDPIC has that share with the stack.  However,
this is probably irrelevant as NOMMU programs aren't likely to use the
brk region, preferring instead allocation via mmap().

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-06 18:16:02 -08:00
Linus Torvalds
93939f4e5d Merge branch 'for-2.6.33' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.33' of git://linux-nfs.org/~bfields/linux:
  sunrpc: fix peername failed on closed listener
  nfsd: make sure data is on disk before calling ->fsync
  nfsd: fix "insecure" export option
2010-01-06 18:10:15 -08:00
OGAWA Hirofumi
56335936de nfs: fix oops in nfs_rename()
Recent change is missing to update "rehash".  With that change, it will
become the cause of adding dentry to hash twice.

This explains the reason of Oops (dereference the freed dentry in
__d_lookup()) on my machine.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Marvin <marvin24@gmx.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-01-06 18:48:26 -05:00
Christoph Hellwig
7211a4e859 nfsd: make sure data is on disk before calling ->fsync
nfsd is not using vfs_fsync, so I missed it when changing the calling
convention during the 2.6.32 window.  This patch fixes it to not only
start the data writeout, but also wait for it to complete before calling
into ->fsync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-01-06 17:37:26 -05:00
Linus Torvalds
c6f7afaeed Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: simple_write_end does not mark_inode_dirty
  exofs: fix pnfs_osd re-definitions in pre-pnfs trees
2010-01-06 01:41:07 -08:00
Linus Torvalds
6307daad84 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2: Handle O_DIRECT when writing to a refcounted cluster.
2010-01-05 16:01:04 -08:00
Boaz Harrosh
efd124b999 exofs: simple_write_end does not mark_inode_dirty
exofs uses simple_write_end() for it's .write_end handler. But
it is not enough because simple_write_end() does not call
mark_inode_dirty() when it extends i_size. So even if we do
call mark_inode_dirty at beginning of write out, with a very
long IO and a saturated system we might get the .write_inode()
called while still extend-writing to file and miss out on the last
i_size updates.

So override .write_end, call simple_write_end(), and afterwords if
i_size was changed call mark_inode_dirty().

It stands to logic that since simple_write_end() was the one extending
i_size it should also call mark_inode_dirty(). But it looks like all
users of simple_write_end() are memory-bound pseudo filesystems, who
could careless about mark_inode_dirty(). I might submit a
warning-comment patch to simple_write_end() in future.

CC: Stable <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-01-05 09:14:32 +02:00
Boaz Harrosh
89be503021 exofs: fix pnfs_osd re-definitions in pre-pnfs trees
Some on disk exofs constants and types are defined in the pnfs_osd_xdr.h
file. Since we needed these types before the pnfs-objects code was
accepted to mainline we duplicated the minimal needed definitions into
an exofs local header. The definitions where conditionally included
depending on !CONFIG_PNFS defined. So if PNFS was present in the tree
definitions are taken from there and if not they are defined locally.

That was all good but, the CONFIG_PNFS is planed to be included upstream
before the pnfs-objects is also included. (The first pnfs batch might be
pnfs-files only)

So condition exofs local definitions on the absence of pnfs_osd_xdr.h
inclusion (__PNFS_OSD_XDR_H__ not defined). User code must make sure
that in future pnfs_osd_xdr.h will be included before fs/exofs/pnfs.h,
which happens to be so in current code.

Once pnfs-objects hits mainline, exofs's local header will be removed.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-01-05 09:14:32 +02:00
Frederic Weisbecker
4f3be1b5a9 reiserfs: Relax lock on xattr removing
When we remove an xattr, we call lookup_and_delete_xattr()
that takes some private xattr inodes mutexes. But we hold
the reiserfs lock at this time, which leads to dependency
inversions.

We can safely call lookup_and_delete_xattr() without the
reiserfs lock, where xattr inodes lookups only need the
xattr inodes mutexes.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
2010-01-05 08:00:50 +01:00
Frederic Weisbecker
108d3943c0 reiserfs: Relax the lock before truncating pages
While truncating a file, reiserfs_setattr() calls inode_setattr()
that will truncate the mapping for the given inode, but for that
it needs the pages locks.

In order to release these, the owners need the reiserfs lock to
complete their jobs. But they can't, as we don't release it before
calling inode_setattr().

We need to do that to fix the following softlockups:

INFO: task flush-8:0:2149 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-8:0     D f51af998     0  2149      2 0x00000000
 f51af9ac 00000092 00000002 f51af998 c2803304 00000000 c1894ad0 010f3000
 f51af9cc c1462604 c189ef80 f51af974 c1710304 f715b450 f715b5ec c2807c40
 00000000 0005bb00 c2803320 c102c55b c1710304 c2807c50 c2803304 00000246
Call Trace:
 [<c1462604>] ? schedule+0x434/0xb20
 [<c102c55b>] ? resched_task+0x4b/0x70
 [<c106fa22>] ? mark_held_locks+0x62/0x80
 [<c146414d>] ? mutex_lock_nested+0x1fd/0x350
 [<c14640b9>] mutex_lock_nested+0x169/0x350
 [<c1178cde>] ? reiserfs_write_lock+0x2e/0x40
 [<c1178cde>] reiserfs_write_lock+0x2e/0x40
 [<c11719a2>] do_journal_end+0xc2/0xe70
 [<c1172912>] journal_end+0xb2/0x120
 [<c11686b3>] ? pathrelse+0x33/0xb0
 [<c11729e4>] reiserfs_end_persistent_transaction+0x64/0x70
 [<c1153caa>] reiserfs_get_block+0x12ba/0x15f0
 [<c106fa22>] ? mark_held_locks+0x62/0x80
 [<c1154b24>] reiserfs_writepage+0xa74/0xe80
 [<c1465a27>] ? _raw_spin_unlock_irq+0x27/0x50
 [<c11f3d25>] ? radix_tree_gang_lookup_tag_slot+0x95/0xc0
 [<c10b5377>] ? find_get_pages_tag+0x127/0x1a0
 [<c106fa22>] ? mark_held_locks+0x62/0x80
 [<c106fcd4>] ? trace_hardirqs_on_caller+0x124/0x170
 [<c10bc1e0>] __writepage+0x10/0x40
 [<c10bc9ab>] write_cache_pages+0x16b/0x320
 [<c10bc1d0>] ? __writepage+0x0/0x40
 [<c10bcb88>] generic_writepages+0x28/0x40
 [<c10bcbd5>] do_writepages+0x35/0x40
 [<c11059f7>] writeback_single_inode+0xc7/0x330
 [<c11067b2>] writeback_inodes_wb+0x2c2/0x490
 [<c1106a86>] wb_writeback+0x106/0x1b0
 [<c1106cf6>] wb_do_writeback+0x106/0x1e0
 [<c1106c18>] ? wb_do_writeback+0x28/0x1e0
 [<c1106e0a>] bdi_writeback_task+0x3a/0xb0
 [<c10cbb13>] bdi_start_fn+0x63/0xc0
 [<c10cbab0>] ? bdi_start_fn+0x0/0xc0
 [<c105d1f4>] kthread+0x74/0x80
 [<c105d180>] ? kthread+0x0/0x80
 [<c100327a>] kernel_thread_helper+0x6/0x10
3 locks held by flush-8:0/2149:
 #0:  (&type->s_umount_key#30){+++++.}, at: [<c110676f>] writeback_inodes_wb+0x27f/0x490
 #1:  (&journal->j_mutex){+.+...}, at: [<c117199a>] do_journal_end+0xba/0xe70
 #2:  (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1178cde>] reiserfs_write_lock+0x2e/0x40
INFO: task fstest:3813 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
fstest        D 00000002     0  3813   3812 0x00000000
 f5103c94 00000082 f5103c40 00000002 f5ad5450 00000007 f5103c28 011f3000
 00000006 f5ad5450 c10bb005 00000480 c1710304 f5ad5450 f5ad55ec c2907c40
 00000001 f5ad5450 f5103c74 00000046 00000002 f5ad5450 00000007 f5103c6c
Call Trace:
 [<c10bb005>] ? free_hot_cold_page+0x1d5/0x280
 [<c1462d64>] io_schedule+0x74/0xc0
 [<c10b5a45>] sync_page+0x35/0x60
 [<c146325a>] __wait_on_bit_lock+0x4a/0x90
 [<c10b5a10>] ? sync_page+0x0/0x60
 [<c10b59e5>] __lock_page+0x85/0x90
 [<c105d660>] ? wake_bit_function+0x0/0x60
 [<c10bf654>] truncate_inode_pages_range+0x1e4/0x2d0
 [<c10bf75f>] truncate_inode_pages+0x1f/0x30
 [<c10bf7cf>] truncate_pagecache+0x5f/0xa0
 [<c10bf86a>] vmtruncate+0x5a/0x70
 [<c10fdb7d>] inode_setattr+0x5d/0x190
 [<c1150117>] reiserfs_setattr+0x1f7/0x2f0
 [<c1464569>] ? down_write+0x49/0x70
 [<c10fde01>] notify_change+0x151/0x330
 [<c10e6f3d>] do_truncate+0x6d/0xa0
 [<c10f4ce2>] do_filp_open+0x9a2/0xcf0
 [<c1465aec>] ? _raw_spin_unlock+0x2c/0x50
 [<c10fec50>] ? alloc_fd+0xe0/0x100
 [<c10e602d>] do_sys_open+0x6d/0x130
 [<c1002cfb>] ? sysenter_exit+0xf/0x16
 [<c10e615e>] sys_open+0x2e/0x40
 [<c1002ccc>] sysenter_do_call+0x12/0x32
3 locks held by fstest/3813:
 #0:  (&sb->s_type->i_mutex_key#4){+.+.+.}, at: [<c10e6f33>] do_truncate+0x63/0xa0
 #1:  (&sb->s_type->i_alloc_sem_key#3){+.+.+.}, at: [<c10fdf07>] notify_change+0x257/0x330
 #2:  (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1178c8e>] reiserfs_write_lock_once+0x2e/0x50

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
2010-01-05 08:00:29 +01:00
Frederic Weisbecker
5fe1533fda reiserfs: Fix recursive lock on lchown
On chown, reiserfs will call reiserfs_setattr() to change the owner
of the given inode, but it may also recursively call
reiserfs_setattr() to propagate the owner change to the private xattr
files for this inode.

Hence, the reiserfs lock may be acquired twice which is not wanted
as reiserfs_setattr() calls journal_begin() that is going to try to
relax the lock in order to safely acquire the journal mutex.

Using reiserfs_write_lock_once() from reiserfs_setattr() solves
the problem.

This fixes the following warning, that precedes a lockdep report.

WARNING: at fs/reiserfs/lock.c:95 reiserfs_lock_check_recursive+0x3f/0x50()
Hardware name: MS-7418
Unwanted recursive reiserfs lock!
Pid: 4189, comm: fsstress Not tainted 2.6.33-rc2-tip-atom+ #195
Call Trace:
 [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50
 [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50
 [<c103f7ac>] warn_slowpath_common+0x6c/0xc0
 [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50
 [<c103f84b>] warn_slowpath_fmt+0x2b/0x30
 [<c1178bff>] reiserfs_lock_check_recursive+0x3f/0x50
 [<c1172ae3>] do_journal_begin_r+0x83/0x350
 [<c1172f2d>] journal_begin+0x7d/0x140
 [<c106509a>] ? in_group_p+0x2a/0x30
 [<c10fda71>] ? inode_change_ok+0x91/0x140
 [<c115007d>] reiserfs_setattr+0x15d/0x2e0
 [<c10f9bf3>] ? dput+0xe3/0x140
 [<c1465adc>] ? _raw_spin_unlock+0x2c/0x50
 [<c117831d>] chown_one_xattr+0xd/0x10
 [<c11780a3>] reiserfs_for_each_xattr+0x113/0x2c0
 [<c1178310>] ? chown_one_xattr+0x0/0x10
 [<c14641e9>] ? mutex_lock_nested+0x2a9/0x350
 [<c117826f>] reiserfs_chown_xattrs+0x1f/0x60
 [<c106509a>] ? in_group_p+0x2a/0x30
 [<c10fda71>] ? inode_change_ok+0x91/0x140
 [<c1150046>] reiserfs_setattr+0x126/0x2e0
 [<c1177c20>] ? reiserfs_getxattr+0x0/0x90
 [<c11b0d57>] ? cap_inode_need_killpriv+0x37/0x50
 [<c10fde01>] notify_change+0x151/0x330
 [<c10e659f>] chown_common+0x6f/0x90
 [<c10e67bd>] sys_lchown+0x6d/0x80
 [<c1002ccc>] sysenter_do_call+0x12/0x32
---[ end trace 7c2b77224c1442fc ]---

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
2010-01-05 07:59:38 +01:00
Eric W. Biederman
846f99749a sysfs: Add lockdep annotations for the sysfs active reference
Holding locks over device_del -> kobject_del -> sysfs_deactivate can
cause deadlocks if those same locks are grabbed in sysfs show or store
methods.

The I model s_active count + completion as a sleeping read/write lock.
I describe to lockdep sysfs_get_active as a read_trylock,
sysfs_put_active as a read_unlock, and sysfs_deactivate as a
write_lock and write_unlock pair.  This seems to capture the essence
for purposes of finding deadlocks, and in my testing gives finds real
issues and ignores non-issues.

This brings us back to holding locks over kobject_del is a problem
that ideally we should find a way of addressing, but at least lockdep
can tell us about the problems instead of requiring developers to debug
rare strange system deadlocks, that happen when sysfs files are removed
while being written to.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-04 12:34:46 -08:00
Linus Torvalds
d4d3b19212 Merge branch 'sh/for-2.6.33' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
* 'sh/for-2.6.33' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6:
  binfmt_elf_fdpic: Fix build breakage introduced by coredump changes.
  sh: update defconfigs.
  sh: Don't default enable PMB support.
  sh: Disable PMB for SH4AL-DSP CPUs.
  sh: Only provide a PCLK definition for legacy CPG CPUs.
2010-01-04 12:32:09 -08:00
Linus Torvalds
e43c259777 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Calculate metadata requirements more accurately
  ext4: Fix accounting of reserved metadata blocks
2010-01-04 12:31:52 -08:00
Linus Torvalds
a87da40875 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: update mailing list address
  nilfs2: Storage class should be before const qualifier
  nilfs2: trivial coding style fix
2010-01-04 12:28:26 -08:00
Daisuke HATAYAMA
2f48912d14 binfmt_elf_fdpic: Fix build breakage introduced by coredump changes.
Commit f6151dfea2 introduces build
breakage, so this patch fixes it together with some printk formatting
cleanup.

Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2010-01-04 15:42:14 +09:00
Frederic Weisbecker
f3e22f48f3 reiserfs: Fix mistake in down_write() conversion
Fix a mistake in commit 0719d34347
(reiserfs: Fix reiserfs lock <-> i_xattr_sem dependency inversion)
that has converted a down_write() into a down_read() accidentally.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
2010-01-03 03:44:53 +01:00
Linus Torvalds
45d28b0972 Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  reiserfs: Safely acquire i_mutex from xattr_rmdir
  reiserfs: Safely acquire i_mutex from reiserfs_for_each_xattr
  reiserfs: Fix journal mutex <-> inode mutex lock inversion
  reiserfs: Fix unwanted recursive reiserfs lock in reiserfs_unlink()
  reiserfs: Relax lock before open xattr dir in reiserfs_xattr_set_handle()
  reiserfs: Relax reiserfs lock while freeing the journal
  reiserfs: Fix reiserfs lock <-> i_mutex dependency inversion on xattr
  reiserfs: Warn on lock relax if taken recursively
  reiserfs: Fix reiserfs lock <-> i_xattr_sem dependency inversion
  reiserfs: Fix remaining in-reclaim-fs <-> reclaim-fs-on locking inversion
  reiserfs: Fix reiserfs lock <-> inode mutex dependency inversion
  reiserfs: Fix reiserfs lock and journal lock inversion dependency
  reiserfs: Fix possible recursive lock
2010-01-02 11:17:05 -08:00
Jaswinder Singh Rajput
4b6764fa9e writeback: add missing kernel-doc notation
Fix the following htmldocs warning:

  Warning(fs/fs-writeback.c:255): No description found for parameter 'sb'

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-02 10:09:44 -08:00
Frederic Weisbecker
835d5247d9 reiserfs: Safely acquire i_mutex from xattr_rmdir
Relax the reiserfs lock before taking the inode mutex from
xattr_rmdir() to avoid the usual reiserfs lock <-> inode mutex
bad dependency.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
2010-01-02 01:59:48 +01:00
Frederic Weisbecker
8b513f56d4 reiserfs: Safely acquire i_mutex from reiserfs_for_each_xattr
Relax the reiserfs lock before taking the inode mutex from
reiserfs_for_each_xattr() to avoid the usual bad dependencies:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.32-atom #179
-------------------------------------------------------
rm/3242 is trying to acquire lock:
 (&sb->s_type->i_mutex_key#4/3){+.+.+.}, at: [<c11428ef>] reiserfs_for_each_xattr+0x23f/0x290

but task is already holding lock:
 (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1143389>] reiserfs_write_lock+0x29/0x40

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
       [<c105ea7f>] __lock_acquire+0x11ff/0x19e0
       [<c105f2c8>] lock_acquire+0x68/0x90
       [<c1401aab>] mutex_lock_nested+0x5b/0x340
       [<c1143339>] reiserfs_write_lock_once+0x29/0x50
       [<c1117022>] reiserfs_lookup+0x62/0x140
       [<c10bd85f>] __lookup_hash+0xef/0x110
       [<c10bf21d>] lookup_one_len+0x8d/0xc0
       [<c1141e3a>] open_xa_dir+0xea/0x1b0
       [<c1142720>] reiserfs_for_each_xattr+0x70/0x290
       [<c11429ba>] reiserfs_delete_xattrs+0x1a/0x60
       [<c111ea2f>] reiserfs_delete_inode+0x9f/0x150
       [<c10c9c32>] generic_delete_inode+0xa2/0x170
       [<c10c9d4f>] generic_drop_inode+0x4f/0x70
       [<c10c8b07>] iput+0x47/0x50
       [<c10c0965>] do_unlinkat+0xd5/0x160
       [<c10c0b13>] sys_unlinkat+0x23/0x40
       [<c1002ec4>] sysenter_do_call+0x12/0x32

-> #0 (&sb->s_type->i_mutex_key#4/3){+.+.+.}:
       [<c105f176>] __lock_acquire+0x18f6/0x19e0
       [<c105f2c8>] lock_acquire+0x68/0x90
       [<c1401aab>] mutex_lock_nested+0x5b/0x340
       [<c11428ef>] reiserfs_for_each_xattr+0x23f/0x290
       [<c11429ba>] reiserfs_delete_xattrs+0x1a/0x60
       [<c111ea2f>] reiserfs_delete_inode+0x9f/0x150
       [<c10c9c32>] generic_delete_inode+0xa2/0x170
       [<c10c9d4f>] generic_drop_inode+0x4f/0x70
       [<c10c8b07>] iput+0x47/0x50
       [<c10c0965>] do_unlinkat+0xd5/0x160
       [<c10c0b13>] sys_unlinkat+0x23/0x40
       [<c1002ec4>] sysenter_do_call+0x12/0x32

other info that might help us debug this:

1 lock held by rm/3242:
 #0:  (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1143389>] reiserfs_write_lock+0x29/0x40

stack backtrace:
Pid: 3242, comm: rm Not tainted 2.6.32-atom #179
Call Trace:
 [<c13ffa13>] ? printk+0x18/0x1a
 [<c105d33a>] print_circular_bug+0xca/0xd0
 [<c105f176>] __lock_acquire+0x18f6/0x19e0
 [<c105c932>] ? mark_held_locks+0x62/0x80
 [<c105cc3b>] ? trace_hardirqs_on+0xb/0x10
 [<c1401098>] ? mutex_unlock+0x8/0x10
 [<c105f2c8>] lock_acquire+0x68/0x90
 [<c11428ef>] ? reiserfs_for_each_xattr+0x23f/0x290
 [<c11428ef>] ? reiserfs_for_each_xattr+0x23f/0x290
 [<c1401aab>] mutex_lock_nested+0x5b/0x340
 [<c11428ef>] ? reiserfs_for_each_xattr+0x23f/0x290
 [<c11428ef>] reiserfs_for_each_xattr+0x23f/0x290
 [<c1143180>] ? delete_one_xattr+0x0/0x100
 [<c11429ba>] reiserfs_delete_xattrs+0x1a/0x60
 [<c1143339>] ? reiserfs_write_lock_once+0x29/0x50
 [<c111ea2f>] reiserfs_delete_inode+0x9f/0x150
 [<c11b0d4f>] ? _atomic_dec_and_lock+0x4f/0x70
 [<c111e990>] ? reiserfs_delete_inode+0x0/0x150
 [<c10c9c32>] generic_delete_inode+0xa2/0x170
 [<c10c9d4f>] generic_drop_inode+0x4f/0x70
 [<c10c8b07>] iput+0x47/0x50
 [<c10c0965>] do_unlinkat+0xd5/0x160
 [<c1401098>] ? mutex_unlock+0x8/0x10
 [<c10c3e0d>] ? vfs_readdir+0x7d/0xb0
 [<c10c3af0>] ? filldir64+0x0/0xf0
 [<c1002ef3>] ? sysenter_exit+0xf/0x16
 [<c105cbe4>] ? trace_hardirqs_on_caller+0x124/0x170
 [<c10c0b13>] sys_unlinkat+0x23/0x40
 [<c1002ec4>] sysenter_do_call+0x12/0x32

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
2010-01-02 01:59:14 +01:00