kernel_optimize_test

Author	SHA1	Message	Date
Chris Mason	25472b880c	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus	2009-10-01 12:58:13 -04:00
Chris Mason	ab93dbecfb	Btrfs: take i_mutex before generic_write_checks btrfs_file_write was incorrectly calling generic_write_checks without taking i_mutex. This lead to problems with racing around i_size when doing O_APPEND writes. The fix here is to move i_mutex higher. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-01 12:29:10 -04:00
Christoph Hellwig	35d62a942d	Btrfs: fix arguments to btrfs_wait_on_page_writeback_range wait_on_page_writeback_range/btrfs_wait_on_page_writeback_range takes a pagecache offset, not a byte offset into the file. Shift the arguments around to wait for the correct range Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-01 10:27:01 -04:00
Eric Sandeen	f0e2dfa7f3	ext4: drop ext4dev compat Kconfig & super.c promised it'd be gone by 2.6.31, so it's about time to drop it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-10-01 02:21:07 -04:00
Theodore Ts'o	1f94533d9c	ext4: fix a BUG_ON crash by checking that page has buffers attached to it In ext4_num_dirty_pages() we were calling page_buffers() before checking to see if the page actually had pages attached to it; this would cause a BUG check crash in the inline function page_buffers(). Thanks to Markus Trippelsdorf for reporting this bug. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-30 22:57:41 -04:00
David Teigland	6861f35078	dlm: fix socket fd translation The code to set up sctp sockets was not using the sockfd_lookup() and sockfd_put() routines to translate an fd to a socket. The direct fget and fput calls were resulting in error messages from alloc_fd(). Also clean up two log messages and remove a third, related to setting up sctp associations. Signed-off-by: David Teigland <teigland@redhat.com>	2009-09-30 12:19:44 -05:00
David Teigland	04bedd79a7	dlm: fix lowcomms_connect_node for sctp The recently added dlm_lowcomms_connect_node() from `391fbdc5d5` does not work when using SCTP instead of TCP. The sctp connection code has nothing to do without data to send. Check for no data in the sctp connection code and do nothing instead of triggering a BUG. Also have connect_node() do nothing when the protocol is sctp. Signed-off-by: David Teigland <teigland@redhat.com>	2009-09-30 12:19:44 -05:00
Linus Torvalds	9abf47f11b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: fix missing initialization of i_dir_start_lookup member nilfs2: fix missing zero-fill initialization of btree node cache	2009-09-30 09:42:24 -07:00
Linus Torvalds	9f44fdc518	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Fix time encoding with extra epoch bits ext4: Add a stub for mpage_da_data in the trace header jbd2: Use tracepoints for history file ext4: Use tracepoints for mb_history trace file ext4, jbd2: Drop unneeded printks at mount and unmount time ext4: Handle nested ext4_journal_start/stop calls without a journal ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode ext4: Avoid updating the inode table bh twice in no journal mode ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first ext4: async direct IO for holes and fallocate support ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O ext4: Split uninitialized extents for direct I/O ext4: release reserved quota when block reservation for delalloc retry ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks ext4: Fix hueristic which avoids group preallocation for closed files ext4: Use ext4_msg() for ext4_da_writepage() errors ext4: Update documentation about quota mount options	2009-09-30 09:32:30 -07:00
Linus Torvalds	4c8f1cb266	Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6: fat: Check s_dirt in fat_sync_fs() vfat: change the default from shortname=lower to shortname=mixed fat/nls: Fix handling of utf8 invalid char	2009-09-30 09:31:14 -07:00
Theodore Ts'o	c1fccc0696	ext4: Fix time encoding with extra epoch bits "Looking at ext4.h, I think the setting of extra time fields forgets to mask the epoch bits so the epoch part overwrites nsec part. The second change is only for coherency (2 -> EXT4_EPOCH_BITS)." Thanks to Damien Guibouret for pointing out this problem. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-30 01:13:55 -04:00
Theodore Ts'o	bf6993276f	jbd2: Use tracepoints for history file The /proc/fs/jbd2/<dev>/history was maintained manually; by using tracepoints, we can get all of the existing functionality of the /proc file plus extra capabilities thanks to the ftrace infrastructure. We save memory as a bonus. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-30 00:32:06 -04:00
Theodore Ts'o	296c355cd6	ext4: Use tracepoints for mb_history trace file The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a number of problems: it required a largish amount of memory to be allocated for each ext4 filesystem, and the s_mb_history_lock introduced a CPU contention problem. By ripping out the mb_history code and replacing it with ftrace tracepoints, and we get more functionality: timestamps, event filtering, the ability to correlate mballoc history with other ext4 tracepoints, etc. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-30 00:32:42 -04:00
Sage Weil	dd7e0b7b02	Btrfs: fix deadlock with free space handling and user transactions If an ioctl-initiated transaction is open, we can't force a commit during the free space checks in order to free up pinned extents or else we deadlock. Just ENOSPC instead. A more satisfying solution that reserves space for the entire user transaction up front is forthcoming... Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-29 19:50:07 -04:00
Sage Weil	1ab86aedbc	Btrfs: fix error cases for ioctl transactions Fix leak of vfsmount write reference and open_ioctl_trans reference on ENOMEM. Clean up the error paths while we're at it. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-29 18:38:44 -04:00
Theodore Ts'o	90576c0b9a	ext4, jbd2: Drop unneeded printks at mount and unmount time There are a number of kernel printk's which are printed when an ext4 filesystem is mounted and unmounted. Disable them to economize space in the system logs. In addition, disabling the mballoc stats by default saves a number of unneeded atomic operations for every block allocation or deallocation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-29 15:51:30 -04:00
Chris Ball	3baf0bed0a	Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL code We've already defined CONFIG_BTRFS_POSIX_ACL in Kconfig, but we're currently not using it and are testing CONFIG_FS_POSIX_ACL instead. CONFIG_FS_POSIX_ACL states "Never use this symbol for ifdefs". Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-29 13:51:05 -04:00
Julia Lawall	fd2696f399	Btrfs: introduce missing kfree Error handling code following a kzalloc should free the allocated data. The semantic match that finds the problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @r exists@ local idexpression x; statement S; expression E; identifier f,f1,l; position p1,p2; expression ptr != NULL; @@ x@p1 = $kmalloc\\|kzalloc\\|kcalloc$(...); ... if (x == NULL) S <... when != x when != if (...) { <+...x...+> } ( x->f1 = E \| (x->f1 == NULL \|\| ...) \| f(...,x->f1,...) ) ...> ( return $0\\|<+...x...+>\\|ptr$; \| return@p2 ...; ) @script:python@ p1 << r.p1; p2 << r.p2; @@ print " file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-29 13:51:04 -04:00
Chris Ball	49cf6f4529	Btrfs: Fix setting umask when POSIX ACLs are not enabled We currently set sb->s_flags \|= MS_POSIXACL unconditionally, which is incorrect -- it tells the VFS that it shouldn't set umask because we will, yet we don't set it ourselves if we aren't using POSIX ACLs, so the umask ends up ignored. Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-29 13:51:04 -04:00
Curt Wohlgemuth	d3d1faf6a7	ext4: Handle nested ext4_journal_start/stop calls without a journal This patch fixes a problem with handling nested calls to ext4_journal_start/ext4_journal_stop, when there is no journal present. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-29 11:01:03 -04:00
Curt Wohlgemuth	f3dc272fd5	ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode This patch a problem that ext4_dirty_inode() was not calling ext4_mark_inode_dirty() if the current_handle is not valid, which it is the case in no journal mode. It also removes a test for non-matching transaction which can never happen. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-29 16:06:01 -04:00
Frank Mayhar	830156c79b	ext4: Avoid updating the inode table bh twice in no journal mode This is a cleanup of commit `91ac6f4`. Since ext4_mark_inode_dirty() has already called ext4_mark_iloc_dirty(), which in turn calls ext4_do_update_inode(), it's not necessary to have ext4_write_inode() call ext4_do_update_inode() in no journal mode. Indeed, it would be duplicated work. Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-29 10:07:47 -04:00
Ryusuke Konishi	3cc811bffd	nilfs2: fix missing initialization of i_dir_start_lookup member The i_dir_start_lookup field in nilfs_inode_info objects should be cleared when the objects are allocated, but the the initialization was missing in case of reading from disk. This adds the initialization. Since the variable just gives a start page on directory lookups, the bug was nonfatal until now. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-09-29 20:32:13 +09:00
Ryusuke Konishi	1f28fcd925	nilfs2: fix missing zero-fill initialization of btree node cache This will fix file system corruption which infrequently happens after mount. The problem was reported from users with the title "[NILFS users] Fail to mount NILFS." (Message-ID: <200908211918.34720.yuri@itinteg.net>), and so forth. I've also experienced the corruption multiple times on kernel 2.6.30 and 2.6.31. The problem turned out to be caused due to discordance between mapping->nrpages of a btree node cache and the actual number of pages hung on the cache; if the mapping->nrpages becomes zero even as it has pages, truncate_inode_pages() returns without doing anything. Usually this is harmless except it may cause page leak, but garbage collection fairly infrequently sees a stale page remained in the btree node cache of DAT (i.e. disk address translation file of nilfs), and induces the corruption. I identified a missing initialization in btree node caches was the root cause. This corrects the bug. I've tested this for kernel 2.6.30 and 2.6.31. Reported-by: Yuri Chislov <yuri@itinteg.net> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: stable <stable@kernel.org>	2009-09-29 20:12:56 +09:00
Josef Bacik	9ed74f2dba	Btrfs: proper -ENOSPC handling At the start of a transaction we do a btrfs_reserve_metadata_space() and specify how many items we plan on modifying. Then once we've done our modifications and such, just call btrfs_unreserve_metadata_space() for the same number of items we reserved. For keeping track of metadata needed for data I've had to add an extent_io op for when we merge extents. This lets us track space properly when we are doing sequential writes, so we don't end up reserving way more metadata space than what we need. The only place where the metadata space accounting is not done is in the relocation code. This is because Yan is going to be reworking that code in the near future, so running btrfs-vol -b could still possibly result in a ENOSPC related panic. This patch also turns off the metadata_ratio stuff in order to allow users to more efficiently use their disk space. This patch makes it so we track how much metadata we need for an inode's delayed allocation extents by tracking how many extents are currently waiting for allocation. It introduces two new callbacks for the extent_io tree's, merge_extent_hook and split_extent_hook. These help us keep track of when we merge delalloc extents together and split them up. Reservations are handled prior to any actually dirty'ing occurs, and then we unreserve after we dirty. btrfs_unreserve_metadata_for_delalloc() will make the appropriate unreservations as needed based on the number of reservations we currently have and the number of extents we currently have. Doing the reservation outside of doing any of the actual dirty'ing lets us do things like filemap_flush() the inode to try and force delalloc to happen, or as a last resort actually start allocation on all delalloc inodes in the fs. This has survived dbench, fs_mark and an fsx torture test. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-28 16:29:42 -04:00
Theodore Ts'o	f3ce8064b3	ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first Move the check to make sure the original and donor inodes are different earlier, to avoid a potential deadlock by trying to lock the same inode twice. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-28 15:58:29 -04:00
Mingming Cao	8d5d02e6b1	ext4: async direct IO for holes and fallocate support For async direct IO that covers holes or fallocate, the end_io callback function now queued the convertion work on workqueue but don't flush the work rightaway as it might take too long to afford. But when fsync is called after all the data is completed, user expects the metadata also being updated before fsync returns. Thus we need to flush the conversion work when fsync() is called. This patch keep track of a listed of completed async direct io that has a work queued on workqueue. When fsync() is called, it will go through the list and do the conversion. Signed-off-by: Mingming Cao <cmm@us.ibm.com>	2009-09-28 15:48:29 -04:00
Mingming Cao	4c0425ff68	ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O Currently the DIO VFS code passes create = 0 when writing to the middle of file. It does this to avoid block allocation for holes, so as not to expose stale data out when there is a parallel buffered read (which does not hold the i_mutex lock). Direct I/O writes into holes falls back to buffered IO for this reason. Since preallocated extents are treated as holes when doing a get_block() look up (buffer is not mapped), direct IO over fallocate also falls back to buffered IO. Thus ext4 actually silently falls back to buffered IO in above two cases, which is undesirable. To fix this, this patch creates unitialized extents when a direct I/O write into holes in sparse files, and registering an end_io callback which converts the uninitialized extent to an initialized extent after the I/O is completed. Singed-Off-By: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-28 15:48:41 -04:00
Mingming Cao	0031462b5b	ext4: Split uninitialized extents for direct I/O When writing into an unitialized extent via direct I/O, and the direct I/O doesn't exactly cover the unitialized extent, split the extent into uninitialized and initialized extents before submitting the I/O. This avoids needing to deal with an ENOSPC error in the end_io callback that gets used for direct I/O. When the IO is complete, the written extent will be marked as initialized. Singed-Off-By: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-28 15:49:08 -04:00
Mingming Cao	9f0ccfd8e0	ext4: release reserved quota when block reservation for delalloc retry ext4_da_reserve_space() can reserve quota blocks multiple times if ext4_claim_free_blocks() fail and we retry the allocation. We should release the quota reservation before restarting. Bug found by Jan Kara. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-28 15:49:52 -04:00
Theodore Ts'o	55138e0bc2	ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks Work around problems in the writeback code to force out writebacks in larger chunks than just 4mb, which is just too small. This also works around limitations in the ext4 block allocator, which can't allocate more than 2048 blocks at a time. So we need to defeat the round-robin characteristics of the writeback code and try to write out as many blocks in one inode before allowing the writeback code to move on to another inode. We add a a new per-filesystem tunable, max_writeback_mb_bump, which caps this to a default of 128mb per inode. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-29 13:31:31 -04:00
Theodore Ts'o	7178057730	ext4: Fix hueristic which avoids group preallocation for closed files The hueristic was designed to avoid using locality group preallocation when writing the last segment of a closed file. Fix it by move setting size to the maximum of size and isize until after we check whether size == isize. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-28 00:06:20 -04:00
Alexey Dobriyan	f0f37e2f77	const: mark struct vm_struct_operations * mark struct vm_area_struct::vm_ops as const * mark vm_ops in AGP code But leave TTM code alone, something is fishy there with global vm_ops being used. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-27 11:39:25 -07:00
Theodore Ts'o	1693918e0b	ext4: Use ext4_msg() for ext4_da_writepage() errors This allows the user to see what filesystem was involved with a particular ext4_da_writepage() error. Also, use KERN_CRIT which is more appropriate than KERN_EMERG. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-26 17:43:59 -04:00
Linus Torvalds	bfebb14063	Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block * 'writeback' of git://git.kernel.dk/linux-2.6-block: writeback: pass in super_block to bdi_start_writeback()	2009-09-26 10:11:13 -07:00
Linus Torvalds	07e2e6ba27	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: fix locking and list handling code in cifs_open and its helper [CIFS] Remove build warning cifs: fix problems with last two commits [CIFS] Fix build break when keys support turned off cifs: eliminate cifs_init_private cifs: convert oplock breaks to use slow_work facility (try #4) cifs: have cifsFileInfo hold an extra inode reference cifs: take read lock on GlobalSMBSes_lock in is_valid_oplock_break cifs: remove cifsInodeInfo.oplockPending flag cifs: fix oplock request handling in posix codepath [CIFS] Re-enable Lanman security	2009-09-26 10:10:35 -07:00
Jens Axboe	a72bfd4dea	writeback: pass in super_block to bdi_start_writeback() Sometimes we only want to write pages from a specific super_block, so allow that to be passed in. This fixes a problem with commit `56a131dcf7` causing writeback on all super_blocks on a bdi, where we only really want to sync a specific sb from writeback_inodes_sb(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-26 00:10:40 +02:00
Jeff Layton	3321b791b2	cifs: fix locking and list handling code in cifs_open and its helper The patch to remove cifs_init_private introduced a locking imbalance. It didn't remove the leftover list addition code and the unlocking in that function. cifs_new_fileinfo does the list addition now, so there should be no need to do it outside of that function. pCifsInode will never be NULL, so we don't need to check for that. This patch also gets rid of the ugly locking and unlocking across function calls. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-25 17:59:31 +00:00
Linus Torvalds	6d7f18f6ea	Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block * 'writeback' of git://git.kernel.dk/linux-2.6-block: writeback: writeback_inodes_sb() should use bdi_start_writeback() writeback: don't delay inodes redirtied by a fast dirtier writeback: make the super_block pinning more efficient writeback: don't resort for a single super_block in move_expired_inodes() writeback: move inodes from one super_block together writeback: get rid to incorrect references to pdflush in comments writeback: improve readability of the wb_writeback() continue/break logic writeback: cleanup writeback_single_inode() writeback: kupdate writeback shall not stop when more io is possible writeback: stop background writeback when below background threshold writeback: balance_dirty_pages() shall write more than dirtied pages fs: Fix busyloop in wb_writeback()	2009-09-25 09:27:30 -07:00
Jens Axboe	56a131dcf7	writeback: writeback_inodes_sb() should use bdi_start_writeback() Pointless to iterate other devices looking for a super, when we have a bdi mapping. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:26 +02:00
Wu Fengguang	b3af9468ae	writeback: don't delay inodes redirtied by a fast dirtier Debug traces show that in per-bdi writeback, the inode under writeback almost always get redirtied by a busy dirtier. We used to call redirty_tail() in this case, which could delay inode for up to 30s. This is unacceptable because it now happens so frequently for plain cp/dd, that the accumulated delays could make writeback of big files very slow. So let's distinguish between data redirty and metadata only redirty. The first one is caused by a busy dirtier, while the latter one could happen in XFS, NFS, etc. when they are doing delalloc or updating isize. The inode being busy dirtied will now be requeued for next io, while the inode being redirtied by fs will continue to be delayed to avoid repeated IO. CC: Jan Kara <jack@suse.cz> CC: Theodore Ts'o <tytso@mit.edu> CC: Dave Chinner <david@fromorbit.com> CC: Chris Mason <chris.mason@oracle.com> CC: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:26 +02:00
Jens Axboe	9ecc2738ac	writeback: make the super_block pinning more efficient Currently we pin the inode->i_sb for every single inode. This increases cache traffic on sb->s_umount sem. Lets instead cache the inode sb pin state and keep the super_block pinned for as long as keep writing out inodes from the same super_block. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:26 +02:00
Jens Axboe	cf137307cd	writeback: don't resort for a single super_block in move_expired_inodes() If we only moved inodes from a single super_block to the temporary list, there's no point in doing a resort for multiple super_blocks. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:26 +02:00
Shaohua Li	5c03449d34	writeback: move inodes from one super_block together __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has several partitions, writeback might keep spindle moving between partitions. To reduce the move, better write big chunk of one partition and then move to another. Inodes from one fs usually are in one partion, so idealy move indoes from one fs together should reduce spindle move. This patch tries to address this. Before per-bdi writeback is added, the behavior is write indoes from one fs first and then another, so the patch restores previous behavior. The loop in the patch is a bit ugly, should we add a dirty list for each superblock in bdi_writeback? Test in a two partition disk with attached fio script shows about 3% ~ 6% improvement. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:25 +02:00
Jens Axboe	5b0830cb90	writeback: get rid to incorrect references to pdflush in comments Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:25 +02:00
Jens Axboe	71fd05a887	writeback: improve readability of the wb_writeback() continue/break logic And throw some comments in there, too. Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:25 +02:00
Wu Fengguang	ae1b7f7d4b	writeback: cleanup writeback_single_inode() Make the if-else straight in writeback_single_inode(). No behavior change. Cc: Jan Kara <jack@suse.cz> Cc: Michael Rubin <mrubin@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:25 +02:00
Wu Fengguang	7fbdea3232	writeback: kupdate writeback shall not stop when more io is possible Fix the kupdate case, which disregards wbc.more_io and stop writeback prematurely even when there are more inodes to be synced. wbc.more_io should always be respected. Also remove the pages_skipped check. It will set when some page(s) of some inode(s) cannot be written for now. Such inodes will be delayed for a while. This variable has nothing to do with whether there are other writeable inodes. CC: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:25 +02:00
Wu Fengguang	d3ddec7635	writeback: stop background writeback when below background threshold Treat bdi_start_writeback(0) as a special request to do background write, and stop such work when we are below the background dirty threshold. Also simplify the (nr_pages <= 0) checks. Since we already pass in nr_pages=LONG_MAX for WB_SYNC_ALL and background writes, we don't need to worry about it being decreased to zero. Reported-by: Richard Kennedy <richard@rsk.demon.co.uk> CC: Jan Kara <jack@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:24 +02:00
Jan Kara	a5989bdc98	fs: Fix busyloop in wb_writeback() If all inodes are under writeback (e.g. in case when there's only one inode with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by waiting on I_SYNC flags of an inode on b_more_io list in case we failed to write anything. Tested-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:24 +02:00
Steve French	15dd478107	[CIFS] Remove build warning Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-25 02:24:45 +00:00
Jeff Layton	5d2c0e2259	cifs: fix problems with last two commits Fix problems with commits: `086f68bd97` `3bc303c254` Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-25 02:12:33 +00:00
Steve French	0f59e61c1f	[CIFS] Fix build break when keys support turned off Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-25 00:33:37 +00:00
Andrew Morton	c44972f178	procfs: disable per-task stack usage on NOMMU It needs walk_page_range(). Reported-by: Michal Simek <monstr@monstr.eu> Tested-by: Michal Simek <monstr@monstr.eu> Cc: Stefani Seibold <stefani@seibold.net> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 17:11:24 -07:00
Linus Torvalds	b9b9df62e7	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: eCryptfs: Prevent lower dentry from going negative during unlink eCryptfs: Propagate vfs_read and vfs_write return codes eCryptfs: Validate global auth tok keys eCryptfs: Filename encryption only supports password auth tokens eCryptfs: Check for O_RDONLY lower inodes when opening lower files eCryptfs: Handle unrecognized tag 3 cipher codes ecryptfs: improved dependency checking and reporting eCryptfs: Fix lockdep-reported AB-BA mutex issue ecryptfs: Remove unneeded locking that triggers lockdep false positives	2009-09-24 17:10:17 -07:00
Jeff Layton	086f68bd97	cifs: eliminate cifs_init_private ...it does the same thing as cifs_fill_fileinfo, but doesn't handle the flist ordering correctly. Also rename cifs_fill_fileinfo to a more descriptive name and have it take an open flags arg instead of just a write_only flag. That makes the logic in the callers a little simpler. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-24 19:35:18 +00:00
Al Viro	36dd2fdb37	nfs[23] tcp breakage in mount with binary options We forget to set nfs_server.protocol in tcp case when old-style binary options are passed to mount. The thing remains zero and never validated afterwards. As the result, we hit BUG in fs/nfs/client.c:588. Breakage has been introduced in NFS: Add nfs_alloc_parsed_mount_data merged yesterday... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-09-24 14:58:42 -04:00
Jeff Layton	3bc303c254	cifs: convert oplock breaks to use slow_work facility (try #4 ) This is the fourth respin of the patch to convert oplock breaks to use the slow_work facility. A customer of ours was testing a backport of one of the earlier patchsets, and hit a "Busy inodes after umount..." problem. An oplock break job had raced with a umount, and the superblock got torn down and its memory reused. When the oplock break job tried to dereference the inode->i_sb, the kernel oopsed. This patchset has the oplock break job hold an inode and vfsmount reference until the oplock break completes. With this, there should be no need to take a tcon reference (the vfsmount implicitly holds one already). Currently, when an oplock break comes in there's a chance that the oplock break job won't occur if the allocation of the oplock_q_entry fails. There are also some rather nasty races in the allocation and handling these structs. Rather than allocating oplock queue entries when an oplock break comes in, add a few extra fields to the cifsFileInfo struct. Get rid of the dedicated cifs_oplock_thread as well and queue the oplock break job to the slow_work thread pool. This approach also has the advantage that the oplock break jobs can potentially run in parallel rather than be serialized like they are today. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-24 18:33:18 +00:00
Linus Torvalds	7ca263cdf8	Merge branch 'cputime' of git://git390.marist.edu/pub/scm/linux-2.6 * 'cputime' of git://git390.marist.edu/pub/scm/linux-2.6: [PATCH] Fix idle time field in /proc/uptime	2009-09-24 09:04:24 -07:00
Linus Torvalds	dc2af6a6bc	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (42 commits) Btrfs: hash the btree inode during fill_super Btrfs: relocate file extents in clusters Btrfs: don't rename file into dummy directory Btrfs: check size of inode backref before adding hardlink Btrfs: fix releasepage to avoid unlocking extents we haven't locked Btrfs: Fix test_range_bit for whole file extents Btrfs: fix errors handling cached state in set/clear_extent_bit Btrfs: fix early enospc during balancing Btrfs: deal with NULL space info Btrfs: account for space used by the super mirrors Btrfs: fix extent entry threshold calculation Btrfs: remove dead code Btrfs: fix bitmap size tracking Btrfs: don't keep retrying a block group if we fail to allocate a cluster Btrfs: make balance code choose more wisely when relocating Btrfs: fix arithmetic error in clone ioctl Btrfs: add snapshot/subvolume destroy ioctl Btrfs: change how subvolumes are organized Btrfs: do not reuse objectid of deleted snapshot/subvol Btrfs: speed up snapshot dropping ...	2009-09-24 08:57:29 -07:00
Linus Torvalds	6c5daf012c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: truncate: use new helpers truncate: new helpers fs: fix overflow in sys_mount() for in-kernel calls fs: Make unload_nls() NULL pointer safe freeze_bdev: grab active reference to frozen superblocks freeze_bdev: kill bd_mount_sem exofs: remove BKL from super operations fs/romfs: correct error-handling code vfs: seq_file: add helpers for data filling vfs: remove redundant position check in do_sendfile vfs: change sb->s_maxbytes to a loff_t vfs: explicitly cast s_maxbytes in fiemap_check_ranges libfs: return error code on failed attr set seq_file: return a negative error code when seq_path_root() fails. vfs: optimize touch_time() too vfs: optimization for touch_atime() vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it fs/inode.c: add dev-id and inode number for debugging in init_special_inode() libfs: make simple_read_from_buffer conventional	2009-09-24 08:32:11 -07:00
Linus Torvalds	db16826367	Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits) HWPOISON: Enable error_remove_page on btrfs HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs HWPOISON: Add madvise() based injector for hardware poisoned pages v4 HWPOISON: Enable error_remove_page for NFS HWPOISON: Enable .remove_error_page for migration aware file systems HWPOISON: The high level memory error handler in the VM v7 HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process HWPOISON: shmem: call set_page_dirty() with locked page HWPOISON: Define a new error_remove_page address space op for async truncation HWPOISON: Add invalidate_inode_page HWPOISON: Refactor truncate to allow direct truncating of page v2 HWPOISON: check and isolate corrupted free pages v2 HWPOISON: Handle hardware poisoned pages in try_to_unmap HWPOISON: Use bitmask/action code for try_to_unmap behaviour HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2 HWPOISON: Add poison check to page fault handling HWPOISON: Add basic support for poisoned pages in fault handler v3 HWPOISON: Add new SIGBUS error codes for hardware poison signals HWPOISON: Add support for poison swap entries v2 HWPOISON: Export some rmap vma locking to outside world ...	2009-09-24 07:53:22 -07:00
Hiroshi Shimamoto	801460d0cf	task_struct cleanup: move binfmt field to mm_struct Because the binfmt is not different between threads in the same process, it can be moved from task_struct to mm_struct. And binfmt moudle is handled per mm_struct instead of task_struct. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:05 -07:00
Julia Lawall	a21f3c2a04	fs/romfs: correct error-handling code romfs_iget returns an ERR_PTR value in an error case instead of NULL. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @match exists@ expression x, E; statement S1, S2; @@ x = romfs_iget(...) ... when != x = E ( * if (x == NULL \|\| ...) S1 else S2 \| * if (x == NULL && ...) S1 else S2 ) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:05 -07:00
Roel Kluin	3886de938c	adfs: remove redundant test on unsigned unsigned block cannot be less than 0. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:05 -07:00
Alexey Dobriyan	8d65af789f	sysctl: remove "struct file *" argument of ->proc_handler It's unused. It isn't needed -- read or write flag is already passed and sysctl shouldn't care about the rest. It _was_ used in two places at arch/frv for some reason. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:04 -07:00
Renzo Davoli	dd5d81f326	fs/char_dev.c: remove useless loop There are two useless lines in fs/char_dev.c. In register_chrdev there is a loop to change all '/' into '!' in the kernel object name. This code is useless as the same substitution is in kobject_set_name_vargs in lib/kobject.c: 228 /* ewww... some of these buggers have '/' in the name ... / 229 while ((s = strchr(kobj->name, '/'))) 230 s[0] = '!'; kobject_set_name_vargs is called by kobject_set_name. kobject_set_name is called just above the useless loop. [hidave.darkstar@gmail.com: fix warning, remove the unused char s] Signed-off-by: Renzo Davoli <renzo@cs.unibo.it> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:03 -07:00
Mike Frysinger	0b8c78f2bf	flat: use IS_ERR_VALUE() helper macro There is a common macro now for testing mixed pointer/errno values, so use that rather than handling the casts ourself. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: David McCullough <david_mccullough@securecomputing.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:03 -07:00
David Howells	8e8b63a68c	fdpic: ignore the loader's PT_GNU_STACK when calculating the stack size Ignore the loader's PT_GNU_STACK when calculating the stack size, and only consider the executable's PT_GNU_STACK, assuming the executable has one. Currently the behaviour is to take the largest stack size and use that, but that means you can't reduce the stack size in the executable. The loader's stack size should probably only be used when executing the loader directly. WARNING: This patch is slightly dangerous - it may render a system inoperable if the loader's stack size is larger than that of important executables, and the system relies unknowingly on this increasing the size of the stack. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:02 -07:00
Amerigo Wang	0cf062d0ff	elf: clean up fill_note_info() Introduce a helper function elf_note_info_init() to help fill_note_info() to do initializations, also fix the potential memory leaks. [akpm@linux-foundation.org: remove NUM_NOTES] Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:01 -07:00
Peter Zijlstra	ba0a6c9f6f	fcntl: add F_[SG]ETOWN_EX In order to direct the SIGIO signal to a particular thread of a multi-threaded application we cannot, like suggested by the manpage, put a TID into the regular fcntl(F_SETOWN) call. It will still be send to the whole process of which that thread is part. Since people do want to properly direct SIGIO we introduce F_SETOWN_EX. The need to direct SIGIO comes from self-monitoring profiling such as with perf-counters. Perf-counters uses SIGIO to notify that new sample data is available. If the signal is delivered to the same task that generated the new sample it can augment that data by inspecting the task's user-space state right after it returns from the kernel. This is esp. convenient for interpreted or virtual machine driven environments. Both F_SETOWN_EX and F_GETOWN_EX take a pointer to a struct f_owner_ex as argument: struct f_owner_ex { int type; pid_t pid; }; Where type is one of F_OWNER_TID, F_OWNER_PID or F_OWNER_GID. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Tested-by: stephane eranian <eranian@googlemail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Roland McGrath <roland@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:01 -07:00
Oleg Nesterov	06f1631a16	signals: send_sigio: use do_send_sig_info() to avoid check_kill_permission() group_send_sig_info()->check_kill_permission() assumes that current is the sender and uses current_cred(). This is not true in send_sigio_to_task() case. From the security pov the sender is not current, but the task which did fcntl(F_SETOWN), that is why we have sigio_perm() which uses the right creds to check. Fortunately, send_sigio() always sends either SEND_SIG_PRIV or SI_FROMKERNEL() signal, so check_kill_permission() does nothing. But still it would be tidier to avoid this bogus security check and save a couple of cycles. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stephane eranian <eranian@googlemail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:01 -07:00
Oleg Nesterov	964ee7df90	exec: fix set_binfmt() vs sys_delete_module() race sys_delete_module() can set MODULE_STATE_GOING after search_binary_handler() does try_module_get(). In this case set_binfmt()->try_module_get() fails but since none of the callers check the returned error, the task will run with the wrong old ->binfmt. The proper fix should change all ->load_binary() methods, but we can rely on fact that the caller must hold a reference to binfmt->module and use __module_get() which never fails. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:01 -07:00
Neil Horman	61be228a06	exec: allow do_coredump() to wait for user space pipe readers to complete Allow core_pattern pipes to wait for user space to complete One of the things that user space processes like to do is look at metadata for a crashing process in their /proc/<pid> directory. this is racy however, since do_coredump in the kernel doesn't wait for the user space process to complete before it reaps the crashing process. This patch corrects that. Allowing the kernel to wait for the user space process to complete before cleaning up the crashing process. This is a bit tricky to do for a few reasons: 1) The user space process isn't our child, so we can't sys_wait4 on it 2) We need to close the pipe before waiting for the user process to complete, since the user process may rely on an EOF condition I've discussed several solutions with Oleg Nesterov off-list about this, and this is the one we've come up with. We add ourselves as a pipe reader (to prevent premature cleanup of the pipe_inode_info), and remove ourselves as a writer (to provide an EOF condition to the writer in user space), then we iterate until the user space process exits (which we detect by pipe->readers == 1, hence the > 1 check in the loop). When we exit the loop, we restore the proper reader/writer values, then we return and let filp_close in do_coredump clean up the pipe data properly. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Earl Chew <earl_chew@agilent.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Neil Horman	a293980c2e	exec: let do_coredump() limit the number of concurrent dumps to pipes Introduce core pipe limiting sysctl. Since we can dump cores to pipe, rather than directly to the filesystem, we create a condition in which a user can create a very high load on the system simply by running bad applications. If the pipe reader specified in core_pattern is poorly written, we can have lots of ourstandig resources and processes in the system. This sysctl introduces an ability to limit that resource consumption. core_pipe_limit defines how many in-flight dumps may be run in parallel, dumps beyond this value are skipped and a note is made in the kernel log. A special value of 0 in core_pipe_limit denotes unlimited core dumps may be handled (this is the default value). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Earl Chew <earl_chew@agilent.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Neil Horman	725eae32df	exec: make do_coredump() more resilient to recursive crashes Change how we detect recursive dumps. Currently we have a mechanism by which we try to compare pathnames of the crashing process to the core_pattern path. This is broken for a dozen reasons, and just doesn't work in any sort of robust way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper apps set RLIMIT_CORE to zero, we don't write out core files for any process with that particular limit set. It the core_pattern is a pipe, any non-zero limit is translated to RLIM_INFINITY. This allows complete dumps to be captured, but prevents infinite recursion in the event that the core_pattern process itself crashes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Earl Chew <earl_chew@agilent.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
From: Mel Gorman	ef1ff6b8c0	hugetlbfs: do not call user_shm_lock() for MAP_HUGETLB fix Commit `6bfde05bf5` ("hugetlbfs: allow the creation of files suitable for MAP_PRIVATE on the vfs internal mount") altered can_do_hugetlb_shm() to check if a file is being created for shared memory or mmap(). If this returns false, we then unconditionally call user_shm_lock() triggering a warning. This block should never be entered for MAP_HUGETLB. This patch partially reverts the problem and fixes the check. Signed-off-by: Eric B Munson <ebmunson@us.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:56 -07:00
Chris Mason	54bcf382da	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus Conflicts: fs/btrfs/super.c	2009-09-24 10:00:58 -04:00
Yan Zheng	c65ddb52dc	Btrfs: hash the btree inode during fill_super The snapshot deletion patches dropped this line, but the inode needs to be hashed. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-24 09:24:43 -04:00
Yan, Zheng	0257bb82d2	Btrfs: relocate file extents in clusters The extent relocation code copy file extents one by one when relocating data block group. This is inefficient if file extents are small. This patch makes the relocation code copy file extents in clusters. So we can can make better use of read-ahead. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-24 09:17:31 -04:00
Yan, Zheng	f679a84034	Btrfs: don't rename file into dummy directory A recent change enforces only one access point to each subvolume. The first directory entry (the one added when the subvolume/snapshot was created) is treated as valid access point, all other subvolume links are linked to dummy empty directories. The dummy directories are temporary inodes that only in memory, so we can not rename file into them. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-24 09:17:31 -04:00
Yan, Zheng	a571952143	Btrfs: check size of inode backref before adding hardlink For every hardlink in btrfs, there is a corresponding inode back reference. All inode back references for hardlinks in a given directory are stored in single b-tree item. The size of b-tree item is limited by the size of b-tree leaf, so we can only create limited number of hardlinks to a given file in a directory. The original code lacks of the check, it oops if the number of hardlinks goes over the limit. This patch fixes the issue by adding check to btrfs_link and btrfs_rename. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-24 09:17:31 -04:00
npiggin@suse.de	c08d3b0e33	truncate: use new helpers Update some fs code to make use of new helper functions introduced in the previous patch. Should be no significant change in behaviour (except CIFS now calls send_sig under i_lock, via inode_newsize_ok). Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Miklos Szeredi <miklos@szeredi.hu> Cc: linux-nfs@vger.kernel.org Cc: Trond.Myklebust@netapp.com Cc: linux-cifs-client@lists.samba.org Cc: sfrench@samba.org Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 08:41:47 -04:00
npiggin@suse.de	25d9e2d152	truncate: new helpers Introduce new truncate helpers truncate_pagecache and inode_newsize_ok. vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and into mm/truncate.c. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 08:41:47 -04:00
Vegard Nossum	eca6f534e6	fs: fix overflow in sys_mount() for in-kernel calls sys_mount() reads/copies a whole page for its "type" parameter. When do_mount_root() passes a kernel address that points to an object which is smaller than a whole page, copy_mount_options() will happily go past this memory object, possibly dereferencing "wild" pointers that could be in any state (hence the kmemcheck warning, which shows that parts of the next page are not even allocated). (The likelihood of something going wrong here is pretty low -- first of all this only applies to kernel calls to sys_mount(), which are mostly found in the boot code. Secondly, I guess if the page was not mapped, exact_copy_from_user() _would_ in fact handle it correctly because of its access_ok(), etc. checks.) But it is much nicer to avoid the dubious reads altogether, by stopping as soon as we find a NUL byte. Is there a good reason why we can't do something like this, using the already existing strndup_from_user()? [akpm@linux-foundation.org: make copy_mount_string() static] [AV: fix compat mount breakage, which involves undoing akpm's change above] Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: al <al@dizzy.pdmi.ras.ru>	2009-09-24 08:40:15 -04:00
Thomas Gleixner	6d729e44a5	fs: Make unload_nls() NULL pointer safe Most call sites of unload_nls() do: if (nls) unload_nls(nls); Check the pointer inside unload_nls() like we do in kfree() and simplify the call sites. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Steve French <sfrench@us.ibm.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Petr Vandrovec <vandrove@vc.cvut.cz> Cc: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:42 -04:00
Christoph Hellwig	4504230a71	freeze_bdev: grab active reference to frozen superblocks Currently we held s_umount while a filesystem is frozen, despite that we might return to userspace and unlock it from a different process. Instead grab an active reference to keep the file system busy and add an explicit check for frozen filesystems in remount and reject the remount instead of blocking on s_umount. Add a new get_active_super helper to super.c for use by freeze_bdev that grabs an active reference to a superblock from a given block device. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:41 -04:00
Christoph Hellwig	4fadd7bb20	freeze_bdev: kill bd_mount_sem Now that we have the freeze count there is not much reason for bd_mount_sem anymore. The actual freeze/thaw operations are serialized using the bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is get_sb_bdev which tries to prevent mounting a filesystem while the block device is frozen. Instead of add a check for bd_fsfreeze_count and return -EBUSY if a filesystem is frozen. While that is a change in user visible behaviour a failing mount is much better for this case rather than having the mount process stuck uninterruptible for a long time. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:39 -04:00
Boaz Harrosh	1ba50bbe93	exofs: remove BKL from super operations the two places inside exofs that where taking the BKL were: exofs_put_super() - .put_super and exofs_sync_fs() - which is .sync_fs and is also called from .write_super. Now exofs_sync_fs() is protected from itself by also taking the sb_lock. exofs_put_super() directly calls exofs_sync_fs() so there is no danger between these two either. In anyway there is absolutely nothing dangerous been done inside exofs_sync_fs(). Unless there is some subtle race with the actual lifetime of the super_block in regard to .put_super and some other parts of the VFS. Which is highly unlikely. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:38 -04:00
Julia Lawall	88a0a53d70	fs/romfs: correct error-handling code romfs_fill_super() assumes that romfs_iget() returns NULL when it fails. romfs_iget() actually returns ERR_PTR(-ve) in that case... Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:37 -04:00
Miklos Szeredi	f84398068d	vfs: seq_file: add helpers for data filling Add two helpers that allow access to the seq_file's own buffer, but hide the internal details of seq_files. This allows easier implementation of special purpose filling functions. It also cleans up some existing functions which duplicated the seq_file logic. Make these inline functions in seq_file.h, as suggested by Al. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:35 -04:00
Jeff Layton	f9098980ff	vfs: remove redundant position check in do_sendfile As Johannes Weiner pointed out, one of the range checks in do_sendfile is redundant and is already checked in rw_verify_area. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Robert Love <rlove@google.com> Cc: Mandeep Singh Baines <msb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:34 -04:00
Jeff Layton	42cb56ae2a	vfs: change sb->s_maxbytes to a loff_t sb->s_maxbytes is supposed to indicate the maximum size of a file that can exist on the filesystem. It's declared as an unsigned long long. Even if a filesystem has no inherent limit that prevents it from using every bit in that unsigned long long, it's still problematic to set it to anything larger than MAX_LFS_FILESIZE. There are places in the kernel that cast s_maxbytes to a signed value. If it's set too large then this cast makes it a negative number and generally breaks the comparison. Change s_maxbytes to be loff_t instead. That should help eliminate the temptation to set it too large by making it a signed value. Also, add a warning for couple of releases to help catch filesystems that set s_maxbytes too large. Eventually we can either convert this to a BUG() or just remove it and in the hope that no one will get it wrong now that it's a signed value. Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Robert Love <rlove@google.com> Cc: Mandeep Singh Baines <msb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:33 -04:00
Jeff Layton	5aa98b706e	vfs: explicitly cast s_maxbytes in fiemap_check_ranges If fiemap_check_ranges is passed a large enough value, then it's possible that the value would be cast to a signed value for comparison against s_maxbytes when we change it to loff_t. Make sure that doesn't happen by explicitly casting s_maxbytes to an unsigned value for the purposes of comparison. Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Love <rlove@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mandeep Singh Baines <msb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:31 -04:00
Wu Fengguang	05cc0cee69	libfs: return error code on failed attr set Currently all simple_attr.set handlers return 0 on success and negative codes on error. Fix simple_attr_write() to return these error codes. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:30 -04:00
Tetsuo Handa	7a62cc1021	seq_file: return a negative error code when seq_path_root() fails. seq_path_root() is returning a return value of successful __d_path() instead of returning a negative value when mangle_path() failed. This is not a bug so far because nobody is using return value of seq_path_root(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:29 -04:00
Andi Kleen	ce06e0b21d	vfs: optimize touch_time() too Do a similar optimization as earlier for touch_atime. Getting the lock in mnt_get_write is relatively costly, so try all avenues to avoid it first. This patch is careful to still only update inode fields inside the lock region. This didn't show up in benchmarks, but it's easy enough to do. [akpm@linux-foundation.org: fix typo in comment] [hugh.dickins@tiscali.co.uk: fix inverted test of mnt_want_write_file()] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Valerie Aurora <vaurora@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:27 -04:00
Andi Kleen	b12536c270	vfs: optimization for touch_atime() Some benchmark testing shows touch_atime to be high up in profile logs for IO intensive workloads. Most likely that's due to the lock in mnt_want_write(). Unfortunately touch_atime first takes the lock, and then does all the other tests that could avoid atime updates (like noatime or relatime). Do it the other way round -- first try to avoid the update and only then if that didn't succeed take the lock. That works because none of the atime avoidance tests rely on locking. This also eliminates a goto. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Reviewed-by: Valerie Aurora <vaurora@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:26 -04:00
Jan Kara	22fe404218	vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it Hugetlbfs needs to do special things instead of truncate_inode_pages(). Currently, it copied generic_forget_inode() except for truncate_inode_pages() call which is asking for trouble (the code there isn't trivial). So create a separate function generic_detach_inode() which does all the list magic done in generic_forget_inode() and call it from hugetlbfs_forget_inode(). Signed-off-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:25 -04:00
Manish Katiyar	af0d9ae811	fs/inode.c: add dev-id and inode number for debugging in init_special_inode() Add device-id and inode number for better debugging. This was suggested by Andreas in one of the threads http://article.gmane.org/gmane.comp.file-systems.ext4/12062 . "If anyone has a chance, fixing this error message to be not-useless would be good... Including the device name and the inode number would help track down the source of the problem." Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Cc: Andreas Dilger <adilger@sun.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:24 -04:00
Steven Rostedt	14be27460e	libfs: make simple_read_from_buffer conventional Impact: have simple_read_from_buffer conform to standards It was brought to my attention by Andrew Morton, Theodore Tso, and H. Peter Anvin that a read from userspace should only return -EFAULT if nothing was actually read. Looking at the simple_read_from_buffer I noticed that this function does not conform to that rule. This patch fixes that function. [akpm@linux-foundation.org: simplification suggested by hpa] [hpa@zytor.com: fix count==0 handling] Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 07:47:22 -04:00
Michael Abbott	96830a57de	[PATCH] Fix idle time field in /proc/uptime Git commit `79741dd` changes idle cputime accounting, but unfortunately the /proc/uptime file hasn't caught up. Here the idle time calculation from /proc/stat is copied over. Signed-off-by: Michael Abbott <michael.abbott@diamond.ac.uk> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2009-09-24 10:16:24 +02:00
Alexey Dobriyan	2bcd57ab61	headers: utsname.h redux * remove asm/atomic.h inclusion from linux/utsname.h -- not needed after kref conversion * remove linux/utsname.h inclusion from files which do not need it NOTE: it looks like fs/binfmt_elf.c do not need utsname.h, however due to some personality stuff it _is_ needed -- cowardly leave ELF-related headers and files alone. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 18:13:10 -07:00
Chris Mason	11ef160fda	Btrfs: fix releasepage to avoid unlocking extents we haven't locked During releasepage, we try to drop any extent_state structs for the bye offsets of the page we're releaseing. But the code was incorrectly telling clear_extent_bit to delete the state struct unconditionallly. Normally this would be fine because we have the page locked, but other parts of btrfs will lock down an entire extent, the most common place being IO completion. releasepage was deleting the extent state without first locking the extent, which may result in removing a state struct that another process had locked down. The fix here is to leave the NODATASUM and EXTENT_LOCKED bits alone in releasepage. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-23 20:30:53 -04:00
Chris Mason	46562cec98	Btrfs: Fix test_range_bit for whole file extents If test_range_bit finds an extent that goes all the way to (u64)-1, it can incorrectly wrap the u64 instead of treaing it like the end of the address space. This just adds a check for the highest possible offset so we don't wrap. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-23 20:30:52 -04:00
Chris Mason	42daec299b	Btrfs: fix errors handling cached state in set/clear_extent_bit Both set and clear_extent_bit allow passing a cached state struct to reduce rbtree search times. clear_extent_bit was improperly bypassing some of the checks around making sure the extent state fields were correct for a given operation. The fix used here (from Yan Zheng) is to use the hit_next goto target instead of jumping all the way down to start clearing bits without making sure the cached state was exactly correct for the operation we were doing. This also fixes up the setting of the start variable for both ops in the case where we find an overlapping extent that begins before the range we want to change. In both cases we were incorrectly going backwards from the original requested change. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-23 20:30:52 -04:00
Linus Torvalds	9e12a7e7d8	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Propagate 'fsc' mount option through automounts sunrpc/rpc_pipe: fix kernel-doc notation sunrpc: xdr_xcode_hyper helpers cannot presume 64-bit alignment NFS: Add nfs_alloc_parsed_mount_data NFS/RPC: fix problems with reestablish_timeout and related code. NFS: Get rid of the NFS_MOUNT_VER3 and NFS_MOUNT_TCP flags	2009-09-23 15:22:41 -07:00
David Howells	2df5480638	NFS: Propagate 'fsc' mount option through automounts Propagate the NFS 'fsc' mount option through NFS automounts of various types. This is now required as commit: commit `c02d7adf8c` Author: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Mon Jun 22 15:09:14 2009 -0400 NFSv4: Replace nfs4_path_walk() with VFS path lookup in a private namespace uses VFS-driven automounting to reach all submounts barring the root, thus preventing fscaching from being enabled on any submount other than the root. This patch gets around that by propagating the NFS_OPTION_FSCACHE flag across automounts. If a uniquifier is supplied to a mount then this is propagated to all automounts of that mount too. Signed-off-by: David Howells <dhowells@redhat.com> [Trond: Fixed up the definition of nfs_fscache_get_super_cookie for the case of #undef CONFIG_NFS_FSCACHE] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-09-23 14:36:39 -04:00
Chuck Lever	9423a08ad5	NFS: Add nfs_alloc_parsed_mount_data Allocating nfs_parsed_mount_data and setting up the defaults is nearly the same for both nfs and nfs4 mounts. Both paths seem to use nfs_validate_transport_protocol(), so setting a default value for nfs_server.protocol ought to be unnecessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-09-23 14:36:38 -04:00
Trond Myklebust	8a6e5deb8a	NFS: Get rid of the NFS_MOUNT_VER3 and NFS_MOUNT_TCP flags Keep it in the case of the legacy binary mount interface, but purge it from the nfs_server structure. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-09-23 14:36:37 -04:00
Abhishek Kulkarni	60e78d2c99	9p: Add fscache support to 9p This patch adds a persistent, read-only caching facility for 9p clients using the FS-Cache caching backend. When the fscache facility is enabled, each inode is associated with a corresponding vcookie which is an index into the FS-Cache indexing tree. The FS-Cache indexing tree is indexed at 3 levels: - session object associated with each mount. - inode/vcookie - actual data (pages) A cache tag is chosen randomly for each session. These tags can be read off /sys/fs/9p/caches and can be passed as a mount-time parameter to re-attach to the specified caching session. Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2009-09-23 13:03:46 -05:00
Abhishek Kulkarni	637d020a02	9p: Fix the incorrect update of inode size in v9fs_file_write() When using the cache=loose flags, the inode's size was not being updated correctly on a remote write. Thus subsequent reads of the whole file resulted in a truncated read. Fix it. Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2009-09-23 13:03:46 -05:00
Abhishek Kulkarni	7549ae3e81	9p: Use the i_size_[read, write]() macros instead of using inode->i_size directly. Change all occurrence of inode->i_size with i_size_read() or i_size_write() as appropriate. Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2009-09-23 13:03:46 -05:00
Linus Torvalds	a7c367b95a	Merge git://git.infradead.org/mtd-2.6 * git://git.infradead.org/mtd-2.6: (58 commits) mtd: jedec_probe: add PSD4256G6V id mtd: OneNand support for Nomadik 8815 SoC (on NHK8815 board) mtd: nand: driver for Nomadik 8815 SoC (on NHK8815 board) m25p80: Add Spansion S25FL129P serial flashes jffs2: Use SLAB_HWCACHE_ALIGN for jffs2_raw_{dirent,inode} slabs mtd: sh_flctl: register sh_flctl using platform_driver_probe() mtd: nand: txx9ndfmc: transfer 512 byte at a time if possible mtd: nand: fix tmio_nand ecc correction mtd: nand: add __nand_correct_data helper function mtd: cfi_cmdset_0002: add 0xFF intolerance for M29W128G mtd: inftl: fix fold chain block number mtd: jedec: fix compilation problem with I28F640C3B definition mtd: nand: fix ECC Correction bug for SMC ordering for NDFC driver mtd: ofpart: Check availability of reg property instead of name property driver/Makefile: Initialize "mtd" and "spi" before "net" mtd: omap: adding DMA mode support in nand prefetch/post-write mtd: omap: add support for nand prefetch-read and post-write mtd: add nand support for w90p910 (v2) mtd: maps: add mtd-ram support to physmap_of mtd: pxa3xx_nand: add single-bit error corrections reporting ...	2009-09-23 10:07:49 -07:00
Linus Torvalds	b64ada6b23	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (85 commits) ocfs2: Use buffer IO if we are appending a file. ocfs2: add spinlock protection when dealing with lockres->purge. dlmglue.c: add missed mlog lines ocfs2: __ocfs2_abort() should not enable panic for local mounts ocfs2: Add ioctl for reflink. ocfs2: Enable refcount tree support. ocfs2: Implement ocfs2_reflink. ocfs2: Add preserve to reflink. ocfs2: Create reflinked file in orphan dir. ocfs2: Use proper parameter for some inode operation. ocfs2: Make transaction extend more efficient. ocfs2: Don't merge in 1st refcount ops of reflink. ocfs2: Modify removing xattr process for refcount. ocfs2: Add reflink support for xattr. ocfs2: Create an xattr indexed block if needed. ocfs2: Call refcount tree remove process properly. ocfs2: Attach xattr clusters to refcount tree. ocfs2: Abstract ocfs2 xattr tree extend rec iteration process. ocfs2: Abstract the creation of xattr block. ocfs2: Remove inode from ocfs2_xattr_bucket_get_name_value. ...	2009-09-23 09:29:20 -07:00
Heiko Carstens	4fd8da8d62	fs: change sys_truncate length parameter type For this system call user space passes a signed long length parameter, while the kernel side takes an unsigned long parameter and converts it later to signed long again. This has led to bugs in compat wrappers see e.g. `dd90bbd5` "powerpc: Add compat_sys_truncate". The s390 compat wrapper for this functions is broken as well since it also performs zero extension instead of sign extension for the length parameter. In addition if hpa comes up with an automated way of generating compat wrappers it would generate a wrong one here. So change the length parameter from unsigned long to long. Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 09:21:05 -07:00
Heiko Carstens	a4255e4c1c	ext2: fix format string compile warning (ino_t) Unlike on most other architectures ino_t is an unsigned int on s390. So add an explicit cast to avoid this compile warning: fs/ext2/namei.c: In function 'ext2_lookup': fs/ext2/namei.c:73: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'ino_t' Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:58 -07:00
Doug Graham	9f6c133393	V3 minixfs: add missing directory type checking There are a few places in the Minix FS code where the "inode" field of a minix_dir_entry is used without checking first to see if the dirent is really a minix3_dir_entry. The inode number in a V1/V2 dirent is 16 bits, whereas that in a V3 dirent is 32 bits. Accessing it as a 16 bit field when it really should be accessed as a 32 bit field probably kinda sorta works on a little-endian machine, but leads to some rather odd behaviour on big-endian machines. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Doug Graham <dgraham@nortel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:57 -07:00
Bartlomiej Zolnierkiewicz	8b2feb10c9	ncpfs: fix wrong check in __ncp_ioctl() We want to check for s_inode's existence, not inode's one (inode is always valid in this function). This takes care of the following entry from Dan's list: fs/ncpfs/ioctl.c +445 __ncp_ioctl(180) warning: variable derefenced before check 'inode' Reported-by: Dan Carpenter <error27@gmail.com> Cc: Julia Lawall <julia@diku.dk> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Petr Vandrovec <vandrove@vc.cvut.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:42 -07:00
Roel Kluin	c5df59136a	ncpfs: read buffer overflow This function uses signed integers for the unix_date and local variables - if a negative number is supplied and the leap-year condition is not met, month will be 0, leading to a later read of day_n[-1] Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: Petr Vandrovec <VANDROVE@vc.cvut.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:42 -07:00
maximilian attems	a7e3108cca	ramfs: move RAMFS_MAGIC to include/linux/magic.h initramfs userspace likes to use this magic number. Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: maximilian attems <max@stro.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:42 -07:00
KAMEZAWA Hiroyuki	0d4c36a9b6	/proc/kcore: update stat.st_size after memory hotplug After memory hotplug (or other events in future), kcore size can be modified. To update inode->i_size, we have to know inode/dentry but we can't get it from inside /proc directly. But considerinyg memory hotplug, kcore image is updated only when it's opened. Then, updating inode->i_size at open() is enough. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:42 -07:00
KAMEZAWA Hiroyuki	678ad5d8aa	/proc/kcore: fix stat.st_size Presently the size of /proc/kcore which can be read by 'ls -l' is 0. But it's not the correct value. On x86-64, ls -l shows ... root root 140737486266368 2009-09-17 10:29 /proc/kcore Then, 7FFFFFFE02000. This comes from vmalloc area's size. (*) This shows "core" size, not memory size. This patch shows the size by updating "size" field in struct proc_dir_entry. Later, lookup routine will create inode and fill inode->i_size based on this value. Then, this has a problem. - Once inode is cached, inode->i_size will never be updated. Then, this patch is not memory-hotplug-aware. To update inode->i_size, we have to know dentry or inode. But there is no way to lookup them by inside kernel. Hmmm.... Next patch will try it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:42 -07:00
KAMEZAWA Hiroyuki	90396f96b7	kcore: more fixes for init proc_kcore_init() doesn't check NULL case. fix it and remove unnecessary comments. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:42 -07:00
KAMEZAWA Hiroyuki	81ac3ad906	kcore: register module area in generic way Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area. This is handled only in x86-64. This patch make it more generic. And we can use vread/vwrite to access the area. Fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:42 -07:00
KAMEZAWA Hiroyuki	26562c59fa	kcore: register vmemmap range Benjamin Herrenschmidt <benh@kernel.crashing.org> pointed out that vmemmap range is not included in KCORE_RAM, KCORE_VMALLOC .... This adds KCORE_VMEMMAP if SPARSEMEM_VMEMMAP is used. By this, vmemmap can be readable via /proc/kcore Because it's not vmalloc area, vread/vwrite cannot be used. But the range is static against the memory layout, this patch handles vmemmap area by the same scheme with physical memory. This patch assumes SPARSEMEM_VMEMMAP range is not in VMALLOC range. It's correct now. [akpm@linux-foundation.org: fix typo] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:41 -07:00
KAMEZAWA Hiroyuki	3089aa1b0c	kcore: use registerd physmem information For /proc/kcore, each arch registers its memory range by kclist_add(). In usual, - range of physical memory - range of vmalloc area - text, etc... are registered but "range of physical memory" has some troubles. It doesn't updated at memory hotplug and it tend to include unnecessary memory holes. Now, /proc/iomem (kernel/resource.c) includes required physical memory range information and it's properly updated at memory hotplug. Then, it's good to avoid using its own code(duplicating information) and to rebuild kclist for physical memory based on /proc/iomem. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:41 -07:00
KAMEZAWA Hiroyuki	9492587cf3	kcore: register text area in generic way Some 64bit arch has special segment for mapping kernel text. It should be entried to /proc/kcore in addtion to direct-linear-map, vmalloc area. This patch unifies KCORE_TEXT entry scattered under x86 and ia64. I'm not familiar with other archs (mips has its own even after this patch) but range of [_stext ..._end) is a valid area of text and it's not in direct-map area, defining CONFIG_ARCH_PROC_KCORE_TEXT is only a necessary thing to do. Note: I left mips as it is now. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:41 -07:00
KAMEZAWA Hiroyuki	a0614da88b	kcore: register vmalloc area in generic way For /proc/kcore, vmalloc areas are registered per arch. But, all of them registers same range of [VMALLOC_START...VMALLOC_END) This patch unifies them. By this. archs which have no kclist_add() hooks can see vmalloc area correctly. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:41 -07:00
KAMEZAWA Hiroyuki	c30bb2a25f	kcore: add kclist types Presently, kclist_add() only eats start address and size as its arguments. Considering to make kclist dynamically reconfigulable, it's necessary to know which kclists are for System RAM and which are not. This patch add kclist types as KCORE_RAM KCORE_VMALLOC KCORE_TEXT KCORE_OTHER This "type" is used in a patch following this for detecting KCORE_RAM. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:41 -07:00
KAMEZAWA Hiroyuki	2ef43ec772	kcore: use usual list for kclist This patchset is for /proc/kcore. With this, - many per-arch hooks are removed. - /proc/kcore will know really valid physical memory area. - /proc/kcore will be aware of memory hotplug. - /proc/kcore will be architecture independent i.e. if an arch supports CONFIG_MMU, it can use /proc/kcore. (if the arch uses usual memory layout.) This patch: /proc/kcore uses its own list handling codes. It's better to use generic list codes. No changes in logic. just clean up. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:41 -07:00
Stefani Seibold	d899bf7b55	procfs: provide stack information for threads A patch to give a better overview of the userland application stack usage, especially for embedded linux. Currently you are only able to dump the main process/thread stack usage which is showed in /proc/pid/status by the "VmStk" Value. But you get no information about the consumed stack memory of the the threads. There is an enhancement in the /proc/<pid>/{task/,}/maps and which marks the vm mapping where the thread stack pointer reside with "[thread stack xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value information, because libpthread doesn't set the start of the stack to the top of the mapped area, depending of the pthread usage. A sample output of /proc/<pid>/task/<tid>/maps looks like: 08048000-08049000 r-xp 00000000 03:00 8312 /opt/z 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] a7d12000-a7d13000 ---p 00000000 00:00 0 a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4] a7f13000-a7f14000 ---p 00000000 00:00 0 a7f14000-a7f36000 rw-p 00000000 00:00 0 a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6 a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6 a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6 a806c000-a806f000 rw-p 00000000 00:00 0 a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 a8085000-a8088000 rw-p 00000000 00:00 0 a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack] ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] Also there is a new entry "stack usage" in /proc/<pid>/{task/,}/status which will you give the current stack usage in kb. A sample output of /proc/self/status looks like: Name: cat State: R (running) Tgid: 507 Pid: 507 . . . CapBnd: fffffffffffffeff voluntary_ctxt_switches: 0 nonvoluntary_ctxt_switches: 0 Stack usage: 12 kB I also fixed stack base address in /proc/<pid>/{task/,}/stat to the base address of the associated thread stack and not the one of the main process. This makes more sense. [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()] Signed-off-by: Stefani Seibold <stefani@seibold.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:41 -07:00
Vincent Li	cba8aafe1e	fs/proc/base.c: fix proc_fault_inject_write() input sanity check Remove obfuscated zero-length input check and return -EINVAL instead of -EIO error to make the error message clear to user. Add whitespace stripping. No functionality changes. The old code: echo 1 > /proc/pid/make-it-fail (ok) echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Input/output error) The new code: echo 1 > /proc/pid/make-it-fail (ok) echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Invalid argument) This patch is conservative in changes to not breaking existing scripts/applications. Signed-off-by: Vincent Li <macli@brc.ubc.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:40 -07:00
Vincent Li	fb92a4b068	fs/proc/task_mmu.c v1: fix clear_refs_write() input sanity check Andrew Morton pointed out similar string hacking and obfuscated check for zero-length input at the end of the function, David Rientjes suggested to use strict_strtol to replace simple_strtol, this patch cover above suggestions, add removing of leading and trailing whitespace from user input. It does not change function behavious. Signed-off-by: Vincent Li <macli@brc.ubc.ca> Acked-by: David Rientjes <rientjes@google.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Amerigo Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:40 -07:00
Amerigo Wang	acef82b873	kcore: fix /proc/kcore's stat.st_size In `9063c61fd5` ("x86, 64-bit: Clean up user address masking") Linus fixed the wrong size of /proc/kcore problem. But its size still looks insane, since it never equals the size of physical memory. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Tao Ma <tao.ma@oracle.com> Cc: <mtk.manpages@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:40 -07:00
Oleg Nesterov	9b4d1cbef8	proc_flush_task: flush /proc/tid/task/pid when a sub-thread exits The exiting sub-thread flushes /proc/pid only, but this doesn't buy too much: ps and friends mostly use /proc/tid/task/pid. Remove "if (thread_group_leader())" checks from proc_flush_task() path, this means we always remove /proc/tid/task/pid dentry on exit, and this actually matches the comment above proc_flush_task(). The test-case: static void* tfunc(void *arg) { char name[256]; sprintf(name, "/proc/%d/task/%ld/status", getpid(), gettid()); close(open(name, O_RDONLY)); return NULL; } int main(void) { pthread_t t; for (;;) { if (!pthread_create(&t, NULL, &tfunc, NULL)) pthread_join(t, NULL); } } slabtop shows that pid/proc_inode_cache/etc grow quickly and "indefinitely" until the task is killed or shrink_slab() is called, not good. And the main thread needs a lot of time to exit. The same can happen if something like "ps -efL" runs continuously, while some application spawns short-living threads. Reported-by: "James M. Leddy" <jleddy@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dominic Duval <dduval@redhat.com> Cc: Frank Hirtz <fhirtz@redhat.com> Cc: "Fuller, Johnray" <Johnray.Fuller@gs.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Paul Batkowski <pbatkowski@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:40 -07:00
Kees Cook	cff4edb591	proc: fix reported unit for RLIMIT_CPU /proc/$pid/limits should show RLIMIT_CPU as seconds, which is the unit used in kernel/posix-cpu-timers.c: unsigned long psecs = cputime_to_secs(ptime); ... if (psecs >= sig->rlim[RLIMIT_CPU].rlim_max) { ... __group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk); Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:40 -07:00
Jiri Pirko	1f10206cf8	getrusage: fill ru_maxrss value Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater mark. This struct is filled as a parameter to getrusage syscall. ->ru_maxrss value is set to KBs which is the way it is done in BSD systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs which seems to be incorrect behavior. Maintainer of this util was notified by me with the patch which corrects it and cc'ed. To make this happen we extend struct signal_struct by two fields. The first one is ->maxrss which we use to store rss hiwater of the task. The second one is ->cmaxrss which we use to store highest rss hiwater of all task childs. These values are used in k_getrusage() to actually fill ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm struct exists. Note: exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss. it is intetionally behavior. BSD getrusage have exec() inheriting. test programs ======================================================== getrusage.c =========== #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <signal.h> #include <sys/mman.h> #include "common.h" #define err(str) perror(str), exit(1) int main(int argc, char* argv) { int status; printf("allocate 100MB\n"); consume(100); printf("testcase1: fork inherit? \n"); printf(" expect: initial.self ~= child.self\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { show_rusage("fork child"); _exit(0); } printf("\n"); printf("testcase2: fork inherit? (cont.) \n"); printf(" expect: initial.children ~= 100MB, but child.children = 0\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { show_rusage("child"); _exit(0); } printf("\n"); printf("testcase3: fork + malloc \n"); printf(" expect: child.self ~= initial.self + 50MB\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { printf("allocate +50MB\n"); consume(50); show_rusage("fork child"); _exit(0); } printf("\n"); printf("testcase4: grandchild maxrss\n"); printf(" expect: post_wait.children ~= 300MB\n"); show_rusage("initial"); if (__fork()) { wait(&status); show_rusage("post_wait"); } else { system("./child -n 0 -g 300"); _exit(0); } printf("\n"); printf("testcase5: zombie\n"); printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n"); printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n"); show_rusage("initial"); if (__fork()) { sleep(1); /* children become zombie / show_rusage("pre_wait"); wait(&status); show_rusage("post_wait"); } else { system("./child -n 400"); _exit(0); } printf("\n"); printf("testcase6: SIG_IGN\n"); printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n"); show_rusage("initial"); signal(SIGCHLD, SIG_IGN); if (__fork()) { sleep(1); / children become zombie / show_rusage("after_zombie"); } else { system("./child -n 500"); _exit(0); } printf("\n"); signal(SIGCHLD, SIG_DFL); printf("testcase7: exec (without fork) \n"); printf(" expect: initial ~= exec \n"); show_rusage("initial"); execl("./child", "child", "-v", NULL); return 0; } child.c ======= #include <sys/types.h> #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include "common.h" int main(int argc, char* argv) { int status; int c; long consume_size = 0; long grandchild_consume_size = 0; int show = 0; while ((c = getopt(argc, argv, "n:g:v")) != -1) { switch (c) { case 'n': consume_size = atol(optarg); break; case 'v': show = 1; break; case 'g': grandchild_consume_size = atol(optarg); break; default: break; } } if (show) show_rusage("exec"); if (consume_size) { printf("child alloc %ldMB\n", consume_size); consume(consume_size); } if (grandchild_consume_size) { if (fork()) { wait(&status); } else { printf("grandchild alloc %ldMB\n", grandchild_consume_size); consume(grandchild_consume_size); exit(0); } } return 0; } common.c ======== #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <signal.h> #include <sys/mman.h> #include "common.h" #define err(str) perror(str), exit(1) void show_rusage(char prefix) { int err, err2; struct rusage rusage_self; struct rusage rusage_children; printf("%s: ", prefix); err = getrusage(RUSAGE_SELF, &rusage_self); if (!err) printf("self %ld ", rusage_self.ru_maxrss); err2 = getrusage(RUSAGE_CHILDREN, &rusage_children); if (!err2) printf("children %ld ", rusage_children.ru_maxrss); printf("\n"); } / Some buggy OS need this worthless CPU waste. / void make_pagefault(void) { void addr; int size = getpagesize(); int i; for (i=0; i<1000; i++) { addr = mmap(NULL, size, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANON, -1, 0); if (addr == MAP_FAILED) err("make_pagefault"); memset(addr, 0, size); munmap(addr, size); } } void consume(int mega) { size_t sz = mega * 1024 * 1024; void ptr; ptr = malloc(sz); memset(ptr, 0, sz); make_pagefault(); } pid_t __fork(void) { pid_t pid; pid = fork(); make_pagefault(); return pid; } common.h ======== void show_rusage(char prefix); void make_pagefault(void); void consume(int mega); pid_t __fork(void); FreeBSD result (expected result) ======================================================== allocate 100MB testcase1: fork inherit? expect: initial.self ~= child.self initial: self 103492 children 0 fork child: self 103540 children 0 testcase2: fork inherit? (cont.) expect: initial.children ~= 100MB, but child.children = 0 initial: self 103540 children 103540 child: self 103564 children 0 testcase3: fork + malloc expect: child.self ~= initial.self + 50MB initial: self 103564 children 103564 allocate +50MB fork child: self 154860 children 0 testcase4: grandchild maxrss expect: post_wait.children ~= 300MB initial: self 103564 children 154860 grandchild alloc 300MB post_wait: self 103564 children 308720 testcase5: zombie expect: pre_wait ~= initial, IOW the zombie process is not accounted. post_wait ~= 400MB, IOW wait() collect child's max_rss. initial: self 103564 children 308720 child alloc 400MB pre_wait: self 103564 children 308720 post_wait: self 103564 children 411312 testcase6: SIG_IGN expect: initial ~= after_zombie (child's 500MB alloc should be ignored). initial: self 103564 children 411312 child alloc 500MB after_zombie: self 103624 children 411312 testcase7: exec (without fork) expect: initial ~= exec initial: self 103624 children 411312 exec: self 103624 children 411312 Linux result (actual test result) ======================================================== allocate 100MB testcase1: fork inherit? expect: initial.self ~= child.self initial: self 102848 children 0 fork child: self 102572 children 0 testcase2: fork inherit? (cont.) expect: initial.children ~= 100MB, but child.children = 0 initial: self 102876 children 102644 child: self 102572 children 0 testcase3: fork + malloc expect: child.self ~= initial.self + 50MB initial: self 102876 children 102644 allocate +50MB fork child: self 153804 children 0 testcase4: grandchild maxrss expect: post_wait.children ~= 300MB initial: self 102876 children 153864 grandchild alloc 300MB post_wait: self 102876 children 307536 testcase5: zombie expect: pre_wait ~= initial, IOW the zombie process is not accounted. post_wait ~= 400MB, IOW wait() collect child's max_rss. initial: self 102876 children 307536 child alloc 400MB pre_wait: self 102876 children 307536 post_wait: self 102876 children 410076 testcase6: SIG_IGN expect: initial ~= after_zombie (child's 500MB alloc should be ignored). initial: self 102876 children 410076 child alloc 500MB after_zombie: self 102880 children 410076 testcase7: exec (without fork) expect: initial ~= exec initial: self 102880 children 410076 exec: self 102880 children 410076 Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:30 -07:00
Suzuki Poulose	d7d7561c90	fix compat_sys_utimensat() Compat utimensat() returns EINVAL when the tv_nsec is one of UTIME_OMIT or UTIME_NOW and the tv_sec is set to non-zero. As per man pages, the tv_sec field should be ignored. sys_utimensat() works fine in this case. Test case: #define _GNU_SOURCE #define _ATFILE_SOURCE #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <sys/stat.h> #include <stdlib.h> main(int argc, char argv[]) { struct timespec ts[2]; struct timespec tsp; if (argc < 2) { fprintf(stderr, "Usage : %s filename\n", argv[0]); exit (-1); } ts[0].tv_nsec = ts[1].tv_nsec = UTIME_NOW; ts[0].tv_sec = ts[1].tv_sec = 1; tsp = ts; if (utimensat(AT_FDCWD, argv[1],tsp,0) == -1) perror("utimensat"); else fprintf(stdout, "utimensat success\n"); return 0; } mjs22lp5:~ # cc -m64 utimensat-test.c -o utimensat_test64 mjs22lp5:~ # cc -m32 utimensat-test.c -o utimensat_test32 mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test utimensat: Invalid argument mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test utimensat success mjs22lp5:~ # uname -r 2.6.31-rc8 With the patch : mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test utimensat success mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test utimensat success mjs22lp5:~ # uname -r 2.6.31-rc8utimensat Signed-off-by: Suzuki K P <suzuki@in.ibm.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:30 -07:00
Christoph Hellwig	945ffe54bb	qnx4: remove write support qnx4 wrte support has never been fully implement, is broken since the dawn of time and hasn't been actively developed since before git history started. Instead of letting it further bitrot and complicate API transition (like the new truncate code) remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Anders Larsen <al@alarsen.net> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:30 -07:00
Christoph Hellwig	8a9f47ddb1	ntfs: remove ntfs_file_write do_sync_write() does the right thing for turning the aio_writev method into a normal non-vectored synchronous write, no need to duplicate it in ntfs. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:29 -07:00
Davide Libenzi	562787a5c3	anonfd: split interface into file creation and install Split the anonfd interface into a bare file pointer creation one, and a file pointer creation plus install one. There are cases, like the usage of eventfds inside other kernel interfaces, where the file pointer created by anonfd needs to be used inside the initialization of other structures. As it is right now, as soon as anon_inode_getfd() returns, the kenrle can race with userspace closing the newly installed file descriptor. This patch, while keeping the old anon_inode_getfd(), introduces a new anon_inode_getfile() (whose services are reused in anon_inode_getfd()) that allows to split the file creation phase and the fd install one. Once all the kernel structures are initialized, the code can call the proper fd_install(). Gregory manifested the need for something like this inside KVM. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: James Morris <jmorris@namei.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:29 -07:00
H Hartley Sweeten	385773e048	aio.c: move EXPORT* macros to line after function As mentioned in Documentation/CodingStyle, move EXPORT* macro's to the line immediately after the closing function brace line. Also, move the __initcall() similarly. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:29 -07:00
H Hartley Sweeten	1fe72eaa0f	fs/buffer.c: clean up EXPORT* macros According to Documentation/CodingStyle the EXPORT* macro should follow immediately after the closing function brace line. Also, mark_buffer_async_write_endio() and do_thaw_all() are not used elsewhere so they should be marked as static. In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to that file. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:29 -07:00
Nick Piggin	88e0fbc452	fs: turn iprune_mutex into rwsem We have had a report of bad memory allocation latency during DVD-RAM (UDF) writing. This is causing the user's desktop session to become unusable. Jan tracked the cause of this down to UDF inode reclaim blocking: gnome-screens D ffff810006d1d598 0 20686 1 ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800 ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580 ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0 Call Trace: [<ffffffff804477f3>] io_schedule+0x63/0xa5 [<ffffffff802c2587>] sync_buffer+0x3b/0x3f [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79 [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77 [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21 [<ffffffff802c442a>] __bread+0x70/0x86 [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394 [<ffffffff802bd205>] write_inode_now+0x7d/0xc4 [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53 [<ffffffff802b39ae>] clear_inode+0xc2/0x11b [<ffffffff802b3ab1>] dispose_list+0x5b/0x102 [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392 [<ffffffff802951fa>] alloc_page_vma+0x176/0x189 [<ffffffff802822d8>] __do_fault+0x10c/0x417 [<ffffffff80284232>] handle_mm_fault+0x466/0x940 [<ffffffff8044b922>] do_page_fault+0x676/0xabf This blocks with iprune_mutex held, which then blocks other reclaimers: X D ffff81009d47c400 0 17285 14831 ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288 ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400 ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740 Call Trace: [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9 [<ffffffff80447e1a>] mutex_lock+0x1e/0x22 [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392 [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6 [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73 [<ffffffff802ad949>] do_select+0x308/0x506 [<ffffffff802adced>] core_sys_select+0x1a6/0x254 [<ffffffff802ae0b7>] sys_select+0xb5/0x157 Now I think the main problem is having the filesystem block (and do IO) in inode reclaim. The problem is that this doesn't get accounted well and penalizes a random allocator with a big latency spike caused by work generated from elsewhere. I think the best idea would be to avoid this. By design if possible, or by deferring the hard work to an asynchronous context. If the latter, then the fs would probably want to throttle creation of new work with queue size of the deferred work, but let's not get into those details. Anyway, the other obvious thing we looked at is the iprune_mutex which is causing the cascading blocking. We could turn this into an rwsem to improve concurrency. It is unreasonable to totally ban all potentially slow or blocking operations in inode reclaim, so I think this is a cheap way to get a small improvement. This doesn't solve the whole problem of course. The process doing inode reclaim will still take the latency hit, and concurrent processes may end up contending on filesystem locks. So fs developers should keep these problems in mind. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jan Kara <jack@ucw.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:29 -07:00
James Morris	88e9d34c72	seq_file: constify seq_operations Make all seq_operations structs const, to help mitigate against revectoring user-triggerable function pointers. This is derived from the grsecurity patch, although generated from scratch because it's simpler than extracting the changes from there. Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:29 -07:00
Nick Black	1fd7317d02	Move magic numbers into magic.h Move various magic-number definitions into magic.h. Signed-off-by: Nick Black <dank@qemfd.net> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:28 -07:00
Guillaume Knispel	5ae87e79ec	poll/select: avoid arithmetic overflow in __estimate_accuracy() __estimate_accuracy() was prone to integer overflow, for example if *tv == {2147, 483648000} on a 32 bit computer (or even for delays as small as {429, 500000000} if the task is niced). Because the result was already forced between 0 and 100ms, the effect of the overflow was not too problematic, but the use of the hrtimer range feature was not optimal in overflow cases. This patch ensures that there can not be an integer overflow in this function. Signed-off-by: Guillaume Knispel <gknispel@proformatique.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:27 -07:00
Roel Kluin	ca976c53de	smbfs: read buffer overflow This function uses signed integers for the unix_date and local variables - if a negative number is supplied and the leap-year condition is not met, month will be 0, leading to a read of day_n[-1] Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:27 -07:00
Tyler Hicks	9c2d205664	eCryptfs: Prevent lower dentry from going negative during unlink When calling vfs_unlink() on the lower dentry, d_delete() turns the dentry into a negative dentry when the d_count is 1. This eventually caused a NULL pointer deref when a read() or write() was done and the negative dentry's d_inode was dereferenced in ecryptfs_read_update_atime() or ecryptfs_getxattr(). Placing mutt's tmpdir in an eCryptfs mount is what initially triggered the oops and I was able to reproduce it with the following sequence: open("/tmp/upper/foo", O_RDWR\|O_CREAT\|O_EXCL\|O_NOFOLLOW, 0600) = 3 link("/tmp/upper/foo", "/tmp/upper/bar") = 0 unlink("/tmp/upper/foo") = 0 open("/tmp/upper/bar", O_RDWR\|O_CREAT\|O_NOFOLLOW, 0600) = 4 unlink("/tmp/upper/bar") = 0 write(4, "eCryptfs test\n"..., 14 <unfinished ...> +++ killed by SIGKILL +++ https://bugs.launchpad.net/ecryptfs/+bug/387073 Reported-by: Loïc Minier <loic.minier@canonical.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: stable <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-09-23 09:10:34 -05:00
Tyler Hicks	96a7b9c2f5	eCryptfs: Propagate vfs_read and vfs_write return codes Errors returned from vfs_read() and vfs_write() calls to the lower filesystem were being masked as -EINVAL. This caused some confusion to users who saw EINVAL instead of ENOSPC when the disk was full, for instance. Also, the actual bytes read or written were not accessible by callers to ecryptfs_read_lower() and ecryptfs_write_lower(), which may be useful in some cases. This patch updates the error handling logic where those functions are called in order to accept positive return codes indicating success. Cc: Eric Sandeen <esandeen@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-09-23 09:10:34 -05:00
Tyler Hicks	3891959846	eCryptfs: Validate global auth tok keys When searching through the global authentication tokens for a given key signature, verify that a matching key has not been revoked and has not expired. This allows the `keyctl revoke` command to be properly used on keys in use by eCryptfs. Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: stable <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-09-23 09:10:32 -05:00
Tyler Hicks	df6ad33ba1	eCryptfs: Filename encryption only supports password auth tokens Returns -ENOTSUPP when attempting to use filename encryption with something other than a password authentication token, such as a private token from openssl. Using filename encryption with a userspace eCryptfs key module is a future goal. Until then, this patch handles the situation a little better than simply using a BUG_ON(). Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: stable <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-09-23 09:10:32 -05:00
Tyler Hicks	ac22ba23b6	eCryptfs: Check for O_RDONLY lower inodes when opening lower files If the lower inode is read-only, don't attempt to open the lower file read/write and don't hand off the open request to the privileged eCryptfs kthread for opening it read/write. Instead, only try an unprivileged, read-only open of the file and give up if that fails. This patch fixes an oops when eCryptfs is mounted on top of a read-only mount. Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Eric Sandeen <esandeen@redhat.com> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: stable <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-09-23 09:10:32 -05:00
Tyler Hicks	b0105eaefa	eCryptfs: Handle unrecognized tag 3 cipher codes Returns an error when an unrecognized cipher code is present in a tag 3 packet or an ecryptfs_crypt_stat cannot be initialized. Also sets an crypt_stat->tfm error pointer to NULL to ensure that it will not be incorrectly freed in ecryptfs_destroy_crypt_stat(). Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: stable <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-09-23 09:10:31 -05:00
Dave Hansen	382684984e	ecryptfs: improved dependency checking and reporting So, I compiled a 2.6.31-rc5 kernel with ecryptfs and loaded its module. When it came time to mount my filesystem, I got this in dmesg, and it refused to mount: [93577.776637] Unable to allocate crypto cipher with name [aes]; rc = [-2] [93577.783280] Error attempting to initialize key TFM cipher with name = [aes]; rc = [-2] [93577.791183] Error attempting to initialize cipher with name = [aes] and key size = [32]; rc = [-2] [93577.800113] Error parsing options; rc = [-22] I figured from the error message that I'd either forgotten to load "aes" or that my key size was bogus. Neither one of those was the case. In fact, I was missing the CRYPTO_ECB config option and the 'ecb' module. Unfortunately, there's no trace of 'ecb' in that error message. I've done two things to fix this. First, I've modified ecryptfs's Kconfig entry to select CRYPTO_ECB and CRYPTO_CBC. I also took CRYPTO out of the dependencies since the 'select' will take care of it for us. I've also modified the error messages to print a string that should contain both 'ecb' and 'aes' in my error case. That will give any future users a chance of finding the right modules and Kconfig options. I also wonder if we should: select CRYPTO_AES if !EMBEDDED since I think most ecryptfs users are using AES like me. Cc: ecryptfs-devel@lists.launchpad.net Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Dustin Kirkland <kirkland@canonical.com> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> [tyhicks@linux.vnet.ibm.com: Removed extra newline, 80-char violation] Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-09-23 09:10:31 -05:00
Roland Dreier	aa06117f19	eCryptfs: Fix lockdep-reported AB-BA mutex issue Lockdep reports the following valid-looking possible AB-BA deadlock with global_auth_tok_list_mutex and keysig_list_mutex: ecryptfs_new_file_context() -> ecryptfs_copy_mount_wide_sigs_to_inode_sigs() -> mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex); -> ecryptfs_add_keysig() -> mutex_lock(&crypt_stat->keysig_list_mutex); vs ecryptfs_generate_key_packet_set() -> mutex_lock(&crypt_stat->keysig_list_mutex); -> ecryptfs_find_global_auth_tok_for_sig() -> mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex); ie the two mutexes are taken in opposite orders in the two different code paths. I'm not sure if this is a real bug where two threads could actually hit the two paths in parallel and deadlock, but it at least makes lockdep impossible to use with ecryptfs since this report triggers every time and disables future lockdep reporting. Since ecryptfs_add_keysig() is called only from the single callsite in ecryptfs_copy_mount_wide_sigs_to_inode_sigs(), the simplest fix seems to be to move the lock of keysig_list_mutex back up outside of the where global_auth_tok_list_mutex is taken. This patch does that, and fixes the lockdep report on my system (and ecryptfs still works OK). The full output of lockdep fixed by this patch is: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.31-2-generic #14~rbd2 ------------------------------------------------------- gdm/2640 is trying to acquire lock: (&mount_crypt_stat->global_auth_tok_list_mutex){+.+.+.}, at: [<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 but task is already holding lock: (&crypt_stat->keysig_list_mutex){+.+.+.}, at: [<ffffffff81217728>] ecryptfs_generate_key_packet_set+0x58/0x2b0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&crypt_stat->keysig_list_mutex){+.+.+.}: [<ffffffff8108c897>] check_prev_add+0x2a7/0x370 [<ffffffff8108cfc1>] validate_chain+0x661/0x750 [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430 [<ffffffff8108d585>] lock_acquire+0xa5/0x150 [<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0 [<ffffffff81552b56>] mutex_lock_nested+0x46/0x60 [<ffffffff8121526a>] ecryptfs_add_keysig+0x5a/0xb0 [<ffffffff81213299>] ecryptfs_copy_mount_wide_sigs_to_inode_sigs+0x59/0xb0 [<ffffffff81214b06>] ecryptfs_new_file_context+0xa6/0x1a0 [<ffffffff8120e42a>] ecryptfs_initialize_file+0x4a/0x140 [<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60 [<ffffffff8113a7d4>] vfs_create+0xb4/0xe0 [<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110 [<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0 [<ffffffff8112d8d9>] do_sys_open+0x69/0x140 [<ffffffff8112d9f0>] sys_open+0x20/0x30 [<ffffffff81013132>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff -> #0 (&mount_crypt_stat->global_auth_tok_list_mutex){+.+.+.}: [<ffffffff8108c675>] check_prev_add+0x85/0x370 [<ffffffff8108cfc1>] validate_chain+0x661/0x750 [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430 [<ffffffff8108d585>] lock_acquire+0xa5/0x150 [<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0 [<ffffffff81552b56>] mutex_lock_nested+0x46/0x60 [<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 [<ffffffff812177d5>] ecryptfs_generate_key_packet_set+0x105/0x2b0 [<ffffffff81212f49>] ecryptfs_write_headers_virt+0xc9/0x120 [<ffffffff8121306d>] ecryptfs_write_metadata+0xcd/0x200 [<ffffffff8120e44b>] ecryptfs_initialize_file+0x6b/0x140 [<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60 [<ffffffff8113a7d4>] vfs_create+0xb4/0xe0 [<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110 [<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0 [<ffffffff8112d8d9>] do_sys_open+0x69/0x140 [<ffffffff8112d9f0>] sys_open+0x20/0x30 [<ffffffff81013132>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff other info that might help us debug this: 2 locks held by gdm/2640: #0: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8113cb8b>] do_filp_open+0x3cb/0xae0 #1: (&crypt_stat->keysig_list_mutex){+.+.+.}, at: [<ffffffff81217728>] ecryptfs_generate_key_packet_set+0x58/0x2b0 stack backtrace: Pid: 2640, comm: gdm Tainted: G C 2.6.31-2-generic #14~rbd2 Call Trace: [<ffffffff8108b988>] print_circular_bug_tail+0xa8/0xf0 [<ffffffff8108c675>] check_prev_add+0x85/0x370 [<ffffffff81094912>] ? __module_text_address+0x12/0x60 [<ffffffff8108cfc1>] validate_chain+0x661/0x750 [<ffffffff81017275>] ? print_context_stack+0x85/0x140 [<ffffffff81089c68>] ? find_usage_backwards+0x38/0x160 [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430 [<ffffffff8108d585>] lock_acquire+0xa5/0x150 [<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 [<ffffffff8108b0b0>] ? check_usage_backwards+0x0/0xb0 [<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0 [<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 [<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 [<ffffffff8108c02c>] ? mark_held_locks+0x6c/0xa0 [<ffffffff81125b0d>] ? kmem_cache_alloc+0xfd/0x1a0 [<ffffffff8108c34d>] ? trace_hardirqs_on_caller+0x14d/0x190 [<ffffffff81552b56>] mutex_lock_nested+0x46/0x60 [<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 [<ffffffff812177d5>] ecryptfs_generate_key_packet_set+0x105/0x2b0 [<ffffffff81212f49>] ecryptfs_write_headers_virt+0xc9/0x120 [<ffffffff8121306d>] ecryptfs_write_metadata+0xcd/0x200 [<ffffffff81210240>] ? ecryptfs_init_persistent_file+0x60/0xe0 [<ffffffff8120e44b>] ecryptfs_initialize_file+0x6b/0x140 [<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60 [<ffffffff8113a7d4>] vfs_create+0xb4/0xe0 [<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110 [<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0 [<ffffffff8129a93e>] ? _raw_spin_unlock+0x5e/0xb0 [<ffffffff8155410b>] ? _spin_unlock+0x2b/0x40 [<ffffffff81139e9b>] ? getname+0x3b/0x240 [<ffffffff81148a5a>] ? alloc_fd+0xfa/0x140 [<ffffffff8112d8d9>] do_sys_open+0x69/0x140 [<ffffffff81553b8f>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8112d9f0>] sys_open+0x20/0x30 [<ffffffff81013132>] system_call_fastpath+0x16/0x1b Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-09-23 09:10:30 -05:00
Roland Dreier	05dafedb90	ecryptfs: Remove unneeded locking that triggers lockdep false positives In ecryptfs_destroy_inode(), inode_info->lower_file_mutex is locked, and just after the mutex is unlocked, the code does: kmem_cache_free(ecryptfs_inode_info_cache, inode_info); This means that if another context could possibly try to take the same mutex as ecryptfs_destroy_inode(), then it could end up getting the mutex just before the data structure containing the mutex is freed. So any such use would be an obvious use-after-free bug (catchable with slab poisoning or mutex debugging), and therefore the locking in ecryptfs_destroy_inode() is not needed and can be dropped. Similarly, in ecryptfs_destroy_crypt_stat(), crypt_stat->keysig_list_mutex is locked, and then the mutex is unlocked just before the code does: memset(crypt_stat, 0, sizeof(struct ecryptfs_crypt_stat)); Therefore taking this mutex is similarly not necessary. Removing this locking fixes false-positive lockdep reports such as the following (and they are false-positives for exactly the same reason that the locking is not needed): ================================= [ INFO: inconsistent lock state ] 2.6.31-2-generic #14~rbd3 --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/323 [HC0[0]:SC0[0]:HE1:SE1] takes: (&inode_info->lower_file_mutex){+.+.?.}, at: [<ffffffff81210d34>] ecryptfs_destroy_inode+0x34/0x100 {RECLAIM_FS-ON-W} state was registered at: [<ffffffff8108c02c>] mark_held_locks+0x6c/0xa0 [<ffffffff8108c10f>] lockdep_trace_alloc+0xaf/0xe0 [<ffffffff81125a51>] kmem_cache_alloc+0x41/0x1a0 [<ffffffff8113117a>] get_empty_filp+0x7a/0x1a0 [<ffffffff8112dd46>] dentry_open+0x36/0xc0 [<ffffffff8121a36c>] ecryptfs_privileged_open+0x5c/0x2e0 [<ffffffff81210283>] ecryptfs_init_persistent_file+0xa3/0xe0 [<ffffffff8120e838>] ecryptfs_lookup_and_interpose_lower+0x278/0x380 [<ffffffff8120f97a>] ecryptfs_lookup+0x12a/0x250 [<ffffffff8113930a>] real_lookup+0xea/0x160 [<ffffffff8113afc8>] do_lookup+0xb8/0xf0 [<ffffffff8113b518>] __link_path_walk+0x518/0x870 [<ffffffff8113bd9c>] path_walk+0x5c/0xc0 [<ffffffff8113be5b>] do_path_lookup+0x5b/0xa0 [<ffffffff8113bfe7>] user_path_at+0x57/0xa0 [<ffffffff811340dc>] vfs_fstatat+0x3c/0x80 [<ffffffff8113424b>] vfs_stat+0x1b/0x20 [<ffffffff81134274>] sys_newstat+0x24/0x50 [<ffffffff81013132>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff irq event stamp: 7811 hardirqs last enabled at (7811): [<ffffffff810c037f>] call_rcu+0x5f/0x90 hardirqs last disabled at (7810): [<ffffffff810c0353>] call_rcu+0x33/0x90 softirqs last enabled at (3764): [<ffffffff810631da>] __do_softirq+0x14a/0x220 softirqs last disabled at (3751): [<ffffffff8101440c>] call_softirq+0x1c/0x30 other info that might help us debug this: 2 locks held by kswapd0/323: #0: (shrinker_rwsem){++++..}, at: [<ffffffff810f67ed>] shrink_slab+0x3d/0x190 #1: (&type->s_umount_key#35){.+.+..}, at: [<ffffffff811429a1>] prune_dcache+0xd1/0x1b0 stack backtrace: Pid: 323, comm: kswapd0 Tainted: G C 2.6.31-2-generic #14~rbd3 Call Trace: [<ffffffff8108ad6c>] print_usage_bug+0x18c/0x1a0 [<ffffffff8108aff0>] ? check_usage_forwards+0x0/0xc0 [<ffffffff8108bac2>] mark_lock_irq+0xf2/0x280 [<ffffffff8108bd87>] mark_lock+0x137/0x1d0 [<ffffffff81164710>] ? fsnotify_clear_marks_by_inode+0x30/0xf0 [<ffffffff8108bee6>] mark_irqflags+0xc6/0x1a0 [<ffffffff8108d337>] __lock_acquire+0x287/0x430 [<ffffffff8108d585>] lock_acquire+0xa5/0x150 [<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100 [<ffffffff8108d2e7>] ? __lock_acquire+0x237/0x430 [<ffffffff815526ad>] __mutex_lock_common+0x4d/0x3d0 [<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100 [<ffffffff81164710>] ? fsnotify_clear_marks_by_inode+0x30/0xf0 [<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100 [<ffffffff8129a91e>] ? _raw_spin_unlock+0x5e/0xb0 [<ffffffff81552b36>] mutex_lock_nested+0x46/0x60 [<ffffffff81210d34>] ecryptfs_destroy_inode+0x34/0x100 [<ffffffff81145d27>] destroy_inode+0x87/0xd0 [<ffffffff81146b4c>] generic_delete_inode+0x12c/0x1a0 [<ffffffff81145832>] iput+0x62/0x70 [<ffffffff811423c8>] dentry_iput+0x98/0x110 [<ffffffff81142550>] d_kill+0x50/0x80 [<ffffffff81142623>] prune_one_dentry+0xa3/0xc0 [<ffffffff811428b1>] __shrink_dcache_sb+0x271/0x290 [<ffffffff811429d9>] prune_dcache+0x109/0x1b0 [<ffffffff81142abf>] shrink_dcache_memory+0x3f/0x50 [<ffffffff810f68dd>] shrink_slab+0x12d/0x190 [<ffffffff810f9377>] balance_pgdat+0x4d7/0x640 [<ffffffff8104c4c0>] ? finish_task_switch+0x40/0x150 [<ffffffff810f63c0>] ? isolate_pages_global+0x0/0x60 [<ffffffff810f95f7>] kswapd+0x117/0x170 [<ffffffff810777a0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff810f94e0>] ? kswapd+0x0/0x170 [<ffffffff810773be>] kthread+0x9e/0xb0 [<ffffffff8101430a>] child_rip+0xa/0x20 [<ffffffff81013c90>] ? restore_args+0x0/0x30 [<ffffffff81077320>] ? kthread+0x0/0xb0 [<ffffffff81014300>] ? child_rip+0x0/0x20 Signed-off-by: Roland Dreier <roland@digitalvampire.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-09-23 09:10:30 -05:00
Tao Ma	b80474b432	ocfs2: Use buffer IO if we are appending a file. In ocfs2_file_aio_write, we will prevent direct io if we find that we are appending(changing i_size) and call generic_file_aio_write_nolock. But actually O_DIRECT flag is there and this function will call generic_file_direct_write eventually which will update i_size and leave di->i_size alone. The bug is http://oss.oracle.com/bugzilla/show_bug.cgi?id=1173. So this patch let ocfs2_direct_IO returns 0 directly if we are appending so that buffered write will be called and di->i_size get updated successfully. And this is also what we want in ocfs2_file_aio_write. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-09-23 01:54:49 -07:00
Wengang Wang	83e32d9044	ocfs2: add spinlock protection when dealing with lockres->purge. when we check/modify lockres->purge, we should with the protection of lockres->spinlock. in dlm_purge_lockres(), the checking/modifying is not with the protectin. this patch fixes it. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-09-23 01:54:48 -07:00
Coly Li	d92bc5127b	dlmglue.c: add missed mlog lines This patch adds the missed mlog_exit() and mlog_exit_void() lines when routines return. Signed-off-by: Coly Li <coly.li@suse.de> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-09-23 01:54:47 -07:00
Sunil Mushran	a2f2ddbf2b	ocfs2: __ocfs2_abort() should not enable panic for local mounts In a clustered setup, we have to panic the box on journal abort. This is because we don't have the facility to go hard readonly. With hard ro, another node would detect node failure and initiate recovery. Having said that, we shouldn't force panic if the volume is mounted locally. This patch defers the handling to the mount option, errors. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-09-23 01:54:46 -07:00
Tao Ma	bd50873dc7	ocfs2: Add ioctl for reflink. The ioctl will take 3 parameters: old_path, new_path and preserve and call vfs_reflink. It is useful when we backport reflink features to old kernels. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:51 -07:00
Tao Ma	64871b8d62	ocfs2: Enable refcount tree support. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:50 -07:00
Tao Ma	09bf27a000	ocfs2: Implement ocfs2_reflink. Implement ocfs2_reflink. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:49 -07:00
Tao Ma	0fe9b66c65	ocfs2: Add preserve to reflink. reflink has 2 options for the destination file: 1. snapshot: reflink will attempt to preserve ownership, permissions, and all other security state in order to create a full snapshot. 2. new file: it will acquire the data extent sharing but will see the file's security state and attributes initialized as a new file. So add the option to ocfs2. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:49 -07:00
Tao Ma	bc13d34757	ocfs2: Create reflinked file in orphan dir. reflink is a very complicated process, so it can't be integrated into one transaction. So if the system panic in the operation, we may leave a unfinished inode in the destication directory. So we will try to create an inode in orphan_dir first, reflink it to the src file and then move it to the destication file in the end. In that way we won't be afraid of any corruption during the reflink. This patch adds 2 functions for orphan_dir operation: 1. Create a new inode in orphand dir. 2. Move an inode to a target dir. Note: fsck.ocfs2 should work for us to remove the unfinished file in the orphan_dir. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:48 -07:00
Tao Ma	19bd341f6a	ocfs2: Use proper parameter for some inode operation. In order to make the original function more suitable for reflink, we modify the following inode operations. Both are tiny. 1. ocfs2_mknod_locked only use dentry for mlog, so move it to the caller so that reflink can use it without dentry. 2. ocfs2_prepare_orphan_dir only want inode to get its ip_blkno. So use ip_blkno instead. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:47 -07:00
Tao Ma	c18b812d12	ocfs2: Make transaction extend more efficient. In ocfs2_extend_rotate_transaction, op_credits is the orignal credits in the handle and we only want to extend the credits for the rotation, but the old solution always double it. It is harmless for some minor operations, but for actions like reflink we may rotate tree many times and cause the credits increase dramatically. So this patch try to only increase the desired credits. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:46 -07:00
Tao Ma	7540c1a77b	ocfs2: Don't merge in 1st refcount ops of reflink. Actually the whole reflink will touch refcount tree 2 times: 1. It will add the clusters in the extent record to the tree if it isn't refcounted before. 2. It will add 1 refcount to these clusters when it add these extent records to the tree. So actually we shouldn't do merge in the 1st operation since the 2nd one will soon be called and we may have to split it again. Do a merge first and split soon is a waste of time. So we only merge in the 2nd round. This is done by adding a new internal __ocfs2_increase_refcount and call it with "not-merge" for 1st refcount operation in reflink. This also has a side-effect that we don't need to worry too much about the metadata allocation in the 2nd round since it will only merge and no split will happen for those records. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:46 -07:00
Tao Ma	ce9c5a54c0	ocfs2: Modify removing xattr process for refcount. The old xattr value remove is quite simple, it just erase the tree and free the clusters. But as we have added refcount support, The process is a little complicated. We have to lock the refcount tree at the beginning, what's more, we may split the refcount tree in some cases, so meta/credits are needed. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:45 -07:00
Tao Ma	2999d12f4d	ocfs2: Add reflink support for xattr. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:45 -07:00
Tao Ma	a7fe7a3a1a	ocfs2: Create an xattr indexed block if needed. With reflink, there is a need that we create a new xattr indexed block from the very beginning. So add a new parameter for ocfs2_create_xattr_block. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:44 -07:00
Tao Ma	8b2c0dba51	ocfs2: Call refcount tree remove process properly. Now with xattr refcount support, we need to check whether we have xattr refcounted before we remove the refcount tree. Now the mechanism is: 1) Check whether i_clusters == 0, if no, exit. 2) check whether we have i_xattr_loc in dinode. if yes, exit. 2) Check whether we have inline xattr stored outside, if yes, exit. 4) Remove the tree. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:44 -07:00
Tao Ma	0129241e2b	ocfs2: Attach xattr clusters to refcount tree. In ocfs2, when xattr's value is larger than OCFS2_XATTR_INLINE_SIZE, it will be kept outside of the blocks we store xattr entry. And they are stored in a b-tree also. So this patch try to attach all these clusters to refcount tree also. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:43 -07:00
Tao Ma	47bca4950b	ocfs2: Abstract ocfs2 xattr tree extend rec iteration process. Currently we have ocfs2_iterate_xattr_buckets which can receive a para and a callback to iterate a series of bucket. It is good. But actually the 2 callers ocfs2_xattr_tree_list_index_block and ocfs2_delete_xattr_index_block are almost the same. The only difference is that the latter need to handle the extent record also. So add a new function named ocfs2_iterate_xattr_index_block. It can be given func callback which are used for exten record. So now we only have one iteration function for the xattr index block. Ane what's more, it is useful for our future reflink operations. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:43 -07:00
Tao Ma	5aea1f0ef4	ocfs2: Abstract the creation of xattr block. In xattr reflink, we also need to create xattr block, so abstract the process out. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:42 -07:00
Tao Ma	fd68a894fc	ocfs2: Remove inode from ocfs2_xattr_bucket_get_name_value. In ocfs2_xattr_bucket_get_name_value, actually we only use super_block. So use it. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:41 -07:00
Tao Ma	492a8a33e1	ocfs2: Add CoW support for xattr. In order to make 2 transcation(xattr and cow) independent with each other, we CoW the whole xattr out in case we are setting them. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:41 -07:00
Tao Ma	913580b4cd	ocfs2: Abstract duplicate clusters process in CoW. We currently use pagecache to duplicate clusters in CoW, but it isn't suitable for xattr case. So abstract it out so that the caller can decide which method it use. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:40 -07:00
Tao Ma	1061f9c1c9	ocfs2: Return extent flags for xattr value tree. With the new refcount tree, xattr value can also be refcounted among multiple files. So return the appropriate extent flags so that CoW can used it later. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:39 -07:00
Tao Ma	a9063ab9a3	ocfs2: handle file attributes issue for reflink. A reflink creates a snapshot of a file, that means the attributes must be identical except for three exceptions - nlink, ino, and ctime. As for time changes, Here is a brief description: 1. Source file: 1) atime: Ignore. Let the lazy atime code handle that. 2) mtime: don't touch. 3) ctime: If we change the tree (adding REFCOUNTED to at least one extent), update it. 2. Destination file: 1) atime: ignore. 2) mtime: we want it to appear identical to the source. 3) ctime: update. The idea here is that an ls -l will show the same time for the src and target - it shows mtime. Backup software like rsync and tar will treat the new file correctly too. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:39 -07:00
Tao Ma	110a045aca	ocfs2: Add normal functions for reflink a normal file's extents. 2 major functions are added in this patch. ocfs2_attach_refcount_tree will create a new refcount tree to the old file if it doesn't have one and insert all the extent records to the tree if they are not refcounted. ocfs2_create_reflink_node will: 1. set the refcount tree to the new file. 2. call ocfs2_duplicate_extent_list which will iterate all the extents for the old file, insert it to the new file and increase the corresponding referennce count. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:38 -07:00
Tao Ma	37f8a2bfaa	ocfs2: CoW a reflinked cluster when it is truncated. When we truncate a file to a specific size which resides in a reflinked cluster, we need to CoW it since ocfs2_zero_range_for_truncate will zero the space after the size(just another type of write). So we add a "max_cpos" in ocfs2_refcount_cow so that it will stop when it hit the max cluster offset. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:38 -07:00
Tao Ma	293b2f70b4	ocfs2: Integrate CoW in file write. When we use mmap, we CoW the refcountd clusters in ocfs2_write_begin_nolock. While for normal file io(including directio), we do CoW in ocfs2_prepare_inode_for_write. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:37 -07:00
Tao Ma	6ae23c5555	ocfs2: CoW refcount tree improvement. During CoW, if the old extent record is refcounted, we allocate som new clusters and do CoW. Actually we can have some improvement here. If the old extent has refcount=1, that means now it is only used by this file. So we don't need to allocate new clusters, just remove the refcounted flag and it is OK. We also have to remove it from the refcount tree while not deleting it. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:36 -07:00
Tao Ma	6f70fa5199	ocfs2: Add CoW support. This patch try CoW support for a refcounted record. the whole process will be: 1. Calculate how many clusters we need to CoW and where we start. Extents that are not completely encompassed by the write will be broken on 1MB boundaries. 2. Do CoW for the clusters with the help of page cache. 3. Change the b-tree structure with the new allocated clusters. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:36 -07:00
Tao Ma	bcbbb24a6a	ocfs2: Decrement refcount when truncating refcounted extents. Add 'Decrement refcount for delete' in to the normal truncate process. So for a refcounted extent record, call refcount rec decrementation instead of cluster free. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:35 -07:00
Tao Ma	1aa75fea64	ocfs2: Add functions for extents refcounted. Add function ocfs2_mark_extent_refcounted which can mark an extent refcounted. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:34 -07:00
Tao Ma	1823cb0b9f	ocfs2: Add support of decrementing refcount for delete. Given a physical cpos and length, decrement the refcount in the tree. If the refcount for any portion of the extent goes to zero, that portion is queued for freeing. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:33 -07:00
Tao Ma	e73a819db9	ocfs2: Add support for incrementing refcount in the tree. Given a physical cpos and length, increment the refcount in the tree. If the extent has not been seen before, a refcount record is created for it. Refcount records may be merged or split by this operation. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:33 -07:00
Tao Ma	e2e9f6082b	ocfs2: move tree path functions to alloc.h. Now fs/ocfs2/alloc.c has more than 7000 lines. It contains our basic b-tree operation. Although we have already make our b-tree operation generic, the basic structrue ocfs2_path which is used to iterate one b-tree branch is still static and limited to only used in alloc.c. As refcount tree need them and I don't want to add any more b-tree unrelated code to alloc.c, export them out. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:32 -07:00
Tao Ma	fe92441595	ocfs2: Add refcount b-tree as a new extent tree. Add refcount b-tree as a new extent tree so that it can use the b-tree to store and maniuplate ocfs2_refcount_rec. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:31 -07:00
Tao Ma	555936bfcb	ocfs2: Abstract extent split process. ocfs2_mark_extent_written actually does the following things: 1. check the parameters. 2. initialize the left_path and split_rec. 3. call __ocfs2_mark_extent_written. it will do: 1) check the flags of unwritten 2) do the real split work. The whole process is packed tightly somehow. So this patch will abstract 2 different functions so that future b-tree operation can work with it. 1. __ocfs2_split_extent will accept path and split_rec and do the real split work. 2. ocfs2_change_extent_flag will accept a new flag and initialize path and split_rec. So now ocfs2_mark_extent_written will do: 1. check the parameters. 2. call ocfs2_change_extent_flag. 1) initalize the left_path and split_rec. 2) check whether the new flags conflict with the old one. 3) call __ocfs2_split_extent to do the split. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:31 -07:00
Tao Ma	853a3a1439	ocfs2: Wrap ocfs2_extent_contig in ocfs2_extent_tree. Add a new operation eo_ocfs2_extent_contig int the extent tree's operations vector. So that with the new refcount tree, We want this so that refcount trees can always return CONTIG_NONE and prevent extent merging. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:30 -07:00
Tao Ma	8bf396de98	ocfs2: Basic tree root operation. Add basic refcount tree root operation. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:30 -07:00
Tao Ma	374a263e79	ocfs2: Add refcount tree lock mechanism. Implement locking around struct ocfs2_refcount_tree. This protects all read/write operations on refcount trees. ocfs2_refcount_tree has its own lock and its own caching_info, protecting buffers among multiple nodes. User must call ocfs2_lock_refcount_tree before his operation on the tree and unlock it after that. ocfs2_refcount_trees are referenced by the block number of the refcount tree root block, So we create an rb-tree on the ocfs2_super to look them up. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:29 -07:00
Tao Ma	c732eb16bf	ocfs2: Add caching info for refcount tree. refcount tree should use its own caching info so that when we downconvert the refcount tree lock, we can drop all the cached buffer head. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:28 -07:00
Tao Ma	8dec98edfe	ocfs2: Add new refcount tree lock resource in dlmglue. refcount tree lock resource is used to protect refcount tree read/write among multiple nodes. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:28 -07:00
Tao Ma	a433848132	ocfs2: Abstract caching info checkpoint. In meta downconvert, we need to checkpoint the metadata in an inode. For refcount tree, we also need it. So abstract the process out. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:27 -07:00
Tao Ma	f2c870e3b1	ocfs2: Add ocfs2_read_refcount_block. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:26 -07:00
Tao Ma	93c97087a6	ocfs2: Add metaecc for ocfs2_refcount_block. Add metaecc and journal trigger for ocfs2_refcount_block. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:26 -07:00
Tao Ma	721f69c404	ocfs2: Define refcount tree structure. Signed-off-by: Tao Ma <tao.ma@oracle.com>	2009-09-22 20:09:25 -07:00
Chris Mason	7ce618db98	Btrfs: fix early enospc during balancing We now do extra checks before a balance to make sure there is room for the balance to take place. One of the checks was testing to see if we were trying to balance away the last block group of a given type. If there is no space available for new chunks, we should not try and balance away the last block group of a give type. But, the code wasn't checking for available chunk space, and so it was exiting too soon. The fix here is to combine some of the checks and make sure we try to allocate new chunks when we're balancing the last block group. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-22 14:48:44 -04:00
Chris Mason	33b4d47f5e	Btrfs: deal with NULL space info After a balance it is briefly possible for the space info field in the inode to be NULL. This adds some checks to make sure things properly deal with the NULL value. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-22 14:45:50 -04:00
Linus Torvalds	a87e84b5cd	Merge branch 'for-2.6.32' of git://linux-nfs.org/~bfields/linux * 'for-2.6.32' of git://linux-nfs.org/~bfields/linux: (68 commits) nfsd4: nfsv4 clients should cross mountpoints nfsd: revise 4.1 status documentation sunrpc/cache: avoid variable over-loading in cache_defer_req sunrpc/cache: use list_del_init for the list_head entries in cache_deferred_req nfsd: return success for non-NFS4 nfs4_state_start nfsd41: Refactor create_client() nfsd41: modify nfsd4.1 backchannel to use new xprt class nfsd41: Backchannel: Implement cb_recall over NFSv4.1 nfsd41: Backchannel: cb_sequence callback nfsd41: Backchannel: Setup sequence information nfsd41: Backchannel: Server backchannel RPC wait queue nfsd41: Backchannel: Add sequence arguments to callback RPC arguments nfsd41: Backchannel: callback infrastructure nfsd4: use common rpc_cred for all callbacks nfsd4: allow nfs4 state startup to fail SUNRPC: Defer the auth_gss upcall when the RPC call is asynchronous nfsd4: fix null dereference creating nfsv4 callback client nfsd4: fix whitespace in NFSPROC4_CLNT_CB_NULL definition nfsd41: sunrpc: add new xprt class for nfsv4.1 backchannel sunrpc/cache: simplify cache_fresh_locked and cache_fresh_unlocked. ...	2009-09-22 07:54:33 -07:00
Linus Torvalds	342ff1a1b5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) trivial: fix typo in aic7xxx comment trivial: fix comment typo in drivers/ata/pata_hpt37x.c trivial: typo in kernel-parameters.txt trivial: fix typo in tracing documentation trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c trivial: remove unnecessary semicolons trivial: Fix duplicated word "options" in comment trivial: kbuild: remove extraneous blank line after declaration of usage() trivial: improve help text for mm debug config options trivial: doc: hpfall: accept disk device to unload as argument trivial: doc: hpfall: reduce risk that hpfall can do harm trivial: SubmittingPatches: Fix reference to renumbered step trivial: fix typos "man[ae]g?ment" -> "management" trivial: media/video/cx88: add __init/__exit macros to cx88 drivers trivial: fix typo in CONFIG_DEBUG_FS in gcov doc trivial: fix missing printk space in amd_k7_smp_check trivial: fix typo s/ketymap/keymap/ in comment trivial: fix typo "to to" in multiple files trivial: fix typos in comments s/DGBU/DBGU/ ...	2009-09-22 07:51:45 -07:00
Michael S. Tsirkin	3d2d827f5c	mm: move use_mm/unuse_mm from aio.c to mm/ Anyone who wants to do copy to/from user from a kernel thread, needs use_mm (like what fs/aio has). Move that into mm/, to make reusing and exporting easier down the line, and make aio use it. Next intended user, besides aio, will be vhost-net. Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:42 -07:00
Eric B Munson	6bfde05bf5	hugetlbfs: allow the creation of files suitable for MAP_PRIVATE on the vfs internal mount This patchset adds a flag to mmap that allows the user to request that an anonymous mapping be backed with huge pages. This mapping will borrow functionality from the huge page shm code to create a file on the kernel internal mount and use it to approximate an anonymous mapping. The MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work without both flags being preset. A new flag is necessary because there is no other way to hook into huge pages without creating a file on a hugetlbfs mount which wouldn't be MAP_ANONYMOUS. To userspace, this mapping will behave just like an anonymous mapping because the file is not accessible outside of the kernel. This patchset is meant to simplify the programming model. Presently there is a large chunk of boiler platecode, contained in libhugetlbfs, required to create private, hugepage backed mappings. This patch set would allow use of hugepages without linking to libhugetlbfs or having hugetblfs mounted. Unification of the VM code would provide these same benefits, but it has been resisted each time that it has been suggested for several reasons: it would break PAGE_SIZE assumptions across the kernel, it makes page-table abstractions really expensive, and it does not provide any benefit on architectures that do not support huge pages, incurring fast path penalties without providing any benefit on these architectures. This patch: There are two means of creating mappings backed by huge pages: 1. mmap() a file created on hugetlbfs 2. Use shm which creates a file on an internal mount which essentially maps it MAP_SHARED The internal mount is only used for shared mappings but there is very little that stops it being used for private mappings. This patch extends hugetlbfs_file_setup() to deal with the creation of files that will be mapped MAP_PRIVATE on the internal hugetlbfs mount. This extended API is used in a subsequent patch to implement the MAP_HUGETLB mmap() flag. Signed-off-by: Eric Munson <ebmunson@us.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:41 -07:00
Hugh Dickins	3f96b79ad9	tmpfs: depend on shmem CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make more sense of it by removing CONFIG_TMPFS altogether, always enabling its code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on CONFIG_TMPFS off that we'd better leave that as is. But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off: make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL shmem_acl.o being pointlessly built into the kernel when SHMEM is off. And a selfish change, to prevent the world from being rebuilt when I switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the header files is mm.h shmem_lock() - give that a shmem.c stub instead. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:41 -07:00
Hugh Dickins	f3e8fccd06	mm: add get_dump_page In preparation for the next patch, add a simple get_dump_page(addr) interface for the CONFIG_ELF_CORE dumpers to use, instead of calling get_user_pages() directly. They're not interested in errors: they just want to use holes as much as possible, to save space and make sure that the data is aligned where the headers said it would be. Oh, and don't use that horrid DUMP_SEEK(off) macro! Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:40 -07:00
KOSAKI Motohiro	5d863b8968	oom: fix oom_adjust_write() input sanity check Andrew Morton pointed out oom_adjust_write() has very strange EIO and new line handling. this patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:39 -07:00
KOSAKI Motohiro	495789a51a	oom: make oom_score to per-process value oom-killer kills a process, not task. Then oom_score should be calculated as per-process too. it makes consistency more and makes speed up select_bad_process(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:39 -07:00
KOSAKI Motohiro	28b83c5193	oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:39 -07:00
Jan Beulich	4481374ce8	mm: replace various uses of num_physpages by totalram_pages Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. Some of the calculations (i.e. those not intending to use high memory) should likely even use (totalram_pages - totalhigh_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Dave Airlie <airlied@linux.ie> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
KAMEZAWA Hiroyuki	73d7c33e81	kcore: /proc/kcore should use vread /proc/kcore has its own routine to access vmallc area. It can be replaced with vread(). And by this, /proc/kcore can do safe access to vmalloc area. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Mike Smith <scgtrp@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:34 -07:00
Moussa A. Ba	398499d5f3	pagemap clear_refs: modify to specify anon or mapped vma clearing The patch makes the clear_refs more versatile in adding the option to select anonymous pages or file backed pages for clearing. This addition has a measurable impact on user space application performance as it decreases the number of pagewalks in scenarios where one is only interested in a specific type of page (anonymous or file mapped). The patch adds anonymous and file backed filters to the clear_refs interface. echo 1 > /proc/PID/clear_refs resets the bits on all pages echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only Any other value is ignored Signed-off-by: Moussa A. Ba <moussa.a.ba@gmail.com> Signed-off-by: Jared E. Hulbert <jaredeh@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:33 -07:00
Hugh Dickins	9a84089514	ksm: identify PageKsm pages KSM will need to identify its kernel merged pages unambiguously, and /proc/kpageflags will probably like to do so too. Since KSM will only be substituting anonymous pages, statistics are best preserved by making a PageKsm page a special PageAnon page: one with no anon_vma. But KSM then needs its own page_add_ksm_rmap() - keep it in ksm.h near PageKsm; and do_wp_page() must COW them, unlike singly mapped PageAnons. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:31 -07:00
KOSAKI Motohiro	4b02108ac1	mm: oom analysis: add shmem vmstat Recently we encountered OOM problems due to memory use of the GEM cache. Generally a large amuont of Shmem/Tmpfs pages tend to create a memory shortage problem. We often use the following calculation to determine the amount of shmem pages: shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES however the expression does not consider isolated and mlocked pages. This patch adds explicit accounting for pages used by shmem and tmpfs. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:27 -07:00
KOSAKI Motohiro	c6a7f5728a	mm: oom analysis: Show kernel stack usage in /proc/meminfo and OOM log output The amount of memory allocated to kernel stacks can become significant and cause OOM conditions. However, we do not display the amount of memory consumed by stacks. Add code to display the amount of memory used for stacks in /proc/meminfo. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:27 -07:00
Alexey Dobriyan	83d5cde47d	const: make block_device_operations const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:25 -07:00
Alexey Dobriyan	7b021967c5	const: make lock_manager_operations const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:25 -07:00
Alexey Dobriyan	6aed62853c	const: make file_lock_operations const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:25 -07:00
Alexey Dobriyan	6e1d5dcc2b	const: mark remaining inode_operations as const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:24 -07:00
Alexey Dobriyan	7f09410bbc	const: mark remaining address_space_operations const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:24 -07:00
Alexey Dobriyan	ac4cfdd6d1	const: mark remaining export_operations const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:24 -07:00
Alexey Dobriyan	b87221de6a	const: mark remaining super_operations const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:24 -07:00
Alexey Dobriyan	0d54b217a2	const: make struct super_block::s_qcop const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:24 -07:00
Alexey Dobriyan	61e225dc34	const: make struct super_block::dq_op const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:24 -07:00
Jan Kara	580be0837a	fs: make sure data stored into inode is properly seen before unlocking new inode In theory it could happen that on one CPU we initialize a new inode but clearing of I_NEW \| I_LOCK gets reordered before some of the initialization. Thus on another CPU we return not fully uptodate inode from iget_locked(). This seems to fix a corruption issue on ext3 mounted over NFS. [akpm@linux-foundation.org: add some commentary] Signed-off-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:24 -07:00
Josef Bacik	1b2da372b0	Btrfs: account for space used by the super mirrors As we get closer to proper -ENOSPC handling in btrfs, we need more accurate space accounting for the space info's. Currently we exclude the free space for the super mirrors, but the space they take up isn't accounted for in any of the counters. This patch introduces bytes_super, which keeps track of the amount of bytes used for a super mirror in the block group cache and space info. This makes sure that our free space caclucations will be completely accurate. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-21 19:23:50 -04:00
Josef Bacik	25891f796d	Btrfs: fix extent entry threshold calculation There is a slight problem with the extent entry threshold calculation for the free space cache. We only adjust the threshold down as we add bitmaps, but never actually adjust the threshold up as we add bitmaps. This means we could fragment the free space so badly that we end up using all bitmaps to describe the free space, use all the free space which would result in the bitmaps being freed, but then go to add free space again as we delete things and immediately add bitmaps since the extent threshold would still be 0. Now as we free bitmaps the extent threshold will be ratcheted up to allow more extent entries to be added. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-21 19:23:50 -04:00
Josef Bacik	f61408b81c	Btrfs: remove dead code This patch removes a bunch of dead code from the snapshot removal stuff. It was confusing me when doing the metadata ENOSPC stuff so I killed it. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-21 19:23:49 -04:00
Josef Bacik	f019f4264a	Btrfs: fix bitmap size tracking When we first go to add free space, we allocate a new info and set the offset and bytes to the space we are adding. This is fine, except we actually set the size of a bitmap as we set the bits in it, so if we add space to a bitmap, we'd end up counting the same space twice. This isn't a huge deal, it just makes the allocator behave weirdly since it will think that a bitmap entry has more space than it ends up actually having. I used a BUG_ON() to catch when this problem happened, and with this patch I no longer get the BUG_ON(). Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-21 19:23:49 -04:00
Josef Bacik	0a24325e6d	Btrfs: don't keep retrying a block group if we fail to allocate a cluster The box can get locked up in the allocator if we happen upon a block group under these conditions: 1) During a commit, so caching threads cannot make progress 2) Our block group currently is in the middle of being cached 3) Our block group currently has plenty of free space in it 4) Our block group is so fragmented that it ends up having no free space chunks larger than min_bytes calculated by btrfs_find_space_cluster. What happens is we try and do btrfs_find_space_cluster, which fails because it is unable to find enough free space chunks that are large than min_bytes and are close enough together. Since the block group is not cached we do a wait_block_group_cache_progress, which waits for the number of bytes we need, except the block group already has _plenty_ of free space, its just severely fragmented, so we loop and try again, ad infinitum. This patch keeps us from waiting on the block group to finish caching if we failed to find a free space cluster before. It also makes sure that we don't even try to find a free space cluster if we are on our last loop in the allocator, since we will have tried everything at this point at it is futile. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-21 19:23:49 -04:00
Josef Bacik	ba1bf4818b	Btrfs: make balance code choose more wisely when relocating Currently, we can panic the box if the first block group we go to move is of a type where there is no space left to move those extents. For example, if we fill the disk up with data, and then we try to balance and we have no room to move the data nor room to allocate new chunks, we will panic. Change this by checking to see if we have room to move this chunk around, and if not, return -ENOSPC and move on to the next chunk. This will make sure we remove block groups that are moveable, like if we have alot of empty metadata block groups, and then that way we make room to be able to balance our data chunks as well. Tested this with an fs that would panic on btrfs-vol -b normally, but no longer panics with this patch. V1->V2: -actually search for a free extent on the device to make sure we can allocate a chunk if need be. -fix btrfs_shrink_device to make sure we actually try to relocate all the chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r we don't remove the device with data still on it. -check to make sure the block group we are going to relocate isn't the last one in that particular space -fix a bug in btrfs_shrink_device where we would change the device's size and not fix it if we fail to do our relocate Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-21 19:23:48 -04:00
Steve Dickson	3c394ddaa7	nfsd4: nfsv4 clients should cross mountpoints Allow NFS v4 clients to seamlessly cross mount point without have to set either the 'crossmnt' or the 'nohide' export options. Signed-Off-By: Steve Dickson <steved@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-09-21 16:02:25 -04:00
Sage Weil	1fb58a6051	Btrfs: fix arithmetic error in clone ioctl Fix an arithmetic error that was breaking extents cloned via the clone ioctl starting in the second half of a file. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-21 16:00:27 -04:00
Yan, Zheng	76dda93c6a	Btrfs: add snapshot/subvolume destroy ioctl This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being used and doesn't contains links to other subvolumes can be destroyed. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-21 16:00:26 -04:00
Yan, Zheng	4df27c4d5c	Btrfs: change how subvolumes are organized btrfs allows subvolumes and snapshots anywhere in the directory tree. If we snapshot a subvolume that contains a link to other subvolume called subvolA, subvolA can be accessed through both the original subvolume and the snapshot. This is similar to creating hard link to directory, and has the very similar problems. The aim of this patch is enforcing there is only one access point to each subvolume. Only the first directory entry (the one added when the subvolume/snapshot was created) is treated as valid access point. The first directory entry is distinguished by checking root forward reference. If the corresponding root forward reference is missing, we know the entry is not the first one. This patch also adds snapshot/subvolume rename support, the code allows rename subvolume link across subvolumes. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-21 15:56:00 -04:00
Yan, Zheng	13a8a7c8c4	Btrfs: do not reuse objectid of deleted snapshot/subvol The new back reference format does not allow reusing objectid of deleted snapshot/subvol. So we use ++highest_objectid to allocate objectid for new snapshot/subvol. Now we use ++highest_objectid to allocate objectid for both new inode and new snapshot/subvolume, so this patch removes 'find hole' code in btrfs_find_free_objectid. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-21 15:56:00 -04:00
Yan, Zheng	1c4850e21d	Btrfs: speed up snapshot dropping This patch contains two changes to avoid unnecessary tree block reads during snapshot dropping. First, check tree block's reference count and flags before reading the tree block. if reference count > 1 and there is no need to update backrefs, we can avoid reading the tree block. Second, save when snapshot was created in root_key.offset. we can compare block pointer's generation with snapshot's creation generation during updating backrefs. If a given block was created before snapshot was created, the snapshot can't be the tree block's owner. So we can avoid reading the block. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-21 15:55:59 -04:00
Linus Torvalds	43c1266ce4	Merge branch 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Tidy up after the big rename perf: Do the big rename: Performance Counters -> Performance Events perf_counter: Rename 'event' to event_id/hw_event perf_counter: Rename list_entry -> group_entry, counter_list -> group_list Manually resolved some fairly trivial conflicts with the tracing tree in include/trace/ftrace.h and kernel/trace/trace_syscalls.c.	2009-09-21 09:15:07 -07:00
Linus Torvalds	58e75a0973	Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block * 'writeback' of git://git.kernel.dk/linux-2.6-block: nfs: initialize the backing_dev_info when creating the server writeback: make balance_dirty_pages() gradually back more off writeback: don't use schedule_timeout() without setting runstate nfs: nfs_kill_super() should call bdi_unregister() after killing super	2009-09-21 09:04:30 -07:00
Jens Axboe	48d0764998	nfs: initialize the backing_dev_info when creating the server NFS may free the server structure without ever having used the bdi, so we either need to flag the bdi as being uninitialized or initialize it up front. This does the latter. This fixes a crash with mounting more than one NFS file system, should people ever need that kind of obscure NFS functionality. Tested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-21 15:40:33 +02:00
Jens Axboe	92f25053c0	nfs: nfs_kill_super() should call bdi_unregister() after killing super Otherwise we could be attempting to flush data for a writeback thread and bdi that have already disappeared. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-21 15:40:32 +02:00
Joe Perches	a419aef8b8	trivial: remove unnecessary semicolons Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-09-21 15:14:58 +02:00
Anand Gadiyar	fd589a8f0a	trivial: fix typo "to to" in multiple files Signed-off-by: Anand Gadiyar <gadiyar@ti.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-09-21 15:14:55 +02:00
Anand Gadiyar	411c940385	trivial: fix typo "for for" in multiple files trivial: fix typo "for for" in multiple files Signed-off-by: Anand Gadiyar <gadiyar@ti.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-09-21 15:14:54 +02:00
Ingo Molnar	cdd6c482c9	perf: Do the big rename: Performance Counters -> Performance Events Bye-bye Performance Counters, welcome Performance Events! In the past few months the perfcounters subsystem has grown out its initial role of counting hardware events, and has become (and is becoming) a much broader generic event enumeration, reporting, logging, monitoring, analysis facility. Naming its core object 'perf_counter' and naming the subsystem 'perfcounters' has become more and more of a misnomer. With pending code like hw-breakpoints support the 'counter' name is less and less appropriate. All in one, we've decided to rename the subsystem to 'performance events' and to propagate this rename through all fields, variables and API names. (in an ABI compatible fashion) The word 'event' is also a bit shorter than 'counter' - which makes it slightly more convenient to write/handle as well. Thanks goes to Stephane Eranian who first observed this misnomer and suggested a rename. User-space tooling and ABI compatibility is not affected - this patch should be function-invariant. (Also, defconfigs were not touched to keep the size down.) This patch has been generated via the following script: FILES=$(find * -type f \| grep -vE 'oprofile\|[^K]config') sed -i \ -e 's/PERF_EVENT_/PERF_RECORD_/g' \ -e 's/PERF_COUNTER/PERF_EVENT/g' \ -e 's/perf_counter/perf_event/g' \ -e 's/nb_counters/nb_events/g' \ -e 's/swcounter/swevent/g' \ -e 's/tpcounter_event/tp_event/g' \ $FILES for N in $(find . -name perf_counter.[ch]); do M=$(echo $N \| sed 's/perf_counter/perf_event/g') mv $N $M done FILES=$(find . -name perf_event.*) sed -i \ -e 's/COUNTER_MASK/REG_MASK/g' \ -e 's/COUNTER/EVENT/g' \ -e 's/\<event\>/event_id/g' \ -e 's/counter/event/g' \ -e 's/Counter/Event/g' \ $FILES ... to keep it as correct as possible. This script can also be used by anyone who has pending perfcounters patches - it converts a Linux kernel tree over to the new naming. We tried to time this change to the point in time where the amount of pending patches is the smallest: the end of the merge window. Namespace clashes were fixed up in a preparatory patch - and some stylistic fallout will be fixed up in a subsequent patch. ( NOTE: 'counters' are still the proper terminology when we deal with hardware registers - and these sed scripts are a bit over-eager in renaming them. I've undone some of that, but in case there's something left where 'counter' would be better than 'event' we can undo that on an individual basis instead of touching an otherwise nicely automated patch. ) Suggested-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <linux-arch@vger.kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-21 14:28:04 +02:00

... 3 4 5 6 7 ...

15788 Commits