This patchset introduces wakeup hints for some of the most popular (from
epoll POV) devices, so that epoll code can avoid spurious wakeups on its
waiters.
The problem with epoll is that the callback-based wakeups do not, ATM,
carry any information about the events the wakeup is related to. So the
only choice epoll has (not being able to call f_op->poll() from inside the
callback), is to add the file* to a ready-list and resolve the real events
later on, at epoll_wait() (or its own f_op->poll()) time. This can cause
spurious wakeups, since the wake_up() itself might be for an event the
caller is not interested into.
The rate of these spurious wakeup can be pretty high in case of many
network sockets being monitored.
By allowing devices to report the events the wakeups refer to (at least
the two major classes - POLLIN/POLLOUT), we are able to spare useless
wakeups by proper handling inside the epoll's poll callback.
Epoll will have in any case to call f_op->poll() on the file* later on,
since the change to be done in order to have the full event set sent via
wakeup, is too invasive for the way our f_op->poll() system works (the
full event set is calculated inside the poll function - there are too many
of them to even start thinking the change - also poll/select would need
change too).
Epoll is changed in a way that both devices which send event hints, and
the ones that don't, are correctly handled. The former will gain some
efficiency though.
As a general rule for devices, would be to add an event mask by using
key-aware wakeup macros, when making up poll wait queues. I tested it
(together with the epoll's poll fix patch Andrew has in -mm) and wakeups
for the supported devices are correctly filtered.
Test program available here:
http://www.xmailserver.org/epoll_test.c
This patch:
Nothing revolutionary here. Just using the available "key" that our
wakeup core already support. The __wake_up_locked_key() was no brainer,
since both __wake_up_locked() and __wake_up_locked_key() are thin wrappers
around __wake_up_common().
The __wake_up_sync() function had a body, so the choice was between
borrowing the body for __wake_up_sync_key() and calling it from
__wake_up_sync(), or make an inline and calling it from both. I chose the
former since in most archs it all resolves to "mov $0, REG; jmp ADDR".
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: William Lee Irwin III <wli@movementarian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
People started using eventfd in a semaphore-like way where before they
were using pipes.
That is, counter-based resource access. Where a "wait()" returns
immediately by decrementing the counter by one, if counter is greater than
zero. Otherwise will wait. And where a "post(count)" will add count to
the counter releasing the appropriate amount of waiters. If eventfd the
"post" (write) part is fine, while the "wait" (read) does not dequeue 1,
but the whole counter value.
The problem with eventfd is that a read() on the fd returns and wipes the
whole counter, making the use of it as semaphore a little bit more
cumbersome. You can do a read() followed by a write() of COUNTER-1, but
IMO it's pretty easy and cheap to make this work w/out extra steps. This
patch introduces a new eventfd flag that tells eventfd to only dequeue 1
from the counter, allowing simple read/write to make it behave like a
semaphore. Simple test here:
http://www.xmailserver.org/eventfd-sem.c
To be back-compatible with earlier kernels, userspace applications should
probe for the availability of this feature via
#ifdef EFD_SEMAPHORE
fd = eventfd2 (CNT, EFD_SEMAPHORE);
if (fd == -1 && errno == EINVAL)
<fallback>
#else
<fallback>
#endif
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <linux-api@vger.kernel.org>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
eventpoll.c uses void * in one place for no obvious reason; change it to
use the real type instead.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ep_modify() doesn't need to set event.data from within the ep->lock
spinlock as the comment suggests. The only place event.data is used is
ep_send_events_proc(), and this is protected by ep->mtx instead of
ep->lock. Also update the comment for mutex_lock() at the top of
ep_scan_ready_list(), which mentions epoll_ctl(EPOLL_CTL_DEL) but not
epoll_ctl(EPOLL_CTL_MOD).
ep_modify() can also use spin_lock_irq() instead of spin_lock_irqsave().
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xchg in ep_unregister_pollwait() is unnecessary because it is protected by
either epmutex or ep->mtx (the same protection as ep_remove()).
If xchg was necessary, it would be insufficient to protect against
problems: if multiple concurrent calls to ep_unregister_pollwait() were
possible then a second caller that returns without doing anything because
nwait == 0 could return before the waitqueues are removed by the first
caller, which looks like it could lead to problematic races with
ep_poll_callback().
So remove xchg and add comments about the locking.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If epoll_wait returns -EFAULT, the event that was being returned when the
fault was encountered will be forgotten. This is not a big deal since
EFAULT will happen only if a buggy userspace program passes in a bad
address, in which case what happens later usually doesn't matter.
However, it is easy to remember the event for later, and this patch makes
a simple change to do that.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ep_call_nested() (formerly ep_poll_safewake()) uses "current" (without
dereferencing it) to detect callback recursion, but it may be called from
irq context where the use of current is generally discouraged. It would
be better to use get_cpu() and put_cpu() to detect the callback recursion.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove debugging code from epoll. There's no need for it to be included
into mainline code.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a bug inside the epoll's f_op->poll() code, that returns POLLIN even
though there are no actual ready monitored fds. The bug shows up if you
add an epoll fd inside another fd container (poll, select, epoll).
The problem is that callback-based wake ups used by epoll does not carry
(patches will follow, to fix this) any information about the events that
actually happened. So the callback code, since it can't call the file*
->poll() inside the callback, chains the file* into a ready-list.
So, suppose you added an fd with EPOLLOUT only, and some data shows up on
the fd, the file* mapped by the fd will be added into the ready-list (via
wakeup callback). During normal epoll_wait() use, this condition is
sorted out at the time we're actually able to call the file*'s
f_op->poll().
Inside the old epoll's f_op->poll() though, only a quick check
!list_empty(ready-list) was performed, and this could have led to
reporting POLLIN even though no ready fds would show up at a following
epoll_wait(). In order to correctly report the ready status for an epoll
fd, the ready-list must be checked to see if any really available fd+event
would be ready in a following epoll_wait().
Operation (calling f_op->poll() from inside f_op->poll()) that, like wake
ups, must be handled with care because of the fact that epoll fds can be
added to other epoll fds.
Test code:
/*
* epoll_test by Davide Libenzi (Simple code to test epoll internals)
* Copyright (C) 2008 Davide Libenzi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*
* Davide Libenzi <davidel@xmailserver.org>
*
*/
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <limits.h>
#include <poll.h>
#include <sys/epoll.h>
#include <sys/wait.h>
#define EPWAIT_TIMEO (1 * 1000)
#ifndef POLLRDHUP
#define POLLRDHUP 0x2000
#endif
#define EPOLL_MAX_CHAIN 100L
#define EPOLL_TF_LOOP (1 << 0)
struct epoll_test_cfg {
long size;
long flags;
};
static int xepoll_create(int n) {
int epfd;
if ((epfd = epoll_create(n)) == -1) {
perror("epoll_create");
exit(2);
}
return epfd;
}
static void xepoll_ctl(int epfd, int cmd, int fd, struct epoll_event *evt) {
if (epoll_ctl(epfd, cmd, fd, evt) < 0) {
perror("epoll_ctl");
exit(3);
}
}
static void xpipe(int *fds) {
if (pipe(fds)) {
perror("pipe");
exit(4);
}
}
static pid_t xfork(void) {
pid_t pid;
if ((pid = fork()) == (pid_t) -1) {
perror("pipe");
exit(5);
}
return pid;
}
static int run_forked_proc(int (*proc)(void *), void *data) {
int status;
pid_t pid;
if ((pid = xfork()) == 0)
exit((*proc)(data));
if (waitpid(pid, &status, 0) != pid) {
perror("waitpid");
return -1;
}
return WIFEXITED(status) ? WEXITSTATUS(status): -2;
}
static int check_events(int fd, int timeo) {
struct pollfd pfd;
fprintf(stdout, "Checking events for fd %d\n", fd);
memset(&pfd, 0, sizeof(pfd));
pfd.fd = fd;
pfd.events = POLLIN | POLLOUT;
if (poll(&pfd, 1, timeo) < 0) {
perror("poll()");
return 0;
}
if (pfd.revents & POLLIN)
fprintf(stdout, "\tPOLLIN\n");
if (pfd.revents & POLLOUT)
fprintf(stdout, "\tPOLLOUT\n");
if (pfd.revents & POLLERR)
fprintf(stdout, "\tPOLLERR\n");
if (pfd.revents & POLLHUP)
fprintf(stdout, "\tPOLLHUP\n");
if (pfd.revents & POLLRDHUP)
fprintf(stdout, "\tPOLLRDHUP\n");
return pfd.revents;
}
static int epoll_test_tty(void *data) {
int epfd, ifd = fileno(stdin), res;
struct epoll_event evt;
if (check_events(ifd, 0) != POLLOUT) {
fprintf(stderr, "Something is cooking on STDIN (%d)\n", ifd);
return 1;
}
epfd = xepoll_create(1);
fprintf(stdout, "Created epoll fd (%d)\n", epfd);
memset(&evt, 0, sizeof(evt));
evt.events = EPOLLIN;
xepoll_ctl(epfd, EPOLL_CTL_ADD, ifd, &evt);
if (check_events(epfd, 0) & POLLIN) {
res = epoll_wait(epfd, &evt, 1, 0);
if (res == 0) {
fprintf(stderr, "Epoll fd (%d) is ready when it shouldn't!\n",
epfd);
return 2;
}
}
return 0;
}
static int epoll_wakeup_chain(void *data) {
struct epoll_test_cfg *tcfg = data;
int i, res, epfd, bfd, nfd, pfds[2];
pid_t pid;
struct epoll_event evt;
memset(&evt, 0, sizeof(evt));
evt.events = EPOLLIN;
epfd = bfd = xepoll_create(1);
for (i = 0; i < tcfg->size; i++) {
nfd = xepoll_create(1);
xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
bfd = nfd;
}
xpipe(pfds);
if (tcfg->flags & EPOLL_TF_LOOP)
{
xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
/*
* If we're testing for loop, we want that the wakeup
* triggered by the write to the pipe done in the child
* process, triggers a fake event. So we add the pipe
* read size with EPOLLOUT events. This will trigger
* an addition to the ready-list, but no real events
* will be there. The the epoll kernel code will proceed
* in calling f_op->poll() of the epfd, triggering the
* loop we want to test.
*/
evt.events = EPOLLOUT;
}
xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);
/*
* The pipe write must come after the poll(2) call inside
* check_events(). This tests the nested wakeup code in
* fs/eventpoll.c:ep_poll_safewake()
* By having the check_events() (hence poll(2)) happens first,
* we have poll wait queue filled up, and the write(2) in the
* child will trigger the wakeup chain.
*/
if ((pid = xfork()) == 0) {
sleep(1);
write(pfds[1], "w", 1);
exit(0);
}
res = check_events(epfd, 2000) & POLLIN;
if (waitpid(pid, NULL, 0) != pid) {
perror("waitpid");
return -1;
}
return res;
}
static int epoll_poll_chain(void *data) {
struct epoll_test_cfg *tcfg = data;
int i, res, epfd, bfd, nfd, pfds[2];
pid_t pid;
struct epoll_event evt;
memset(&evt, 0, sizeof(evt));
evt.events = EPOLLIN;
epfd = bfd = xepoll_create(1);
for (i = 0; i < tcfg->size; i++) {
nfd = xepoll_create(1);
xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
bfd = nfd;
}
xpipe(pfds);
if (tcfg->flags & EPOLL_TF_LOOP)
{
xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
/*
* If we're testing for loop, we want that the wakeup
* triggered by the write to the pipe done in the child
* process, triggers a fake event. So we add the pipe
* read size with EPOLLOUT events. This will trigger
* an addition to the ready-list, but no real events
* will be there. The the epoll kernel code will proceed
* in calling f_op->poll() of the epfd, triggering the
* loop we want to test.
*/
evt.events = EPOLLOUT;
}
xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);
/*
* The pipe write mush come before the poll(2) call inside
* check_events(). This tests the nested f_op->poll calls code in
* fs/eventpoll.c:ep_eventpoll_poll()
* By having the pipe write(2) happen first, we make the kernel
* epoll code to load the ready lists, and the following poll(2)
* done inside check_events() will test nested poll code in
* ep_eventpoll_poll().
*/
if ((pid = xfork()) == 0) {
write(pfds[1], "w", 1);
exit(0);
}
sleep(1);
res = check_events(epfd, 1000) & POLLIN;
if (waitpid(pid, NULL, 0) != pid) {
perror("waitpid");
return -1;
}
return res;
}
int main(int ac, char **av) {
int error;
struct epoll_test_cfg tcfg;
fprintf(stdout, "\n********** Testing TTY events\n");
error = run_forked_proc(epoll_test_tty, NULL);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing short wakeup chain\n");
error = run_forked_proc(epoll_wakeup_chain, &tcfg);
fprintf(stdout, error == POLLIN ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = EPOLL_MAX_CHAIN;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing long wakeup chain (HOLD ON)\n");
error = run_forked_proc(epoll_wakeup_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing short poll chain\n");
error = run_forked_proc(epoll_poll_chain, &tcfg);
fprintf(stdout, error == POLLIN ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = EPOLL_MAX_CHAIN;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing long poll chain (HOLD ON)\n");
error = run_forked_proc(epoll_poll_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = EPOLL_TF_LOOP;
fprintf(stdout, "\n********** Testing loopy wakeup chain (HOLD ON)\n");
error = run_forked_proc(epoll_wakeup_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = EPOLL_TF_LOOP;
fprintf(stdout, "\n********** Testing loopy poll chain (HOLD ON)\n");
error = run_forked_proc(epoll_poll_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
return 0;
}
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a driver for Intersil's ISL29003 ambient light sensor device plus some
documentation. Inspired by tsl2550.c, a driver for a similar device.
It is put in drivers/misc for now until the industrial I/O framework gets
merged.
Signed-off-by: Daniel Mack <daniel@caiaq.de>
Acked-by: Jonathan Cameron <jic23@cam.ac.uk>
Cc: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change hpilo open and close logic to spin for 10usec between checking device,
rather than every usec.
Because the loop is coded to take up to 10ms, it seemed prudent to
increase the interval between polling the device, to reduce the load on
the system and allow more other work to happen.
Signed-off-by: David Altobelli <david.altobelli@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The base versions handle constant folding now and are shorter than these
private wrappers, use them directly.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We cover all log-levels by pr_... macros except KERN_CONT one. Add it
for convenience.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The GPIO API is supposed to return 0 or a negative error code,
but the SSB GPIO functions return the bitmask of the GPIO register.
Fix this by ignoring the bitmask and always returning 0. The SSB GPIO functions can't fail.
Signed-off-by: Michael Buesch <mb@bu3sch.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove PARPORT dependency for Auxiliary Display support.
This is not needed since the dependency for the KS0108 driver is
PARPORT_PC.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Miguel Ojeda Sandonis <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the filesystem freeze operation has been elevated to the VFS, and
is just an ioctl away, some sort of safety net for unintentionally frozen
root filesystems may be in order.
The timeout thaw originally proposed did not get merged, but perhaps
something like this would be useful in emergencies.
For example, freeze /path/to/mountpoint may freeze your root filesystem if
you forgot that you had that unmounted.
I chose 'j' as the last remaining character other than 'h' which is sort
of reserved for help (because help is generated on any unknown character).
I've tested this on a non-root fs with multiple (nested) freezers, as well
as on a system rendered unresponsive due to a frozen root fs.
[randy.dunlap@oracle.com: emergency thaw only if CONFIG_BLOCK enabled]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tfour 4 redundant if-conditions in function __rb_erase_color() in
lib/rbtree.c are removed.
In pseudo-source-code, the structure of the code is as follows:
if ((!A || B) && (!C || D)) {
.
.
.
} else {
if (!C || D) {//if this is true, it implies: (A == true) && (B == false)
if (A) {//hence this always evaluates to 'true'...
.
}
.
//at this point, C always becomes true, because of:
__rb_rotate_right/left();
//and:
other = parent->rb_right/left;
}
.
.
if (C) {//...and this too !
.
}
}
Signed-off-by: Wolfram Strepp <wstrepp@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These comments are useless now, remove them.
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These error messages are from check_sysemu(), not check_ptrace().
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
uml uses a concatenated string literal to store the contents of .config,
but .config file content is varaible, it can be very long.
Use an array of string literals instead.
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MAJOR_NR isn't needed anymore since very early 2.5 kernels.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the following header file changes:
- remove arch ifdefs and asm/suspend.h from linux/suspend.h
- add asm/suspend.h to disk.c (for arch_prepare_suspend())
- add linux/io.h to swsusp.c (for ioremap())
- x86 32/64 bit compile fixes
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Cc: Paul Mundt <lethal@linux-sh.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert alpha architecture to use u64 as unsigned long long. This is
being done so that (a) all arches use u64 as unsigned long long and (b)
printk of a u64 as %ll[ux] will not generate format warnings by gcc.
The only gcc cross-compiler that I have is 4.0.2, which generates errors
about miscompiling __weak references, so I have commented out that line in
compiler-gcc4.h so that most of these compile, but more builds and real
machine testing would be Real Good.
[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
From: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- "_local" versions of xchg/cmpxchg functions duplicate code
of non-local ones (quite a few pages of assembler), except
memory barriers. We can generate these two variants from a
single header file using simple macros;
- convert xchg macro back to inline function using always_inline
attribute;
- use proper argument types for cmpxchg_u8/u16 functions
to fix a problem with negative arguments.
Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When this macros isn't called with 'fixup', e.g. with foo this will
incorectly expand to foo->foo.bits.errreg
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Synopsis: if shmem_writepage calls swap_writepage directly, most shmem
swap loads benefit, and a catastrophic interaction between SLUB and some
flash storage is avoided.
shmem_writepage() has always been peculiar in making no attempt to write:
it has just transferred a shmem page from file cache to swap cache, then
let that page make its way around the LRU again before being written and
freed.
The idea was that people use tmpfs because they want those pages to stay
in RAM; so although we give it an overflow to swap, we should resist
writing too soon, giving those pages a second chance before they can be
reclaimed.
That was always questionable, and I've toyed with this patch for years;
but never had a clear justification to depart from the original design.
It became more questionable in 2.6.28, when the split LRU patches classed
shmem and tmpfs pages as SwapBacked rather than as file_cache: that in
itself gives them more resistance to reclaim than normal file pages. I
prepared this patch for 2.6.29, but the merge window arrived before I'd
completed gathering statistics to justify sending it in.
Then while comparing SLQB against SLUB, running SLUB on a laptop I'd
habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping
tests five times slower than SLAB or SLQB - other machines slower too, but
nowhere near so bad. Simpler "cp -a" swapping tests showed the same.
slub_max_order=0 brings sanity to all, but heavy swapping is too far from
normal to justify such a tuning. The crucial factor on that laptop turns
out to be that I'm using an SD card for swap. What happens is this:
By default, SLUB uses order-2 pages for shmem_inode_cache (and many other
fs inodes), so creating tmpfs files under memory pressure brings lumpy
reclaim into play. One subpage of the order is chosen from the bottom of
the LRU as usual, then the other three picked out from their random
positions on the LRUs.
In a tmpfs load, many of these pages will be ones which already passed
through shmem_writepage, so already have swap allocated. And though their
offsets on swap were probably allocated sequentially, now that the pages
are picked off at random, their swap offsets are scattered.
But the flash storage on the SD card is very sensitive to having its
writes merged: once swap is written at scattered offsets, performance
falls apart. Rotating disk seeks increase too, but less disastrously.
So: stop giving shmem/tmpfs pages a second pass around the LRU, write them
out to swap as soon as their swap has been allocated.
It's surely possible to devise an artificial load which runs faster the
old way, one whose sizing is such that the tmpfs pages on their second
pass are the ones that are wanted again, and other pages not.
But I've not yet found such a load: on all machines, under the loads I've
tried, immediate swap_writepage speeds up shmem swapping: especially when
using the SLUB allocator (and more effectively than slub_max_order=0), but
also with the others; and it also reduces the variance between runs. How
much faster varies widely: a factor of five is rare, 5% is common.
One load which might have suffered: imagine a swapping shmem load in a
limited mem_cgroup on a machine with plenty of memory. Before 2.6.29 the
swapcache was not charged, and such a load would have run quickest with
the shmem swapcache never written to swap. But now swapcache is charged,
so even this load benefits from shmem_writepage directly to swap.
Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h:
it's silly because that will never get called; but refactoring shmem.c
sensibly according to CONFIG_SWAP will be a separate task.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
try_to_free_pages() is used for the direct reclaim of up to
SWAP_CLUSTER_MAX pages when watermarks are low. The caller to
alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to
be used but this is not passed to try_to_free_pages(). This can lead to
unnecessary reclaim of pages that are unusable by the caller and int the
worst case lead to allocation failure as progress was not been make where
it is needed.
This patch passes the nodemask used for alloc_pages_nodemask() to
try_to_free_pages().
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of open-coding the lru-list-add pagevec batching when expanding a
file mapping from zero, defer to the appropriate page cache function that
also takes care of adding the page to the lru list.
This is cleaner, saves code and reduces the stack footprint by 16 words
worth of pagevec.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.com>
Cc: MinChan Kim <minchan.kim@gmail.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a shrinker has a negative number of objects to delete, the symbol
name of the shrinker should be printed, not shrink_slab. This also makes
the error message slightly more informative.
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=n. There's no logical
reason it shouldn't be available, and it can be used for ramfs.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Enrik Berkhan <Enrik.Berkhan@ge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mlock() facility does not exist for NOMMU since all mappings are
effectively locked anyway, so we don't make the bits available when
they're not useful.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Enrik Berkhan <Enrik.Berkhan@ge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use debug_kmap_atomic in kmap_atomic, kmap_atomic_pfn, and
iomap_atomic_prot_pfn.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86 has debug_kmap_atomic_prot() which is error checking function for
kmap_atomic. It is usefull for the other architectures, although it needs
CONFIG_TRACE_IRQFLAGS_SUPPORT.
This patch exposes it to the other architectures.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix warnings and return values in sysfs bin_page_mkwrite(), fixing
fs/sysfs/bin.c: In function `bin_page_mkwrite':
fs/sysfs/bin.c:250: warning: passing argument 2 of `bb->vm_ops->page_mkwrite' from incompatible pointer type
fs/sysfs/bin.c: At top level:
fs/sysfs/bin.c:280: warning: initialization from incompatible pointer type
Expects to have my [PATCH next] sysfs: fix some bin_vm_ops errors
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: "Eric W. Biederman" <ebiederm@aristanetworks.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_mkwrite is called with neither the page lock nor the ptl held. This
means a page can be concurrently truncated or invalidated out from
underneath it. Callers are supposed to prevent truncate races themselves,
however previously the only thing they can do in case they hit one is to
raise a SIGBUS. A sigbus is wrong for the case that the page has been
invalidated or truncated within i_size (eg. hole punched). Callers may
also have to perform memory allocations in this path, where again, SIGBUS
would be wrong.
The previous patch ("mm: page_mkwrite change prototype to match fault")
made it possible to properly specify errors. Convert the generic buffer.c
code and btrfs to return sane error values (in the case of page removed
from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit
without doing anything, and the fault will be retried properly).
This fixes core code, and converts btrfs as a template/example. All other
filesystems defining their own page_mkwrite should be fixed in a similar
manner.
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.
This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).
This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On PowerPC we allocate large boot time hashes on node 0. This leads to an
imbalance in the free memory, for example on a 64GB box (4 x 16GB nodes):
Free memory:
Node 0: 97.03%
Node 1: 98.54%
Node 2: 98.42%
Node 3: 98.53%
If we switch to using vmalloc (like ia64 and x86-64) things are more
balanced:
Free memory:
Node 0: 97.53%
Node 1: 98.35%
Node 2: 98.33%
Node 3: 98.33%
For many HPC applications we are limited by the free available memory on
the smallest node, so even though the same amount of memory is used the
better balancing helps.
Since all 64bit NUMA capable architectures should have sufficient vmalloc
space, it makes sense to enable it via CONFIG_64BIT.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9838
On i386, HZ=1000, jiffies_to_clock_t() converts time in a somewhat strange
way from the user's point of view:
# echo 500 >/proc/sys/vm/dirty_writeback_centisecs
# cat /proc/sys/vm/dirty_writeback_centisecs
499
So, we have 5000 jiffies converted to only 499 clock ticks and reported
back.
TICK_NSEC = 999848
ACTHZ = 256039
Keeping in-kernel variable in units passed from userspace will fix issue
of course, but this probably won't be right for every sysctl.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_DEBUG_PAGEALLOC is now supported by x86, powerpc, sparc64, and
s390. This patch implements it for the rest of the architectures by
filling the pages with poison byte patterns after free_pages() and
verifying the poison patterns before alloc_pages().
This generic one cannot detect invalid page accesses immediately but
invalid read access may cause invalid dereference by poisoned memory and
invalid write access can be detected after a long delay.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I notice there are many places doing copy_from_user() which follows
kmalloc():
dst = kmalloc(len, GFP_KERNEL);
if (!dst)
return -ENOMEM;
if (copy_from_user(dst, src, len)) {
kfree(dst);
return -EFAULT
}
memdup_user() is a wrapper of the above code. With this new function, we
don't have to write 'len' twice, which can lead to typos/mistakes. It
also produces smaller code and kernel text.
A quick grep shows 250+ places where memdup_user() *may* be used. I'll
prepare a patchset to do this conversion.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
chg is unsigned, so it cannot be less than 0.
Also, since region_chg returns long, let vma_needs_reservation() forward
this to alloc_huge_page(). Store it as long as well. all callers cast it
to long anyway.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Johannes Weiner <hannes@saeurebad.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pagevec_swap_free() at the end of shrink_active_list() was introduced
in 68a22394 "vmscan: free swap space on swap-in/activation" when
shrink_active_list() was still rotating referenced active pages.
In 7e9cd48 "vmscan: fix pagecache reclaim referenced bit check" this was
changed, the rotating removed but the pagevec_swap_free() after the
rotation loop was forgotten, applying now to the pagevec of the
deactivation loop instead.
Now swap space is freed for deactivated pages. And only for those that
happen to be on the pagevec after the deactivation loop.
Complete 7e9cd48 and remove the rest of the swap freeing.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In shrink_active_list() after the deactivation loop, we strip buffer heads
from the potentially remaining pages in the pagevec.
Currently, this drops the zone's lru lock for stripping, only to reacquire
it again afterwards to update statistics.
It is not necessary to strip the pages before updating the stats, so move
the whole thing out of the protected region and save the extra locking.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: MinChan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>