Commit Graph

3064 Commits

Author SHA1 Message Date
Thomas Gleixner
cde898fa80 futex: correctly return -EFAULT not -EINVAL
return -EFAULT not -EINVAL. Found by review.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-05 15:46:09 +01:00
Oleg Nesterov
54561783ee lockdep: in_range() fix
Torsten Kaiser wrote:

| static inline int in_range(const void *start, const void *addr, const void *end)
| {
|         return addr >= start && addr <= end;
| }
| This  will return true, if addr is in the range of start (including)
| to end (including).
|
| But debug_check_no_locks_freed() seems does:
| const void *mem_to = mem_from + mem_len
| -> mem_to is the last byte of the freed range, that fits in_range
| lock_from = (void *)hlock->instance;
| -> first byte of the lock
| lock_to = (void *)(hlock->instance + 1);
| -> first byte of the next lock, not last byte of the lock that is being checked!
|
| The test is:
| if (!in_range(mem_from, lock_from, mem_to) &&
|                                         !in_range(mem_from, lock_to, mem_to))
|                         continue;
| So it tests, if the first byte of the lock is in the range that is freed ->OK
| And if the first byte of the *next* lock is in the range that is freed
| -> Not OK.

We can also simplify in_range checks, we need only 2 comparisons, not 4.
If the lock is not in memory range, it should be either at the left of range
or at the right.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-12-05 15:46:09 +01:00
Ingo Molnar
856848737b lockdep: fix debug_show_all_locks()
fix the oops that can be seen in:

   http://bugzilla.kernel.org/attachment.cgi?id=13828&action=view

it is not safe to print the locks of running tasks.

(even with this fix we have a small race - but this is a debug
 function after all.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-12-05 15:46:09 +01:00
Ingo Molnar
41a2d6cfa3 sched: style cleanups
style cleanup of various changes that were done recently.

no code changed:

      text    data     bss     dec     hex filename
     23680    2542      28   26250    668a sched.o.before
     23680    2542      28   26250    668a sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-05 15:46:09 +01:00
Steven Rostedt
ce6bd420f4 futex: fix for futex_wait signal stack corruption
David Holmes found a bug in the -rt tree with respect to
pthread_cond_timedwait. After trying his test program on the latest git
from mainline, I found the bug was there too.  The bug he was seeing
that his test program showed, was that if one were to do a "Ctrl-Z" on a
process that was in the pthread_cond_timedwait, and then did a "bg" on
that process, it would return with a "-ETIMEDOUT" but early. That is,
the timer would go off early.

Looking into this, I found the source of the problem. And it is a rather
nasty bug at that.

Here's the relevant code from kernel/futex.c: (not in order in the file)

[...]
smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
                          struct timespec __user *utime, u32 __user *uaddr2,
                          u32 val3)
{
        struct timespec ts;
        ktime_t t, *tp = NULL;
        u32 val2 = 0;
        int cmd = op & FUTEX_CMD_MASK;

        if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
                if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
                        return -EFAULT;
                if (!timespec_valid(&ts))
                        return -EINVAL;

                t = timespec_to_ktime(ts);
                if (cmd == FUTEX_WAIT)
                        t = ktime_add(ktime_get(), t);
                tp = &t;
        }
[...]
        return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
}

[...]

long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
                u32 __user *uaddr2, u32 val2, u32 val3)
{
        int ret;
        int cmd = op & FUTEX_CMD_MASK;
        struct rw_semaphore *fshared = NULL;

        if (!(op & FUTEX_PRIVATE_FLAG))
                fshared = &current->mm->mmap_sem;

        switch (cmd) {
        case FUTEX_WAIT:
                ret = futex_wait(uaddr, fshared, val, timeout);

[...]

static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
                      u32 val, ktime_t *abs_time)
{
[...]
               struct restart_block *restart;
                restart = &current_thread_info()->restart_block;
                restart->fn = futex_wait_restart;
                restart->arg0 = (unsigned long)uaddr;
                restart->arg1 = (unsigned long)val;
                restart->arg2 = (unsigned long)abs_time;
                restart->arg3 = 0;
                if (fshared)
                        restart->arg3 |= ARG3_SHARED;
                return -ERESTART_RESTARTBLOCK;
[...]

static long futex_wait_restart(struct restart_block *restart)
{
        u32 __user *uaddr = (u32 __user *)restart->arg0;
        u32 val = (u32)restart->arg1;
        ktime_t *abs_time = (ktime_t *)restart->arg2;
        struct rw_semaphore *fshared = NULL;

        restart->fn = do_no_restart_syscall;
        if (restart->arg3 & ARG3_SHARED)
                fshared = &current->mm->mmap_sem;
        return (long)futex_wait(uaddr, fshared, val, abs_time);
}

So when the futex_wait is interrupt by a signal we break out of the
hrtimer code and set up or return from signal. This code does not return
back to userspace, so we set up a RESTARTBLOCK.  The bug here is that we
save the "abs_time" which is a pointer to the stack variable "ktime_t t"
from sys_futex.

This returns and unwinds the stack before we get to call our signal. On
return from the signal we go to futex_wait_restart, where we update all
the parameters for futex_wait and call it. But here we have a problem
where abs_time is no longer valid.

I verified this with print statements, and sure enough, what abs_time
was set to ends up being garbage when we get to futex_wait_restart.

The solution I did to solve this (with input from Linus Torvalds)
was to add unions to the restart_block to allow system calls to
use the restart with specific parameters.  This way the futex code now
saves the time in a 64bit value in the restart block instead of storing
it on the stack.

Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
Not sure what that is there for.  If this turns out to be a problem, I've
tested this with using "unsigned int" for u32 and "unsigned long long" for
u64 and it worked just the same. I'm using u32 and u64 just to be
consistent with what the futex code uses.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-05 15:46:09 +01:00
David S. Miller
874a5f87f5 [SYSCTL_CHECK]: Fix typo in KERN_SPARC_SCONS_PWROFF entry string.
Based upon a report by Mikael Pettersson.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-05 05:37:56 -08:00
Ingo Molnar
db292ca302 sched: default to more agressive yield for SCHED_BATCH tasks
do more agressive yield for SCHED_BATCH tuned tasks: they are all
about throughput anyway. This allows a gentler migration path for
any apps that relied on stronger yield.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-04 17:04:39 +01:00
Ingo Molnar
77034937dc sched: fix crash in sys_sched_rr_get_interval()
Luiz Fernando N. Capitulino reported that sched_rr_get_interval()
crashes for SCHED_OTHER tasks that are on an idle runqueue.

The fix is to return a 0 timeslice for tasks that are on an idle
runqueue. (and which are not running, obviously)

this also shrinks the code a bit:

   text    data     bss     dec     hex filename
  47903    3934     336   52173    cbcd sched.o.before
  47885    3934     336   52155    cbbb sched.o.after

Reported-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-04 17:04:39 +01:00
Linus Torvalds
ca6435f188 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  sched: cpu accounting controller (V2)
2007-12-03 08:21:06 -08:00
Al Viro
b00296fb78 uml: add !UML dependencies
The previous commit ("uml: keep UML Kconfig in sync with x86") is not
enough, unfortunately.  If we go that way, we need to add dependencies
on !UML for several options.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-03 08:13:17 -08:00
Srivatsa Vaddagiri
d842de871c sched: cpu accounting controller (V2)
Commit cfb5285660 removed a useful feature for
us, which provided a cpu accounting resource controller.  This feature would be
useful if someone wants to group tasks only for accounting purpose and doesnt
really want to exercise any control over their cpu consumption.

The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df6406), with
these differences:

        - Removed load average information. I felt it needs more thought (esp
	  to deal with SMP and virtualized platforms) and can be added for
	  2.6.25 after more discussions.
        - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
	  stats are) and invoke cpuacct_charge() from the respective scheduler
	  classes
	- Make accounting scalable on SMP systems by splitting the usage
	  counter to be per-cpu
	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
	  code is not big enough to warrant a new file and also this rightly
	  needs to live inside the scheduler. Also things like accessing
	  rq->lock while reading cpu usage becomes easier if the code lived in
	  kernel/sched.c)

The patch also modifies the cpu controller not to provide the same accounting
information.

Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>

 Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
 some simple tests like cpuspin (spin on the cpu), ran several tasks in
 the same group and timed them. Compared their time stamps with
 cpuacct.usage.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-12-02 20:04:49 +01:00
Scott James Remnant
e6ceb32aa2 wait_task_stopped(): pass correct exit_code to wait_noreap_copyout()
In wait_task_stopped() exit_code already contains the right value for the
si_status member of siginfo, and this is simply set in the non WNOWAIT
case.

If you call waitid() with a stopped or traced process, you'll get the signal
in siginfo.si_status as expected -- however if you call waitid(WNOWAIT) at the
same time, you'll get the signal << 8 | 0x7f

Pass it unchanged to wait_noreap_copyout(); we would only need to shift it
and add 0x7f if we were returning it in the user status field and that
isn't used for any function that permits WNOWAIT.

Signed-off-by: Scott James Remnant <scott@ubuntu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-29 09:24:55 -08:00
David Howells
9e6c1e6333 FRV: fix the extern declaration of kallsyms_num_syms
Fix the extern declaration of kallsyms_num_syms to indicate that the symbol
does not reside in the small-data storage space, and so may not be accessed
relative to the small data base register.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-29 09:24:54 -08:00
Pavel Emelyanov
32df81cbd5 Isolate the UTS namespace's domainname and hostname back
Commit 7d69a1f4a7 ("remove CONFIG_UTS_NS
and CONFIG_IPC_NS") by Cedric Le Goater accidentally removed the code
that prevented the uts->hostname and uts->domainname values from being
overwritten from another namespace.

In other words, setting hostname/domainname via sysfs (echo xxx >
/proc/sys/kernel/(host|domain)name) cased the new value to be set in
init UTS namespace only.

Return the isolation back.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Cedric Le Goater <clg@fr.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-29 09:24:53 -08:00
Oleg Nesterov
c895078355 wait_task_stopped(): don't use task_pid_nr_ns() lockless
wait_task_stopped(WNOWAIT) does task_pid_nr_ns() without tasklist/rcu lock,
we can read an already freed memory.  Use the cached pid_t value.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Looks-good-to: Roland McGrath <roland@redhat.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-29 09:24:52 -08:00
Ingo Molnar
f95e0d1c2a sched: clean up kernel/sched_stat.h
clean up kernel/sched_stat.h.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-28 15:52:56 +01:00
Ingo Molnar
c1a89740da sched: clean up overlong line in kernel/sched_debug.c
clean up overlong line in kernel/sched_debug.c.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-28 15:52:56 +01:00
Ingo Molnar
deaf2227dd sched: clean up, move __sched_text_start/end to sched.h
move __sched_text_start/end to sched.h. No code changed:

   text    data     bss     dec     hex filename
  26582    2310      28   28920    70f8 sched.o.before
  26582    2310      28   28920    70f8 sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-28 15:52:56 +01:00
Ingo Molnar
9a4e715914 sched: clean up sd_alloc_ctl_cpu_table() definition
clean up sd_alloc_ctl_cpu_table() definition.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-28 15:52:56 +01:00
Thomas Gleixner
d393820446 softlockup: fix false positives on CONFIG_NOHZ
David Miller reported soft lockup false-positives that trigger
on NOHZ due to CPUs idling for more than 10 seconds.

The solution is touch the softlockup watchdog when we return from
idle. (by definition we are not 'locked up' when we were idle)

 http://bugzilla.kernel.org/show_bug.cgi?id=9409

Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-28 15:52:56 +01:00
Linus Torvalds
8c27eba549 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/net-2.6: (41 commits)
  [XFRM]: Fix leak of expired xfrm_states
  [ATM]: [he] initialize lock and tasklet earlier
  [IPV4]: Remove bogus ifdef mess in arp_process
  [SKBUFF]: Free old skb properly in skb_morph
  [IPV4]: Fix memory leak in inet_hashtables.h when NUMA is on
  [IPSEC]: Temporarily remove locks around copying of non-atomic fields
  [TCP] MTUprobe: Cleanup send queue check (no need to loop)
  [TCP]: MTUprobe: receiver window & data available checks fixed
  [MAINTAINERS]: tlan list is subscribers-only
  [SUNRPC]: Remove SPIN_LOCK_UNLOCKED
  [SUNRPC]: Make xprtsock.c:xs_setup_{udp,tcp}() static
  [PFKEY]: Sending an SADB_GET responds with an SADB_GET
  [IRDA]: Compilation for CONFIG_INET=n case
  [IPVS]: Fix compiler warning about unused register_ip_vs_protocol
  [ARP]: Fix arp reply when sender ip 0
  [IPV6] TCPMD5: Fix deleting key operation.
  [IPV6] TCPMD5: Check return value of tcp_alloc_md5sig_pool().
  [IPV4] TCPMD5: Use memmove() instead of memcpy() because we have overlaps.
  [IPV4] TCPMD5: Omit redundant NULL check for kfree() argument.
  ieee80211: Stop net_ratelimit/IEEE80211_DEBUG_DROP log pollution
  ...
2007-11-26 20:09:07 -08:00
Linus Torvalds
0685ab4fb8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  sched: bump version of kernel/sched_debug.c
  sched: fix minimum granularity tunings
  sched: fix RLIMIT_CPU comment
  sched: fix kernel/acct.c comment
  sched: fix prev_stime calculation
  sched: don't forget to unlock uids_mutex on error paths
2007-11-26 19:42:08 -08:00
Linus Torvalds
ff1ea52fa3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86
* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86:
  x86: fix APIC related bootup crash on Athlon XP CPUs
  time: add ADJ_OFFSET_SS_READ
  x86: export the symbol empty_zero_page on the 32-bit x86 architecture
  x86: fix kprobes_64.c inlining borkage
  pci: use pci=bfsort for HP DL385 G2, DL585 G2
  x86: correctly set UTS_MACHINE for "make ARCH=x86"
  lockdep: annotate do_debug() trap handler
  x86: turn off iommu merge by default
  x86: fix ACPI compile for LOCAL_APIC=n
  x86: printk kernel version in WARN_ON and other dump_stack users
  ACPI: Set max_cstate to 1 for early Opterons.
  x86: fix NMI watchdog & 'stopped time' problem
2007-11-26 19:41:28 -08:00
Linus Torvalds
f4d53cedce Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  virtio: fix net driver loop case where we fail to restart
  module: fix and elaborate comments
  virtio: fix module/device unloading
  lguest: Fix uninitialized members in example launcher
2007-11-26 19:17:19 -08:00
Ingo Molnar
f7b9329e55 sched: bump version of kernel/sched_debug.c
bump version of kernel/sched_debug.c and remove CFS version
information from it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-26 21:21:49 +01:00
Zou Nan hai
722aab0c3b sched: fix minimum granularity tunings
increase the default minimum granularity some more - this gives us
more performance in aim7 benchmarks.

also correct some comments: we scale with ilog(ncpus) + 1.

Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-26 21:21:49 +01:00
Ingo Molnar
bcbe4a0766 sched: fix kernel/acct.c comment
fix kernel/acct.c comment.

noticed by Lin Tan. Comment suggested by Olaf Kirch.

also see:

  http://bugzilla.kernel.org/show_bug.cgi?id=8220

Reported-by: tammy000@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-26 21:21:49 +01:00
Pavel Emelyanov
5e8869bb69 sched: don't forget to unlock uids_mutex on error paths
The commit

 commit 5cb350baf5
 Author: Dhaval Giani <dhaval@linux.vnet.ibm.com>
 Date:   Mon Oct 15 17:00:14 2007 +0200

    sched: group scheduling, sysfs tunables

introduced the uids_mutex and the helpers to lock/unlock it.
Unfortunately, the error paths of alloc_uid() were not patched
to unlock it.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-26 21:21:49 +01:00
John Stultz
52bfb36050 time: add ADJ_OFFSET_SS_READ
Michael Kerrisk reported that a long standing bug in the adjtimex()
system call causes glibc's adjtime(3) function to deliver the wrong
results if 'delta' is NULL.

add the ADJ_OFFSET_SS_READ API detail, which will be used by glibc
to fix this API compatibility bug.

Also see: http://bugzilla.kernel.org/show_bug.cgi?id=6761

[ mingo@elte.hu: added patch description and made it backwards compatible ]

NOTE: the new flag is defined 0xa001 so that it returns -EINVAL on
older kernels - this way glibc can use it safely. Suggested by Ulrich
Drepper.

Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-11-26 20:42:19 +01:00
Heiko Carstens
37e3a6ac5a [S390] appldata: remove unused binary sysctls.
Remove binary sysctls that never worked due to missing strategy functions.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <geraldsc@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2007-11-20 11:13:45 +01:00
Heiko Carstens
43ebbf119a [S390] cmm: remove unused binary sysctls.
Remove binary sysctls that never worked due to missing strategy functions.

Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2007-11-20 11:13:45 +01:00
Simon Horman
9055fa1f3d [IPVS]: Move remaining sysctl handlers over to CTL_UNNUMBERED
Switch the remaining IPVS sysctl entries over to to use CTL_UNNUMBERED,
I stronly doubt that anyone is using the sys_sysctl interface to
these variables.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 21:51:13 -08:00
Simon Horman
9e103fa6bd [IPVS]: Fix sysctl warnings about missing strategy in schedulers
sysctl table check failed: /net/ipv4/vs/lblc_expiration .3.5.21.19 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/lblcr_expiration .3.5.21.20 Missing strategy

Switch these entried over to use CTL_UNNUMBERED as clearly
the sys_syscal portion wasn't working.

This is along the same lines as Christian Borntraeger's patch that fixes
up entries with no stratergy in net/ipv4/ipvs/ip_vs_ctl.c

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 21:50:21 -08:00
Christian Borntraeger
611cd55b15 [IPVS]: Fix sysctl warnings about missing strategy
Running the latest git code I get the following messages during boot:
sysctl table check failed: /net/ipv4/vs/drop_entry .3.5.21.4 Missing strategy
[...]		  
sysctl table check failed: /net/ipv4/vs/drop_packet .3.5.21.5 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/secure_tcp .3.5.21.6 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/sync_threshold .3.5.21.24 Missing strategy

I removed the binary sysctl handler for those messages and also removed
the definitions in ip_vs.h. The alternative would be to implement a 
proper strategy handler, but syscall sysctl is deprecated.

There are other sysctl definitions that are commented out or work with 
the default sysctl_data strategy. I did not touch these. 

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 21:49:25 -08:00
Matti Linnanvuori
9a4b9708f1 module: fix and elaborate comments
Fix and elaborate comments.

Signed-off-by: Matti Linnanvuori <mattilinnanvuori@yahoo.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2007-11-19 11:20:43 +11:00
David P. Reed
fa6a1a554b ntp: fix typo that makes sync_cmos_clock erratic
Fix a typo in ntp.c that has caused updating of the persistent (RTC)
clock when synced to NTP to behave erratically.

When debugging a freeze that arises on my AMD64 machines when I
run the ntpd service, I added a number of printk's to monitor the
sync_cmos_clock procedure.  I discovered that it was not syncing to
cmos RTC every 11 minutes as documented, but instead would keep trying
every second for hours at a time.  The reason turned out to be a typo
in sync_cmos_clock, where it attempts to ensure that
update_persistent_clock is called very close to 500 msec. after a 1
second boundary (required by the PC RTC's spec). That typo referred to
"xtime" in one spot, rather than "now", which is derived from "xtime"
but not equal to it.  This makes the test erratic, creating a
"coin-flip" that decides when update_persistent_clock is called - when
it is called, which is rarely, it may be at any time during the one
second period, rather than close to 500 msec, so the value written is
needlessly incorrect, too.

Signed-off-by: David P. Reed
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-11-17 16:27:01 +01:00
Ingo Molnar
4307d1e5ad x86: ignore the sys_getcpu() tcache parameter
dont use the vgetcpu tcache - it's causing problems with tasks
migrating, they'll see the old cache up to a jiffy after the
migration, further increasing the costs of the migration.

In the worst case they see a complete bogus information from
the tcache, when a sys_getcpu() call "invalidated" the cache
info by incrementing the jiffies _and_ the cpuid info in the
cache and the following vdso_getcpu() call happens after
vdso_jiffies have been incremented.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-11-17 16:27:00 +01:00
Linus Torvalds
3c72f526df Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  sched: reorder SCHED_FEAT_ bits
  sched: make sched_nr_latency static
  sched: remove activate_idle_task()
  sched: fix __set_task_cpu() SMP race
  sched: fix SCHED_FIFO tasks & FAIR_GROUP_SCHED
  sched: fix accounting of interrupts during guest execution on s390
2007-11-15 12:14:52 -08:00
Ingo Molnar
9612633a21 sched: reorder SCHED_FEAT_ bits
reorder SCHED_FEAT_ bits so that the used ones come first. Makes
tuning instructions easier.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15 20:57:40 +01:00
Adrian Bunk
518b22e990 sched: make sched_nr_latency static
sched_nr_latency can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15 20:57:40 +01:00
Dmitry Adamushko
94bc9a7bd9 sched: remove activate_idle_task()
cpu_down() code is ok wrt sched_idle_next() placing the 'idle' task not
at the beginning of the queue.

So get rid of activate_idle_task() and make use of activate_task() instead.
It is the same as activate_task(), except for the update_rq_clock(rq) call
that is redundant.

Code size goes down:

   text    data     bss     dec     hex filename
  47853    3934     336   52123    cb9b sched.o.before
  47828    3934     336   52098    cb82 sched.o.after

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15 20:57:40 +01:00
Dmitry Adamushko
ce96b5ac74 sched: fix __set_task_cpu() SMP race
Grant Wilson has reported rare SCHED_FAIR_USER crashes on his quad-core
system, which crashes can only be explained via runqueue corruption.

there is a narrow SMP race in __set_task_cpu(): after ->cpu is set up to
a new value, task_rq_lock(p, ...) can be successfuly executed on another
CPU. We must ensure that updates of per-task data have been completed by
this moment.

this bug has been hiding in the Linux scheduler for an eternity (we never
had any explicit barrier for task->cpu in set_task_cpu() - so the bug was
introduced in 2.5.1), but only became visible via set_task_cfs_rq() being
accidentally put after the task->cpu update. It also probably needs a
sufficiently out-of-order CPU to trigger.

Reported-by: Grant Wilson <grant.wilson@zen.co.uk>
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15 20:57:40 +01:00
Oleg Nesterov
dae51f5620 sched: fix SCHED_FIFO tasks & FAIR_GROUP_SCHED
Suppose that the SCHED_FIFO task does

	switch_uid(new_user);

Now, p->se.cfs_rq and p->se.parent both point into the old
user_struct->tg because sched_move_task() doesn't call set_task_cfs_rq()
for !fair_sched_class case.

Suppose that old user_struct/task_group is freed/reused, and the task
does

	sched_setscheduler(SCHED_NORMAL);

__setscheduler() sets fair_sched_class, but doesn't update
->se.cfs_rq/parent which point to the freed memory.

This means that check_preempt_wakeup() doing

		while (!is_same_group(se, pse)) {
			se = parent_entity(se);
			pse = parent_entity(pse);
		}

may OOPS in a similar way if rq->curr or p did something like above.

Perhaps we need something like the patch below, note that
__setscheduler() can't do set_task_cfs_rq().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15 20:57:40 +01:00
Christian Borntraeger
9778385db3 sched: fix accounting of interrupts during guest execution on s390
Currently the scheduler checks for PF_VCPU to decide if this timeslice
has to be accounted as guest time. On s390 host interrupts are not
disabled during guest execution. This causes theses interrupts to be
accounted as guest time if CONFIG_VIRT_CPU_ACCOUNTING is set. Solution
is to check if an interrupt triggered account_system_time. As the tick
is timer interrupt based, we have to subtract hardirq_offset.

I tested the patch on s390 with CONFIG_VIRT_CPU_ACCOUNTING and on
x86_64. Seems to work.

CC: Avi Kivity <avi@qumranet.com>
CC: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15 20:57:39 +01:00
Roland McGrath
a3474224e6 wait_task_stopped: Check p->exit_state instead of TASK_TRACED
The original meaning of the old test (p->state > TASK_STOPPED) was
"not dead", since it was before TASK_TRACED existed and before the
state/exit_state split.  It was a wrong correction in commit
14bf01bb05 to make this test for
TASK_TRACED instead.  It should have been changed when TASK_TRACED
was introducted and again when exit_state was introduced.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Kees Cook <kees@ubuntu.com>
Acked-by: Scott James Remnant <scott@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-15 08:36:27 -08:00
Linus Torvalds
6f37ac793d Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [NET]: rt_check_expire() can take a long time, add a cond_resched()
  [ISDN] sc: Really, really fix warning
  [ISDN] sc: Fix sndpkt to have the correct number of arguments
  [TCP] FRTO: Clear frto_highmark only after process_frto that uses it
  [NET]: Remove notifier block from chain when register_netdevice_notifier fails
  [FS_ENET]: Fix module build.
  [TCP]: Make sure write_queue_from does not begin with NULL ptr
  [TCP]: Fix size calculation in sk_stream_alloc_pskb
  [S2IO]: Fixed memory leak when MSI-X vector allocation fails
  [BONDING]: Fix resource use after free
  [SYSCTL]: Fix warning for token-ring from sysctl checker
  [NET] random : secure_tcp_sequence_number should not assume CONFIG_KTIME_SCALAR
  [IWLWIFI]: Not correctly dealing with hotunplug.
  [TCP] FRTO: Plug potential LOST-bit leak
  [TCP] FRTO: Limit snd_cwnd if TCP was application limited
  [E1000]: Fix schedule while atomic when called from mii-tool.
  [NETX]: Fix build failure added by 2.6.24 statistics cleanup.
  [EP93xx_ETH]: Build fix after 2.6.24 NAPI changes.
  [PKT_SCHED]: Check subqueue status before calling hard_start_xmit
2007-11-14 18:51:48 -08:00
Adrian Bunk
f96159840b kernel/taskstats.c: fix bogus nlmsg_free()
We'd better not nlmsg_free on a pointer containing an undefined value
(and without having anything allocated).

Spotted by the Coverity checker.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:44 -08:00
Johannes Berg
60a0d23386 hibernate: fix lockdep report
Lockdep reports a circular locking dependency in the hibernate code
because
 - during system boot hibernate code (from an initcall) locks pm_mutex
   and then a sysfs buffer mutex via name_to_dev_t
 - during regular operation hibernate code locks pm_mutex under a
   sysfs buffer mutex because it's called from sysfs methods.

The deadlock can never happen because during initcall invocation nothing
can write to sysfs yet. This removes the lockdep report by marking the
initcall locking as being in a different class.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:43 -08:00
Russ Anderson
c642b8391c __do_IRQ does not check IRQ_DISABLED when IRQ_PER_CPU is set
In __do_IRQ(), the normal case is that IRQ_DISABLED is checked and if set
the handler (handle_IRQ_event()) is not called.

Earlier in __do_IRQ(), if IRQ_PER_CPU is set the code does not check
IRQ_DISABLED and calls the handler even though IRQ_DISABLED is set.  This
behavior seems unintentional.

One user encountering this behavior is the CPE handler (in
arch/ia64/kernel/mca.c).  When the CPE handler encounters too many CPEs
(such as a solid single bit error), it sets up a polling timer and disables
the CPE interrupt (to avoid excessive overhead logging the stream of single
bit errors).  disable_irq_nosync() is called which sets IRQ_DISABLED.  The
IRQ_PER_CPU flag was previously set (in ia64_mca_late_init()).  The net
result is the CPE handler gets called even though it is marked disabled.

If the behavior of not checking IRQ_DISABLED when IRQ_PER_CPU is set is
intentional, it would be worthy of a comment describing the intended
behavior.  disable_irq_nosync() does call chip->disable() to provide a
chipset specifiec interface for disabling the interrupt, which avoids this
issue when used.

Signed-off-by: Russ Anderson <rja@sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:43 -08:00
Eric W. Biederman
57d5f66b86 pidns: Place under CONFIG_EXPERIMENTAL
This is my trivial patch to swat innumerable little bugs with a single
blow.

After some intensive review (my apologies for not having gotten to this
sooner) what we have looks like a good base to build on with the current
pid namespace code but it is not complete, and it is still much to simple
to find issues where the kernel does the wrong thing outside of the initial
pid namespace.

Until the dust settles and we are certain we have the ABI and the
implementation is as correct as humanly possible let's keep process ID
namespaces behind CONFIG_EXPERIMENTAL.

Allowing us the option of fixing any ABI or other bugs we find as long as
they are minor.

Allowing users of the kernel to avoid those bugs simply by ensuring their
kernel does not have support for multiple pid namespaces.

[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Adrian Bunk <bunk@kernel.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Kir Kolyshkin <kir@swsoft.com>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:43 -08:00
Jan Kiszka
22800a2830 fix param_sysfs_builtin name length check
Commit faf8c714f4 caused a regression:
parameter names longer than MAX_KBUILD_MODNAME will now be rejected,
although we just need to keep the module name part that short.  This patch
restores the old behaviour while still avoiding that memchr is called with
its length parameter larger than the total string length.

Signed-off-by: Jan Kiszka <jan.kiszka@web.de>
Cc: Dave Young <hidave.darkstar@gmail.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:42 -08:00
Mathieu Desnoyers
314de8a9e1 Linux Kernel Markers: fix marker mutex not taken upon module load
Upon module load, we must take the markers mutex.  It implies that the marker
mutex must be nested inside the module mutex.

It implies changing the nesting order : now the marker mutex nests inside the
module mutex.  Make the necessary changes to reverse the order in which the
mutexes are taken.

Includes some cleanup from Dave Hansen <haveblue@us.ibm.com>.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:40 -08:00
Andrew Morton
cfb5285660 revert "Task Control Groups: example CPU accounting subsystem"
Revert 62d0df6406.

This was originally intended as a simple initial example of how to create a
control groups subsystem; it wasn't intended for mainline, but I didn't make
this clear enough to Andrew.

The CFS cgroup subsystem now has better functionality for the per-cgroup usage
accounting (based directly on CFS stats) than the "usage" status file in this
patch, and the "load" status file is rather simplistic - although having a
per-cgroup load average report would be a useful feature, I don't believe this
patch actually provides it.  If it gets into the final 2.6.24 we'd probably
have to support this interface for ever.

Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:40 -08:00
Yasunori Goto
887c3cb188 Add IORESOUCE_BUSY flag for System RAM
i386 and x86-64 registers System RAM as IORESOURCE_MEM | IORESOURCE_BUSY.

But ia64 registers it as IORESOURCE_MEM only.
In addition, memory hotplug code registers new memory as IORESOURCE_MEM too.

This difference causes a failure of memory unplug of x86-64.  This patch
fixes it.

This patch adds IORESOURCE_BUSY to avoid potential overlap mapping by PCI
device.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Luck, Tony" <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:39 -08:00
Diego Calleja
cfe36bde59 Improve cgroup printks
When I boot with the 'quiet' parameter, I see on the screen:

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[   39.036026] Initializing cgroup subsys cpuacct
[   39.036080] Initializing cgroup subsys debug
[   39.036118] Initializing cgroup subsys ns

This patch lowers the priority of those messages, adds a "cgroup: " prefix
to another couple of printks and kills the useless reference to the source
file.

Signed-off-by: Diego Calleja <diegocg@gmail.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:37 -08:00
Tetsuo Handa
6fc48af82c sysctl: check length at deprecated_sysctl_warning
Original patch assumed args->nlen < CTL_MAXNAME, but it can be false.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:37 -08:00
Olof Johansson
ce1d18e006 [SYSCTL]: Fix warning for token-ring from sysctl checker
As seen when booting ppc64_defconfig:

sysctl table check failed: /net/token-ring .3.14 procname does not match binary path procname

Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 21:15:24 -08:00
Roland McGrath
325d22df7b sigwait eats blocked default-ignore signals
While a signal is blocked, it must be posted even if its action is
SIG_IGN or is SIG_DFL with the default action to ignore.  This works
right most of the time, but is broken when a sigwait (rt_sigtimedwait)
is in progress.  This changes the early-discard check to respect
real_blocked.  ~blocked is the set to check for "should wake up now",
but ~(blocked|real_blocked) is the set for "blocked" semantics as
defined by POSIX.

This fixes bugzilla entry 9347, see

	http://bugzilla.kernel.org/show_bug.cgi?id=9347

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-12 16:05:23 -08:00
David Miller
3c5fd9c77d [FUTEX] Fix address computation in compat code.
compat_exit_robust_list() computes a pointer to the
futex entry in userspace as follows:

	(void __user *)entry + futex_offset

'entry' is a 'struct robust_list __user *', and
'futex_offset' is a 'compat_long_t' (typically a 's32').

Things explode if the 32-bit sign bit is set in futex_offset.

Type promotion sign extends futex_offset to a 64-bit value before
adding it to 'entry'.

This triggered a problem on sparc64 running 32-bit applications which
would lock up a cpu looping forever in the fault handling for the
userspace load in handle_futex_death().

Compat userspace runs with address masking (wherein the cpu zeros out
the top 32-bits of every effective address given to a memory operation
instruction) so the sparc64 fault handler accounts for this by
zero'ing out the top 32-bits of the fault address too.

Since the kernel properly uses the compat_uptr interfaces, kernel side
accesses to compat userspace work too since they will only use
addresses with the top 32-bit clear.

Because of this compat futex layer bug we get into the following loop
when executing the get_user() load near the top of handle_futex_death():

1) load from address '0xfffffffff7f16bd8', FAULT
2) fault handler clears upper 32-bits, processes fault
   for address '0xf7f16bd8' which succeeds
3) goto #1

I want to thank Bernd Zeimetz, Josip Rodin, and Fabio Massimo Di Nitto
for their tireless efforts helping me track down this bug.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-09 16:13:08 -08:00
Adrian Bunk
e6fe6649b4 sched: proper prototype for kernel/sched.c:migration_init()
This patch adds a proper prototype for migration_init() in
include/linux/sched.h

Since there's no point in always returning 0 to a caller that doesn't check
the return value it also changes the function to return void.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Peter Zijlstra
b82d9fdd84 sched: avoid large irq-latencies in smp-balancing
SMP balancing is done with IRQs disabled and can iterate the full rq.
When rqs are large this can cause large irq-latencies. Limit the nr of
iterations on each run.

This fixes a scheduling latency regression reported by the -rt folks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Srivatsa Vaddagiri
3c90e6e99b sched: fix copy_namespace() <-> sched_fork() dependency in do_fork
Sukadev Bhattiprolu reported a kernel crash with control groups.
There are couple of problems discovered by Suka's test:

- The test requires the cgroup filesystem to be mounted with
  atleast the cpu and ns options (i.e both namespace and cpu 
  controllers are active in the same hierarchy). 

	# mkdir /dev/cpuctl
	# mount -t cgroup -ocpu,ns none cpuctl
	(or simply)
	# mount -t cgroup none cpuctl -> Will activate all controllers
					 in same hierarchy.

- The test invokes clone() with CLONE_NEWNS set. This causes a a new child
  to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone->
  cgroup_clone) and the child is attached to the new group (cgroup_clone->
  attach_task->sched_move_task). At this point in time, the child's scheduler 
  related fields are uninitialized (including its on_rq field, which it has
  inherited from parent). As a result sched_move_task thinks its on
  runqueue, when it isn't.

  As a solution to this problem, I moved sched_fork() call, which
  initializes scheduler related fields on a new task, before
  copy_namespaces(). I am not sure though whether moving up will
  cause other side-effects. Do you see any issue?

- The second problem exposed by this test is that task_new_fair()
  assumes that parent and child will be part of the same group (which 
  needn't be as this test shows). As a result, cfs_rq->curr can be NULL
  for the child.

  The solution is to test for curr pointer being NULL in
  task_new_fair().

With the patch below, I could run ns_exec() fine w/o a crash.

Reported-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Ingo Molnar
502d26b524 sched: clean up the wakeup preempt check, #2
clean up the preemption check to not use unnecessary 64-bit
variables. This improves code size:

   text    data     bss     dec     hex filename
  44227    3326      36   47589    b9e5 sched.o.before
  44201    3326      36   47563    b9cb sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Ingo Molnar
77d9cc44b5 sched: clean up the wakeup preempt check
clean up the wakeup preemption check. No code changed:

   text    data     bss     dec     hex filename
  44227    3326      36   47589    b9e5 sched.o.before
  44227    3326      36   47589    b9e5 sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Ingo Molnar
8bc6767acb sched: wakeup preemption fix
wakeup preemption fix: do not make it dependent on p->prio.
Preemption purely depends on ->vruntime.

This improves preemption in mixed-nice-level workloads.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Ingo Molnar
3e3e13f399 sched: remove PREEMPT_RESTRICT
remove PREEMPT_RESTRICT. (this is a separate commit so that any
regression related to the removal itself is bisectable)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Ingo Molnar
52d3da1ad4 sched: turn off PREEMPT_RESTRICT
PREEMPT_RESTRICT was a method aimed at reducing the amount of wakeup
related preemption. It has a disadvantage though, it can prevent
legitimate wakeups if a task is 'unlucky' to be hit too early by a tick
that clears peer_preempt.

Now that the wakeup preemption has been cleaned up we dont seem to have
excessive preemptions anymore, so this feature can be turned off. (and
removed in the next patch)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Eric Dumazet
d6322faf29 sched: cleanup, use NSEC_PER_MSEC and NSEC_PER_SEC
1) hardcoded 1000000000 value is used five times in places where
   NSEC_PER_SEC might be more readable.

2) A conversion from nsec to msec uses the hardcoded 1000000 value,
   which is a candidate for NSEC_PER_MSEC.

no code changed:

    text    data     bss     dec     hex filename
   44359    3326      36   47721    ba69 sched.o.before
   44359    3326      36   47721    ba69 sched.o.after

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:38 +01:00
Ingo Molnar
19978ca610 sched: reintroduce SMP tunings again
Yanmin Zhang reported an aim7 regression and bisected it down to:

 |  commit 38ad464d41
 |  Author: Ingo Molnar <mingo@elte.hu>
 |  Date:   Mon Oct 15 17:00:02 2007 +0200
 |
 |     sched: uniform tunings
 |
 |     use the same defaults on both UP and SMP.

fix this by reintroducing similar SMP tunings again. This resolves
the regression.

(also update the comments to match the ilog2(nr_cpus) tuning effect)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:38 +01:00
Paul Mackerras
fa13a5a1f2 sched: restore deterministic CPU accounting on powerpc
Since powerpc started using CONFIG_GENERIC_CLOCKEVENTS, the
deterministic CPU accounting (CONFIG_VIRT_CPU_ACCOUNTING) has been
broken on powerpc, because we end up counting user time twice: once in
timer_interrupt() and once in update_process_times().

This fixes the problem by pulling the code in update_process_times
that updates utime and stime into a separate function called
account_process_tick.  If CONFIG_VIRT_CPU_ACCOUNTING is not defined,
there is a version of account_process_tick in kernel/timer.c that
simply accounts a whole tick to either utime or stime as before.  If
CONFIG_VIRT_CPU_ACCOUNTING is defined, then arch code gets to
implement account_process_tick.

This also lets us simplify the s390 code a bit; it means that the s390
timer interrupt can now call update_process_times even when
CONFIG_VIRT_CPU_ACCOUNTING is turned on, and can just implement a
suitable account_process_tick().

account_process_tick() now takes the task_struct * as an argument.
Tested both with and without CONFIG_VIRT_CPU_ACCOUNTING.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:38 +01:00
Balbir Singh
9a41785cc4 sched: fix delay accounting regression
Fix the delay accounting regression introduced by commit
75d4ef16a6. rq no longer has sched_info
data associated with it. task_struct sched_info structure is used by delay
accounting to provide back statistics to user space.

also remove direct use of sched_clock() (which is not a valid thing to
do anymore) and use rq->clock instead.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:37 +01:00
Peter Zijlstra
b2be5e96dc sched: reintroduce the sched_min_granularity tunable
we lost the sched_min_granularity tunable to a clever optimization
that uses the sched_latency/min_granularity ratio - but the ratio
is quite unintuitive to users and can also crash the kernel if the
ratio is set to 0. So reintroduce the min_granularity tunable,
while keeping the ratio maintained internally.

no functionality changed.

[ mingo@elte.hu: some fixlets. ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:37 +01:00
Peter Zijlstra
2cb8600e6b sched: documentation: place_entity() comments
Add a few comments to place_entity(). No code changed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:37 +01:00
Peter Zijlstra
10b777246c sched: fix vslice
vslice was missing a factor NICE_0_LOAD, as weight is in
weight*NICE_0_LOAD units.

the effect of this bug was larger initial slices and
thus latency-noisier forks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:37 +01:00
Li Zefan
8dce39c231 time: fix inconsistent function names in comments
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-05 15:12:33 -08:00
Alexey Dobriyan
5db6a4dac1 Dump stack during sysctl registration failure
Let's make immediately obvious from where sysctl comes from and messages
itself more noticeable.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-05 15:12:31 -08:00
Adrian Bunk
fad23fc78b kernel/futex.c: make 3 functions static
The following functions can now become static again:
- get_futex_key()
- get_futex_key_refs()
- drop_futex_key_refs()

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2007-11-05 21:53:46 +11:00
Linus Torvalds
a7e1e001f4 Merge branch 'v2.6.24-rc1-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep
* 'v2.6.24-rc1-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep:
  lockdep: fix a typo in the __lock_acquire comment
  sched: fix unconditional irq lock
  lockdep: fixup irq tracing
2007-11-03 12:42:52 -07:00
David S. Miller
f3baa4827a [COMPAT]: Fix build on COMPAT platforms when CONFIG_NET is disabled.
Add some missing cond_syscall() entries for this case.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-30 21:29:56 -07:00
Rafael J. Wysocki
cc5f916e90 Freezer: do not allow freezing processes to clear TIF_SIGPENDING
Do not allow processes to clear their TIF_SIGPENDING if TIF_FREEZE is set,
so that they will not race with the freezer (like mysqld does, for example).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@suspend2.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-30 08:06:55 -07:00
Balbir Singh
9301899be7 sched: fix /proc/<PID>/stat stime/utime monotonicity, part 2
Extend Peter's patch to fix accounting issues, by keeping stime
monotonic too.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Frans Pop <elendil@planet.nl>
2007-10-30 00:26:32 +01:00
Ingo Molnar
38605cae99 sched: fix style in kernel/sched.c
fallout of recent commits: small coding style fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:11 +01:00
Ingo Molnar
8eb172d941 sched: fix style of swap() macro in kernel/sched_fair.c
fix style of swap() macro in kernel/sched_fair.c.

( this macro should eventually move to a general header, as ext3 uses
  a similar construct too. )

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:11 +01:00
Paul Menage
fe5c7cc228 sched: report CPU usage in CFS cgroup directories
Adds a cpu.usage file to the CFS cgroup that reports CPU usage in
milliseconds for that cgroup's tasks

[ mingo@elte.hu: style cleanups. ]

Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:11 +01:00
Srivatsa Vaddagiri
ae8393e508 sched: move rcu_head to task_group struct
Peter Zijlstra noticed that the rcu_head object need not be present
in every cfs_rq of a group. Move it to the task_group structure
instead.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:11 +01:00
James Bottomley
7bae49d498 sched: fix incorrect assumption that cpu 0 exists
This patch:

commit 9b5b77512d
Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Date:   Mon Oct 15 17:00:09 2007 +0200

    sched: clean up code under CONFIG_FAIR_GROUP_SCHED

Introduced an assumption of the existence of CPU0 via this line

cfs_rq = tg->cfs_rq[0];

If you have no CPU0, that will be NULL.  The fix seems to be just to
take whatever cfs_rq queue comes out of the for_each_possible_cpu()
loop, since they're all equally good for the destruction operation.

Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:11 +01:00
Peter Zijlstra
73a2bcb0ed sched: keep utime/stime monotonic
keep utime/stime monotonic.

cpustats use utime/stime as a ratio against sum_exec_runtime, as a
consequence it can happen - when the ratio changes faster than time
accumulates - that either can be appear to go backwards.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:11 +01:00
Adrian Bunk
f7402e0361 sched: make kernel/sched.c:account_guest_time() static
account_guest_time() can become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:10 +01:00
Linus Torvalds
93400708db Merge git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt
* git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
  Quieten hrtimer printk: "Switched to high resolution mode .."
  timer_list: Fix printk format strings
  clockevents: unexport tick_nohz_get_sleep_length
2007-10-29 07:47:05 -07:00
Al Viro
ca5cd877ae x86 merge fallout: uml
Don't undef __i386__/__x86_64__ in uml anymore, make sure that (few) places
that required adjusting the ifdefs got those.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-29 07:41:32 -07:00
Michael Ellerman
edfed66e17 Quieten hrtimer printk: "Switched to high resolution mode .."
Change the hrtimer printk "Switched to high resolution mode .." to
be KERN_DEBUG, rather than KERN_INFO. If users need to see this they
can pass "loglevel" or "debug" on the command line, or check dmesg.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

 kernel/hrtimer.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
2007-10-29 09:39:38 +01:00
Vegard Nossum
129f1d2c53 timer_list: Fix printk format strings
This makes sure printk format strings contain no more than a single
line.

Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-29 09:39:38 +01:00
Adrian Bunk
64e38eb082 clockevents: unexport tick_nohz_get_sleep_length
This patch removes the unused 
EXPORT_SYMBOL_GPL(tick_nohz_get_sleep_length).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-29 09:39:38 +01:00
Gautham R Shenoy
17aacfb9cd lockdep: fix a typo in the __lock_acquire comment
Fix a typo in the __lock_acquire comment.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-28 20:47:01 +01:00
Peter Zijlstra
ab63a633cf sched: fix unconditional irq lock
Lockdep noticed that this lock can also be taken from hardirq context, and can
thus not unconditionally disable/enable irqs.

 WARNING: at kernel/lockdep.c:2033 trace_hardirqs_on()
  [show_trace_log_lvl+26/48] show_trace_log_lvl+0x1a/0x30
  [show_trace+18/32] show_trace+0x12/0x20
  [dump_stack+22/32] dump_stack+0x16/0x20
  [trace_hardirqs_on+405/416] trace_hardirqs_on+0x195/0x1a0
  [_read_unlock_irq+34/48] _read_unlock_irq+0x22/0x30
  [sched_debug_show+2615/4224] sched_debug_show+0xa37/0x1080
  [show_state_filter+326/368] show_state_filter+0x146/0x170
  [sysrq_handle_showstate+10/16] sysrq_handle_showstate+0xa/0x10
  [__handle_sysrq+123/288] __handle_sysrq+0x7b/0x120
  [handle_sysrq+40/64] handle_sysrq+0x28/0x40
  [kbd_event+1045/1680] kbd_event+0x415/0x690
  [input_pass_event+206/208] input_pass_event+0xce/0xd0
  [input_handle_event+170/928] input_handle_event+0xaa/0x3a0
  [input_event+95/112] input_event+0x5f/0x70
  [atkbd_interrupt+434/1456] atkbd_interrupt+0x1b2/0x5b0
  [serio_interrupt+59/128] serio_interrupt+0x3b/0x80
  [i8042_interrupt+263/576] i8042_interrupt+0x107/0x240
  [handle_IRQ_event+40/96] handle_IRQ_event+0x28/0x60
  [handle_edge_irq+175/320] handle_edge_irq+0xaf/0x140
  [do_IRQ+64/128] do_IRQ+0x40/0x80
  [common_interrupt+46/52] common_interrupt+0x2e/0x34

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-25 14:02:45 +02:00
Peter Williams
681f3e6854 sched: isolate SMP balancing code a bit more
At the moment, a lot of load balancing code that is irrelevant to non
SMP systems gets included during non SMP builds.

This patch addresses this issue and reduces the binary size on non
SMP systems:

   text    data     bss     dec     hex filename
  10983      28    1192   12203    2fab sched.o.before
  10739      28    1192   11959    2eb7 sched.o.after

Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:51 +02:00
Peter Williams
e1d1484f72 sched: reduce balance-tasks overhead
At the moment, balance_tasks() provides low level functionality for both
  move_tasks() and move_one_task() (indirectly) via the load_balance()
function (in the sched_class interface) which also provides dual
functionality.  This dual functionality complicates the interfaces and
internal mechanisms and makes the run time overhead of operations that
are called with two run queue locks held.

This patch addresses this issue and reduces the overhead of these
operations.

Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:51 +02:00
Adrian Bunk
a0f846aa76 sched: make cpu_shares_{show,store}() static
cpu_shares_{show,store}() can become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:50 +02:00
Paul Menage
2b01dfe372 sched: clean up some control group code
- replace "cont" with "cgrp" in a few places in the CFS cgroup code, 
- use write_uint rather than write for cpu.shares write function

Signed-off-by: Paul Menage <menage@google.com>
Acked-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:50 +02:00
Mel Gorman
b3da2a73ff sched: document profile=sleep requiring CONFIG_SCHEDSTATS
profile=sleep only works if CONFIG_SCHEDSTATS is set. This patch notes
the limitation in Documentation/kernel-parameters.txt and prints a
warning at boot-time if profile=sleep is used without CONFIG_SCHEDSTAT.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:50 +02:00