Jeff Dike <jdike@addtoit.com>,
Paolo 'Blaisorblade' Giarrusso <blaisorblade_spam@yahoo.it>,
Bodo Stroesser <bstroesser@fujitsu-siemens.com>
Adds a new ptrace(2) mode, called PTRACE_SYSEMU, resembling PTRACE_SYSCALL
except that the kernel does not execute the requested syscall; this is useful
to improve performance for virtual environments, like UML, which want to run
the syscall on their own.
In fact, using PTRACE_SYSCALL means stopping child execution twice, on entry
and on exit, and each time you also have two context switches; with SYSEMU you
avoid the 2nd stop and so save two context switches per syscall.
Also, some architectures don't have support in the host for changing the
syscall number via ptrace(), which is currently needed to skip syscall
execution (UML turns any syscall into getpid() to avoid it being executed on
the host). Fixing that is hard, while SYSEMU is easier to implement.
* This version of the patch includes some suggestions of Jeff Dike to avoid
adding any instructions to the syscall fast path, plus some other little
changes, by myself, to make it work even when the syscall is executed with
SYSENTER (but I'm unsure about them). It has been widely tested for quite a
lot of time.
* Various fixed were included to handle the various switches between
various states, i.e. when for instance a syscall entry is traced with one of
PT_SYSCALL / _SYSEMU / _SINGLESTEP and another one is used on exit.
Basically, this is done by remembering which one of them was used even after
the call to ptrace_notify().
* We're combining TIF_SYSCALL_EMU with TIF_SYSCALL_TRACE or TIF_SINGLESTEP
to make do_syscall_trace() notice that the current syscall was started with
SYSEMU on entry, so that no notification ought to be done in the exit path;
this is a bit of a hack, so this problem is solved in another way in next
patches.
* Also, the effects of the patch:
"Ptrace - i386: fix Syscall Audit interaction with singlestep"
are cancelled; they are restored back in the last patch of this series.
Detailed descriptions of the patches doing this kind of processing follow (but
I've already summed everything up).
* Fix behaviour when changing interception kind #1.
In do_syscall_trace(), we check the status of the TIF_SYSCALL_EMU flag
only after doing the debugger notification; but the debugger might have
changed the status of this flag because he continued execution with
PTRACE_SYSCALL, so this is wrong. This patch fixes it by saving the flag
status before calling ptrace_notify().
* Fix behaviour when changing interception kind #2:
avoid intercepting syscall on return when using SYSCALL again.
A guest process switching from using PTRACE_SYSEMU to PTRACE_SYSCALL
crashes.
The problem is in arch/i386/kernel/entry.S. The current SYSEMU patch
inhibits the syscall-handler to be called, but does not prevent
do_syscall_trace() to be called after this for syscall completion
interception.
The appended patch fixes this. It reuses the flag TIF_SYSCALL_EMU to
remember "we come from PTRACE_SYSEMU and now are in PTRACE_SYSCALL", since
the flag is unused in the depicted situation.
* Fix behaviour when changing interception kind #3:
avoid intercepting syscall on return when using SINGLESTEP.
When testing 2.6.9 and the skas3.v6 patch, with my latest patch and had
problems with singlestepping on UML in SKAS with SYSEMU. It looped
receiving SIGTRAPs without moving forward. EIP of the traced process was
the same for all SIGTRAPs.
What's missing is to handle switching from PTRACE_SYSCALL_EMU to
PTRACE_SINGLESTEP in a way very similar to what is done for the change from
PTRACE_SYSCALL_EMU to PTRACE_SYSCALL_TRACE.
I.e., after calling ptrace(PTRACE_SYSEMU), on the return path, the debugger is
notified and then wake ups the process; the syscall is executed (or skipped,
when do_syscall_trace() returns 0, i.e. when using PTRACE_SYSEMU), and
do_syscall_trace() is called again. Since we are on the return path of a
SYSEMU'd syscall, if the wake up is performed through ptrace(PTRACE_SYSCALL),
we must still avoid notifying the parent of the syscall exit. Now, this
behaviour is extended even to resuming with PTRACE_SINGLESTEP.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Clean code up a bit, and only show suspend to disk as available when
it is configured in.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If process freezing fails, some processes are frozen, and rest are left in
"were asked to be frozen" state. Thats wrong, we should leave it in some
consistent state.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Drop printing during normal boot (when no image exists in swap), print
message when drivers fail, fix error paths and consolidate near-identical
functions in disk.c (and functions with just one statement).
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is trying to protect swsusp_resume_device and software_resume() from two
users banging it from userspace at the same time.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The function calc_nr uses an iterative algorithm to calculate the number of
pages needed for the image and the pagedir. Exactly the same result can be
obtained with a one-line expression.
Note that this was even proved correct ;-).
Signed-off-by: Michal Schmidt <xschmi00@stud.feec.vutbr.cz>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The patch protects from leaking sensitive data after resume from suspend.
During suspend a temporary key is created and this key is used to encrypt the
data written to disk. When, during resume, the data was read back into memory
the temporary key is destroyed which simply means that all data written to
disk during suspend are then inaccessible so they can't be stolen lateron.
Think of the following: you suspend while an application is running that keeps
sensitive data in memory. The application itself prevents the data from being
swapped out. Suspend, however, must write these data to swap to be able to
resume lateron. Without suspend encryption your sensitive data are then
stored in plaintext on disk. This means that after resume your sensitive data
are accessible to all applications having direct access to the swap device
which was used for suspend. If you don't need swap after resume these data
can remain on disk virtually forever. Thus it can happen that your system
gets broken in weeks later and sensitive data which you thought were encrypted
and protected are retrieved and stolen from the swap device.
Signed-off-by: Andreas Steinmetz <ast@domdv.de>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This should make refrigerator sleep properly, not busywait after the first
schedule() returns.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Aha, swsusp dips into swap_info[], better update it to swap_lock. It's
bitflipping flags with 0xFF, so get_swap_page will allocate from only the one
chosen device: let's change that to flip SWP_WRITEOK.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Of this type, mostly:
CHECK net/ipv6/netfilter.c
net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Remove bogus code for compiling netlink as module
- Add module refcounting support for modules implementing a netlink
protocol
- Add support for autoloading modules that implement a netlink protocol
as soon as someone opens a socket for that protocol
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
At the suggestion of Nick Piggin and Dinakar, totally disable
the facility to allow cpu_exclusive cpusets to define dynamic
sched domains in Linux 2.6.13, in order to avoid problems
first reported by John Hawkes (corrupt sched data structures
and kernel oops).
This has been built for ppc64, i386, ia64, x86_64, sparc, alpha.
It has been built, booted and tested for cpuset functionality
on an SN2 (ia64).
Dinakar or Nick - could you verify that it for sure does avoid
the problems Hawkes reported. Hawkes is out of town, and I don't
have the recipe to reproduce what he found.
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The partial disabling of Dinakar's new facility to allow
cpu_exclusive cpusets to define dynamic sched domains
doesn't go far enough. At the suggestion of Nick Piggin
and Dinakar, let us instead totally disable this facility
for 2.6.13, in order to avoid problems first reported
by John Hawkes (corrupt sched data structures and kernel oops).
This patch removes the partial disabling code in 2.6.13-rc7,
in anticipation of the next patch, which will totally disable
it instead.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Build issues were mostly in the ACPI=n case -- don't do that.
Select ACPI from IA64_GENERIC.
Add some missing dependencies on ACPI.
Mark BLACKLIST_YEAR and some laptop-only ACPI drivers
as X86-only. Let me know when you get an IA64 Laptop.
Signed-off-by: Len Brown <len.brown@intel.com>
As reported by Paul Mackerras <paulus@samba.org>, the previous patch
"cpu_exclusive sched domains fix" broke the ppc64 build with
CONFIC_CPUSET, yielding error messages:
kernel/cpuset.c: In function 'update_cpu_domains':
kernel/cpuset.c:648: error: invalid lvalue in unary '&'
kernel/cpuset.c:648: error: invalid lvalue in unary '&'
On some arch's, the node_to_cpumask() is a function, returning
a cpumask_t. But the for_each_cpu_mask() requires an lvalue mask.
The following patch fixes this build failure by making a copy
of the cpumask_t on the stack.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This keeps the kernel/cpuset.c routine update_cpu_domains() from
invoking the sched.c routine partition_sched_domains() if the cpuset in
question doesn't fall on node boundaries.
I have boot tested this on an SN2, and with the help of a couple of ad
hoc printk's, determined that it does indeed avoid calling the
partition_sched_domains() routine on partial nodes.
I did not directly verify that this avoids setting up bogus sched
domains or avoids the oops that Hawkes saw.
This patch imposes a silent artificial constraint on which cpusets can
be used to define dynamic sched domains.
This patch should allow proceeding with this new feature in 2.6.13 for
the configurations in which it is useful (node alligned sched domains)
while avoiding trying to setup sched domains in the less useful cases
that can cause the kernel corruption and oops.
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With CONFIG_PREEMPT && !CONFIG_SMP, it's possible for sys_getppid to
return a bogus value if the parent's task_struct gets reallocated after
current->group_leader->real_parent is read:
asmlinkage long sys_getppid(void)
{
int pid;
struct task_struct *me = current;
struct task_struct *parent;
parent = me->group_leader->real_parent;
RACE HERE => for (;;) {
pid = parent->tgid;
#ifdef CONFIG_SMP
{
struct task_struct *old = parent;
/*
* Make sure we read the pid before re-reading the
* parent pointer:
*/
smp_rmb();
parent = me->group_leader->real_parent;
if (old != parent)
continue;
}
#endif
break;
}
return pid;
}
If the process gets preempted at the indicated point, the parent process
can go ahead and call exit() and then get wait()'d on to reap its
task_struct. When the preempted process gets resumed, it will not do any
further checks of the parent pointer on !CONFIG_SMP: it will read the
bad pid and return.
So, the same algorithm used when SMP is enabled should be used when
preempt is enabled, which will recheck ->real_parent in this case.
Signed-off-by: David Meybohm <dmeybohmlkml@bellsouth.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As suggested by Michael Kerrisk <mtk-manpages@gmx.net>, make RLIMIT_NICE
consistent with getpriority before it becomes available in released glibc.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This bug is quite subtle and only happens in a very interesting
situation where a real-time threaded process is in the middle of a
coredump when someone whacks it with a SIGKILL. However, this deadlock
leaves the system pretty hosed and you have to reboot to recover.
Not good for real-time priority-preemption applications like our
telephony application, with 90+ real-time (SCHED_FIFO and SCHED_RR)
processes, many of them multi-threaded, interacting with each other for
high volume call processing.
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch against audit.81 prevents duplicate syscall rules in
a given filter list by walking the list on each rule add.
I also removed the unused struct audit_entry in audit.c and made the
static inlines in auditsc.c consistent.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
It was showing up fairly high on profiles even when no rules were set.
Make sure the common path stays as fast as possible.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
We have a chek in there to make sure that the name won't overflow
task_struct.comm[], but it's triggering for scsi with lots of HBAs, only
scsi is using single-threaded workqueues which don't append the "/%d"
anyway.
All too hard. Just kill the BUG_ON.
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
[ kthread_create() uses vsnprintf() and limits the thing, so no
actual overflow can actually happen regardless ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix possible cpuset_sem ABBA deadlock if 'notify_on_release' set.
For a particular usage pattern, creating and destroying cpusets fairly
frequently using notify_on_release, on a very large system, this deadlock
can be seen every few days. If you are not using the cpuset
notify_on_release feature, you will never see this deadlock.
The existing code, on task exit (or cpuset deletion) did:
get cpuset_sem
if cpuset marked notify_on_release and is ready to release:
compute cpuset path relative to /dev/cpuset mount point
call_usermodehelper() forks /sbin/cpuset_release_agent with path
drop cpuset_sem
Unfortunately, the fork in call_usermodehelper can allocate memory, and
allocating memory can require cpuset_sem, if the mems_generation values
changed in the interim. This results in an ABBA deadlock, trying to obtain
cpuset_sem when it is already held by the current task.
To fix this, I put the cpuset path (which must be computed while holding
cpuset_sem) in a temporary buffer, to be used in the call_usermodehelper
call of /sbin/cpuset_release_agent only _after_ dropping cpuset_sem.
So the new logic is:
get cpuset_sem
if cpuset marked notify_on_release and is ready to release:
compute cpuset path relative to /dev/cpuset mount point
stash path in kmalloc'd buffer
drop cpuset_sem
call_usermodehelper() forks /sbin/cpuset_release_agent with path
free path
The sharp eyed reader might notice that this patch does not contain any
calls to kmalloc. The existing code in the check_for_release() routine was
already kmalloc'ing a buffer to hold the cpuset path. In the old code, it
just held the buffer for a few lines, over the cpuset_release_agent() call
that in turn invoked call_usermodehelper(). In the new code, with the
application of this patch, it returns that buffer via the new char
**ppathbuf parameter, for later use and freeing in cpuset_release_agent(),
which is called after cpuset_sem is dropped. Whereas the old code has just
one call to cpuset_release_agent(), right in the check_for_release()
routine, the new code has three calls to cpuset_release_agent(), from the
various places that a cpuset can be released.
This patch has been build and booted on SN2, and passed a stress test that
previously hit the deadlock within a few seconds.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Revert this June 17 patch: it broke persistence of timers across execve().
Cc: Roland McGrath <roland@redhat.com>
Cc: george anzinger <george@mvista.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This removes the calls to device_suspend() from the shutdown path that
were added sometime during 2.6.13-rc*. They aren't working properly on
a number of configs (I got reports from both ppc powerbook users and x86
users) causing the system to not shutdown anymore.
I think it isn't the right approach at the moment anyway. We have
already a shutdown() callback for the drivers that actually care about
shutdown and the suspend() code isn't yet in a good enough shape to be
so much generalized. Also, the semantics of suspend and shutdown are
slightly different on a number of setups and the way this was patched in
provides little way for drivers to cleanly differenciate. It should
have been at least a different message.
For 2.6.13, I think we should revert to 2.6.12 behaviour and have a
working suspend back.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The module code assumes noone will ever ask for a per-cpu area more than
SMP_CACHE_BYTES aligned. However, as these cases show, gcc asks sometimes
asks for 32-byte alignment for the per-cpu section on a module, and if
CONFIG_X86_L1_CACHE_SHIFT is 4, we hit that BUG_ON(). This is obviously an
unusual combination, as there have been few reports, but better to warn
than die.
See:
http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/0768.html
And more recently:
http://bugs.gentoo.org/show_bug.cgi?id=97006
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This removes sys_set_zone_reclaim() for now. While i'm sure Martin is
trying to solve a real problem, we must not hard-code an incomplete and
insufficient approach into a syscall, because syscalls are pretty much
for eternity. I am quite strongly convinced that this syscall must not
hit v2.6.13 in its current form.
Firstly, the syscall lacks basic syscall design: e.g. it allows the
global setting of VM policy for unprivileged users. (!) [ Imagine an
Oracle installation and a SAP installation on the same NUMA box fighting
over the 'optimal' setting for this flag. What will they do? Will they
try to set the flag to their own preferred value every second or so? ]
Secondly, it was added based on a single datapoint from Martin:
http://marc.theaimsgroup.com/?l=linux-mm&m=111763597218177&w=2
where Martin characterizes the numbers the following way:
' Run-to-run variability for "make -j" is huge, so these numbers aren't
terribly useful except to see that with reclaim the benchmark still
finishes in a reasonable amount of time. '
in other words: the fundamental problem has likely not been solved, only
a tendential move into the right direction has been observed, and a
handful of numbers were picked out of a set of hugely variable results,
without showing the variability data. How much variance is there
run-to-run?
I'd really suggest to first walk the walk and see what's needed to get
stable & predictable kernel compilation numbers on that NUMA box, before
adding random syscalls to tune a particular aspect of the VM ... which
approach might not even matter once the whole picture has been analyzed
and understood!
The third, most important point is that the syscall exposes VM tuning
internals in a completely unstructured way. What sense does it make to
have a _GLOBAL_ per-node setting for 'should we go to another node for
reclaim'? If then it might make sense to do this per-app, via numalib or
so.
The change is minimalistic in that it doesnt remove the syscall and the
underlying infrastructure changes, only the user-visible changes. We
could perhaps add a CAP_SYS_ADMIN-only sysctl for this hack, a'ka
/proc/sys/vm/swappiness, but even that looks quite counterproductive
when the generic approach is that we are trying to reduce the number of
external factors in the VM balance picture.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This snuck in with an x86_64 change. Thanks to Richard Purdie
<rpurdie@rpsys.net> for spotting it.
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If device_suspend(PMSG_FREEZE) is not ready to be called in
kernel_restart it is definitely not ready to be called in the even more
fickle kernel_kexec.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
(We found this (after a customer complained) and it is in the kernel.org
kernel. Seems that for CLOCK_MONOTONIC absolute timers and clock_nanosleep
calls both the request time and wall_to_monotonic are subtracted prior to
the normalize resulting in an overflow in the existing normalize test.
This causes the result to be shifted ~4 seconds ahead instead of ~2 seconds
back in time.)
The normalize code in posix-timers.c fails when the tv_nsec member is ~1.2
seconds negative. This can happen on absolute timers (and
clock_nanosleeps) requested on CLOCK_MONOTONIC (both the request time and
wall_to_monotonic are subtracted resulting in the possibility of a number
close to -2 seconds.)
This fix uses the set_normalized_timespec() (which does not have an
overflow problem) to fix the problem and as a side effect makes the code
cleaner.
Signed-off-by: George Anzinger <george@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This avoids some potential stack overflows with very deep softirq callchains.
i386 does this too.
TOADD CFI annotation
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
My fairly ordinary x86 test box gets stuck during reboot on the
wait_for_completion() in ide_do_drive_cmd():
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
`gcc -W' likes to complain if the static keyword is not at the beginning of
the declaration. This patch fixes all remaining occurrences of "inline
static" up with "static inline" in the entire kernel tree (140 occurrences in
47 files).
While making this change I came across a few lines with trailing whitespace
that I also fixed up, I have also added or removed a blank line or two here
and there, but there are no functional changes in the patch.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add kerneldoc to kernel/cpuset.c
Fix cpuset typos in init/Kconfig
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Split spin lock and r/w lock implementation into a single try which is done
inline and an out of line function that repeatedly tries to get the lock
before doing the cpu_relax(). Add a system control to set the number of
retries before a cpu is yielded.
The reason for the spin lock retry is that the diagnose 0x44 that is used to
give up the virtual cpu is quite expensive. For spin locks that are held only
for a short period of time the costs of the diagnoses outweights the savings
for spin locks that are held for a longer timer. The default retry count is
1000.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the recent off-by-one fix in the itimer code:
1. The repeating timer is figured using the requested time
(not +1 as we know where we are in the jiffie).
2. The tests for interval too large are left to the time_val to jiffie code.
Signed-off-by: George Anzinger <george@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes a warning in the disable_nonboot_cpus call in
kernel/power/smp.c.
Signed-off by: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here's the patch again to fix the code to handle if the values between
MAX_USER_RT_PRIO and MAX_RT_PRIO are different.
Without this patch, an SMP system will crash if the values are
different.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
RLIMIT_RTPRIO is supposed to grant non privileged users the right to use
SCHED_FIFO/SCHED_RR scheduling policies with priorites bounded by the
RLIMIT_RTPRIO value via sched_setscheduler(). This is usually used by
audio users.
Unfortunately this is broken in 2.6.13rc3 as you can see in the excerpt
from sched_setscheduler below:
/*
* Allow unprivileged RT tasks to decrease priority:
*/
if (!capable(CAP_SYS_NICE)) {
/* can't change policy */
if (policy != p->policy)
return -EPERM;
After the above unconditional test which causes sched_setscheduler to
fail with no regard to the RLIMIT_RTPRIO value the following check is made:
/* can't increase priority */
if (policy != SCHED_NORMAL &&
param->sched_priority > p->rt_priority &&
param->sched_priority >
p->signal->rlim[RLIMIT_RTPRIO].rlim_cur)
return -EPERM;
Thus I do believe that the RLIMIT_RTPRIO value must be taken into
account for the policy check, especially as the RLIMIT_RTPRIO limit is
of no use without this change.
The attached patch fixes this problem.
Signed-off-by: Andreas Steinmetz <ast@domdv.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The suspend to disk code was a poor copy of the code in
sys_reboot now that we have kernel_power_off, kernel_restart
and kernel_halt use them instead of poorly duplicating them inline.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We know the system is in trouble so there is no question if this
is an emergecy :)
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We already do all of the gymnastics to run from process context
to call the power off code so call into the power off code cleanly.
This especially helps acpi as part of it's shutdown logic should
run acpi_shutdown called from device_shutdown which was not
being called from here.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When the kernel is working well and we want to restart cleanly
kernel_restart is the function to use. But in many instances
the kernel wants to reboot when thing are expected to be working
very badly such as from panic or a software watchdog handler.
This patch adds the function emergency_restart() so that
callers can be clear what semantics they expect when calling
restart. emergency_restart() is expected to be callable
from interrupt context and possibly reliable in even more
trying circumstances.
This is an initial generic implementation for all architectures.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is obvious we wanted to call kernel_restart here
but since we don't have it the code was expanded inline and hasn't
been correct since sometime in 2.4.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Because the factors of sys_reboot don't exist people calling
into the reboot path duplicate the code badly, leading to
inconsistent expectations of code in the reboot path.
This patch should is just code motion.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the recent addition of device_suspend calls into
sys_reboot two code paths were missed.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... by generating serial numbers only if an audit context is actually
_used_, rather than doing so at syscall entry even when the context
isn't necessarily marked auditable.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
When we flush a pending syscall audit record due to audit_free(), we
might be doing that in the context of the idle thread. So use GFP_ATOMIC
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Fix the sparse warning "implicit cast to nocast type"
Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
This moves the inotify sysctl knobs to "/proc/sys/fs/inotify" from
"/proc/sys/fs". Also some related cleanup.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
dup_mmap of a VM_DONTCOPY vma forgot to lower the child's total_vm. (But
no way does this account for the recent report of total_vm seen too low.)
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kernel/power/disk.c needs a declaration of name_to_dev_t() in scope. mount.h
seems like an appropriate choice.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Free some RAM before entering S3 so that upon
resume we can be sure early allocations will succeed.
http://bugzilla.kernel.org/show_bug.cgi?id=3469
Signed-off-by: David Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Register an "acpi" system device to be notified of shutdown preparation.
This depends on CONFIG_PM
http://bugzilla.kernel.org/show_bug.cgi?id=4041
Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
The BKS might be reacquired before we have dropped PREEMPT_ACTIVE, which
could trigger a second could trigger a second cond_resched() call. Bug
found by Hirofumi Ogawa.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new section called ".data.read_mostly" for data items that are read
frequently and rarely written to like cpumaps etc.
If these maps are placed in the .data section then these frequenly read
items may end up in cachelines with data is is frequently updated. In that
case all processors in an SMP system must needlessly reload the cachelines
again and again containing elements of those frequently used variables.
The ability to share these cachelines will allow each cpu in an SMP system
to keep local copies of those shared cachelines thereby optimizing
performance.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Christoph Lameter <christoph@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
freezeable() already tests for TRACED/STOPPED processes, no need to do it
twice.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix error handling and whitespace in swsusp.c. swsusp_free() was called when
there was nothing allocating, leading to oops.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move device name resolution code around so that it is not called from
resume-from-initrd. name_to_dev_t may be unavailable at that point.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following renames arch_init, a kprobes function for performing any
architecture specific initialization, to arch_init_kprobes in order to
cleanup the namespace.
Also, this patch adds arch_init_kprobes to sparc64 to fix the sparc64 kprobes
build from the last return probe patch.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The pid in the audit context isn't always set up. Use tsk->pid when
checking whether it's auditd in audit_filter_syscall(), instead of
ctx->pid. Remove a band-aid which did the same elsewhere.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
We force a rate-limit on auditable events by making them wait for space
on the backlog queue. However, if auditd really is AWOL then this could
potentially bring the entire system to a halt, depending on the audit
rules in effect.
Firstly, make sure the wait time is honoured correctly -- it's the
maximum time the process should wait, rather than the time to wait
_each_ time round the loop. We were getting re-woken _each_ time a
packet was dequeued, and the timeout was being restarted each time.
Secondly, reset the wait time after audit_panic() is called. In general
this will be reset to zero, to allow progress to be made. If the system
is configured to _actually_ panic on audit_panic() then that will
already have happened; otherwise we know that audit records are being
lost anyway.
These two tunables can't be exposed via AUDIT_GET and AUDIT_SET because
those aren't particularly well-designed. It probably should have been
done by sysctls or sysfs anyway -- one for a later patch.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Anyone reporting a stuck IRQ should try these options. Its effectiveness
varies we've found in the Fedora case. Quite a few systems with misdescribed
IRQ routing just work when you use irqpoll. It also fixes up the VIA systems
although thats now fixed with the VIA quirk (which we could just make default
as its what Redmond OS does but Linus didn't like it historically).
A small number of systems have jammed IRQ sources or misdescribes that cause
an IRQ that we have no handler registered anywhere for. In those cases it
doesn't help.
Signed-off-by: Alan Cox <number6@the-village.bc.nu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As Steven Rostedt pointed out, there are 2 problems with ITIMER_REAL
timers.
1. do_setitimer() does not call del_timer_sync() in case
when the timer is not pending (it_real_value() returns 0).
This is wrong, the timer may still be running, and it can
rearm itself.
2. It calls del_timer_sync() with tsk->sighand->siglock held.
This is deadlockable, because timer's handler needs this
lock too.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use msleep() in a few places.
Signed-off-by: Luca Falavigna <dktrkranz@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch tweaks idle thread setup semantics a bit: instead of setting
NEED_RESCHED in init_idle(), we do an explicit schedule() before calling
into cpu_idle().
This patch, while having no negative side-effects, enables wider use of
cond_resched()s. (which might happen in the stock kernel too, but it's
particulary important for voluntary-preempt)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following is the second version of the function return probe patches
I sent out earlier this week. Changes since my last submission include:
* Fix in ppc64 code removing an unneeded call to re-enable preemption
* Fix a build problem in ia64 when kprobes was turned off
* Added another BUG_ON check to each of the architecture trampoline
handlers
My initial patch description ==>
From my experiences with adding return probes to x86_64 and ia64, and the
feedback on LKML to those patches, I think we can simplify the design
for return probes.
The following patch tweaks the original design such that:
* Instead of storing the stack address in the return probe instance, the
task pointer is stored. This gives us all we need in order to:
- find the correct return probe instance when we enter the trampoline
(even if we are recursing)
- find all left-over return probe instances when the task is going away
This has the side effect of simplifying the implementation since more
work can be done in kernel/kprobes.c since architecture specific knowledge
of the stack layout is no longer required. Specifically, we no longer have:
- arch_get_kprobe_task()
- arch_kprobe_flush_task()
- get_rp_inst_tsk()
- get_rp_inst()
- trampoline_post_handler() <see next bullet>
* Instead of splitting the return probe handling and cleanup logic across
the pre and post trampoline handlers, all the work is pushed into the
pre function (trampoline_probe_handler), and then we skip single stepping
the original function. In this case the original instruction to be single
stepped was just a NOP, and we can do without the extra interruption.
The new flow of events to having a return probe handler execute when a target
function exits is:
* At system initialization time, a kprobe is inserted at the beginning of
kretprobe_trampoline. kernel/kprobes.c use to handle this on it's own,
but ia64 needed to do this a little differently (i.e. a function pointer
is really a pointer to a structure containing the instruction pointer and
a global pointer), so I added the notion of arch_init(), so that
kernel/kprobes.c:init_kprobes() now allows architecture specific
initialization by calling arch_init() before exiting. Each architecture
now registers a kprobe on it's own trampoline function.
* register_kretprobe() will insert a kprobe at the beginning of the targeted
function with the kprobe pre_handler set to arch_prepare_kretprobe
(still no change)
* When the target function is entered, the kprobe is fired, calling
arch_prepare_kretprobe (still no change)
* In arch_prepare_kretprobe() we try to get a free instance and if one is
available then we fill out the instance with a pointer to the return probe,
the original return address, and a pointer to the task structure (instead
of the stack address.) Just like before we change the return address
to the trampoline function and mark the instance as used.
If multiple return probes are registered for a given target function,
then arch_prepare_kretprobe() will get called multiple times for the same
task (since our kprobe implementation is able to handle multiple kprobes
at the same address.) Past the first call to arch_prepare_kretprobe,
we end up with the original address stored in the return probe instance
pointing to our trampoline function. (This is a significant difference
from the original arch_prepare_kretprobe design.)
* Target function executes like normal and then returns to kretprobe_trampoline.
* kprobe inserted on the first instruction of kretprobe_trampoline is fired
and calls trampoline_probe_handler() (no change here)
* trampoline_probe_handler() consumes each of the instances associated with
the current task by calling the registered handler function and marking
the instance as unused until an instance is found that has a return address
different then the trampoline function.
(change similar to my previous ia64 RFC)
* If the task is killed with some left-over return probe instances (meaning
that a target function was entered, but never returned), then we just
free any instances associated with the task. (Not much different other
then we can handle this without calling architecture specific functions.)
There is a known problem that this patch does not yet solve where
registering a return probe flush_old_exec or flush_thread will put us
in a bad state. Most likely the best way to handle this is to not allow
registering return probes on these two functions.
(Significant change)
This patch series applies to the 2.6.12-rc6-mm1 kernel, and provides:
* kernel/kprobes.c changes
* i386 patch of existing return probes implementation
* x86_64 patch of existing return probe implementation
* ia64 implementation
* ppc64 implementation (provided by Ananth)
This patch implements the architecture independant changes for a reworking
of the kprobes based function return probes design. Changes include:
* Removing functions for querying a return probe instance off a stack address
* Removing the stack_addr field from the kretprobe_instance definition,
and adding a task pointer
* Adding architecture specific initialization via arch_init()
* Removing extern definitions for the architecture trampoline functions
(this isn't needed anymore since the architecture handles the
initialization of the kprobe in the return probe trampoline function.)
Signed-off-by: Rusty Lynch <rusty.lynch@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that PPC64 has no-execute support, here is a second try to fix the
single step out of line during kprobe execution. Kprobes on x86_64 already
solved this problem by allocating an executable page and using it as the
scratch area for stepping out of line. Reuse that.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This updates the CFQ io scheduler to the new time sliced design (cfq
v3). It provides full process fairness, while giving excellent
aggregate system throughput even for many competing processes. It
supports io priorities, either inherited from the cpu nice value or set
directly with the ioprio_get/set syscalls. The latter closely mimic
set/getpriority.
This import is based on my latest from -mm.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1. Establish a simple API for process freezing defined in linux/include/sched.h:
frozen(process) Check for frozen process
freezing(process) Check if a process is being frozen
freeze(process) Tell a process to freeze (go to refrigerator)
thaw_process(process) Restart process
frozen_process(process) Process is frozen now
2. Remove all references to PF_FREEZE and PF_FROZEN from all
kernel sources except sched.h
3. Fix numerous locations where try_to_freeze is manually done by a driver
4. Remove the argument that is no longer necessary from two function calls.
5. Some whitespace cleanup
6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
cleared before setting PF_FROZEN, recalc_sigpending does not check
PF_FROZEN).
This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes use of ALIGN() to remove duplicate round-up code.
Signed-off-by: Nick Wilson <njw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The comment for msleep_interruptible() is wrong, as it will ignore
wait-queue events, but will wake up early for signals.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Following patch provides purely cosmetic changes and corrects CodingStyle
guide lines related certain issues like below in kexec related files
o braces for one line "if" statements, "for" loops,
o more than 80 column wide lines,
o No space after "while", "for" and "switch" key words
o Changes:
o take-2: Removed the extra tab before "case" key words.
o take-3: Put operator at the end of line and space before "*/"
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Makes kexec_crashdump() take a pt_regs * as an argument. This allows to
get exact register state at the point of the crash. If we come from direct
panic assertion NULL will be passed and the current registers saved before
crashdump.
This hooks into two places:
die(): check the conditions under which we will panic when calling
do_exit and go there directly with the pt_regs that caused the fatal
fault.
die_nmi(): If we receive an NMI lockup while in the kernel use the
pt_regs and go directly to crash_kexec(). We're probably nested up badly
at this point so this might be the only chance to escape with proper
information.
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: "Vivek Goyal" <vgoyal@in.ibm.com>
o Support for /proc/vmcore interface. This interface exports elf core image
either in ELF32 or ELF64 format, depending on the format in which elf headers
have been stored by crashed kernel.
o Added support for CONFIG_VMCORE config option.
o Removed the dependency on /proc/kcore.
From: "Eric W. Biederman" <ebiederm@xmission.com>
This patch has been refactored to more closely match the prevailing style in
the affected files. And to clearly indicate the dependency between
/proc/kcore and proc/vmcore.c
From: Hariprasad Nellitheertha <hari@in.ibm.com>
This patch contains the code that provides an ELF format interface to the
previous kernel's memory post kexec reboot.
Signed off by Hariprasad Nellitheertha <hari@in.ibm.com>
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds support for retrieving the address of elf core header if one
is passed in command line.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides the interfaces necessary to read the dump contents,
treating it as a high memory device.
Signed off by Hariprasad Nellitheertha <hari@in.ibm.com>
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Following patch exports kexec global variable "crash_notes" to user space
through sysfs as kernel attribute in /sys/kernel.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is a minor bug fix in kexec to resolve the problem of loading panic
kernel with initrd.
o Problem: Loading a capture kenrel fails if initrd is also being loaded.
This has been observed for vmlinux image for kexec on panic case.
o This patch fixes the problem. In segment location and size verification
logic, minor correction has been done. Segment memory end (mend) should be
mstart + memsz - 1. This one byte offset was source of failure for initrd
loading which was being loaded at hole boundary.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the architecture independent implementation the
sys_kexec_load, the compat_sys_kexec_load system calls.
Kexec on panic support has been integrated into the core patch and is
relatively clean.
In addition the hopefully architecture independent option
crashkernel=size@location has been docuemented. It's purpose is to reserve
space for the panic kernel to live, and where no DMA transfer will ever be
setup to access.
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds a new preemption model: 'Voluntary Kernel Preemption'. The
3 models can be selected from a new menu:
(X) No Forced Preemption (Server)
( ) Voluntary Kernel Preemption (Desktop)
( ) Preemptible Kernel (Low-Latency Desktop)
we still default to the stock (Server) preemption model.
Voluntary preemption works by adding a cond_resched()
(reschedule-if-needed) call to every might_sleep() check. It is lighter
than CONFIG_PREEMPT - at the cost of not having as tight latencies. It
represents a different latency/complexity/overhead tradeoff.
It has no runtime impact at all if disabled. Here are size stats that show
how the various preemption models impact the kernel's size:
text data bss dec hex filename
3618774 547184 179896 4345854 424ffe vmlinux.stock
3626406 547184 179896 4353486 426dce vmlinux.voluntary +0.2%
3748414 548640 179896 4476950 445016 vmlinux.preempt +3.5%
voluntary-preempt is +0.2% of .text, preempt is +3.5%.
This feature has been tested for many months by lots of people (and it's
also included in the RHEL4 distribution and earlier variants were in Fedora
as well), and it's intended for users and distributions who dont want to
use full-blown CONFIG_PREEMPT for one reason or another.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The only sane way to clean up the current 3 lock_kernel() variants seems to
be to remove the spinlock-based BKL implementations altogether, and to keep
the semaphore-based one only. If we dont want to do that for whatever
reason then i'm afraid we have to live with the current complexity. (but
i'm open for other cleanup suggestions as well.)
To explore this possibility we'll (at a minimum) have to know whether the
semaphore-based BKL works fine on plain SMP too. The patch below enables
this.
The patch may make sense in isolation as well, as it might bring
performance benefits: code that would formerly spin on the BKL spinlock
will now schedule away and give up the CPU. It might introduce performance
regressions as well, if any performance-critical code uses the BKL heavily
and gets overscheduled due to the semaphore. I very much hope there is no
such performance-critical codepath left though.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch consolidates the CONFIG_PREEMPT and CONFIG_PREEMPT_BKL
preemption options into kernel/Kconfig.preempt. This, besides reducing
source-code, also enables more centralized tweaking of preemption related
options.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adds the core update_cpu_domains code and updated cpusets documentation
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patches add dynamic sched domains functionality that was
extensively discussed on lkml and lse-tech. I would like to see this added to
-mm
o The main advantage with this feature is that it ensures that the scheduler
load balacing code only balances against the cpus that are in the sched
domain as defined by an exclusive cpuset and not all of the cpus in the
system. This removes any overhead due to load balancing code trying to
pull tasks outside of the cpu exclusive cpuset only to be prevented by
the tasks' cpus_allowed mask.
o cpu exclusive cpusets are useful for servers running orthogonal
workloads such as RT applications requiring low latency and HPC
applications that are throughput sensitive
o It provides a new API partition_sched_domains in sched.c
that makes dynamic sched domains possible.
o cpu_exclusive cpusets sets are now associated with a sched domain.
Which means that the users can dynamically modify the sched domains
through the cpuset file system interface
o ia64 sched domain code has been updated to support this feature as well
o Currently, this does not support hotplug. (However some of my tests
indicate hotplug+preempt is currently broken)
o I have tested it extensively on x86.
o This should have very minimal impact on performance as none of
the fast paths are affected
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Presently, a process without the capability CAP_SYS_NICE can not change
its own policy, which is OK.
But it can also not decrease its RT priority (if scheduled with policy
SCHED_RR or SCHED_FIFO), which is what this patch changes.
The rationale is the same as for the nice value: a process should be
able to require less priority for itself. Increasing the priority is
still not allowed.
This is for example useful if you give a multithreaded user process a RT
priority, and the process would like to organize its internal threads
using priorities also. Then you can give the process the highest
priority needed N, and the process starts its threads with lower
priorities: N-1, N-2...
The POSIX norm says that the permissions are implementation specific, so
I think we can do that.
In a sense, it makes the permissions consistent whatever the policy is:
with this patch, process scheduled by SCHED_FIFO, SCHED_RR and
SCHED_OTHER can all decrease their priority.
From: Ingo Molnar <mingo@elte.hu>
cleaned up and merged to -mm.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The maximum rebalance interval allowed by the multiprocessor balancing
backoff is often not large enough to handle corner cases where there are
lots of tasks pinned on a CPU. Suresh reported:
I see system livelock's if for example I have 7000 processes
pinned onto one cpu (this is on the fastest 8-way system I
have access to).
After this patch, the machine is reported to go well above this number.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Consolidate balance-on-exec with balance-on-fork. This is made easy by the
sched-domains RCU patches.
As well as the general goodness of code reduction, this allows the runqueues
to be unlocked during balance-on-fork.
schedstats is a problem. Maybe just have balance-on-event instead of
distinguishing fork and exec?
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
One of the problems with the multilevel balance-on-fork/exec is that it needs
to jump through hoops to satisfy sched-domain's locking semantics (that is,
you may traverse your own domain when not preemptable, and you may traverse
others' domains when holding their runqueue lock).
balance-on-exec had to potentially migrate between more than one CPU before
finding a final CPU to migrate to, and balance-on-fork needed to potentially
take multiple runqueue locks.
So bite the bullet and make sched-domains go completely RCU. This actually
simplifies the code quite a bit.
From: Ingo Molnar <mingo@elte.hu>
schedstats RCU fix, and a nice comment on for_each_domain, from Ingo.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The fundamental problem that Suresh has with balance on exec and fork is that
it only tries to balance the top level domain with the flag set.
This was worked around by removing degenerate domains, but is still a problem
if people want to start using more complex sched-domains, especially
multilevel NUMA that ia64 is already using.
This patch makes balance on fork and exec try balancing over not just the top
most domain with the flag set, but all the way down the domain tree.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove degenerate scheduler domains during the sched-domain init.
For example on x86_64, we always have NUMA configured in. On Intel EM64T
systems, top most sched domain will be of NUMA and with only one sched_group
in it.
With fork/exec balances(recent Nick's fixes in -mm tree), we always endup
taking wrong decisions because of this topmost domain (as it contains only one
group and find_idlest_group always returns NULL). We will endup loading HT
package completely first, letting active load balance kickin and correct it.
In general, this patch also makes sense with out recent Nick's fixes in -mm.
From: Nick Piggin <nickpiggin@yahoo.com.au>
Modified to account for more than just sched_groups when scanning for
degenerate domains by Nick Piggin. And allow a runqueue's sd to go NULL
rather than keep a single degenerate domain around (this happens when you run
with maxcpus=1).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the last 2 places that directly access a runqueue's sched-domain and
assume it cannot be NULL.
That allows the use of NULL for domain, instead of a dummy domain, to signify
no balancing is to happen. No functional changes.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of requiring architecture code to interact with the scheduler's
locking implementation, provide a couple of defines that can be used by the
architecture to request runqueue unlocked context switches, and ask for
interrupts to be enabled over the context switch.
Also replaces the "switch_lock" used by these architectures with an oncpu
flag (note, not a potentially slow bitflag). This eliminates one bus
locked memory operation when context switching, and simplifies the
task_running function.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reimplement the balance on exec balancing to be sched-domains aware. Use this
to also do balance on fork balancing. Make x86_64 do balance on fork over the
NUMA domain.
The problem that the non sched domains aware blancing became apparent on dual
core, multi socket opterons. What we want is for the new tasks to be sent to
a different socket, but more often than not, we would first load up our
sibling core, or fill two cores of a single remote socket before selecting a
new one.
This gives large improvements to STREAM on such systems.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the very aggressive idle stuff that has recently gone into 2.6 - it is
going against the direction we are trying to go. Hopefully we can regain
performance through other methods.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Do less affine wakeups. We're trying to reduce dbt2-pgsql idle time
regressions here... make sure we don't don't move tasks the wrong way in an
imbalance condition. Also, remove the cache coldness requirement from the
calculation - this seems to induce sharp cutoff points where behaviour will
suddenly change on some workloads if the load creeps slightly over or under
some point. It is good for periodic balancing because in that case have
otherwise have no other context to determine what task to move.
But also make a minor tweak to "wake balancing" - the imbalance tolerance is
now set at half the domain's imbalance, so we get the opportunity to do wake
balancing before the more random periodic rebalancing gets preformed.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Do CPU load averaging over a number of different intervals. Allow each
interval to be chosen by sending a parameter to source_load and target_load.
0 is instantaneous, idx > 0 returns a decaying average with the most recent
sample weighted at 2^(idx-1). To a maximum of 3 (could be easily increased).
So generally a higher number will result in more conservative balancing.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the special casing for idle CPU balancing. Things like this are
hurting for example on SMT, where are single sibling being idle doesn't really
warrant a really aggressive pull over the NUMA domain, for example.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These conditions should now be impossible, and we need to fix them if they
happen.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SMT balancing has a couple of problems. Firstly, active_load_balance is too
complex - basically it should be a dumb helper for when the periodic balancer
has determined there is an imbalance, but gets stuck because the task is
running.
So rip out all its "smarts", and just make it move one task to the target CPU.
Second, the busy CPU's sched-domain tree was being used for active balancing.
This means that it may not see that nr_balance_failed has reached a critical
level. So use the target CPU's sched-domain tree for this. We can do this
because we hold its runqueue lock.
Lastly, reset nr_balance_failed to a point where we allow cache hot migration.
This will help ensure active load balancing is successful.
Thanks to Suresh Siddha for pointing out these issues.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix up active load balancing a bit so it doesn't get called when it shouldn't.
Reset the nr_balance_failed counter at more points where we have found
conditions to be balanced. This reduces too aggressive active balancing seen
on some workloads.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
John Hawkes explained the problem best:
A large number of processes that are pinned to a single CPU results
in every other CPU's load_balance() seeing this overloaded CPU as
"busiest", yet move_tasks() never finds a task to pull-migrate. This
condition occurs during module unload, but can also occur as a
denial-of-service using sys_sched_setaffinity(). Several hundred
CPUs performing this fruitless load_balance() will livelock on the
busiest CPU's runqueue lock. A smaller number of CPUs will livelock
if the pinned task count gets high.
Expanding slightly on John's patch, this one attempts to work out whether the
balancing failure has been due to too many tasks pinned on the runqueue. This
allows it to be basically invisible to the regular blancing paths (ie. when
there are no pinned tasks). We can use this extra knowledge to shut down the
balancing faster, and ensure the migration threads don't start running which
is another problem observed in the wild.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
New sched-domains code means we don't get spans with offline CPUs in
them.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Show swsuspend only on .config where it can compile. I got this on PPC32 &&
SMP:
kernel/power/smp.c:24: error: storage size of `ctxt' isn't known
Also mark swsusp as no longer experimental.
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the cpu hotplug case, per-cpu data possibly isn't initialized even the
system state is 'running'. As the comments say in the original code, some
console drivers assume per-cpu resources have been allocated. radeon fb is
one such driver, which uses kmalloc. After a CPU is down, the per-cpu data
of slab is freed, so the system crashed when printing some info.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch moves the recalculation of nr_copy_pages so that the right
number is used in the calculation of the size of memory and swap needed.
It prevents swsusp from attempting to suspend if there is not enough memory
and/or swap (which is unlikely anyway).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch cleans up whitespace in swsusp.c (a bit):
- removes any trailing whitespace
- adds spaces after if, for, for_each_pbe, for_each_zone etc., wherever
necessary.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch removes the unnecessary function does_collide_order().
This function is no longer necessary, as currently there are only 0-order
allocations in swsusp, and the use of it is confusing.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Without this patch, Linux provokes emergency disk shutdowns and
similar nastiness. It was in SuSE kernels for some time, IIRC.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Using CPU hotplug to support suspend/resume SMP. Both S3 and S4 use
disable/enable_nonboot_cpus API. The S4 part is based on Pavel's original S4
SMP patch.
Signed-off-by: Li Shaohua<shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
(The i386 CPU hotplug patch provides infrastructure for some work which Pavel
is doing as well as for ACPI S3 (suspend-to-RAM) work which Li Shaohua
<shaohua.li@intel.com> is doing)
The following provides i386 architecture support for safely unregistering and
registering processors during runtime, updated for the current -mm tree. In
order to avoid dumping cpu hotplug code into kernel/irq/* i dropped the
cpu_online check in do_IRQ() by modifying fixup_irqs(). The difference being
that on cpu offline, fixup_irqs() is called before we clear the cpu from
cpu_online_map and a long delay in order to ensure that we never have any
queued external interrupts on the APICs. There are additional changes to s390
and ppc64 to account for this change.
1) Add CONFIG_HOTPLUG_CPU
2) disable local APIC timer on dead cpus.
3) Disable preempt around irq balancing to prevent CPUs going down.
4) Print irq stats for all possible cpus.
5) Debugging check for interrupts on offline cpus.
6) Hacky fixup_irqs() to redirect irqs when cpus go off/online.
7) play_dead() for offline cpus to spin inside.
8) Handle offline cpus set in flush_tlb_others().
9) Grab lock earlier in smp_call_function() to prevent CPUs going down.
10) Implement __cpu_disable() and __cpu_die().
11) Enable local interrupts in cpu_enable() after fixup_irqs()
12) Don't fiddle with NMI on dead cpu, but leave intact on other cpus.
13) Program IRQ affinity whilst cpu is still in cpu_online_map on offline.
Signed-off-by: Zwane Mwaikambo <zwane@linuxpower.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't look up the task by its pid and then use the syscall filtering
helper. Just implement our own filter helper which operates solely on
the information in the netlink_skb_parms.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
When the task refcounting was added to audit_filter_rules() it became
more of a problem that this function was violating the 'only one
return from each function' rule. In fixing it to use a variable to store
'ret' I stupidly neglected to actually change the 'return 1;' at the
end. This makes it not work very well.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Another rollup of patches which give various symbols static scope
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds version and srcversion files to
/sys/module/${modulename} containing the version and srcversion fields
of the module's modinfo section (if present).
/sys/module/e1000
|-- srcversion
`-- version
This patch differs slightly from the version posted in January, as it
now uses the new kstrdup() call in -mm.
Why put this in sysfs?
a) Tools like DKMS, which deal with changing out individual kernel
modules without replacing the whole kernel, can behave smarter if they
can tell the version of a given module. The autoinstaller feature, for
example, which determines if your system has a "good" version of a
driver (i.e. if the one provided by DKMS has a newer verson than that
provided by the kernel package installed), and to automatically compile
and install a newer version if DKMS has it but your kernel doesn't yet
have that version.
b) Because sysadmins manually, or with tools like DKMS, can switch out
modules on the file system, you can't count on 'modinfo foo.ko', which
looks at /lib/modules/${kernelver}/... actually matching what is loaded
into the kernel already. Hence asking sysfs for this.
c) as the unbind-driver-from-device work takes shape, it will be
possible to rebind a driver that's built-in (no .ko to modinfo for the
version) to a newly loaded module. sysfs will have the
currently-built-in version info, for comparison.
d) tech support scripts can then easily grab the version info for what's
running presently - a question I get often.
There has been renewed interest in this patch on linux-scsi by driver
authors.
As the idea originated from GregKH, I leave his Signed-off-by: intact,
though the implementation is nearly completely new. Compiled and run on
x86 and x86_64.
From: Matthew Dobson <colpatch@us.ibm.com>
build fix
From: Thierry Vignaud <tvignaud@mandriva.com>
build fix
From: Matthew Dobson <colpatch@us.ibm.com>
warning fix
Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
Signed-off-by: Matt Domsch <Matt_Domsch@dell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The attached patch makes the following changes:
(1) There's a new special key type called ".request_key_auth".
This is an authorisation key for when one process requests a key and
another process is started to construct it. This type of key cannot be
created by the user; nor can it be requested by kernel services.
Authorisation keys hold two references:
(a) Each refers to a key being constructed. When the key being
constructed is instantiated the authorisation key is revoked,
rendering it of no further use.
(b) The "authorising process". This is either:
(i) the process that called request_key(), or:
(ii) if the process that called request_key() itself had an
authorisation key in its session keyring, then the authorising
process referred to by that authorisation key will also be
referred to by the new authorisation key.
This means that the process that initiated a chain of key requests
will authorise the lot of them, and will, by default, wind up with
the keys obtained from them in its keyrings.
(2) request_key() creates an authorisation key which is then passed to
/sbin/request-key in as part of a new session keyring.
(3) When request_key() is searching for a key to hand back to the caller, if
it comes across an authorisation key in the session keyring of the
calling process, it will also search the keyrings of the process
specified therein and it will use the specified process's credentials
(fsuid, fsgid, groups) to do that rather than the calling process's
credentials.
This allows a process started by /sbin/request-key to find keys belonging
to the authorising process.
(4) A key can be read, even if the process executing KEYCTL_READ doesn't have
direct read or search permission if that key is contained within the
keyrings of a process specified by an authorisation key found within the
calling process's session keyring, and is searchable using the
credentials of the authorising process.
This allows a process started by /sbin/request-key to read keys belonging
to the authorising process.
(5) The magic KEY_SPEC_*_KEYRING key IDs when passed to KEYCTL_INSTANTIATE or
KEYCTL_NEGATE will specify a keyring of the authorising process, rather
than the process doing the instantiation.
(6) One of the process keyrings can be nominated as the default to which
request_key() should attach new keys if not otherwise specified. This is
done with KEYCTL_SET_REQKEY_KEYRING and one of the KEY_REQKEY_DEFL_*
constants. The current setting can also be read using this call.
(7) request_key() is partially interruptible. If it is waiting for another
process to finish constructing a key, it can be interrupted. This permits
a request-key cycle to be broken without recourse to rebooting.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-Off-By: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The attached patch makes it possible to pass a session keyring through to the
process spawned by call_usermodehelper(). This allows patch 3/3 to pass an
authorisation key through to /sbin/request-key, thus permitting better access
controls when doing just-in-time key creation.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the upcoming aio_down patch, it is useful to store a private data
pointer in the kiocb's wait_queue. Since we provide our own wake up
function and do not require the task_struct pointer, it makes sense to
convert the task pointer into a generic private pointer.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Avoid taking the tasklist_lock in sys_times if the process is single
threaded. In a NUMA system taking the tasklist_lock may cause a bouncing
cacheline if multiple independent processes continually call sys_times to
measure their performance.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes recalc_sigpending() to work correctly with tasks which are
being freezed.
The problem is that freeze_processes() sets PF_FREEZE and TIF_SIGPENDING
flags on tasks, but recalc_sigpending() called from e.g.
sys_rt_sigtimedwait or any other kernel place will clear TIF_SIGPENDING due
to no pending signals queued and the tasks won't be freezed until it
recieves a real signal or freezed_processes() fail due to timeout.
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-Off-By: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new `suid_dumpable' sysctl:
This value can be used to query and set the core dump mode for setuid
or otherwise protected/tainted binaries. The modes are
0 - (default) - traditional behaviour. Any process which has changed
privilege levels or is execute only will not be dumped
1 - (debug) - all processes dump core when possible. The core dump is
owned by the current user and no security is applied. This is intended
for system debugging situations only. Ptrace is unchecked.
2 - (suidsafe) - any binary which normally would not be dumped is dumped
readable by root only. This allows the end user to remove such a dump but
not access it directly. For security reasons core dumps in this mode will
not overwrite one another or other files. This mode is appropriate when
adminstrators are attempting to debug problems in a normal environment.
(akpm:
> > +EXPORT_SYMBOL(suid_dumpable);
>
> EXPORT_SYMBOL_GPL?
No problem to me.
> > if (current->euid == current->uid && current->egid == current->gid)
> > current->mm->dumpable = 1;
>
> Should this be SUID_DUMP_USER?
Actually the feedback I had from last time was that the SUID_ defines
should go because its clearer to follow the numbers. They can go
everywhere (and there are lots of places where dumpable is tested/used
as a bool in untouched code)
> Maybe this should be renamed to `dump_policy' or something. Doing that
> would help us catch any code which isn't using the #defines, too.
Fair comment. The patch was designed to be easy to maintain for Red Hat
rather than for merging. Changing that field would create a gigantic
diff because it is used all over the place.
)
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>