Replace all remaining callers of alloc_maybe_bootmem with
zalloc_maybe_bootmem. The callsite in pci_dn is followed with a
memset to clear the memory, and not zeroing at the other callsites
in the celleb fake pci code could lead to following uninitialized
memory as pointers or even freeing said pointers on error paths.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Now that smp_ops->smp_message_pass is always called with an (online) cpu
number for the target remove the checks for MSG_ALL and MSG_ALL_BUT_SELF.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The only user of MSG_ALL_BUT_SELF in the whole kernel tree is powerpc,
and it only uses it to start the debugger. Both debuggers always call
smp_send_debugger_break with MSG_ALL_BUT_SELF, and only mpic can do
anything more optimal than a loop over all online cpus, but all message
passing implementations have to code for this special delivery target.
Convert smp_send_debugger_break to take void and loop calling the smp_ops
message_pass function for each of the other cpus in the online cpumask.
Use raw_smp_processor_id() because we are either entering the debugger
or trying to start kdump and the additional warning it not useful were
it to trigger.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
c1854e0072 (powerpc: Set nr_cpu_ids early
and use it to free PACAs) copied the formerly static setup_nr_cpu_ids
from init/main.c but 34db18a054 (smp:
move smp setup functions to kernel/smp.c) moved it to kernel/smp.c
with a declaration in include/linux/smp.h, so we can call it instead of
replicating it.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Now that we never set a cpu above nr_cpu_ids possible we can
limit our initial paca allocation to nr_cpu_ids. We can then
clamp the number of cpus in platforms/iseries/setup.c.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We should not set cpus above nr_cpu_ids to possible. While we
will trigger a warning with CONFIG_CPUMASK_DEBUG, even then the mask
initializers will set the bits beyond what the iterators check and cause
nr_cpu_ids to increase.
Respecting nr_cpu_ids during setup will allow us to use it in our initial
paca allocation. It can be reduced from NR_CPUS by the existing early param
nr_cpus=, which was added in 2b633e3fac (smp:
Use nr_cpus= to set nr_cpu_ids early). We already call parse_early_parms
between finding the command line and allocating the pacas.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Starting with 1426d5a3bd (powerpc:
Dynamically allocate pacas) the space for pacas beyond cpu_possible
is freed, but we failed to update the loop in crash.c.
Since c1854e0072 (powerpc: Set nr_cpu_ids
early and use it to free PACAs) the number of pacas allocated is
always nr_cpu_ids.
Signed-off-by: Milton Miller <miltonm@bga.com>
Cc: <stable@kernel.org> # .34.x
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Starting with 1426d5a3bd (powerpc:
Dynamically allocate pacas) we free the memory for pacas beyond
cpu_possible, but we failed to update the loop the secondary cpus use
to find their paca. If the system has running cpu threads for which
the kernel did not allocate a paca for they will search the memory that
was freed. For instance this could happen when the device tree for
a kdump kernel was not updated after a cpu hotplug, or the kernel is
running with more cpus than the kernel was configured.
Since c1854e0072 (powerpc: Set nr_cpu_ids
early and use it to free PACAs) we set nr_cpu_ids before telling the
cpus to advance, so use that to limit the search.
We can't reference nr_cpu_ids without CONFIG_SMP because it is defined
as 1 instead of a memory location, but any extra threads should be sent
to kexec_wait in that case anyways, so make that explicit and remove
the search loop for UP.
Note to stable: The fix also requires
c1854e0072 (powerpc: Set
nr_cpu_ids early and use it to free PACAs) to function. Also
9d07bc841c (Properly handshake CPUs going
out of boot spin loop) affects the second chunk, specifically the branch
target was 3b before and is 4b after that patch, and there was a blank
line before the #ifdef CONFIG_SMP that was removed
Cc: <stable@kernel.org> # .34.x: c1854e0072 powerpc: Set nr_cpu_ids early
Cc: <stable@kernel.org> # .34.x
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit 1fc711f7ff (powerpc/kexec: Fix race
in kexec shutdown) moved the write to signal the cpu had exited the kernel
from before the transition to real mode in kexec_smp_wait to kexec_wait.
Unfornately it missed that kexec_wait is used both by cpus leaving the
kernel and by secondary slave cpus that were not allocated a paca for
what ever reason -- they could be beyond nr_cpus or not described in
the current device tree for whatever reason (for example, kexec-load
was not refreshed after a cpu hotplug operation). Cpus coming through
that path they will write to paca[NR_CPUS] which is beyond the space
allocated for the paca data and overwrite memory not allocated to pacas
but very likely still real mode accessable).
Move the write back to kexec_smp_wait, which is used only by cpus that
found their paca, but after the transition to real mode.
Signed-off-by: Milton Miller <miltonm@bga.com>
Cc: <stable@kernel.org> # (1fc711f was backported to 2.6.32)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
After looking at our system call path, Mary Brown suggested that we
should put all mfspr SRR* instructions before any mtspr SRR*.
To test this I used a very simple null syscall (actually getppid)
testcase at http://ozlabs.org/~anton/junkcode/null_syscall.c
I tested with the following changes against the pseries_defconfig:
CONFIG_VIRT_CPU_ACCOUNTING=n
CONFIG_AUDIT=n
to remove the overhead of virtual CPU accounting and syscall
auditing.
POWER6:
baseline: mean = 757.2 cycles sd = 2.108
modified: mean = 759.1 cycles sd = 2.020
POWER7:
baseline: mean = 411.4 cycles sd = 0.138
modified: mean = 404.1 cycles sd = 0.109
So we have 1.77% improvement on POWER7 which looks significant. The
POWER6 suggest a 0.25% slowdown, but the results are within 1
standard deviation and may be in the noise.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
A static branch hint will override dynamic branch prediction on
recent POWER CPUs. Since we are about to use more altivec in the
kernel remove the static hint in giveup_altivec that assumes
a userspace task is using altivec.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To make it easier to add optimised versions of copy_page, remove
the 4kB loop for 64kB pages and just do all the work in copy_page.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch below removes an unused config variable found by using a kernel
cleanup script.
Note: I did try to cross compile these but hit erros while doing so..
(gcc is not setup to cross compile) and am unsure if anymore needs to be done.
Please have a look if/when anybody has free time.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Jack Miller <jack@codezen.org>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add a platform for the Wire Speed Processor, based on the PPC A2.
This includes code for the ICS & OPB interrupt controllers, as well
as a SCOM backend, and SCOM based cpu bringup.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jack Miller <jack@codezen.org>
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
slb0_limit() wasn't a very descriptive name. This changes it along with
a comment explaining what it's used for, and provides a 64-bit BookE
implementation.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Adds support for page coalescing, which is a feature on IBM Power servers
which allows for coalescing identical pages between logical partitions.
Hint text pages as coalesce candidates, since they are the most likely
pages to be able to be coalesced between partitions. This patch also
exports some page coalescing statistics available from firmware via
lparcfg.
[BenH: Moved a couple of things around to fix compile problems]
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit b987812b3f left
crash_kexec_wait_realmode() undefined for UP.
Commit 7c7a81b53e defined it for UP but
left it undefined for 32-bit SMP.
Seems like people are getting confused by nested #ifdef's, so move the
definitions of crash_kexec_wait_realmode() after the #ifdef CONFIG_SMP
section.
Compile-tested with 32-bit UP, 32-bit SMP and 64-bit SMP configurations.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Adapt new API.
Almost change is trivial. Most important change is the below line
because we plan to change task->cpus_allowed implementation.
- ctx->cpus_allowed = current->cpus_allowed;
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Recent 64-bit server processors (POWER6 and POWER7) have a "Come-From
Address Register" (CFAR), that records the address of the most recent
branch or rfid (return from interrupt) instruction for debugging purposes.
This saves the value of the CFAR in the exception entry code and stores
it in the exception frame. We also make xmon print the CFAR value in
its register dump code.
Rather than extend the pt_regs struct at this time, we steal the orig_gpr3
field, which is only used for system calls, and use it for the CFAR value
for all exceptions/interrupts other than system calls. This means we
don't save the CFAR on system calls, which is not a great problem since
system calls tend not to happen unexpectedly, and also avoids adding the
overhead of reading the CFAR to the system call entry path.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we take an interrupt or exception from kernel mode and the stack
pointer is obviously not a kernel address (i.e. the top bit is 0), we
switch to an emergency stack, save register values and panic. However,
on 64-bit server machines, we don't actually save the values of r9 - r13
at the time of the interrupt, but rather values corrupted by the
exception entry code for r12-r13, and nothing at all for r9-r11.
This fixes it by passing a pointer to the register save area in the paca
through to the bad_stack code in r3. The register values are saved in
one of the paca register save areas (depending on which exception this
is). Using the pointer in r3, the bad_stack code now retrieves the
saved values of r9 - r13 and stores them in the exception frame on the
emergency stack. This also stores the normal exception frame marker
("regshere") in the exception frame.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some of the 64bit PPC CPU features are MMU-related, so this patch moves
them to MMU_FTR_ bits. All cpu_has_feature()-style tests are moved to
mmu_has_feature(), and seven feature bits are freed as a result.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
RTAS returns extended error codes as a hint of how long the
OS might want to wait before retrying a call. If we have nothing
else useful to do we may as well call back straight away.
This was found when testing the new dynamic dma window feature.
Firmware split the zeroing of the TCE table into 32k chunks but
returned 9901 (which is a suggested wait of 10ms). All up this took
about 10 minutes to complete since msleep is jiffies based and will
round 10ms up to 20ms.
With the patch below we take 3 seconds to complete the same test.
The hint firmware is returning in the RTAS call should definitely
be decreased, but even if we slept 1ms each iteration this would
take 32s.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use the new MSR_64BIT in a few places. Some of these are already ifdef'ed
for BOOKE vs BOOKS, but it's still clearer, MSR_SF does not immediately
parse as "MSR bit for 64bit".
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This can be useful for differentiating interrupts on the same host
but with different chip data.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Even when no initfunc is provided.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The DSCR (aka Data Stream Control Register) is supported on some
server PowerPC chips and allow some control over the prefetch
of data streams.
This patch allows the value to be specified per thread by emulating
the corresponding mfspr and mtspr instructions. Children of such
threads inherit the value. Other threads use a default value that
can be specified in sysfs - /sys/devices/system/cpu/dscr_default.
If a thread starts with non default value in the sysfs entry,
all children threads inherit this non default value even if
the sysfs value is changed later.
Signed-off-by: Alexey Kardashevskiy <aik@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we set up the TLB for ourselves on Book3E, we need to flush out any
old mappings established by the firmware or bootloader. At present we
attempt this with a tlbilx to flush everything, but this will leave behind
any entries with the IPROT bit set.
There are several good reason firmware might establish mappings with IPROT,
and in fact ePAPR compliant firmwares are required to establish their
initial mapped area with IPROT.
This patch, therefore adds more complex code to scan through the TLB upon
entry and flush away any entries that are not our own.
Signed-off-by: Jack Miller <jack@codezen.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
An erratum on A2 can lead to the bolted entry we insert for the linear
mapping being evicted, to avoid that write the bolted entry to way 3.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In exc_lvl_ctx_init() we index into the crit/dbg/mcheck stacks using
the hard cpu id, but that assumes the hard cpu id is zero based and
contiguous. That is not the case on A2.
The root of the problem is that the 32bit code has no equivalent of the
paca to allow it to do the hard->soft mapping in assembler. Until the
32bit code is updated to handle that, index the stacks using the soft
cpu ids on 64bit and hard on 32 bit.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add the cputable entry, regs and setup & restore entries for
the PowerPC A2 core.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we start a cpu we use smp_ops->kick_cpu(), which currently
returns void, it should be able to fail. Convert it to return
int, and update all uses.
Convert all the current error cases to return -ENOENT, which is
what would eventually be returned by __cpu_up() currently when
it doesn't detect the cpu as coming up in time.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We need to do that to guarantee they see any code change done by
dynamic patching during boot.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Wakeup comes from the system reset handler with a potential loss of
the non-hypervisor CPU state. We save the non-volatile state on the
stack and a pointer to it in the PACA, which the system reset handler
uses to restore things
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We need to wait a bit for them to have done their CPU setup
or we might end up with translation and EE on with different
LPCR values between threads
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We do it before we loop on the PACA start flag. This way, we get a
chance to set critical SPRs on all CPUs before Linux tries to start
them up, which avoids problems when changing some bits such as LPCR
bits that need to be identical on all threads of a core or similar
things like that. Ideally, some of that should also be done before
the MMU is enabled, but that's a separate issue which would require
moving some of the SMP startup code earlier, let's not get there
for now, it works with that change alone.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This sets the default data stream prefetch size for operating
systems that don't set their own value in DSCR. We use 4 which
is "medium".
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This uses feature sections to arrange that we always use HSPRG1
as the scratch register in the interrupt entry code rather than
SPRG2 when we're running in hypervisor mode on POWER7. This will
ensure that we don't trash the guest's SPRG2 when we are running
KVM guests. To simplify the code, we define GET_SCRATCH0() and
SET_SCRATCH0() macros like the GET_PACA/SET_PACA macros.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rework exception macros a bit to split offset from vector and add
some basic support for HDEC, HDSI, HISI and a few more.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pass the register type to the prolog, also provides alternate "HV"
version of hardware interrupt (0x500) and adjust LPES accordingly
We tag those interrupts by setting bit 0x2 in the trap number
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When running in Hypervisor mode (arch 2.06 or later), we store the PACA
in HSPRG0 instead of SPRG1. The architecture specifies that SPRGs may be
lost during a "nap" power management operation (though they aren't
currently on POWER7) and this enables use of SPRG1 by KVM guests.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This bit indicates that we are operating in hypervisor mode on a CPU
compliant to architecture 2.06 or later (currently server only).
We set it on POWER7 and have a boot-time CPU setup function that
clears it if MSR:HV isn't set (booting under a hypervisor).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Because of speculative event roll back, it is possible for some event coutners
to decrease between reads on POWER7. This causes a problem with the way that
counters are updated. Delta calues are calculated in a 64 bit value and the
top 32 bits are masked. If the register value has decreased, this leaves us
with a very large positive value added to the kernel counters. This patch
protects against this by skipping the update if the delta would be negative.
This can lead to a lack of precision in the coutner values, but from my testing
the value is typcially fewer than 10 samples at a time.
Signed-off-by: Eric B Munson <emunson@mgebm.net>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We currently enable interrupts before the dispatch log for the boot
cpu is setup. If a timer interrupt comes in early enough we oops in
scan_dispatch_log:
Unable to handle kernel paging request for data at address 0x00000010
...
.scan_dispatch_log+0xb0/0x170
.account_system_vtime+0xa0/0x220
.irq_enter+0x88/0xc0
.do_IRQ+0x48/0x230
The patch below adds a check to scan_dispatch_log to ensure the
dispatch log has been allocated.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Recent commit b987812b3f caused
a compile failure on UP because a considerably large block
of the file was included within CONFIG_SMP, hence making a stub
function not exposed on UP builds when it needed to be.
Relocate the stub to the #else /* ! CONFIG_SMP */ section
and also annotate the relevant else/endif so that nobody
else falls into the same trap I did.
Reported-by: Michael Guntsche <mike@it-loops.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>