forked from luck/tmp_suning_uos_patched
db9a0975a2
Rename the ia64 documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. There are two upper case file names. Rename them to lower case, as we're working to avoid upper case file names at Documentation. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
304 lines
12 KiB
ReStructuredText
304 lines
12 KiB
ReStructuredText
===================================
|
|
Light-weight System Calls for IA-64
|
|
===================================
|
|
|
|
Started: 13-Jan-2003
|
|
|
|
Last update: 27-Sep-2003
|
|
|
|
David Mosberger-Tang
|
|
<davidm@hpl.hp.com>
|
|
|
|
Using the "epc" instruction effectively introduces a new mode of
|
|
execution to the ia64 linux kernel. We call this mode the
|
|
"fsys-mode". To recap, the normal states of execution are:
|
|
|
|
- kernel mode:
|
|
Both the register stack and the memory stack have been
|
|
switched over to kernel memory. The user-level state is saved
|
|
in a pt-regs structure at the top of the kernel memory stack.
|
|
|
|
- user mode:
|
|
Both the register stack and the kernel stack are in
|
|
user memory. The user-level state is contained in the
|
|
CPU registers.
|
|
|
|
- bank 0 interruption-handling mode:
|
|
This is the non-interruptible state which all
|
|
interruption-handlers start execution in. The user-level
|
|
state remains in the CPU registers and some kernel state may
|
|
be stored in bank 0 of registers r16-r31.
|
|
|
|
In contrast, fsys-mode has the following special properties:
|
|
|
|
- execution is at privilege level 0 (most-privileged)
|
|
|
|
- CPU registers may contain a mixture of user-level and kernel-level
|
|
state (it is the responsibility of the kernel to ensure that no
|
|
security-sensitive kernel-level state is leaked back to
|
|
user-level)
|
|
|
|
- execution is interruptible and preemptible (an fsys-mode handler
|
|
can disable interrupts and avoid all other interruption-sources
|
|
to avoid preemption)
|
|
|
|
- neither the memory-stack nor the register-stack can be trusted while
|
|
in fsys-mode (they point to the user-level stacks, which may
|
|
be invalid, or completely bogus addresses)
|
|
|
|
In summary, fsys-mode is much more similar to running in user-mode
|
|
than it is to running in kernel-mode. Of course, given that the
|
|
privilege level is at level 0, this means that fsys-mode requires some
|
|
care (see below).
|
|
|
|
|
|
How to tell fsys-mode
|
|
=====================
|
|
|
|
Linux operates in fsys-mode when (a) the privilege level is 0 (most
|
|
privileged) and (b) the stacks have NOT been switched to kernel memory
|
|
yet. For convenience, the header file <asm-ia64/ptrace.h> provides
|
|
three macros::
|
|
|
|
user_mode(regs)
|
|
user_stack(task,regs)
|
|
fsys_mode(task,regs)
|
|
|
|
The "regs" argument is a pointer to a pt_regs structure. The "task"
|
|
argument is a pointer to the task structure to which the "regs"
|
|
pointer belongs to. user_mode() returns TRUE if the CPU state pointed
|
|
to by "regs" was executing in user mode (privilege level 3).
|
|
user_stack() returns TRUE if the state pointed to by "regs" was
|
|
executing on the user-level stack(s). Finally, fsys_mode() returns
|
|
TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
|
|
The fsys_mode() macro is equivalent to the expression::
|
|
|
|
!user_mode(regs) && user_stack(task,regs)
|
|
|
|
How to write an fsyscall handler
|
|
================================
|
|
|
|
The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
|
|
(fsyscall_table). This table contains one entry for each system call.
|
|
By default, a system call is handled by fsys_fallback_syscall(). This
|
|
routine takes care of entering (full) kernel mode and calling the
|
|
normal Linux system call handler. For performance-critical system
|
|
calls, it is possible to write a hand-tuned fsyscall_handler. For
|
|
example, fsys.S contains fsys_getpid(), which is a hand-tuned version
|
|
of the getpid() system call.
|
|
|
|
The entry and exit-state of an fsyscall handler is as follows:
|
|
|
|
Machine state on entry to fsyscall handler
|
|
------------------------------------------
|
|
|
|
========= ===============================================================
|
|
r10 0
|
|
r11 saved ar.pfs (a user-level value)
|
|
r15 system call number
|
|
r16 "current" task pointer (in normal kernel-mode, this is in r13)
|
|
r32-r39 system call arguments
|
|
b6 return address (a user-level value)
|
|
ar.pfs previous frame-state (a user-level value)
|
|
PSR.be cleared to zero (i.e., little-endian byte order is in effect)
|
|
- all other registers may contain values passed in from user-mode
|
|
========= ===============================================================
|
|
|
|
Required machine state on exit to fsyscall handler
|
|
--------------------------------------------------
|
|
|
|
========= ===========================================================
|
|
r11 saved ar.pfs (as passed into the fsyscall handler)
|
|
r15 system call number (as passed into the fsyscall handler)
|
|
r32-r39 system call arguments (as passed into the fsyscall handler)
|
|
b6 return address (as passed into the fsyscall handler)
|
|
ar.pfs previous frame-state (as passed into the fsyscall handler)
|
|
========= ===========================================================
|
|
|
|
Fsyscall handlers can execute with very little overhead, but with that
|
|
speed comes a set of restrictions:
|
|
|
|
* Fsyscall-handlers MUST check for any pending work in the flags
|
|
member of the thread-info structure and if any of the
|
|
TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
|
|
doing a full system call (by calling fsys_fallback_syscall).
|
|
|
|
* Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
|
|
r15, b6, and ar.pfs) because they will be needed in case of a
|
|
system call restart. Of course, all "preserved" registers also
|
|
must be preserved, in accordance to the normal calling conventions.
|
|
|
|
* Fsyscall-handlers MUST check argument registers for containing a
|
|
NaT value before using them in any way that could trigger a
|
|
NaT-consumption fault. If a system call argument is found to
|
|
contain a NaT value, an fsyscall-handler may return immediately
|
|
with r8=EINVAL, r10=-1.
|
|
|
|
* Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
|
|
any other operation that would trigger mandatory RSE
|
|
(register-stack engine) traffic.
|
|
|
|
* Fsyscall-handlers MUST NOT write to any stacked registers because
|
|
it is not safe to assume that user-level called a handler with the
|
|
proper number of arguments.
|
|
|
|
* Fsyscall-handlers need to be careful when accessing per-CPU variables:
|
|
unless proper safe-guards are taken (e.g., interruptions are avoided),
|
|
execution may be pre-empted and resumed on another CPU at any given
|
|
time.
|
|
|
|
* Fsyscall-handlers must be careful not to leak sensitive kernel'
|
|
information back to user-level. In particular, before returning to
|
|
user-level, care needs to be taken to clear any scratch registers
|
|
that could contain sensitive information (note that the current
|
|
task pointer is not considered sensitive: it's already exposed
|
|
through ar.k6).
|
|
|
|
* Fsyscall-handlers MUST NOT access user-memory without first
|
|
validating access-permission (this can be done typically via
|
|
probe.r.fault and/or probe.w.fault) and without guarding against
|
|
memory access exceptions (this can be done with the EX() macros
|
|
defined by asmmacro.h).
|
|
|
|
The above restrictions may seem draconian, but remember that it's
|
|
possible to trade off some of the restrictions by paying a slightly
|
|
higher overhead. For example, if an fsyscall-handler could benefit
|
|
from the shadow register bank, it could temporarily disable PSR.i and
|
|
PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
|
|
needed. In other words, following the above rules yields extremely
|
|
fast system call execution (while fully preserving system call
|
|
semantics), but there is also a lot of flexibility in handling more
|
|
complicated cases.
|
|
|
|
Signal handling
|
|
===============
|
|
|
|
The delivery of (asynchronous) signals must be delayed until fsys-mode
|
|
is exited. This is accomplished with the help of the lower-privilege
|
|
transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
|
|
checks whether the interrupted task was in fsys-mode and, if so, sets
|
|
PSR.lp and returns immediately. When fsys-mode is exited via the
|
|
"br.ret" instruction that lowers the privilege level, a trap will
|
|
occur. The trap handler clears PSR.lp again and returns immediately.
|
|
The kernel exit path then checks for and delivers any pending signals.
|
|
|
|
PSR Handling
|
|
============
|
|
|
|
The "epc" instruction doesn't change the contents of PSR at all. This
|
|
is in contrast to a regular interruption, which clears almost all
|
|
bits. Because of that, some care needs to be taken to ensure things
|
|
work as expected. The following discussion describes how each PSR bit
|
|
is handled.
|
|
|
|
======= =======================================================================
|
|
PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used
|
|
to ensure the CPU is in little-endian mode before the first
|
|
load/store instruction is executed. PSR.be is normally NOT
|
|
restored upon return from an fsys-mode handler. In other
|
|
words, user-level code must not rely on PSR.be being preserved
|
|
across a system call.
|
|
PSR.up Unchanged.
|
|
PSR.ac Unchanged.
|
|
PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers!
|
|
PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers!
|
|
PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
|
|
PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
|
|
PSR.pk Unchanged.
|
|
PSR.dt Unchanged.
|
|
PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers!
|
|
PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers!
|
|
PSR.sp Unchanged.
|
|
PSR.pp Unchanged.
|
|
PSR.di Unchanged.
|
|
PSR.si Unchanged.
|
|
PSR.db Unchanged. The kernel prevents user-level from setting a hardware
|
|
breakpoint that triggers at any privilege level other than
|
|
3 (user-mode).
|
|
PSR.lp Unchanged.
|
|
PSR.tb Lazy redirect. If a taken-branch trap occurs while in
|
|
fsys-mode, the trap-handler modifies the saved machine state
|
|
such that execution resumes in the gate page at
|
|
syscall_via_break(), with privilege level 3. Note: the
|
|
taken branch would occur on the branch invoking the
|
|
fsyscall-handler, at which point, by definition, a syscall
|
|
restart is still safe. If the system call number is invalid,
|
|
the fsys-mode handler will return directly to user-level. This
|
|
return will trigger a taken-branch trap, but since the trap is
|
|
taken _after_ restoring the privilege level, the CPU has already
|
|
left fsys-mode, so no special treatment is needed.
|
|
PSR.rt Unchanged.
|
|
PSR.cpl Cleared to 0.
|
|
PSR.is Unchanged (guaranteed to be 0 on entry to the gate page).
|
|
PSR.mc Unchanged.
|
|
PSR.it Unchanged (guaranteed to be 1).
|
|
PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit.
|
|
PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit.
|
|
PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit.
|
|
PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to
|
|
be taken. The trap handler then modifies the saved machine
|
|
state such that execution resumes in the gate page at
|
|
syscall_via_break(), with privilege level 3.
|
|
PSR.ri Unchanged.
|
|
PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode
|
|
handler performed a speculative load that gets NaTted. If so, this
|
|
would be the normal & expected behavior, so no special treatment is
|
|
needed.
|
|
PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed.
|
|
Doing so requires clearing PSR.i and PSR.ic as well.
|
|
PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit.
|
|
======= =======================================================================
|
|
|
|
Using fast system calls
|
|
=======================
|
|
|
|
To use fast system calls, userspace applications need simply call
|
|
__kernel_syscall_via_epc(). For example
|
|
|
|
-- example fgettimeofday() call --
|
|
|
|
-- fgettimeofday.S --
|
|
|
|
::
|
|
|
|
#include <asm/asmmacro.h>
|
|
|
|
GLOBAL_ENTRY(fgettimeofday)
|
|
.prologue
|
|
.save ar.pfs, r11
|
|
mov r11 = ar.pfs
|
|
.body
|
|
|
|
mov r2 = 0xa000000000020660;; // gate address
|
|
// found by inspection of System.map for the
|
|
// __kernel_syscall_via_epc() function. See
|
|
// below for how to do this for real.
|
|
|
|
mov b7 = r2
|
|
mov r15 = 1087 // gettimeofday syscall
|
|
;;
|
|
br.call.sptk.many b6 = b7
|
|
;;
|
|
|
|
.restore sp
|
|
|
|
mov ar.pfs = r11
|
|
br.ret.sptk.many rp;; // return to caller
|
|
END(fgettimeofday)
|
|
|
|
-- end fgettimeofday.S --
|
|
|
|
In reality, getting the gate address is accomplished by two extra
|
|
values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
|
|
|
|
* AT_SYSINFO : is the address of __kernel_syscall_via_epc()
|
|
* AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
|
|
|
|
The ELF DSO is a pre-linked library that is mapped in by the kernel at
|
|
the gate page. It is a proper ELF shared object so, with a dynamic
|
|
loader that recognises the library, you should be able to make calls to
|
|
the exported functions within it as with any other shared library.
|
|
AT_SYSINFO points into the kernel DSO at the
|
|
__kernel_syscall_via_epc() function for historical reasons (it was
|
|
used before the kernel DSO) and as a convenience.
|