8ce156deca
Documentation/process/botching-up-ioctls.rst was orignally written as a blog post for DRM driver writers, so it it misses some points while going into a lot of detail on others. Try to provide a replacement that addresses typical issues across a wider range of subsystems, and follows the style of the core-api documentation better. Many improvements to the document are suggested by Ben Hutchings <ben.hutchings@codethink.co.uk>, Jonathan Corbet <corbet@lwn.net> and Geert Uytterhoeven <geert@linux-m68k.org>. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
254 lines
10 KiB
ReStructuredText
254 lines
10 KiB
ReStructuredText
======================
|
||
ioctl based interfaces
|
||
======================
|
||
|
||
ioctl() is the most common way for applications to interface
|
||
with device drivers. It is flexible and easily extended by adding new
|
||
commands and can be passed through character devices, block devices as
|
||
well as sockets and other special file descriptors.
|
||
|
||
However, it is also very easy to get ioctl command definitions wrong,
|
||
and hard to fix them later without breaking existing applications,
|
||
so this documentation tries to help developers get it right.
|
||
|
||
Command number definitions
|
||
==========================
|
||
|
||
The command number, or request number, is the second argument passed to
|
||
the ioctl system call. While this can be any 32-bit number that uniquely
|
||
identifies an action for a particular driver, there are a number of
|
||
conventions around defining them.
|
||
|
||
``include/uapi/asm-generic/ioctl.h`` provides four macros for defining
|
||
ioctl commands that follow modern conventions: ``_IO``, ``_IOR``,
|
||
``_IOW``, and ``_IOWR``. These should be used for all new commands,
|
||
with the correct parameters:
|
||
|
||
_IO/_IOR/_IOW/_IOWR
|
||
The macro name specifies how the argument will be used. It may be a
|
||
pointer to data to be passed into the kernel (_IOW), out of the kernel
|
||
(_IOR), or both (_IOWR). _IO can indicate either commands with no
|
||
argument or those passing an integer value instead of a pointer.
|
||
It is recommended to only use _IO for commands without arguments,
|
||
and use pointers for passing data.
|
||
|
||
type
|
||
An 8-bit number, often a character literal, specific to a subsystem
|
||
or driver, and listed in :doc:`../userspace-api/ioctl/ioctl-number`
|
||
|
||
nr
|
||
An 8-bit number identifying the specific command, unique for a give
|
||
value of 'type'
|
||
|
||
data_type
|
||
The name of the data type pointed to by the argument, the command number
|
||
encodes the ``sizeof(data_type)`` value in a 13-bit or 14-bit integer,
|
||
leading to a limit of 8191 bytes for the maximum size of the argument.
|
||
Note: do not pass sizeof(data_type) type into _IOR/_IOW/IOWR, as that
|
||
will lead to encoding sizeof(sizeof(data_type)), i.e. sizeof(size_t).
|
||
_IO does not have a data_type parameter.
|
||
|
||
|
||
Interface versions
|
||
==================
|
||
|
||
Some subsystems use version numbers in data structures to overload
|
||
commands with different interpretations of the argument.
|
||
|
||
This is generally a bad idea, since changes to existing commands tend
|
||
to break existing applications.
|
||
|
||
A better approach is to add a new ioctl command with a new number. The
|
||
old command still needs to be implemented in the kernel for compatibility,
|
||
but this can be a wrapper around the new implementation.
|
||
|
||
Return code
|
||
===========
|
||
|
||
ioctl commands can return negative error codes as documented in errno(3);
|
||
these get turned into errno values in user space. On success, the return
|
||
code should be zero. It is also possible but not recommended to return
|
||
a positive 'long' value.
|
||
|
||
When the ioctl callback is called with an unknown command number, the
|
||
handler returns either -ENOTTY or -ENOIOCTLCMD, which also results in
|
||
-ENOTTY being returned from the system call. Some subsystems return
|
||
-ENOSYS or -EINVAL here for historic reasons, but this is wrong.
|
||
|
||
Prior to Linux 5.5, compat_ioctl handlers were required to return
|
||
-ENOIOCTLCMD in order to use the fallback conversion into native
|
||
commands. As all subsystems are now responsible for handling compat
|
||
mode themselves, this is no longer needed, but it may be important to
|
||
consider when backporting bug fixes to older kernels.
|
||
|
||
Timestamps
|
||
==========
|
||
|
||
Traditionally, timestamps and timeout values are passed as ``struct
|
||
timespec`` or ``struct timeval``, but these are problematic because of
|
||
incompatible definitions of these structures in user space after the
|
||
move to 64-bit time_t.
|
||
|
||
The ``struct __kernel_timespec`` type can be used instead to be embedded
|
||
in other data structures when separate second/nanosecond values are
|
||
desired, or passed to user space directly. This is still not ideal though,
|
||
as the structure matches neither the kernel's timespec64 nor the user
|
||
space timespec exactly. The get_timespec64() and put_timespec64() helper
|
||
functions can be used to ensure that the layout remains compatible with
|
||
user space and the padding is treated correctly.
|
||
|
||
As it is cheap to convert seconds to nanoseconds, but the opposite
|
||
requires an expensive 64-bit division, a simple __u64 nanosecond value
|
||
can be simpler and more efficient.
|
||
|
||
Timeout values and timestamps should ideally use CLOCK_MONOTONIC time,
|
||
as returned by ktime_get_ns() or ktime_get_ts64(). Unlike
|
||
CLOCK_REALTIME, this makes the timestamps immune from jumping backwards
|
||
or forwards due to leap second adjustments and clock_settime() calls.
|
||
|
||
ktime_get_real_ns() can be used for CLOCK_REALTIME timestamps that
|
||
need to be persistent across a reboot or between multiple machines.
|
||
|
||
32-bit compat mode
|
||
==================
|
||
|
||
In order to support 32-bit user space running on a 64-bit machine, each
|
||
subsystem or driver that implements an ioctl callback handler must also
|
||
implement the corresponding compat_ioctl handler.
|
||
|
||
As long as all the rules for data structures are followed, this is as
|
||
easy as setting the .compat_ioctl pointer to a helper function such as
|
||
compat_ptr_ioctl() or blkdev_compat_ptr_ioctl().
|
||
|
||
compat_ptr()
|
||
------------
|
||
|
||
On the s390 architecture, 31-bit user space has ambiguous representations
|
||
for data pointers, with the upper bit being ignored. When running such
|
||
a process in compat mode, the compat_ptr() helper must be used to
|
||
clear the upper bit of a compat_uptr_t and turn it into a valid 64-bit
|
||
pointer. On other architectures, this macro only performs a cast to a
|
||
``void __user *`` pointer.
|
||
|
||
In an compat_ioctl() callback, the last argument is an unsigned long,
|
||
which can be interpreted as either a pointer or a scalar depending on
|
||
the command. If it is a scalar, then compat_ptr() must not be used, to
|
||
ensure that the 64-bit kernel behaves the same way as a 32-bit kernel
|
||
for arguments with the upper bit set.
|
||
|
||
The compat_ptr_ioctl() helper can be used in place of a custom
|
||
compat_ioctl file operation for drivers that only take arguments that
|
||
are pointers to compatible data structures.
|
||
|
||
Structure layout
|
||
----------------
|
||
|
||
Compatible data structures have the same layout on all architectures,
|
||
avoiding all problematic members:
|
||
|
||
* ``long`` and ``unsigned long`` are the size of a register, so
|
||
they can be either 32-bit or 64-bit wide and cannot be used in portable
|
||
data structures. Fixed-length replacements are ``__s32``, ``__u32``,
|
||
``__s64`` and ``__u64``.
|
||
|
||
* Pointers have the same problem, in addition to requiring the
|
||
use of compat_ptr(). The best workaround is to use ``__u64``
|
||
in place of pointers, which requires a cast to ``uintptr_t`` in user
|
||
space, and the use of u64_to_user_ptr() in the kernel to convert
|
||
it back into a user pointer.
|
||
|
||
* On the x86-32 (i386) architecture, the alignment of 64-bit variables
|
||
is only 32-bit, but they are naturally aligned on most other
|
||
architectures including x86-64. This means a structure like::
|
||
|
||
struct foo {
|
||
__u32 a;
|
||
__u64 b;
|
||
__u32 c;
|
||
};
|
||
|
||
has four bytes of padding between a and b on x86-64, plus another four
|
||
bytes of padding at the end, but no padding on i386, and it needs a
|
||
compat_ioctl conversion handler to translate between the two formats.
|
||
|
||
To avoid this problem, all structures should have their members
|
||
naturally aligned, or explicit reserved fields added in place of the
|
||
implicit padding. The ``pahole`` tool can be used for checking the
|
||
alignment.
|
||
|
||
* On ARM OABI user space, structures are padded to multiples of 32-bit,
|
||
making some structs incompatible with modern EABI kernels if they
|
||
do not end on a 32-bit boundary.
|
||
|
||
* On the m68k architecture, struct members are not guaranteed to have an
|
||
alignment greater than 16-bit, which is a problem when relying on
|
||
implicit padding.
|
||
|
||
* Bitfields and enums generally work as one would expect them to,
|
||
but some properties of them are implementation-defined, so it is better
|
||
to avoid them completely in ioctl interfaces.
|
||
|
||
* ``char`` members can be either signed or unsigned, depending on
|
||
the architecture, so the __u8 and __s8 types should be used for 8-bit
|
||
integer values, though char arrays are clearer for fixed-length strings.
|
||
|
||
Information leaks
|
||
=================
|
||
|
||
Uninitialized data must not be copied back to user space, as this can
|
||
cause an information leak, which can be used to defeat kernel address
|
||
space layout randomization (KASLR), helping in an attack.
|
||
|
||
For this reason (and for compat support) it is best to avoid any
|
||
implicit padding in data structures. Where there is implicit padding
|
||
in an existing structure, kernel drivers must be careful to fully
|
||
initialize an instance of the structure before copying it to user
|
||
space. This is usually done by calling memset() before assigning to
|
||
individual members.
|
||
|
||
Subsystem abstractions
|
||
======================
|
||
|
||
While some device drivers implement their own ioctl function, most
|
||
subsystems implement the same command for multiple drivers. Ideally the
|
||
subsystem has an .ioctl() handler that copies the arguments from and
|
||
to user space, passing them into subsystem specific callback functions
|
||
through normal kernel pointers.
|
||
|
||
This helps in various ways:
|
||
|
||
* Applications written for one driver are more likely to work for
|
||
another one in the same subsystem if there are no subtle differences
|
||
in the user space ABI.
|
||
|
||
* The complexity of user space access and data structure layout is done
|
||
in one place, reducing the potential for implementation bugs.
|
||
|
||
* It is more likely to be reviewed by experienced developers
|
||
that can spot problems in the interface when the ioctl is shared
|
||
between multiple drivers than when it is only used in a single driver.
|
||
|
||
Alternatives to ioctl
|
||
=====================
|
||
|
||
There are many cases in which ioctl is not the best solution for a
|
||
problem. Alternatives include:
|
||
|
||
* System calls are a better choice for a system-wide feature that
|
||
is not tied to a physical device or constrained by the file system
|
||
permissions of a character device node
|
||
|
||
* netlink is the preferred way of configuring any network related
|
||
objects through sockets.
|
||
|
||
* debugfs is used for ad-hoc interfaces for debugging functionality
|
||
that does not need to be exposed as a stable interface to applications.
|
||
|
||
* sysfs is a good way to expose the state of an in-kernel object
|
||
that is not tied to a file descriptor.
|
||
|
||
* configfs can be used for more complex configuration than sysfs
|
||
|
||
* A custom file system can provide extra flexibility with a simple
|
||
user interface but adds a lot of complexity to the implementation.
|