Go to file
Thomas Gleixner b5e26be407 cpu/hotplug: Cure the cpusets trainwreck
commit b22afcdf04c96ca58327784e280e10288cfd3303 upstream.

Alexey and Joshua tried to solve a cpusets related hotplug problem which is
user space visible and results in unexpected behaviour for some time after
a CPU has been plugged in and the corresponding uevent was delivered.

cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
workqueue. This is done because the cpusets code has already a lock
nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
waiting for the work to finish with cpu_hotplug_lock held can and will
deadlock because that results in the reverse lock order.

As a consequence the uevent can be delivered before cpusets have consistent
state which means that a user space invocation of sched_setaffinity() to
move a task to the plugged CPU fails up to the point where the scheduled
work has been processed.

The same is true for CPU unplug, but that does not create user observable
failure (yet).

It's still inconsistent to claim that an operation is finished before it
actually is and that's the real issue at hand. uevents just make it
reliably observable.

Obviously the problem should be fixed in cpusets/cgroups, but untangling
that is pretty much impossible because according to the changelog of the
commit which introduced this 8 years ago:

 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")

the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
the whole code is built around that.

So bite the bullet and invoke the relevant cpuset function, which waits for
the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
only when tasks are not frozen by suspend/hibernate because that would
obviously wait forever.

Waiting there with cpu_add_remove_lock, which is protecting the present
and possible CPU maps, held is not a problem at all because neither work
queues nor cpusets/cgroups have any lockchains related to that lock.

Waiting in the hotplug machinery is not problematic either because there
are already state callbacks which wait for hardware queues to drain. It
makes the operations slightly slower, but hotplug is slow anyway.

This ensures that state is consistent before returning from a hotplug
up/down operation. It's still inconsistent during the operation, but that's
a different story.

Add a large comment which explains why this is done and why this is not a
dump ground for the hack of the day to work around half thought out locking
schemes. Document also the implications vs. hotplug operations and
serialization or the lack of it.

Thanks to Alexy and Joshua for analyzing why this temporary
sched_setaffinity() failure happened.

Fixes: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
Reported-by: Alexey Klimov <aklimov@redhat.com>
Reported-by: Joshua Baker <jobaker@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexey Klimov <aklimov@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87tuowcnv3.ffs@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19 09:44:59 +02:00
arch arm64: tlb: fix the TTL value of tlb_get_level 2021-07-19 09:44:59 +02:00
block blk-mq: update hctx->dispatch_busy in case of real scheduler 2021-07-14 16:56:13 +02:00
certs certs: Add ability to preload revocation certs 2021-06-30 08:47:30 -04:00
crypto crypto: sm2 - fix a memory leak in sm2 2021-07-14 16:56:06 +02:00
Documentation powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable 2021-07-14 16:56:49 +02:00
drivers ata: ahci_sunxi: Disable DIPM 2021-07-19 09:44:59 +02:00
fs io_uring: convert io_buffer_idr to XArray 2021-07-19 09:44:56 +02:00
include scsi: iscsi: Fix race condition between login and sync thread 2021-07-19 09:44:56 +02:00
init sched/core: Initialize the idle task with preemption disabled 2021-07-14 16:55:50 +02:00
ipc ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry 2021-05-26 12:06:54 +02:00
kernel cpu/hotplug: Cure the cpusets trainwreck 2021-07-19 09:44:59 +02:00
lib lib/math/rational.c: fix divide by zero 2021-07-14 16:56:51 +02:00
LICENSES LICENSES/deprecated: add Zlib license text 2020-09-16 14:33:49 +02:00
mm mm,hwpoison: return -EBUSY when migration fails 2021-07-19 09:44:56 +02:00
net sctp: add size validation when walking chunks 2021-07-19 09:44:55 +02:00
samples samples/bpf: Fix the error return code of xdp_redirect's main() 2021-07-14 16:56:23 +02:00
scripts kbuild: Fix objtool dependency for 'OBJECT_FILES_NON_STANDARD_<obj> := n' 2021-07-14 16:56:04 +02:00
security selinux: use __GFP_NOWARN with GFP_NOWAIT in the AVC 2021-07-19 09:44:49 +02:00
sound ALSA: firewire-lib: Fix 'amdtp_domain_start()' when no AMDTP_OUT_STREAM stream is found 2021-07-14 16:56:49 +02:00
tools selftests/resctrl: Fix incorrect parsing of option "-t" 2021-07-19 09:44:55 +02:00
usr Merge branch 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-08-07 13:29:39 -07:00
virt KVM: do not allow mapping valid but non-reference-counted pages 2021-06-30 08:47:25 -04:00
.clang-format RDMA 5.10 pull request 2020-10-17 11:18:18 -07:00
.cocciconfig
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore kbuild: generate Module.symvers only when vmlinux exists 2021-05-19 10:12:59 +02:00
.mailmap mailmap: add two more addresses of Uwe Kleine-König 2020-12-06 10:19:07 -08:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: Move Jason Cooper to CREDITS 2020-11-30 10:20:34 +01:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS f2fs: move ioctl interface definitions to separated file 2021-05-19 10:13:00 +02:00
Makefile Linux 5.10.50 2021-07-14 16:56:55 +02:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.