kernel_optimize_test/kernel
Balbir Singh 568ac88821 cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork
cgroup_threadgroup_rwsem is acquired in read mode during process exit
and fork.  It is also grabbed in write mode during
__cgroups_proc_write().  I've recently run into a scenario with lots
of memory pressure and OOM and I am beginning to see

systemd

 __switch_to+0x1f8/0x350
 __schedule+0x30c/0x990
 schedule+0x48/0xc0
 percpu_down_write+0x114/0x170
 __cgroup_procs_write.isra.12+0xb8/0x3c0
 cgroup_file_write+0x74/0x1a0
 kernfs_fop_write+0x188/0x200
 __vfs_write+0x6c/0xe0
 vfs_write+0xc0/0x230
 SyS_write+0x6c/0x110
 system_call+0x38/0xb4

This thread is waiting on the reader of cgroup_threadgroup_rwsem to
exit.  The reader itself is under memory pressure and has gone into
reclaim after fork. There are times the reader also ends up waiting on
oom_lock as well.

 __switch_to+0x1f8/0x350
 __schedule+0x30c/0x990
 schedule+0x48/0xc0
 jbd2_log_wait_commit+0xd4/0x180
 ext4_evict_inode+0x88/0x5c0
 evict+0xf8/0x2a0
 dispose_list+0x50/0x80
 prune_icache_sb+0x6c/0x90
 super_cache_scan+0x190/0x210
 shrink_slab.part.15+0x22c/0x4c0
 shrink_zone+0x288/0x3c0
 do_try_to_free_pages+0x1dc/0x590
 try_to_free_pages+0xdc/0x260
 __alloc_pages_nodemask+0x72c/0xc90
 alloc_pages_current+0xb4/0x1a0
 page_table_alloc+0xc0/0x170
 __pte_alloc+0x58/0x1f0
 copy_page_range+0x4ec/0x950
 copy_process.isra.5+0x15a0/0x1870
 _do_fork+0xa8/0x4b0
 ppc_clone+0x8/0xc

In the meanwhile, all processes exiting/forking are blocked almost
stalling the system.

This patch moves the threadgroup_change_begin from before
cgroup_fork() to just before cgroup_canfork().  There is no nee to
worry about threadgroup changes till the task is actually added to the
threadgroup.  This avoids having to call reclaim with
cgroup_threadgroup_rwsem held.

tj: Subject and description edits.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-08-17 09:54:52 -04:00
..
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2016-07-27 12:03:20 -07:00
configs config: add android config fragments 2016-08-02 19:35:42 -04:00
debug
events perf/core: Change log level for duration warning to KERN_INFO 2016-08-02 10:23:57 +02:00
gcov gcov: add support for gcc version >= 6 2016-07-15 14:54:27 +09:00
irq genirq: Fix missing irq allocation affinity hint 2016-07-19 10:49:47 +02:00
livepatch modules: add ro_after_init support 2016-08-04 10:16:55 +09:30
locking Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-07-25 12:41:29 -07:00
power mm, vmscan: move LRU lists to node 2016-07-28 16:07:41 -07:00
printk printk: add kernel parameter to control writes to /dev/kmsg 2016-08-02 19:35:06 -04:00
rcu Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-07-29 13:55:30 -07:00
sched xen: features and fixes for 4.8-rc0 2016-07-27 11:35:37 -07:00
time Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-07-29 13:55:30 -07:00
trace block: rename bio bi_rw to bi_opf 2016-08-07 14:41:02 -06:00
.gitignore
acct.c
async.c
audit_fsnotify.c
audit_tree.c
audit_watch.c
audit.c Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit 2016-07-29 17:54:17 -07:00
audit.h Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit 2016-07-29 17:54:17 -07:00
auditfilter.c audit: add fields to exclude filter by reusing user filter 2016-06-27 11:01:00 -04:00
auditsc.c Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit 2016-07-29 17:54:17 -07:00
backtracetest.c
bounds.c
capability.c kernel: Add noaudit variant of ns_capable() 2016-06-06 20:16:18 +10:00
cgroup_freezer.c
cgroup_pids.c cgroup: Use lld instead of ld when printing pids controller events_limit 2016-06-21 15:03:36 -04:00
cgroup.c Merge branch 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2016-07-29 14:29:04 -07:00
compat.c
configs.c
context_tracking.c
cpu_pm.c
cpu.c timers/core: Correct callback order during CPU hot plug 2016-07-28 18:56:22 +02:00
cpuset.c cpuset: make sure new tasks conform to the current config of the cpuset 2016-08-09 23:58:01 -04:00
crash_dump.c
cred.c cred: Reject inodes with invalid ids in set_create_file_as() 2016-06-30 18:05:09 -05:00
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c kernel/exit.c: quieten greatest stack depth printk 2016-08-02 19:35:23 -04:00
extable.c
fork.c cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork 2016-08-17 09:54:52 -04:00
freezer.c freezer, oom: check TIF_MEMDIE on the correct task 2016-07-28 16:07:41 -07:00
futex_compat.c
futex.c futex: Calculate the futex key based on a tail page for file-based futexes 2016-06-08 19:23:54 +02:00
groups.c
hung_task.c
irq_work.c
jump_label.c powerpc updates for 4.8 #2 2016-08-05 09:00:54 -04:00
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c kernel/kcov: unproxify debugfs file's fops 2016-06-15 04:56:35 -07:00
kexec_core.c kexec: add restriction on kexec_load() segment sizes 2016-08-02 19:35:31 -04:00
kexec_file.c kexec: introduce a protection mechanism for the crashkernel reserved memory 2016-05-23 17:04:14 -07:00
kexec_internal.h
kexec.c kexec: allow architectures to override boot mapping 2016-08-02 19:35:27 -04:00
kmod.c
kprobes.c
ksysfs.c kexec: add a kexec_crash_loaded() function 2016-08-02 19:35:30 -04:00
kthread.c
latencytop.c
Makefile ELF/MIPS build fix 2016-05-23 17:04:14 -07:00
membarrier.c
memremap.c libnvdimm for 4.8 2016-07-28 17:38:16 -07:00
module_signing.c
module-internal.h
module.c Removed the MODULE_SIG_FORCE-means-no-MODULE_FORCE_LOAD patch. 2016-08-04 09:14:38 -04:00
notifier.c
nsproxy.c
padata.c
panic.c kexec: use core_param for crash_kexec_post_notifiers boot option 2016-08-02 19:35:29 -04:00
params.c
pid_namespace.c
pid.c remove lots of IS_ERR_VALUE abuses 2016-05-27 15:26:11 -07:00
profile.c profile: Convert to hotplug state machine 2016-07-15 10:41:42 +02:00
ptrace.c tree-wide: replace config_enabled() with IS_ENABLED() 2016-08-04 08:50:07 -04:00
range.c
reboot.c
relay.c relay: add global mode support for buffer-only channels 2016-08-02 19:35:41 -04:00
resource.c
seccomp.c tree-wide: replace config_enabled() with IS_ENABLED() 2016-08-04 08:50:07 -04:00
signal.c signals: Use hrtimer for sigtimedwait() 2016-07-07 10:35:07 +02:00
smp.c Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-07-29 13:55:30 -07:00
smpboot.c
smpboot.h
softirq.c
stacktrace.c
stop_machine.c stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE 2016-07-27 11:12:11 +02:00
sys_ni.c
sys.c prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable 2016-05-23 17:04:14 -07:00
sysctl_binary.c kernel/sysctl_binary.c: use generic UUID library 2016-05-20 17:58:30 -07:00
sysctl.c printk: add kernel parameter to control writes to /dev/kmsg 2016-08-02 19:35:06 -04:00
task_work.c task_work: use READ_ONCE/lockless_dereference, avoid pi_lock if !task_works 2016-08-02 19:35:02 -04:00
taskstats.c
test_kprobes.c
torture.c torture: Stop onoff task if there is only one cpu 2016-06-14 16:03:28 -07:00
tracepoint.c
tsacct.c
uid16.c
up.c
user_namespace.c fs: Limit file caps to the user namespace of the super block 2016-06-24 10:40:31 -05:00
user-return-notifier.c
user.c
utsname_sysctl.c
utsname.c
watchdog.c Revert "perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86" 2016-07-10 20:58:36 +02:00
workqueue_internal.h
workqueue.c Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-07-29 13:55:30 -07:00