kernel_optimize_test/mm
Johannes Weiner dd0a41bc17 Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
commit e82553c10b0899994153f9bf0af333c0a1550fd7 upstream.

This reverts commit 536d3bf261, as it can
cause writers to memory.high to get stuck in the kernel forever,
performing page reclaim and consuming excessive amounts of CPU cycles.

Before the patch, a write to memory.high would first put the new limit
in place for the workload, and then reclaim the requested delta.  After
the patch, the kernel tries to reclaim the delta before putting the new
limit into place, in order to not overwhelm the workload with a sudden,
large excess over the limit.  However, if reclaim is actively racing
with new allocations from the uncurbed workload, it can keep the write()
working inside the kernel indefinitely.

This is causing problems in Facebook production.  A privileged
system-level daemon that adjusts memory.high for various workloads
running on a host can get unexpectedly stuck in the kernel and
essentially turn into a sort of involuntary kswapd for one of the
workloads.  We've observed that daemon busy-spin in a write() for
minutes at a time, neglecting its other duties on the system, and
expending privileged system resources on behalf of a workload.

To remedy this, we have first considered changing the reclaim logic to
break out after a couple of loops - whether the workload has converged
to the new limit or not - and bound the write() call this way.  However,
the root cause that inspired the sequence change in the first place has
been fixed through other means, and so a revert back to the proven
limit-setting sequence, also used by memory.max, is preferable.

The sequence was changed to avoid extreme latencies in the workload when
the limit was lowered: the sudden, large excess created by the limit
lowering would erroneously trigger the penalty sleeping code that is
meant to throttle excessive growth from below.  Allocating threads could
end up sleeping long after the write() had already reclaimed the delta
for which they were being punished.

However, erroneous throttling also caused problems in other scenarios at
around the same time.  This resulted in commit b3ff92916a ("mm, memcg:
reclaim more aggressively before high allocator throttling"), included
in the same release as the offending commit.  When allocating threads
now encounter large excess caused by a racing write() to memory.high,
instead of entering punitive sleeps, they will simply be tasked with
helping reclaim down the excess, and will be held no longer than it
takes to accomplish that.  This is in line with regular limit
enforcement - i.e.  if the workload allocates up against or over an
otherwise unchanged limit from below.

With the patch breaking userspace, and the root cause addressed by other
means already, revert it again.

Link: https://lkml.kernel.org/r/20210122184341.292461-1-hannes@cmpxchg.org
Fixes: 536d3bf261 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: <stable@vger.kernel.org>	[5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-13 13:55:17 +01:00
..
kasan kasan: fix incorrect arguments passing in kasan_add_zero_shadow 2021-01-27 11:55:23 +01:00
backing-dev.c
balloon_compaction.c
cleancache.c
cma_debug.c
cma.c
cma.h
compaction.c mm, compaction: move high_pfn to the for loop scope 2021-02-10 09:29:21 +01:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: avoid doing memory allocation with pgtable_t mapped. 2020-10-16 11:11:14 -07:00
debug.c mm, dump_page: rename head_mapcount() --> head_compound_mapcount() 2020-10-13 18:38:29 -07:00
dmapool.c mm/dmapool.c: replace hard coded function name with __func__ 2020-10-13 18:38:32 -07:00
early_ioremap.c
fadvise.c mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED 2020-10-13 18:38:29 -07:00
failslab.c
filemap.c mm/filemap: add missing mem_cgroup_uncharge() to __add_to_page_cache_locked() 2021-02-10 09:29:21 +01:00
frame_vector.c
frontswap.c
gup_benchmark.c mm/gup_benchmark: take the mmap lock around GUP 2020-10-18 09:27:09 -07:00
gup.c mm/gup: combine put_compound_head() and unpin_user_page() 2020-12-30 11:53:54 +01:00
highmem.c mm/highmem.c: clean up endif comments 2020-10-16 11:11:18 -07:00
hmm.c
huge_memory.c mm: thp: fix MADV_REMOVE deadlock on shmem THP 2021-02-10 09:29:21 +01:00
hugetlb_cgroup.c hugetlb_cgroup: fix offline of hugetlb cgroup with reservations 2020-12-06 10:19:07 -08:00
hugetlb.c mm: hugetlb: remove VM_BUG_ON_PAGE from page_huge_active 2021-02-10 09:29:21 +01:00
hwpoison-inject.c mm,hwpoison-inject: don't pin for hwpoison_filter 2020-10-16 11:11:16 -07:00
init-mm.c mm/gup: prevent gup_fast from racing with COW during fork 2020-12-30 11:53:54 +01:00
internal.h mm: rename page_order() to buddy_order() 2020-10-16 11:11:19 -07:00
interval_tree.c
ioremap.c
Kconfig mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING 2020-12-06 10:19:07 -08:00
Kconfig.debug
khugepaged.c mm: remove the now-unnecessary mmget_still_valid() hack 2020-10-16 11:11:22 -07:00
kmemleak.c mm/kmemleak: rely on rcu for task stack scanning 2020-10-13 18:38:27 -07:00
ksm.c docs: get rid of :c:type explicit declarations for structs 2020-10-15 07:49:40 +02:00
list_lru.c mm: list_lru: set shrinker map bit when child nr_items is not zero 2020-12-06 10:19:07 -08:00
maccess.c
madvise.c mm,memory_failure: always pin the page in madvise_inject_error 2020-12-30 11:53:55 +01:00
Makefile mm,kmemleak-test.c: move kmemleak-test.c to samples dir 2020-10-13 18:38:27 -07:00
mapping_dirty_helpers.c
memblock.c memblock: do not start bottom-up allocations with kernel_end 2021-02-10 09:29:15 +01:00
memcontrol.c Revert "mm: memcontrol: avoid workload stalls when lowering memory.high" 2021-02-13 13:55:17 +01:00
memfd.c
memory_hotplug.c mm: memmap defer init doesn't work as expected 2021-01-06 14:56:50 +01:00
memory-failure.c mm,memory_failure: always pin the page in madvise_inject_error 2020-12-30 11:53:55 +01:00
memory.c mm/gup: prevent gup_fast from racing with COW during fork 2020-12-30 11:53:54 +01:00
mempolicy.c mm: mempolicy: fix potential pte_unmap_unlock pte error 2020-11-02 12:14:19 -08:00
mempool.c mm/mempool: add 'else' to split mutually exclusive case 2020-10-13 18:38:34 -07:00
memremap.c mm/mremap_pages: fix static key devmap_managed_key updates 2020-11-02 12:14:18 -08:00
memtest.c
migrate.c mm: fix numa stats for thp migration 2021-01-27 11:55:14 +01:00
mincore.c mm: factor find_get_incore_page out of mincore_page 2020-10-13 18:38:29 -07:00
mlock.c
mm_init.c
mmap.c mm/mmap.c: fix mmap return value when vma is merged after call_mmap() 2020-12-06 10:19:07 -08:00
mmu_gather.c
mmu_notifier.c mm/mmu_notifier: fix mmget() assert in __mmu_interval_notifier_insert 2020-10-16 11:11:17 -07:00
mmzone.c
mprotect.c
mremap.c
msync.c
nommu.c mm: remove alloc_vm_area 2020-10-18 09:27:10 -07:00
oom_kill.c mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary 2020-10-13 18:38:35 -07:00
page_alloc.c mm/page_alloc: add a missing mm_page_alloc_zone_locked() tracepoint 2021-01-30 13:55:19 +01:00
page_counter.c mm/page_counter: correct the obsolete func name in the comment of page_counter_try_charge() 2020-10-13 18:38:30 -07:00
page_ext.c
page_idle.c
page_io.c mm/page_io.c: remove useless out label in __swap_writepage() 2020-10-13 18:38:30 -07:00
page_isolation.c mm: rename page_order() to buddy_order() 2020-10-16 11:11:19 -07:00
page_owner.c mm: rename page_order() to buddy_order() 2020-10-16 11:11:19 -07:00
page_poison.c mm/page_poison.c: replace bool variable with static key 2020-10-16 11:11:17 -07:00
page_reporting.c mm: rename page_order() to buddy_order() 2020-10-16 11:11:19 -07:00
page_reporting.h
page_vma_mapped.c
page-writeback.c mm: make wait_on_page_writeback() wait for multiple pending writebacks 2021-01-12 20:18:22 +01:00
pagewalk.c
percpu-internal.h
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c percpu: convert flexible array initializers to use struct_size() 2020-10-30 23:02:28 +00:00
pgalloc-track.h
pgtable-generic.c
process_vm_access.c mm/process_vm_access.c: include compat.h 2021-01-19 18:27:21 +01:00
ptdump.c
readahead.c mm: use limited read-ahead to satisfy read 2020-10-17 13:49:08 -06:00
rmap.c mm/rmap: always do TTU_IGNORE_ACCESS 2020-12-30 11:53:55 +01:00
rodata_test.c
shmem.c fs: add a filesystem flag for THPs 2020-10-16 11:11:15 -07:00
shuffle.c mm: rename page_order() to buddy_order() 2020-10-16 11:11:19 -07:00
shuffle.h
slab_common.c
slab.c mm: fix some comments formatting 2020-10-16 11:11:19 -07:00
slab.h mm: memcg/slab: fix obj_cgroup_charge() return value handling 2020-12-06 10:19:07 -08:00
slob.c
slub.c Revert "mm/slub: fix a memory leak in sysfs_slab_add()" 2021-01-30 13:55:16 +01:00
sparse-vmemmap.c
sparse.c mm/memory_hotplug: guard more declarations by CONFIG_MEMORY_HOTPLUG 2020-10-16 11:11:18 -07:00
swap_cgroup.c
swap_slots.c mm/swap_slots.c: remove always zero and unused return value of enable_swap_slots_cache() 2020-10-13 18:38:30 -07:00
swap_state.c mm: fix some broken comments 2020-10-16 11:11:19 -07:00
swap.c mm: move call to compound_head() in release_pages() 2020-10-13 18:38:33 -07:00
swapfile.c mm: fix a race on nr_swap_pages 2021-01-30 13:55:19 +01:00
truncate.c mm/truncate.c: make __invalidate_mapping_pages() static 2020-11-02 12:14:19 -08:00
usercopy.c
userfaultfd.c
util.c mm/util.c: update the kerneldoc for kstrdup_const() 2020-10-16 11:11:17 -07:00
vmacache.c
vmalloc.c mm/vmalloc.c: fix potential memory leak 2021-01-19 18:27:21 +01:00
vmpressure.c
vmscan.c mm: don't put pinned pages into the swap cache 2021-01-19 18:27:29 +01:00
vmstat.c mm/vmstat.c: use helper macro abs() 2020-10-16 11:11:17 -07:00
workingset.c XArray updates for 5.9 2020-10-20 14:39:37 -07:00
z3fold.c z3fold: stricter locking and more careful reclaim 2020-12-30 11:54:10 +01:00
zbud.c mm/zbud: remove redundant initialization 2020-10-13 18:38:34 -07:00
zpool.c
zsmalloc.c mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING 2020-12-06 10:19:07 -08:00
zswap.c