kernel_optimize_test/drivers/md
Arne Welzel 7509c4cb7c dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()
commit 528b16bfc3ae5f11638e71b3b63a81f9999df727 upstream.

On systems with many cores using dm-crypt, heavy spinlock contention in
percpu_counter_compare() can be observed when the page allocation limit
for a given device is reached or close to be reached. This is due
to percpu_counter_compare() taking a spinlock to compute an exact
result on potentially many CPUs at the same time.

Switch to non-exact comparison of allocated and allowed pages by using
the value returned by percpu_counter_read_positive() to avoid taking
the percpu_counter spinlock.

This may over/under estimate the actual number of allocated pages by at
most (batch-1) * num_online_cpus().

Currently, batch is bounded by 32. The system on which this issue was
first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
allocations, this seems an acceptable error. Certainly preferred over
running into the spinlock contention.

This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
and 192GB RAM as follows, but can be provoked on systems with less CPUs
as well.

 * Disable swap
 * Tune vm settings to promote regular writeback
     $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
     $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
     $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes

 * Create 8 dmcrypt devices based on files on a tmpfs
 * Create and mount an ext4 filesystem on each crypt devices
 * Run stress-ng --hdd 8 within one of above filesystems

Total %system usage collected from sysstat goes to ~35%. Write throughput
on the underlying loop device is ~2GB/s. perf profiling an individual
kworker kcryptd thread shows the following profile, indicating spinlock
contention in percpu_counter_compare():

    99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
      |
      --ret_from_fork
        kthread
        worker_thread
        |
         --99.92%--process_one_work
            |
            |--80.52%--kcryptd_crypt
            |    |
            |    |--62.58%--mempool_alloc
            |    |  |
            |    |   --62.24%--crypt_page_alloc
            |    |     |
            |    |      --61.51%--__percpu_counter_compare
            |    |        |
            |    |         --61.34%--__percpu_counter_sum
            |    |           |
            |    |           |--58.68%--_raw_spin_lock_irqsave
            |    |           |  |
            |    |           |   --58.30%--native_queued_spin_lock_slowpath
            |    |           |
            |    |            --0.69%--cpumask_next
            |    |                |
            |    |                 --0.51%--_find_next_bit
            |    |
            |    |--10.61%--crypt_convert
            |    |          |
            |    |          |--6.05%--xts_crypt
            ...

After applying this patch and running the same test, %system usage is
lowered to ~7% and write throughput on the loop device increases
to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
in the profile and not hitting the percpu_counter() spinlock anymore.

    |--8.15%--mempool_alloc
    |    |
    |    |--3.93%--crypt_page_alloc
    |    |    |
    |    |     --3.75%--__alloc_pages
    |    |         |
    |    |          --3.62%--get_page_from_freelist
    |    |              |
    |    |               --3.22%--rmqueue_bulk
    |    |                   |
    |    |                    --2.59%--_raw_spin_lock
    |    |                      |
    |    |                       --2.57%--native_queued_spin_lock_slowpath
    |    |
    |     --3.05%--_raw_spin_lock_irqsave
    |               |
    |                --2.49%--native_queued_spin_lock_slowpath

Suggested-by: DJ Gregor <dj@corelight.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Arne Welzel <arne.welzel@corelight.com>
Fixes: 5059353df8 ("dm crypt: limit the number of allocated pages")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-18 13:40:08 +02:00
..
bcache bcache: add proper error unwinding in bcache_device_init 2021-09-15 09:50:25 +02:00
persistent-data dm btree remove: assign new_root only when removal succeeds 2021-07-19 09:45:01 +02:00
dm-bio-prison-v1.c
dm-bio-prison-v1.h
dm-bio-prison-v2.c
dm-bio-prison-v2.h
dm-bio-record.h
dm-bufio.c dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size 2021-03-09 11:11:12 +01:00
dm-builtin.c
dm-cache-background-tracker.c
dm-cache-background-tracker.h
dm-cache-block-types.h
dm-cache-metadata.c dm cache metadata: Avoid returning cmd->bm wild pointer on error 2020-09-02 13:38:24 -04:00
dm-cache-metadata.h
dm-cache-policy-internal.h
dm-cache-policy-smq.c
dm-cache-policy.c
dm-cache-policy.h
dm-cache-target.c Revert "dm cache: fix arm link errors with inline" 2020-12-01 15:43:36 -05:00
dm-clone-metadata.c
dm-clone-metadata.h
dm-clone-target.c writeback: remove bdi->congested_fn 2020-07-08 17:20:46 -06:00
dm-core.h dm: fix deadlock when swapping to encrypted device 2021-03-04 11:38:44 +01:00
dm-crypt.c dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() 2021-09-18 13:40:08 +02:00
dm-delay.c block: rename generic_make_request to submit_bio_noacct 2020-07-01 07:27:24 -06:00
dm-dust.c dm dust: add interface to list all badblocks 2020-07-20 11:17:41 -04:00
dm-ebs-target.c dm ebs: Fix incorrect checking for REQ_OP_FLUSH 2020-08-04 16:01:40 -04:00
dm-era-target.c dm era: only resize metadata in preresume 2021-03-04 11:38:46 +01:00
dm-exception-store.c
dm-exception-store.h
dm-flakey.c
dm-historical-service-time.c
dm-init.c dm init: Set file local variable static 2020-08-04 15:51:28 -04:00
dm-integrity.c dm integrity: fix missing goto in bitmap_flush_interval error handling 2021-05-11 14:47:40 +02:00
dm-io.c treewide: Remove uninitialized_var() usage 2020-07-16 12:35:15 -07:00
dm-ioctl.c dm ioctl: fix out of bounds array access when no devices 2021-03-30 14:31:56 +02:00
dm-kcopyd.c
dm-linear.c dm: add support for REQ_NOWAIT and enable it for linear target 2020-09-25 08:20:03 -06:00
dm-log-userspace-base.c
dm-log-userspace-transfer.c
dm-log-userspace-transfer.h
dm-log-writes.c
dm-log.c
dm-mpath.c dm: use dm_table_get_device_name() where appropriate in targets 2020-09-29 16:33:08 -04:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h
dm-queue-length.c
dm-raid.c dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences 2021-05-11 14:47:36 +02:00
dm-raid1.c block: rename generic_make_request to submit_bio_noacct 2020-07-01 07:27:24 -06:00
dm-region-hash.c
dm-round-robin.c
dm-rq.c dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails 2021-05-11 14:47:40 +02:00
dm-rq.h
dm-service-time.c
dm-snap-persistent.c dm snap persistent: simplify area_io() 2020-09-29 16:33:12 -04:00
dm-snap-transient.c
dm-snap.c dm snapshot: properly fix a crash when an origin has no snapshots 2021-06-03 09:00:30 +02:00
dm-stats.c
dm-stats.h
dm-stripe.c
dm-switch.c
dm-sysfs.c
dm-table.c dm table: Fix zoned model check and zone sectors check 2021-03-30 14:32:06 +02:00
dm-target.c
dm-thin-metadata.c dm thin metadata: Remove unused local variable when create thin and snap 2020-09-29 16:33:11 -04:00
dm-thin-metadata.h
dm-thin.c writeback: remove bdi->congested_fn 2020-07-08 17:20:46 -06:00
dm-uevent.c
dm-uevent.h
dm-unstripe.c
dm-verity-fec.c dm verity fec: fix misaligned RS roots IO 2021-04-21 13:00:54 +02:00
dm-verity-fec.h dm verity fec: fix misaligned RS roots IO 2021-04-21 13:00:54 +02:00
dm-verity-target.c dm verity: fix DM_VERITY_OPTS_MAX value 2021-03-30 14:31:56 +02:00
dm-verity-verify-sig.c dm verity: fix require_signatures module_param permissions 2021-06-16 12:01:37 +02:00
dm-verity-verify-sig.h dm verity: Fix compilation warning 2020-08-04 15:48:13 -04:00
dm-verity.h dm verity: add "panic_on_corruption" error handling mode 2020-07-13 11:47:33 -04:00
dm-writecache.c dm writecache: write at least 4k when committing 2021-07-19 09:45:02 +02:00
dm-zero.c
dm-zoned-metadata.c dm zoned: check zone capacity 2021-07-19 09:45:01 +02:00
dm-zoned-reclaim.c dm zoned: Fix zone reclaim trigger 2020-07-08 12:21:53 -04:00
dm-zoned-target.c dm table: Fix zoned model check and zone sectors check 2021-03-30 14:32:06 +02:00
dm-zoned.h dm zoned: select reclaim zone based on device index 2020-06-05 14:59:53 -04:00
dm.c dm: Fix dm_accept_partial_bio() relative to zone management commands 2021-07-19 09:44:46 +02:00
dm.h dm table: fix DAX iterate_devices based device capability checks 2021-03-04 11:38:44 +01:00
Kconfig dm integrity: select CRYPTO_SKCIPHER 2021-01-27 11:54:57 +01:00
Makefile md: move the early init autodetect code to drivers/md/ 2020-07-16 15:34:47 +02:00
md-autodetect.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
md-bitmap.c md/bitmap: wait for external bitmap writes to complete during tear down 2021-05-14 09:49:59 +02:00
md-bitmap.h
md-cluster.c md/cluster: fix deadlock when node is doing resync job 2020-12-30 11:54:25 +01:00
md-cluster.h
md-faulty.c block: rename generic_make_request to submit_bio_noacct 2020-07-01 07:27:24 -06:00
md-linear.c block: add a new revalidate_disk_size helper 2020-09-02 08:00:07 -06:00
md-linear.h
md-multipath.c writeback: remove bdi->congested_fn 2020-07-08 17:20:46 -06:00
md-multipath.h
md.c md: Fix missing unused status line of /proc/mdstat 2021-05-14 09:49:59 +02:00
md.h Revert "md: change mddev 'chunk_sectors' from int to unsigned" 2020-12-14 19:33:01 +01:00
raid1-10.c
raid1.c md/raid10: properly indicate failure when ending a failed write request 2021-08-12 13:22:17 +02:00
raid1.h
raid5-cache.c raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show 2020-08-02 23:03:52 -07:00
raid5-log.h
raid5-ppl.c md/raid456: convert macro STRIPE_* to RAID5_STRIPE_* 2020-07-21 17:18:12 -07:00
raid5.c md/raid5: fix oops during stripe resizing 2020-10-08 22:38:10 -07:00
raid5.h md/raid5: let multiple devices of stripe_head share page 2020-09-24 16:44:44 -07:00
raid10.c md/raid10: properly indicate failure when ending a failed write request 2021-08-12 13:22:17 +02:00
raid10.h Revert "md/raid10: improve discard request for far layout" 2020-12-09 20:46:00 -08:00
raid0.c Revert "md: add md_submit_discard_bio() for submitting discard bio" 2020-12-09 20:46:01 -08:00
raid0.h