kernel_optimize_test/include
Rasmus Villemoes 94a58c360a slab.h: sprinkle __assume_aligned attributes
The various allocators return aligned memory.  Telling the compiler that
allows it to generate better code in many cases, for example when the
return value is immediately passed to memset().

Some code does become larger, but at least we win twice as much as we lose:

$ scripts/bloat-o-meter /tmp/vmlinux vmlinux
add/remove: 0/0 grow/shrink: 13/52 up/down: 995/-2140 (-1145)

An example of the different (and smaller) code can be seen in mm_alloc(). Before:

:       48 8d 78 08             lea    0x8(%rax),%rdi
:       48 89 c1                mov    %rax,%rcx
:       48 89 c2                mov    %rax,%rdx
:       48 c7 00 00 00 00 00    movq   $0x0,(%rax)
:       48 c7 80 48 03 00 00    movq   $0x0,0x348(%rax)
:       00 00 00 00
:       31 c0                   xor    %eax,%eax
:       48 83 e7 f8             and    $0xfffffffffffffff8,%rdi
:       48 29 f9                sub    %rdi,%rcx
:       81 c1 50 03 00 00       add    $0x350,%ecx
:       c1 e9 03                shr    $0x3,%ecx
:       f3 48 ab                rep stos %rax,%es:(%rdi)

After:

:       48 89 c2                mov    %rax,%rdx
:       b9 6a 00 00 00          mov    $0x6a,%ecx
:       31 c0                   xor    %eax,%eax
:       48 89 d7                mov    %rdx,%rdi
:       f3 48 ab                rep stos %rax,%es:(%rdi)

So gcc's strategy is to do two possibly (but not really, of course)
unaligned stores to the first and last word, then do an aligned rep stos
covering the middle part with a little overlap.  Maybe arches which do not
allow unaligned stores gain even more.

I don't know if gcc can actually make use of alignments greater than 8 for
anything, so one could probably drop the __assume_xyz_alignment macros and
just use __assume_aligned(8).

The increases in code size are mostly caused by gcc deciding to
opencode strlen() using the check-four-bytes-at-a-time trick when it
knows the buffer is sufficiently aligned (one function grew by 200
bytes). Now it turns out that many of these strlen() calls showing up
were in fact redundant, and they're gone from -next. Applying the two
patches to next-20151001 bloat-o-meter instead says

add/remove: 0/0 grow/shrink: 6/52 up/down: 244/-2140 (-1896)

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-20 16:17:32 -08:00
..
acpi
asm-generic h8300 update for v4.4 2015-11-12 15:26:39 -08:00
clocksource
crypto
drm drm/atomic: add a drm_atomic_clean_old_fb helper. 2015-11-17 13:02:14 +02:00
dt-bindings ARM: DT updates for v4.4 2015-11-10 15:06:26 -08:00
keys
kvm
linux slab.h: sprinkle __assume_aligned attributes 2015-11-20 16:17:32 -08:00
math-emu
media
memory
misc
net net: switchdev: fix return code of fdb_dump stub 2015-11-16 15:24:37 -05:00
pcmcia
ras
rdma
rxrpc
scsi scsi: use host wide tags by default 2015-11-09 17:11:57 -08:00
soc ARM: SoC driver updates for v4.4 2015-11-10 15:00:03 -08:00
sound
target Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending 2015-11-13 20:04:17 -08:00
trace Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2015-11-11 09:03:01 -08:00
uapi VFIO updates for v4.4-rc1 2015-11-13 17:05:32 -08:00
video
xen
Kbuild