kernel_optimize_test/Documentation
Eric Dumazet c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
..
ABI Merge branch 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86 2013-07-13 18:08:23 -07:00
accounting
acpi
aoe
arm
arm64
auxdisplay
backlight
blackfin
block
blockdev
bus-devices
cdrom
cgroups Merge branch 'for-3.11/core' of git://git.kernel.dk/linux-block 2013-07-11 13:03:24 -07:00
connector
console
cpu-freq
cpuidle
cris
crypto
development-process
device-mapper Add a device-mapper target called dm-switch to provide a multipath 2013-07-11 13:05:40 -07:00
devicetree Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2013-07-13 18:05:13 -07:00
DocBook Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-07-13 12:09:57 -07:00
driver-model
dvb
early-userspace
EDID
extcon
fault-injection
fb
filesystems xfs: update (#2) for 3.11-rc1 2013-07-13 11:40:24 -07:00
firmware_class
fmc
frv
hid
hwmon
i2c
i2o
ia64
ide
infiniband
input
ioctl
isdn
ja_JP
kbuild
kdump
ko_KR
laptops
leds
m68k
make
memory-devices
metag
mips
misc-devices
mmc
mn10300
mtd
namespaces
netlabel
networking tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
nfc
parisc
PCI
pcmcia
power
powerpc
pps
prctl
pti
ptp
rapidio
RCU
s390
scheduler
scsi
security
serial
sh
sound
spi
sysctl net: rename busy poll socket op and globals 2013-07-10 17:08:27 -07:00
target
thermal Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2013-07-11 12:26:08 -07:00
timers
trace The majority of the changes here are cleanups for the large changes that 2013-07-11 09:02:09 -07:00
usb
vDSO
video4linux Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-07-13 12:09:57 -07:00
virtual
vm zswap: add documentation 2013-07-10 18:11:34 -07:00
w1
watchdog watchdog: delete mpcore_wdt driver 2013-07-11 21:47:58 +02:00
wimax
x86
xtensa
zh_CN
.gitignore
00-INDEX
applying-patches.txt
atomic_ops.txt
bad_memory.txt
basic_profiling.txt
bcache.txt
binfmt_misc.txt
braille-console.txt
bt8xxgpio.txt
btmrvl.txt
BUG-HUNTING
bus-virt-phys-mapping.txt
cachetlb.txt
Changes
circular-buffers.txt
clk.txt
coccinelle.txt
CodingStyle
cpu-hotplug.txt kernel: delete __cpuinit usage from all core kernel files 2013-07-14 19:36:59 -04:00
cpu-load.txt
cputopology.txt
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
devices.txt
digsig.txt
DMA-API-HOWTO.txt
DMA-API.txt
DMA-attributes.txt
dma-buf-sharing.txt
DMA-ISA-LPC.txt
dmaengine.txt
dmatest.txt
dontdiff
dynamic-debug-howto.txt
edac.txt
eisa.txt
email-clients.txt
flexible-arrays.txt
futex-requeue-pi.txt
gcov.txt
gpio.txt
highuid.txt
HOWTO
hw_random.txt
hwspinlock.txt
init.txt
initrd.txt
intel_txt.txt
Intel-IOMMU.txt
io_ordering.txt
io-mapping.txt
iostats.txt
IPMI.txt
IRQ-affinity.txt
IRQ-domain.txt
IRQ.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt
kernel-docs.txt
kernel-parameters.txt The majority of the changes here are cleanups for the large changes that 2013-07-11 09:02:09 -07:00
kernel-per-CPU-kthreads.txt
kmemcheck.txt
kmemleak.txt
kobject.txt
kprobes.txt
kref.txt
ldm.txt
local_ops.txt
lockdep-design.txt
lockstat.txt
lockup-watchdogs.txt
logo.gif
logo.txt
magic-number.txt
Makefile
ManagementStyle
md.txt
media-framework.txt Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-07-13 12:09:57 -07:00
memory-barriers.txt
memory-hotplug.txt
mono.txt
mutex-design.txt
nommu-mmap.txt
numastat.txt
oops-tracing.txt
padata.txt
parport-lowlevel.txt
parport.txt
percpu-rw-semaphore.txt
pi-futex.txt
pinctrl.txt
pnp.txt
preempt-locking.txt
printk-formats.txt
pwm.txt
ramoops.txt
rbtree.txt
remoteproc.txt
rfkill.txt
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
rt-mutex-design.txt
rt-mutex.txt
rtc.txt
SAK.txt
SecurityBugs
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
SM501.txt
smsc_ece1099.txt
sparse.txt
spinlocks.txt
stable_api_nonsense.txt
stable_kernel_rules.txt
static-keys.txt
SubmitChecklist
SubmittingDrivers
SubmittingPatches
svga.txt
sysfs-rules.txt
sysrq.txt
this_cpu_ops.txt
unaligned-memory-access.txt
unicode.txt
unshare.txt
vfio.txt
VGA-softcursor.txt
vgaarbiter.txt
video-output.txt
vme_api.txt
volatile-considered-harmful.txt
workqueue.txt
ww-mutex-design.txt
xz.txt
zorro.txt