kernel_optimize_test/net/ipv4
Neal Cardwell 4648dc97af tcp: fix tcp_shift_skb_data() to not shift SACKed data below snd_una
This commit fixes tcp_shift_skb_data() so that it does not shift
SACKed data below snd_una.

This fixes an issue whose symptoms exactly match reports showing
tp->sacked_out going negative since 3.3.0-rc4 (see "WARNING: at
net/ipv4/tcp_input.c:3418" thread on netdev).

Since 2008 (832d11c5cd)
tcp_shift_skb_data() had been shifting SACKed ranges that were below
snd_una. It checked that the *end* of the skb it was about to shift
from was above snd_una, but did not check that the end of the actual
shifted range was above snd_una; this commit adds that check.

Shifting SACKed ranges below snd_una is problematic because for such
ranges tcp_sacktag_one() short-circuits: it does not declare anything
as SACKed and does not increase sacked_out.

Before the fixes in commits cc9a672ee5
and daef52bab1, shifting SACKed ranges
below snd_una happened to work because tcp_shifted_skb() was always
(incorrectly) passing in to tcp_sacktag_one() an skb whose end_seq
tcp_shift_skb_data() had already guaranteed was beyond snd_una. Hence
tcp_sacktag_one() never short-circuited and always increased
tp->sacked_out in this case.

After those two fixes, my testing has verified that shifting SACKed
ranges below snd_una could cause tp->sacked_out to go negative with
the following sequence of events:

(1) tcp_shift_skb_data() sees an skb whose end_seq is beyond snd_una,
    then shifts a prefix of that skb that is below snd_una

(2) tcp_shifted_skb() increments the packet count of the
    already-SACKed prev sk_buff

(3) tcp_sacktag_one() sees the end of the new SACKed range is below
    snd_una, so it short-circuits and doesn't increase tp->sacked_out

(5) tcp_clean_rtx_queue() sees the SACKed skb has been ACKed,
    decrements tp->sacked_out by this "inflated" pcount that was
    missing a matching increase in tp->sacked_out, and hence
    tp->sacked_out underflows to a u32 like 0xFFFFFFFF, which casted
    to s32 is negative.

(6) this leads to the warnings seen in the recent "WARNING: at
    net/ipv4/tcp_input.c:3418" thread on the netdev list; e.g.:
    tcp_input.c:3418  WARN_ON((int)tp->sacked_out < 0);

More generally, I think this bug can be tickled in some cases where
two or more ACKs from the receiver are lost and then a DSACK arrives
that is immediately above an existing SACKed skb in the write queue.

This fix changes tcp_shift_skb_data() to abort this sequence at step
(1) in the scenario above by noticing that the bytes are below snd_una
and not shifting them.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-06 14:43:49 -05:00
..
netfilter Merge branch 'for-linus' of git://selinuxproject.org/~jmorris/linux-security 2012-01-14 18:36:33 -08:00
af_inet.c per-netns ipv4 sysctl_tcp_mem 2011-12-12 19:04:11 -05:00
ah4.c ah: Don't return NET_XMIT_DROP on input. 2011-11-12 18:13:32 -05:00
arp.c net: Don't proxy arp respond if iif == rt->dst.dev if private VLAN is disabled 2012-02-10 15:13:36 -05:00
cipso_ipv4.c
datagram.c
devinet.c net: reintroduce missing rcu_assign_pointer() calls 2012-01-12 12:26:56 -08:00
esp4.c
fib_frontend.c
fib_lookup.h
fib_rules.c net: ipv4: export fib_lookup and fib_table_lookup 2011-12-04 22:43:33 +01:00
fib_semantics.c
fib_trie.c net: reintroduce missing rcu_assign_pointer() calls 2012-01-12 12:26:56 -08:00
gre.c
icmp.c net: more accurate skb truesize 2011-10-13 16:05:07 -04:00
igmp.c net: reintroduce missing rcu_assign_pointer() calls 2012-01-12 12:26:56 -08:00
inet_connection_sock.c tcp: bind() optimize port allocation 2012-01-25 21:50:43 -05:00
inet_diag.c inet_diag: Rename inet_diag_req_compat into inet_diag_req 2012-01-11 12:56:06 -08:00
inet_fragment.c
inet_hashtables.c
inet_lro.c net: add skb frag size accessors 2011-10-19 03:10:46 -04:00
inet_timewait_sock.c net: Fix files explicitly needing to include module.h 2011-10-31 19:30:28 -04:00
inetpeer.c inetpeer: initialize ->redirect_genid in inet_getpeer() 2012-01-17 15:52:12 -05:00
ip_forward.c ipv4: Save nexthop address of LSRR/SSRR option to IPCB. 2011-11-23 19:19:32 -05:00
ip_fragment.c treewide: Fix typos in various parts of the kernel, and fix some comments. 2011-12-02 14:57:31 +01:00
ip_gre.c gre: fix spelling in comments 2012-02-24 17:41:11 -05:00
ip_input.c
ip_options.c ipv4: Fix wrong order of ip_rt_get_source() and update iph->daddr. 2012-02-10 15:12:12 -05:00
ip_output.c net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}. 2011-12-05 15:20:19 -05:00
ip_sockglue.c net: use IS_ENABLED(CONFIG_IPV6) 2011-12-11 18:25:16 -05:00
ipcomp.c
ipconfig.c net: fix some sparse errors 2012-01-17 10:31:12 -05:00
ipip.c net: reintroduce missing rcu_assign_pointer() calls 2012-01-12 12:26:56 -08:00
ipmr.c net: reintroduce missing rcu_assign_pointer() calls 2012-01-12 12:26:56 -08:00
Kconfig net: Fix build regression when INET_UDP_DIAG=y and IPV6=m 2012-02-07 13:35:28 -05:00
Makefile tcp memory pressure controls 2011-12-12 19:04:10 -05:00
netfilter.c netfilter: possible unaligned packet header in ip_route_me_harder 2011-11-21 18:46:18 +01:00
ping.c ipv4: ping: Fix recvmsg MSG_OOB error handling. 2012-02-21 17:59:19 -05:00
proc.c tcp: detect loss above high_seq in recovery 2012-01-22 15:08:44 -05:00
protocol.c
raw.c ipv4: Remove all uses of LL_ALLOCATED_SPACE 2011-11-18 14:37:08 -05:00
route.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2011-12-23 17:13:56 -05:00
syncookies.c tcp: Replace constants with #define macros 2011-12-21 01:03:23 -05:00
sysctl_net_ipv4.c tcp: properly initialize tcp memory limits 2012-02-02 14:34:41 -05:00
tcp_bic.c tcp: fix undo after RTO for BIC 2012-01-20 14:17:26 -05:00
tcp_cong.c tcp: do not scale TSO segment size with reordering degree 2011-11-29 00:29:41 -05:00
tcp_cubic.c tcp: fix undo after RTO for CUBIC 2012-01-20 14:17:26 -05:00
tcp_diag.c inet_diag: Rename inet_diag_req into inet_diag_req_v2 2012-01-11 12:56:06 -08:00
tcp_highspeed.c
tcp_htcp.c
tcp_hybla.c
tcp_illinois.c
tcp_input.c tcp: fix tcp_shift_skb_data() to not shift SACKed data below snd_una 2012-03-06 14:43:49 -05:00
tcp_ipv4.c tcp_v4_send_reset: binding oif to iif in no sock case 2012-02-04 18:20:05 -05:00
tcp_lp.c
tcp_memcontrol.c net: decrement memcg jump label when limit, not usage, is changed 2012-01-12 12:27:59 -08:00
tcp_minisocks.c net: use IS_ENABLED(CONFIG_IPV6) 2011-12-11 18:25:16 -05:00
tcp_output.c tcp: fix tcp_trim_head() to adjust segment count with skb MSS 2012-01-30 12:42:58 -05:00
tcp_probe.c
tcp_scalable.c
tcp_timer.c net: Disambiguate kernel message 2012-02-01 14:41:50 -05:00
tcp_vegas.c
tcp_vegas.h
tcp_veno.c
tcp_westwood.c
tcp_yeah.c
tcp.c vfs: fix panic in __d_lookup() with high dentry hashtable counts 2012-02-13 20:45:38 -05:00
tunnel4.c net: use IS_ENABLED(CONFIG_IPV6) 2011-12-11 18:25:16 -05:00
udp_diag.c net: kill duplicate included header 2012-01-17 10:31:12 -05:00
udp_impl.h
udp.c udp: Export code sk lookup routines 2011-12-09 14:14:08 -05:00
udplite.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
xfrm4_input.c
xfrm4_mode_beet.c ipsec: be careful of non existing mac headers 2012-02-23 16:50:45 -05:00
xfrm4_mode_transport.c
xfrm4_mode_tunnel.c ipsec: be careful of non existing mac headers 2012-02-23 16:50:45 -05:00
xfrm4_output.c
xfrm4_policy.c ipv4: fix ipsec forward performance regression 2011-10-24 03:01:22 -04:00
xfrm4_state.c net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules 2011-10-31 19:30:30 -04:00
xfrm4_tunnel.c net: use IS_ENABLED(CONFIG_IPV6) 2011-12-11 18:25:16 -05:00