kernel_optimize_test

Author	SHA1	Message	Date
Eric Dumazet	588f033075	net: use jump_label for netstamp_needed netstamp_needed seems a good candidate to jump_label conversion. This avoids 3 conditional branches per incoming packet in fast path. No measurable difference, given that these conditional branches are predicted on modern cpus. Only a small icache reduction, thanks to the unlikely() stuff. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-16 17:30:06 -05:00
Eric Dumazet	b2b5ce9d1c	net: introduce build_skb() One of the thing we discussed during netdev 2011 conference was the idea to change some network drivers to allocate/populate their skb at RX completion time, right before feeding the skb to network stack. In old days, we allocated skbs when populating the RX ring. This means bringing into cpu cache sk_buff and skb_shared_info cache lines (since we clear/initialize them), then 'queue' skb->data to NIC. By the time NIC fills a frame in skb->data buffer and host can process it, cpu probably threw away the cache lines from its caches, because lot of things happened between the allocation and final use. So the deal would be to allocate only the data buffer for the NIC to populate its RX ring buffer. And use build_skb() at RX completion to attach a data buffer (now filled with an ethernet frame) to a new skb, initialize the skb_shared_info portion, and give the hot skb to network stack. build_skb() is the function to allocate an skb, caller providing the data buffer that should be attached to it. Drivers are expected to call skb_reserve() right after build_skb() to adjust skb->data to the Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some drivers might add a hardware provided alignment) Data provided to build_skb() MUST have been allocated by a prior kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) bytes at the end of the data without corrupting incoming frame. data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), GFP_ATOMIC); ... skb = build_skb(data); if (!skb) { recycle_data(data); } else { skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN); ... } Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Eilon Greenstein <eilong@broadcom.com> CC: Ben Hutchings <bhutchings@solarflare.com> CC: Tom Herbert <therbert@google.com> CC: Jamal Hadi Salim <hadi@mojatatu.com> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Thomas Graf <tgraf@infradead.org> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-14 14:13:30 -05:00
Eric Dumazet	8b5c171bb3	neigh: new unresolved queue limits Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit : > From: David Miller <davem@davemloft.net> > Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST) > > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Wed, 09 Nov 2011 12:14:09 +0100 > > > >> unres_qlen is the number of frames we are able to queue per unresolved > >> neighbour. Its default value (3) was never changed and is responsible > >> for strange drops, especially if IP fragments are used, or multiple > >> sessions start in parallel. Even a single tcp flow can hit this limit. > > ... > > > > Ok, I've applied this, let's see what happens :-) > > Early answer, build fails. > > Please test build this patch with DECNET enabled and resubmit. The > decnet neigh layer still refers to the removed ->queue_len member. > > Thanks. Ouch, this was fixed on one machine yesterday, but not the other one I used this morning, sorry. [PATCH V5 net-next] neigh: new unresolved queue limits unres_qlen is the number of frames we are able to queue per unresolved neighbour. Its default value (3) was never changed and is responsible for strange drops, especially if IP fragments are used, or multiple sessions start in parallel. Even a single tcp flow can hit this limit. $ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108 PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data. 8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-14 00:47:54 -05:00
Eric Dumazet	e56c57d0d3	net: rename sk_clone to sk_clone_lock Make clear that sk_clone() and inet_csk_clone() return a locked socket. Add _lock() prefix and kerneldoc. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-08 17:07:07 -05:00
Linus Torvalds	32aaeffbd4	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h	2011-11-06 19:44:47 -08:00
Tony Lindgren	bc417e30f8	net: Add back alignment for size for __alloc_skb Commit `87fb4b7b53` (net: more accurate skb truesize) changed the alignment of size. This can cause problems at least on some machines with NFS root: Unhandled fault: alignment exception (0x801) at 0xc183a43a Internal error: : 801 [#1] PREEMPT Modules linked in: CPU: 0 Not tainted (3.1.0-08784-g5eeee4a #733) pc : [<c02fbba0>] lr : [<c02fbb9c>] psr: 60000013 sp : c180fef8 ip : 00000000 fp : c181f580 r10: 00000000 r9 : c044b28c r8 : 00000001 r7 : c183a3a0 r6 : c1835be0 r5 : c183a412 r4 : 000001f2 r3 : 00000000 r2 : 00000000 r1 : ffffffe6 r0 : c183a43a Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel Control: 0005317f Table: 10004000 DAC: 00000017 Process swapper (pid: 1, stack limit = 0xc180e270) Stack: (0xc180fef8 to 0xc1810000) fee0: 00000024 00000000 ff00: 00000000 c183b9c0 c183b8e0 c044b28c c0507ccc c019dfc4 c180ff2c c0503cf8 ff20: c180ff4c c180ff4c 00000000 c1835420 c182c740 c18349c0 c05233c0 00000000 ff40: 00000000 c00e6bb8 c180e000 00000000 c04dd82c c0507e7c c050cc18 c183b9c0 ff60: c05233c0 00000000 00000000 c01f34f4 c0430d70 c019d364 c04dd898 c04dd898 ff80: c04dd82c c0507e7c c180e000 00000000 c04c584c c01f4918 c04dd898 c04dd82c ffa0: c04ddd28 c180e000 00000000 c0008758 c181fa60 3231d82c 00000037 00000000 ffc0: 00000000 c04dd898 c04dd82c c04ddd28 00000013 00000000 00000000 00000000 ffe0: 00000000 c04b2224 00000000 c04b21a0 c001056c c001056c 00000000 00000000 Function entered at [<c02fbba0>] from [<c019dfc4>] Function entered at [<c019dfc4>] from [<c01f34f4>] Function entered at [<c01f34f4>] from [<c01f4918>] Function entered at [<c01f4918>] from [<c0008758>] Function entered at [<c0008758>] from [<c04b2224>] Function entered at [<c04b2224>] from [<c001056c>] Code: e1a00005 e3a01028 ebfa7cb0 e35a0000 (e5858028) Here PC is at __alloc_skb and &shinfo->dataref is unaligned because skb->end can be unaligned without this patch. As explained by Eric Dumazet <eric.dumazet@gmail.com>, this happens only with SLOB, and not with SLAB or SLUB: * Eric Dumazet <eric.dumazet@gmail.com> [111102 15:56]: > > Your patch is absolutely needed, I completely forgot about SLOB :( > > since, kmalloc(386) on SLOB gives exactly ksize=386 bytes, not nearest > power of two. > > [ 60.305763] malloc(size=385)->ffff880112c11e38 ksize=386 -> nsize=2 > [ 60.305921] malloc(size=385)->ffff88007c92ce28 ksize=386 -> nsize=2 > [ 60.306898] malloc(size=656)->ffff88007c44ad28 ksize=656 -> nsize=272 > [ 60.325385] malloc(size=656)->ffff88007c575868 ksize=656 -> nsize=272 > [ 60.325531] malloc(size=656)->ffff88011c777230 ksize=656 -> nsize=272 > [ 60.325701] malloc(size=656)->ffff880114011008 ksize=656 -> nsize=272 > [ 60.346716] malloc(size=385)->ffff880114142008 ksize=386 -> nsize=2 > [ 60.346900] malloc(size=385)->ffff88011c777690 ksize=386 -> nsize=2 Signed-off-by: Tony Lindgren <tony@atomide.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-03 18:09:16 -04:00
David S. Miller	045f7b3b00	neigh: Kill bogus SMP protected debugging message. Whatever situations make this state legitimate when SMP also would be legitimate when !SMP and f.e. preemption is enabled. This is dubious enough that we should just delete it entirely. If we want to add debugging for neigh timer races, better more thorough mechanisms are needed. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-01 17:45:55 -04:00
Paul Gortmaker	bc3b2d7fb9	net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules These files are non modular, but need to export symbols using the macros now living in export.h -- call out the include so that things won't break when we remove the implicit presence of module.h from everywhere. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:30:30 -04:00
Paul Gortmaker	3a9a231d97	net: Fix files explicitly needing to include module.h With calls to modular infrastructure, these files really needs the full module.h header. Call it out so some of the cleanups of implicit and unrequired includes elsewhere can be cleaned up. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:30:28 -04:00
Eric Dumazet	6a32e4f9dd	vlan: allow nested vlan_do_receive() commit `2425717b27` (net: allow vlan traffic to be received under bond) broke ARP processing on vlan on top of bonding. +-------+ eth0 --\| bond0 \|---bond0.103 eth1 --\| \| +-------+ 52870.115435: skb_gro_reset_offset <-napi_gro_receive 52870.115435: dev_gro_receive <-napi_gro_receive 52870.115435: napi_skb_finish <-napi_gro_receive 52870.115435: netif_receive_skb <-napi_skb_finish 52870.115435: get_rps_cpu <-netif_receive_skb 52870.115435: __netif_receive_skb <-netif_receive_skb 52870.115436: vlan_do_receive <-__netif_receive_skb 52870.115436: bond_handle_frame <-__netif_receive_skb 52870.115436: vlan_do_receive <-__netif_receive_skb 52870.115436: arp_rcv <-__netif_receive_skb 52870.115436: kfree_skb <-arp_rcv Packet is dropped in arp_rcv() because its pkt_type was set to PACKET_OTHERHOST in the first vlan_do_receive() call, since no eth0.103 exists. We really need to change pkt_type only if no more rx_handler is about to be called for the packet. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-30 04:43:30 -04:00
Thomas Gleixner	b0691c8ee7	net: Unlock sock before calling sk_free() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-25 19:17:25 -04:00
Linus Torvalds	8a9ea3237e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1745 commits) dp83640: free packet queues on remove dp83640: use proper function to free transmit time stamping packets ipv6: Do not use routes from locally generated RAs \|PATCH net-next] tg3: add tx_dropped counter be2net: don't create multiple RX/TX rings in multi channel mode be2net: don't create multiple TXQs in BE2 be2net: refactor VF setup/teardown code into be_vf_setup/clear() be2net: add vlan/rx-mode/flow-control config to be_setup() net_sched: cls_flow: use skb_header_pointer() ipv4: avoid useless call of the function check_peer_pmtu TCP: remove TCP_DEBUG net: Fix driver name for mdio-gpio.c ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces ipv4: fix ipsec forward performance regression jme: fix irq storm after suspend/resume route: fix ICMP redirect validation net: hold sock reference while processing tx timestamps tcp: md5: add more const attributes Add ethtool -g support to virtio_net ... Fix up conflicts in: - drivers/net/Kconfig: The split-up generated a trivial conflict with removal of a stale reference to Documentation/networking/net-modules.txt. Remove it from the new location instead. - fs/sysfs/dir.c: Fairly nasty conflicts with the sysfs rb-tree usage, conflicting with Eric Biederman's changes for tagged directories.	2011-10-25 13:25:22 +02:00
Linus Torvalds	2d03423b23	Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (38 commits) mm: memory hotplug: Check if pages are correctly reserved on a per-section basis Revert "memory hotplug: Correct page reservation checking" Update email address for stable patch submission dynamic_debug: fix undefined reference to `__netdev_printk' dynamic_debug: use a single printk() to emit messages dynamic_debug: remove num_enabled accounting dynamic_debug: consolidate repetitive struct _ddebug descriptor definitions uio: Support physical addresses >32 bits on 32-bit systems sysfs: add unsigned long cast to prevent compile warning drivers: base: print rejected matches with DEBUG_DRIVER memory hotplug: Correct page reservation checking memory hotplug: Refuse to add unaligned memory regions remove the messy code file Documentation/zh_CN/SubmitChecklist ARM: mxc: convert device creation to use platform_device_register_full new helper to create platform devices with dma mask docs/driver-model: Update device class docs docs/driver-model: Document device.groups kobj_uevent: Ignore if some listeners cannot handle message dynamic_debug: make netif_dbg() call __netdev_printk() dynamic_debug: make netdev_dbg() call __netdev_printk() ...	2011-10-25 12:13:59 +02:00
David S. Miller	1805b2f048	Merge branch 'master' of ra.kernel.org:/pub/scm/linux/kernel/git/davem/net	2011-10-24 18:18:09 -04:00
Eric W. Biederman	d2237d3574	rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces Renato Westphal noticed that since commit `a2835763e1` "rtnetlink: handle rtnl_link netlink notifications manually" was merged we no longer send a netlink message when a networking device is moved from one network namespace to another. Fix this by adding the missing manual notification in dev_change_net_namespaces. Since all network devices that are processed by dev_change_net_namspaces are in the initialized state the complicated tests that guard the manual rtmsg_ifinfo calls in rollback_registered and register_netdevice are unnecessary and we can just perform a plain notification. Cc: stable@kernel.org Tested-by: Renato Westphal <renatowestphal@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-24 03:03:37 -04:00
Richard Cochran	da92b194cc	net: hold sock reference while processing tx timestamps The pair of functions, * skb_clone_tx_timestamp() * skb_complete_tx_timestamp() were designed to allow timestamping in PHY devices. The first function, called during the MAC driver's hard_xmit method, identifies PTP protocol packets, clones them, and gives them to the PHY device driver. The PHY driver may hold onto the packet and deliver it at a later time using the second function, which adds the packet to the socket's error queue. As pointed out by Johannes, nothing prevents the socket from disappearing while the cloned packet is sitting in the PHY driver awaiting a timestamp. This patch fixes the issue by taking a reference on the socket for each such packet. In addition, the comments regarding the usage of these function are expanded to highlight the rule that PHY drivers must use skb_complete_tx_timestamp() to release the packet, in order to release the socket reference, too. These functions first appeared in v2.6.36. Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Cc: <stable@vger.kernel.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-24 02:54:50 -04:00
Eric Dumazet	cf533ea53e	tcp: add const qualifiers where possible Adding const qualifiers to pointers can ease code review, and spot some bugs. It might allow compiler to optimize code further. For example, is it legal to temporary write a null cksum into tcphdr in tcp_md5_hash_header() ? I am afraid a sniffer could catch the temporary null value... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-21 05:22:42 -04:00
Mihai Maruseac	f04565ddf5	dev: use name hash for dev_seq_ops Instead of using the dev->next chain and trying to resync at each call to dev_seq_start, use the name hash, keeping the bucket and the offset in seq->private field. Tests revealed the following results for ifconfig > /dev/null * 1000 interfaces: * 0.114s without patch * 0.089s with patch * 3000 interfaces: * 0.489s without patch * 0.110s with patch * 5000 interfaces: * 1.363s without patch * 0.250s with patch * 128000 interfaces (other setup): * ~100s without patch * ~30s with patch Signed-off-by: Mihai Maruseac <mmaruseac@ixiacom.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-21 02:54:38 -04:00
Ian Campbell	a8605c6063	net: add opaque struct around skb frag page I've split this bit out of the skb frag destructor patch since it helps enforce the use of the fragment API. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-21 02:52:53 -04:00
Eric Dumazet	33136d12be	pktgen: remove ndelay() call Daniel Turull reported inaccuracies in pktgen when using low packet rates, because we call ndelay(val) with values bigger than 20000. Instead of calling ndelay() for delays < 100us, we can instead loop calling ktime_now() only. Reported-by: Daniel Turull <daniel.turull@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-20 17:00:21 -04:00
Ian Campbell	a0bec1cd8f	net: do not take an additional reference in skb_frag_set_page I audited all of the callers in the tree and only one of them (pktgen) expects it to do so. Taking this reference is pretty obviously confusing and error prone. In particular I looked at the following commits which switched callers of (__)skb_frag_set_page to the skb paged fragment api: `6a930b9f16` cxgb3: convert to SKB paged frag API. `5dc3e196ea` myri10ge: convert to SKB paged frag API. `0e0634d20d` vmxnet3: convert to SKB paged frag API. `86ee8130a4` virtionet: convert to SKB paged frag API. `4a22c4c919` sfc: convert to SKB paged frag API. `18324d690d` cassini: convert to SKB paged frag API. `b061b39e3a` benet: convert to SKB paged frag API. `b7b6a688d2` bnx2: convert to SKB paged frag API. `804cf14ea5` net: xfrm: convert to SKB frag APIs `ea2ab69379` net: convert core to skb paged frag APIs Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 19:40:39 -04:00
roy.qing.li@gmail.com	e049f28883	neigh: fix rcu splat in neigh_update() when use dst_get_neighbour to get neighbour, we need rcu_read_lock to protect, since dst_get_neighbour uses rcu_dereference. The bug was reported by Ari Savolainen <ari.m.savolainen@gmail.com> [ 105.612095] [ 105.612096] =================================================== [ 105.612100] [ INFO: suspicious rcu_dereference_check() usage. ] [ 105.612101] --------------------------------------------------- [ 105.612103] include/net/dst.h:91 invoked rcu_dereference_check() without protection! [ 105.612105] [ 105.612106] other info that might help us debug this: [ 105.612106] [ 105.612108] [ 105.612108] rcu_scheduler_active = 1, debug_locks = 0 [ 105.612110] 1 lock held by dnsmasq/2618: [ 105.612111] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff815df8c7>] rtnl_lock+0x17/0x20 [ 105.612120] [ 105.612121] stack backtrace: [ 105.612123] Pid: 2618, comm: dnsmasq Not tainted 3.1.0-rc1 #41 [ 105.612125] Call Trace: [ 105.612129] [<ffffffff810ccdcb>] lockdep_rcu_dereference+0xbb/0xc0 [ 105.612132] [<ffffffff815dc5a9>] neigh_update+0x4f9/0x5f0 [ 105.612135] [<ffffffff815da001>] ? neigh_lookup+0xe1/0x220 [ 105.612139] [<ffffffff81639298>] arp_req_set+0xb8/0x230 [ 105.612142] [<ffffffff8163a59f>] arp_ioctl+0x1bf/0x310 [ 105.612146] [<ffffffff810baa40>] ? lock_hrtimer_base.isra.26+0x30/0x60 [ 105.612150] [<ffffffff8163fb75>] inet_ioctl+0x85/0x90 [ 105.612154] [<ffffffff815b5520>] sock_do_ioctl+0x30/0x70 [ 105.612157] [<ffffffff815b55d3>] sock_ioctl+0x73/0x280 [ 105.612162] [<ffffffff811b7698>] do_vfs_ioctl+0x98/0x570 [ 105.612165] [<ffffffff811a5c40>] ? fget_light+0x340/0x3a0 [ 105.612168] [<ffffffff811b7bbf>] sys_ioctl+0x4f/0x80 [ 105.612172] [<ffffffff816fdcab>] system_call_fastpath+0x16/0x1b Reported-by: Ari Savolainen <ari.m.savolainen@gmail.com> Signed-off-by: RongQing <roy.qing.li@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 19:38:51 -04:00
Dan Carpenter	4f25af2782	filter: use unsigned int to silence static checker warning This is just a cleanup. My testing version of Smatch warns about this: net/core/filter.c +380 check_load_and_stores(6) warn: check 'flen' for negative values flen comes from the user. We try to clamp the values here between 1 and BPF_MAXINSNS but the clamp doesn't work because it could be negative. This is a bug, but it's not exploitable. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 19:35:51 -04:00
Yan, Zheng	afaef734e5	fib_rules: fix unresolved_rules counting we should decrease ops->unresolved_rules when deleting a unresolved rule. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 19:17:41 -04:00
Richard Cochran	4dc360c5e7	net: validate HWTSTAMP ioctl parameters This patch adds a sanity check on the values provided by user space for the hardware time stamping configuration. If the values lie outside of the absolute limits, then the ioctl request will be denied. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 17:00:35 -04:00
Eric W. Biederman	850a545bd8	net: Move rcu_barrier from rollback_registered_many to netdev_run_todo. This patch moves the rcu_barrier from rollback_registered_many (inside the rtnl_lock) into netdev_run_todo (just outside the rtnl_lock). This allows us to gain the full benefit of sychronize_net calling synchronize_rcu_expedited when the rtnl_lock is held. The rcu_barrier in rollback_registered_many was originally a synchronize_net but was promoted to be a rcu_barrier() when it was found that people were unnecessarily hitting the 250ms wait in netdev_wait_allrefs(). Changing the rcu_barrier back to a synchronize_net is therefore safe. Since we only care about waiting for the rcu callbacks before we get to netdev_wait_allrefs() it is also safe to move the wait into netdev_run_todo. This was tested by creating and destroying 1000 tap devices and observing /proc/lock_stat. /proc/lock_stat reports this change reduces the hold times of the rtnl_lock by a factor of 10. There was no observable difference in the amount of time it takes to destroy a network device. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 16:59:42 -04:00
Andy Fleming	3d153a7c8b	net: Allow skb_recycle_check to be done in stages skb_recycle_check resets the skb if it's eligible for recycling. However, there are times when a driver might want to optionally manipulate the skb data with the skb before resetting the skb, but after it has determined eligibility. We do this by splitting the eligibility check from the skb reset, creating two inline functions to accomplish that task. Signed-off-by: Andy Fleming <afleming@freescale.com> Acked-by: David Daney <david.daney@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 15:59:45 -04:00
Eric Dumazet	9e903e0852	net: add skb frag size accessors To ease skb->truesize sanitization, its better to be able to localize all references to skb frags size. Define accessors : skb_frag_size() to fetch frag size, and skb_frag_size_{set\|add\|sub}() to manipulate it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 03:10:46 -04:00
John Fastabend	2425717b27	net: allow vlan traffic to be received under bond The following configuration used to work as I expected. At least we could use the fcoe interfaces to do MPIO and the bond0 iface to do load balancing or failover. ---eth2.228-fcoe \| eth2 -----\| \| \|---- bond0 \| eth3 -----\| \| ---eth3.228-fcoe This worked because of a change we added to allow inactive slaves to rx 'exact' matches. This functionality was kept intact with the rx_handler mechanism. However now the vlan interface attached to the active slave never receives traffic because the bonding rx_handler updates the skb->dev and goto's another_round. Previously, the vlan_do_receive() logic was called before the bonding rx_handler. Now by the time vlan_do_receive calls vlan_find_dev() the skb->dev is set to bond0 and it is clear no vlan is attached to this iface. The vlan lookup fails. This patch moves the VLAN check above the rx_handler. A VLAN tagged frame is now routed to the eth2.228-fcoe iface in the above schematic. Untagged frames continue to the bond0 as normal. This case also remains intact, eth2 --> bond0 --> vlan.228 Here the skb is VLAN tagged but the vlan lookup fails on eth2 causing the bonding rx_handler to be called. On the second pass the vlan lookup is on the bond0 iface and completes as expected. Putting a VLAN.228 on both the bond0 and eth2 device will result in eth2.228 receiving the skb. I don't think this is completely unexpected and was the result prior to the rx_handler result. Note, the same setup is also used for other storage traffic that MPIO is used with eg. iSCSI and similar setups can be contrived without storage protocols. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jesse Gross <jesse@nicira.com> Reviewed-by: Jiri Pirko <jpirko@redhat.com> Tested-by: Hans Schillstrom <hams.schillstrom@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-18 23:46:46 -04:00
huajun li	6ccc3abdc9	net/flow: Fix potential memory leak While preparing net flow caches, once a fail may cause potential memory leak , fix it. Signed-off-by: Huajun Li <huajun.li.lee@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-17 19:18:42 -04:00
Greg Rose	5f8444a3fa	if_link: Add additional parameter to IFLA_VF_INFO for spoof checking Add configuration setting for drivers to turn spoof checking on or off for discrete VFs. v2 - Fix indentation problem, wrap the ifla_vf_info structure in #ifdef __KERNEL__ to prevent user space from accessing and change function paramater for the spoof check setting netdev op from u8 to bool. v3 - Preset spoof check setting to -1 so that user space tools such as ip can detect that the driver didn't report a spoofcheck setting. Prevents incorrect display of spoof check settings for drivers that don't report it. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2011-10-16 13:15:38 -07:00
Eric Dumazet	87fb4b7b53	net: more accurate skb truesize skb truesize currently accounts for sk_buff struct and part of skb head. kmalloc() roundings are also ignored. Considering that skb_shared_info is larger than sk_buff, its time to take it into account for better memory accounting. This patch introduces SKB_TRUESIZE(X) macro to centralize various assumptions into a single place. At skb alloc phase, we put skb_shared_info struct at the exact end of skb head, to allow a better use of memory (lowering number of reallocations), since kmalloc() gives us power-of-two memory blocks. Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are aligned to cache lines, as before. Note: This patch might trigger performance regressions because of misconfigured protocol stacks, hitting per socket or global memory limits that were previously not reached. But its a necessary step for a more accurate memory accounting. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Andi Kleen <ak@linux.intel.com> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-13 16:05:07 -04:00
Johannes Berg	8083f0fc96	net: use sock_valbool_flag to set/clear SOCK_RXQ_OVFL There's no point in open-coding sock_valbool_flag(). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-07 13:27:07 -04:00
Ben Hutchings	09994d1b09	RPS: Ensure that an expired hardware filter can be re-added later Amir Vadai wrote: > When a stream is paused, and its rule is expired while it is paused, > no new rule will be configured to the HW when traffic resume. [...] > - When stream was resumed, traffic was steered again by RSS, and > because current-cpu was equal to desired-cpu, ndo_rx_flow_steer > wasn't called and no rule was configured to the HW. Fix this by setting the flow's current CPU only in the table for the newly selected RX queue. Reported-and-tested-by: Amir Vadai <amirv@dev.mellanox.co.il> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-03 12:14:45 -04:00
Changli Gao	5dd17e08f3	net: rps: fix the support for PPPOE The upper protocol numbers of PPPOE are different, and should be treated specially. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-28 13:34:25 -04:00
Eric Dumazet	16e5726269	af_unix: dont send SCM_CREDENTIALS by default Since commit `7361c36c52` (af_unix: Allow credentials to work across user and pid namespaces) af_unix performance dropped a lot. This is because we now take a reference on pid and cred in each write(), and release them in read(), usually done from another process, eventually from another cpu. This triggers false sharing. # Events: 154K cycles # # Overhead Command Shared Object Symbol # ........ ....... .................. ......................... # 10.40% hackbench [kernel.kallsyms] [k] put_pid 8.60% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg 7.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg 6.11% hackbench [kernel.kallsyms] [k] do_raw_spin_lock 4.95% hackbench [kernel.kallsyms] [k] unix_scm_to_skb 4.87% hackbench [kernel.kallsyms] [k] pid_nr_ns 4.34% hackbench [kernel.kallsyms] [k] cred_to_ucred 2.39% hackbench [kernel.kallsyms] [k] unix_destruct_scm 2.24% hackbench [kernel.kallsyms] [k] sub_preempt_count 1.75% hackbench [kernel.kallsyms] [k] fget_light 1.51% hackbench [kernel.kallsyms] [k] __mutex_lock_interruptible_slowpath 1.42% hackbench [kernel.kallsyms] [k] sock_alloc_send_pskb This patch includes SCM_CREDENTIALS information in a af_unix message/skb only if requested by the sender, [man 7 unix for details how to include ancillary data using sendmsg() system call] Note: This might break buggy applications that expected SCM_CREDENTIAL from an unaware write() system call, and receiver not using SO_PASSCRED socket option. If SOCK_PASSCRED is set on source or destination socket, we still include credentials for mere write() syscalls. Performance boost in hackbench : more than 50% gain on a 16 thread machine (2 quad-core cpus, 2 threads per core) hackbench 20 thread 2000 4.228 sec instead of 9.102 sec Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-28 13:29:50 -04:00
David S. Miller	8decf86879	Merge branch 'master' of github.com:davem330/net Conflicts: MAINTAINERS drivers/net/Kconfig drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c drivers/net/ethernet/broadcom/tg3.c drivers/net/wireless/iwlwifi/iwl-pci.c drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c drivers/net/wireless/rt2x00/rt2800usb.c drivers/net/wireless/wl12xx/main.c	2011-09-22 03:23:13 -04:00
Gao feng	561dac2d41	fib:fix BUG_ON in fib_nl_newrule when add new fib rule add new fib rule can cause BUG_ON happen the reproduce shell is ip rule add pref 38 ip rule add pref 38 ip rule add to 192.168.3.0/24 goto 38 ip rule del pref 38 ip rule add to 192.168.3.0/24 goto 38 ip rule add pref 38 then the BUG_ON will happen del BUG_ON and use (ctarget == NULL) identify whether this rule is unresolved Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-21 15:16:40 -04:00
dpward	aa1c366e4f	net: Handle different key sizes between address families in flow cache With the conversion of struct flowi to a union of AF-specific structs, some operations on the flow cache need to account for the exact size of the key. Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-16 17:47:28 -04:00
Jiri Pirko	4bc71cb983	net: consolidate and fix ethtool_ops->get_settings calling This patch does several things: - introduces __ethtool_get_settings which is called from ethtool code and from drivers as well. Put ASSERT_RTNL there. - dev_ethtool_get_settings() is replaced by __ethtool_get_settings() - changes calling in drivers so rtnl locking is respected. In iboe_get_rate was previously ->get_settings() called unlocked. This fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same problem. Also fixed by calling __dev_get_by_index() instead of dev_get_by_index() and holding rtnl_lock for both calls. - introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create() so bnx2fc_if_create() and fcoe_if_create() are called locked as they are from other places. - use __ethtool_get_settings() in bonding code Signed-off-by: Jiri Pirko <jpirko@redhat.com> v2->v3: -removed dev_ethtool_get_settings() -added ASSERT_RTNL into __ethtool_get_settings() -prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock around it and __ethtool_get_settings() call v1->v2: add missing export_symbol Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> [except FCoE bits] Acked-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-15 17:32:26 -04:00
Eric Dumazet	c37e0c9930	net: linkwatch: allow vlans to get carrier changes faster There is a time-lag of IFF_RUNNING flag consistency between vlan and real devices when the real devices are in problem such as link or cable broken. This leads to a degradation of Availability such as a delay of failover in HA systems using vlan since the detection of the problem at real device is delayed. We can avoid the linkwatch delay (~1 sec) for devices linked to another ones, since delay is already done for the realdev. Based on a previous patch from Mitsuo Hayasaka Reported-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Patrick McHardy <kaber@trash.net> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> Cc: Tom Herbert <therbert@google.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Jesse Gross <jesse@nicira.com> Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-15 15:36:34 -04:00
Michael S. Tsirkin	48c830120f	net: copy userspace buffers on device forwarding dev_forward_skb loops an skb back into host networking stack which might hang on the memory indefinitely. In particular, this can happen in macvtap in bridged mode. Copy the userspace fragments to avoid blocking the sender in that case. As this patch makes skb_copy_ubufs extern now, I also added some documentation and made it clear the SKBTX_DEV_ZEROCOPY flag automatically instead of doing it in all callers. This can be made into a separate patch if people feel it's worth it. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-15 14:49:44 -04:00
dpward	0542b69e2c	net: Make flow cache namespace-aware flow_cache_lookup will return a cached object (or null pointer) that the resolver (i.e. xfrm_policy_lookup) previously found for another namespace using the same key/family/dir. Instead, make the namespace part of what identifies entries in the cache. As before, flow_entry_valid will return 0 for entries where the namespace has been deleted, and they will be removed from the cache the next time flow_cache_gc_task is run. Reported-by: Andrew Dickinson <whydna@whydna.net> Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-15 14:49:44 -04:00
Eric Dumazet	e9278a475f	netpoll: fix incorrect access to skb data in __netpoll_rx __netpoll_rx() doesnt properly handle skbs with small header pskb_may_pull() or pskb_trim_rcsum() can change skb->data, we must reload it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-26 12:49:04 -04:00
Eric Dumazet	20e6074eb8	arp: fix rcu lockdep splat in arp_process() Dave Jones reported a lockdep splat triggered by an arp_process() call from parp_redo(). Commit `faa9dcf793` (arp: RCU changes) is the origin of the bug, since it assumed arp_process() was called under rcu_read_lock(), which is not true in this particular path. Instead of adding rcu_read_lock() in parp_redo(), I chose to add it in neigh_proxy_process() to take care of IPv6 side too. =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- include/linux/inetdevice.h:209 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 4 locks held by setfiles/2123: #0: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff8114cbc4>] walk_component+0x1ef/0x3e8 #1: (&isec->lock){+.+.+.}, at: [<ffffffff81204bca>] inode_doinit_with_dentry+0x3f/0x41f #2: (&tbl->proxy_timer){+.-...}, at: [<ffffffff8106a803>] run_timer_softirq+0x157/0x372 #3: (class){+.-...}, at: [<ffffffff8141f256>] neigh_proxy_process +0x36/0x103 stack backtrace: Pid: 2123, comm: setfiles Tainted: G W 3.1.0-0.rc2.git7.2.fc16.x86_64 #1 Call Trace: <IRQ> [<ffffffff8108ca23>] lockdep_rcu_dereference+0xa7/0xaf [<ffffffff8146a0b7>] __in_dev_get_rcu+0x55/0x5d [<ffffffff8146a751>] arp_process+0x25/0x4d7 [<ffffffff8146ac11>] parp_redo+0xe/0x10 [<ffffffff8141f2ba>] neigh_proxy_process+0x9a/0x103 [<ffffffff8106a8c4>] run_timer_softirq+0x218/0x372 [<ffffffff8106a803>] ? run_timer_softirq+0x157/0x372 [<ffffffff8141f220>] ? neigh_stat_seq_open+0x41/0x41 [<ffffffff8108f2f0>] ? mark_held_locks+0x6d/0x95 [<ffffffff81062bb6>] __do_softirq+0x112/0x25a [<ffffffff8150d27c>] call_softirq+0x1c/0x30 [<ffffffff81010bf5>] do_softirq+0x4b/0xa2 [<ffffffff81062f65>] irq_exit+0x5d/0xcf [<ffffffff8150dc11>] smp_apic_timer_interrupt+0x7c/0x8a [<ffffffff8150baf3>] apic_timer_interrupt+0x73/0x80 <EOI> [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158 [<ffffffff814fc285>] ? __slab_free+0x30/0x24c [<ffffffff814fc283>] ? __slab_free+0x2e/0x24c [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f [<ffffffff81130cb0>] kfree+0x108/0x131 [<ffffffff81204e74>] inode_doinit_with_dentry+0x2e9/0x41f [<ffffffff81204fc6>] selinux_d_instantiate+0x1c/0x1e [<ffffffff81200f4f>] security_d_instantiate+0x21/0x23 [<ffffffff81154625>] d_instantiate+0x5c/0x61 [<ffffffff811563ca>] d_splice_alias+0xbc/0xd2 [<ffffffff811b17ff>] ext4_lookup+0xba/0xeb [<ffffffff8114bf1e>] d_alloc_and_lookup+0x45/0x6b [<ffffffff8114cbea>] walk_component+0x215/0x3e8 [<ffffffff8114cdf8>] lookup_last+0x3b/0x3d [<ffffffff8114daf3>] path_lookupat+0x82/0x2af [<ffffffff8110fc53>] ? might_fault+0xa5/0xac [<ffffffff8110fc0a>] ? might_fault+0x5c/0xac [<ffffffff8114c564>] ? getname_flags+0x31/0x1ca [<ffffffff8114dd48>] do_path_lookup+0x28/0x97 [<ffffffff8114df2c>] user_path_at+0x59/0x96 [<ffffffff811467ad>] ? cp_new_stat+0xf7/0x10d [<ffffffff811469a6>] vfs_fstatat+0x44/0x6e [<ffffffff811469ee>] vfs_lstat+0x1e/0x20 [<ffffffff81146b3d>] sys_newlstat+0x1a/0x33 [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158 [<ffffffff812535fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8150af82>] system_call_fastpath+0x16/0x1b Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-24 17:55:00 -07:00
Ian Campbell	ea2ab69379	net: convert core to skb paged frag APIs Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-24 17:52:11 -07:00
Eric Dumazet	ec5efe7946	rps: support IPIP encapsulation Skip IPIP header to get proper layer-4 information. Like GRE tunnels, this only works if rxhash is not already provided by the device itself (ethtool -K ethX rxhash off), to allow kernel compute a software rxhash. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-24 16:13:55 -07:00
Jason Baron	ffa10cb47a	dynamic_debug: make netdev_dbg() call __netdev_printk() Previously, if dynamic debug was enabled netdev_dbg() was using dynamic_dev_dbg() to print out the underlying msg. Fix this by making sure netdev_dbg() uses __netdev_printk(). Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-08-22 18:23:06 -07:00
Jiri Pirko	0dfe178239	net: vlan: goto another_round instead of calling __netif_receive_skb Now, when vlan tag on untagged in non-accelerated path is stripped from skb, headers are reset right away. Benefit from that and avoid calling __netif_receive_skb recursivelly and just use another_round. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-22 12:43:22 -07:00
Changli Gao	6461be3a54	net: Preserve ooo_okay when copying skb header Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-20 14:20:49 -07:00

1 2 3 4 5 ...

2291 Commits