tmp_suning_uos_patched

Author	SHA1	Message	Date
Lennert Buytenhek	8a70cefa30	ieee802154: Fix sockaddr_ieee802154 implicit padding information leak. The AF_IEEE802154 sockaddr looks like this: struct sockaddr_ieee802154 { sa_family_t family; /* AF_IEEE802154 / struct ieee802154_addr_sa addr; }; struct ieee802154_addr_sa { int addr_type; u16 pan_id; union { u8 hwaddr[IEEE802154_ADDR_LEN]; u16 short_addr; }; }; On most architectures there will be implicit structure padding here, in two different places: In struct sockaddr_ieee802154, two bytes of padding between 'family' (unsigned short) and 'addr', so that 'addr' starts on a four byte boundary. * In struct ieee802154_addr_sa, two bytes at the end of the structure, to make the structure 16 bytes. When calling recvmsg(2) on a PF_IEEE802154 SOCK_DGRAM socket, the ieee802154 stack constructs a struct sockaddr_ieee802154 on the kernel stack without clearing these padding fields, and, depending on the addr_type, between four and ten bytes of uncleared kernel stack will be copied to userspace. We can't just insert two 'u16 __pad's in the right places and zero those before copying an address to userspace, as not all architectures insert this implicit padding -- from a quick test it seems that avr32, cris and m68k don't insert this padding, while every other architecture that I have cross compilers for does insert this padding. The easiest way to plug the leak is to just memset the whole struct sockaddr_ieee802154 before filling in the fields we want to fill in, and that's what this patch does. Cc: stable@vger.kernel.org Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-04 12:26:58 +02:00
Varka Bhadram	07bd77fa4c	cfg802154: fix rdev-ops naming convension and format specifiers This patch make to use the same naming convention that mac802154 tracing follows and fixes the format specifier for extended addr. Signed-off-by: Varka Bhadram <varkab@cdac.in> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-04 12:26:58 +02:00
Varka Bhadram	0ecc4e688b	mac802154: add trace functionality for driver ops This patch adds trace events for driver operations. Signed-off-by: Varka Bhadram <varkab@cdac.in> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-02 19:21:09 +02:00
Alexander Aring	1caf6f476e	ieee802154: 6lowpan: set ackreq when needed This patch sets the acknowledge request bit inside the 802.15.4 mac header when frame retries is 0 or above. The other frame retries value which is -1 indicates that the transmitter doesn't care about an acknowledge frame which will be ignored after transmitting if the node sends anyway an ack frame after receiving. This is currently unnecessary traffic if the max frame retries parameter is -1. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-02 17:09:35 +02:00
Lennert Buytenhek	daf4e2c892	ieee802154: Fix EUI-64 station address validation. Refuse to allow setting an EUI-64 group address as an interface address, as those are not valid station addresses. Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-31 13:40:53 +02:00
David S. Miller	a9ab2184f4	Included changes: - checkpatch fixes - code cleanup - debugfs component is now compiled only if DEBUG_FS is selected - update copyright years - disable by default not-so-user-safe features -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVaCjMAAoJEOb/4TMchkvfQS0QANDW/0eOT8azlbik5+MZTC5i d+K1Xbc7Qn7ebo5F27eRGNrgV5a8Wwx1JUCANXhAfjSURItj3KoHbjLYN2lJLn5L mBoU7IWwqUzX2garm7xKm94TTaN3Q6t/NGYVeQqJXNcWBDJQcNAr7ECg8tpV16Ec +o6FPsuZBX1dKNijvcy77VNGAaauhAbfMuAYRJDx6CtCIyWg+f/vcAeTR2PCmbMD FP2qD2zHBnR5feQF9YtrCOUHX3SzKlnCBQ1DyUzWbC40eGJWQPZiml+CC0r7fNrI buOlk2yDI1Pc0/TIDrm3B3f0LqoQhmC4h0EDP/tazoiHAe/Vh06D4dmsC81XBM+H 9wEzU+C20DUjDVIyTzboIDjcSNwTN5TxK0dG72vc+yDfSSAmJVtLQ8dqQevRp6cd NPVebjCyJKXoBZWd1o7KO0s41dTbFBVHrA5ZLaEu5TcCMpKHzicJJyMr+OLgqTQE tqLMzqR+7VPmJfIwXuHX+wqHlsJCkrU1zyiuOyBn6uQ4rvbg503eadJffOAaLeCH FpOtKkQ34HNDUchgmiFVWWV1w6r3Si3/a7WRJN55B49sIZqJxxQfB2Evlk8vYNzT sVDFsNk8QnbaL2yCwxJEXj/Kgyfxj/PLAoxDnkt+cHWOF6nbGPHyIdDJQGSAHFrp NcZisqImn5iJS+2QV68a =2UBm -----END PGP SIGNATURE----- Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge Antonio Quartulli says: ==================== Included changes: - checkpatch fixes - code cleanup - debugfs component is now compiled only if DEBUG_FS is selected - update copyright years - disable by default not-so-user-safe features ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 01:07:06 -07:00
Wang Long	282c320d33	netevent: remove automatic variable in register_netevent_notifier() Remove automatic variable 'err' in register_netevent_notifier() and return the result of atomic_notifier_chain_register() directly. Signed-off-by: Wang Long <long.wanglong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 00:03:21 -07:00
David S. Miller	583d3f5af2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, they are: 1) default CONFIG_NETFILTER_INGRESS to y for easier compile-testing of all options. 2) Allow to bind a table to net_device. This introduces the internal NFT_AF_NEEDS_DEV flag to perform a mandatory check for this binding. This is required by the next patch. 3) Add the 'netdev' table family, this new table allows you to create ingress filter basechains. This provides access to the existing nf_tables features from ingress. 4) Kill unused argument from compat_find_calc_{match,target} in ip_tables and ip6_tables, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 00:02:30 -07:00
Julia Lawall	3d2f6d41d1	ipv6: drop unneeded goto Delete jump to a label on the next line, when that label is not used elsewhere. A simplified version of the semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r@ identifier l; @@ -if (...) goto l; -l: // </smpl> Also remove the unnecessary ret variable. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 23:48:36 -07:00
David S. Miller	9d52bf0a23	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2015-05-28 Here's a set of patches intended for 4.2. The majority of the changes are on the 802.15.4 side of things rather than Bluetooth related: - All sorts of cleanups & fixes to ieee802154 and related drivers - Rework of tx power support in ieee802154 and its drivers - Support for setting ieee802154 tx power through nl802154 - New IDs for the btusb driver - Various cleanups & smaller fixes to btusb - New btrtl driver for Realtec devices - Fix suspend/resume for Realtek devices Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 23:26:45 -07:00
Ying Xue	1ea23a2117	tipc: unconditionally put sock refcnt when sock timer to be deleted is pending As sock refcnt is taken when sock timer is started in sk_reset_timer(), the sock refcnt should be put when sock timer to be deleted is in pending state no matter what "probing_state" value of tipc sock is. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 18:08:37 -07:00
Alexei Starovoitov	37e82c2f97	bpf: allow BPF programs access skb->skb_iif and skb->dev->ifindex fields classic BPF already exposes skb->dev->ifindex via SKF_AD_IFINDEX extension. Allow eBPF program to access it as well. Note that classic aborts execution of the program if 'skb->dev == NULL' (which is inconvenient for program writers), whereas eBPF returns zero in such case. Also expose the 'skb_iif' field, since programs triggered by redirected packet need to known the original interface index. Summary: __skb->ifindex -> skb->dev->ifindex __skb->ingress_ifindex -> skb->skb_iif Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 17:51:13 -07:00
Sorin Dumitru	8133534c76	net: limit tcp/udp rmem/wmem to SOCK_{RCV,SND}BUF_MIN This is similar to b1cb59cf2efe(net: sysctl_net_core: check SNDBUF and RCVBUF for min length). I don't think too small values can cause crashes in the case of udp and tcp, but I've seen this set to too small values which triggered awful performance. It also makes the setting consistent across all the wmem/rmem sysctls. Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 17:37:44 -07:00
Antonio Quartulli	8ea64e2708	batman-adv: Use common declaration order in *_send_skb_(packet\|unicast) Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>	2015-05-29 10:13:37 +02:00
Markus Pargmann	01b97a3eed	batman-adv: iv_ogm_orig_update, remove unnecessary brackets Remove these unnecessary brackets inside a condition. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:37 +02:00
Markus Pargmann	8f34b38878	batman-adv: iv_ogm_can_aggregate, code readability This patch tries to increase code readability by negating the first if block and rearranging some of the other conditional blocks. This way we save an indentation level, we also save some allocation that is not necessary for one of the conditions. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:37 +02:00
Marek Lindner	fc1f869366	batman-adv: checkpatch - spaces preferred around that '*' Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:37 +02:00
Marek Lindner	00f548bf54	batman-adv: checkpatch - comparison to NULL could be rewritten Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:37 +02:00
Sven Eckelmann	dab7b62190	batman-adv: Use safer default config for optional features The current default settings for optional features in batman-adv seems to be based around the idea that the user only compiles what he requires. They will automatically enabled when they are compiled in. For example the network coding part of batman-adv is by default disabled in the out-of-tree module but will be enabled when the code is compiled during the module build. But distributions like Debian just enable all features of the batman-adv kernel module and hope that more experimental features or features with possible negative effects have to be enabled using some runtime configuration interface. The network_coding feature can help in specific setups but also has drawbacks and is not disabled by default in the out-of-tree module. Disabling by default in the runtime config seems to be also quite sane. The bridge_loop_avoidance is the only feature which is disabled by default but may be necessary even in simple setups. Packet loops may even be created during the initial node setup when this is not enabled. This is different than STP on bridges because mesh is usually used on Adhoc WiFi. Having two nodes (by accident) in the same LAN segment and in the same mesh network is rather common in this situation. Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Martin Hundebøll <martin@hundeboll.net> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	de12baece9	batman-adv: iv_ogm_send_to_if, declare char* as const This string pointer is later assigned to a constant string, so it should be defined constant at the beginning. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	9fd9b19ea0	batman-adv: iv_ogm_aggr_packet, bool return value This function returns bool values, so it should be defined to return them instead of the whole int range. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	42d9f2cbd4	batman-adv: iv_ogm_iface_enable, direct return values Directly return error values. No need to use a return variable. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	9fc1883ef2	batman-adv: Makefile, Sort alphabetically The whole Makefile is sorted, just the multicast rule is not at the right position. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	16b9ce83fb	batman-adv: tvlv realloc, move error handling into if block Instead of hiding the normal function flow inside an if block, we should just put the error handling into the if block. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	9bb218828c	batman-adv: debugfs, avoid compiling for !DEBUG_FS Normally the debugfs framework will return error pointer with -ENODEV for function calls when DEBUG_FS is not set. batman does not notice this error code and continues trying to create debugfs files and executes more code. We can avoid this code execution by disabling compiling debugfs.c when DEBUG_FS is not set. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:35 +02:00
Sven Eckelmann	83e8b87721	batman-adv: Use only queued fragments when merging The fragment queueing code now validates the total_size of each fragment, checks when enough fragments are queued to allow to merge them into a single packet and if the fragments have the correct size. Therefore, it is not required to have any other parameter for the merging function than a list of queued fragments. This change should avoid problems like in the past when the different skb from the list and the function parameter were mixed incorrectly. Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Martin Hundebøll <martin@hundeboll.net> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:35 +02:00
Sven Eckelmann	53e771457e	batman-adv: Check total_size when queueing fragments The fragmentation code was replaced in `610bfc6bc9` ("batman-adv: Receive fragmented packets and merge") by an implementation which handles the queueing+merging of fragments based on their size and the total_size of the non-fragmented packet. This total_size is announced by each fragment. The new implementation doesn't check if the the total_size information of the packets inside one chain is consistent. This is consistency check is recommended to allow using any of the packets in the queue to decide whether all fragments of a packet are received or not. Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Martin Hundebøll <martin@hundeboll.net> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:35 +02:00
Sven Eckelmann	9f6446c7f9	batman-adv: update copyright years for 2015 Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:35 +02:00
Simon Wunderlich	70e717762d	batman-adv: Start new development cycle Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2015-05-29 10:13:35 +02:00
David S. Miller	a74eab639e	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2015-05-28 1) Remove xfrm_queue_purge as this is the same as skb_queue_purge. 2) Optimize policy and state walk. 3) Use a sane return code if afinfo registration fails. 4) Only check fori a acquire state if the state is not valid. 5) Remove a unnecessary NULL check before xfrm_pol_hold as it checks the input for NULL. 6) Return directly if the xfrm hold queue is empty, avoid to take a lock as it is nothing to do in this case. 7) Optimize the inexact policy search and allow for matching of policies with priority ~0U. All from Li RongQing. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-28 20:23:01 -07:00
Eric Dumazet	ed2dfd9009	tcp/dccp: warn user for preferred ip_local_port_range After commit `07f4c90062` ("tcp/dccp: try to not exhaust ip_local_port_range in connect()") it is advised to have an even number of ports described in /proc/sys/net/ipv4/ip_local_port_range This means start/end values should have a different parity. Let's warn sysadmins of this, so that they can update their settings if they want to. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 14:35:36 -04:00
Eric Dumazet	e2baad9e4b	tcp: connect() from bound sockets can be faster __inet_hash_connect() does not use its third argument (port_offset) if socket was already bound to a source port. No need to perform useless but expensive md5 computations. Reported-by: Crestez Dan Leonard <cdleonard@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 14:30:10 -04:00
Eric Dumazet	07f4c90062	tcp/dccp: try to not exhaust ip_local_port_range in connect() A long standing problem on busy servers is the tiny available TCP port range (/proc/sys/net/ipv4/ip_local_port_range) and the default sequential allocation of source ports in connect() system call. If a host is having a lot of active TCP sessions, chances are very high that all ports are in use by at least one flow, and subsequent bind(0) attempts fail, or have to scan a big portion of space to find a slot. In this patch, I changed the starting point in __inet_hash_connect() so that we try to favor even [1] ports, leaving odd ports for bind() users. We still perform a sequential search, so there is no guarantee, but if connect() targets are very different, end result is we leave more ports available to bind(), and we spread them all over the range, lowering time for both connect() and bind() to find a slot. This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range is even, ie if start/end values have different parity. Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to 32768 - 60999 (instead of 32768 - 61000) There is no change on security aspects here, only some poor hashing schemes could be eventually impacted by this change. [1] : The odd/even property depends on ip_local_port_range values parity Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 13:30:44 -04:00
Alexander Aring	b69644c1c7	nl802154: add support to set cca ed level This patch adds support for setting the current cca ed level value over nl802154. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Varka Bhadram <varkabhadram@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-27 19:29:42 +02:00
Alexander Aring	e4390592a4	nl802154: add support for cca ed level info This patch adds information about the current cca ed level when the phy is dumped over nl802154. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Varka Bhadram <varkabhadram@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-27 19:29:42 +02:00
Florian Westphal	d6b915e29f	ip_fragment: don't forward defragmented DF packet We currently always send fragments without DF bit set. Thus, given following setup: mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280 A R1 R2 B Where R1 and R2 run linux with netfilter defragmentation/conntrack enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1 will respond with icmp too big error if one of these fragments exceeded 1400 bytes. However, if R1 receives fragment sizes 1200 and 100, it would forward the reassembled packet without refragmenting, i.e. R2 will send an icmp error in response to a packet that was never sent, citing mtu that the original sender never exceeded. The other minor issue is that a refragmentation on R1 will conceal the MTU of R2-B since refragmentation does not set DF bit on the fragments. This modifies ip_fragment so that we track largest fragment size seen both for DF and non-DF packets, and set frag_max_size to the largest value. If the DF fragment size is larger or equal to the non-df one, we will consider the packet a path mtu probe: We set DF bit on the reassembled skb and also tag it with a new IPCB flag to force refragmentation even if skb fits outdev mtu. We will also set DF bit on each fragment in this case. Joint work with Hannes Frederic Sowa. Reported-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 13:03:31 -04:00
Florian Westphal	c5501eb340	net: ipv4: avoid repeated calls to ip_skb_dst_mtu helper ip_skb_dst_mtu is small inline helper, but its called in several places. before: 17061 44 0 17105 42d1 net/ipv4/ip_output.o after: 16805 44 0 16849 41d1 net/ipv4/ip_output.o Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 13:03:30 -04:00
Varka Bhadram	dec169eccc	ieee802154: fix typo for file name Signed-off-by: Varka Bhadram <varkab@cdac.in> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-27 13:32:46 +02:00
Varka Bhadram	0f999b09f5	ieee802154: add set transmit power support This patch adds transmission power setting support for IEEE-802.15.4 devices via nl802154. Signed-off-by: Varka Bhadram <varkab@cdac.in> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-27 13:29:25 +02:00
David S. Miller	ffa915d071	ipv4: Fix fib_trie.c build, missing linux/vmalloc.h include. We used to get this indirectly I supposed, but no longer do. Either way, an explicit include should have been done in the first place. net/ipv4/fib_trie.c: In function '__node_free_rcu': >> net/ipv4/fib_trie.c:293:3: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration] vfree(n); ^ net/ipv4/fib_trie.c: In function 'tnode_alloc': >> net/ipv4/fib_trie.c:312:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration] return vzalloc(size); ^ >> net/ipv4/fib_trie.c:312:3: warning: return makes pointer from integer without a cast cc1: some warnings being treated as errors Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 00:19:03 -04:00
Eric Dumazet	d6a4e26afb	tcp: tcp_tso_autosize() minimum is one packet By making sure sk->sk_gso_max_segs minimal value is one, and sysctl_tcp_min_tso_segs minimal value is one as well, tcp_tso_autosize() will return a non zero value. We can then revert `843925f33f` ("tcp: Do not apply TSO segment limit to non-TSO packets") and save few cpu cycles in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-26 23:21:29 -04:00
Eric Dumazet	095dc8e0c3	tcp: fix/cleanup inet_ehash_locks_alloc() If tcp ehash table is constrained to a very small number of buckets (eg boot parameter thash_entries=128), then we can crash if spinlock array has more entries. While we are at it, un-inline inet_ehash_locks_alloc() and make following changes : - Budget 2 cache lines per cpu worth of 'spinlocks' - Try to kmalloc() the array to avoid extra TLB pressure. (Most servers at Google allocate 8192 bytes for this hash table) - Get rid of various #ifdef Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-26 19:48:46 -04:00
Jon Paul Maloy	f3903bcc00	tipc: fix bug in link protocol message create function In commit `dd3f9e70f5` ("tipc: add packet sequence number at instant of transmission") we made a change with the consequence that packets in the link backlog queue don't contain valid sequence numbers. However, when we create a link protocol message, we still use the sequence number of the first packet in the backlog, if there is any, as "next_sent" indicator in the message. This may entail unnecessary retransissions or stale packet transmission when there is very low traffic on the link. This commit fixes this issue by only using the current value of tipc_link::snd_nxt as indicator. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-26 19:43:03 -04:00
Alexander Aring	fc4f805243	nl802154: fix cca mode wpan phy flag This patch fix the handling to call cca mode setting. If the phy isn't flag then the driver doesn't support this setting. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reported-by: Varka Bhadram <varkabhadram@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 23:45:15 +02:00
Lennert Buytenhek	641459ca33	mac802154: mac802154_mlme_start_req() optimisation. mac802154_mlme_start_req() calls ieee802154_mlme_ops(dev)->llsec->set_params() on the net_device passed into it, however, this net_device will always be a mac802154 net_device, so just call mac802154_set_params() directly instead. Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 20:26:10 +02:00
Lennert Buytenhek	66a3297f6d	ieee802154 socket: No need to check for ARPHRD_IEEE802154 in raw_bind(). ieee802154_get_dev() only returns devices that have dev->type == ARPHRD_IEEE802154, therefore, there is no need to check this again in raw_bind(). Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 20:26:10 +02:00
Lennert Buytenhek	01c8d2bbd4	ieee802154: Remove 802.15.4/6LoWPAN checks for interface MTU. In the past, 802.15.4 interfaces and 6LoWPAN interfaces used the same dev->type (ARPHRD_IEEE802154), and 802.15.4 interfaces were distinguished from 6LoWPAN interfaces by their differing dev->mtu. 6LoWPAN interfaces have their own ARPHRD type now, so there is no longer any need to check dev->mtu to distinguish 802.15.4 devices from 6LoWPAN devices. Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 20:26:09 +02:00
Pablo Neira Ayuso	ed6c4136f1	netfilter: nf_tables: add netdev table to filter from ingress This allows us to create netdev tables that contain ingress chains. Use skb_header_pointer() as we may see shared sk_buffs at this stage. This change provides access to the existing nf_tables features from the ingress hook. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-26 18:41:23 +02:00
Pablo Neira Ayuso	ebddf1a8d7	netfilter: nf_tables: allow to bind table to net_device This patch adds the internal NFT_AF_NEEDS_DEV flag to indicate that you must attach this table to a net_device. This change is required by the follow up patch that introduces the new netdev table. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-26 18:41:17 +02:00
Pablo Neira Ayuso	529985de20	netfilter: default CONFIG_NETFILTER_INGRESS to y Useful to compile-test all options. Suggested-by: Alexei Stavoroitov <ast@plumgrid.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-26 18:41:06 +02:00
Florian Westphal	2f06550b3b	netfilter: remove unused comefrom hookmask argument Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-26 18:40:30 +02:00
Lennert Buytenhek	c032705ebf	ieee802154 socket: Return EMSGSIZE from raw_sendmsg() if packet too big. The proper return code for trying to send a packet that exceeds the outgoing interface's MTU is EMSGSIZE, not EINVAL, so patch ieee802154's raw_sendmsg() to do the right thing. (Its dgram_sendmsg() was already returning EMSGSIZE for this case.) Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 18:07:39 +02:00
Lennert Buytenhek	e34fd879f5	mac802154: Avoid rtnl deadlock in mac802154_wpan_ioctl(). ->ndo_do_ioctl() can be entered with the rtnl lock already held, for example when sending a wext ioctl to a device (in which case the rtnl lock is taken by wext_ioctl_dispatch()), but mac802154_wpan_ioctl() currently unconditionally takes the rtnl lock on entry, which can cause deadlocks. To fix this, bail out of mac802154_wpan_ioctl() before taking the rtnl lock if the ioctl cmd is not one of the cmds we implement. Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 18:03:01 +02:00
Eric Dumazet	05c985436d	net: fix inet_proto_csum_replace4() sparse errors make C=2 CF=-D__CHECK_ENDIAN__ net/core/utils.o ... net/core/utils.c:307:72: warning: incorrect type in argument 2 (different base types) net/core/utils.c:307:72: expected restricted __wsum [usertype] addend net/core/utils.c:307:72: got restricted __be32 [usertype] from net/core/utils.c:308:34: warning: incorrect type in argument 2 (different base types) net/core/utils.c:308:34: expected restricted __wsum [usertype] addend net/core/utils.c:308:34: got restricted __be32 [usertype] to net/core/utils.c:310:70: warning: incorrect type in argument 2 (different base types) net/core/utils.c:310:70: expected restricted __wsum [usertype] addend net/core/utils.c:310:70: got restricted __be32 [usertype] from net/core/utils.c:310:77: warning: incorrect type in argument 2 (different base types) net/core/utils.c:310:77: expected restricted __wsum [usertype] addend net/core/utils.c:310:77: got restricted __be32 [usertype] to net/core/utils.c:312:72: warning: incorrect type in argument 2 (different base types) net/core/utils.c:312:72: expected restricted __wsum [usertype] addend net/core/utils.c:312:72: got restricted __be32 [usertype] from net/core/utils.c:313:35: warning: incorrect type in argument 2 (different base types) net/core/utils.c:313:35: expected restricted __wsum [usertype] addend net/core/utils.c:313:35: got restricted __be32 [usertype] to Note we can use csum_replace4() helper Fixes: `58e3cac561` ("net: optimise inet_proto_csum_replace4()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 22:56:47 -04:00
Eric Dumazet	68319052d1	net: remove a sparse error in secure_dccpv6_sequence_number() make C=2 CF=-D__CHECK_ENDIAN__ net/core/secure_seq.o net/core/secure_seq.c:157:50: warning: restricted __be32 degrades to integer Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 22:55:37 -04:00
Florian Grandel	f72186d22a	Bluetooth: mgmt: fix typos A few comments had minor typos. These are being fixed. Signed-off-by: Florian Grandel <fgrandel@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 03:57:56 +02:00
Wilson Kok	eb8d7baae2	bridge: skip fdb add if the port shouldn't learn Check in fdb_add_entry() if the source port should learn, similar check is used in br_fdb_update. Note that new fdb entries which are added manually or as local ones are still permitted. This patch has been tested by running traffic via a bridge port and switching the port's state, also by manually adding/removing entries from the bridge's fdb. Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 20:29:54 -04:00
Eric Dumazet	d496958145	pktgen: remove one sparse error net/core/pktgen.c:2672:43: warning: incorrect type in assignment (different base types) net/core/pktgen.c:2672:43: expected unsigned short [unsigned] [short] [usertype] <noident> net/core/pktgen.c:2672:43: got restricted __be16 [usertype] protocol Let's use proper struct ethhdr instead of hard coding everything. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 20:27:50 -04:00
Eric Dumazet	7f1598678d	ipv6: ipv6_select_ident() returns a __be32 ipv6_select_ident() returns a 32bit value in network order. Fixes: `286c2349f6` ("ipv6: Clean up ipv6_select_ident() and ip6_fragment()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 20:27:11 -04:00
Nicholas Mc Guire	005e8709c6	irda: use msecs_to_jiffies for conversion to jiffies API compliance scanning with coccinelle flagged: ./net/irda/timer.c:63:35-37: use of msecs_to_jiffies probably perferable Converting milliseconds to jiffies by "val * HZ / 1000" technically is not a clean solution as it does not handle all corner cases correctly. By changing the conversion to use msecs_to_jiffies(val) conversion is correct in all cases. Further the () around the arithmetic expression was dropped. Patch was compile tested for x86_64_defconfig + CONFIG_IRDA=m Patch is against 4.1-rc4 (localversion-next is -next-20150522) Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 17:46:21 -04:00
Linus Lüssing	6ae4ae8e51	bridge: allow setting hash_max + multicast_router if interface is down Network managers like netifd (used in OpenWRT for instance) try to configure interface options after creation but before setting the interface up. Unfortunately the sysfs / bridge currently only allows to configure the hash_max and multicast_router options when the bridge interface is up. But since br_multicast_init() doesn't start any timers and only sets default values and initializes timers it should be save to reconfigure the default values after that, before things actually get active after the bridge is set up. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 17:28:01 -04:00
Florian Westphal	485fca664d	ipv6: don't increase size when refragmenting forwarded ipv6 skbs since commit `6aafeef03b` ("netfilter: push reasm skb through instead of original frag skbs") we will end up sometimes re-fragmenting skbs that we've reassembled. ipv6 defrag preserves the original skbs using the skb frag list, i.e. as long as the skb frag list is preserved there is no problem since we keep original geometry of fragments intact. However, in the rare case where the frag list is munged or skb is linearized, we might send larger fragments than what we originally received. A router in the path might then send packet-too-big errors even if sender never sent fragments exceeding the reported mtu: mtu 1500 - 1500:1400 - 1400:1280 - 1280 A R1 R2 B 1 - A sends to B, fragment size 1400 2 - R2 sends pkttoobig error for 1280 3 - A sends to B, fragment size 1280 4 - R2 sends pkttoobig error for 1280 again because it sees fragments of size 1400. make sure ip6_fragment always caps MTU at largest packet size seen when defragmented skb is forwarded. Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 17:22:23 -04:00
Martin KaFai Lau	d52d3997f8	ipv6: Create percpu rt6_info After the patch 'ipv6: Only create RTF_CACHE routes after encountering pmtu exception', we need to compensate the performance hit (bouncing dst->__refcnt). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:35 -04:00
Martin KaFai Lau	83a09abd1a	ipv6: Break up ip6_rt_copy() This patch breaks up ip6_rt_copy() into ip6_rt_copy_init() and ip6_rt_cache_alloc(). In the later patch, we need to create a percpu rt6_info copy. Hence, refactor the common rt6_info init codes to ip6_rt_copy_init(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:34 -04:00
Martin KaFai Lau	8d0b94afdc	ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister This patch keeps track of the DST_NOCACHE routes in a list and replaces its dev with loopback during the iface down/unregister event. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:34 -04:00
Martin KaFai Lau	3da59bd945	ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set This patch always creates RTF_CACHE clone with DST_NOCACHE when FLOWI_FLAG_KNOWN_NH is set so that the rt6i_dst is set to the fl6->daddr. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Tested-by: Julian Anastasov <ja@ssi.bg> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:34 -04:00
Martin KaFai Lau	48e8aa6e31	ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags The neighbor look-up used to depend on the rt6i_gateway (if there is a gateway) or the rt6i_dst (if it is a RTF_CACHE clone) as the nexthop address. Note that rt6i_dst is set to fl6->daddr for the RTF_CACHE clone where fl6->daddr is the one used to do the route look-up. Now, we only create RTF_CACHE clone after encountering exception. When doing the neighbor look-up with a route that is neither a gateway nor a RTF_CACHE clone, the daddr in skb will be used as the nexthop. In some cases, the daddr in skb is not the one used to do the route look-up. One example is in ip_vs_dr_xmit_v6() where the real nexthop server address is different from the one in the skb. This patch is going to follow the IPv4 approach and ask the ip6_pol_route() callers to set the FLOWI_FLAG_KNOWN_NH properly. In the next patch, ip6_pol_route() will honor the FLOWI_FLAG_KNOWN_NH and create a RTF_CACHE clone. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Tested-by: Julian Anastasov <ja@ssi.bg> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:34 -04:00
Martin KaFai Lau	b197df4f0f	ipv6: Add rt6_get_cookie() function Instead of doing the rt6->rt6i_node check whenever we need to get the route's cookie. Refactor it into rt6_get_cookie(). It is a prep work to handle FLOWI_FLAG_KNOWN_NH and also percpu rt6_info later. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:34 -04:00
Martin KaFai Lau	45e4fd2668	ipv6: Only create RTF_CACHE routes after encountering pmtu exception This patch creates a RTF_CACHE routes only after encountering a pmtu exception. After ip6_rt_update_pmtu() has inserted the RTF_CACHE route to the fib6 tree, the rt->rt6i_node->fn_sernum is bumped which will fail the ip6_dst_check() and trigger a relookup. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:33 -04:00
Martin KaFai Lau	8b9df26577	ipv6: Combine rt6_alloc_cow and rt6_alloc_clone A prep work for creating RTF_CACHE on exception only. After this patch, the same condition (rt->rt6i_flags & (RTF_NONEXTHOP \| RTF_GATEWAY)) is checked twice. This redundancy will be removed in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:33 -04:00
Martin KaFai Lau	2647a9b070	ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST When creating a RTF_CACHE route, RTF_ANYCAST is set based on rt6i_dst. Also, rt6i_gateway is always set to the nexthop while the nexthop could be a gateway or the rt6i_dst.addr. After removing the rt6i_dst and rt6i_src dependency in the last patch, we also need to stop the caller from depending on rt6i_gateway and RTF_ANYCAST. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:33 -04:00
Martin KaFai Lau	fd0273d793	ipv6: Remove external dependency on rt6i_dst and rt6i_src This patch removes the assumptions that the returned rt is always a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the destination and source address. The dst and src can be recovered from the calling site. We may consider to rename (rt6i_dst, rt6i_src) to (rt6i_key_dst, rt6i_key_src) later. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:32 -04:00
Martin KaFai Lau	286c2349f6	ipv6: Clean up ipv6_select_ident() and ip6_fragment() This patch changes the ipv6_select_ident() signature to return a fragment id instead of taking a whole frag_hdr as a param to only set the frag_hdr->identification. It also cleans up ip6_fragment() to obtain the fragment id at the beginning instead of using multiple "if" later to check fragment id has been generated or not. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:32 -04:00
Florian Westphal	cf82624432	ip: reject too-big defragmented DF-skb when forwarding Send icmp pmtu error if we find that the largest fragment of df-skb exceeded the output path mtu. The ip output path will still catch this later on but we can avoid the forward/postrouting hook traversal by rejecting right away. This is what ipv6 already does. Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:08:48 -04:00
Hannes Frederic Sowa	2b514574f7	net: af_unix: implement splice for stream af_unix sockets unix_stream_recvmsg is refactored to unix_stream_read_generic in this patch and enhanced to deal with pipe splicing. The refactoring is inneglible, we mostly have to deal with a non-existing struct msghdr argument. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:06:59 -04:00
Hannes Frederic Sowa	a60e3cc7c9	net: make skb_splice_bits more configureable Prepare skb_splice_bits to be able to deal with AF_UNIX sockets. AF_UNIX sockets don't use lock_sock/release_sock and thus we have to use a callback to make the locking and unlocking configureable. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:06:59 -04:00
Hannes Frederic Sowa	869e7c6248	net: af_unix: implement stream sendpage support This patch implements sendpage support for AF_UNIX SOCK_STREAM sockets. This is also required for a complete splice implementation. The implementation is a bit tricky because we append to already existing skbs and so have to hold unix_sk->readlock to protect the reading side from either advancing UNIXCB.consumed or freeing the skb at the socket receive tail. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:06:58 -04:00
Hannes Frederic Sowa	be12a1fe29	net: skbuff: add skb_append_pagefrags and use it Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:06:58 -04:00
Alexander Aring	c947f7e1e3	mac802154: remove mib lock This patch removes the mib lock. The new locking mechanism is to protect the mib values with the rtnl lock. Note that this isn't always necessary if we have an interface up the most mib values are readonly (e.g. address settings). With this behaviour we can remove locking in hotpath like frame parsing completely. It depends on context if we need to hold the rtnl lock or not, this makes the callbacks of ieee802154_mlme_ops unnecessary because these callbacks hols always the locks. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-23 17:57:08 +02:00
Alexander Aring	344f8c119d	mac802154: use atomic ops for sequence incrementation This patch will use atomic operations for sequence number incrementation while MAC header generation. Upper layers like af_802154 or 6LoWPAN could call this function in a parallel context while generating 802.15.4 MAC header before queuing into wpan interfaces transmit queue. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-23 17:57:08 +02:00
Alexander Aring	4a3a8c0c3a	mac802154: remove pib lock This patch removes the pib lock which is now replaced by rtnl lock. The new interface already use the rtnl lock only. Nevertheless this patch will fix issues while using new and old interface at the same time. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-23 17:57:08 +02:00
Alexander Aring	4a669f7d72	mac802154: fix hold rtnl while ioctl This patch fixes an issue to set address configuration with ioctl. Accessing the mib requires rtnl lock and the ndo_do_ioctl doesn't hold the rtnl lock while this callback is called. This patch do that manually. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reported-by: Matteo Petracca <matteo.petracca@sssup.it> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-23 17:57:07 +02:00
David S. Miller	36583eb54d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/cadence/macb.c drivers/net/phy/phy.c include/linux/skbuff.h net/ipv4/tcp.c net/switchdev/switchdev.c Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD} renaming overlapping with net-next changes of various sorts. phy.c was a case of two changes, one adding a local variable to a function whilst the second was removing one. tcp.c overlapped a deadlock fix with the addition of new tcp_info statistic values. macb.c involved the addition of two zyncq device entries. skbuff.h involved adding back ipv4_daddr to nf_bridge_info whilst net-next changes put two other existing members of that struct into a union. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-23 01:22:35 -04:00
Jesper Dangaard Brouer	4020726479	pktgen: make /proc/net/pktgen/pgctrl report fail on invalid input Giving /proc/net/pktgen/pgctrl an invalid command just returns shell success and prints a warning in dmesg. This is not very useful for shell scripting, as it can only detect the error by parsing dmesg. Instead return -EINVAL when the command is unknown, as this provides userspace shell scripting a way of detecting this. Also bump version tag to 2.75, because (1) reading /proc/net/pktgen/pgctrl output this version number which would allow to detect this small semantic change, and (2) because the pktgen version tag have not been updated since 2010. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 23:59:16 -04:00
Jesper Dangaard Brouer	d079abd181	pktgen: adjust spacing in proc file interface output Too many spaces were introduced in commit `63adc6fb8a` ("pktgen: cleanup checkpatch warnings"), thus misaligning "src_min:" to other columns. Fixes: `63adc6fb8a` ("pktgen: cleanup checkpatch warnings") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 23:59:16 -04:00
Eric Dumazet	93a33a584e	bridge: fix lockdep splat Following lockdep splat was reported : [ 29.382286] =============================== [ 29.382315] [ INFO: suspicious RCU usage. ] [ 29.382344] 4.1.0-0.rc0.git11.1.fc23.x86_64 #1 Not tainted [ 29.382380] ------------------------------- [ 29.382409] net/bridge/br_private.h:626 suspicious rcu_dereference_check() usage! [ 29.382455] other info that might help us debug this: [ 29.382507] rcu_scheduler_active = 1, debug_locks = 0 [ 29.382549] 2 locks held by swapper/0/0: [ 29.382576] #0: (((&p->forward_delay_timer))){+.-...}, at: [<ffffffff81139f75>] call_timer_fn+0x5/0x4f0 [ 29.382660] #1: (&(&br->lock)->rlock){+.-...}, at: [<ffffffffa0450dc1>] br_forward_delay_timer_expired+0x31/0x140 [bridge] [ 29.382754] stack backtrace: [ 29.382787] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.0-0.rc0.git11.1.fc23.x86_64 #1 [ 29.382838] Hardware name: LENOVO 422916G/LENOVO, BIOS A1KT53AUS 04/07/2015 [ 29.382882] 0000000000000000 3ebfc20364115825 ffff880666603c48 ffffffff81892d4b [ 29.382943] 0000000000000000 ffffffff81e124e0 ffff880666603c78 ffffffff8110bcd7 [ 29.383004] ffff8800785c9d00 ffff88065485ac58 ffff880c62002800 ffff880c5fc88ac0 [ 29.383065] Call Trace: [ 29.383084] <IRQ> [<ffffffff81892d4b>] dump_stack+0x4c/0x65 [ 29.383130] [<ffffffff8110bcd7>] lockdep_rcu_suspicious+0xe7/0x120 [ 29.383178] [<ffffffffa04520f9>] br_fill_ifinfo+0x4a9/0x6a0 [bridge] [ 29.383225] [<ffffffffa045266b>] br_ifinfo_notify+0x11b/0x4b0 [bridge] [ 29.383271] [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge] [ 29.383320] [<ffffffffa0450de8>] br_forward_delay_timer_expired+0x58/0x140 [bridge] [ 29.383371] [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge] [ 29.383416] [<ffffffff8113a033>] call_timer_fn+0xc3/0x4f0 [ 29.383454] [<ffffffff81139f75>] ? call_timer_fn+0x5/0x4f0 [ 29.383493] [<ffffffff8110a90f>] ? lock_release_holdtime.part.29+0xf/0x200 [ 29.383541] [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge] [ 29.383587] [<ffffffff8113a6a4>] run_timer_softirq+0x244/0x490 [ 29.383629] [<ffffffff810b68cc>] __do_softirq+0xec/0x670 [ 29.383666] [<ffffffff810b70d5>] irq_exit+0x145/0x150 [ 29.383703] [<ffffffff8189f506>] smp_apic_timer_interrupt+0x46/0x60 [ 29.383744] [<ffffffff8189d523>] apic_timer_interrupt+0x73/0x80 [ 29.383782] <EOI> [<ffffffff816f131f>] ? cpuidle_enter_state+0x5f/0x2f0 [ 29.383832] [<ffffffff816f131b>] ? cpuidle_enter_state+0x5b/0x2f0 Problem here is that br_forward_delay_timer_expired() is a timer handler, calling br_ifinfo_notify() which assumes either rcu_read_lock() or RTNL are held. Simplest fix seems to add rcu read lock section. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Reported-by: Dominick Grift <dac.override@gmail.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 16:23:56 -04:00
Arun Parameswaran	f96dee13b8	net: core: 'ethtool' issue with querying phy settings When trying to configure the settings for PHY1, using commands like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to modify other settings apart from the speed of the PHY1, in the above case. The ethtool seems to query the settings for PHY0, and use this as the base to apply the new settings to the PHY1. This is causing the other settings of the PHY 1 to be wrongly configured. The issue is caused by the '_ethtool_get_settings()' API, which gets called because of the 'ETHTOOL_GSET' command, is clearing the 'cmd' pointer (of type 'struct ethtool_cmd') by calling memset. This clears all the parameters (if any) passed for the 'ETHTOOL_GSET' cmd. So the driver's callback is always invoked with 'cmd->phy_address' as '0'. The '_ethtool_get_settings()' is called from other files in the 'net/core'. So the fix is applied to the 'ethtool_get_settings()' which is only called in the context of the 'ethtool'. Signed-off-by: Arun Parameswaran <aparames@broadcom.com> Reviewed-by: Ray Jui <rjui@broadcom.com> Reviewed-by: Scott Branden <sbranden@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 16:14:17 -04:00
Thadeu Lima de Souza Cascardo	47cc84ce0c	bridge: fix parsing of MLDv2 reports When more than a multicast address is present in a MLDv2 report, all but the first address is ignored, because the code breaks out of the loop if there has not been an error adding that address. This has caused failures when two guests connected through the bridge tried to communicate using IPv6. Neighbor discoveries would not be transmitted to the other guest when both used a link-local address and a static address. This only happens when there is a MLDv2 querier in the network. The fix will only break out of the loop when there is a failure adding a multicast address. The mdb before the patch: dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp dev ovirtmgmt port bond0.86 grp ff02::2 temp After the patch: dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp dev ovirtmgmt port bond0.86 grp ff02::fb temp dev ovirtmgmt port bond0.86 grp ff02::2 temp dev ovirtmgmt port bond0.86 grp ff02::d temp dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp dev ovirtmgmt port bond0.86 grp ff02::16 temp dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp Fixes: `08b202b672` ("bridge br_multicast: IPv6 MLD support.") Reported-by: Rik Theys <Rik.Theys@esat.kuleuven.be> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Tested-by: Rik Theys <Rik.Theys@esat.kuleuven.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 15:08:20 -04:00
Michal Kubeček	d4e64c2909	ipv4: fill in table id when replacing a route When replacing an IPv4 route, tb_id member of the new fib_alias structure is not set in the replace code path so that the new route is ignored. Fixes: `0ddcf43d5d` ("ipv4: FIB Local/MAIN table collapse") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 14:33:17 -04:00
David S. Miller	572152adfb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contain Netfilter fixes for your net tree, they are: 1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash. This problem is due to wrong order in the per-net registration and netlink socket events. Patch from Francesco Ruggeri. 2) Make sure that counters that userspace pass us are higher than 0 in all the x_tables frontends. Discovered via Trinity, patch from Dave Jones. 3) Revert a patch for br_netfilter to rely on the conntrack status bits. This breaks stateless IPv6 NAT transformations. Patch from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 14:25:45 -04:00
Eric W. Biederman	381c759d99	ipv4: Avoid crashing in ip_error ip_error does not check if in_dev is NULL before dereferencing it. IThe following sequence of calls is possible: CPU A CPU B ip_rcv_finish ip_route_input_noref() ip_route_input_slow() inetdev_destroy() dst_input() With the result that a network device can be destroyed while processing an input packet. A crash was triggered with only unicast packets in flight, and forwarding enabled on the only network device. The error condition was created by the removal of the network device. As such it is likely the that error code was -EHOSTUNREACH, and the action taken by ip_error (if in_dev had been accessible) would have been to not increment any counters and to have tried and likely failed to send an icmp error as the network device is going away. Therefore handle this weird case by just dropping the packet if !in_dev. It will result in dropping the packet sooner, and will not result in an actual change of behavior. Fixes: `251da41301` ("ipv4: Cache ip_error() routes even when not forwarding.") Reported-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Tested-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Signed-off-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 14:23:40 -04:00
Jiri Pirko	12c227ec89	flow_dissector: do not break if ports are not needed in flowlabel This restored previous behaviour. If caller does not want ports to be filled, we should not break. Fixes: `06635a35d1` ("flow_dissect: use programable dissector in skb_flow_dissect and friends") Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 13:59:02 -04:00
Eric Dumazet	d654976cbf	tcp: fix a potential deadlock in tcp_get_info() Taking socket spinlock in tcp_get_info() can deadlock, as inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i], while packet processing can use the reverse locking order. We could avoid this locking for TCP_LISTEN states, but lockdep would certainly get confused as all TCP sockets share same lockdep classes. [ 523.722504] ====================================================== [ 523.728706] [ INFO: possible circular locking dependency detected ] [ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted [ 523.739202] ------------------------------------------------------- [ 523.745474] ss/18032 is trying to acquire lock: [ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360 [ 523.758129] [ 523.758129] but task is already holding lock: [ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0 [ 523.774661] [ 523.774661] which lock already depends on the new lock. [ 523.774661] [ 523.782850] [ 523.782850] the existing dependency chain (in reverse order) is: [ 523.790326] -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}: [ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270 [ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50 [ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110 [ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350 [ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500 [ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0 [ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10 [ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0 [ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0 [ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0 [ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420 Lets use u64_sync infrastructure instead. As a bonus, 64bit arches get optimized, as these are nop for them. Fixes: `0df48c26d8` ("tcp: add tcpi_bytes_acked to tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 13:46:06 -04:00
Marcelo Ricardo Leitner	2efd055c53	tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info This patch tracks the total number of inbound and outbound segments on a TCP socket. One may use this number to have an idea on connection quality when compared against the retransmissions. RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut These are a 32bit field each and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->segs_out was placed near tp->snd_nxt for good data locality and minimal performance impact, while tp->segs_in was placed near tp->bytes_received for the same reason. Join work with Eric Dumazet. Note that received SYN are accounted on the listener, but sent SYNACK are not accounted. Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 23:25:21 -04:00
Florian Westphal	48ed7b26fa	ipv6: reject locally assigned nexthop addresses ip -6 addr add dead::1/128 dev eth0 sleep 5 ip -6 route add default via dead::1/128 -> fails ip -6 addr add dead::1/128 dev eth0 ip -6 route add default via dead::1/128 -> succeeds reason is that if (nonsensensical) route above is added, dead::1 is still subject to DAD, so the route lookup will pick eth0 as outdev due to the prefix route that is added before DAD work is started. Add explicit test that checks if nexthop gateway is a local address. Link: https://bugzilla.redhat.com/show_bug.cgi?id=1167969 Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 23:23:38 -04:00
Eric Dumazet	946f9eb226	tcp: improve REUSEADDR/NOREUSEADDR cohabitation inet_csk_get_port() randomization effort tends to spread sockets on all the available range (ip_local_port_range) This is unfortunate because SO_REUSEADDR sockets have less requirements than non SO_REUSEADDR ones. If an application uses SO_REUSEADDR hint, it is to try to allow source ports being shared. So instead of picking a random port number in ip_local_port_range, lets try first in first half of the range. This gives more chances to use upper half of the range for the sockets with strong requirements (not using SO_REUSEADDR) Note this patch does not add a new sysctl, and only changes the way we try to pick port number. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Flavio Leitner <fbl@redhat.com> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 18:55:32 -04:00
Eric Dumazet	f5af1f57a2	inet_hashinfo: remove bsocket counter We no longer need bsocket atomic counter, as inet_csk_get_port() calls bind_conflict() regardless of its value, after commit `2b05ad33e1` ("tcp: bind() fix autoselection to share ports") This patch removes overhead of maintaining this counter and double inet_csk_get_port() calls under pressure. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Flavio Leitner <fbl@redhat.com> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 18:55:32 -04:00
Jason Baron	ce5ec44099	tcp: ensure epoll edge trigger wakeup when write queue is empty We currently rely on the setting of SOCK_NOSPACE in the write() path to ensure that we wake up any epoll edge trigger waiters when acks return to free space in the write queue. However, if we fail to allocate even a single skb in the write queue, we could end up waiting indefinitely. Fix this by explicitly issuing a wakeup when we detect the condition of an empty write queue and a return value of -EAGAIN. This allows userspace to re-try as we expect this to be a temporary failure. I've tested this approach by artificially making sk_stream_alloc_skb() return NULL periodically. In that case, epoll edge trigger waiters will hang indefinitely in epoll_wait() without this patch. Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 18:52:47 -04:00
Daniel Borkmann	c78e1746d3	net: sched: fix call_rcu() race on classifier module unloads Vijay reported that a loop as simple as ... while true; do tc qdisc add dev foo root handle 1: prio tc filter add dev foo parent 1: u32 match u32 0 0 flowid 1 tc qdisc del dev foo root rmmod cls_u32 done ... will panic the kernel. Moreover, he bisected the change apparently introducing it to `78fd1d0ab0` ("netlink: Re-add locking to netlink_lookup() and seq walker"). The removal of synchronize_net() from the netlink socket triggering the qdisc to be removed, seems to have uncovered an RCU resp. module reference count race from the tc API. Given that RCU conversion was done after `e341694e3e` ("netlink: Convert netlink_lookup() to use RCU protected hash table") which added the synchronize_net() originally, occasion of hitting the bug was less likely (not impossible though): When qdiscs that i) support attaching classifiers and, ii) have at least one of them attached, get deleted, they invoke tcf_destroy_chain(), and thus call into ->destroy() handler from a classifier module. After RCU conversion, all classifier that have an internal prio list, unlink them and initiate freeing via call_rcu() deferral. Meanhile, tcf_destroy() releases already reference to the tp->ops->owner module before the queued RCU callback handler has been invoked. Subsequent rmmod on the classifier module is then not prevented since all module references are already dropped. By the time, the kernel invokes the RCU callback handler from the module, that function address is then invalid. One way to fix it would be to add an rcu_barrier() to unregister_tcf_proto_ops() to wait for all pending call_rcu()s to complete. synchronize_rcu() is not appropriate as under heavy RCU callback load, registered call_rcu()s could be deferred longer than a grace period. In case we don't have any pending call_rcu()s, the barrier is allowed to return immediately. Since we came here via unregister_tcf_proto_ops(), there are no users of a given classifier anymore. Further nested call_rcu()s pointing into the module space are not being done anywhere. Only cls_bpf_delete_prog() may schedule a work item, to unlock pages eventually, but that is not in the range/context of cls_bpf anymore. Fixes: `25d8c0d55f` ("net: rcu-ify tcf_proto") Fixes: `9888faefe1` ("net: sched: cls_basic use RCU") Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 18:48:18 -04:00
Alexei Starovoitov	04fd61ab36	bpf: allow bpf programs to tail-call other bpf programs introduce bpf_tail_call(ctx, &jmp_table, index) helper function which can be used from BPF programs like: int bpf_prog(struct pt_regs ctx) { ... bpf_tail_call(ctx, &jmp_table, index); ... } that is roughly equivalent to: int bpf_prog(struct pt_regs ctx) { ... if (jmp_table[index]) return (jmp_table[index])(ctx); ... } The important detail that it's not a normal call, but a tail call. The kernel stack is precious, so this helper reuses the current stack frame and jumps into another BPF program without adding extra call frame. It's trivially done in interpreter and a bit trickier in JITs. In case of x64 JIT the bigger part of generated assembler prologue is common for all programs, so it is simply skipped while jumping. Other JITs can do similar prologue-skipping optimization or do stack unwind before jumping into the next program. bpf_tail_call() arguments: ctx - context pointer jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table index - index in the jump table Since all BPF programs are idenitified by file descriptor, user space need to populate the jmp_table with FDs of other BPF programs. If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere and program execution continues as normal. New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can populate this jmp_table array with FDs of other bpf programs. Programs can share the same jmp_table array or use multiple jmp_tables. The chain of tail calls can form unpredictable dynamic loops therefore tail_call_cnt is used to limit the number of calls and currently is set to 32. Use cases: Acked-by: Daniel Borkmann <daniel@iogearbox.net> ========== - simplify complex programs by splitting them into a sequence of small programs - dispatch routine For tracing and future seccomp the program may be triggered on all system calls, but processing of syscall arguments will be different. It's more efficient to implement them as: int syscall_entry(struct seccomp_data ctx) { bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number /); ... default: process unknown syscall ... } int sys_write_event(struct seccomp_data ctx) {...} int sys_read_event(struct seccomp_data ctx) {...} syscall_jmp_table[__NR_write] = sys_write_event; syscall_jmp_table[__NR_read] = sys_read_event; For networking the program may call into different parsers depending on packet format, like: int packet_parser(struct __sk_buff skb) { ... parse L2, L3 here ... __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol)); bpf_tail_call(skb, &ipproto_jmp_table, ipproto); ... default: process unknown protocol ... } int parse_tcp(struct __sk_buff skb) {...} int parse_udp(struct __sk_buff skb) {...} ipproto_jmp_table[IPPROTO_TCP] = parse_tcp; ipproto_jmp_table[IPPROTO_UDP] = parse_udp; - for TC use case, bpf_tail_call() allows to implement reclassify-like logic - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table are atomic, so user space can build chains of BPF programs on the fly Implementation details: ======================= - high performance of bpf_tail_call() is the goal. It could have been implemented without JIT changes as a wrapper on top of BPF_PROG_RUN() macro, but with two downsides: . all programs would have to pay performance penalty for this feature and tail call itself would be slower, since mandatory stack unwind, return, stack allocate would be done for every tailcall. . tailcall would be limited to programs running preempt_disabled, since generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would need to be either global per_cpu variable accessed by helper and by wrapper or global variable protected by locks. In this implementation x64 JIT bypasses stack unwind and jumps into the callee program after prologue. - bpf_prog_array_compatible() ensures that prog_type of callee and caller are the same and JITed/non-JITed flag is the same, since calling JITed program from non-JITed is invalid, since stack frames are different. Similarly calling kprobe type program from socket type program is invalid. - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map' abstraction, its user space API and all of verifier logic. It's in the existing arraymap.c file, since several functions are shared with regular array map. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 17:07:59 -04:00

1 2 3 4 5 ...

37829 Commits