2019-05-29 22:18:09 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2014-11-14 09:36:49 +08:00
|
|
|
/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
|
|
|
|
*/
|
|
|
|
#include <linux/bpf.h>
|
|
|
|
#include <linux/rcupdate.h>
|
2015-03-14 09:27:16 +08:00
|
|
|
#include <linux/random.h>
|
2015-03-14 09:27:17 +08:00
|
|
|
#include <linux/smp.h>
|
2016-10-21 18:46:33 +08:00
|
|
|
#include <linux/topology.h>
|
2015-05-30 05:23:06 +08:00
|
|
|
#include <linux/ktime.h>
|
2015-06-13 10:39:12 +08:00
|
|
|
#include <linux/sched.h>
|
|
|
|
#include <linux/uidgid.h>
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
#include <linux/filter.h>
|
2019-03-19 08:55:26 +08:00
|
|
|
#include <linux/ctype.h>
|
2020-01-23 07:36:46 +08:00
|
|
|
#include <linux/jiffies.h>
|
2020-03-05 04:41:56 +08:00
|
|
|
#include <linux/pid_namespace.h>
|
|
|
|
#include <linux/proc_ns.h>
|
bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
[ Upstream commit ff40e51043af63715ab413995ff46996ecf9583f ]
Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:
1) The audit events that are triggered due to calls to security_locked_down()
can OOM kill a machine, see below details [0].
2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
when trying to wake up kauditd, for example, when using trace_sched_switch()
tracepoint, see details in [1]. Triggering this was not via some hypothetical
corner case, but with existing tools like runqlat & runqslower from bcc, for
example, which make use of this tracepoint. Rough call sequence goes like:
rq_lock(rq) -> -------------------------+
trace_sched_switch() -> |
bpf_prog_xyz() -> +-> deadlock
selinux_lockdown() -> |
audit_log_end() -> |
wake_up_interruptible() -> |
try_to_wake_up() -> |
rq_lock(rq) --------------+
What's worse is that the intention of 59438b46471a to further restrict lockdown
settings for specific applications in respect to the global lockdown policy is
completely broken for BPF. The SELinux policy rule for the current lockdown check
looks something like this:
allow <who> <who> : lockdown { <reason> };
However, this doesn't match with the 'current' task where the security_locked_down()
is executed, example: httpd does a syscall. There is a tracing program attached
to the syscall which triggers a BPF program to run, which ends up doing a
bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
the permission check against 'current', that is, httpd in this example. httpd
has literally zero relation to this tracing program, and it would be nonsensical
having to write an SELinux policy rule against httpd to let the tracing helper
pass. The policy in this case needs to be against the entity that is installing
the BPF program. For example, if bpftrace would generate a histogram of syscall
counts by user space application:
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
bpftrace would then go and generate a BPF program from this internally. One way
of doing it [for the sake of the example] could be to call bpf_get_current_task()
helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
helpers. So the program itself has nothing to do with httpd or any other random
app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
check. The allow/deny policy belongs in the context of bpftrace: meaning, you
want to grant bpftrace access to use these helpers, but other tracers on the
system like my_random_tracer _not_.
Therefore fix all three issues at the same time by taking a completely different
approach for the security_locked_down() hook, that is, move the check into the
program verification phase where we actually retrieve the BPF func proto. This
also reliably gets the task (current) that is trying to install the BPF tracing
program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
we're moving this out of the BPF helper's fast-path which can be called several
millions of times per second.
The check is then also in line with other security_locked_down() hooks in the
system where the enforcement is performed at open/load time, for example,
open_kcore() for /proc/kcore access or module_sig_check() for module signatures
just to pick few random ones. What's out of scope in the fix as well as in
other security_locked_down() hook locations /outside/ of BPF subsystem is that
if the lockdown policy changes on the fly there is no retrospective action.
This requires a different discussion, potentially complex infrastructure, and
it's also not clear whether this can be solved generically. Either way, it is
out of scope for a suitable stable fix which this one is targeting. Note that
the breakage is specifically on 59438b46471a where it started to rely on 'current'
as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf:
Restrict bpf when kernel lockdown is in confidentiality mode").
[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:
I starting seeing this with F-34. When I run a container that is traced with
BPF to record the syscalls it is doing, auditd is flooded with messages like:
type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality }
for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
tclass=lockdown permissive=0
This seems to be leading to auditd running out of space in the backlog buffer
and eventually OOMs the machine.
[...]
auditd running at 99% CPU presumably processing all the messages, eventually I get:
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
[...]
[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
Serhei Makarov says:
Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
ppc64le. Example stack trace:
[...]
[ 730.868702] stack backtrace:
[ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
[ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[ 730.873278] Call Trace:
[ 730.873770] dump_stack+0x7f/0xa1
[ 730.874433] check_noncircular+0xdf/0x100
[ 730.875232] __lock_acquire+0x1202/0x1e10
[ 730.876031] ? __lock_acquire+0xfc0/0x1e10
[ 730.876844] lock_acquire+0xc2/0x3a0
[ 730.877551] ? __wake_up_common_lock+0x52/0x90
[ 730.878434] ? lock_acquire+0xc2/0x3a0
[ 730.879186] ? lock_is_held_type+0xa7/0x120
[ 730.880044] ? skb_queue_tail+0x1b/0x50
[ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90
[ 730.881656] ? __wake_up_common_lock+0x52/0x90
[ 730.882532] __wake_up_common_lock+0x52/0x90
[ 730.883375] audit_log_end+0x5b/0x100
[ 730.884104] slow_avc_audit+0x69/0x90
[ 730.884836] avc_has_perm+0x8b/0xb0
[ 730.885532] selinux_lockdown+0xa5/0xd0
[ 730.886297] security_locked_down+0x20/0x40
[ 730.887133] bpf_probe_read_compat+0x66/0xd0
[ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820
[ 730.888917] trace_call_bpf+0xe9/0x240
[ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0
[ 730.890579] perf_trace_sched_switch+0x142/0x180
[ 730.891485] ? __schedule+0x6d8/0xb20
[ 730.892209] __schedule+0x6d8/0xb20
[ 730.892899] schedule+0x5b/0xc0
[ 730.893522] exit_to_user_mode_prepare+0x11d/0x240
[ 730.894457] syscall_exit_to_user_mode+0x27/0x70
[ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae
[...]
Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Reported-by: Jakub Hrozek <jhrozek@redhat.com>
Reported-by: Serhei Makarov <smakarov@redhat.com>
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Frank Eigler <fche@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-28 17:16:31 +08:00
|
|
|
#include <linux/security.h>
|
2019-03-19 08:55:26 +08:00
|
|
|
|
|
|
|
#include "../../lib/kstrtox.h"
|
2014-11-14 09:36:49 +08:00
|
|
|
|
|
|
|
/* If kernel subsystem is allowing eBPF programs to call this function,
|
|
|
|
* inside its own verifier_ops->get_func_proto() callback it should return
|
|
|
|
* bpf_map_lookup_elem_proto, so that verifier can properly check the arguments
|
|
|
|
*
|
|
|
|
* Different map implementations will rely on rcu in map methods
|
|
|
|
* lookup/update/delete, therefore eBPF programs must run under rcu lock
|
|
|
|
* if program is allowed to access maps, so check rcu_read_lock_held in
|
|
|
|
* all three functions.
|
|
|
|
*/
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
|
2014-11-14 09:36:49 +08:00
|
|
|
{
|
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held());
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
return (unsigned long) map->ops->map_lookup_elem(map, key);
|
2014-11-14 09:36:49 +08:00
|
|
|
}
|
|
|
|
|
2015-03-01 19:31:42 +08:00
|
|
|
const struct bpf_func_proto bpf_map_lookup_elem_proto = {
|
2015-05-30 05:23:07 +08:00
|
|
|
.func = bpf_map_lookup_elem,
|
|
|
|
.gpl_only = false,
|
bpf: direct packet write and access for helpers for clsact progs
This work implements direct packet access for helpers and direct packet
write in a similar fashion as already available for XDP types via commits
4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and
6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a
complementary feature to the already available direct packet read for tc
(cls/act) programs.
For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
and bpf_csum_update(). The first is generally needed for both, read and
write, because they would otherwise only be limited to the current linear
skb head. Usually, when the data_end test fails, programs just bail out,
or, in the direct read case, use bpf_skb_load_bytes() as an alternative
to overcome this limitation. If such data sits in non-linear parts, we
can just pull them in once with the new helper, retest and eventually
access them.
At the same time, this also makes sure the skb is uncloned, which is, of
course, a necessary condition for direct write. As this needs to be an
invariant for the write part only, the verifier detects writes and adds
a prologue that is calling bpf_skb_pull_data() to effectively unclone the
skb from the very beginning in case it is indeed cloned. The heuristic
makes use of a similar trick that was done in 233577a22089 ("net: filter:
constify detection of pkt_type_offset"). This comes at zero cost for other
programs that do not use the direct write feature. Should a program use
this feature only sparsely and has read access for the most parts with,
for example, drop return codes, then such write action can be delegated
to a tail called program for mitigating this cost of potential uncloning
to a late point in time where it would have been paid similarly with the
bpf_skb_store_bytes() as well. Advantage of direct write is that the
writes are inlined whereas the helper cannot make any length assumptions
and thus needs to generate a call to memcpy() also for small sizes, as well
as cost of helper call itself with sanity checks are avoided. Plus, when
direct read is already used, we don't need to cache or perform rechecks
on the data boundaries (due to verifier invalidating previous checks for
helpers that change skb->data), so more complex programs using rewrites
can benefit from switching to direct read plus write.
For direct packet access to helpers, we save the otherwise needed copy into
a temp struct sitting on stack memory when use-case allows. Both facilities
are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
this to map helpers and csum_diff, and can successively enable other helpers
where we find it makes sense. Helpers that definitely cannot be allowed for
this are those part of bpf_helper_changes_skb_data() since they can change
underlying data, and those that write into memory as this could happen for
packet typed args when still cloned. bpf_csum_update() helper accommodates
for the fact that we need to fixup checksum_complete when using direct write
instead of bpf_skb_store_bytes(), meaning the programs can use available
helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
csum_block_add(), csum_block_sub() equivalents in eBPF together with the
new helper. A usage example will be provided for iproute2's examples/bpf/
directory.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 06:26:13 +08:00
|
|
|
.pkt_access = true,
|
2015-05-30 05:23:07 +08:00
|
|
|
.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
|
|
|
|
.arg1_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg2_type = ARG_PTR_TO_MAP_KEY,
|
2014-11-14 09:36:49 +08:00
|
|
|
};
|
|
|
|
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
|
|
|
|
void *, value, u64, flags)
|
2014-11-14 09:36:49 +08:00
|
|
|
{
|
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held());
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
return map->ops->map_update_elem(map, key, value, flags);
|
2014-11-14 09:36:49 +08:00
|
|
|
}
|
|
|
|
|
2015-03-01 19:31:42 +08:00
|
|
|
const struct bpf_func_proto bpf_map_update_elem_proto = {
|
2015-05-30 05:23:07 +08:00
|
|
|
.func = bpf_map_update_elem,
|
|
|
|
.gpl_only = false,
|
bpf: direct packet write and access for helpers for clsact progs
This work implements direct packet access for helpers and direct packet
write in a similar fashion as already available for XDP types via commits
4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and
6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a
complementary feature to the already available direct packet read for tc
(cls/act) programs.
For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
and bpf_csum_update(). The first is generally needed for both, read and
write, because they would otherwise only be limited to the current linear
skb head. Usually, when the data_end test fails, programs just bail out,
or, in the direct read case, use bpf_skb_load_bytes() as an alternative
to overcome this limitation. If such data sits in non-linear parts, we
can just pull them in once with the new helper, retest and eventually
access them.
At the same time, this also makes sure the skb is uncloned, which is, of
course, a necessary condition for direct write. As this needs to be an
invariant for the write part only, the verifier detects writes and adds
a prologue that is calling bpf_skb_pull_data() to effectively unclone the
skb from the very beginning in case it is indeed cloned. The heuristic
makes use of a similar trick that was done in 233577a22089 ("net: filter:
constify detection of pkt_type_offset"). This comes at zero cost for other
programs that do not use the direct write feature. Should a program use
this feature only sparsely and has read access for the most parts with,
for example, drop return codes, then such write action can be delegated
to a tail called program for mitigating this cost of potential uncloning
to a late point in time where it would have been paid similarly with the
bpf_skb_store_bytes() as well. Advantage of direct write is that the
writes are inlined whereas the helper cannot make any length assumptions
and thus needs to generate a call to memcpy() also for small sizes, as well
as cost of helper call itself with sanity checks are avoided. Plus, when
direct read is already used, we don't need to cache or perform rechecks
on the data boundaries (due to verifier invalidating previous checks for
helpers that change skb->data), so more complex programs using rewrites
can benefit from switching to direct read plus write.
For direct packet access to helpers, we save the otherwise needed copy into
a temp struct sitting on stack memory when use-case allows. Both facilities
are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
this to map helpers and csum_diff, and can successively enable other helpers
where we find it makes sense. Helpers that definitely cannot be allowed for
this are those part of bpf_helper_changes_skb_data() since they can change
underlying data, and those that write into memory as this could happen for
packet typed args when still cloned. bpf_csum_update() helper accommodates
for the fact that we need to fixup checksum_complete when using direct write
instead of bpf_skb_store_bytes(), meaning the programs can use available
helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
csum_block_add(), csum_block_sub() equivalents in eBPF together with the
new helper. A usage example will be provided for iproute2's examples/bpf/
directory.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 06:26:13 +08:00
|
|
|
.pkt_access = true,
|
2015-05-30 05:23:07 +08:00
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg2_type = ARG_PTR_TO_MAP_KEY,
|
|
|
|
.arg3_type = ARG_PTR_TO_MAP_VALUE,
|
|
|
|
.arg4_type = ARG_ANYTHING,
|
2014-11-14 09:36:49 +08:00
|
|
|
};
|
|
|
|
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
BPF_CALL_2(bpf_map_delete_elem, struct bpf_map *, map, void *, key)
|
2014-11-14 09:36:49 +08:00
|
|
|
{
|
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held());
|
|
|
|
return map->ops->map_delete_elem(map, key);
|
|
|
|
}
|
|
|
|
|
2015-03-01 19:31:42 +08:00
|
|
|
const struct bpf_func_proto bpf_map_delete_elem_proto = {
|
2015-05-30 05:23:07 +08:00
|
|
|
.func = bpf_map_delete_elem,
|
|
|
|
.gpl_only = false,
|
bpf: direct packet write and access for helpers for clsact progs
This work implements direct packet access for helpers and direct packet
write in a similar fashion as already available for XDP types via commits
4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and
6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a
complementary feature to the already available direct packet read for tc
(cls/act) programs.
For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
and bpf_csum_update(). The first is generally needed for both, read and
write, because they would otherwise only be limited to the current linear
skb head. Usually, when the data_end test fails, programs just bail out,
or, in the direct read case, use bpf_skb_load_bytes() as an alternative
to overcome this limitation. If such data sits in non-linear parts, we
can just pull them in once with the new helper, retest and eventually
access them.
At the same time, this also makes sure the skb is uncloned, which is, of
course, a necessary condition for direct write. As this needs to be an
invariant for the write part only, the verifier detects writes and adds
a prologue that is calling bpf_skb_pull_data() to effectively unclone the
skb from the very beginning in case it is indeed cloned. The heuristic
makes use of a similar trick that was done in 233577a22089 ("net: filter:
constify detection of pkt_type_offset"). This comes at zero cost for other
programs that do not use the direct write feature. Should a program use
this feature only sparsely and has read access for the most parts with,
for example, drop return codes, then such write action can be delegated
to a tail called program for mitigating this cost of potential uncloning
to a late point in time where it would have been paid similarly with the
bpf_skb_store_bytes() as well. Advantage of direct write is that the
writes are inlined whereas the helper cannot make any length assumptions
and thus needs to generate a call to memcpy() also for small sizes, as well
as cost of helper call itself with sanity checks are avoided. Plus, when
direct read is already used, we don't need to cache or perform rechecks
on the data boundaries (due to verifier invalidating previous checks for
helpers that change skb->data), so more complex programs using rewrites
can benefit from switching to direct read plus write.
For direct packet access to helpers, we save the otherwise needed copy into
a temp struct sitting on stack memory when use-case allows. Both facilities
are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
this to map helpers and csum_diff, and can successively enable other helpers
where we find it makes sense. Helpers that definitely cannot be allowed for
this are those part of bpf_helper_changes_skb_data() since they can change
underlying data, and those that write into memory as this could happen for
packet typed args when still cloned. bpf_csum_update() helper accommodates
for the fact that we need to fixup checksum_complete when using direct write
instead of bpf_skb_store_bytes(), meaning the programs can use available
helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
csum_block_add(), csum_block_sub() equivalents in eBPF together with the
new helper. A usage example will be provided for iproute2's examples/bpf/
directory.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 06:26:13 +08:00
|
|
|
.pkt_access = true,
|
2015-05-30 05:23:07 +08:00
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg2_type = ARG_PTR_TO_MAP_KEY,
|
2014-11-14 09:36:49 +08:00
|
|
|
};
|
2015-03-14 09:27:16 +08:00
|
|
|
|
2018-10-18 21:16:25 +08:00
|
|
|
BPF_CALL_3(bpf_map_push_elem, struct bpf_map *, map, void *, value, u64, flags)
|
|
|
|
{
|
|
|
|
return map->ops->map_push_elem(map, value, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_map_push_elem_proto = {
|
|
|
|
.func = bpf_map_push_elem,
|
|
|
|
.gpl_only = false,
|
|
|
|
.pkt_access = true,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg2_type = ARG_PTR_TO_MAP_VALUE,
|
|
|
|
.arg3_type = ARG_ANYTHING,
|
|
|
|
};
|
|
|
|
|
|
|
|
BPF_CALL_2(bpf_map_pop_elem, struct bpf_map *, map, void *, value)
|
|
|
|
{
|
|
|
|
return map->ops->map_pop_elem(map, value);
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_map_pop_elem_proto = {
|
|
|
|
.func = bpf_map_pop_elem,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg2_type = ARG_PTR_TO_UNINIT_MAP_VALUE,
|
|
|
|
};
|
|
|
|
|
|
|
|
BPF_CALL_2(bpf_map_peek_elem, struct bpf_map *, map, void *, value)
|
|
|
|
{
|
|
|
|
return map->ops->map_peek_elem(map, value);
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_map_peek_elem_proto = {
|
2021-01-20 04:53:18 +08:00
|
|
|
.func = bpf_map_peek_elem,
|
2018-10-18 21:16:25 +08:00
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg2_type = ARG_PTR_TO_UNINIT_MAP_VALUE,
|
|
|
|
};
|
|
|
|
|
2015-03-14 09:27:16 +08:00
|
|
|
const struct bpf_func_proto bpf_get_prandom_u32_proto = {
|
bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs
While recently arguing on a seccomp discussion that raw prandom_u32()
access shouldn't be exposed to unpriviledged user space, I forgot the
fact that SKF_AD_RANDOM extension actually already does it for some time
in cBPF via commit 4cd3675ebf74 ("filter: added BPF random opcode").
Since prandom_u32() is being used in a lot of critical networking code,
lets be more conservative and split their states. Furthermore, consolidate
eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
bpf_get_prandom_u32() was only accessible for priviledged users, but
should that change one day, we also don't want to leak raw sequences
through things like eBPF maps.
One thought was also to have own per bpf_prog states, but due to ABI
reasons this is not easily possible, i.e. the program code currently
cannot access bpf_prog itself, and copying the rnd_state to/from the
stack scratch space whenever a program uses the prng seems not really
worth the trouble and seems too hacky. If needed, taus113 could in such
cases be implemented within eBPF using a map entry to keep the state
space, or get_random_bytes() could become a second helper in cases where
performance would not be critical.
Both sides can trigger a one-time late init via prandom_init_once() on
the shared state. Performance-wise, there should even be a tiny gain
as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
inside the BPF core since kernels could have a NET-less config as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 07:20:39 +08:00
|
|
|
.func = bpf_user_rnd_u32,
|
2015-03-14 09:27:16 +08:00
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
};
|
2015-03-14 09:27:17 +08:00
|
|
|
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
BPF_CALL_0(bpf_get_smp_processor_id)
|
2015-03-14 09:27:17 +08:00
|
|
|
{
|
2016-06-28 18:18:26 +08:00
|
|
|
return smp_processor_id();
|
2015-03-14 09:27:17 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_get_smp_processor_id_proto = {
|
|
|
|
.func = bpf_get_smp_processor_id,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
};
|
2015-05-30 05:23:06 +08:00
|
|
|
|
2016-10-21 18:46:33 +08:00
|
|
|
BPF_CALL_0(bpf_get_numa_node_id)
|
|
|
|
{
|
|
|
|
return numa_node_id();
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_get_numa_node_id_proto = {
|
|
|
|
.func = bpf_get_numa_node_id,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
};
|
|
|
|
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
BPF_CALL_0(bpf_ktime_get_ns)
|
2015-05-30 05:23:06 +08:00
|
|
|
{
|
|
|
|
/* NMI safe access to clock monotonic */
|
|
|
|
return ktime_get_mono_fast_ns();
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_ktime_get_ns_proto = {
|
|
|
|
.func = bpf_ktime_get_ns,
|
2020-04-21 02:47:50 +08:00
|
|
|
.gpl_only = false,
|
2015-05-30 05:23:06 +08:00
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
};
|
2015-06-13 10:39:12 +08:00
|
|
|
|
2020-04-27 00:15:25 +08:00
|
|
|
BPF_CALL_0(bpf_ktime_get_boot_ns)
|
|
|
|
{
|
|
|
|
/* NMI safe access to clock boottime */
|
|
|
|
return ktime_get_boot_fast_ns();
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_ktime_get_boot_ns_proto = {
|
|
|
|
.func = bpf_ktime_get_boot_ns,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
};
|
|
|
|
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
BPF_CALL_0(bpf_get_current_pid_tgid)
|
2015-06-13 10:39:12 +08:00
|
|
|
{
|
|
|
|
struct task_struct *task = current;
|
|
|
|
|
2016-09-09 08:45:28 +08:00
|
|
|
if (unlikely(!task))
|
2015-06-13 10:39:12 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return (u64) task->tgid << 32 | task->pid;
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_get_current_pid_tgid_proto = {
|
|
|
|
.func = bpf_get_current_pid_tgid,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
};
|
|
|
|
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
BPF_CALL_0(bpf_get_current_uid_gid)
|
2015-06-13 10:39:12 +08:00
|
|
|
{
|
|
|
|
struct task_struct *task = current;
|
|
|
|
kuid_t uid;
|
|
|
|
kgid_t gid;
|
|
|
|
|
2016-09-09 08:45:28 +08:00
|
|
|
if (unlikely(!task))
|
2015-06-13 10:39:12 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
current_uid_gid(&uid, &gid);
|
|
|
|
return (u64) from_kgid(&init_user_ns, gid) << 32 |
|
2016-09-09 08:45:28 +08:00
|
|
|
from_kuid(&init_user_ns, uid);
|
2015-06-13 10:39:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_get_current_uid_gid_proto = {
|
|
|
|
.func = bpf_get_current_uid_gid,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
};
|
|
|
|
|
bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 08:45:31 +08:00
|
|
|
BPF_CALL_2(bpf_get_current_comm, char *, buf, u32, size)
|
2015-06-13 10:39:12 +08:00
|
|
|
{
|
|
|
|
struct task_struct *task = current;
|
|
|
|
|
2016-04-13 06:10:52 +08:00
|
|
|
if (unlikely(!task))
|
|
|
|
goto err_clear;
|
2015-06-13 10:39:12 +08:00
|
|
|
|
2016-04-13 06:10:52 +08:00
|
|
|
strncpy(buf, task->comm, size);
|
|
|
|
|
|
|
|
/* Verifier guarantees that size > 0. For task->comm exceeding
|
|
|
|
* size, guarantee that buf is %NUL-terminated. Unconditionally
|
|
|
|
* done here to save the size test.
|
|
|
|
*/
|
|
|
|
buf[size - 1] = 0;
|
2015-06-13 10:39:12 +08:00
|
|
|
return 0;
|
2016-04-13 06:10:52 +08:00
|
|
|
err_clear:
|
|
|
|
memset(buf, 0, size);
|
|
|
|
return -EINVAL;
|
2015-06-13 10:39:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_get_current_comm_proto = {
|
|
|
|
.func = bpf_get_current_comm,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
2017-01-10 02:19:50 +08:00
|
|
|
.arg1_type = ARG_PTR_TO_UNINIT_MEM,
|
|
|
|
.arg2_type = ARG_CONST_SIZE,
|
2015-06-13 10:39:12 +08:00
|
|
|
};
|
2018-06-04 06:59:41 +08:00
|
|
|
|
2019-02-01 07:40:04 +08:00
|
|
|
#if defined(CONFIG_QUEUED_SPINLOCKS) || defined(CONFIG_BPF_ARCH_SPINLOCK)
|
|
|
|
|
|
|
|
static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
|
|
|
|
{
|
|
|
|
arch_spinlock_t *l = (void *)lock;
|
|
|
|
union {
|
|
|
|
__u32 val;
|
|
|
|
arch_spinlock_t lock;
|
|
|
|
} u = { .lock = __ARCH_SPIN_LOCK_UNLOCKED };
|
|
|
|
|
|
|
|
compiletime_assert(u.val == 0, "__ARCH_SPIN_LOCK_UNLOCKED not 0");
|
|
|
|
BUILD_BUG_ON(sizeof(*l) != sizeof(__u32));
|
|
|
|
BUILD_BUG_ON(sizeof(*lock) != sizeof(__u32));
|
|
|
|
arch_spin_lock(l);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
|
|
|
|
{
|
|
|
|
arch_spinlock_t *l = (void *)lock;
|
|
|
|
|
|
|
|
arch_spin_unlock(l);
|
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
|
|
|
|
{
|
|
|
|
atomic_t *l = (void *)lock;
|
|
|
|
|
|
|
|
BUILD_BUG_ON(sizeof(*l) != sizeof(*lock));
|
|
|
|
do {
|
|
|
|
atomic_cond_read_relaxed(l, !VAL);
|
|
|
|
} while (atomic_xchg(l, 1));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
|
|
|
|
{
|
|
|
|
atomic_t *l = (void *)lock;
|
|
|
|
|
|
|
|
atomic_set_release(l, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static DEFINE_PER_CPU(unsigned long, irqsave_flags);
|
|
|
|
|
|
|
|
notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
local_irq_save(flags);
|
|
|
|
__bpf_spin_lock(lock);
|
|
|
|
__this_cpu_write(irqsave_flags, flags);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_spin_lock_proto = {
|
|
|
|
.func = bpf_spin_lock,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_VOID,
|
|
|
|
.arg1_type = ARG_PTR_TO_SPIN_LOCK,
|
|
|
|
};
|
|
|
|
|
|
|
|
notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
flags = __this_cpu_read(irqsave_flags);
|
|
|
|
__bpf_spin_unlock(lock);
|
|
|
|
local_irq_restore(flags);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_spin_unlock_proto = {
|
|
|
|
.func = bpf_spin_unlock,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_VOID,
|
|
|
|
.arg1_type = ARG_PTR_TO_SPIN_LOCK,
|
|
|
|
};
|
|
|
|
|
2019-02-01 07:40:09 +08:00
|
|
|
void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
|
|
|
|
bool lock_src)
|
|
|
|
{
|
|
|
|
struct bpf_spin_lock *lock;
|
|
|
|
|
|
|
|
if (lock_src)
|
|
|
|
lock = src + map->spin_lock_off;
|
|
|
|
else
|
|
|
|
lock = dst + map->spin_lock_off;
|
|
|
|
preempt_disable();
|
|
|
|
____bpf_spin_lock(lock);
|
|
|
|
copy_map_value(map, dst, src);
|
|
|
|
____bpf_spin_unlock(lock);
|
|
|
|
preempt_enable();
|
|
|
|
}
|
|
|
|
|
2020-01-23 07:36:46 +08:00
|
|
|
BPF_CALL_0(bpf_jiffies64)
|
|
|
|
{
|
|
|
|
return get_jiffies_64();
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_jiffies64_proto = {
|
|
|
|
.func = bpf_jiffies64,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
};
|
|
|
|
|
2018-06-04 06:59:41 +08:00
|
|
|
#ifdef CONFIG_CGROUPS
|
|
|
|
BPF_CALL_0(bpf_get_current_cgroup_id)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp = task_dfl_cgroup(current);
|
|
|
|
|
2019-11-05 07:54:30 +08:00
|
|
|
return cgroup_id(cgrp);
|
2018-06-04 06:59:41 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_get_current_cgroup_id_proto = {
|
|
|
|
.func = bpf_get_current_cgroup_id,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
};
|
2018-08-03 05:27:24 +08:00
|
|
|
|
2020-03-27 23:58:54 +08:00
|
|
|
BPF_CALL_1(bpf_get_current_ancestor_cgroup_id, int, ancestor_level)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp = task_dfl_cgroup(current);
|
|
|
|
struct cgroup *ancestor;
|
|
|
|
|
|
|
|
ancestor = cgroup_ancestor(cgrp, ancestor_level);
|
|
|
|
if (!ancestor)
|
|
|
|
return 0;
|
|
|
|
return cgroup_id(ancestor);
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_get_current_ancestor_cgroup_id_proto = {
|
|
|
|
.func = bpf_get_current_ancestor_cgroup_id,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_ANYTHING,
|
|
|
|
};
|
|
|
|
|
2018-09-28 22:45:36 +08:00
|
|
|
#ifdef CONFIG_CGROUP_BPF
|
2021-08-24 01:36:46 +08:00
|
|
|
DECLARE_PER_CPU(struct bpf_cgroup_storage_info,
|
|
|
|
bpf_cgroup_storage_info[BPF_CGROUP_STORAGE_NEST_MAX]);
|
2018-08-03 05:27:24 +08:00
|
|
|
|
|
|
|
BPF_CALL_2(bpf_get_local_storage, struct bpf_map *, map, u64, flags)
|
|
|
|
{
|
2018-09-28 22:45:36 +08:00
|
|
|
/* flags argument is not used now,
|
|
|
|
* but provides an ability to extend the API.
|
|
|
|
* verifier checks that its value is correct.
|
2018-08-03 05:27:24 +08:00
|
|
|
*/
|
2018-09-28 22:45:36 +08:00
|
|
|
enum bpf_cgroup_storage_type stype = cgroup_storage_type(map);
|
2021-08-24 01:36:46 +08:00
|
|
|
struct bpf_cgroup_storage *storage = NULL;
|
2018-09-28 22:45:43 +08:00
|
|
|
void *ptr;
|
2021-08-24 01:36:46 +08:00
|
|
|
int i;
|
2018-09-28 22:45:36 +08:00
|
|
|
|
bpf: Fix potentially incorrect results with bpf_get_local_storage()
commit a2baf4e8bb0f306fbed7b5e6197c02896a638ab5 upstream.
Commit b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed a bug for bpf_get_local_storage() helper so different tasks
won't mess up with each other's percpu local storage.
The percpu data contains 8 slots so it can hold up to 8 contexts (same or
different tasks), for 8 different program runs, at the same time. This in
general is sufficient. But our internal testing showed the following warning
multiple times:
[...]
warning: WARNING: CPU: 13 PID: 41661 at include/linux/bpf-cgroup.h:193
__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
RIP: 0010:__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
<IRQ>
tcp_call_bpf.constprop.99+0x93/0xc0
tcp_conn_request+0x41e/0xa50
? tcp_rcv_state_process+0x203/0xe00
tcp_rcv_state_process+0x203/0xe00
? sk_filter_trim_cap+0xbc/0x210
? tcp_v6_inbound_md5_hash.constprop.41+0x44/0x160
tcp_v6_do_rcv+0x181/0x3e0
tcp_v6_rcv+0xc65/0xcb0
ip6_protocol_deliver_rcu+0xbd/0x450
ip6_input_finish+0x11/0x20
ip6_input+0xb5/0xc0
ip6_sublist_rcv_finish+0x37/0x50
ip6_sublist_rcv+0x1dc/0x270
ipv6_list_rcv+0x113/0x140
__netif_receive_skb_list_core+0x1a0/0x210
netif_receive_skb_list_internal+0x186/0x2a0
gro_normal_list.part.170+0x19/0x40
napi_complete_done+0x65/0x150
mlx5e_napi_poll+0x1ae/0x680
__napi_poll+0x25/0x120
net_rx_action+0x11e/0x280
__do_softirq+0xbb/0x271
irq_exit_rcu+0x97/0xa0
common_interrupt+0x7f/0xa0
</IRQ>
asm_common_interrupt+0x1e/0x40
RIP: 0010:bpf_prog_1835a9241238291a_tw_egress+0x5/0xbac
? __cgroup_bpf_run_filter_skb+0x378/0x4e0
? do_softirq+0x34/0x70
? ip6_finish_output2+0x266/0x590
? ip6_finish_output+0x66/0xa0
? ip6_output+0x6c/0x130
? ip6_xmit+0x279/0x550
? ip6_dst_check+0x61/0xd0
[...]
Using drgn [0] to dump the percpu buffer contents showed that on this CPU
slot 0 is still available, but slots 1-7 are occupied and those tasks in
slots 1-7 mostly don't exist any more. So we might have issues in
bpf_cgroup_storage_unset().
Further debugging confirmed that there is a bug in bpf_cgroup_storage_unset().
Currently, it tries to unset "current" slot with searching from the start.
So the following sequence is possible:
1. A task is running and claims slot 0
2. Running BPF program is done, and it checked slot 0 has the "task"
and ready to reset it to NULL (not yet).
3. An interrupt happens, another BPF program runs and it claims slot 1
with the *same* task.
4. The unset() in interrupt context releases slot 0 since it matches "task".
5. Interrupt is done, the task in process context reset slot 0.
At the end, slot 1 is not reset and the same process can continue to occupy
slots 2-7 and finally, when the above step 1-5 is repeated again, step 3 BPF
program won't be able to claim an empty slot and a warning will be issued.
To fix the issue, for unset() function, we should traverse from the last slot
to the first. This way, the above issue can be avoided.
The same reverse traversal should also be done in bpf_get_local_storage() helper
itself. Otherwise, incorrect local storage may be returned to BPF program.
[0] https://github.com/osandov/drgn
Fixes: b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210810010413.1976277-1-yhs@fb.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-10 09:04:13 +08:00
|
|
|
for (i = BPF_CGROUP_STORAGE_NEST_MAX - 1; i >= 0; i--) {
|
|
|
|
if (likely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current))
|
2021-08-24 01:36:46 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
storage = this_cpu_read(bpf_cgroup_storage_info[i].storage[stype]);
|
|
|
|
break;
|
|
|
|
}
|
2018-09-28 22:45:40 +08:00
|
|
|
|
2018-09-28 22:45:43 +08:00
|
|
|
if (stype == BPF_CGROUP_STORAGE_SHARED)
|
|
|
|
ptr = &READ_ONCE(storage->buf)->data[0];
|
|
|
|
else
|
|
|
|
ptr = this_cpu_ptr(storage->percpu_buf);
|
|
|
|
|
|
|
|
return (unsigned long)ptr;
|
2018-08-03 05:27:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_get_local_storage_proto = {
|
|
|
|
.func = bpf_get_local_storage,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_PTR_TO_MAP_VALUE,
|
|
|
|
.arg1_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg2_type = ARG_ANYTHING,
|
|
|
|
};
|
2018-06-04 06:59:41 +08:00
|
|
|
#endif
|
2019-03-19 08:55:26 +08:00
|
|
|
|
|
|
|
#define BPF_STRTOX_BASE_MASK 0x1F
|
|
|
|
|
|
|
|
static int __bpf_strtoull(const char *buf, size_t buf_len, u64 flags,
|
|
|
|
unsigned long long *res, bool *is_negative)
|
|
|
|
{
|
|
|
|
unsigned int base = flags & BPF_STRTOX_BASE_MASK;
|
|
|
|
const char *cur_buf = buf;
|
|
|
|
size_t cur_len = buf_len;
|
|
|
|
unsigned int consumed;
|
|
|
|
size_t val_len;
|
|
|
|
char str[64];
|
|
|
|
|
|
|
|
if (!buf || !buf_len || !res || !is_negative)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (base != 0 && base != 8 && base != 10 && base != 16)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (flags & ~BPF_STRTOX_BASE_MASK)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
while (cur_buf < buf + buf_len && isspace(*cur_buf))
|
|
|
|
++cur_buf;
|
|
|
|
|
|
|
|
*is_negative = (cur_buf < buf + buf_len && *cur_buf == '-');
|
|
|
|
if (*is_negative)
|
|
|
|
++cur_buf;
|
|
|
|
|
|
|
|
consumed = cur_buf - buf;
|
|
|
|
cur_len -= consumed;
|
|
|
|
if (!cur_len)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
cur_len = min(cur_len, sizeof(str) - 1);
|
|
|
|
memcpy(str, cur_buf, cur_len);
|
|
|
|
str[cur_len] = '\0';
|
|
|
|
cur_buf = str;
|
|
|
|
|
|
|
|
cur_buf = _parse_integer_fixup_radix(cur_buf, &base);
|
|
|
|
val_len = _parse_integer(cur_buf, base, res);
|
|
|
|
|
|
|
|
if (val_len & KSTRTOX_OVERFLOW)
|
|
|
|
return -ERANGE;
|
|
|
|
|
|
|
|
if (val_len == 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
cur_buf += val_len;
|
|
|
|
consumed += cur_buf - str;
|
|
|
|
|
|
|
|
return consumed;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __bpf_strtoll(const char *buf, size_t buf_len, u64 flags,
|
|
|
|
long long *res)
|
|
|
|
{
|
|
|
|
unsigned long long _res;
|
|
|
|
bool is_negative;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = __bpf_strtoull(buf, buf_len, flags, &_res, &is_negative);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
if (is_negative) {
|
|
|
|
if ((long long)-_res > 0)
|
|
|
|
return -ERANGE;
|
|
|
|
*res = -_res;
|
|
|
|
} else {
|
|
|
|
if ((long long)_res < 0)
|
|
|
|
return -ERANGE;
|
|
|
|
*res = _res;
|
|
|
|
}
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
BPF_CALL_4(bpf_strtol, const char *, buf, size_t, buf_len, u64, flags,
|
|
|
|
long *, res)
|
|
|
|
{
|
|
|
|
long long _res;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = __bpf_strtoll(buf, buf_len, flags, &_res);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
if (_res != (long)_res)
|
|
|
|
return -ERANGE;
|
|
|
|
*res = _res;
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_strtol_proto = {
|
|
|
|
.func = bpf_strtol,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_PTR_TO_MEM,
|
|
|
|
.arg2_type = ARG_CONST_SIZE,
|
|
|
|
.arg3_type = ARG_ANYTHING,
|
|
|
|
.arg4_type = ARG_PTR_TO_LONG,
|
|
|
|
};
|
|
|
|
|
|
|
|
BPF_CALL_4(bpf_strtoul, const char *, buf, size_t, buf_len, u64, flags,
|
|
|
|
unsigned long *, res)
|
|
|
|
{
|
|
|
|
unsigned long long _res;
|
|
|
|
bool is_negative;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = __bpf_strtoull(buf, buf_len, flags, &_res, &is_negative);
|
|
|
|
if (err < 0)
|
|
|
|
return err;
|
|
|
|
if (is_negative)
|
|
|
|
return -EINVAL;
|
|
|
|
if (_res != (unsigned long)_res)
|
|
|
|
return -ERANGE;
|
|
|
|
*res = _res;
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_strtoul_proto = {
|
|
|
|
.func = bpf_strtoul,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_PTR_TO_MEM,
|
|
|
|
.arg2_type = ARG_CONST_SIZE,
|
|
|
|
.arg3_type = ARG_ANYTHING,
|
|
|
|
.arg4_type = ARG_PTR_TO_LONG,
|
|
|
|
};
|
2018-09-28 22:45:36 +08:00
|
|
|
#endif
|
2020-03-05 04:41:56 +08:00
|
|
|
|
|
|
|
BPF_CALL_4(bpf_get_ns_current_pid_tgid, u64, dev, u64, ino,
|
|
|
|
struct bpf_pidns_info *, nsdata, u32, size)
|
|
|
|
{
|
|
|
|
struct task_struct *task = current;
|
|
|
|
struct pid_namespace *pidns;
|
|
|
|
int err = -EINVAL;
|
|
|
|
|
|
|
|
if (unlikely(size != sizeof(struct bpf_pidns_info)))
|
|
|
|
goto clear;
|
|
|
|
|
|
|
|
if (unlikely((u64)(dev_t)dev != dev))
|
|
|
|
goto clear;
|
|
|
|
|
|
|
|
if (unlikely(!task))
|
|
|
|
goto clear;
|
|
|
|
|
|
|
|
pidns = task_active_pid_ns(task);
|
|
|
|
if (unlikely(!pidns)) {
|
|
|
|
err = -ENOENT;
|
|
|
|
goto clear;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!ns_match(&pidns->ns, (dev_t)dev, ino))
|
|
|
|
goto clear;
|
|
|
|
|
|
|
|
nsdata->pid = task_pid_nr_ns(task, pidns);
|
|
|
|
nsdata->tgid = task_tgid_nr_ns(task, pidns);
|
|
|
|
return 0;
|
|
|
|
clear:
|
|
|
|
memset((void *)nsdata, 0, (size_t) size);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = {
|
|
|
|
.func = bpf_get_ns_current_pid_tgid,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_ANYTHING,
|
|
|
|
.arg2_type = ARG_ANYTHING,
|
|
|
|
.arg3_type = ARG_PTR_TO_UNINIT_MEM,
|
|
|
|
.arg4_type = ARG_CONST_SIZE,
|
|
|
|
};
|
2020-04-25 07:59:41 +08:00
|
|
|
|
|
|
|
static const struct bpf_func_proto bpf_get_raw_smp_processor_id_proto = {
|
|
|
|
.func = bpf_get_raw_cpu_id,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
};
|
|
|
|
|
|
|
|
BPF_CALL_5(bpf_event_output_data, void *, ctx, struct bpf_map *, map,
|
|
|
|
u64, flags, void *, data, u64, size)
|
|
|
|
{
|
|
|
|
if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return bpf_event_output(map, flags, data, size, NULL, 0, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_event_output_data_proto = {
|
|
|
|
.func = bpf_event_output_data,
|
|
|
|
.gpl_only = true,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_PTR_TO_CTX,
|
|
|
|
.arg2_type = ARG_CONST_MAP_PTR,
|
|
|
|
.arg3_type = ARG_ANYTHING,
|
|
|
|
.arg4_type = ARG_PTR_TO_MEM,
|
|
|
|
.arg5_type = ARG_CONST_SIZE_OR_ZERO,
|
|
|
|
};
|
|
|
|
|
2020-08-28 06:01:12 +08:00
|
|
|
BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size,
|
|
|
|
const void __user *, user_ptr)
|
|
|
|
{
|
|
|
|
int ret = copy_from_user(dst, user_ptr, size);
|
|
|
|
|
|
|
|
if (unlikely(ret)) {
|
|
|
|
memset(dst, 0, size);
|
|
|
|
ret = -EFAULT;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_copy_from_user_proto = {
|
|
|
|
.func = bpf_copy_from_user,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_INTEGER,
|
|
|
|
.arg1_type = ARG_PTR_TO_UNINIT_MEM,
|
|
|
|
.arg2_type = ARG_CONST_SIZE_OR_ZERO,
|
|
|
|
.arg3_type = ARG_ANYTHING,
|
|
|
|
};
|
|
|
|
|
2020-09-30 07:50:47 +08:00
|
|
|
BPF_CALL_2(bpf_per_cpu_ptr, const void *, ptr, u32, cpu)
|
|
|
|
{
|
|
|
|
if (cpu >= nr_cpu_ids)
|
|
|
|
return (unsigned long)NULL;
|
|
|
|
|
|
|
|
return (unsigned long)per_cpu_ptr((const void __percpu *)ptr, cpu);
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_per_cpu_ptr_proto = {
|
|
|
|
.func = bpf_per_cpu_ptr,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL,
|
|
|
|
.arg1_type = ARG_PTR_TO_PERCPU_BTF_ID,
|
|
|
|
.arg2_type = ARG_ANYTHING,
|
|
|
|
};
|
|
|
|
|
2020-09-30 07:50:48 +08:00
|
|
|
BPF_CALL_1(bpf_this_cpu_ptr, const void *, percpu_ptr)
|
|
|
|
{
|
|
|
|
return (unsigned long)this_cpu_ptr((const void __percpu *)percpu_ptr);
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct bpf_func_proto bpf_this_cpu_ptr_proto = {
|
|
|
|
.func = bpf_this_cpu_ptr,
|
|
|
|
.gpl_only = false,
|
|
|
|
.ret_type = RET_PTR_TO_MEM_OR_BTF_ID,
|
|
|
|
.arg1_type = ARG_PTR_TO_PERCPU_BTF_ID,
|
|
|
|
};
|
|
|
|
|
2020-05-25 00:50:55 +08:00
|
|
|
const struct bpf_func_proto bpf_get_current_task_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_probe_read_user_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_probe_read_kernel_proto __weak;
|
|
|
|
const struct bpf_func_proto bpf_probe_read_kernel_str_proto __weak;
|
|
|
|
|
2020-04-25 07:59:41 +08:00
|
|
|
const struct bpf_func_proto *
|
|
|
|
bpf_base_func_proto(enum bpf_func_id func_id)
|
|
|
|
{
|
|
|
|
switch (func_id) {
|
|
|
|
case BPF_FUNC_map_lookup_elem:
|
|
|
|
return &bpf_map_lookup_elem_proto;
|
|
|
|
case BPF_FUNC_map_update_elem:
|
|
|
|
return &bpf_map_update_elem_proto;
|
|
|
|
case BPF_FUNC_map_delete_elem:
|
|
|
|
return &bpf_map_delete_elem_proto;
|
|
|
|
case BPF_FUNC_map_push_elem:
|
|
|
|
return &bpf_map_push_elem_proto;
|
|
|
|
case BPF_FUNC_map_pop_elem:
|
|
|
|
return &bpf_map_pop_elem_proto;
|
|
|
|
case BPF_FUNC_map_peek_elem:
|
|
|
|
return &bpf_map_peek_elem_proto;
|
|
|
|
case BPF_FUNC_get_prandom_u32:
|
|
|
|
return &bpf_get_prandom_u32_proto;
|
|
|
|
case BPF_FUNC_get_smp_processor_id:
|
|
|
|
return &bpf_get_raw_smp_processor_id_proto;
|
|
|
|
case BPF_FUNC_get_numa_node_id:
|
|
|
|
return &bpf_get_numa_node_id_proto;
|
|
|
|
case BPF_FUNC_tail_call:
|
|
|
|
return &bpf_tail_call_proto;
|
|
|
|
case BPF_FUNC_ktime_get_ns:
|
|
|
|
return &bpf_ktime_get_ns_proto;
|
2020-04-27 00:15:25 +08:00
|
|
|
case BPF_FUNC_ktime_get_boot_ns:
|
|
|
|
return &bpf_ktime_get_boot_ns_proto;
|
bpf: Implement BPF ring buffer and verifier support for it
This commit adds a new MPSC ring buffer implementation into BPF ecosystem,
which allows multiple CPUs to submit data to a single shared ring buffer. On
the consumption side, only single consumer is assumed.
Motivation
----------
There are two distinctive motivators for this work, which are not satisfied by
existing perf buffer, which prompted creation of a new ring buffer
implementation.
- more efficient memory utilization by sharing ring buffer across CPUs;
- preserving ordering of events that happen sequentially in time, even
across multiple CPUs (e.g., fork/exec/exit events for a task).
These two problems are independent, but perf buffer fails to satisfy both.
Both are a result of a choice to have per-CPU perf ring buffer. Both can be
also solved by having an MPSC implementation of ring buffer. The ordering
problem could technically be solved for perf buffer with some in-kernel
counting, but given the first one requires an MPSC buffer, the same solution
would solve the second problem automatically.
Semantics and APIs
------------------
Single ring buffer is presented to BPF programs as an instance of BPF map of
type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately
rejected.
One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make
BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce
"same CPU only" rule. This would be more familiar interface compatible with
existing perf buffer use in BPF, but would fail if application needed more
advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses
this with current approach. Additionally, given the performance of BPF
ringbuf, many use cases would just opt into a simple single ring buffer shared
among all CPUs, for which current approach would be an overkill.
Another approach could introduce a new concept, alongside BPF map, to
represent generic "container" object, which doesn't necessarily have key/value
interface with lookup/update/delete operations. This approach would add a lot
of extra infrastructure that has to be built for observability and verifier
support. It would also add another concept that BPF developers would have to
familiarize themselves with, new syntax in libbpf, etc. But then would really
provide no additional benefits over the approach of using a map.
BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so
doesn't few other map types (e.g., queue and stack; array doesn't support
delete, etc).
The approach chosen has an advantage of re-using existing BPF map
infrastructure (introspection APIs in kernel, libbpf support, etc), being
familiar concept (no need to teach users a new type of object in BPF program),
and utilizing existing tooling (bpftool). For common scenario of using
a single ring buffer for all CPUs, it's as simple and straightforward, as
would be with a dedicated "container" object. On the other hand, by being
a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to
implement a wide variety of topologies, from one ring buffer for each CPU
(e.g., as a replacement for perf buffer use cases), to a complicated
application hashing/sharding of ring buffers (e.g., having a small pool of
ring buffers with hashed task's tgid being a look up key to preserve order,
but reduce contention).
Key and value sizes are enforced to be zero. max_entries is used to specify
the size of ring buffer and has to be a power of 2 value.
There are a bunch of similarities between perf buffer
(BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
- variable-length records;
- if there is no more space left in ring buffer, reservation fails, no
blocking;
- memory-mappable data area for user-space applications for ease of
consumption and high performance;
- epoll notifications for new incoming data;
- but still the ability to do busy polling for new data to achieve the
lowest latency, if necessary.
BPF ringbuf provides two sets of APIs to BPF programs:
- bpf_ringbuf_output() allows to *copy* data from one place to a ring
buffer, similarly to bpf_perf_event_output();
- bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs
split the whole process into two steps. First, a fixed amount of space is
reserved. If successful, a pointer to a data inside ring buffer data area
is returned, which BPF programs can use similarly to a data inside
array/hash maps. Once ready, this piece of memory is either committed or
discarded. Discard is similar to commit, but makes consumer ignore the
record.
bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because
record has to be prepared in some other place first. But it allows to submit
records of the length that's not known to verifier beforehand. It also closely
matches bpf_perf_event_output(), so will simplify migration significantly.
bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory
pointer directly to ring buffer memory. In a lot of cases records are larger
than BPF stack space allows, so many programs have use extra per-CPU array as
a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
completely. But in exchange, it only allows a known constant size of memory to
be reserved, such that verifier can verify that BPF program can't access
memory outside its reserved record space. bpf_ringbuf_output(), while slightly
slower due to extra memory copy, covers some use cases that are not suitable
for bpf_ringbuf_reserve().
The difference between commit and discard is very small. Discard just marks
a record as discarded, and such records are supposed to be ignored by consumer
code. Discard is useful for some advanced use-cases, such as ensuring
all-or-nothing multi-record submission, or emulating temporary malloc()/free()
within single BPF program invocation.
Each reserved record is tracked by verifier through existing
reference-tracking logic, similar to socket ref-tracking. It is thus
impossible to reserve a record, but forget to submit (or discard) it.
bpf_ringbuf_query() helper allows to query various properties of ring buffer.
Currently 4 are supported:
- BPF_RB_AVAIL_DATA returns amount of unconsumed data in ring buffer;
- BPF_RB_RING_SIZE returns the size of ring buffer;
- BPF_RB_CONS_POS/BPF_RB_PROD_POS returns current logical possition of
consumer/producer, respectively.
Returned values are momentarily snapshots of ring buffer state and could be
off by the time helper returns, so this should be used only for
debugging/reporting reasons or for implementing various heuristics, that take
into account highly-changeable nature of some of those characteristics.
One such heuristic might involve more fine-grained control over poll/epoll
notifications about new data availability in ring buffer. Together with
BPF_RB_NO_WAKEUP/BPF_RB_FORCE_WAKEUP flags for output/commit/discard helpers,
it allows BPF program a high degree of control and, e.g., more efficient
batched notifications. Default self-balancing strategy, though, should be
adequate for most applications and will work reliable and efficiently already.
Design and implementation
-------------------------
This reserve/commit schema allows a natural way for multiple producers, either
on different CPUs or even on the same CPU/in the same BPF program, to reserve
independent records and work with them without blocking other producers. This
means that if BPF program was interruped by another BPF program sharing the
same ring buffer, they will both get a record reserved (provided there is
enough space left) and can work with it and submit it independently. This
applies to NMI context as well, except that due to using a spinlock during
reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock,
in which case reservation will fail even if ring buffer is not full.
The ring buffer itself internally is implemented as a power-of-2 sized
circular buffer, with two logical and ever-increasing counters (which might
wrap around on 32-bit architectures, that's not a problem):
- consumer counter shows up to which logical position consumer consumed the
data;
- producer counter denotes amount of data reserved by all producers.
Each time a record is reserved, producer that "owns" the record will
successfully advance producer counter. At that point, data is still not yet
ready to be consumed, though. Each record has 8 byte header, which contains
the length of reserved record, as well as two extra bits: busy bit to denote
that record is still being worked on, and discard bit, which might be set at
commit time if record is discarded. In the latter case, consumer is supposed
to skip the record and move on to the next one. Record header also encodes
record's relative offset from the beginning of ring buffer data area (in
pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only
the pointer to the record itself, without requiring also the pointer to ring
buffer itself. Ring buffer memory location will be restored from record
metadata header. This significantly simplifies verifier, as well as improving
API usability.
Producer counter increments are serialized under spinlock, so there is
a strict ordering between reservations. Commits, on the other hand, are
completely lockless and independent. All records become available to consumer
in the order of reservations, but only after all previous records where
already committed. It is thus possible for slow producers to temporarily hold
off submitted records, that were reserved later.
Reservation/commit/consumer protocol is verified by litmus tests in
Documentation/litmus-test/bpf-rb.
One interesting implementation bit, that significantly simplifies (and thus
speeds up as well) implementation of both producers and consumers is how data
area is mapped twice contiguously back-to-back in the virtual memory. This
allows to not take any special measures for samples that have to wrap around
at the end of the circular buffer data area, because the next page after the
last data page would be first data page again, and thus the sample will still
appear completely contiguous in virtual memory. See comment and a simple ASCII
diagram showing this visually in bpf_ringbuf_area_alloc().
Another feature that distinguishes BPF ringbuf from perf ring buffer is
a self-pacing notifications of new data being availability.
bpf_ringbuf_commit() implementation will send a notification of new record
being available after commit only if consumer has already caught up right up
to the record being committed. If not, consumer still has to catch up and thus
will see new data anyways without needing an extra poll notification.
Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c) show that
this allows to achieve a very high throughput without having to resort to
tricks like "notify only every Nth sample", which are necessary with perf
buffer. For extreme cases, when BPF program wants more manual control of
notifications, commit/discard/output helpers accept BPF_RB_NO_WAKEUP and
BPF_RB_FORCE_WAKEUP flags, which give full control over notifications of data
availability, but require extra caution and diligence in using this API.
Comparison to alternatives
--------------------------
Before considering implementing BPF ring buffer from scratch existing
alternatives in kernel were evaluated, but didn't seem to meet the needs. They
largely fell into few categores:
- per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations
outlined above (ordering and memory consumption);
- linked list-based implementations; while some were multi-producer designs,
consuming these from user-space would be very complicated and most
probably not performant; memory-mapping contiguous piece of memory is
simpler and more performant for user-space consumers;
- io_uring is SPSC, but also requires fixed-sized elements. Naively turning
SPSC queue into MPSC w/ lock would have subpar performance compared to
locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
elements would be too limiting for BPF programs, given existing BPF
programs heavily rely on variable-sized perf buffer already;
- specialized implementations (like a new printk ring buffer, [0]) with lots
of printk-specific limitations and implications, that didn't seem to fit
well for intended use with BPF programs.
[0] https://lwn.net/Articles/779550/
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200529075424.3139988-2-andriin@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-05-29 15:54:20 +08:00
|
|
|
case BPF_FUNC_ringbuf_output:
|
|
|
|
return &bpf_ringbuf_output_proto;
|
|
|
|
case BPF_FUNC_ringbuf_reserve:
|
|
|
|
return &bpf_ringbuf_reserve_proto;
|
|
|
|
case BPF_FUNC_ringbuf_submit:
|
|
|
|
return &bpf_ringbuf_submit_proto;
|
|
|
|
case BPF_FUNC_ringbuf_discard:
|
|
|
|
return &bpf_ringbuf_discard_proto;
|
|
|
|
case BPF_FUNC_ringbuf_query:
|
|
|
|
return &bpf_ringbuf_query_proto;
|
2020-04-25 07:59:41 +08:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2020-05-14 07:03:54 +08:00
|
|
|
if (!bpf_capable())
|
2020-04-25 07:59:41 +08:00
|
|
|
return NULL;
|
|
|
|
|
|
|
|
switch (func_id) {
|
|
|
|
case BPF_FUNC_spin_lock:
|
|
|
|
return &bpf_spin_lock_proto;
|
|
|
|
case BPF_FUNC_spin_unlock:
|
|
|
|
return &bpf_spin_unlock_proto;
|
|
|
|
case BPF_FUNC_jiffies64:
|
|
|
|
return &bpf_jiffies64_proto;
|
2020-12-12 05:36:25 +08:00
|
|
|
case BPF_FUNC_per_cpu_ptr:
|
2020-09-30 07:50:47 +08:00
|
|
|
return &bpf_per_cpu_ptr_proto;
|
2020-12-12 05:36:25 +08:00
|
|
|
case BPF_FUNC_this_cpu_ptr:
|
2020-09-30 07:50:48 +08:00
|
|
|
return &bpf_this_cpu_ptr_proto;
|
2020-05-25 00:50:55 +08:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!perfmon_capable())
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
switch (func_id) {
|
2021-01-28 01:46:15 +08:00
|
|
|
case BPF_FUNC_trace_printk:
|
|
|
|
return bpf_get_trace_printk_proto();
|
2020-05-25 00:50:55 +08:00
|
|
|
case BPF_FUNC_get_current_task:
|
|
|
|
return &bpf_get_current_task_proto;
|
|
|
|
case BPF_FUNC_probe_read_user:
|
|
|
|
return &bpf_probe_read_user_proto;
|
|
|
|
case BPF_FUNC_probe_read_kernel:
|
bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
[ Upstream commit ff40e51043af63715ab413995ff46996ecf9583f ]
Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:
1) The audit events that are triggered due to calls to security_locked_down()
can OOM kill a machine, see below details [0].
2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
when trying to wake up kauditd, for example, when using trace_sched_switch()
tracepoint, see details in [1]. Triggering this was not via some hypothetical
corner case, but with existing tools like runqlat & runqslower from bcc, for
example, which make use of this tracepoint. Rough call sequence goes like:
rq_lock(rq) -> -------------------------+
trace_sched_switch() -> |
bpf_prog_xyz() -> +-> deadlock
selinux_lockdown() -> |
audit_log_end() -> |
wake_up_interruptible() -> |
try_to_wake_up() -> |
rq_lock(rq) --------------+
What's worse is that the intention of 59438b46471a to further restrict lockdown
settings for specific applications in respect to the global lockdown policy is
completely broken for BPF. The SELinux policy rule for the current lockdown check
looks something like this:
allow <who> <who> : lockdown { <reason> };
However, this doesn't match with the 'current' task where the security_locked_down()
is executed, example: httpd does a syscall. There is a tracing program attached
to the syscall which triggers a BPF program to run, which ends up doing a
bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
the permission check against 'current', that is, httpd in this example. httpd
has literally zero relation to this tracing program, and it would be nonsensical
having to write an SELinux policy rule against httpd to let the tracing helper
pass. The policy in this case needs to be against the entity that is installing
the BPF program. For example, if bpftrace would generate a histogram of syscall
counts by user space application:
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
bpftrace would then go and generate a BPF program from this internally. One way
of doing it [for the sake of the example] could be to call bpf_get_current_task()
helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
helpers. So the program itself has nothing to do with httpd or any other random
app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
check. The allow/deny policy belongs in the context of bpftrace: meaning, you
want to grant bpftrace access to use these helpers, but other tracers on the
system like my_random_tracer _not_.
Therefore fix all three issues at the same time by taking a completely different
approach for the security_locked_down() hook, that is, move the check into the
program verification phase where we actually retrieve the BPF func proto. This
also reliably gets the task (current) that is trying to install the BPF tracing
program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
we're moving this out of the BPF helper's fast-path which can be called several
millions of times per second.
The check is then also in line with other security_locked_down() hooks in the
system where the enforcement is performed at open/load time, for example,
open_kcore() for /proc/kcore access or module_sig_check() for module signatures
just to pick few random ones. What's out of scope in the fix as well as in
other security_locked_down() hook locations /outside/ of BPF subsystem is that
if the lockdown policy changes on the fly there is no retrospective action.
This requires a different discussion, potentially complex infrastructure, and
it's also not clear whether this can be solved generically. Either way, it is
out of scope for a suitable stable fix which this one is targeting. Note that
the breakage is specifically on 59438b46471a where it started to rely on 'current'
as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf:
Restrict bpf when kernel lockdown is in confidentiality mode").
[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:
I starting seeing this with F-34. When I run a container that is traced with
BPF to record the syscalls it is doing, auditd is flooded with messages like:
type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality }
for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
tclass=lockdown permissive=0
This seems to be leading to auditd running out of space in the backlog buffer
and eventually OOMs the machine.
[...]
auditd running at 99% CPU presumably processing all the messages, eventually I get:
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
[...]
[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
Serhei Makarov says:
Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
ppc64le. Example stack trace:
[...]
[ 730.868702] stack backtrace:
[ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
[ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[ 730.873278] Call Trace:
[ 730.873770] dump_stack+0x7f/0xa1
[ 730.874433] check_noncircular+0xdf/0x100
[ 730.875232] __lock_acquire+0x1202/0x1e10
[ 730.876031] ? __lock_acquire+0xfc0/0x1e10
[ 730.876844] lock_acquire+0xc2/0x3a0
[ 730.877551] ? __wake_up_common_lock+0x52/0x90
[ 730.878434] ? lock_acquire+0xc2/0x3a0
[ 730.879186] ? lock_is_held_type+0xa7/0x120
[ 730.880044] ? skb_queue_tail+0x1b/0x50
[ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90
[ 730.881656] ? __wake_up_common_lock+0x52/0x90
[ 730.882532] __wake_up_common_lock+0x52/0x90
[ 730.883375] audit_log_end+0x5b/0x100
[ 730.884104] slow_avc_audit+0x69/0x90
[ 730.884836] avc_has_perm+0x8b/0xb0
[ 730.885532] selinux_lockdown+0xa5/0xd0
[ 730.886297] security_locked_down+0x20/0x40
[ 730.887133] bpf_probe_read_compat+0x66/0xd0
[ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820
[ 730.888917] trace_call_bpf+0xe9/0x240
[ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0
[ 730.890579] perf_trace_sched_switch+0x142/0x180
[ 730.891485] ? __schedule+0x6d8/0xb20
[ 730.892209] __schedule+0x6d8/0xb20
[ 730.892899] schedule+0x5b/0xc0
[ 730.893522] exit_to_user_mode_prepare+0x11d/0x240
[ 730.894457] syscall_exit_to_user_mode+0x27/0x70
[ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae
[...]
Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Reported-by: Jakub Hrozek <jhrozek@redhat.com>
Reported-by: Serhei Makarov <smakarov@redhat.com>
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Frank Eigler <fche@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-28 17:16:31 +08:00
|
|
|
return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
|
|
|
|
NULL : &bpf_probe_read_kernel_proto;
|
2020-05-25 00:50:55 +08:00
|
|
|
case BPF_FUNC_probe_read_user_str:
|
|
|
|
return &bpf_probe_read_user_str_proto;
|
|
|
|
case BPF_FUNC_probe_read_kernel_str:
|
bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
[ Upstream commit ff40e51043af63715ab413995ff46996ecf9583f ]
Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:
1) The audit events that are triggered due to calls to security_locked_down()
can OOM kill a machine, see below details [0].
2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
when trying to wake up kauditd, for example, when using trace_sched_switch()
tracepoint, see details in [1]. Triggering this was not via some hypothetical
corner case, but with existing tools like runqlat & runqslower from bcc, for
example, which make use of this tracepoint. Rough call sequence goes like:
rq_lock(rq) -> -------------------------+
trace_sched_switch() -> |
bpf_prog_xyz() -> +-> deadlock
selinux_lockdown() -> |
audit_log_end() -> |
wake_up_interruptible() -> |
try_to_wake_up() -> |
rq_lock(rq) --------------+
What's worse is that the intention of 59438b46471a to further restrict lockdown
settings for specific applications in respect to the global lockdown policy is
completely broken for BPF. The SELinux policy rule for the current lockdown check
looks something like this:
allow <who> <who> : lockdown { <reason> };
However, this doesn't match with the 'current' task where the security_locked_down()
is executed, example: httpd does a syscall. There is a tracing program attached
to the syscall which triggers a BPF program to run, which ends up doing a
bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
the permission check against 'current', that is, httpd in this example. httpd
has literally zero relation to this tracing program, and it would be nonsensical
having to write an SELinux policy rule against httpd to let the tracing helper
pass. The policy in this case needs to be against the entity that is installing
the BPF program. For example, if bpftrace would generate a histogram of syscall
counts by user space application:
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
bpftrace would then go and generate a BPF program from this internally. One way
of doing it [for the sake of the example] could be to call bpf_get_current_task()
helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
helpers. So the program itself has nothing to do with httpd or any other random
app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
check. The allow/deny policy belongs in the context of bpftrace: meaning, you
want to grant bpftrace access to use these helpers, but other tracers on the
system like my_random_tracer _not_.
Therefore fix all three issues at the same time by taking a completely different
approach for the security_locked_down() hook, that is, move the check into the
program verification phase where we actually retrieve the BPF func proto. This
also reliably gets the task (current) that is trying to install the BPF tracing
program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
we're moving this out of the BPF helper's fast-path which can be called several
millions of times per second.
The check is then also in line with other security_locked_down() hooks in the
system where the enforcement is performed at open/load time, for example,
open_kcore() for /proc/kcore access or module_sig_check() for module signatures
just to pick few random ones. What's out of scope in the fix as well as in
other security_locked_down() hook locations /outside/ of BPF subsystem is that
if the lockdown policy changes on the fly there is no retrospective action.
This requires a different discussion, potentially complex infrastructure, and
it's also not clear whether this can be solved generically. Either way, it is
out of scope for a suitable stable fix which this one is targeting. Note that
the breakage is specifically on 59438b46471a where it started to rely on 'current'
as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf:
Restrict bpf when kernel lockdown is in confidentiality mode").
[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:
I starting seeing this with F-34. When I run a container that is traced with
BPF to record the syscalls it is doing, auditd is flooded with messages like:
type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality }
for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
tclass=lockdown permissive=0
This seems to be leading to auditd running out of space in the backlog buffer
and eventually OOMs the machine.
[...]
auditd running at 99% CPU presumably processing all the messages, eventually I get:
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
[...]
[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
Serhei Makarov says:
Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
ppc64le. Example stack trace:
[...]
[ 730.868702] stack backtrace:
[ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
[ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[ 730.873278] Call Trace:
[ 730.873770] dump_stack+0x7f/0xa1
[ 730.874433] check_noncircular+0xdf/0x100
[ 730.875232] __lock_acquire+0x1202/0x1e10
[ 730.876031] ? __lock_acquire+0xfc0/0x1e10
[ 730.876844] lock_acquire+0xc2/0x3a0
[ 730.877551] ? __wake_up_common_lock+0x52/0x90
[ 730.878434] ? lock_acquire+0xc2/0x3a0
[ 730.879186] ? lock_is_held_type+0xa7/0x120
[ 730.880044] ? skb_queue_tail+0x1b/0x50
[ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90
[ 730.881656] ? __wake_up_common_lock+0x52/0x90
[ 730.882532] __wake_up_common_lock+0x52/0x90
[ 730.883375] audit_log_end+0x5b/0x100
[ 730.884104] slow_avc_audit+0x69/0x90
[ 730.884836] avc_has_perm+0x8b/0xb0
[ 730.885532] selinux_lockdown+0xa5/0xd0
[ 730.886297] security_locked_down+0x20/0x40
[ 730.887133] bpf_probe_read_compat+0x66/0xd0
[ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820
[ 730.888917] trace_call_bpf+0xe9/0x240
[ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0
[ 730.890579] perf_trace_sched_switch+0x142/0x180
[ 730.891485] ? __schedule+0x6d8/0xb20
[ 730.892209] __schedule+0x6d8/0xb20
[ 730.892899] schedule+0x5b/0xc0
[ 730.893522] exit_to_user_mode_prepare+0x11d/0x240
[ 730.894457] syscall_exit_to_user_mode+0x27/0x70
[ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae
[...]
Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Reported-by: Jakub Hrozek <jhrozek@redhat.com>
Reported-by: Serhei Makarov <smakarov@redhat.com>
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Frank Eigler <fche@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-28 17:16:31 +08:00
|
|
|
return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
|
|
|
|
NULL : &bpf_probe_read_kernel_str_proto;
|
2021-01-28 01:46:15 +08:00
|
|
|
case BPF_FUNC_snprintf_btf:
|
|
|
|
return &bpf_snprintf_btf_proto;
|
2020-04-25 07:59:41 +08:00
|
|
|
default:
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
}
|