[PATCH bpf-next] bpf, capabilities: introduce CAP_BPF

Andy Lutomirski luto at kernel.org
Wed Aug 28 00:55:41 UTC 2019


On Tue, Aug 27, 2019 at 5:34 PM Alexei Starovoitov
<alexei.starovoitov at gmail.com> wrote:
>
> On Tue, Aug 27, 2019 at 04:01:08PM -0700, Andy Lutomirski wrote:
> > [adding some security and tracing folks to cc]
> >
> > On Tue, Aug 27, 2019 at 1:52 PM Alexei Starovoitov <ast at kernel.org> wrote:
> > >
> > > Introduce CAP_BPF that allows loading all types of BPF programs,
> > > create most map types, load BTF, iterate programs and maps.
> > > CAP_BPF alone is not enough to attach or run programs.
> > >
> > > Networking:
> > >
> > > CAP_BPF and CAP_NET_ADMIN are necessary to:
> > > - attach to cgroup-bpf hooks like INET_INGRESS, INET_SOCK_CREATE, INET4_CONNECT
> > > - run networking bpf programs (like xdp, skb, flow_dissector)
> > >
> > > Tracing:
> > >
> > > CAP_BPF and perf_paranoid_tracepoint_raw() (which is kernel.perf_event_paranoid == -1)
> > > are necessary to:
> > > - attach bpf program to raw tracepoint
> > > - use bpf_trace_printk() in all program types (not only tracing programs)
> > > - create bpf stackmap
> > >
> > > To attach bpf to perf_events perf_event_open() needs to succeed as usual.
> > >
> > > CAP_BPF controls BPF side.
> > > CAP_NET_ADMIN controls intersection where BPF calls into networking.
> > > perf_paranoid_tracepoint_raw controls intersection where BPF calls into tracing.
> > >
> > > In the future CAP_TRACING could be introduced to control
> > > creation of kprobe/uprobe and attaching bpf to perf_events.
> > > In such case bpf_probe_read() thin wrapper would be controlled by CAP_BPF.
> > > Whereas probe_read() would be controlled by CAP_TRACING.
> > > CAP_TRACING would also control generic kprobe+probe_read.
> > > CAP_BPF and CAP_TRACING would be necessary for tracing bpf programs
> > > that want to use bpf_probe_read.
> >
> > First, some high-level review:
> >
> > Can you write up some clear documentation aimed at administrators that
> > says what CAP_BPF does?  For example, is it expected that CAP_BPF by
> > itself permits reading all kernel memory?
>
> hmm. the answer is in the sentence you quoted right above.

I was hoping for something in Documentation/admin-guide, not in a
changelog that's hard to find.

>
> > Can you give at least one fully described use case where CAP_BPF
> > solves a real-world problem that is not solved by existing mechanisms?
>
> bpftrace binary would be installed with CAP_BPF and CAP_TRACING.
> bcc tools would be installed with CAP_BPF and CAP_TRACING.
> perf binary would be installed with CAP_TRACING only.
> XDP networking daemon would be installed with CAP_BPF and CAP_NET_ADMIN.
> None of them would need full root.

As in just setting these bits in fscaps?  What does this achieve
beyond just installing them setuid-root or with CAP_SYS_ADMIN and
judicious use of capset internally?  For that matter, what prevents
unauthorized users from tracing the system if you do this?  This just
lets anyone trace the system, which seems like a mistake.

Can you clarify your example or give another one?

>
> > Changing the capability that some existing operation requires could
> > break existing programs.  The old capability may need to be accepted
> > as well.
>
> As far as I can see there is no ABI breakage. Please point out
> which line of the patch may break it.

As a more or less arbitrary selection:

 void bpf_prog_kallsyms_add(struct bpf_prog *fp)
 {
        if (!bpf_prog_kallsyms_candidate(fp) ||
-           !capable(CAP_SYS_ADMIN))
+           !capable(CAP_BPF))
                return;

Before your patch, a task with CAP_SYS_ADMIN could do this.  Now it
can't.  Per the usual Linux definition of "ABI break", this is an ABI
break if and only if someone actually did this in a context where they
have CAP_SYS_ADMIN but not all capabilities.  How confident are you
that no one does things like this?
 void bpf_prog_kallsyms_add(struct bpf_prog *fp)
 {
        if (!bpf_prog_kallsyms_candidate(fp) ||
-           !capable(CAP_SYS_ADMIN))
+           !capable(CAP_BPF))
                return;

> > > ---
> > > I would prefer to introduce CAP_TRACING soon, since it
> > > will make tracing and networking permission model symmetrical.
> > >
> >
> > Here's my proposal for CAP_TRACING, documentation-style:
> >
> > --- begin ---
> >
> > CAP_TRACING enables a task to use various kernel features to trace
> > running user programs and the kernel itself.  CAP_TRACING also enables
> > a task to bypass some speculation attack countermeasures.  A task in
> > the init user namespace with CAP_TRACING will be able to tell exactly
> > what kernel code is executed and when, and will be able to read kernel
> > registers and kernel memory.  It will, similarly, be able to read the
> > state of other user tasks.
> >
> > Specifically, CAP_TRACING allows the following operations.  It may
> > allow more operations in the future:
> >
> >  - Full use of perf_event_open(), similarly to the effect of
> > kernel.perf_event_paranoid == -1.
>
> +1
>
> >  - Loading and attaching tracing BPF programs, including use of BPF
> > raw tracepoints.
>
> -1
>
> >  - Use of BPF stack maps.
>
> -1
>
> >  - Use of bpf_probe_read() and bpf_trace_printk().
>
> -1
>
> >  - Use of unsafe pointer-to-integer conversions in BPF.
>
> -1
>
> >  - Bypassing of BPF's speculation attack hardening measures and
> > constant blinding.  (Note: other mechanisms might also allow this.)
>
> -1
> All of the above are allowed by CAP_BPF.
> They are not allowed by CAP_TRACING.

Why?  I don't mean to discount your -1, and you may well have a
compelling reason.  If so, I'll change my proposal.

>From the previous discussion, you want to make progress toward solving
a lot of problems with CAP_BPF.  One of them was making BPF
firewalling more generally useful. By making CAP_BPF grant the ability
to read kernel memory, you will make administrators much more nervous
to grant CAP_BPF.  Similarly, and correct me if I'm wrong, most of
these capabilities are primarily or only useful for tracing, so I
don't see why users without CAP_TRACING should get them.
bpf_trace_printk(), in particular, even has "trace" in its name :)

Also, if a task has CAP_TRACING, it's expected to be able to trace the
system -- that's the whole point.  Why shouldn't it be able to use BPF
to trace the system better?

>
> > CAP_TRACING does not override normal permissions on sysfs or debugfs.
> > This means that, unless a new interface for programming kprobes and
> > such is added, it does not directly allow use of kprobes.
>
> kprobes can be created via perf_event_open already.
> So above statement contradicts your first statement.

Hmm.  The way of using perf with kprobes that I'm familiar with is:

# perf probe --add func_name

And this uses "/sys/kernel/debug/tracing//kprobe_events" (with the
double slash!).  Is there indeed another way to do this?

Anyway, I didn't mean to exclude any existing perf_event_open()
mechanism -- what I meant was that, without some extension to my
proposal, /sys/kernel/debug/tracing wouldn't magically become
accessible due to CAP_TRACING.

>
> > If CAP_TRACING, by itself, enables a task to crash or otherwise
> > corrupt the kernel or other tasks, this will be considered a kernel
> > bug.
>
> +1
>
> > CAP_TRACING in a non-init user namespace may, in the future, allow
> > tracing of other tasks in that user namespace or its descendants.  It
> > will not enable kernel tracing or tracing of tasks outside the user
> > namespace in question.
>
> I would avoid describing user ns for now.
> There is enough confusion without it.
>
> > --- end ---
> >
> > Does this sound good?  The idea here is that CAP_TRACING should be
> > very useful even without CAP_BPF, which allows CAP_BPF to be less
> > powerful.
>
> As proposed CAP_BPF does not allow tracing or networking on its own.
> CAP_BPF only controls BPF side.
>
> For example:
> BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
> {
>         int ret;
>
>         ret = probe_kernel_read(dst, unsafe_ptr, size);
>         if (unlikely(ret < 0))
>                 memset(dst, 0, size);
>
>         return ret;
> }
>
> All of BPF (including prototype of bpf_probe_read) is controlled by CAP_BPF.
> But the kernel primitives its using (probe_kernel_read) is controlled by CAP_TRACING.
> Hence a task needs _both_ CAP_BPF and CAP_TRACING to attach and run bpf program
> that uses bpf_probe_read.
>
> Similar with all other kernel code that BPF helpers may call directly or indirectly.
> If there is a way for bpf program to call into piece of code controlled by CAP_TRACING
> such helper would need CAP_BPF and CAP_TRACING.
> If bpf helper calls into something that may mangle networking packet
> such helper would need both CAP_BPF and CAP_NET_ADMIN to execute.

Why do you want to require CAP_BPF to call into functions like
bpf_probe_read()?  I understand why you want to limit access to bpf,
but I think that CAP_TRACING should be sufficient to allow the tracing
parts of BPF.  After all, a lot of your concerns, especially the ones
involving speculation, don't really apply to users with CAP_TRACING --
users with CAP_TRACING can read kernel memory with or without bpf.

>
> > > @@ -2080,7 +2083,10 @@ static int bpf_prog_test_run(const union bpf_attr *attr,
> > >         struct bpf_prog *prog;
> > >         int ret = -ENOTSUPP;
> > >
> > > -       if (!capable(CAP_SYS_ADMIN))
> > > +       if (!capable(CAP_NET_ADMIN) || !capable(CAP_BPF))
> > > +               /* test_run callback is available for networking progs only.
> > > +                * Add cap_bpf_tracing() above when tracing progs become runable.
> > > +                */
> >
> > I think test_run should probably be CAP_SYS_ADMIN forever.  test_run
> > is the only way that one can run a bpf program and call helper
> > functions via the program if one doesn't have permission to attach the
> > program.
>
> Since CAP_BPF + CAP_NET_ADMIN allow attach. It means that a task
> with these two permissions will have programs running anyway.
> (traffic will flow through netdev, socket events will happen, etc)
> Hence no reason to disallow running program via test_run.
>

test_run allows fully controlled inputs, in a context where a program
can trivially flush caches, mistrain branch predictors, etc first.  It
seems to me that, if a JITted bpf program contains an exploitable
speculation gadget (MDS, Spectre v1, RSB, or anything else), it will
be *much* easier to exploit it using test_run than using normal
network traffic.  Similarly, normal network traffic will have network
headers that are valid enough to have caused the BPF program to be
invoked in the first place.  test_run can inject arbitrary garbage.



More information about the Linux-security-module-archive mailing list