[PATCH v2 bpf-next 00/18] BPF token
Andrii Nakryiko
andrii.nakryiko at gmail.com
Mon Jun 26 22:08:41 UTC 2023
On Thu, Jun 22, 2023 at 6:03 PM Andy Lutomirski <luto at kernel.org> wrote:
>
>
>
> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
> > On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan at redhat.com> wrote:
> >
> > For CAP_BPF too broad. It is broad, yes. If you have good ideas how to
> > break it down some more -- please propose. But this is all orthogonal,
> > because the blocking problem is fundamental incompatibility of user
> > namespaces (and their implied isolation and sandboxing of workloads)
> > and BPF functionality, which is global by its very nature. The latter
> > is unavoidable in principle.
>
> How, exactly, is BPF global by its very nature?
>
> The *implementation* has some issues with globalness. Much of it should be fixable.
>
bpf_probe_read_kernel() is widely used and required for real-world
applications. It's global by its nature and in principle not
restrictable. We can say that we'll just disable applications that use
bpf_probe_read_kernel(), but the goal is to enable applications that
are *practically useful*, not just some restricted set of programs
that are provably contained.
> >
> > No matter how much you break down CAP_BPF, you can't enforce that BPF
> > program won't interfere with applications in other containers. Or that
> > it won't "spy" on them. It's just not what BPF can enforce in
> > principle.
>
> The WHOLE POINT of the verifier is to attempt to constrain what BPF programs can and can't do. There are bugs -- I get that. There are helper functions that are fundamentally global. But, in the absence of verifier bugs, BPF has actual boundaries to its functionality.
looking at your other replies, I think you realized yourself that
there are valid use cases where it's impossible to statically validate
boundaries
>
> >
> > So that comes back down to a question of trust and then controlled
> > delegation of BPF functionality. You trust workload with BPF usage
> > because you reviewed the BPF code, workload, testing, etc? Grant BPF
> > token and let that container use limited subset of BPF. Employ BPF LSM
> > to further restrict it beyond what BPF token can control.
> >
> > You cannot trust an application to not do something harmful? You
> > shouldn't grant it either CAP_BPF in init namespace, nor BPF token in
> > user namespace. That's it. Pick your poison.
>
> I think what's lost here is hardening vs restricting intended functionality.
>
> We have access control to restrict intended functionality. We have other (and generally fairly ad-hoc and awkward) ways to flip off functionality because we want to reduce exposure to any bugs in it.
>
> BPF needs hardening -- this is well established. Right now, this is accomplished by restricting it to global root (effectively). It should have access controls, too, but it doesn't.
>
> >
> > But all this cannot be mechanically decided or enforced. There has to
> > be some humans involved in making these decisions. Kernel's job is to
> > provide building blocks to grant and control BPF functionality to the
> > extent that it is technically possible.
> >
>
> Exactly. And it DOES NOT. bpf maps, etc do not have sensible access controls. Things that should not be global are global. I'm saying the kernel should fix THAT. Once it's in a state that it's at least credible to allow BPF in a user namespace, than come up with a way to allow it.
>
> > As for "something to isolate the pinned maps/progs by different apps
> > (why not DAC rules?)", there is no such thing, as I've explained
> > already.
> >
> > I can install sched_switch raw_tracepoint BPF program (if I'm allowed
> > to), and that program has system-wide observability. It cannot be
> > bound to an application.
>
> Great, a real example!
>
> Either:
>
> (a) don't run this in a container. Have a service for the container to request the help of this program.
>
> (b) have a way to have root approve a particular program and expose *that* program to the container, and let the program have its own access controls internally (e.g. only output info that belongs to that container).
>
> > then what do we do when we switch from process A in container
> > X to process B in container Y? Is that event belonging to container X?
> > Or container Y?
>
> I don't know, but you had better answer this question before you run this thing in a container, not just for security but for basic functionality. If you haven't defined what your program is even supposed to do in a container, don't run it there.
I think you are missing the point I'm making. A specific BPF program
that will use sched_switch is doing correct and right thing (for
whatever that means in a specific case). We as humans designed,
implemented, validated, reviewed it and are confident enough (as much
as we can be with software) that it does the right thing. It doesn't
try to spy on things, doesn't try to disrupt things.
We know this as humans thanks to our internal development process.
But this is not *provable* in a mechanical sense such that the kernel
can validate and enforce this. And yet it's a practically useful
application which we'd like to be able to launch from inside the
container without rearchitecting and rewriting the entire world and
proxying everything through some external root service.
>
>
> > Hopefully you can see where I'm going with this. And this is just one
> > random tiny example. We can think up tons of other cases to prove BPF
> > is not isolatable to any sort of "container".
>
> No. You have not come up with an example of why BPF is not isolatable to a container. You have come up with an example of why binding to a sched_switch raw tracepoint does not make sense in a container without additional mechanisms to give it well defined functionality and appropriate security.
>
> Please stop conflating BPF (programs, maps, etc) with *attachments* of BPF programs to systemwide things. They're both under the BPF umbrella. They're not the same thing.
I'm not conflating things. Thinking about BPF maps and BPF programs in
isolation from them being attached somewhere in the kernel and doing
actual and useful work is not useful.
It's the end-to-end functionality including attaching and running BPF
programs is what matters.
Pedantically drawing the line at the BPF program load step and saying
"this is BPF and everything else is not BPF" isn't really helpful. No
one cares about just loading and validating BPF programs. Developers
care about attaching and running them, that's what it all is about.
>
> Passing a token into a container that allow that container to do things like loading its own programs *and attaching them to raw tracepoints* is IMO a complete nonstarter. It makes no sense.
More information about the Linux-security-module-archive
mailing list