[PATCH v2 bpf-next 00/18] BPF token

Fri Jun 23 23:23:10 UTC 2023

On 6/23/23 5:10 PM, Andy Lutomirski wrote:
> On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote:
>> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
>>
>>> Hopefully you can see where I'm going with this. And this is just one
>>> random tiny example. We can think up tons of other cases to prove BPF
>>> is not isolatable to any sort of "container".
>>
>> No.  You have not come up with an example of why BPF is not isolatable
>> to a container.  You have come up with an example of why binding to a
>> sched_switch raw tracepoint does not make sense in a container without
>> additional mechanisms to give it well defined functionality and
>> appropriate security.

One big blocker for the case of BPF is not isolatable to a container are
CPU hardware bugs. There has been plenty of mitigation effort so that the
flexibility cannot be abused as a tool e.g. discussed in [0], but ultimately
it's a cat and mouse game and vendors are also not really transparent. So
actual reasonable discussion can be resumed once CPU vendors gets their
stuff fixed.

   [0] https://popl22.sigplan.org/details/prisc-2022-papers/11/BPF-and-Spectre-Mitigating-transient-execution-attacks

> Thinking about this some more:
> 
> Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example).  The workload is in the container.  The tracepoint is global.  Kernel memory is global unless something that is trusted and understands the containers is doing the reading.  And proxying BPF is a mess.

Agree that proxy is a mess for various reasons stated earlier.

> So here are a couple of possible solutions:
> 
> (a) Improve BPF maps a bit so that BPF maps work well in containers.  It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags.  (IIRC my patch series was a decent step in this direction,)  Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container.  So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data.

I don't think it's very practical, meaning the vast majority of applications
out there today are tightly coupled BPF code + user space application, and in
a lot of cases programs are dynamically created. This would require somehow
splitting up parts of your application to run outside the container in hostns
and other parts inside the container.. for the sake of the mentioned example
it's something fairly static, but real-world applications look different and
are much more complex.

> (b) Make a way to pass a pre-approved program into a container.  So a daemon outside loads the program and does some new magic to say "make an fd that can be used to attach this particular program to this particular tracepoint" and pass that into the container.

Same as above. Programs are in most cases very tightly coupled to the application
itself. I'm not sure if the ask is to redesign/implement all the existing user
space infra.

> I think (a) is better.  In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container.
> 
> For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers.

Worst case, sure, but it's not the point. These containers which would receive
the tokens are part of your trusted compute base.. so its up to the specific
applications and their surrounding infrastructure with regards to what problem
they solve where and approved by operators/platform engs to deploy in your cluster.
I don't particularly see that there's a performance problem. Andrii specifically
mentioned /trusted unprivileged applications/.

> And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance.  You want *one* XDP program fanning the packets out to the relevant containers.
> 
> If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation.
> 
> --Andy
>