[PATCH v2 bpf-next 00/18] BPF token

Fri Jun 23 15:10:32 UTC 2023

On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote:
> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
>
>> Hopefully you can see where I'm going with this. And this is just one
>> random tiny example. We can think up tons of other cases to prove BPF
>> is not isolatable to any sort of "container".
>
> No.  You have not come up with an example of why BPF is not isolatable 
> to a container.  You have come up with an example of why binding to a 
> sched_switch raw tracepoint does not make sense in a container without 
> additional mechanisms to give it well defined functionality and 
> appropriate security.

Thinking about this some more:

Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example).  The workload is in the container.  The tracepoint is global.  Kernel memory is global unless something that is trusted and understands the containers is doing the reading.  And proxying BPF is a mess.

So here are a couple of possible solutions:

(a) Improve BPF maps a bit so that BPF maps work well in containers.  It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags.  (IIRC my patch series was a decent step in this direction,)  Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container.  So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data.

(b) Make a way to pass a pre-approved program into a container.  So a daemon outside loads the program and does some new magic to say "make an fd that can be used to attach this particular program to this particular tracepoint" and pass that into the container.

I think (a) is better.  In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container.

For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers.

And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance.  You want *one* XDP program fanning the packets out to the relevant containers.

If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation.

--Andy