[PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

Tue Aug 13 21:58:25 UTC 2019

On Tue, Aug 06, 2019 at 10:24:25PM -0700, Andy Lutomirski wrote:
> >
> > Inside containers and inside nested containers we need to start processes
> > that will use bpf. All of the processes are trusted.
> 
> Trusted by whom?  In a non-nested container, the container manager
> *might* be trusted by the outside world.  In a *nested* container,
> unless the inner container management is controlled from outside the
> outer container, it's not trusted.  I don't know much about how
> Facebook's containers work, but the LXC/LXD/Podman world is moving
> very strongly toward user namespaces and maximally-untrusted
> containers, and I think bpf() should work in that context.

agree that containers (namespaces) reduce amount of trust necessary
for apps to run, but the end goal is not security though.
Linux has become a single user system.
If user can ssh into the host they can become root.
If arbitrary code can run on the host it will be break out of any sandbox.
Containers are not providing the level of security that is enough
to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
Containers are used to make production systems safer.
Some people call it more 'secure', but it's clearly not secure for
arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
It's been a constant source of pain. The constant blinding, randomization,
verifier speculative analysis, all spectre v1, v2, v4 mitigations
are simply not worth it. It's a lot of complex kernel code without users.
There is not a single use case to allow arbitrary malicious bpf
program to be loaded and executed.
As soon as we have /dev/bpf to allow all of bpf to be used without root
we will set sysctl kernel.unprivileged_bpf_disabled=1
Hence I prefer this /dev/bpf mechanism to be as simple a possible.
The applications that will use it are going to be just as trusted as systemd.

> > To solve your concern of bypassing all capable checks...
> > How about we do /dev/bpf/full_verifier first?
> > It will replace capable() checks in the verifier only.
> 
> I'm not convinced that "in the verifier" is the right distinction.
> Telling administrators that some setting lets certain users bypass
> bpf() verifier checks doesn't have a clear enough meaning.  

linux is a single user system. there are no administrators any more.
No doubt, folks will disagree, but that game is over.
At least on bpf side it's done.

> I propose,
> instead, that the current capable() checks be divided into three
> categories:

I don't see a use case for these categories.
All bpf programs extend the kernel in some way.
The kernel vs user is one category.
Conceptually CAP_BPF is enough. It would be similar to CAP_NET_ADMIN.
When application has CAP_NET_ADMIN it covers all of networking knobs.
There is no use case that would warrant fine grain CAP_ROUTE_ADMIN,
CAP_ETHTOOL_ADMIN, CAP_ETH0_ADMIN, etc.
Similarly CAP_BPF as the only knob is enough.
The only disadvantage of CAP_BPF is that it's not possible to
pass it from one systemd-like daemon to another systemd-like daemon.
Hence /dev/bpf idea and passing file descriptor.

> This type of thing actually fits quite nicely into an idea I've been
> thinking about for a while called "implicit rights". In very brief
> summary, there would be objects called /dev/rights/xyz, where xyz is
> the same of a "right".  If there is a readable object of the right
> type at the literal path "/dev/rights/xyz", then you have right xyz.
> There's a bit more flexibility on top of this.  BPF could use
> /dev/rights/bpf/maptypes/lpm and
> /dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
> use cases include a biggie:
> /dev/rights/namespace/create_unprivileged_userns.
> /dev/rights/bind_port/80 would be nice, too.

The concept of "implicit rights" is very nice and I'm sure it will
be a good fit somewhere, but I don't see why use it in bpf space.
There is no use case for fine grain partition of bpf features.