[PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

Christian Brauner christian.brauner at ubuntu.com
Fri Aug 16 22:22:53 UTC 2019


On Fri, Aug 16, 2019 at 02:45:44PM -0700, Alexei Starovoitov wrote:
> On Thu, Aug 15, 2019 at 05:54:59PM -0700, Andy Lutomirski wrote:
> > 
> > 
> > > On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov <alexei.starovoitov at gmail.com> wrote:
> > 
> > 
> > >> 
> > >> I'm not sure why you draw the line for VMs -- they're just as buggy
> > >> as anything else. Regardless, I reject this line of thinking: yes,
> > >> all software is buggy, but that isn't a reason to give up.
> > > 
> > > hmm. are you saying you want kernel community to work towards
> > > making containers (namespaces) being able to run arbitrary code
> > > downloaded from the internet?
> > 
> > Yes.

If I may weigh in here too: Yes. In fact, we already do that large
scale!

> > 
> > As an example, Sandstorm uses a combination of namespaces (user, network, mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other fairly heavyweight stacks can all run under Sandstorm, with the whole stack (database engine binaries, etc) supplied by entirely untrusted customers.  During the time Sandstorm was under active development, I can recall *one* bug that would have allowed a sandbox escape. That’s a pretty good track record.  (Also, Meltdown and Spectre, sigh.)
> 
> exactly: "meltdown", "spectre", "sigh".
> Side channels effectively stalled the work on secure containers.
> And killed projects like sandstorm.

If I may, Sandstorm's death has very likely nothing to do with
Meltdown/Spectre etc. since that should've also killed Qemu, Crosvm and
all the others in one fell swoop. It's also not a very good example (no
offense, Andy :)) and probably a bit dated.
We have LXD which is a full-fledged container manager that runs
*unprivileged system* containers on a large scale and is very much
alive. That is it runs systemd, openrc, what have you, i.e. simply
unmodifed distro images at this point.

It's used to run Linux workloads on all Chromebooks and in a bunch of
other places. Since its inception we did not have a single
*unprivileged* container to init userns/host breakout.
At this point in time the really bad CVEs are almost exclusively against
*privileged* containers (see this year's leading nomination for
container CVE grand mal of the year: CVE-2019-5736) which were never and
will never be considered root safe despite everyone pretending
otherwise.


> Why work on improving kaslr when there are several ways to
> get kernel addresses through hw bugs? Patch mouse holes when roof is leaking ?
> In case of unprivileged bpf I'm confident that all known holes are patched.
> Until disclosures stop happening with the frequency they do now the time
> of bpf developers is better spent on something other than unprivileged bpf.
> 
> > I’m suggesting that you engage the security community ...
> > .. so that normal users can use bpf filtering
> 
> yes, but not soon. unfortunately.

Tbh, I do not have a strong opinion on unprivileged bpf at this moment.
If the community thinks that the bits and pieces are not in place or the
timing is not right that's fine with me.
What we should make sure though is that we don't end up with something
halfbaked. And this /dev/bpf thing very much feels like a hack. If
unprivileged bpf is not a thing why bother with /dev/bpf or CAP_BPF at
all.

(The one usecase I'd care about is to extend seccomp to do pointer-based
syscall filtering. Whether or not that'd require (unprivileged) ebpf is
up for discussion at KSummit.)

Christian



More information about the Linux-security-module-archive mailing list