[RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters

Tue May 11 05:21:17 UTC 2021

On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto at kernel.org> wrote:
> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999 at gmail.com> wrote:
> >
> > From: YiFei Zhu <yifeifz2 at illinois.edu>
> >
> > Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html
> >
> > This patchset enables seccomp filters to be written in eBPF.
> > Supporting eBPF filters has been proposed a few times in the past.
> > The main concerns were (1) use cases and (2) security. We have
> > identified many use cases that can benefit from advanced eBPF
> > filters, such as:
>
> I haven't reviewed this carefully, but I think we need to distinguish
> a few things:
>
> 1. Using the eBPF *language*.
>
> 2. Allowing the use of stateful / non-pure eBPF features.
>
> 3. Allowing the eBPF programs to read the target process' memory.
>
> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> even less convinced by (3).
>
> >
> >   * exec-only-once filter / apply filter after exec
>
> This is (2).  I'm not sure it's a good idea.

The basic idea is that for a container runtime it may wait to execute
a program in a container without that program being able to execve
another program, stopping any attack that involves loading another
binary. The container runtime can block any syscall but execve in the
exec-ed process by using only cBPF.

The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
@Andrea and @Giuseppe, could you clarify more in case I missed
something?

> >   * syscall logging (eg. via maps)
>
> This is (2).  Probably useful, but doesn't obviously belong in
> seccomp, or at least not as part of the same seccomp feature as
> regular filtering.
>
> >   * expressiveness & better tooling (no need for DSLs like easyseccomp)
>
> (1).  Sounds good.
>
> >   * contained syscall fault injection
>
> (2)?  We can already do this with notifiers.

To clarify, “we can already do with notifiers” isn’t the point here.
We can do almost everything if you have notifiers and ptrace, but it
may impose significant overhead (see the microbenchmark results).

The reason I’m saying the overhead is important is for the
reproduction / testing of certain race conditions. A syscall failing
quickly in a userspace application could, from a race condition, have
a completely different trace as a syscall failing after a few context
switches. eBPF makes quick fault injection possible.

> > For security, for an unprivileged caller, our implementation is as
> > restrictive as user notifier + ptrace, in regards to capabilities.
> > eBPF helpers follow the privilege model of original eBPF helpers.
>
> eBPF doesn't really have a privilege model yet.  There was a long and
> disappointing thread about this awhile back.

The idea is that “seccomp-eBPF does not make life easier for an
adversary”. Any attack an adversary could potentially utilize
seccomp-eBPF, they can do the same with other eBPF features, i.e. it
would be an issue with eBPF in general rather than specifically
seccomp’s use of eBPF.

Here it is referring to the helpers goes to the base
bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
!perfmon_capable). In this case, if the adversary would utilize eBPF
helpers to perform an attack, they could do it via another
unprivileged prog type.

That said, there are a few additional helpers this patchset is adding:
* get_current_uid_gid
* get_current_pid_tgid
  These two provide public information (are namespaces a concern?). I
have no idea what kind of exploit it could add unless the adversary
somehow side-channels the task_struct? But in that case, how is the
reading of task_struct different from how the rest of the kernel is
reading task_struct?
  Though, if knowing the global uid / pid is a concern then the eBPF
progs will need to keep track of namespaces, and that might not be
trivial.
* probe_read_user
* probe_read_user_str
  Reduction to ptrace. The privilege model of reading another
process’s data (via process_vm_readv or
ptrace(PTRACE_PEEK{TEXT,DATA})) is guarded by
PTRACE_MODE_ATTACH_REALCREDS. However, unprivileged seccomp is
safeguarded by no_new_privs, so, unless the caller have a non-uniform
resuid & fsuid, in which case it’s the caller’s failure to relinquish
privileges, ruid of the seccomp-eBPF executor (which is task whose
syscalls is being filtered) would be the save as the ruid of the
applier (the task that set the seccomp mode, at the time of setting
it).
  The main concern here is LSMs. LSMs can further restrict the scope
of ptrace hence I also allow LSMs to deny all “the use of stateful /
non-pure eBPF features”.
  As for side channels... the copy_from_user_nofault may allow an
adversary to observe what’s in resident memory and what’s swapped out,
but the adversary can already do this by observing the timing of
memory accesses. The non-nofault variant copy_from_user is used
everywhere in the kernel, so if an adversary were to side channel the
kernel by copy_from_user against an address, they can already do it by
using a syscall with a pointer that would be used by copy_from_user.
* task_storage_get
* task_storage_delete
  This is what I’m least sure about. The implementation of
task_storage is more complex than the other helpers, and also assumes
a privileged eBPF loader. It would slightly extend the attack surface.
If this is a big issue then eBPF can emulate such a map by using some
hashmap and having PID as key...

> > Moreover, a mechanism for reading user memory is added. The same
> > prototypes of bpf_probe_read_user{,str} from tracing are used. However,
> > when the loader of bpf program does not have CAP_PTRACE, the helper
> > will return -EPERM if the task under seccomp filter is non-dumpable.
> > The reason for this is that if we perform reduction from seccomp-eBPF
> > to user notifier + ptrace, ptrace requires CAP_PTRACE to read from
> > a non-dumpable process. However, eBPF does not solve the TOCTOU problem
> > of user notifier, so users should not use this to enforce a policy
> > based on memory contents.
>
> What is this for?

Memory reading opens up lots of use cases. For example, logging what
files are being opened without imposing too much performance penalty
from strace. Or as an accelerator for user notify emulation, where
syscalls can be rejected on a fast path if we know the memory contents
does not satisfy certain conditions that user notify will check.

YiFei Zhu