[RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory

Tue May 11 07:14:01 UTC 2021

On Mon, May 10, 2021 at 9:04 PM Alexei Starovoitov
<alexei.starovoitov at gmail.com> wrote:
>
> On Mon, May 10, 2021 at 12:22:47PM -0500, YiFei Zhu wrote:
> >
> > +BPF_CALL_3(bpf_probe_read_user_dumpable, void *, dst, u32, size,
> > +        const void __user *, unsafe_ptr)
> > +{
> > +     int ret = -EPERM;
> > +
> > +     if (get_dumpable(current->mm))
> > +             ret = copy_from_user_nofault(dst, unsafe_ptr, size);
>
> Could you explain a bit more how dumpable flag makes it safe for unpriv?
> The unpriv prog is attached to the children tasks only, right?
> and dumpable gets cleared if euid changes?

This is the "reduction to ptrace". The model here is that the eBPF
seccomp filter is doing the equivalent of ptracing the user process
using the privileges of the task at the time of loading the seccomp
filter.

ptrace access control is governed by ptrace.c:__ptrace_may_access. The
requirements are:
* always allow thread group introspection -- assume false so we are
more restrictive than ptrace.
* tracer has CAP_PTRACE in the target user namespace or tracer
r/fsu/gidid equal target resu/gid -- discuss below
* tracer has CAP_PTRACE in the target user namespace or target is
SUID_DUMP_USER (I realized I should probably change the condition to
== SUID_DUMP_USER).
* passes LSM checks (eg yama ptrace_scope) -- we expose a hook to LSM
but it's more of a "disable all advanced seccomp-eBPF features". How
would a better interface to LSM look like?

The dumpable check handles the "target is SUID_DUMP_USER" condition,
in the circumstance that the loader does not have CAP_PTRACE in its
namespace at the time of load. Why would this imply its CAP_PTRACE
capability in target namespace? This is based on my understanding on
how capabilities and user namespaces interact:
For the sake of simplicity, let's first assume that loader is the same
task as the task that attaches the filter (via prctl or seccomp
syscall).
* Case 1: target and loader are the same user namespace. Trivial case,
the two operations are the same.
* Case 2: target is loader's parent namespace. Can't happen under
assumption. Seccomp affects itself and children only, and it is only
possible to join a descendant user ns.
* Case 3: target is loader's descendant namespace. Loader would have
full CAP_PTRACE on target. We are more restrictive than ptrace.
* Case 4: target and loader are on unrelated namespace branches. Can't
happen under assumption. Same as case 2.

Let's break this assumption and see what happens if the loader and
attacher are in different contexts:
* Case 1: attacher is less capable (as a general term of "what it can
do") than loader then all of the above applies, since the model
concerns and checks the capabilities of the loader.
* Case 2: attacher is more capable than loader. The attacher would
need an fd to the prog to attach it:
  * subcase 1: attacher inherited the fd after an exec and became more
capable. uh... why is it trusting fds from a less capable context?
  * subcase 2: attacher has CAP_SYS_ADMIN and gets the fd via
BPF_PROG_GET_FD_BY_ID. uh... why is it trusting random fds and
attaching it?
  * subcase 3: attacher received the fd via a domain socket from a
process which may be in a different user namespace. On my first
thought, I thought, why is it trusting random fds from a less capable
context? Except I just thought of an adversary could:
    * Clone into new userns,
    * Load filter in child, which has CAP_PTRACE in new userns
    * Send filter to the parent which doesn't have CAP_PTRACE in its userns
    * It's broken :(
We'll think more about this case. One way is to check against init
namespace, which means unpriv container runtimes won't have the
non-dumpable override. Though, it shouldn't be affecting most of the
use cases. Alternatively we can store which userns it was loaded from
and reject attaching from a different userns.

Regarding u/gids, for an attacher to attach a seccomp filter, whether
cBPF or eBPF, if it doesn't have CAP_SYS_ADMIN in its current ns, it
will have to set no_new_privs on itself before it can attach. (Unlike
the previous discussion, this check is done at attach time rather than
load.) With no_new_privs the target's privs is a subset of the
attacher's, so the attacher should have a way to match the target's
resuid, so this condition is not a concern.

YiFei Zhu