[PATCH] nsproxy: attach to namespaces via pidfds

Mon Apr 27 17:28:56 UTC 2020

On Mon, Apr 27, 2020 at 4:47 PM Christian Brauner
<christian.brauner at ubuntu.com> wrote:
> For quite a while we have been thinking about using pidfds to attach to
> namespaces. This patchset has existed for about a year already but we've
> wanted to wait to see how the general api would be received and adopted.
> Now that more and more programs in userspace have started using pidfds
> for process management it's time to send this one out.

You can already reliably switch to a specific namespace of another
process like this given a pidfd and the pid of the process (which, if
you don't have it, you can get via fdinfo), right?

int switch_ns_to(int pidfd, int pid, char *nstypename) {
  char ns_path[100];
  snprintf(ns_path, sizeof(ns_path), "/proc/%d/ns/%s", pid, nstypename);
  int fd = open(ns_path, O_RDONLY|O_CLOEXEC);
  int errno_after_open = errno;

  if (pidfd_send_signal(pidfd, 0, NULL, 0))
    return -1;

  if (fd == -1) {
    errno = errno_after_open;
    return -1;
  }

  int ret = setns(fd, 0);
  close(fd);
  return ret;
}

> This patch makes it possible to use pidfds to attach to the namespaces
> of another process, i.e. they can be passed as the first argument to the
> setns() syscall. When only a single namespace type is specified the
> semantics are equivalent to passing an nsfd.

This introduces a difference in terms of security: With the old API,
you need PTRACE_MODE_READ_FSCREDS on the task whose namespace you're
attaching to (to dereference the link /proc/*/ns/*) *AND* whatever
access checks the namespace itself enforces (always includes a check
for CAP_SYS_ADMIN on the namespace). The ptrace check has the
advantage, among other things, that it allows an LSM to see the
relationship between the task that's accessing the namespace (subject)
and the task whose namespace is being accessed (object).

I feel slightly twitchy about this relaxation, and I'm wondering
whether we can add a ptrace access check analogous to what you'd have
needed via procfs.

> That means
> setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
> when a pidfd is passed, multiple namespace flags can be specified in the
> second setns() argument and setns() will attach the caller to all the
> specified namespaces all at once or to none of them. If 0 is specified
> together with a pidfd then setns() will interpret it the same way 0 is
> interpreted together with a nsfd argument, i.e. attach to any/all
> namespaces.
[...]
> Apart from significiantly reducing the number of syscalls from double
> digit to single digit which is a decent reason post-spectre/meltdown
> this also allows to switch to a set of namespaces atomically, i.e.
> either attaching to all the specified namespaces succeeds or we fail.

Apart from the issues I've pointed out below, I think it's worth
calling out explicitly that with the current design, the switch will
not, in fact, be fully atomic - the process will temporarily be in
intermediate stages where the switches to some namespaces have
completed while the switches to other namespaces are still pending;
and while there will be less of these intermediate stages than before,
it also means that they will be less explicit to userspace.

[...]
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
[...]
> +/*
> + * Ordering is equivalent to the standard ordering used everywhere
> + * else during unshare and process creation.
> + */
> +static int ns_install(struct nsproxy *nsproxy, struct pid *pid, int flags)
> +{
> +       int ret = 0;
> +       struct task_struct *tsk;
> +       struct nsproxy *nsp;
> +
> +       tsk = get_pid_task(pid, PIDTYPE_PID);
> +       if (!tsk)
> +               return -ESRCH;
> +
> +       get_nsproxy(tsk->nsproxy);
> +       nsp = tsk->nsproxy;

How is this correct? Are you holding any locks that protect tsk->nsproxy?

> +#ifdef CONFIG_USER_NS
> +       if (wants_ns(flags, CLONE_NEWUSER)) {
> +               struct user_namespace *user_ns;
> +
> +               user_ns = get_user_ns(__task_cred(tsk)->user_ns);
> +               ret = __ns_install(nsproxy, &user_ns->ns);

If ret == 0, then at this point you've already committed the user
namespace change *to the calling process*. The ->install handler of
user namespaces doesn't touch the nsproxy at all.

> +               put_user_ns(user_ns);
> +       }
> +#else
> +       if (flags & CLONE_NEWUSER)
> +               ret = -EINVAL;
> +#endif
> +
> +       if (!ret && wants_ns(flags, CLONE_NEWNS))
> +               ret = __ns_install(nsproxy, mnt_ns_to_common(nsp->mnt_ns));

And this one might be even worse, because the mount namespace change
itself is only stored in the nsproxy at this point, but the cwd and
root paths have already been overwritten on the task's fs_struct.

To actually make sys_set_ns() atomic, I think you'd need some
moderately complicated prep work, splitting the ->install handlers up
into prep work and a commit phase that can't fail.

[...]
> +#ifdef CONFIG_PID_NS
> +       if (!ret && wants_ns(flags, CLONE_NEWPID)) {
> +               struct pid_namespace *pidns;
> +
> +               pidns = task_active_pid_ns(tsk);
> +               if (pidns) {
> +                       get_pid_ns(pidns);
> +                       ret = __ns_install(nsproxy, &pidns->ns);
> +                       put_pid_ns(pidns);
> +               }

If you can't get the task's pidns, shouldn't that be an error?

> +       }
[...]