[PATCH v5 0/4] Introduce security_create_user_ns()

Thu Aug 18 14:05:21 UTC 2022

On Wed, Aug 17, 2022 at 04:24:28PM -0500, Eric W. Biederman wrote:
> Paul Moore <paul at paul-moore.com> writes:
> 
> > On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman <ebiederm at xmission.com> wrote:
> >> Paul Moore <paul at paul-moore.com> writes:
> >> > On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman <ebiederm at xmission.com> wrote:
> >> >> Paul Moore <paul at paul-moore.com> writes:
> >> >>
> >> >> > At the end of the v4 patchset I suggested merging this into lsm/next
> >> >> > so it could get a full -rc cycle in linux-next, assuming no issues
> >> >> > were uncovered during testing
> >> >>
> >> >> What in the world can be uncovered in linux-next for code that has no in
> >> >> tree users.
> >> >
> >> > The patchset provides both BPF LSM and SELinux implementations of the
> >> > hooks along with a BPF LSM test under tools/testing/selftests/bpf/.
> >> > If no one beats me to it, I plan to work on adding a test to the
> >> > selinux-testsuite as soon as I'm done dealing with other urgent
> >> > LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I
> >> > run these tests multiple times a week (multiple times a day sometimes)
> >> > against the -rcX kernels with the lsm/next, selinux/next, and
> >> > audit/next branches applied on top.  I know others do similar things.
> >>
> >> A layer of hooks that leaves all of the logic to userspace is not an
> >> in-tree user for purposes of understanding the logic of the code.
> >
> > The BPF LSM selftests which are part of this patchset live in-tree.
> > The SELinux hook implementation is completely in-tree with the
> > subject/verb/object relationship clearly described by the code itself.
> > After all, the selinux_userns_create() function consists of only two
> > lines, one of which is an assignment.  Yes, it is true that the
> > SELinux policy lives outside the kernel, but that is because there is
> > no singular SELinux policy for everyone.  From a practical
> > perspective, the SELinux policy is really just a configuration file
> > used to setup the kernel at runtime; it is not significantly different
> > than an iptables script, /etc/sysctl.conf, or any of the other myriad
> > of configuration files used to configure the kernel during boot.
> 
> I object to adding the new system configuration knob.

I do strongly sympathize with Eric's points.  It will be very easy, once
user namespace creation has been further restricted in some distros, to
say "well see this stuff is silly" and go back to simply requiring root
to create all containers and namespaces, which is generally quite a bit
easier anywway.  And then, of course, give everyone root so they can
start containers.

As Eric said,

 | Further adding a random failure mode to user namespace creation if it is
 | used at all will just encourage userspace to use a setuid application to
 | perform the namespace creation instead.  Creating a less secure system
 | overall.

However, I'm also looking at e.g. CVE-2022-2588 and CVE-2022-2586, and
yes there are two issues which do require discussion (three if you
count reportability, which is mainly a tool in guarding against the others).

The first is, indeed, configuration knobs.  There are tools, including
chrome, which use user namespaces to make things better.  The hope is
that more and more tools will do so.

The second is damage control.  When an 0day has been announced, things
change.  You can say "well the bug was there all along", but it is
different when every lazy ne'erdowell can pick an exploit off a mailing
list and use it against a product for which spinning a new version with
a new kernel and getting customers to update is probably a months-long
endeavor.  Some of these products do in fact require namespaces (user
and otherwise) as part of their function.  And - to my chagrin - I suspect
most of them create usernamespace as the root user, before possibly processing
untrusted user input, so unprivileged_userns_clone isn't a good fit.

SELinux (and LSMs in generaly) do in fact seem like a useful place to
add some configuration, because they tend to assign different domains
to tasks with different purposes and trust levels.  But another such
place is the init system / service manager.  And in most cases these
days, this will use cgroups to collect tasks of certain types.  So I
wonder (this is ALMOST ENTIRELY thinking out loud, not thought through
sufficiently) whether we should be setting a cgroup.nslock or
somesuch.

Of course, kernel livepatch is another potentially useful mitigation.
Currently that's not possible for everyone.

Maybe there is a more fundamental way we can approach this.  Part of me
still likes the idea of splitting the id mapping and capability-in-userns
parts, but that's not sufficient.  Maybe looking over all the relevant
CVEs would give a better hint.

Eric, you said

 | If the concern is to reduce the attack surface everything this
 | proposed hook can do is already possible with the security_capable
 | security hook.

I suppose I could envision an LSM which gets activated when we find
out there was a net-ns-exacerbated 0-day, which refuses CAP_NET_ADMIN
for a task not in init_user_ns?  Ideally it would be more flexible
than that.

> idea.  What is userspace going to do with this new feature that makes it
> worth maintaining in the kernel?
> 
> That is always the conversation we have when adding new features, and
> that is exactly the conversation that has not happened here.

Eric and Paul, I wonder, will you - or some people you'd like to represent
you - be at plumbers in September?  Should there be a BOF session there?  (I
won't be there, but could join over video)  I think a brainstorming session 
for solutions to the above problems would be good.

> Adding a layer of indirection should not exempt a new feature from
> needing to justify itself.
> 
> Eric