[PATCH v3 00/25] user_namespace: introduce fsid mappings

Wed Feb 19 15:36:30 UTC 2020

On Wed, 2020-02-19 at 13:27 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 03:50:56PM -0800, James Bottomley wrote:
> > On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote:
[...]
> > > With fsid mappings we can solve this by writing an id mapping of
> > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On
> > > filesystem access the kernel will now lookup the mapping for
> > > 300000 in the fsid mapping tables of the user namespace. And
> > > since such a mapping exists, the corresponding files will have
> > > correct ownership.
> > 
> > So I did compile this up in order to run the shiftfs tests over it
> > to see how it coped with the various corner cases.  However, what I
> > find is it simply fails the fsid reverse mapping in the
> > setup.  Trying to use a simple uid of 0 100000 1000 and a fsid of
> > 100000 0 1000 fails the entry setuid(0) call because of this code:
> 
> This is easy to fix. But what's the exact use-case?

Well, the use case I'm looking to solve is the same one it's always
been: getting a deprivileged fake root in a user_ns to be able to write
an image at fsuid 0.

I don't think it's solvable in your current framework, although
allowing the domain to be disjoint might possibly hack around it.  The
problem with the proposed framework is that there are no backshifts
from the filesystem view, there are only forward shifts to the
filesystem view.  This means that to get your framework to write a
filesystem at fsuid 0 you have to have an identity map for fsuid. Which
I can do: I tested uid shift 0 100000 1000 and fsuid shift 0 0 1000. 
It does all work, as you'd expect because the container has real fs
root not a fake root.  And that's the whole problem:  Firstly, I'm fs
root for any filesystem my userns can see, so any imprecision in
setting up the mount namespace of the container and I own your host and
secondly any containment break and I'm privileged with respect to the
fs uid wherever I escape to so I will likewise own your host.

The only way to keep containment is to have a zero fsuid inside the
container corresponding to a non-zero one outside.  And the only way to
solve the imprecision in mount namespace issue is to strictly control
the entry point at which the writing at fsuid 0 becomes active.

James