[RFC 20/20] ima: Setup securityfs_ns for IMA namespace

Wed Dec 1 21:34:35 UTC 2021

On 12/1/21 16:11, James Bottomley wrote:
> On Wed, 2021-12-01 at 15:25 -0500, Stefan Berger wrote:
>> On 12/1/21 14:21, James Bottomley wrote:
>>> On Wed, 2021-12-01 at 13:11 -0500, Stefan Berger wrote:
>>>> On 12/1/21 12:56, James Bottomley wrote:
>>> [...]
>>>> I tried this with runc and a user namespace active mapping uid
>>>> 1000 on the host to uid 0 in the container. There I run into the
>>>> problem that  all of the files and directories without the above
>>>> work-around are mapped to 'nobody', just like all the files in
>>>> sysfs in this case are also mapped to nobody. This code resolved
>>>> the issue.
>>> So I applied your patches with the permission shift commented out
>>> and instrumented inode_alloc() to see where it might be failing and
>>> I actually find it all works as expected for me:
>>>
>>> ejb at testdeb:~> unshare -r --user --mount --ima
>>> root at testdeb:~# mount -t securityfs_ns none /sys/kernel/security
>>> root at testdeb:~# ls -l /sys/kernel/security/ima/
>>> total 0
>>> -r--r----- 1 root root 0 Dec  1 19:11 ascii_runtime_measurements
>>> -r--r----- 1 root root 0 Dec  1 19:11 binary_runtime_measurements
>>> -rw------- 1 root root 0 Dec  1 19:11 policy
>>> -r--r----- 1 root root 0 Dec  1 19:11 runtime_measurements_count
>>> -r--r----- 1 root root 0 Dec  1 19:11 violations
>>>
>>> I think your problem is something to do with how runc is installing
>>> the uid/gid mappings.  If it's installing them after the
>>> security_ns inodes are created then they get the -1 value (because
>>> no mappings exist in s_user_ns).  I can even demonstrate this by
>>> forcing unshare to enter the IMA namespace before writing the
>>> mapping values and I'll see "nobody nogroup" above like you do.
>> I am surprised you get this mapping even after commenting the
>> permission adjustments... it doesn't work for me when I comment them
>> out:
>>
>> [stefanb at ima-ns-dev rootfs]$ unshare -r --user --mount
>> [root at ima-ns-dev rootfs]# mount -t securityfs_ns none
>> /sys/kernel/security/
>> [root at ima-ns-dev rootfs]# cd /sys/kernel/security/ima/
>> [root at ima-ns-dev ima]# ls -l
>> total 0
>> -r--r-----. 1 nobody nobody 0 Dec  1 15:20 ascii_runtime_measurements
>> -r--r-----. 1 nobody nobody 0 Dec  1 15:20
>> binary_runtime_measurements
>> -rw-------. 1 nobody nobody 0 Dec  1 15:20 policy
>> -r--r-----. 1 nobody nobody 0 Dec  1 15:20 runtime_measurements_count
>> -r--r-----. 1 nobody nobody 0 Dec  1 15:20 violations
>> [root at ima-ns-dev ima]# cat /proc/self/uid_map
>>            0       1000          1
>> [root at ima-ns-dev ima]# cat /proc/self/gid_map
>>            0       1000          1
>>
>> The initialization of securityfs and setup of files and directories
>> happens at the same time as the IMA namespace is created. At this
>> time there are no user mappings available, so that's why I need to
>> make the adjustments 'late'.
> There is one other possible difference:  To get the correct s_user_ns

I am currently wondering why I cannot re-create your setup while 
disabling the remapping...

> on the securityfs_ns mount, the mount namespace itself has to be owned
> by the user namespace ... is runc doing that correctly?  I always

Following an strace of 'runc create' I see an unshare(CLONE_NEWUSER) by 
a process before it does an 
unshare(CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWNET), 
so this seems to be doing it in the order you suggest.

Also, runc seems to have its own set of struggles. I am not sure we 
would be able to ask them to accommodate us to do it 'correctly' - it 
doesn't sound so 'easy' for them either to get everything under the hood:

https://github.com/opencontainers/runc/blob/master/libcontainer/nsenter/nsexec.c#L919

      * In order for this unsharing code to be more extensible we need 
to split
      * up unshare(CLONE_NEWUSER) and clone() in various ways. The ideal 
case
      * would be if we did clone(CLONE_NEWUSER) and the other namespaces
      * separately, but because of SELinux issues we cannot really do 
that. But

[...]

      * However, if we unshare(2) the user namespace *before* we 
clone(2), then
      * all hell breaks loose.

sounds like fun

So, I am not quite sure whether I am working around an issue of runc but 
for that I would like to first be able to re-create your successful 
setup to see what's different.

    Stefan

> forget this detail because unshare does it correctly automatically but
> it means you must unshare the user namespace first and then unshare the
> mount namespace (or do it in the same sys call because the kernel will
> get the correct order).
>
> James
>
>