[RFC PATCH 02/27] containers: Implement containers as kernel objects

Sun Feb 17 18:57:54 UTC 2019

Hi David,

On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> Implement a kernel container object such that it contains the
> following
> things:
> 
>  (1) Namespaces.
> 
>  (2) A root directory.
> 
>  (3) A set of processes, including one designated as the 'init'
> process.
> 
> A container is created and attached to a file descriptor by:
> 
> 	int cfd = container_create(const char *name, unsigned int
> flags);
> 
> this inherits all the namespaces of the parent container unless
> otherwise
> the mask calls for new namespaces.
> 
> 	CONTAINER_NEW_FS_NS
> 	CONTAINER_NEW_EMPTY_FS_NS
> 	CONTAINER_NEW_CGROUP_NS [root only]
> 	CONTAINER_NEW_UTS_NS
> 	CONTAINER_NEW_IPC_NS
> 	CONTAINER_NEW_USER_NS
> 	CONTAINER_NEW_PID_NS
> 	CONTAINER_NEW_NET_NS
> 
> Other flags include:
> 
> 	CONTAINER_KILL_ON_CLOSE
> 	CONTAINER_CLOSE_ON_EXEC
> 
> Note that I've added a pointer to the current container to
> task_struct.
> This doesn't make the nsproxy pointer redundant as you can still make
> new
> namespaces with clone().
> 
> I've also added a list_head to task_struct to form a list in the
> container
> of its member processes.  This is convenient, but redundant since the
> code
> could iterate over all the tasks looking for ones that have a
> matching
> task->container.
> 
> It might make sense to use fsconfig() to configure the container:
> 
> 	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "user", NULL, userns_fd);
> 	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "mnt", NULL, mntns_fd);
> 	fsconfig(cfd, FSCONFIG_SET_FD, "rootfs", NULL, root_fd);
> 	fsconfig(cfd, FSCONFIG_CMD_CREATE_CONTAINER, NULL, NULL, 0);
> 
> 
> ==================
> FUTURE DEVELOPMENT
> ==================
> 
>  (1) Setting up the container.
> 
>      A container would be created with, say:
> 
> 	int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS);
> 
>      Once created, it should then be possible for the supervising
> process
>      to modify the new container.  Mounts can be created inside of
> the
>      container's namespaces:
> 
> 	fsfd = fsopen("ext4", 0);
> 	fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
> 	fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0);
> 	fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
> 	fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> 	mfd = fsmount(fsfd, 0, 0);
> 
>      and then mounted into the namespace:
> 
> 	move_mount(mfd, "", cfd, "/",
> 		   MOVE_MOUNT_F_EMPTY_PATH |
> MOVE_MOUNT_T_CONTAINER_ROOT);
> 
>      Further mounts can be added by:
> 
> 	move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH);
> 
>      Files and devices can be created by supplying the container fd
> as the
>      dirfd argument:
> 
> 	mkdirat(int cfd, const char *path, mode_t mode);
> 	mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
> 	int fd = openat(int cfd, const char *path,
> 			unsigned int flags, mode_t mode);
> 
>      [*] Note that when using cfd as dirfd, the path must not contain
> a '/'
>      	 at the front.
> 
>      Sockets, such as netlink, can be opened inside of the
> container's
>      namespaces:
> 
> 	int fd = container_socket(int cfd, int domain, int type,
> 				  int protocol);
> 
>      This should allow management of the container's network
> namespace from
>      outside.
> 
>  (2) Starting the container.
> 
>      Once all modifications are complete, the container's 'init'
> process
>      can be started by:
> 
> 	fork_into_container(int cfd);
> 
>      This precludes further external modification of the mount tree
> within
>      the container.  Before this point, the container is simply
> destroyed
>      if the container fd is closed.
> 
>  (3) Waiting for the container to complete.
> 
>      The container fd can then be polled to wait for init process
> therein
>      to complete and the exit code collected by:
> 
> 	container_wait(int container_fd, int *_wstatus, unsigned int
> wait,
> 		       struct rusage *rusage);
> 
>      The container and everything in it can be terminated or killed
> off:
> 
> 	container_kill(int container_fd, int initonly, int signal);
> 
>      If 'init' dies, all other processes in the container are
> preemptively
>      SIGKILL'd by the kernel.
> 
>      By default, if the container is active and its fd is closed, the
>      container is left running and wil be cleaned up when its 'init'
> exits.
>      The default can be changed with the CONTAINER_KILL_ON_CLOSE
> flag.
> 
>  (4) Supervising the container.
> 
>      Given that we have an fd attached to the container, we could
> make it
>      such that the supervising process could monitor and override
> EPERM
>      returns for mount and other privileged operations within the
>      container.
> 
>  (5) Per-container keyring.
> 
>      Each container can point to a per-container keyring for the
> holding of
>      integrity keys and filesystem keys for use inside the
> container.  This
>      would be attached:
> 
> 	keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring)
> 
>      This keyring would be searched by request_key() after it has
> searched
>      the thread, process and session keyrings.
> 
>  (6) Running different LSM policies by container.  This might
> particularly
>      make sense with something like Apparmor where different path-
> based
>      rules might be required inside a container to inside the parent.
> 
> Signed-off-by: David Howells <dhowells at redhat.com>
> ---

Do we really need a new system call to set up containers? That would
force changes to all existing orchestration software.

Given that the main thing we want to achieve is to direct messages from
the kernel to an appropriate handler, why not focus on adding
functionality to do just that?

Is there any reason why a syscall to allow an appropriately privileged
process to add a keyring-specific message queue to its own
user_namespace and obtain a file descriptor to that message queue might
not work? That forces the container to use a daemon if it cares to
intercept keyring traffic, rather than worrying about the kernel
running request_key (in fact, it might make sense to allow a trivial
implementation of the daemon to be to just read the messages, parse
them and run request_key).

With such an implementation, the fallback mechanism could be to walk
back up the hierarchy of user_namespaces until a message queue is
found, and to invoke the existing request_key mechanism if not.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust at hammerspace.com