[RFC PATCH 4/9] User-space API for creating a supervisor-fd

Fri Apr 11 10:55:32 UTC 2025

On Wed, Mar 26, 2025 at 12:06:11AM +0000, Tingmao Wang wrote:
> On 3/11/25 19:28, Mickaël Salaün wrote:
> > On Mon, Mar 10, 2025 at 12:41:28AM +0000, Tingmao Wang wrote:
> > > On 3/5/25 16:09, Mickaël Salaün wrote:
> > > > On Tue, Mar 04, 2025 at 01:13:00AM +0000, Tingmao Wang wrote:
> > > > > We allow the user to pass in an additional flag to landlock_create_ruleset
> > > > > which will make the ruleset operate in "supervise" mode, with a supervisor
> > > > > attached. We create additional space in the landlock_ruleset_attr
> > > > > structure to pass the newly created supervisor fd back to user-space.
> > > > > 
> > > > > The intention, while not implemented yet, is that the user-space will read
> > > > > events from this fd and write responses back to it.
> > > > > 
> > > > > Note: need to investigate if fd clone on fork() is handled correctly, but
> > > > > should be fine if it shares the struct file. We might also want to let the
> > > > > user customize the flags on this fd, so that they can request no
> > > > > O_CLOEXEC.
> > > > > 
> > > > > NOTE: despite this patch having a new uapi, I'm still very open to e.g.
> > > > > re-using fanotify stuff instead (if that makes sense in the end). This is
> > > > > just a PoC.
> > > > 
> > > > The main security risk of this feature is for this FD to leak and be
> > > > used by a sandboxed process to bypass all its restrictions.  This should
> > > > be highlighted in the UAPI documentation.
> 
> In particular, if for some reason the supervisor does a fork without exec,
> it must close this fd in the "about-to-be-untrusted" child.

Yes...

> 
> (I wonder if it would be worth enforcing that the child calling
> landlock_restrict_self must not have any open supervisor fd that can
> supervise its own domain (returning an error if it does), but that can be
> difficult to implement so nevermind)

That would mean that a call can fail according to the caller's context
(e.g. FDs), which is not good for reproducibility (i.e. not idempotent).

Being able to tie a supervisor FD to a set of rulesets and then to a set
of domains is interesting too.  We might want to also add a "cookie"
value when creating a ruleset for the supervisor to identify which
ruleset it received a request from.

I was also thinking about pidfd, but they do not refer to a domain but
to a process (which may be sandboxed several times).  I found a better
idea, see below.

> 
> > > > 
> > > > > 
> > > > > Signed-off-by: Tingmao Wang <m at maowtm.org>
> > > > > ---
> > > > >    include/uapi/linux/landlock.h |  10 ++++
> > > > >    security/landlock/syscalls.c  | 102 +++++++++++++++++++++++++++++-----
> > > > >    2 files changed, 98 insertions(+), 14 deletions(-)
> > > > > 
> > > > > diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> > > > > index e1d2c27533b4..7bc1eb4859fb 100644
> > > > > --- a/include/uapi/linux/landlock.h
> > > > > +++ b/include/uapi/linux/landlock.h
> > > > > @@ -50,6 +50,15 @@ struct landlock_ruleset_attr {
> > > > >    	 * resources (e.g. IPCs).
> > > > >    	 */
> > > > >    	__u64 scoped;
> > > > > +	/**
> > > > > +	 * @supervisor_fd: Placeholder to store the supervisor file
> > > > > +	 * descriptor when %LANDLOCK_CREATE_RULESET_SUPERVISE is set.
> > > > > +	 */
> > > > > +	__s32 supervisor_fd;
> > > > 
> > > > This interface would require the ruleset_attr becoming updatable by the
> > > > kernel, which might be OK in theory but requires current syscall wrapper
> > > > signature update, see sandboxer.c change.  It also creates a FD which
> > > > might not be useful (e.g. if an error occurs before the actual
> > > > enforcement).
> > > > 
> > > > I see a few alternatives.  We could just use/extend the ruleset FD
> > > > instead of creating a new one, but because leaking current rulesets is
> > > > not currently a security risk, we should be careful to not change that.
> > > > 
> > > > Another approach, similar to seccomp unotify, is to get a
> > > > "[landlock-domain]" FD returned by the landlock_restrict_self(2) when a
> > > > new LANDLOCK_RESTRICT_SELF_DOMAIN_FD flag is set.  This FD would be a
> > > > reference to the newly created domain, which is more specific than the
> > > > ruleset used to created this domain (and that can be used to create
> > > > other domains).  This domain FD could be used for introspection (i.e.
> > > > to get read-only properties such as domain ID), but being able to
> > > > directly supervise the referenced domain only with this FD would be a
> > > > risk that we should limit.
> > > > 
> > > > What we can do is to implement an IOCTL command for such domain FD that
> > > > would return a supervisor FD (if the LANDLOCK_RESTRICT_SELF_SUPERVISED
> > > > flag was also set).  The key point is to check (one time) that the
> > > > process calling this IOCTL is not restricted by the related domain (see
> > > > the scope helpers).
> > > 
> > > Is LANDLOCK_RESTRICT_SELF_DOMAIN_FD part of your (upcoming?) introspection
> > > patch? (thinking about when will someone pass that only and not
> > > LANDLOCK_RESTRICT_SELF_SUPERVISED, or vice versa)
> > 
> > I don't plan to work on such LANDLOCK_RESTRICT_SELF_DOMAIN_FD flag for
> > now, but the introspection feature(s) would help for this supervisor
> > feature.
> > 
> > > 
> > > By the way, is it alright to conceptually relate the supervisor to a domain?
> > > It really would be a layer inside a domain - the domain could have earlier
> > > or later layers which can deny access without supervision, or the supervisor
> > > for earlier layers can deny access first. Therefore having supervisor fd
> > > coming out of the ruleset felt sensible to me at first.
> > 
> > Good question.  I've been using the name "domain" to refer to the set of
> > restrictions enforced on a set of processes, but these restrictions are
> > composed of inherited ones plus the latest layer.  In this case, a
> > domain FD should refer to all the restrictions, but the supervisor FD
> > should indeed only refer to the latest layer of a domain (created by
> > landlock_restrict_self).
> > 
> > > 
> > > Also, isn't "check that process calling this IOCTL is not restricted by the
> > > related domain" and the fact that the IOCTL is on the domain fd, which is a
> > > return value of landlock_restrict_self, kind of contradictory?  I mean it is
> > > a sensible check, but that kind of highlights that this interface is
> > > slightly awkward - basically all callers are forced to have a setup where
> > > the child sends the domain fd back to the parent.
> > 
> > I agree that its confusing.  I'd like to avoid the ruleset to gain any
> > control on domains after they are created.
> > 
> > Another approach would be to create a supervisor FD with the
> > landlock_create_ruleset() syscall, and pass this FD to the ruleset,
> > potentially with landlock_add_rule() calls to only request this
> > supervisor when matching specific rules (that could potentially be
> > catch-all rules)?
> 
> Maybe passing in a fd per landlock_add_rule calls, and thus potentially
> allowing different supervisor fd tied to different rules in the same
> ruleset, is a bit overkill (as now each rule needs to store a supervisor
> pointer?) and I don't really see the use of it.

I though about this approach too but being able to update the domain
with new rules would be more useful and powerful.

> I think it would be better
> to just pass it once in the landlock_ruleset_attr, which gets around the
> signature having const for the ruleset_attr problem. (I'm also open to the
> ioctl on domain fd idea, but I'm slightly wary of making this more
> complicated then necessary for the user space, as it now has to set up a
> socket (?) and pass a fd with scm_rights (?))

OK, here is another proposal: supervisor rulesets and supervisee FDs.
The idea is to add a new flag to landlock_restrict_self(2) to created a
ruleset marked as "supervisor".  This ruleset could not be passed to
landlock_restrict_self(2), but a dedicated IOCTL would create a
supervisee file descriptor.  This supervisee could be passed to a
landlock_ruleset_attr to created a supervised ruleset.

This approach is interesting because it makes it explicit the access
rights which are handled by the supervisor, which enables us to only
supervise a set of actions and update the supervisor ruleset with
landlock_add_rule(2).

Another interesting property is that because we have at least two file
descriptors for a supervisor, it's easy to create a ruleset supervisor
in process A and then only pass a supervisee FD to process B.  A leaked
supervisee FD could not give more privileges, and it is unlikely that a
supervisor FD is passed to process B because it could not be usable as a
supervisee and should then be detected early in the development cycle.

> 
> The other aspect of this is whether we want to have the supervisor mark
> specific rules as supervised, rather than having all denied access (from
> this layer) result in a supervisor invocation.  I also don't think this is
> necessary, as denials are supposed to be "abnormal" in some sense, and I
> would imagine most supervisors would want to find out about these (at least
> to print/show a warning of some sort, if it knows that the requested access
> is bad).  If a supervisor really wants to have the kernel just "silently"
> (from its perspective, but maybe there would be audit logs) deny any access
> outside of some known rules, it can also create a nested, unsupervised
> landlock domain that has the right effect. Avoiding having some sort of
> tri-state rules would simplify implementation, I imagine.

Because this supervisor use case is mainly about sandboxing programs
which may not be aware of such restrictions, they could legitimately
request a lot of time the same denied actions.  To avoid overloading the
supervisor, we need a way to filter such requests.  But being able to
initially get these request would be useful too, which is why being able
to dynamically update the supervisor ruleset is interesting.

> 
> > 
> > Overall, my main concern about this patch series is that the supervisor
> > could get a lot of requests, which will make the sandbox unusable
> > because always blocked by some thread/process.  This latest approach and
> > the ability to update the domain somehow could make it workable.
> > 
> > > 
> > > > 
> > > > Relying on IOCTL commands (for all these FD types) instead of read/write
> > > > operations should also limit the risk of these FDs being misused through
> > > > a confused deputy attack (because such IOCTL command would convey an
> > > > explicit intent):
> > > > https://docs.kernel.org/security/credentials.html#open-file-credentials
> > > > https://lore.kernel.org/all/CAG48ez0HW-nScxn4G5p8UHtYy=T435ZkF3Tb1ARTyyijt_cNEg@mail.gmail.com/
> > > > We should get inspiration from seccomp unotify for this too:
> > > > https://lore.kernel.org/all/20181209182414.30862-1-tycho@tycho.ws/
> > > 
> > > I think in the seccomp unotify case the problem arises from what the setuid
> > > binary thinks is just normal data getting interpreted by the kernel as a fd,
> > > and thus having different effect if the attacker writes it vs. if the suid
> > > app writes it.  In our case I *think* we should be alright, but maybe we
> > > should go with ioctl anyway...
> > 
> > I don't see why Jann's attack scenario could work for this Landlock
> > supervisor too.  The main point that it the read/write interfaces are
> > used by a lot of different FDs, and we may not need them.
> > 
> > > However, how does using netlink messages (a
> > > suggestion from a different thread) affect this (if we do end up using it)?
> > > Would we have to do netlink msgs via IOCTL?
> > 
> > Because all requests should be synchronous, one IOCTL could be used to
> > both acknowledge a previous event (or just start) and read the next one.
> > 
> > I was thinking about an IOCTL with these arguments:
> > 1. supervisor FD
> > 2. (extensible) IOCTL command (see PIDFD_GET_INFO for instance)
> > 3. pointer to a fixed-size control structure
> > 
> > The fixed-size control structure could contain:
> > - handled access rights, used to only get event related to specific
> >    access.
> > - flags, to specify which kind of FD we would like to get (e.g. only
> >    directory FD, pidfd...)
> > - fd[6]: an array of received file descriptors.
> > - pointer to a variable-size data buffer that would contain all the
> >    records (e.g. source dir FD, source file name, destination dir FD,
> >    destination file name) for one event, potentially formatted with NLA.
> > - the size of this buffer
> > 
> > I'm not sure about the content of this buffer and the NLA format, and
> > the related API might not be usable without netlink sockets though.
> > Taking inspiration from the fanotify message format is another option.
> > 
> > > 
> > > 
> > > > > +	/**
> > > > > +	 * @pad: Unused, must be zero.
> > > > > +	 */
> > > > > +	__u32 pad;
> > > > 
> > > > In this case we should pack the struct instead.
> > > > 
> > > > >    };
> > > > >    /*
> > > > > @@ -60,6 +69,7 @@ struct landlock_ruleset_attr {
> > > > >     */
> > > > >    /* clang-format off */
> > > > >    #define LANDLOCK_CREATE_RULESET_VERSION			(1U << 0)
> > > > > +#define LANDLOCK_CREATE_RULESET_SUPERVISE		(1U << 1)
> > > > >    /* clang-format on */
> > > > >    /**
> > > > 
> > > > [...]
> > > 
> > > 
> 
>