[PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()

Thu Nov 27 09:32:21 UTC 2025

Hello!

Thank you for the review!  It has taken a while, but now V3 is almost
ready and I have the answers to these questions.

On Fri, Oct 17, 2025 at 05:04:23PM +0200, Mickaël Salaün wrote:
> On Wed, Oct 01, 2025 at 01:18:06PM +0200, Günther Noack wrote:
> > Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag.  With this flag, a
> > given Landlock ruleset is applied to all threads of the calling
> > process, instead of only the current one.
> > 
> > Without this flag, multithreaded userspace programs currently resort
> > to using the nptl(7)/libpsx hack for multithreaded policy enforcement,
> > which is also used by libcap and for setuid(2).  Using this scheme,
> > the threads of a process enforce the same Landlock ruleset, but the
> > resulting Landlock domains are still separate, which makes a
> > difference for Landlock's "scoped" access rights, where the domain
> > identity and nesting is used.  As a result, when using
> > LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads stops
> > working.  This is a problem for programming languages and frameworks
> > which are inherently multithreaded (e.g. Go).
> 
> Having different Landlock domains also means that we get different
> domains ID, and the audit logs might confuse users.

True, I added that to the commit message.

> > Cc: Mickaël Salaün <mic at digikod.net>
> > Cc: Paul Moore <paul at paul-moore.com>
> > Cc: linux-security-module at vger.kernel.org
> > Suggested-by: Jann Horn <jannh at google.com>
> > Signed-off-by: Günther Noack <gnoack at google.com>
> > ---
> >  include/uapi/linux/landlock.h |   4 +
> >  security/landlock/cred.h      |  12 +
> >  security/landlock/limits.h    |   2 +-
> >  security/landlock/syscalls.c  | 433 +++++++++++++++++++++++++++++++++-
> 
> You can move most of this code to a new tsync.c file.

Done.

> > --- a/include/uapi/linux/landlock.h
> > +++ b/include/uapi/linux/landlock.h
> > @@ -117,11 +117,15 @@ struct landlock_ruleset_attr {
> >   *     future nested domains, not the one being created. It can also be used
> >   *     with a @ruleset_fd value of -1 to mute subdomain logs without creating a
> >   *     domain.
> > + *
> > + * %LANDLOCK_RESTRICT_SELF_TSYNC
> > + *    Apply the given ruleset atomically to all threads of the current process.
> 
> Please bump the Landlock ABI version and update the doc.

Done.

Docs: I added flag description to the landlock.h header so that it
appears in the documentation for the system call, and added an entry
to the section where we describe the differences between the ABI
versions.

> > diff --git a/security/landlock/cred.h b/security/landlock/cred.h
> > index c82fe63ec598..eb28eeade760 100644
> > --- a/security/landlock/cred.h
> > +++ b/security/landlock/cred.h
> > @@ -65,6 +65,18 @@ landlock_cred(const struct cred *cred)
> >  	return cred->security + landlock_blob_sizes.lbs_cred;
> >  }
> >  
> > +static inline void landlock_cred_copy(struct landlock_cred_security *dst,
> > +				      const struct landlock_cred_security *src)
> > +{
> 
> You can simplify by removing the domain checks which are already done by
> landlock_put/get_ruleset().
> 
> > +	if (dst->domain)
> > +		landlock_put_ruleset(dst->domain);
> > +
> > +	*dst = *src;
> > +
> > +	if (dst->domain)
> > +		landlock_get_ruleset(src->domain);

Done.

> > +/*
> > + * restrict_one_thread - update a thread's Landlock domain in lockstep with the
> > + * other threads in the same process
> > + *
> > + * When this is run, the same function gets run in all other threads in the same
> > + * process (except for the calling thread which called landlock_restrict_self).
> > + * The concurrently running invocations of restrict_one_thread coordinate
> > + * through the shared ctx object to do their work in lockstep to implement
> > + * all-or-nothing semantics for enforcing the new Landlock domain.
> > + *
> > + * Afterwards, depending on the presence of an error, all threads either commit
> > + * or abort the prepared credentials.  The commit operation can not fail any more.
> > + */
> > +static void restrict_one_thread(struct tsync_shared_context *ctx)
> > +{
> > +	int res;
> 
> For consistency with existing code, please use "err" instead of "res".

Done.

> > +	struct cred *cred = NULL;
> > +	const struct cred *current_cred = current_cred();
> 
> No need for this variable.

Done.

> > +
> > +	if (current_cred == ctx->old_cred) {
> > +		/*
> > +		 * As a shortcut, switch out old_cred with new_cred, if
> > +		 * possible.
> > +		 *
> > +		 * Note: We are intentionally dropping the const qualifier here,
> > +		 * because it is required by commit_creds() and abort_creds().
> > +		 */
> > +		cred = (struct cred *)get_cred(ctx->new_cred);
> 
> Good.  You can extend the comment to explain that this optimization
> avoid creating new credentials in most cases, and then save memory.

Sounds good, I extended that comment a bit.

> > +	} else {
> > +		/* Else, prepare new creds and populate them. */
> > +		cred = prepare_creds();
> > +
> > +		if (!cred) {
> > +			atomic_set(&ctx->preparation_error, -ENOMEM);
> > +
> > +			/*
> > +			 * Even on error, we need to adhere to the protocol and
> > +			 * coordinate with concurrently running invocations.
> > +			 */
> > +			if (atomic_dec_return(&ctx->num_preparing) == 0)
> > +				complete_all(&ctx->all_prepared);
> > +
> > +			goto out;
> > +		}
> > +
> > +		landlock_cred_copy(landlock_cred(cred),
> > +				   landlock_cred(ctx->new_cred));
> > +	}
> > +
> > +	/*
> > +	 * Barrier: Wait until all threads are done preparing.
> > +	 * After this point, we can have no more failures.
> > +	 */
> > +	if (atomic_dec_return(&ctx->num_preparing) == 0)
> > +		complete_all(&ctx->all_prepared);
> > +
> > +	/*
> > +	 * Wait for signal from calling thread that it's safe to read the
> > +	 * preparation error now and we are ready to commit (or abort).
> > +	 */
> > +	wait_for_completion(&ctx->ready_to_commit);
> > +
> > +	/* Abort the commit if any of the other threads had an error. */
> > +	res = atomic_read(&ctx->preparation_error);
> > +	if (res) {
> > +		abort_creds(cred);
> > +		goto out;
> > +	}
> > +
> > +	/* If needed, establish enforcement prerequisites. */
> > +	if (!ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
> > +		task_set_no_new_privs(current);
> 
> We should always set PR_SET_NO_NEW_PRIVS if it is set on the calling
> thread as done by seccomp.  We should just store the result of
> task_no_new_privs() in tsync_shared_context and use it as condition here.
> This should be explained in the documentation.
> 
> This also mean that if the calling thread has CAP_SYS_ADMIN but not
> PR_SET_NO_NEW_PRIVS, then a sibling thread could not have CAP_SYS_ADMIN
> nor PR_SET_NO_NEW_PRIVS.  This would be a risky state but mainly because
> of the CAP_SYS_ADMIN inconsistency, not really the missing
> PR_SET_NO_NEW_PRIVS.

OK, done.  In line with seccomp now, where the logic is (pseudocode):

  if (task_no_new_privs(caller))
      task_set_no_new_privs(current);

As you are saying as well, the situation where the caller has
CAP_SYS_ADMIN but not PR_SET_NO_NEW_PRIVS results in sibling threads
that will also end up without PR_SET_NO_NEW_PRIVS.  But this is also
consistent with Seccomp.

> > +
> > +	commit_creds(cred);
> > +
> > +out:
> > +	/* Notify the calling thread once all threads are done */
> > +	if (atomic_dec_return(&ctx->num_unfinished) == 0)
> > +		complete_all(&ctx->all_finished);
> > +}
> > +
> > +/*
> > + * restrict_one_thread_callback - task_work callback for restricting a thread
> > + *
> > + * Calls restrict_one_thread with the struct landlock_shared_tsync_context.
> > + */
> > +static void restrict_one_thread_callback(struct callback_head *work)
> > +{
> > +	struct tsync_work *ctx = container_of(work, struct tsync_work, work);
> > +
> > +	restrict_one_thread(ctx->shared_ctx);
> > +}
> > +
> > +/*
> > + * struct tsync_works - a growable array of per-task contexts
> > + *
> > + * The zero-initialized struct represents the empty array.
> > + */
> > +struct tsync_works {
> > +	struct tsync_work **works;
> > +	size_t size;
> > +	size_t capacity;
> > +};
> > +
> > +/*
> > + * tsync_works_provide - provides a preallocated tsync_work for the given task
> > + *
> > + * This also stores a task pointer in the context and increments the reference
> > + * count of the task.
> > + *
> > + * Returns:
> > + *   A pointer to the preallocated context struct, with task filled in.
> > + *
> > + *   NULL, if we ran out of preallocated context structs.
> > + */
> > +static struct tsync_work *tsync_works_provide(struct tsync_works *s,
> > +					      struct task_struct *task)
> > +{
> > +	struct tsync_work *ctx;
> > +
> > +	if (s->size >= s->capacity)
> 
> In which case can this happen?  Should we wrap this in a WARN_ON_ONCE()?

As Jann also said in his reply, it can happen if the number of
existing threads differs between the first for_each_thread() loop
where we count the threads and the second for_each_thread() loop where
we fill the allocated slots in the array.  If it happens, we deal with
it by doing an additional loop to discover new threads.

> > +		return NULL;
> > +
> > +	ctx = s->works[s->size];
> > +	s->size++;
> > +
> > +	ctx->task = get_task_struct(task);
> > +	return ctx;
> > +}
> > +
> > +/*
> > + * tsync_works_grow_by - preallocates space for n more contexts in s
> > + *
> > + * Returns:
> > + *   -ENOMEM if the (re)allocation fails
> > + *   0       if the allocation succeeds, partially succeeds, or no reallocation was needed
> > + */
> > +static int tsync_works_grow_by(struct tsync_works *s, size_t n, gfp_t flags)
> > +{
> > +	int i;
> 
> size_t i;

Done.

> > +	size_t new_capacity = s->capacity + n;
> 
> > +	struct tsync_work **works;
> > +
> > +	if (new_capacity <= s->capacity)
> 
> This integer overflow check should return an error instead.

Oh, this was not meant to check overflow, actually.

As Jann is remarking in [1], what should really have happened here is
that the capacity should have been increased to s->size + n instead of
s->capacity + n a few lines above.  With that logic, it also makes
more sense to check whether the new capacity is below-or-equal the old
capacity, as it can legitimately happen in cases where the previously
allocated size was not used up in its entirety.  In that case, it is
possible that we already have sufficient space for the additional n
items, and it is reasonable to return 0 in that case.  (The semantics
of the function are "make sure that I have enough space for n
additional items", and that is fulfilled without error.)

[1] http://lore.kernel.org/all/CAG48ez1oS9kANZBq1bt+D76MX03DPHAFp76GJt7z5yx-Na1VLQ@mail.gmail.com

I did add an overflow check as well, for good measure (it's
practically for free).

> 
> > +		return 0;
> > +
> > +	works = krealloc_array(s->works, new_capacity, sizeof(s->works[0]),
> > +			       flags);
> > +	if (IS_ERR(works))
> > +		return PTR_ERR(works);
> > +
> > +	s->works = works;
> > +
> > +	for (i = s->capacity; i < new_capacity; i++) {
> > +		s->works[i] = kzalloc(sizeof(*s->works[i]), flags);
> 
> We should use a local variable to avoid storing an error code in
> s->works[i] and potentially dereferencing it later (e.g. in
> tsync_work_free).

As Jann also pointed out in the other mail, this was a mistake -
kzalloc returns NULL on failure, not an error code. (Fixed)

I extracted a local variable anyway, it's a bit clearer to read.

> Why can't we avoid this loop entirely and allocate a flat array with
> only one call to krealloc_array()?  Why struct tsync_works->works needs
> to be a pointer to a pointer?

We pass out pointers to the struct tsync_work elements when calling
task_work_add(), and these pointers are used by the task_work logic
and by our own implementation of that task_work,
restrict_one_thread_callback().  If these elements were directly in
the array, realloc would move them to new locations and these pointers
would then point into freed memory.

Discussed by Jann in [2] as well.

[2] https://lore.kernel.org/all/CAG48ez2KoF6hVSJwdPfUpN3oroMww6Mu1+-bsBSbO8C5f91P6Q@mail.gmail.com/

If we were to take a very close look at all the affected code, we
could potentially convince ourselves that we can discard these objects
after the all_prepared point, but this all seems like a brittle coding
pattern that would go against normal conventions and could break in a
future version. (e.g. when task_work_run() would invoke
"work->func(work)", the memory behind that "work" pointer would get
freed halfway throughout that invocation.  It seems a bit brittle.)
I'd prefer to not base our code correctness on such complicated
assumptions about other corners of the code.

> > +		if (IS_ERR(s->works[i])) {
> > +			/*
> > +			 * Leave the object in a consistent state,
> > +			 * but return an error.
> > +			 */
> > +			s->capacity = i;
> > +			return PTR_ERR(s->works[i]);
> > +		}
> > +	}
> > +	s->capacity = new_capacity;
> > +	return 0;
> > +}
> > +
> > +/*
> > + * tsync_works_contains - checks for presence of task in s
> > + */
> > +static bool tsync_works_contains_task(struct tsync_works *s,
> > +				      struct task_struct *task)
> > +{
> > +	size_t i;
> > +
> > +	for (i = 0; i < s->size; i++)
> > +		if (s->works[i]->task == task)
> > +			return true;
> > +	return false;
> > +}
> > +
> > +/*
> > + * tsync_works_free - free memory held by s and drop all task references
> > + */
> > +static void tsync_works_free(struct tsync_works *s)
> 
> tsync_works_put() would be more appropriate since this function doesn't
> free s.

put() is usually a reference count decrement, and we don't do that either.
After a bit of thinking, I settled on tsync_works_release(), which is
a slightly more generic wording that does not sound like a refcount.

> > @@ -566,5 +987,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> >  	new_llcred->domain_exec |= BIT(new_dom->num_layers - 1);
> >  #endif /* CONFIG_AUDIT */
> >  
> > +	if (flags & LANDLOCK_RESTRICT_SELF_TSYNC) {
> > +		res = restrict_sibling_threads(current_cred(), new_cred);
> > +		if (res != 0) {
> 
> if (!err) {

Done.

(Corrected it in another place in landlock_restrict_sibling_threads() as well)

—Günther