[PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()
Günther Noack
gnoack at google.com
Thu Nov 27 10:32:07 UTC 2025
On Fri, Oct 24, 2025 at 11:11:10PM +0200, Jann Horn wrote:
> On Wed, Oct 1, 2025 at 1:18 PM Günther Noack <gnoack at google.com> wrote:
> > Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag. With this flag, a
> > given Landlock ruleset is applied to all threads of the calling
> > process, instead of only the current one.
> >
> > Without this flag, multithreaded userspace programs currently resort
> > to using the nptl(7)/libpsx hack for multithreaded policy enforcement,
> > which is also used by libcap and for setuid(2). Using this scheme,
> > the threads of a process enforce the same Landlock ruleset, but the
> > resulting Landlock domains are still separate, which makes a
> > difference for Landlock's "scoped" access rights, where the domain
> > identity and nesting is used. As a result, when using
> > LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads stops
> > working. This is a problem for programming languages and frameworks
> > which are inherently multithreaded (e.g. Go).
>
> This looks good to me overall, though there are a couple details to fix.
>
> [...]
> > +static inline void landlock_cred_copy(struct landlock_cred_security *dst,
> > + const struct landlock_cred_security *src)
> > +{
> > + if (dst->domain)
> > + landlock_put_ruleset(dst->domain);
> > +
> > + *dst = *src;
>
> nit: I would add a short comment at the definition of struct
> landlock_cred_security noting that this function memcpy's the entire
> struct
Sounds good. I added a small remark "when updating this, also update
landlock_cred_copy() if needed".
> > +
> > + if (dst->domain)
> > + landlock_get_ruleset(src->domain);
> > +}
> [...]
> > +/*
> > + * tsync_works_grow_by - preallocates space for n more contexts in s
> > + *
> > + * Returns:
> > + * -ENOMEM if the (re)allocation fails
> > + * 0 if the allocation succeeds, partially succeeds, or no reallocation was needed
> > + */
> > +static int tsync_works_grow_by(struct tsync_works *s, size_t n, gfp_t flags)
> > +{
> > + int i;
> > + size_t new_capacity = s->capacity + n;
>
> (You only have to grow to `s->size + n` but I guess this works too.)
Thanks, well spotted. This was indeed the intended behavior, the
new_capacity <= s->capacity check also makes much more sense that
way. (I have a more detailed answer in another reply.) I fixed this
and also added an overflow check for good measure.
> > + struct tsync_work **works;
> > +
> > + if (new_capacity <= s->capacity)
> > + return 0;
> > +
> > + works = krealloc_array(s->works, new_capacity, sizeof(s->works[0]),
> > + flags);
> > + if (IS_ERR(works))
> > + return PTR_ERR(works);
>
> The kmalloc function family returns NULL on failure, so you have to
> check for NULL here instead of IS_ERR(), and then return -ENOMEM
> instead of PTR_ERR().
Thanks, fixed.
> > + s->works = works;
> > +
> > + for (i = s->capacity; i < new_capacity; i++) {
> > + s->works[i] = kzalloc(sizeof(*s->works[i]), flags);
> > + if (IS_ERR(s->works[i])) {
>
> (again, kzalloc() returns NULL on failure)
Done.
> > + /*
> > + * Leave the object in a consistent state,
> > + * but return an error.
> > + */
> > + s->capacity = i;
> > + return PTR_ERR(s->works[i]);
> > + }
> > + }
> > + s->capacity = new_capacity;
> > + return 0;
> > +}
> [...]
> > +/*
> > + * tsync_works_free - free memory held by s and drop all task references
> > + */
> > +static void tsync_works_free(struct tsync_works *s)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < s->size; i++)
> > + put_task_struct(s->works[i]->task);
>
> You'll need a NULL check before calling put_task_struct(), since the
> task_work_add() failure path can NULL out ->task. (Alternatively you
> could leave the task pointer intact in the task_work_add() failure
> path, since task_work_add() only fails if the task is already
> PF_EXITING. The &work_exited marker which causes task_work_add() to
> fail is only put on the task work list when task_work_run() runs on a
> PF_EXITING task.)
Thanks for spotting this, this is correct! I added a NULL check.
> > + for (i = 0; i < s->capacity; i++)
> > + kfree(s->works[i]);
> > + kfree(s->works);
> > + s->works = NULL;
> > + s->size = 0;
> > + s->capacity = 0;
> > +}
> > +
> > +/*
> > + * restrict_sibling_threads - enables a Landlock policy for all sibling threads
> > + */
> > +static int restrict_sibling_threads(const struct cred *old_cred,
> > + const struct cred *new_cred)
> > +{
> > + int res;
> > + struct task_struct *thread, *caller;
> > + struct tsync_shared_context shared_ctx;
> > + struct tsync_works works = {};
> > + size_t newly_discovered_threads;
> > + bool found_more_threads;
> > + struct tsync_work *ctx;
> > +
> > + atomic_set(&shared_ctx.preparation_error, 0);
> > + init_completion(&shared_ctx.all_prepared);
> > + init_completion(&shared_ctx.ready_to_commit);
> > + atomic_set(&shared_ctx.num_unfinished, 0);
>
> I think num_unfinished should be initialized to 1 here and decremented
> later on, I think, similar to how num_preparing works. Though it only
> matters in the edge case where the first thread we send task work to
> immediately fails the memory allocation. (And then you can also remove
> that "if (works.size)" check before
> "wait_for_completion(&shared_ctx.all_finished)".)
Thank you, good catch!
The works.size check was inaccurate, because in the case of an error
during task_work_add(), it wasn't actually counting the number of
scheduled task works, but overestimating it. The scenario is a bit
obscure, but initializing num_unfinished is a more robust approach
that rules out that variant of bugs.
> > + init_completion(&shared_ctx.all_finished);
> > + shared_ctx.old_cred = old_cred;
> > + shared_ctx.new_cred = new_cred;
> > +
> > + caller = current;
> [...]
> > + init_task_work(&ctx->work,
> > + restrict_one_thread_callback);
> > + res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> > + if (res) {
> > + /*
> > + * Remove the task from ctx so that we will
> > + * revisit the task at a later stage, if it
> > + * still exists.
> > + */
> > + put_task_struct_rcu_user(ctx->task);
>
> The complement to get_task_struct() is put_task_struct(), which I see
> you also used in tsync_works_free(). put_task_struct_rcu_user() is for
> a different, special type of task_struct reference.
Thanks, done.
> > + ctx->task = NULL;
> > +
> > + atomic_set(&shared_ctx.preparation_error, res);
>
> I think you don't want to set preparation_error here - that would
> cause the syscall to return -ESRCH if we happen to race with an
> exiting thread. Just remove that line - in the next iteration, we'll
> skip this thread even if it still exists, because it has PF_EXITING
> set by this point.
Thanks, that is correct and I fixed it as you suggested. -- The thread
exiting is the only reason why task_work_add() can fail. In the
(perfectly valid) case where one of the sibling threads happens to
exit, we do not want the landlock_restrict_self() syscall to fail just
because of that.
> > + atomic_dec(&shared_ctx.num_preparing);
> > + atomic_dec(&shared_ctx.num_unfinished);
> > + }
> > + }
> > + rcu_read_unlock();
> > +
> > + /*
> > + * Decrement num_preparing for current, to undo that we
> > + * initialized it to 1 at the beginning of the inner loop.
> > + */
> > + if (atomic_dec_return(&shared_ctx.num_preparing) > 0)
> > + wait_for_completion(&shared_ctx.all_prepared);
>
> I'm sorry, because this will make the patch a little bit more
> complicated, but... I don't think you can use wait_for_completion()
> here. Consider the scenario where two userspace threads of the same
> process call this functionality (or a kernel subsystem that does
> something similar) simultaneously. Each thread will wait for the other
> indefinitely, and userspace won't even be able to resolve the deadlock
> by killing the processes.
> Similar issues would probably apply if, for example, GDB tried to
> attach to the process with bad timing - if GDB ptrace-stops another
> thread before you schedule task work for it, and then tries to
> ptrace-stop this thread, I think this thread could essentially be in a
> deadlock with GDB.
>
> You'll have to do something else here. I think the best solution would
> be to use wait_for_completion_interruptible() instead; then if that
> fails, tear down all the task work stuff that was already scheduled,
> and return with error -ERESTARTNOINTR. Something like (entirely
> untested):
>
> /* interruptible wait to avoid deadlocks while waiting for other tasks
> to enter our task work */
> if (wait_for_completion_interruptible(&shared_ctx.all_prepared)) {
> atomic_set(&shared_ctx.preparation_error, -ERESTARTNOINTR);
> for (int i=0; i<works.size; i++) {
> if (task_work_cancel(works.works[i]->task, &works.works[i]->work))
> if (atomic_dec_return(&shared_ctx.num_preparing))
> complete_all(&shared_ctx.all_prepared);
> }
> /* at this point we're only waiting for tasks that are already
> executing the task work */
> wait_for_completion(&shared_ctx.all_prepared);
> }
>
> Note that if the syscall returns -ERESTARTNOINTR, that won't be
> visible to userspace (except for debugging tools like strace/gdb); the
> kernel ensures that the syscall will transparently re-execute
> immediately. (It literally decrements the saved userspace instruction
> pointer by the size of a syscall instruction, so that when the kernel
> returns to userspace, the next instruction that executes will redo the
> syscall.) This allows us to break the deadlock without having to write
> any ugly retry logic or throwing userspace-visible errors.
Thank you again for catching this!
I used your suggestion, with the following (minor) differences:
1. Factored it out as a function (indentation level got too high...)
2. typo: call complete_all() only if atomic_dec_return() returns 0
3. Do the same barrier synchronization dance with num_unfinished/all_finished as well.
...and I also implemented a selftest for the case where
landlock_restrict_self() gets called by two adjacent threads at the
same time.
I'll write down my reasoning why this works for reference:
The problem in V2 is that we can run into a deadlock in the case where
two thread call landlock_restrict_self(). In that case, they'll both
become uninterruptible at syscall entry and try to schedule a
task_work for each other. Then, they proceed to wait for each other's
task_work to execute, which never happens because the task_work never
gets scheduled.
This is resolved by using an interruptible wait. With the
interruptible wait, the task can detect the condition where a signal
(or task work) comes in, execute that task_work and bail out of the
system call cleanly (we un-schedule the task_works for other threads
that are still pending and we abort all the task_works that have
already started to run by setting the shared_ctx.preparation_error).
After returning from the system call with -ERESTARTNOINTR, it gets
retried automatically to recover from the problem.
In all the other cases where we wait_for_completion() uninterruptibly,
we can reason about that returning, because these happen under
circumstances where we know that the task works we are waiting for
have all been started already.
I have reproduced the deadlock and verified the fixed implementation
with the selftest, by temporarily adding mdelay() calls and a lot of
logging in strategic places.
>
> > + } while (found_more_threads &&
> > + !atomic_read(&shared_ctx.preparation_error));
> > +
> > + /*
> > + * We now have all sibling threads blocking and in "prepared" state in
> > + * the task work. Ask all threads to commit.
> > + */
> > + complete_all(&shared_ctx.ready_to_commit);
> > +
> > + if (works.size)
> > + wait_for_completion(&shared_ctx.all_finished);
> > +
> > + tsync_works_free(&works);
> > +
> > + return atomic_read(&shared_ctx.preparation_error);
> > +}
> [...]
> > @@ -566,5 +987,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> > new_llcred->domain_exec |= BIT(new_dom->num_layers - 1);
> > #endif /* CONFIG_AUDIT */
> >
> > + if (flags & LANDLOCK_RESTRICT_SELF_TSYNC) {
> > + res = restrict_sibling_threads(current_cred(), new_cred);
> > + if (res != 0) {
> > + abort_creds(new_cred);
> > + return res;
> > + }
> > + }
>
> Annoyingly, there is a special-case path above for the case where
> LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF is set without actually
> applying any ruleset. In that case you won't reach this point, and so
> LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF would only affect the
> current thread in that case. I doubt it'd be very noticeable, but
> still, it might be a good idea to rearrange things here a bit... maybe
> instead of the current `if (!ruleset) return commit_creds(new_cred);`,
> put some of the subsequent stuff in a `if (ruleset) {` block?
Thanks, I fixed that one as well.
—Günther
More information about the Linux-security-module-archive
mailing list