[PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()

Fri Oct 24 21:11:10 UTC 2025

On Wed, Oct 1, 2025 at 1:18 PM Günther Noack <gnoack at google.com> wrote:
> Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag.  With this flag, a
> given Landlock ruleset is applied to all threads of the calling
> process, instead of only the current one.
>
> Without this flag, multithreaded userspace programs currently resort
> to using the nptl(7)/libpsx hack for multithreaded policy enforcement,
> which is also used by libcap and for setuid(2).  Using this scheme,
> the threads of a process enforce the same Landlock ruleset, but the
> resulting Landlock domains are still separate, which makes a
> difference for Landlock's "scoped" access rights, where the domain
> identity and nesting is used.  As a result, when using
> LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads stops
> working.  This is a problem for programming languages and frameworks
> which are inherently multithreaded (e.g. Go).

This looks good to me overall, though there are a couple details to fix.

[...]
> +static inline void landlock_cred_copy(struct landlock_cred_security *dst,
> +                                     const struct landlock_cred_security *src)
> +{
> +       if (dst->domain)
> +               landlock_put_ruleset(dst->domain);
> +
> +       *dst = *src;

nit: I would add a short comment at the definition of struct
landlock_cred_security noting that this function memcpy's the entire
struct

> +
> +       if (dst->domain)
> +               landlock_get_ruleset(src->domain);
> +}
[...]
> +/*
> + * tsync_works_grow_by - preallocates space for n more contexts in s
> + *
> + * Returns:
> + *   -ENOMEM if the (re)allocation fails
> + *   0       if the allocation succeeds, partially succeeds, or no reallocation was needed
> + */
> +static int tsync_works_grow_by(struct tsync_works *s, size_t n, gfp_t flags)
> +{
> +       int i;
> +       size_t new_capacity = s->capacity + n;

(You only have to grow to `s->size + n` but I guess this works too.)

> +       struct tsync_work **works;
> +
> +       if (new_capacity <= s->capacity)
> +               return 0;
> +
> +       works = krealloc_array(s->works, new_capacity, sizeof(s->works[0]),
> +                              flags);
> +       if (IS_ERR(works))
> +               return PTR_ERR(works);

The kmalloc function family returns NULL on failure, so you have to
check for NULL here instead of IS_ERR(), and then return -ENOMEM
instead of PTR_ERR().

> +
> +       s->works = works;
> +
> +       for (i = s->capacity; i < new_capacity; i++) {
> +               s->works[i] = kzalloc(sizeof(*s->works[i]), flags);
> +               if (IS_ERR(s->works[i])) {

(again, kzalloc() returns NULL on failure)

> +                       /*
> +                        * Leave the object in a consistent state,
> +                        * but return an error.
> +                        */
> +                       s->capacity = i;
> +                       return PTR_ERR(s->works[i]);
> +               }
> +       }
> +       s->capacity = new_capacity;
> +       return 0;
> +}
[...]
> +/*
> + * tsync_works_free - free memory held by s and drop all task references
> + */
> +static void tsync_works_free(struct tsync_works *s)
> +{
> +       int i;
> +
> +       for (i = 0; i < s->size; i++)
> +               put_task_struct(s->works[i]->task);

You'll need a NULL check before calling put_task_struct(), since the
task_work_add() failure path can NULL out ->task. (Alternatively you
could leave the task pointer intact in the task_work_add() failure
path, since task_work_add() only fails if the task is already
PF_EXITING. The &work_exited marker which causes task_work_add() to
fail is only put on the task work list when task_work_run() runs on a
PF_EXITING task.)

> +       for (i = 0; i < s->capacity; i++)
> +               kfree(s->works[i]);
> +       kfree(s->works);
> +       s->works = NULL;
> +       s->size = 0;
> +       s->capacity = 0;
> +}
> +
> +/*
> + * restrict_sibling_threads - enables a Landlock policy for all sibling threads
> + */
> +static int restrict_sibling_threads(const struct cred *old_cred,
> +                                   const struct cred *new_cred)
> +{
> +       int res;
> +       struct task_struct *thread, *caller;
> +       struct tsync_shared_context shared_ctx;
> +       struct tsync_works works = {};
> +       size_t newly_discovered_threads;
> +       bool found_more_threads;
> +       struct tsync_work *ctx;
> +
> +       atomic_set(&shared_ctx.preparation_error, 0);
> +       init_completion(&shared_ctx.all_prepared);
> +       init_completion(&shared_ctx.ready_to_commit);
> +       atomic_set(&shared_ctx.num_unfinished, 0);

I think num_unfinished should be initialized to 1 here and decremented
later on, I think, similar to how num_preparing works. Though it only
matters in the edge case where the first thread we send task work to
immediately fails the memory allocation. (And then you can also remove
that "if (works.size)" check before
"wait_for_completion(&shared_ctx.all_finished)".)

> +       init_completion(&shared_ctx.all_finished);
> +       shared_ctx.old_cred = old_cred;
> +       shared_ctx.new_cred = new_cred;
> +
> +       caller = current;
[...]
> +                       init_task_work(&ctx->work,
> +                                      restrict_one_thread_callback);
> +                       res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> +                       if (res) {
> +                               /*
> +                                * Remove the task from ctx so that we will
> +                                * revisit the task at a later stage, if it
> +                                * still exists.
> +                                */
> +                               put_task_struct_rcu_user(ctx->task);

The complement to get_task_struct() is put_task_struct(), which I see
you also used in tsync_works_free(). put_task_struct_rcu_user() is for
a different, special type of task_struct reference.

> +                               ctx->task = NULL;
> +
> +                               atomic_set(&shared_ctx.preparation_error, res);

I think you don't want to set preparation_error here - that would
cause the syscall to return -ESRCH if we happen to race with an
exiting thread. Just remove that line - in the next iteration, we'll
skip this thread even if it still exists, because it has PF_EXITING
set by this point.

> +                               atomic_dec(&shared_ctx.num_preparing);
> +                               atomic_dec(&shared_ctx.num_unfinished);
> +                       }
> +               }
> +               rcu_read_unlock();
> +
> +               /*
> +                * Decrement num_preparing for current, to undo that we
> +                * initialized it to 1 at the beginning of the inner loop.
> +                */
> +               if (atomic_dec_return(&shared_ctx.num_preparing) > 0)
> +                       wait_for_completion(&shared_ctx.all_prepared);

I'm sorry, because this will make the patch a little bit more
complicated, but... I don't think you can use wait_for_completion()
here. Consider the scenario where two userspace threads of the same
process call this functionality (or a kernel subsystem that does
something similar) simultaneously. Each thread will wait for the other
indefinitely, and userspace won't even be able to resolve the deadlock
by killing the processes.
Similar issues would probably apply if, for example, GDB tried to
attach to the process with bad timing - if GDB ptrace-stops another
thread before you schedule task work for it, and then tries to
ptrace-stop this thread, I think this thread could essentially be in a
deadlock with GDB.

You'll have to do something else here. I think the best solution would
be to use wait_for_completion_interruptible() instead; then if that
fails, tear down all the task work stuff that was already scheduled,
and return with error -ERESTARTNOINTR. Something like (entirely
untested):

/* interruptible wait to avoid deadlocks while waiting for other tasks
to enter our task work */
if (wait_for_completion_interruptible(&shared_ctx.all_prepared)) {
  atomic_set(&shared_ctx.preparation_error, -ERESTARTNOINTR);
  for (int i=0; i<works.size; i++) {
    if (task_work_cancel(works.works[i]->task, &works.works[i]->work))
      if (atomic_dec_return(&shared_ctx.num_preparing))
        complete_all(&shared_ctx.all_prepared);
  }
  /* at this point we're only waiting for tasks that are already
executing the task work */
  wait_for_completion(&shared_ctx.all_prepared);
}

Note that if the syscall returns -ERESTARTNOINTR, that won't be
visible to userspace (except for debugging tools like strace/gdb); the
kernel ensures that the syscall will transparently re-execute
immediately. (It literally decrements the saved userspace instruction
pointer by the size of a syscall instruction, so that when the kernel
returns to userspace, the next instruction that executes will redo the
syscall.) This allows us to break the deadlock without having to write
any ugly retry logic or throwing userspace-visible errors.

> +       } while (found_more_threads &&
> +                !atomic_read(&shared_ctx.preparation_error));
> +
> +       /*
> +        * We now have all sibling threads blocking and in "prepared" state in
> +        * the task work. Ask all threads to commit.
> +        */
> +       complete_all(&shared_ctx.ready_to_commit);
> +
> +       if (works.size)
> +               wait_for_completion(&shared_ctx.all_finished);
> +
> +       tsync_works_free(&works);
> +
> +       return atomic_read(&shared_ctx.preparation_error);
> +}
[...]
> @@ -566,5 +987,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
>         new_llcred->domain_exec |= BIT(new_dom->num_layers - 1);
>  #endif /* CONFIG_AUDIT */
>
> +       if (flags & LANDLOCK_RESTRICT_SELF_TSYNC) {
> +               res = restrict_sibling_threads(current_cred(), new_cred);
> +               if (res != 0) {
> +                       abort_creds(new_cred);
> +                       return res;
> +               }
> +       }

Annoyingly, there is a special-case path above for the case where
LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF is set without actually
applying any ruleset. In that case you won't reach this point, and so
LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF would only affect the
current thread in that case. I doubt it'd be very noticeable, but
still, it might be a good idea to rearrange things here a bit... maybe
instead of the current `if (!ruleset) return commit_creds(new_cred);`,
put some of the subsequent stuff in a `if (ruleset) {` block?