[syzbot] [kernel?] INFO: task hung in restrict_one_thread_callback

Günther Noack gnoack3000 at gmail.com
Mon Feb 23 09:42:45 UTC 2026


On Sat, Feb 21, 2026 at 02:19:53PM +0100, Günther Noack wrote:
> OK, I think I understand now.  Our existing recovery code for this
> conflict is this:
> 
> /*
>  * Decrement num_preparing for current, to undo that we initialized it
>  * to 1 a few lines above.
>  */
> if (atomic_dec_return(&shared_ctx.num_preparing) > 0) {
> 	if (wait_for_completion_interruptible(
> 		    &shared_ctx.all_prepared)) {
> 		/* In case of interruption, we need to retry the system call. */
> 		atomic_set(&shared_ctx.preparation_error,
> 			   -ERESTARTNOINTR);
> 
> 		/*
> 		 * Cancel task works for tasks that did not start running yet,
> 		 * and decrement all_prepared and num_unfinished accordingly.
> 		 */
> 		cancel_tsync_works(&works, &shared_ctx);
> 
> 		/*
> 		 * The remaining task works have started running, so waiting for
> 		 * their completion will finish.
> 		 */
> 		wait_for_completion(&shared_ctx.all_prepared);
> 	}
> }
> 
> When I wrote this, I assumed, as the last comment states, that the
> task works which we could not cancel, are already running.
> 
> I was wrong there, because I had misunderstood task_work_run().  When
> the task works get run there, it first *atomically dequeues the entire
> queue of scheduled task works*, and then runs them sequentially.
> 
> That is why, if we have one task work that belongs to the first
> landlock_restrict_self() call and one which belongs to the other, the
> task work which is scheduled later can (a) not be dequeued with
> cancel_tsync_works() any more, and (b) also has not started running
> yet.
> 
> Now the only thing that is necessary to produce the deadlock is that
> we have a pair of threads where the task works for the restriction
> calls have been scheduled in different order.  When the two
> landlock_restrict_self() calls end up in the recovery path quoted
> above, they will wait for one of their task works to run which is
> blocked from running by another task work that is scheduled before and
> does not finish either.
> 
> (Just pasting a brain dump here to save you some time hunting for the
> root cause. I don't know the best solution yet either.)

Let me propose the following fixes:

1. Immediate fix for that specific issue
----------------------------------------

Proposal:
* Remove the wait_for_completion(&shared_ctx.all_prepared)
  call in the code snippet above.
* Rewrite surrounding comments: Be clear about the fact that
  cancel_tsync_works() is an opportunistic improvement, but we don't
  have a guarantee at all that it cancels any of the enqueued task
  works (because task_work_run might already have popped them off).

This removes the hold-and-wait dependency circle between the threads,
which produces the observed deadlock.  The way that we shut down now
is that we exit the main loop (happens already without it, but we
might also "break" to be explicit).

I think that this fix or an equivalent one is needed here, because in
either way, our assumptions in the quoted code above were wrong.


2. Can we reason constructively about correctness?
--------------------------------------------------

The remaining question: If on the shutdown path, we can not actually
remove all the enqueued task works, under what circumstances are we
even able to interrupt and return from the landlock_restrict_self()
system call?

2.1 For n competing restrict_self calls, n-1 of them need to get interrupted
----------------------------------------------------------------------------

To answer this, consider a multithreaded process with threads named
"red", "green" and "blue" and many additional threads: When "red",
"green" and "blue" enforce landlock_restrict_self() concurrently, due
to differing iteration order, we might end up enqueueing the task
works on other threads in all of the following combinations:

  t0:  R G B  <- front of queue
  t1:  R B G
  t2:  G R B
  t3:  G B R
  t4:  B R G
  t5:  B G R

In this configuration, for any of the landlock_restrict_self() system
calls to even return (successfully or unsuccessfully), at least two
threads must receive an interrupt and therefore remove their enqueued
task works from the front of the queue.  Assuming those are green and
blue, we get:

  t0:  R      <- front of queue
  t1:  R
  t2:  G R
  t3:  G B R
  t4:  B R
  t5:  B G R

(This works because after the patch above, all of the enqueued G and B
works finish even if there are remaining G and B works that are still
blocked by an "R" entry.)

Now, "R" is in the front of the queue, and the
landlock_restrict_self() call for the red thread can finish normally,
even without it being interrupted.

Once the "R" task works are done as well, the remaining G and B works
can run and finish as well.

This scheme generalizes: If we have n competing
landlock_restrict_self() calls, then in worst case, at least n-1 of
these system calls need to be interrupted so that they can all
terminate.

2.2 Can we guarantee that two system calls get interrupted?
-----------------------------------------------------------

In case of competing landlock_restrict_self() calls, I think it is
possible that not all relevant system calls get seen.  The scenario is
one where we have a "red" and "green" thread calling
landlock_restrict_self().

  (a set of additional threads)
  t0: task_works: R G
  t1: task_works: G R
  tR: red thread
  tG: green thread

In the red thread, the following happens:
 * Under RCU, count the number of total threads => get a low number
 * Allocate space for that number of task_works
 * Under RCU
   * Enqueue "R" into t0 and t1
   * Enqueue "R" for some of the "additional threads"
   * But we do not have enough pre-allocated space to enqueue "R" for
     the green thread tG.

The same thing happens in the green thread as well.

The result is that we still have a deadlock between t0 and t1, but
neither the red nor the green thread get interrupted so that they can
resolve it.

(FWIW, you could resolve it from the outside by sending a signal to
the red or green thread manually, but it is not guaranteed to happen
on its own.)

Caveat: I am making pessimistic assumptions about the iteration order
of the task list here, and I am assuming that the number of
"additional threads" is swinging up and down during the competing
enforcement, so that the enforcing threads are mis-approximating the
required space for memory pre-allocation.

2.3 Possible resolutions
------------------------

* We could try to interrupt all sibling threads during the teardown,
  to fix the issue discussed in 2.2. (Downside: Complicated, more
  expensive)
* The reason why landlock_restrict_self() can't return is because it
  needs to wait until all task works are done before it can free the
  memory.  Alternatively, we could make the task works take ownership
  of these memory structures (refcounting the shared_ctx).  (Downside:
  The used memory is not linear to the number of threads any more.)

Side remark: In testing, I had the impression that the
landlock_restrict_self() calls can go into a retry loop for a while
where all competing threads get interrupted all the time; in a debug
build, when the Syzkaller test prints out a line for each attempt,
sometimes it was hanging for seconds and *then* resolving itself
again.

3 Conclusion
---------------

I would prefer if the final solution would not require deadlock
reasoning at that level and we could do it in simpler way.  I
therefore propose to do what Ding Yihan suggested, and what we had
also discussed previously in the code review:

* Let's serialize the landlock_restrict_self()-with-TSYNC operations
  through the cred_guard_mutex.

This will resolve the issue where competing landlock_restrict_self()
calls with TSYNC can deadlock.  It will also remove the jittery
behavior for that worst case where the conflict is resolved through
retry.


So in my mind, we need both patches:

 * The fix to the cleanup path from 1. above, to make interruption
   work more reliably and to correct the misunderstandings in the
   comments.
 * cred_guard_mutex to serialize the TSYNC invocations.

Please let me know what you think.

Thanks,
–Günther



More information about the Linux-security-module-archive mailing list