[syzbot] Monthly lsm report (Sep 2024)

Sat Sep 28 00:08:43 UTC 2024

On Fri, Sep 27, 2024 at 02:18:10PM GMT, Roberto Sassu wrote:
> On Tue, 2024-09-24 at 13:53 +0200, Roberto Sassu wrote:
> > On Mon, 2024-09-23 at 08:06 -0400, Paul Moore wrote:
> > > On Mon, Sep 23, 2024 at 5:02 AM syzbot
> > > <syzbot+listfc277c7cb94932601d96 at syzkaller.appspotmail.com> wrote:
> > > > 
> > > > Hello lsm maintainers/developers,
> > > > 
> > > > This is a 31-day syzbot report for the lsm subsystem.
> > > > All related reports/information can be found at:
> > > > https://syzkaller.appspot.com/upstream/s/lsm
> > > > 
> > > > During the period, 0 new issues were detected and 0 were fixed.
> > > > In total, 4 issues are still open and 27 have been fixed so far.
> > > > 
> > > > Some of the still happening issues:
> > > > 
> > > > Ref Crashes Repro Title
> > > > <1> 306     No    INFO: task hung in process_measurement (2)
> > > >                   https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330
> > > 
> > > Mimi, Roberto,
> > > 
> > > Any chance this is this related in any way to this report:
> > > 
> > > https://lore.kernel.org/linux-security-module/CALAgD-4hkHVcCq2ycdwnA2hYDBMqijLUOfZgvf1WfFpU-8+42w@mail.gmail.com/
> > 
> > I reproduced the last, but I got a different result (the kernel crashed
> > in a different place).
> > 
> > It seems a corruption case, while the former looks more a lock
> > inversion issue. Will check more.
> 
> + Kent Overstreet
> 
> https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330
> 
> It happens few times per day, since commit 4a39ac5b7d62 (which is
> followed by a lot of merges). The bug has been likely introduced there.
> 
> In all recent reports, I noticed that there is always the following
> lock sequence:
> 
> [  291.584319][   T30] 5 locks held by syz.0.75/5970:
> [  291.594487][   T30]  #0: ffff888064066420 (sb_writers#25){.+.+}-{0:0}, at: mnt_want_write+0x3f/0x90
> [  291.603984][   T30]  #1: ffff88805d8b0148 (&sb->s_type->i_mutex_key#30){++++}-{3:3}, at: do_truncate+0x20c/0x310
> [  291.614497][   T30]  #2: ffff888054700a38 (&c->snapshot_create_lock){.+.+}-{3:3}, at: bch2_truncate+0x16d/0x2c0
> [  291.624871][   T30]  #3: ffff888054704398 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x7de/0xd20
> [  291.635446][   T30]  #4: ffff8880547266d0 (&c->gc_lock){.+.+}-{3:3}, at: bch2_btree_update_start+0x682/0x14e0
> 
> IMA is stuck too, since it is waiting for the inode lock to be released:
> 
> [  291.645689][   T30] 1 lock held by syz.0.75/6010:
> [  291.650622][   T30]  #0: ffff88805d8b0148 (&sb->s_type->i_mutex_key#30){++++}-{3:3}, at: process_measurement+0x439/0x1fb0
> 
> It seems that the super block is locked by someone else, which is not
> able to unlock. Maybe, it is related to bch2_journal_reclaim_thread(),
> but I don't know for sure.
> 
> Kent, do you have time to look at this report?

If you scroll back in the console log, you see this:

[  230.372988][ T6772] Allocator stuck? Waited for 30 seconds
[  230.373025][ T6772] Allocator debug:
[  230.373039][ T6772]   capacity1536
[  230.373051][ T6772]   reserved             31232
[  230.373064][ T6772]   hidden               0
[  230.373077][ T6772]   btree                0
[  230.373090][ T6772]   data                 0
[  230.373102][ T6772]   cached               0
[  230.373114][ T6772]   reserved             0
[  230.373127][ T6772]   online_reserved      768
[  230.373140][ T6772]   nr_inodes            0
[  230.373152][ T6772]   
[  230.373164][ T6772]   freelist_wait        waiting
[  230.373176][ T6772]   open buckets allocated1
[  230.373189][ T6772]   open buckets total   1024
[  230.373202][ T6772]   open_buckets_wait    empty
[  230.373214][ T6772]   open_buckets_btree   0
[  230.373227][ T6772]   open_buckets_user    0
[  230.373239][ T6772]   btree reserve cache  0
[  230.373251][ T6772] 
[  230.373262][ T6772] Dev 0:
[  230.373274][ T6772]                      buckets         sectors      fragmented
[  230.373288][ T6772]   free                     0               0               0
[  230.373303][ T6772]   sb                       0               0               0
[  230.373318][ T6772]   journal                  0               0               0
[  230.373332][ T6772]   btree                    0               0               0
[  230.373347][ T6772]   user                     0               0               0
[  230.373361][ T6772]   cached                   0               0               0
[  230.373375][ T6772]   parity                   0               0               0
[  230.373390][ T6772]   stripe                   0               0               0
[  230.373404][ T6772]   need_gc_gens             0               0               0
[  230.373418][ T6772]   need_discard             0               0               0
[  230.373432][ T6772]   unstriped                0               0               0
[  230.373446][ T6772]   capacity               128
[  230.373459][ T6772]   
[  230.373470][ T6772]   reserves:
[  230.373481][ T6772]   stripe                  60
[  230.373494][ T6772]   normal                  58
[  230.373506][ T6772]   copygc                  56
[  230.373519][ T6772]   btree                   28
[  230.373531][ T6772]   btree_copygc             0
[  230.373543][ T6772]   reclaim                  0
[  230.373556][ T6772]   interior_updates         0
[  230.373569][ T6772]   
[  230.373580][ T6772]   open buckets             0
[  230.373592][ T6772]   buckets to invalidate    0
[  230.373605][ T6772] 
[  230.373616][ T6772] Copygc debug:
[  230.373627][ T6772]   running: 1
[  230.373656][ T6772]   copygc_wait:0
[  230.373675][ T6772]   copygc_wait_at:0
[  230.373694][ T6772]   Currently waiting for:0 B
[  230.373708][ T6772]   Currently waiting since:644 KiB
[  230.373722][ T6772]   Currently calculated wait:0 B
[  230.373735][ T6772]

this looks like a bug in bcachefs where alloc counters didn't get
initialized correctly, which makes sense given that 6.11 landed the disk
accounting rewrite. Will dig more as I have time.