[syzbot] Monthly lsm report (Sep 2024)

Sat Sep 28 01:25:36 UTC 2024

On Fri, Sep 27, 2024 at 08:08:48PM GMT, Kent Overstreet wrote:
> On Fri, Sep 27, 2024 at 02:18:10PM GMT, Roberto Sassu wrote:
> > On Tue, 2024-09-24 at 13:53 +0200, Roberto Sassu wrote:
> > > On Mon, 2024-09-23 at 08:06 -0400, Paul Moore wrote:
> > > > On Mon, Sep 23, 2024 at 5:02 AM syzbot
> > > > <syzbot+listfc277c7cb94932601d96 at syzkaller.appspotmail.com> wrote:
> > > > > 
> > > > > Hello lsm maintainers/developers,
> > > > > 
> > > > > This is a 31-day syzbot report for the lsm subsystem.
> > > > > All related reports/information can be found at:
> > > > > https://syzkaller.appspot.com/upstream/s/lsm
> > > > > 
> > > > > During the period, 0 new issues were detected and 0 were fixed.
> > > > > In total, 4 issues are still open and 27 have been fixed so far.
> > > > > 
> > > > > Some of the still happening issues:
> > > > > 
> > > > > Ref Crashes Repro Title
> > > > > <1> 306     No    INFO: task hung in process_measurement (2)
> > > > >                   https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330
> > > > 
> > > > Mimi, Roberto,
> > > > 
> > > > Any chance this is this related in any way to this report:
> > > > 
> > > > https://lore.kernel.org/linux-security-module/CALAgD-4hkHVcCq2ycdwnA2hYDBMqijLUOfZgvf1WfFpU-8+42w@mail.gmail.com/
> > > 
> > > I reproduced the last, but I got a different result (the kernel crashed
> > > in a different place).
> > > 
> > > It seems a corruption case, while the former looks more a lock
> > > inversion issue. Will check more.
> > 
> > + Kent Overstreet
> > 
> > https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330
> > 
> > It happens few times per day, since commit 4a39ac5b7d62 (which is
> > followed by a lot of merges). The bug has been likely introduced there.
> > 
> > In all recent reports, I noticed that there is always the following
> > lock sequence:
> > 
> > [  291.584319][   T30] 5 locks held by syz.0.75/5970:
> > [  291.594487][   T30]  #0: ffff888064066420 (sb_writers#25){.+.+}-{0:0}, at: mnt_want_write+0x3f/0x90
> > [  291.603984][   T30]  #1: ffff88805d8b0148 (&sb->s_type->i_mutex_key#30){++++}-{3:3}, at: do_truncate+0x20c/0x310
> > [  291.614497][   T30]  #2: ffff888054700a38 (&c->snapshot_create_lock){.+.+}-{3:3}, at: bch2_truncate+0x16d/0x2c0
> > [  291.624871][   T30]  #3: ffff888054704398 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x7de/0xd20
> > [  291.635446][   T30]  #4: ffff8880547266d0 (&c->gc_lock){.+.+}-{3:3}, at: bch2_btree_update_start+0x682/0x14e0
> > 
> > IMA is stuck too, since it is waiting for the inode lock to be released:
> > 
> > [  291.645689][   T30] 1 lock held by syz.0.75/6010:
> > [  291.650622][   T30]  #0: ffff88805d8b0148 (&sb->s_type->i_mutex_key#30){++++}-{3:3}, at: process_measurement+0x439/0x1fb0
> > 
> > It seems that the super block is locked by someone else, which is not
> > able to unlock. Maybe, it is related to bch2_journal_reclaim_thread(),
> > but I don't know for sure.
> > 
> > Kent, do you have time to look at this report?
> 
> If you scroll back in the console log, you see this:
> 
> [  230.372988][ T6772] Allocator stuck? Waited for 30 seconds
> [  230.373025][ T6772] Allocator debug:
> [  230.373039][ T6772]   capacity1536
> [  230.373051][ T6772]   reserved             31232
> [  230.373064][ T6772]   hidden               0
> [  230.373077][ T6772]   btree                0
> [  230.373090][ T6772]   data                 0
> [  230.373102][ T6772]   cached               0
> [  230.373114][ T6772]   reserved             0
> [  230.373127][ T6772]   online_reserved      768
> [  230.373140][ T6772]   nr_inodes            0
> [  230.373152][ T6772]   
> [  230.373164][ T6772]   freelist_wait        waiting
> [  230.373176][ T6772]   open buckets allocated1
> [  230.373189][ T6772]   open buckets total   1024
> [  230.373202][ T6772]   open_buckets_wait    empty
> [  230.373214][ T6772]   open_buckets_btree   0
> [  230.373227][ T6772]   open_buckets_user    0
> [  230.373239][ T6772]   btree reserve cache  0
> [  230.373251][ T6772] 
> [  230.373262][ T6772] Dev 0:
> [  230.373274][ T6772]                      buckets         sectors      fragmented
> [  230.373288][ T6772]   free                     0               0               0
> [  230.373303][ T6772]   sb                       0               0               0
> [  230.373318][ T6772]   journal                  0               0               0
> [  230.373332][ T6772]   btree                    0               0               0
> [  230.373347][ T6772]   user                     0               0               0
> [  230.373361][ T6772]   cached                   0               0               0
> [  230.373375][ T6772]   parity                   0               0               0
> [  230.373390][ T6772]   stripe                   0               0               0
> [  230.373404][ T6772]   need_gc_gens             0               0               0
> [  230.373418][ T6772]   need_discard             0               0               0
> [  230.373432][ T6772]   unstriped                0               0               0
> [  230.373446][ T6772]   capacity               128
> [  230.373459][ T6772]   
> [  230.373470][ T6772]   reserves:
> [  230.373481][ T6772]   stripe                  60
> [  230.373494][ T6772]   normal                  58
> [  230.373506][ T6772]   copygc                  56
> [  230.373519][ T6772]   btree                   28
> [  230.373531][ T6772]   btree_copygc             0
> [  230.373543][ T6772]   reclaim                  0
> [  230.373556][ T6772]   interior_updates         0
> [  230.373569][ T6772]   
> [  230.373580][ T6772]   open buckets             0
> [  230.373592][ T6772]   buckets to invalidate    0
> [  230.373605][ T6772] 
> [  230.373616][ T6772] Copygc debug:
> [  230.373627][ T6772]   running: 1
> [  230.373656][ T6772]   copygc_wait:0
> [  230.373675][ T6772]   copygc_wait_at:0
> [  230.373694][ T6772]   Currently waiting for:0 B
> [  230.373708][ T6772]   Currently waiting since:644 KiB
> [  230.373722][ T6772]   Currently calculated wait:0 B
> [  230.373735][ T6772]
> 
> this looks like a bug in bcachefs where alloc counters didn't get
> initialized correctly, which makes sense given that 6.11 landed the disk
> accounting rewrite. Will dig more as I have time.

And looking further, I don't see anyhting in the console log from when
bcachefs actually mounted (???), which means I don't think I have enough
to go on. It's clearly an upgrade path issue - we didn't run
check_allocations as is required when upgrading to 1.11 - but it's not
reproducing for me when I run tests with old tools.

Can we get some more information about the syzbot reproducer? Exact
tools version, format command and mount command.