[PATCH 0/2 v2] remove PF_MEMALLOC_NORECLAIM
Dave Chinner
david at fromorbit.com
Wed Sep 4 22:34:39 UTC 2024
On Wed, Sep 04, 2024 at 02:03:13PM -0400, Kent Overstreet wrote:
> On Wed, Sep 04, 2024 at 06:46:00PM GMT, Michal Hocko wrote:
> > On Wed 04-09-24 12:05:56, Kent Overstreet wrote:
> > > But it seems to me that the limit should be lower if you're on e.g. a 2
> > > GB machine (not failing with a warning, just failing immediately rather
> > > than oom killing a bunch of stuff first) - and it's going to need to be
> > > raised above INT_MAX as large memory machines keep growing, I keep
> > > hitting it in bcachefs fsck code.
> >
> > Do we actual usecase that would require more than couple of MB? The
> > amount of memory wouldn't play any actual role then.
>
> Which "amount of memory?" - not parsing that.
>
> For large allocations in bcachefs: in journal replay we read all the
> keys in the journal, and then we create a big flat array with references
> to all of those keys to sort and dedup them.
>
> We haven't hit the INT_MAX size limit there yet, but filesystem sizes
> being what they are, we will soon. I've heard of users with 150 TB
> filesystems, and once the fsck scalability issues are sorted we'll be
> aiming for petabytes. Dirty keys in the journal scales more with system
> memory, but I'm leasing machines right now with a quarter terabyte of
> ram.
I've seen xfs_repair require a couple of TB of RAM to repair
metadata heavy filesystems of relatively small size (sub-20TB).
Once you get about a few hundred GB of metadata in the filesystem,
the fsck cross-reference data set size can easily run into the TBs.
So 256GB might *seem* like a lot of memory, but we were seeing
xfs_repair exceed that amount of RAM for metadata heavy filesystems
at least a decade ago...
Indeed, we recently heard about a 6TB filesystem with 15 *billion*
hardlinks in it. The cross reference for resolving all those
hardlinks would require somewhere in the order of 1.5TB of RAM to
hold. The only way to reliably handle random access data sets this
large is with pageable memory....
> Another more pressing one is the extents -> backpointers and
> backpointers -> extents passes of fsck; we do a linear scan through one
> btree checking references to another btree. For the btree we're checking
> references to the lookups are random, so we need to cache and pin the
> entire btree in ram if possible, or if not whatever will fit and we run
> in multiple passes.
>
> This is the #1 scalability issue hitting a number of users right now, so
> I may need to rewrite it to pull backpointers into an eytzinger array
> and do our random lookups for backpointers on that - but that will be
> "the biggest vmalloc array we can possible allocate", so the INT_MAX
> size limit is clearly an issue there...
Given my above comments, I think you are approaching this problem
the wrong way. It is known that the data set that can exceed
physical kernel memory size, hence it needs to be swappable. That
way users can extend the kernel memory capacity via swapfiles when
bcachefs.fsck needs more memory than the system has physical RAM.
This is a problem Darrick had to address for the XFS online repair
code - we've known for a long time that repair needs to hold a data
set larger than physical memory to complete successfully. Hence for
online repair we needed a mechanism that provided us with pagable
kernel memory. vmalloc() is not an option - it has hard size limits
(both API based and physical capacity based).
Hence Darrick designed and implemented pageable shmem backed memory
files (xfiles) to hold these data sets. Hence the size limit of the
online repair data set is physical RAM + swap space, same as it is
for offline repair. You can find the xfile code in
fs/xfs/scrub/xfile.[ch].
Support for large, sortable arrays of fixed size records built on
xfiles can be found in xfarray.[ch], and blob storage in
xfblob.[ch].
vmalloc() is really not a good solution for holding arbitrary sized
data sets in kernel memory....
-Dave.
--
Dave Chinner
david at fromorbit.com
More information about the Linux-security-module-archive
mailing list