[PATCH v3 2/2] vfs: avoid duplicating creds in faccessat if possible

Sat Mar 4 20:31:43 UTC 2023

On 3/4/23, Mateusz Guzik <mjguzik at gmail.com> wrote:
> On 3/3/23, Linus Torvalds <torvalds at linux-foundation.org> wrote:
>> On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik at gmail.com> wrote:
>>>
>>> I think there is a systemic problem which comes with the kzalloc API
>>
>> Well, it's not necessarily the API that is bad, but the implementation.
>>
>> We could easily make kzalloc() with a constant size just expand to
>> kmalloc+memset, and get the behavior you want.
>>
>> We already do magical things for "find the right slab bucket" part of
>> kmalloc too for constant sizes. It's changed over the years, but that
>> policy goes back a long long time. See
>>
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8
>>
>> from the BK history tree.
>>
>> Exactly because some things are worth optimizing for when the size is
>> known at compile time.
>>
>> Maybe just extending kzalloc() similarly? Trivial and entirely untested
>> patch:
>>
>>    --- a/include/linux/slab.h
>>    +++ b/include/linux/slab.h
>>    @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
>> kmem_cache *k, gfp_t flags)
>>      */
>>     static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
>>     {
>>    +    if (__builtin_constant_p(size)) {
>>    +            void *ret = kmalloc(size, flags);
>>    +            if (ret)
>>    +                    memset(ret, 0, size);
>>    +            return ret;
>>    +    }
>>         return kmalloc(size, flags | __GFP_ZERO);
>>     }
>>
>
> So I played with this and have a rather nasty summary. Bullet points:
> 1. patched kzalloc does not reduce memsets calls during kernel build
> 2. patched kmem_cache_zalloc_ptr + 2 consumers converted *does* drop
> it significantly (36150671 -> 14414454)
> 3. ... inline memset generated by gcc sucks by resorting to rep stosq
> around 48 bytes
> 4. memsets not sorted out have sizes not known at compilation time and
> are not necessarily perf bugs on their own [read: would benefit from
> faster memset]
>
> Onto the the meat:
>
> I patched the kernel with a slightly tweaked version of the above:
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 45af70315a94..7abb5490690f 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
> kmem_cache *k, gfp_t flags)
>   */
>  static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
>  {
> +       if (__builtin_constant_p(size)) {
> +               void *ret = kmalloc(size, flags);
> +               if (likely(ret))
> +                       memset(ret, 0, size);
> +               return ret;
> +       }
>         return kmalloc(size, flags | __GFP_ZERO);
>  }
>
> and verified it indeed zeroes inline:
>
> void kztest(void)
> {
>         void *ptr;
>
>         ptr = kzalloc(32, GFP_KERNEL);
>         if (unlikely(!ptr))
>                 return;
>         memsettest_rec(ptr);
> }
>
> $ objdump --disassemble=kztest vmlinux
> [snip]
> call   ffffffff8135e130 <kmalloc_trace>
> test   %rax,%rax
> je     ffffffff81447d5f <kztest+0x4f>
> movq   $0x0,(%rax)
> mov    %rax,%rdi
> movq   $0x0,0x8(%rax)
> movq   $0x0,0x10(%rax)
> movq   $0x0,0x18(%rax)
> call   ffffffff81454060 <memsettest_rec>
> [snip]
>
> This did *NOT* lead to reduction of memset calls when building the kernel.
>
> I verified few cases by hand, it is all either kmem_cache_zalloc or
> explicitly added memsets with sizes not known at compilation time.
>
> Two most frequent callers:
> @[
>     memset+5
>     __alloc_file+40
>     alloc_empty_file+73
>     path_openat+77
>     do_filp_open+182
>     do_sys_openat2+159
>     __x64_sys_openat+89
>     do_syscall_64+93
>     entry_SYSCALL_64_after_hwframe+114
> ]: 11028994
> @[
>     memset+5
>     security_file_alloc+45
>     __alloc_file+89
>     alloc_empty_file+73
>     path_openat+77
>     do_filp_open+182
>     do_sys_openat2+159
>     __x64_sys_openat+89
>     do_syscall_64+93
>     entry_SYSCALL_64_after_hwframe+114
> ]: 11028994
>
> My wip addition is:
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 45af70315a94..12b5b02ef3d3 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -710,6 +710,17 @@ static inline void *kmem_cache_zalloc(struct
> kmem_cache *k, gfp_t flags)
>         return kmem_cache_alloc(k, flags | __GFP_ZERO);
>  }
>
> +#define kmem_cache_zalloc_ptr(k, f, retp) ({                           \
> +       __typeof(retp) _retp = kmem_cache_alloc(k, f);                  \
> +       bool _rv = false;                                               \
> +       retp = _retp;                                                   \
> +       if (likely(_retp)) {                                            \
> +               memset(_retp, 0, sizeof(*_retp));                       \
> +               _rv = true;                                             \
> +       }                                                               \
> +       _rv;                                                            \
> +})
> +
> diff --git a/security/security.c b/security/security.c
> index cf6cc576736f..0f769ede0e54 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -600,8 +600,7 @@ static int lsm_file_alloc(struct file *file)
>                 return 0;
>         }
>
> -       file->f_security = kmem_cache_zalloc(lsm_file_cache, GFP_KERNEL);
> -       if (file->f_security == NULL)
> +       if (!kmem_cache_zalloc_ptr(lsm_file_cache, GFP_KERNEL,
> file->f_security))
>                 return -ENOMEM;
>         return 0;
>  }

This one is actually buggy -- f_security is a void * pointer and
sizeof(*file->f_security) returns just 1. The macro does not have any
safety belts against that -- it should probably check for void * at
compilation time and get a BUG_ON for runtime mismatch. Does not
affect the idea though.

Good news: gcc provides a lot of control as to how it inlines string
ops, most notably:
       -mstringop-strategy=alg
           Override the internal decision heuristic for the particular
algorithm to use for inlining string
           operations.  The allowed values for alg are:

           rep_byte
           rep_4byte
           rep_8byte
               Expand using i386 "rep" prefix of the specified size.

           byte_loop
           loop
           unrolled_loop
               Expand into an inline loop.

           libcall
               Always use a library call.

I'm going to play with it and send something more presentable.

> diff --git a/fs/file_table.c b/fs/file_table.c
> index 372653b92617..8e0dabf9530e 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -136,8 +136,7 @@ static struct file *__alloc_file(int flags, const
> struct cred *cred)
>         struct file *f;
>         int error;
>
> -       f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
> -       if (unlikely(!f))
> +       if (!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f))
>                 return ERR_PTR(-ENOMEM);
>
>         f->f_cred = get_cred(cred);
>
> As mentioned above it cuts total calls in more than half.
>
> The problem is it is it rolls with rep stosq way too easily, partially
> defeating the point of inlining anything. clang does not have this
> problem.
>
> Take a look at __alloc_file:
> [snip]
> mov    0x19cab05(%rip),%rdi        # ffffffff82df4318 <filp_cachep>
> call   ffffffff813dd610 <kmem_cache_alloc>
> test   %rax,%rax
> je     ffffffff814298b7 <__alloc_file+0xc7>
> mov    %rax,%r12
> mov    $0x1d,%ecx
> xor    %eax,%eax
> mov    %r12,%rdi
> rep stos %rax,%es:(%rdi)
> [/snip]
>
> Here is a sample consumer which can't help but have a variable size --
> select, used by gmake:
> @[
>     memset+5
>     do_pselect.constprop.0+202
>     __x64_sys_pselect6+101
>     do_syscall_64+93
>     entry_SYSCALL_64_after_hwframe+114
> ]: 13160
>
> In conclusion:
> 1. fixing up memsets is worthwhile regardless of what happens to its
> current consumers -- not all of them are necessarily doing something
> wrong
> 2. inlining memset can be a pessimization vs non-plain-erms memset as
> evidenced above. will have to figure out how to convince gcc to be
> less eager to use it.
>
> Sometimes I hate computers.
>
> --
> Mateusz Guzik <mjguzik gmail.com>
>

-- 
Mateusz Guzik <mjguzik gmail.com>