[PATCH bpf-next 0/4] Make inode storage available to tracing prog

Tue Nov 19 22:35:04 UTC 2024

> On Nov 19, 2024, at 10:14 AM, Casey Schaufler <casey at schaufler-ca.com> wrote:
> 
> On 11/19/2024 4:27 AM, Dr. Greg wrote:
>> On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote:
>> 
>>> Hi Christian, James and Jan,
>> Good morning, I hope the day is starting well for everyone.
>> 
>>>> On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley at HansenPartnership.com> wrote:
>>> [...]
>>> 
>>>>> We can address this with something like following:
>>>>> 
>>>>> #ifdef CONFIG_SECURITY
>>>>>        void                    *i_security;
>>>>> #elif CONFIG_BPF_SYSCALL
>>>>>        struct bpf_local_storage __rcu *i_bpf_storage;
>>>>> #endif
>>>>> 
>>>>> This will help catch all misuse of the i_bpf_storage at compile
>>>>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. 
>>>>> 
>>>>> Does this make sense?
>>>> Got to say I'm with Casey here, this will generate horrible and failure
>>>> prone code.
>>>> 
>>>> Since effectively you're making i_security always present anyway,
>>>> simply do that and also pull the allocation code out of security.c in a
>>>> way that it's always available?  That way you don't have to special
>>>> case the code depending on whether CONFIG_SECURITY is defined. 
>>>> Effectively this would give everyone a generic way to attach some
>>>> memory area to an inode.  I know it's more complex than this because
>>>> there are LSM hooks that run from security_inode_alloc() but if you can
>>>> make it work generically, I'm sure everyone will benefit.
>>> On a second thought, I think making i_security generic is not 
>>> the right solution for "BPF inode storage in tracing use cases". 
>>> 
>>> This is because i_security serves a very specific use case: it 
>>> points to a piece of memory whose size is calculated at system 
>>> boot time. If some of the supported LSMs is not enabled by the 
>>> lsm= kernel arg, the kernel will not allocate memory in 
>>> i_security for them. The only way to change lsm= is to reboot 
>>> the system. BPF LSM programs can be disabled at the boot time, 
>>> which fits well in i_security. However, BPF tracing programs 
>>> cannot be disabled at boot time (even we change the code to 
>>> make it possible, we are not likely to disable BPF tracing). 
>>> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some 
>>> BPF tracing programs to load at some point of time, and these 
>>> programs may use BPF inode storage. 
>>> 
>>> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory 
>>> always will be attached to i_security (maybe under a different 
>>> name, say, i_generic) of every inode. In this case, we should 
>>> really add i_bpf_storage directly to the inode, because another 
>>> pointer jump via i_generic gives nothing but overhead. 
>>> 
>>> Does this make sense? Or did I misunderstand the suggestion?
>> There is a colloquialism that seems relevant here: "Pick your poison".
>> 
>> In the greater interests of the kernel, it seems that a generic
>> mechanism for attaching per inode information is the only realistic
>> path forward, unless Christian changes his position on expanding
>> the size of struct inode.
>> 
>> There are two pathways forward.
>> 
>> 1.) Attach a constant size 'blob' of storage to each inode.
>> 
>> This is a similar approach to what the LSM uses where each blob is
>> sized as follows:
>> 
>> S = U * sizeof(void *)
>> 
>> Where U is the number of sub-systems that have a desire to use inode
>> specific storage.
> 
> I can't tell for sure, but it looks like you don't understand how
> LSM i_security blobs are used. It is *not* the case that each LSM
> gets a pointer in the i_security blob. Each LSM that wants storage
> tells the infrastructure at initialization time how much space it
> wants in the blob. That can be a pointer, but usually it's a struct
> with flags, pointers and even lists.
> 
>> Each sub-system uses it's pointer slot to manage any additional
>> storage that it desires to attach to the inode.
> 
> Again, an LSM may choose to do it that way, but most don't.
> SELinux and Smack need data on every inode. It makes much more sense
> to put it directly in the blob than to allocate a separate chunk
> for every inode.

AFAICT, i_security is somehow unique in the way that its size
is calculated at boot time. I guess we will just keep most LSM
users behind. 

> 
>> This has the obvious advantage of O(1) cost complexity for any
>> sub-system that wants to access its inode specific storage.
>> 
>> The disadvantage, as you note, is that it wastes memory if a
>> sub-system does not elect to attach per inode information, for example
>> the tracing infrastructure.
> 
> To be clear, that disadvantage only comes up if the sub-system uses
> inode data on an occasional basis. If it never uses inode data there
> is no need to have a pointer to it.
> 
>> This disadvantage is parried by the fact that it reduces the size of
>> the inode proper by 24 bytes (4 pointers down to 1) and allows future
>> extensibility without colliding with the interests and desires of the
>> VFS maintainers.
> 
> You're adding a level of indirection. Even I would object based on
> the performance impact.
> 
>> 2.) Implement key/value mapping for inode specific storage.
>> 
>> The key would be a sub-system specific numeric value that returns a
>> pointer the sub-system uses to manage its inode specific memory for a
>> particular inode.
>> 
>> A participating sub-system in turn uses its identifier to register an
>> inode specific pointer for its sub-system.
>> 
>> This strategy loses O(1) lookup complexity but reduces total memory
>> consumption and only imposes memory costs for inodes when a sub-system
>> desires to use inode specific storage.
> 
> SELinux and Smack use an inode blob for every inode. The performance
> regression boggles the mind. Not to mention the additional complexity
> of managing the memory.
> 
>> Approach 2 requires the introduction of generic infrastructure that
>> allows an inode's key/value mappings to be located, presumably based
>> on the inode's pointer value.  We could probably just resurrect the
>> old IMA iint code for this purpose.
>> 
>> In the end it comes down to a rather standard trade-off in this
>> business, memory vs. execution cost.
>> 
>> We would posit that option 2 is the only viable scheme if the design
>> metric is overall good for the Linux kernel eco-system.
> 
> No. Really, no. You need look no further than secmarks to understand
> how a key based blob allocation scheme leads to tears. Keys are fine
> in the case where use of data is sparse. They have no place when data
> use is the norm.

OTOH, I think some on-demand key-value storage makes sense for many 
other use cases, such as BPF (LSM and tracing), file lock, fanotify, 
etc. 

Overall, I think we have 3 types storages attached to inode: 

  1. Embedded in struct inode, gated by CONFIG_*. 
  2. Behind i_security (or maybe call it a different name if we
     can find other uses for it). The size is calculated at boot
     time. 
  3. Behind a key-value storage. 

To evaluate these categories, we have:

Speed: 1 > 2 > 3
Flexibility: 3 > 2 > 1

We don't really have 3 right now. I think the direction is to 
create it. BPF inode storage is a key-value store. If we can 
get another user for 3, in addition to BPF inode storage, it
should be a net win. 

Does this sound like a viable path forward?

Thanks,
Song