[PATCH bpf-next 0/4] Make inode storage available to tracing prog
Song Liu
songliubraving at meta.com
Tue Nov 19 22:35:04 UTC 2024
> On Nov 19, 2024, at 10:14 AM, Casey Schaufler <casey at schaufler-ca.com> wrote:
>
> On 11/19/2024 4:27 AM, Dr. Greg wrote:
>> On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote:
>>
>>> Hi Christian, James and Jan,
>> Good morning, I hope the day is starting well for everyone.
>>
>>>> On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley at HansenPartnership.com> wrote:
>>> [...]
>>>
>>>>> We can address this with something like following:
>>>>>
>>>>> #ifdef CONFIG_SECURITY
>>>>> void *i_security;
>>>>> #elif CONFIG_BPF_SYSCALL
>>>>> struct bpf_local_storage __rcu *i_bpf_storage;
>>>>> #endif
>>>>>
>>>>> This will help catch all misuse of the i_bpf_storage at compile
>>>>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y.
>>>>>
>>>>> Does this make sense?
>>>> Got to say I'm with Casey here, this will generate horrible and failure
>>>> prone code.
>>>>
>>>> Since effectively you're making i_security always present anyway,
>>>> simply do that and also pull the allocation code out of security.c in a
>>>> way that it's always available? That way you don't have to special
>>>> case the code depending on whether CONFIG_SECURITY is defined.
>>>> Effectively this would give everyone a generic way to attach some
>>>> memory area to an inode. I know it's more complex than this because
>>>> there are LSM hooks that run from security_inode_alloc() but if you can
>>>> make it work generically, I'm sure everyone will benefit.
>>> On a second thought, I think making i_security generic is not
>>> the right solution for "BPF inode storage in tracing use cases".
>>>
>>> This is because i_security serves a very specific use case: it
>>> points to a piece of memory whose size is calculated at system
>>> boot time. If some of the supported LSMs is not enabled by the
>>> lsm= kernel arg, the kernel will not allocate memory in
>>> i_security for them. The only way to change lsm= is to reboot
>>> the system. BPF LSM programs can be disabled at the boot time,
>>> which fits well in i_security. However, BPF tracing programs
>>> cannot be disabled at boot time (even we change the code to
>>> make it possible, we are not likely to disable BPF tracing).
>>> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some
>>> BPF tracing programs to load at some point of time, and these
>>> programs may use BPF inode storage.
>>>
>>> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory
>>> always will be attached to i_security (maybe under a different
>>> name, say, i_generic) of every inode. In this case, we should
>>> really add i_bpf_storage directly to the inode, because another
>>> pointer jump via i_generic gives nothing but overhead.
>>>
>>> Does this make sense? Or did I misunderstand the suggestion?
>> There is a colloquialism that seems relevant here: "Pick your poison".
>>
>> In the greater interests of the kernel, it seems that a generic
>> mechanism for attaching per inode information is the only realistic
>> path forward, unless Christian changes his position on expanding
>> the size of struct inode.
>>
>> There are two pathways forward.
>>
>> 1.) Attach a constant size 'blob' of storage to each inode.
>>
>> This is a similar approach to what the LSM uses where each blob is
>> sized as follows:
>>
>> S = U * sizeof(void *)
>>
>> Where U is the number of sub-systems that have a desire to use inode
>> specific storage.
>
> I can't tell for sure, but it looks like you don't understand how
> LSM i_security blobs are used. It is *not* the case that each LSM
> gets a pointer in the i_security blob. Each LSM that wants storage
> tells the infrastructure at initialization time how much space it
> wants in the blob. That can be a pointer, but usually it's a struct
> with flags, pointers and even lists.
>
>> Each sub-system uses it's pointer slot to manage any additional
>> storage that it desires to attach to the inode.
>
> Again, an LSM may choose to do it that way, but most don't.
> SELinux and Smack need data on every inode. It makes much more sense
> to put it directly in the blob than to allocate a separate chunk
> for every inode.
AFAICT, i_security is somehow unique in the way that its size
is calculated at boot time. I guess we will just keep most LSM
users behind.
>
>> This has the obvious advantage of O(1) cost complexity for any
>> sub-system that wants to access its inode specific storage.
>>
>> The disadvantage, as you note, is that it wastes memory if a
>> sub-system does not elect to attach per inode information, for example
>> the tracing infrastructure.
>
> To be clear, that disadvantage only comes up if the sub-system uses
> inode data on an occasional basis. If it never uses inode data there
> is no need to have a pointer to it.
>
>> This disadvantage is parried by the fact that it reduces the size of
>> the inode proper by 24 bytes (4 pointers down to 1) and allows future
>> extensibility without colliding with the interests and desires of the
>> VFS maintainers.
>
> You're adding a level of indirection. Even I would object based on
> the performance impact.
>
>> 2.) Implement key/value mapping for inode specific storage.
>>
>> The key would be a sub-system specific numeric value that returns a
>> pointer the sub-system uses to manage its inode specific memory for a
>> particular inode.
>>
>> A participating sub-system in turn uses its identifier to register an
>> inode specific pointer for its sub-system.
>>
>> This strategy loses O(1) lookup complexity but reduces total memory
>> consumption and only imposes memory costs for inodes when a sub-system
>> desires to use inode specific storage.
>
> SELinux and Smack use an inode blob for every inode. The performance
> regression boggles the mind. Not to mention the additional complexity
> of managing the memory.
>
>> Approach 2 requires the introduction of generic infrastructure that
>> allows an inode's key/value mappings to be located, presumably based
>> on the inode's pointer value. We could probably just resurrect the
>> old IMA iint code for this purpose.
>>
>> In the end it comes down to a rather standard trade-off in this
>> business, memory vs. execution cost.
>>
>> We would posit that option 2 is the only viable scheme if the design
>> metric is overall good for the Linux kernel eco-system.
>
> No. Really, no. You need look no further than secmarks to understand
> how a key based blob allocation scheme leads to tears. Keys are fine
> in the case where use of data is sparse. They have no place when data
> use is the norm.
OTOH, I think some on-demand key-value storage makes sense for many
other use cases, such as BPF (LSM and tracing), file lock, fanotify,
etc.
Overall, I think we have 3 types storages attached to inode:
1. Embedded in struct inode, gated by CONFIG_*.
2. Behind i_security (or maybe call it a different name if we
can find other uses for it). The size is calculated at boot
time.
3. Behind a key-value storage.
To evaluate these categories, we have:
Speed: 1 > 2 > 3
Flexibility: 3 > 2 > 1
We don't really have 3 right now. I think the direction is to
create it. BPF inode storage is a key-value store. If we can
get another user for 3, in addition to BPF inode storage, it
should be a net win.
Does this sound like a viable path forward?
Thanks,
Song
More information about the Linux-security-module-archive
mailing list