[RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf

Casey Schaufler casey at schaufler-ca.com
Wed Nov 15 17:09:06 UTC 2023


On 11/15/2023 6:26 AM, Yafang Shao wrote:
> On Wed, Nov 15, 2023 at 5:33 PM Yafang Shao <laoar.shao at gmail.com> wrote:
>> On Wed, Nov 15, 2023 at 4:45 PM Michal Hocko <mhocko at suse.com> wrote:
>>> On Wed 15-11-23 09:52:38, Yafang Shao wrote:
>>>> On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler <casey at schaufler-ca.com> wrote:
>>>>> On 11/14/2023 3:59 AM, Yafang Shao wrote:
>>>>>> On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko at suse.com> wrote:
>>>>>>> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
>>>>>>>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey at schaufler-ca.com> wrote:
>>>>>>>>> On 11/11/2023 11:34 PM, Yafang Shao wrote:
>>>>>>>>>> Background
>>>>>>>>>> ==========
>>>>>>>>>>
>>>>>>>>>> In our containerized environment, we've identified unexpected OOM events
>>>>>>>>>> where the OOM-killer terminates tasks despite having ample free memory.
>>>>>>>>>> This anomaly is traced back to tasks within a container using mbind(2) to
>>>>>>>>>> bind memory to a specific NUMA node. When the allocated memory on this node
>>>>>>>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
>>>>>>>>>> indiscriminately kills tasks. This becomes more critical with guaranteed
>>>>>>>>>> tasks (oom_score_adj: -998) aggravating the issue.
>>>>>>>>> Is there some reason why you can't fix the callers of mbind(2)?
>>>>>>>>> This looks like an user space configuration error rather than a
>>>>>>>>> system security issue.
>>>>>>>> It appears my initial description may have caused confusion. In this
>>>>>>>> scenario, the caller is an unprivileged user lacking any capabilities.
>>>>>>>> While a privileged user, such as root, experiencing this issue might
>>>>>>>> indicate a user space configuration error, the concerning aspect is
>>>>>>>> the potential for an unprivileged user to disrupt the system easily.
>>>>>>>> If this is perceived as a misconfiguration, the question arises: What
>>>>>>>> is the correct configuration to prevent an unprivileged user from
>>>>>>>> utilizing mbind(2)?"
>>>>>>> How is this any different than a non NUMA (mbind) situation?
>>>>>> In a UMA system, each gigabyte of memory carries the same cost.
>>>>>> Conversely, in a NUMA architecture, opting to confine processes within
>>>>>> a specific NUMA node incurs additional costs. In the worst-case
>>>>>> scenario, if all containers opt to bind their memory exclusively to
>>>>>> specific nodes, it will result in significant memory wastage.
>>>>> That still sounds like you've misconfigured your containers such
>>>>> that they expect to get more memory than is available, and that
>>>>> they have more control over it than they really do.
>>>> And again: What configuration method is suitable to limit user control
>>>> over memory policy adjustments, besides the heavyweight seccomp
>>>> approach?

What makes seccomp "heavyweight"? The overhead? The infrastructure required?

>>> This really depends on the workloads. What is the reason mbind is used
>>> in the first place?
>> It can improve their performance.

How much? You've already demonstrated that using mbind can degrade their performance.

>>
>>> Is it acceptable to partition the system so that
>>> there is a numa node reserved for NUMA aware workloads?
>> As highlighted in the commit log, our preference is to configure this
>> memory policy through kubelet using cpuset.mems in the cpuset
>> controller, rather than allowing individual users to set it
>> independently.
>>
>>> If not, have you
>>> considered (already proposed numa=off)?
>> The challenge at hand isn't solely about whether users should bind to
>> a memory node or the deployment of workloads. What we're genuinely
>> dealing with is the fact that users can bind to a specific node
>> without our explicit agreement or authorization.
> BYW, the same principle should also apply to sched_setaffinity(2).
> While there's already a security_task_setscheduler() in place, it's
> undeniable that we should also consider adding a
> security_set_mempolicy() for consistency.

	"A foolish consistency is the hobgoblin of little minds"
	- Ralph Waldo Emerson




More information about the Linux-security-module-archive mailing list