[RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf

Yafang Shao laoar.shao at gmail.com
Tue Nov 14 11:59:53 UTC 2023


On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko at suse.com> wrote:
>
> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
> > On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey at schaufler-ca.com> wrote:
> > >
> > > On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > > > Background
> > > > ==========
> > > >
> > > > In our containerized environment, we've identified unexpected OOM events
> > > > where the OOM-killer terminates tasks despite having ample free memory.
> > > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > > indiscriminately kills tasks. This becomes more critical with guaranteed
> > > > tasks (oom_score_adj: -998) aggravating the issue.
> > >
> > > Is there some reason why you can't fix the callers of mbind(2)?
> > > This looks like an user space configuration error rather than a
> > > system security issue.
> >
> > It appears my initial description may have caused confusion. In this
> > scenario, the caller is an unprivileged user lacking any capabilities.
> > While a privileged user, such as root, experiencing this issue might
> > indicate a user space configuration error, the concerning aspect is
> > the potential for an unprivileged user to disrupt the system easily.
> > If this is perceived as a misconfiguration, the question arises: What
> > is the correct configuration to prevent an unprivileged user from
> > utilizing mbind(2)?"
>
> How is this any different than a non NUMA (mbind) situation?

In a UMA system, each gigabyte of memory carries the same cost.
Conversely, in a NUMA architecture, opting to confine processes within
a specific NUMA node incurs additional costs. In the worst-case
scenario, if all containers opt to bind their memory exclusively to
specific nodes, it will result in significant memory wastage.

> You can
> still have an unprivileged user to allocate just until the OOM triggers
> and disrupt other workload consuming more memory. Sure the mempolicy
> based OOM is less precise and it might select a victim with only a small
> consumption on a target NUMA node but fundamentally the situation is
> very similar. I do not think disallowing mbind specifically is solving a
> real problem.

How would you recommend addressing this more effectively?

-- 
Regards
Yafang



More information about the Linux-security-module-archive mailing list