[GIT PULL] Block fixes for 6.18-rc3
Christian Brauner
brauner at kernel.org
Sat Nov 1 13:33:29 UTC 2025
On Fri, Oct 31, 2025 at 09:30:11AM -0700, Linus Torvalds wrote:
> On Fri, 31 Oct 2025 at 08:44, Christian Brauner <brauner at kernel.org> wrote:
> >
> > Hm, two immediate observations before I go off and write the series.
> >
> > (1) The thing is that init_cred would have to be exposed to modules via
> > EXPORT_SYMBOL() for this to work. It would be easier to just force
> > the use of init_task->cred instead.
>
> Yea, I guess we already export that.
>
> > That pointer deref won't matter in the face of the allocations and
> > refcounts we wipe out with this. Then we should also move init_cred
> > to init/init_task.c and make it static const. Nobody really needs it
> > currently.
>
> Well, I did the "does it compile ok" with it marked as 'const', but as
> mentioned, those 'struct cred' instances aren't *really* const, they
> are only pseudo-const things in that they are *marked* const so that
> nobody modifies them by mistake, but then the ref-counting will cast
> the constness away in order to update references.
>
> So I don't think we can *actually* mark it "static const", because
> that will put the data structure in the const data section, and then
> the refcounting will trigger kernel page faults.
>
> End result: I think we can indeed move it to init/init_task.c. And
> yes, we can and should make it static to that file, but not plain
> 'const'.
>
> If we expose it to others - but I think you're right that maybe it's
> not a good idea - we should *expose* it as a 'const' data structure.
>
> But we should probably put it in some explicitly writable section (I
> was going to suggest marking it "__read_mostly", but it turns out some
> architectures #define that to be empty, so a "const __read_mosyly"
> data structure could still end up in a read-only section).
For some init data structures that are heavily used such as:
init_pid_ns
it often makes sense to just skip the refcounting completely because we
know they are always around. Take the pid namespace as an example:
static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
{
if (ns != &init_pid_ns)
ns_ref_inc(ns);
return ns;
}
void put_pid_ns(struct pid_namespace *ns)
{
if (ns && ns != &init_pid_ns && ns_ref_put(ns))
schedule_work(&ns->work);
}
While it has the obvious disadvantage that it introduces a special-case
into the refcounting and it would obviously be more elegant if we just
did:
void put_pid_ns(struct pid_namespace *ns)
{
if (ns_ref_put(ns))
schedule_work(&ns->work);
}
it does elide a ton of refcount increments and decrements during task
creation.
While that's not true for init_creds it would still be easy to just not
refcount them at all if it's worth it.
Now that I think about it: given that I reworked all the namespace
reference counting completely it should be easy to make all initial
namespaces not get or put reference counts at all, like:
static __always_inline bool is_initial_namespace(struct ns_common *ns)
{
VFS_WARN_ON_ONCE(ns->ns_id == 0);
/* initial namespaces have fixed ids and the ids aren't recycled */
return ns->ns_id <= NS_LAST_INIT_ID;
}
diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 241eb1e98e1d..fe9c81963786 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -136,9 +136,8 @@ struct ns_common *__must_check ns_owner(struct ns_common *ns);
#define to_ns_common(__ns) \
@@ -225,6 +224,8 @@ static __always_inline __must_check int __ns_ref_active_read(const struct ns_com
static __always_inline __must_check bool __ns_ref_put(struct ns_common *ns)
{
+ if (is_initial_namespace(ns))
+ return false;
if (refcount_dec_and_test(&ns->__ns_ref)) {
VFS_WARN_ON_ONCE(__ns_ref_active_read(ns));
return true;
@@ -234,6 +235,8 @@ static __always_inline __must_check bool __ns_ref_put(struct ns_common *ns)
static __always_inline __must_check bool __ns_ref_get(struct ns_common *ns)
{
+ if (is_initial_namespace(ns))
+ return true;
if (refcount_inc_not_zero(&ns->__ns_ref))
return true;
VFS_WARN_ON_ONCE(__ns_ref_active_read(ns));
@@ -246,7 +249,8 @@ static __always_inline __must_check int __ns_ref_read(const struct ns_common *ns
}
#define ns_ref_read(__ns) __ns_ref_read(to_ns_common((__ns)))
-#define ns_ref_inc(__ns) refcount_inc(&to_ns_common((__ns))->__ns_ref)
+#define ns_ref_inc(__ns) \
+ do { if (!is_initial_namespace(to_ns_common(__ns))) refcount_inc(&to_ns_common((__ns))->__ns_ref); } while (0)
#define ns_ref_get(__ns) __ns_ref_get(to_ns_common((__ns)))
#define ns_ref_put(__ns) __ns_ref_put(to_ns_common((__ns)))
#define ns_ref_put_and_lock(__ns, __lock) \
This effectively means we can drop all the special-casing in the
namespace helpers like:
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 445517a72ad0..ef06c3d3fb52 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -61,9 +61,7 @@ static inline struct pid_namespace *to_pid_ns(struct ns_common *ns)
static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
{
- if (ns != &init_pid_ns)
- ns_ref_inc(ns);
- return ns;
+ ns_ref_inc(ns);
}
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 650be58d8d18..e48f5de41361 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -184,7 +184,7 @@ struct pid_namespace *copy_pid_ns(u64 flags,
void put_pid_ns(struct pid_namespace *ns)
{
- if (ns && ns != &init_pid_ns && ns_ref_put(ns))
+ if (ns && ns_ref_put(ns))
schedule_work(&ns->work);
}
EXPORT_SYMBOL_GPL(put_pid_ns);
And all the other ones - without having looked into any potential
pitfalls - would get the same behavior as the pidns for free. Worth it?
I think especially for the network namespace that might potentially
avoid a bunch of cacheline ping-pong. But idk, it's just a theory. But
it's easy enough to implement.
>
> > (2) I think the plain override_creds() would work but we can do better.
> > I envision we can leverage CLASS() to completely hide any access to
> > init_cred and force a scope with kernel creds.
>
> Ack.
>
> Linus
More information about the Linux-security-module-archive
mailing list