[RFC PATCH 2/3] add statmnt(2) syscall
Miklos Szeredi
miklos at szeredi.hu
Thu Sep 14 10:13:54 UTC 2023
On Thu, 14 Sept 2023 at 11:28, Christian Brauner <brauner at kernel.org> wrote:
>
> On Wed, Sep 13, 2023 at 05:22:35PM +0200, Miklos Szeredi wrote:
> > Add a way to query attributes of a single mount instead of having to parse
> > the complete /proc/$PID/mountinfo, which might be huge.
> >
> > Lookup the mount by the old (32bit) or new (64bit) mount ID. If a mount
> > needs to be queried based on path, then statx(2) can be used to first query
> > the mount ID belonging to the path.
> >
> > Design is based on a suggestion by Linus:
> >
> > "So I'd suggest something that is very much like "statfsat()", which gets
> > a buffer and a length, and returns an extended "struct statfs" *AND*
> > just a string description at the end."
>
> So what we agreed to at LSFMM was that we split filesystem option
> retrieval into a separate system call and just have a very focused
> statx() for mounts with just binary and non-variable sized information.
> We even gave David a hard time about this. :) I would really love if we
> could stick to that.
>
> Linus, I realize this was your suggestion a long time ago but I would
> really like us to avoid structs with variable sized fields at the end of
> a struct. That's just so painful for userspace and universally disliked.
> If you care I can even find the LSFMM video where we have users of that
> api requesting that we please don't do this. So it'd be great if you
> wouldn't insist on it.
I completely missed that.
What I'm thinking is making it even simpler for userspace:
struct statmnt {
...
char *mnt_root;
char *mountpoint;
char *fs_type;
u32 num_opts;
char *opts;
};
I'd still just keep options nul delimited.
Is there a good reason not to return pointers (pointing to within the
supplied buffer obviously) to userspace?
>
> This will also allow us to turn statmnt() into an extensible argument
> system call versioned by size just like we do any new system calls with
> struct arguments (e.g., mount_setattr(), clone3(), openat2() and so on).
> Which is how we should do things like that.
The mask mechanism also allow versioning of the struct.
>
> Other than that I really think this is on track for what we ultimately
> want.
>
> > +struct stmt_str {
> > + __u32 off;
> > + __u32 len;
> > +};
> > +
> > +struct statmnt {
> > + __u64 mask; /* What results were written [uncond] */
> > + __u32 sb_dev_major; /* Device ID */
> > + __u32 sb_dev_minor;
> > + __u64 sb_magic; /* ..._SUPER_MAGIC */
> > + __u32 sb_flags; /* MS_{RDONLY,SYNCHRONOUS,DIRSYNC,LAZYTIME} */
> > + __u32 __spare1;
> > + __u64 mnt_id; /* Unique ID of mount */
> > + __u64 mnt_parent_id; /* Unique ID of parent (for root == mnt_id) */
> > + __u32 mnt_id_old; /* Reused IDs used in proc/.../mountinfo */
> > + __u32 mnt_parent_id_old;
> > + __u64 mnt_attr; /* MOUNT_ATTR_... */
> > + __u64 mnt_propagation; /* MS_{SHARED,SLAVE,PRIVATE,UNBINDABLE} */
> > + __u64 mnt_peer_group; /* ID of shared peer group */
> > + __u64 mnt_master; /* Mount receives propagation from this ID */
> > + __u64 propagate_from; /* Propagation from in current namespace */
> > + __u64 __spare[20];
> > + struct stmt_str mnt_root; /* Root of mount relative to root of fs */
> > + struct stmt_str mountpoint; /* Mountpoint relative to root of process */
> > + struct stmt_str fs_type; /* Filesystem type[.subtype] */
>
> I think if we want to do this here we should add:
>
> __u64 fs_type
> __u64 fs_subtype
>
> fs_type can just be our filesystem magic number and we introduce magic
It's already there: sb_magic.
However it's not a 1:1 mapping (ext* only has one magic).
> numbers for sub types as well. So we don't need to use strings here.
Ugh.
Thanks,
Miklos
More information about the Linux-security-module-archive
mailing list