[PATCH v2 00/15] Make the user mode driver code a better citizen
Tetsuo Handa
penguin-kernel at i-love.sakura.ne.jp
Sat Jul 4 06:57:38 UTC 2020
On 2020/07/04 7:25, Eric W. Biederman wrote:
> Tetsuo Handa <penguin-kernel at i-love.sakura.ne.jp> writes:
>
>> On 2020/07/02 22:08, Eric W. Biederman wrote:
>>>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>>>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>>>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>>>> called from both kernel thread and from process context (e.g. init_module()
>>>> syscall by /sbin/insmod )) ?
>>>
>>> And __fput_sync needs to be even more careful.
>>> umd_load_blob is called in these changes without any locks held.
>>
>> But where is the guarantee that a thread which called flush_delayed_fput() waits for
>> the completion of processing _all_ "struct file" linked into delayed_fput_list ?
>> If some other thread or delayed_fput_work (scheduled by fput_many()) called
>> flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput()
>> sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which
>> needs to be processed before execve(), can't it?
>
> As a module the guarantee is we call task_work_run.
No. It is possible that blob_to_mnt() is called by a kernel thread which was
started by init_module() syscall by /sbin/insmod .
> Built into the kernel the guarantee as best I can trace it is that
> kthreadd hasn't started, and as such nothing that is scheduled has run
> yet.
Have you ever checked how early the kthreadd (PID=2) gets started?
----------
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2306,6 +2306,7 @@ static __latent_entropy struct task_struct *copy_process(
trace_task_newtask(p, clone_flags);
uprobe_copy_process(p, clone_flags);
+ printk(KERN_INFO "Created PID: %u Comm: %s\n", p->pid, p->comm);
return p;
bad_fork_cancel_cgroup:
----------
----------
[ 0.090757][ T0] pid_max: default: 65536 minimum: 512
[ 0.090890][ T0] LSM: Security Framework initializing
[ 0.090890][ T0] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[ 0.090890][ T0] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[ 0.090890][ T0] Disabled fast string operations
[ 0.090890][ T0] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024
[ 0.090890][ T0] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 1GB 4
[ 0.090890][ T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 0.090890][ T0] Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available!
[ 0.090890][ T0] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[ 0.090890][ T0] SRBDS: Unknown: Dependent on hypervisor status
[ 0.090890][ T0] MDS: Mitigation: Clear CPU buffers
[ 0.090890][ T0] Freeing SMP alternatives memory: 24K
[ 0.090890][ T0] Created PID: 1 Comm: swapper/0
[ 0.090890][ T0] Created PID: 2 Comm: swapper/0
[ 0.090890][ T1] smpboot: CPU0: Intel(R) Core(TM) i5-4440S CPU @ 2.80GHz (family: 0x6, model: 0x3c, stepping: 0x3)
[ 0.091000][ T2] Created PID: 3 Comm: kthreadd
[ 0.091995][ T2] Created PID: 4 Comm: kthreadd
[ 0.093028][ T2] Created PID: 5 Comm: kthreadd
[ 0.093997][ T2] Created PID: 6 Comm: kthreadd
[ 0.094995][ T2] Created PID: 7 Comm: kthreadd
[ 0.096037][ T2] Created PID: 8 Comm: kthreadd
(...snipped...)
[ 0.135716][ T2] Created PID: 13 Comm: kthreadd
[ 0.135716][ T1] smp: Bringing up secondary CPUs ...
[ 0.135716][ T2] Created PID: 14 Comm: kthreadd
[ 0.135716][ T2] Created PID: 15 Comm: kthreadd
[ 0.135716][ T2] Created PID: 16 Comm: kthreadd
[ 0.135716][ T2] Created PID: 17 Comm: kthreadd
[ 0.135716][ T2] Created PID: 18 Comm: kthreadd
[ 0.135716][ T1] x86: Booting SMP configuration:
(...snipped...)
[ 0.901990][ T1] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[ 0.902145][ T1] pci 0000:00:0f.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[ 0.902213][ T1] pci 0000:02:00.0: CLS mismatch (32 != 64), using 64 bytes
[ 0.902224][ T1] Trying to unpack rootfs image as initramfs...
[ 1.107993][ T1] Freeing initrd memory: 18876K
[ 1.109049][ T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[ 1.111003][ T1] software IO TLB: mapped [mem 0xab000000-0xaf000000] (64MB)
[ 1.112136][ T1] check: Scanning for low memory corruption every 60 seconds
[ 1.115040][ T2] Created PID: 52 Comm: kthreadd
[ 1.116110][ T1] workingset: timestamp_bits=46 max_order=20 bucket_order=0
[ 1.120936][ T1] SGI XFS with ACLs, security attributes, verbose warnings, quota, no debug enabled
[ 1.129626][ T2] Created PID: 53 Comm: kthreadd
[ 1.131403][ T2] Created PID: 54 Comm: kthreadd
----------
kthreadd (PID=2) is created by swapper/0 (PID=0) immediately after init (PID=1) was created by
swapper/0 (PID=0). It is even before secondary CPUs are brought up, and far earlier than unpacking
initramfs.
And how can we prove that blob_to_mnt() is only called by a kernel thread before some kernel
thread that interferes fput() starts running? blob_to_mnt() needs to be prepared for being
called after many processes already started running.
>
>> Also, I don't know how convoluted the dependency of all "struct file" linked into
>> delayed_fput_list might be, for there can be "struct file" which will not be a
>> simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request.
>>
>> On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads,
>> there is a guarantee that __fput_sync() waits for the completion of "struct file"
>> which needs to be flushed before execve(), isn't there?
>
> There is really not a good helper or helpers, and this code suggests we
> have something better. Right now I have used the existing helpers to
> the best of my ability. If you or someone else wants to write a better
> version of flushing so that exec can happen be my guest.
>
> As far as I can tell what I have is good enough.
Just saying what you think is not a "review". I'm waiting for answer from Al Viro
because I consider that Al will be the most familiar with fput()'s behavior.
At least I consider that
if (current->flags & PF_KTHREAD) {
__fput_sync(file);
} else {
fput(file);
task_work_run();
}
is a candidate for closing the race window. And depending on Al's answer,
removing
BUG_ON(!(task->flags & PF_KTHREAD));
from __fput_sync() and unconditionally using
__fput_sync(file);
from blob_to_mnt() might be the better choice. Anyway, I consider that
Al's response is important for this "review".
>
>>> We fundamentally AKA in any correct version of this code need to flush
>>> the file descriptor before we call exec or exec can not open it a
>>> read-only denying all writes from any other opens.
>>>
>>> The use case of flush_delayed_fput is exactly the same as that used
>>> when loading the initramfs.
>>
>> When loading the initramfs, the number of threads is quite few (which
>> means that the possibility of hitting the race window and convoluted
>> dependency is small).
>
> But the reality is the code run very early, before the initramfs is
> initialized in practice.
Such expectation is not a reality.
>
>> But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
>> flush_delayed_fput() might be called after many number of threads already
>> started running.
>
> At which point the code probably won't be runnig from a kernel thread
> but instead will be running in a thread where task_work_run is relevant.
No. It is possible that blob_to_mnt() is called by a kernel thread which was
started by init_module() syscall by /sbin/insmod .
>
> At worst it is a very small race, where someone else in another thread
> starts flushing the file. Which means the file could still be
> completely close before exec. Even that is not necessarily fatal,
> as the usermode driver code has a respawn capability.
>
> Code that is used enough that it hits that race sounds like a very
> good problem to have from the perspective of the usermode driver code.
In general, unconditionally retrying call_usermodehelper() when it returned
a negative value (e.g. -ENOENT, -ENOMEM, -EBUSY) is bad. I don't know which
code is an implementation of "a respawn capability"; I'd like to check where
that code is and whether that code is checking -ETXTBSY.
More information about the Linux-security-module-archive
mailing list