[PATCH 10/17] prmem: documentation

Tue Oct 30 20:43:14 UTC 2018

On 30/10/2018 21:20, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook at chromium.org> wrote:
>>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>>> And I think that there is only one sane way to implement it: using an
>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>>> differ from init_mm in that it has extra mappings in the *user* region.
>>>
>>> I'd like to understand this approach a little better.  In a syscall path,
>>> we run with the user task's mm.  What you're proposing is that when we
>>> want to modify rare data, we switch to rare_mm which contains a
>>> writable mapping to all the kernel data which is rare-write.
>>>
>>> So the API might look something like this:
>>>
>>> 	void *p = rare_alloc(...);	/* writable pointer */
>>> 	p->a = x;
>>> 	q = rare_protect(p);		/* read-only pointer */

With pools and memory allocated from vmap_areas, I was able to say

protect(pool)

and that would do a swipe on all the pages currently in use.
In the SELinux policyDB, for example, one doesn't really want to 
individually protect each allocation.

The loading phase happens usually at boot, when the system can be 
assumed to be sane (one might even preload a bare-bone set of rules from 
initramfs and then replace it later on, with the full blown set).

There is no need to process each of these tens of thousands allocations 
and initialization as write-rare.

Would it be possible to do the same here?

>>>
>>> To subsequently modify q,
>>>
>>> 	p = rare_modify(q);
>>> 	q->a = y;
>>
>> Do you mean
>>
>>      p->a = y;
>>
>> here? I assume the intent is that q isn't writable ever, but that's
>> the one we have in the structure at rest.
> 
> Yes, that was my intent, thanks.
> 
> To handle the list case that Igor has pointed out, you might want to
> do something like this:
> 
> 	list_for_each_entry(x, &xs, entry) {
> 		struct foo *writable = rare_modify(entry);

Would this mapping be impossible to spoof by other cores?

I'm asking this because, from what I understand, local interrupts are 
enabled here, so an attack could freeze the core performing the 
write-rare operation, while another scrapes the memory.

But blocking interrupts for the entire body of the loop would make RT 
latency unpredictable.

> 		kref_get(&writable->ref);
> 		rare_protect(writable);
> 	}
> 
> but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.

This seems suspiciously close to the duplication of kernel interfaces 
that I was roasted for :-)

--
igor