[RFC PATCH v4 01/12] x86/sgx: Use mmu_notifier.release() instead of per-vma refcounting

Mon Jul 8 14:57:07 UTC 2019

On Fri, Jun 21, 2019 at 12:03:36AM +0300, Jarkko Sakkinen wrote:
> On Wed, Jun 19, 2019 at 03:23:50PM -0700, Sean Christopherson wrote:
> > Using per-vma refcounting to track mm_structs associated with an enclave
> > requires hooking .vm_close(), which in turn prevents the mm from merging
> > vmas (precisely to allow refcounting).
> 
> Why having sgx_vma_close() prevents that? I do not understand the
> problem statement.

vmas that define .vm_close() cannot be merged.

  /*
   * If the vma has a ->close operation then the driver probably needs to release
   * per-vma resources, so we don't attempt to merge those.
   */
  static inline int is_mergeable_vma(struct vm_area_struct *vma,
				struct file *file, unsigned long vm_flags,
				struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
  {
	...

	if (vma->vm_ops && vma->vm_ops->close)
		return 0;
	if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
		return 0;
	return 1;
  }

> 
> > Avoid refcounting encl_mm altogether by registering an mmu_notifier at
> > .mmap(), removing the dying encl_mm at mmu_notifier.release() and
> > protecting mm_list during reclaim via a per-enclave SRCU.
> 
> Right, there is the potential collision with my changes:
> 
> 1. Your patch: enclave life-cycle equals life-cycle of all processes
>    that are associated with the enclave.
> 2. My (yet be sent) patch: enclave life-cycle equals the life cycle.
> 
> I won't rush with my patch. I rather merge neither at this point and
> you can review mine after you come back from your vacation.
> 
> > Removing refcounting/vm_close() allows merging of enclave vmas, at the
> > cost of delaying removal of encl_mm structs from mm_list, i.e. an mm is
> > disassociated from an enclave when the mm exits or the enclave dies, as
> > opposed to when the last vma (in a given mm) is closed.
> > 
> > The impact of delying encl_mm removal is its memory footprint and
> > whatever overhead is incurred during EPC reclaim (to walk an mm's vmas).
> > Practically speaking, a stale encl_mm will exist for a meaningful amount
> > of time if and only if the enclave is mapped in a long-lived process and
> > then passed off to another long-lived process.  It is expected that the
> > vast majority of use cases will not encounter this condition, e.g. even
> > using a daemon to build enclaves should not result in a stale encl_mm as
> > the builder should never need to mmap() the enclave.
> 
> This paragraph speaks only about "well behaving" software.

Malicious software isn't all that interesting as there are far easier ways
to waste system resources.  That being said, the encl_mm allocation can
use GFP_KERNEL_ACCOUNT.

> > Even if there are scenarios that lead to defunct encl_mms, the cost is
> > likely far outweighed by the benefits of reducing the number of vmas
> > across all enclaves.
> > 
> > Note, using SRCU to protect mm_list is not strictly necessary, i.e. the
> > existing walker with encl_mm refcounting could be massaged to work with
> > mmu_notifier.release(), but the resulting code is subtle and fragile (I
> > never actually got it working).  The primary issue is that an encl_mm
> > can't be moved off the list until its refcount goes to zero, otherwise
> > the custom walker goes off into the weeds.  The refcount requirement
> > then prevents using mm_list to identify if an mmu_notifier.release()
> > has fired, i.e. another mechanism is needed to guard against races
> > between exit_mmap() and sgx_release().
> 
> Is it really impossible to send a separate SRCU patch?

I can split out the SRCU as a precursor.  It'll likely take me a few days
to get it sent.

> I fully agree with the SRCU whereas rest of this patch is still
> under debate.
> 
> If you could do that, I can merge it in no time. It is a small
> step into better direction.
> 
> > Cc: Dave Hansen <dave.hansen at intel.com>
> > Cc: Andy Lutomirski <luto at kernel.org>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson at intel.com>
> 
> Needs to be rebased because the master missing your earlier bug fix.

...

> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> > index 9566eb72d417..c6436bbd4a68 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -132,103 +132,125 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> >  	return entry;
> >  }
> >  
> > -struct sgx_encl_mm *sgx_encl_mm_add(struct sgx_encl *encl,
> > -				    struct mm_struct *mm)
> > +static void sgx_encl_mm_release_wq(struct work_struct *work)
> > +{
> > +	struct sgx_encl_mm *encl_mm =
> > +		container_of(work, struct sgx_encl_mm, release_work);
> > +
> > +	sgx_encl_mm_release(encl_mm);
> > +}
> > +
> > +/*
> > + * Being a call_srcu() callback, this needs to be short, and sgx_encl_release()
> > + * is anything but short.  Do the final freeing in yet another async callback.
> > + */
> > +static void sgx_encl_mm_release_delayed(struct rcu_head *rcu)
> 
> Would rename this either as *_tail() or *_deferred().

Deferred works for me.

> > +{
> > +	struct sgx_encl_mm *encl_mm =
> > +		container_of(rcu, struct sgx_encl_mm, rcu);
> > +
> > +	INIT_WORK(&encl_mm->release_work, sgx_encl_mm_release_wq);
> > +	schedule_work(&encl_mm->release_work);
> > +}
> > +

...

> > @@ -118,11 +123,13 @@ void sgx_encl_destroy(struct sgx_encl *encl);
> >  void sgx_encl_release(struct kref *ref);
> >  pgoff_t sgx_encl_get_index(struct sgx_encl *encl, struct sgx_encl_page *page);
> >  struct page *sgx_encl_get_backing_page(struct sgx_encl *encl, pgoff_t index);
> > -struct sgx_encl_mm *sgx_encl_next_mm(struct sgx_encl *encl,
> > -				     struct sgx_encl_mm *encl_mm, int *iter);
> > -struct sgx_encl_mm *sgx_encl_mm_add(struct sgx_encl *encl,
> > -				    struct mm_struct *mm);
> > -void sgx_encl_mm_release(struct kref *ref);
> > +int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
> > +static inline void sgx_encl_mm_release(struct sgx_encl_mm *encl_mm)
> > +{
> > +	kref_put(&encl_mm->encl->refcount, sgx_encl_release);
> > +
> > +	kfree(encl_mm);
> > +}
> 
> Please just open code this to the two call sites. Makes the code hard to
> follow.

Heh, I waffled between a helper and open coding.  I chose poorly :-)

> Right now I did not find anything else questionable from the code
> changes. Repeating myself but if it is by any means possible before
> going away, can you construct a pure SRCU patch.
>
> I could then reconstruct my changes on top off that, which would
> make evalution of both heck a lot easier.
> 
> /Jarkko