[PATCH 10/17] prmem: documentation

Mike Rapoport rppt at linux.ibm.com
Wed Oct 24 23:04:26 UTC 2018


Hi Igor,

On Wed, Oct 24, 2018 at 12:34:57AM +0300, Igor Stoppa wrote:
> Documentation for protected memory.
> 
> Topics covered:
> * static memory allocation
> * dynamic memory allocation
> * write-rare
> 
> Signed-off-by: Igor Stoppa <igor.stoppa at huawei.com>
> CC: Jonathan Corbet <corbet at lwn.net>
> CC: Randy Dunlap <rdunlap at infradead.org>
> CC: Mike Rapoport <rppt at linux.vnet.ibm.com>
> CC: linux-doc at vger.kernel.org
> CC: linux-kernel at vger.kernel.org
> ---
>  Documentation/core-api/index.rst |   1 +
>  Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++

Thanks for having docs a part of the patchset!

>  MAINTAINERS                      |   1 +
>  3 files changed, 174 insertions(+)
>  create mode 100644 Documentation/core-api/prmem.rst
> 
> diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
> index 26b735cefb93..1a90fa878d8d 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -31,6 +31,7 @@ Core utilities
>     gfp_mask-from-fs-io
>     timekeeping
>     boot-time-mm
> +   prmem
> 
>  Interfaces for kernel debugging
>  ===============================
> diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
> new file mode 100644
> index 000000000000..16d7edfe327a
> --- /dev/null
> +++ b/Documentation/core-api/prmem.rst
> @@ -0,0 +1,172 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _prmem:
> +
> +Memory Protection
> +=================
> +
> +:Date: October 2018
> +:Author: Igor Stoppa <igor.stoppa at huawei.com>
> +
> +Foreword
> +--------
> +- In a typical system using some sort of RAM as execution environment,
> +  **all** memory is initially writable.
> +
> +- It must be initialized with the appropriate content, be it code or data.
> +
> +- Said content typically undergoes modifications, i.e. relocations or
> +  relocation-induced changes.
> +
> +- The present document doesn't address such transient.
> +
> +- Kernel code is protected at system level and, unlike data, it doesn't
> +  require special attention.
> +

I feel that foreword should include a sentence or two saying why we need
the memory protection and when it can/should be used.

> +Protection mechanism
> +--------------------
> +
> +- When available, the MMU can write protect memory pages that would be
> +  otherwise writable.
> +
> +- The protection has page-level granularity.
> +
> +- An attempt to overwrite a protected page will trigger an exception.
> +- **Write protected data must go exclusively to write protected pages**
> +- **Writable data must go exclusively to writable pages**
> +
> +Available protections for kernel data
> +-------------------------------------
> +
> +- **constant**
> +   Labelled as **const**, the data is never supposed to be altered.
> +   It is statically allocated - if it has any memory footprint at all.
> +   The compiler can even optimize it away, where possible, by replacing
> +   references to a **const** with its actual value.
> +
> +- **read only after init**
> +   By tagging an otherwise ordinary statically allocated variable with
> +   **__ro_after_init**, it is placed in a special segment that will
> +   become write protected, at the end of the kernel init phase.
> +   The compiler has no notion of this restriction and it will treat any
> +   write operation on such variable as legal. However, assignments that
> +   are attempted after the write protection is in place, will cause
> +   exceptions.
> +
> +- **write rare after init**
> +   This can be seen as variant of read only after init, which uses the
> +   tag **__wr_after_init**. It is also limited to statically allocated
> +   memory. It is still possible to alter this type of variables, after

                                                         no comma ^

> +   the kernel init phase is complete, however it can be done exclusively
> +   with special functions, instead of the assignment operator. Using the
> +   assignment operator after conclusion of the init phase will still
> +   trigger an exception. It is not possible to transition a certain
> +   variable from __wr_ater_init to a permanent read-only status, at

                    __wr_aFter_init

> +   runtime.
> +
> +- **dynamically allocated write-rare / read-only**
> +   After defining a pool, memory can be obtained through it, primarily
> +   through the **pmalloc()** allocator. The exact writability state of the
> +   memory obtained from **pmalloc()** and friends can be configured when
> +   creating the pool. At any point it is possible to transition to a less
> +   permissive write status the memory currently associated to the pool.
> +   Once memory has become read-only, it the only valid operation, beside

... become read-only, the only valid operation

> +   reading, is to released it, by destroying the pool it belongs to.
> +
> +
> +Protecting dynamically allocated memory
> +---------------------------------------
> +
> +When dealing with dynamically allocated memory, three options are
> + available for configuring its writability state:
> +
> +- **Options selected when creating a pool**
> +   When creating the pool, it is possible to choose one of the following:
> +    - **PMALLOC_MODE_RO**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_WR**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_AUTO_RO**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *NONE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_AUTO_WR**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_START_WR**
> +       - Writability at allocation time: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*

For me this part is completely blind. Maybe arranging this as a table would
make the states more clearly visible.

> +
> +   **Remarks:**
> +    - The "AUTO" modes perform automatic protection of the content, whenever
> +       the current vmap_area is used up and a new one is allocated.
> +        - At that point, the vmap_area being phased out is protected.
> +        - The size of the vmap_area depends on various parameters.
> +        - It might not be possible to know for sure *when* certain data will
> +          be protected.
> +        - The functionality is provided as tradeoff between hardening and speed.
> +        - Its usefulness depends on the specific use case at hand
> +    - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.
> +
> +- **Protecting the pool**
> +   This is achieved with **pmalloc_protect_pool()**
> +    - Any vmap_area currently in the pool is write-protected according to its initial configuration.
> +    - Any residual space still available from the current vmap_area is lost, as the area is protected.
> +    - **protecting a pool after every allocation will likely be very wasteful**
> +    - Using PMALLOC_MODE_START_WR is likely a better choice.
> +
> +- **Upgrading the protection level**
> +   This is achieved with **pmalloc_make_pool_ro()**
> +    - it turns the present content of a write-rare pool into read-only
> +    - can be useful when the content of the memory has settled
> +
> +
> +Caveats
> +-------
> +- Freeing of memory is not supported. Pages will be returned to the
> +  system upon destruction of their memory pool.
> +
> +- The address range available for vmalloc (and thus for pmalloc too) is
> +  limited, on 32-bit systems. However it shouldn't be an issue, since not

   no comma ^

> +  much data is expected to be dynamically allocated and turned into
> +  write-protected.
> +
> +- Regarding SMP systems, changing state of pages and altering mappings
> +  requires performing cross-processor synchronizations of page tables.
> +  This is an additional reason for limiting the use of write rare.
> +
> +- Not only the pmalloc memory must be protected, but also any reference to
> +  it that might become the target for an attack. The attack would replace
> +  a reference to the protected memory with a reference to some other,
> +  unprotected, memory.
> +
> +- The users of rare write must take care of ensuring the atomicity of the
> +  action, respect to the way they use the data being altered; for example,
> +  take a lock before making a copy of the value to modify (if it's
> +  relevant), then alter it, issue the call to rare write and finally
> +  release the lock. Some special scenario might be exempt from the need
> +  for locking, but in general rare-write must be treated as an operation
> +  that can incur into races.
> +
> +- pmalloc relies on virtual memory areas and will therefore use more
> +  tlb entries. It still does a better job of it, compared to invoking
> +  vmalloc for each allocation, but it is undeniably less optimized wrt to
> +  TLB use than using the physmap directly, through kmalloc or similar.
> +
> +
> +Utilization
> +-----------
> +
> +**add examples here**
> +
> +API
> +---
> +
> +.. kernel-doc:: include/linux/prmem.h
> +.. kernel-doc:: mm/prmem.c
> +.. kernel-doc:: include/linux/prmemextra.h
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ea979a5a9ec9..246b1a1cc8bb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9463,6 +9463,7 @@ F:	include/linux/prmemextra.h
>  F:	mm/prmem.c
>  F:	mm/test_write_rare.c
>  F:	mm/test_pmalloc.c
> +F:	Documentation/core-api/prmem.rst

I think the MAINTAINERS update can go in one chunk as the last patch in the
series.
 
>  MEMORY MANAGEMENT
>  L:	linux-mm at kvack.org
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.



More information about the Linux-security-module-archive mailing list