[PATCH RFC 0/9] sk_buff: optimize layout for GRO

Thu Jul 22 07:10:51 UTC 2021

Hello,

On Wed, 2021-07-21 at 11:15 -0700, Casey Schaufler wrote:
> On 7/21/2021 9:44 AM, Paolo Abeni wrote:
> > This is a very early draft - in a different world would be
> > replaced by hallway discussion at in-person conference - aimed at
> > outlining some ideas and collect feedback on the overall outlook.
> > There are still bugs to be fixed, more test and benchmark need, etc.
> > 
> > There are 3 main goals:
> > - [try to] avoid the overhead for uncommon conditions at GRO time
> >   (patches 1-4)
> > - enable backpressure for the veth GRO path (patches 5-6)
> > - reduce the number of cacheline used by the sk_buff lifecycle
> >   from 4 to 3, at least in some common scenarios (patches 1,7-9).
> >   The idea here is avoid the initialization of some fields and
> >   control their validity with a bitmask, as presented by at least
> >   Florian and Jesper in the past.
> 
> If I understand correctly, you're creating an optimized case
> which excludes ct, secmark, vlan and UDP tunnel. Is this correct,
> and if so, why those particular fields? What impact will this have
> in the non-optimal (with any of the excluded fields) case?

Thank you for the feedback.

There are 2 different relevant points:

- the GRO stage.
  packets carring any of CT, dst, sk or skb_ext will do 2 additional
conditionals per gro_receive WRT the current code. My understanding is
that having any of such field set at GRO receive time is quite
exceptional for real nic. All others packet will do 4 or 5 less
conditionals, and will traverse a little less code.

- sk_buff lifecycle
  * packets carrying vlan and UDP will not see any differences: sk_buff
lifecycle will stil use 4 cachelines, as currently does, and no
additional conditional is introduced.
  * packets carring nfct or secmark will see an additional conditional
every time such field is accessed. The number of cacheline used will
still be 4, as in the current code. My understanding is that when such
access happens, there is already a relevant amount of "additional" code
to be executed, the conditional overhead should not be measurable.

Cheers,

Paolo