[PATCH RFC 0/9] sk_buff: optimize layout for GRO

Thu Jul 22 16:04:32 UTC 2021

On 7/22/2021 12:10 AM, Paolo Abeni wrote:
> Hello,
>
> On Wed, 2021-07-21 at 11:15 -0700, Casey Schaufler wrote:
>> On 7/21/2021 9:44 AM, Paolo Abeni wrote:
>>> This is a very early draft - in a different world would be
>>> replaced by hallway discussion at in-person conference - aimed at
>>> outlining some ideas and collect feedback on the overall outlook.
>>> There are still bugs to be fixed, more test and benchmark need, etc.
>>>
>>> There are 3 main goals:
>>> - [try to] avoid the overhead for uncommon conditions at GRO time
>>>   (patches 1-4)
>>> - enable backpressure for the veth GRO path (patches 5-6)
>>> - reduce the number of cacheline used by the sk_buff lifecycle
>>>   from 4 to 3, at least in some common scenarios (patches 1,7-9).
>>>   The idea here is avoid the initialization of some fields and
>>>   control their validity with a bitmask, as presented by at least
>>>   Florian and Jesper in the past.
>> If I understand correctly, you're creating an optimized case
>> which excludes ct, secmark, vlan and UDP tunnel. Is this correct,
>> and if so, why those particular fields? What impact will this have
>> in the non-optimal (with any of the excluded fields) case?
> Thank you for the feedback.

You're most welcome. You did request comments.

>
> There are 2 different relevant points:
>
> - the GRO stage.
>   packets carring any of CT, dst, sk or skb_ext will do 2 additional
> conditionals per gro_receive WRT the current code. My understanding is
> that having any of such field set at GRO receive time is quite
> exceptional for real nic. All others packet will do 4 or 5 less
> conditionals, and will traverse a little less code.
>
> - sk_buff lifecycle
>   * packets carrying vlan and UDP will not see any differences: sk_buff
> lifecycle will stil use 4 cachelines, as currently does, and no
> additional conditional is introduced.
>   * packets carring nfct or secmark will see an additional conditional
> every time such field is accessed. The number of cacheline used will
> still be 4, as in the current code. My understanding is that when such
> access happens, there is already a relevant amount of "additional" code
> to be executed, the conditional overhead should not be measurable.

I'm responsible for some of that "additonal" code. If the secmark
is considered to be outside the performance critical data there are
changes I would like to make that will substantially improve the
performance of that "additional" code that would include a u64
secmark. If use of a secmark is considered indicative of a "slow"
path, the rationale for restricting it to u32, that it might impact
the "usual" case performance, seems specious. I can't say that I
understand all the nuances and implications involved. It does
appear that the changes you've suggested could negate the classic
argument that requires the u32 secmark.

>
> Cheers,
>
> Paolo
>