Skip to content

XDP & AF_XDP for IAVF based on libie and page-pool#19

Open
michalQb wants to merge 40 commits into
alobakin:net-nextfrom
michalQb:xdp-for-iavf-libie-pp
Open

XDP & AF_XDP for IAVF based on libie and page-pool#19
michalQb wants to merge 40 commits into
alobakin:net-nextfrom
michalQb:xdp-for-iavf-libie-pp

Conversation

@michalQb

Copy link
Copy Markdown

No description provided.

alobakin and others added 30 commits March 24, 2023 20:33
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Ever since build_skb() became stable, the old way with allocating an skb
for storing the headers separately, which will be then copied manually,
was slower, less flexible and thus obsolete.

* it had higher pressure on MM since it actually allocates new pages,
  which then get split and refcount-biased (NAPI page cache);
* it implies memcpy() of packet headers (40+ bytes per each frame);
* the actual header length was calculated via eth_get_headlen(), which
  invokes Flow Dissector and thus wastes a bunch of CPU cycles;
* XDP makes it even more weird since it requires headroom for long and
  also tailroom for some time (since mbuf landed). Take a look at the
  ice driver, which is built around work-arounds to make XDP work with
  it.

Even on some quite low-end hardware (not a common case for 100G NICs) it
was performing worse.
The only advantage "legacy-rx" had is that it didn't require any
reserved headroom and tailroom. But iavf didn't use this, as it always
splits pages into two halves of 2k, while that save would only be useful
when striding. And again, XDP effectively removes that sole pro.

There's a train of features to land in IAVF soon: Page Pool, XDP, XSk,
multi-buffer etc. Each new would require adding more and more Danse
Macabre for absolutely no reason, besides making hotpath less and less
effective.
Remove the "feature" with all the related code. This includes at least
one very hot branch (typically hit on each new frame), which was either
always-true or always-false at least for a complete NAPI bulk of 64
frames, the whole private flags cruft and so on. Some stats:

Function: add/remove: 0/2 grow/shrink: 0/7 up/down: 0/-774 (-774)
RO Data: add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-40 (-40)

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
The Rx hotpath code of IAVF is not well-optimized TBH. Before doing any
further buffer model changes, shake it up a bit. Notably:

1. Cache more variables on the stack.
   DMA device, Rx page size, NTC -- these are the most common things
   used all throughout the hotpath, often in loops on each iteration.
   Instead of fetching (or even calculating, as with the page size) them
   from the ring all the time, cache them on the stack at the beginning
   of the NAPI polling callback. NTC will be written back at the end,
   the rest are used read-only, so no sync needed.
2. Don't move the recycled buffers around the ring.
   The idea of passing the page of the right-now-recycled-buffer to a
   different buffer, in this case, the first one that needs to be
   allocated, moreover, on each new frame, is fundamentally wrong. It
   involves a few o' fetches, branches and then writes (and one Rx
   buffer struct is at least 32 bytes) where they're completely unneeded,
   but gives no good -- the result is the same as if we'd recycle it
   inplace, at the same position where it was used. So drop this and let
   the main refilling function take care of all the buffers, which were
   processed and now need to be recycled/refilled.
3. Don't allocate with %GPF_ATOMIC on ifup.
   This involved introducing the @gfp parameter to a couple functions.
   Doesn't change anything for Rx -> softirq.
4. 1 budget unit == 1 descriptor, not skb.
   There could be underflow when receiving a lot of fragmented frames.
   If each of them would consist of 2 frags, it means that we'd process
   64 descriptors at the point where we pass the 32th skb to the stack.
   But the driver would count that only as a half, which could make NAPI
   re-enable interrupts prematurely and create unnecessary CPU load.
5. Shortcut !size case.
   It's super rare, but possible -- for example, if the last buffer of
   the fragmented frame contained only FCS, which was then stripped by
   the HW. Instead of checking for size several times when processing,
   quickly reuse the buffer and jump to the skb fields part.
6. Refill the ring after finishing the polling loop.
   Previously, the loop wasn't starting a new iteration after the 64th
   desc, meaning that we were always leaving 16 buffers non-refilled
   until the next NAPI poll. It's better to refill them while they're
   still hot, so do that right after exiting the loop as well.
   For a full cycle of 64 descs, there will be 4 refills of 16 descs
   from now on.

Function: add/remove: 4/2 grow/shrink: 0/5 up/down: 473/-647 (-174)

+ up to 2% performance.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
As an intermediate step, remove all page splitting/recyclig code. Just
always allocate a new page and don't touch its refcount, so that it gets
freed by the core stack later.
The change allows to greatly simplify certain parts of the code:

Function: add/remove: 2/3 grow/shrink: 0/5 up/down: 543/-963 (-420)

&iavf_rx_buf can even now retire in favor of just storing an array of
pages used for Rx. Their DMA addresses can be stored in page::dma_addr
-- use Page Pool's function for that.
No surprise perf loses up to 30% here, but that regression will go away
once PP lands.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
The current scheme with trying to pick the smallest buffer possible for
the current MTU in order to flip/split pages is not very optimal.
For example, on default MTU of 1500 it gives only 192 bytes of headroom,
while XDP may require up to 258. But this also involves unnecessary code
complication, which sometimes is even hard to follow.
As page split is no more, always allocate order-0 pages. This optimizes
performance a bit and drops some bytes off the object code. Next, always
pick the maximum buffer length available for this %PAGE_SIZE to set it
up in the hardware. This means it now becomes a constant value, which
also has its positive impact.
On x64 this means (without XDP):

4096 page
64 head, 320 tail
3712 HW buffer size
3686 max MTU w/o frags

Previously, the maximum MTU w/o splitting a frame into several buffers
was 3046.
Increased buffer size allows us to reach the maximum frame size w/ frags
supported by HW: 16382 bytes (MTU 16356). Reflect it in the netdev
config as well. Relying on max single buffer size when calculating MTU
was not correct.
Move around a couple of fields in &iavf_ring after ::rx_buf_len removal
to reduce holes and improve cache locality.
Instead of providing the Rx definitions, which can and will be reused in
rest of the drivers, exclusively for IAVF, do that in the libie header.
Non-PP drivers could still use at least some of them and lose a couple
copied lines.

Function: add/remove: 0/0 grow/shrink: 3/9 up/down: 18/-265 (-247)

+ even reclaims a half percent of performance, nice.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Add a new flag, %PP_FLAG_DMA_MAP_WEAK, whill will tell PP to map pages
with %DMA_ATTR_WEAK_ORDERING.
To keep the code simple and optimized, map the following PP flags to DMA
map attr flags:

%PP_FLAG_DMA_MAP	=> %DMA_ATTR_SKIP_CPU_SYNC
%PP_FLAG_DMA_MAP_WEAK	=> %DMA_ATTR_WEAK_ORDERING

The first pair is done to be able to just pass it directly to
dma_map_page_attrs(). When a driver wants Page Pool to maintain DMA
mappings, it always sets this flag. Page Pool always skips CPU syncs
when mapping to do that separately later, so having those two 1:1 avoids
introducing ifs and/or bit-ors and keeps the code more compact.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Each driver is responsible for syncing buffers written by HW for CPU
before accessing them. Almost each PP-enabled driver uses the same
pattern, which could be shorthanded into a static inline to make driver
code a little bit more compact.
Introduce a pair of such functions. The first one takes the actual size
of the data written by HW and is the main one to be used on Rx. The
second picks max_len from the PP params and is designed for more extreme
cases when the size is unknown, but the buffer still needs to be synced.
Also constify pointer arguments of page_pool_get_dma_dir() and
page_pool_get_dma_addr() to give a bit more room for optimization, as
both of them are read-only.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
no custom recycling logics and one whole page per frame, it can easily
be switched to using Page Pool API instead.
Introduce libie_rx_page_pool_create(), a wrapper for creating a PP with
the default libie settings applicable to all Intel hardware, and replace
the alloc/free calls with the corresponding PP functions, including the
newly added sync-for-CPU helpers. Use skb_mark_for_recycle() to bring
back the recycling and restore the initial performance.

From the important object code changes, worth mentioning that
__iavf_alloc_rx_pages() is now inlined due to the greatly reduced size.
The resulting driver is on par with the pre-series code and 1-2% slower
than the "optimized" version right before the recycling removal.
But the number of locs and object code bytes slaughtered is much more
important here after all, not speaking of that there's still a vast
space for optimization and improvements.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Next stop, per-queue private stats. They have only subtle differences
from driver to driver and can easily be resolved.
Define common structures, inline helpers and Ethtool helpers to collect,
update and export the statistics. Use u64_stats_t right from the start,
as well as the corresponding helpers to ensure tear-free operations.
For the NAPI parts of both Rx and Tx, also define small onstack
containers to update them in polling loops and then sync the actual
containers once a loop ends.
The drivers will be switched to use this API later on a per-driver
basis, along with conversion to PP.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Expand the libie generic per-queue stats with the generic Page Pool
stats provided by the API itself, when CONFIG_PAGE_POOL is enable.
When it's not, there'll be no such fields in the stats structure, so
no space wasted.
They are also a bit special in terms of how they are obtained. One
&page_pool accumulates statistics until it's destroyed obviously,
which happens on ifdown. So, in order to not lose any statistics,
get the stats and store in the queue container before destroying
a pool. This container survives ifups/downs, so it basically stores
the statistics accumulated since the very first pool was allocated
on this queue. When it's needed to export the stats, first get the
numbers from this container and then add the "live" numbers -- the
ones that the current active pool returns. The result values will
always represent the actual device-lifetime* stats.
There's a cast from &page_pool_stats to `u64 *` in a couple functions,
but they are guarded with stats asserts to make sure it's safe to do.
FWIW it saves a lot of object code.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
iavf is pretty much ready for using the generic libie stats, so drop all
the custom code and just use generic definitions. The only thing is that
it previously lacked the counter of Tx queue stops. It's present in the
other drivers, so add it here as well.
The rest is straightforward. There were two fields in the Tx stats
struct, which didn't belong there. The first one has never been used,
wipe it; and move the other to the queue structure. Plus move around
a couple fields in &iavf_ring to account stats structs' alignment.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Currently, the test relies on that only dropped ("xmitted") frames will
be recycled and if a frame became an skb, it will be freed later by the
stack and never come back to its page_pool.
So, it easily gets broken by trying to recycle skbs:

  test_xdp_do_redirect:PASS:pkt_count_xdp 0 nsec
  test_xdp_do_redirect:FAIL:pkt_count_zero unexpected pkt_count_zero:
actual 9936 != expected 2
  test_xdp_do_redirect:PASS:pkt_count_tc 0 nsec

That huge mismatch happened because after the TC ingress hook zeroes the
magic, the page gets recycled when skb is freed, not returned to the MM
layer. "Live frames" mode initializes only new pages and keeps the
recycled ones as is by design, so they appear with zeroed magic on the
Rx path again.
Expand the possible magic values from two: 0 (was "xmitted"/dropped or
did hit the TC hook) and 0x42 (hit the input XDP prog) to three: the new
one will mark frames hit the TC hook, so that they will elide both
@pkt_count_zero and @pkt_count_xdp. They can then be recycled to their
page_pool or returned to the page allocator, this won't affect the
counters anyhow. Just make sure to mark them as "input" (0x42) when they
appear on the Rx path again.
Also make an enum from those magics, so that they will be always visible
and can be changed in just one place anytime. This also eases adding any
new marks later on.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
skb_mark_for_recycle() is guarded with CONFIG_PAGE_POOL, this creates
unneeded complication when using it in the generic code. For now, it's
only used in the drivers always selecting Page Pool, so this works.
Move the guards so that preprocessor will cut out only the operation
itself and the function will still be a noop on !PAGE_POOL systems,
but available there as well.
No functional changes.

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202303020342.Wi2PRFFH-lkp@intel.com
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
__xdp_build_skb_from_frame() state(d):

/* Until page_pool get SKB return path, release DMA here */

Page Pool got skb pages recycling in April 2021, but missed this
function.

xdp_release_frame() is relevant only for Page Pool backed frames and it
detaches the page from the corresponding page_pool in order to make it
freeable via page_frag_free(). It can instead just mark the output skb
as eligible for recycling if the frame is backed by a pp. No change for
other memory model types (the same condition check as before).
cpumap redirect and veth on Page Pool drivers now become zero-alloc (or
almost).

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
__xdp_build_skb_from_frame() was the last user of
{__,}xdp_release_frame(), which detaches pages from the page_pool.
All the consumers now recycle Page Pool skbs and page, except mlx5,
stmmac and tsnep drivers, which use page_pool_release_page() directly
(might change one day). It's safe to assume this functionality is not
needed anymore and can be removed (in favor of recycling).

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
The current version of Intel 'ice' driver allows for using zero
for the ring lenghth in 'configure queue' VIRTCHNL message.
Such a value indicates the ring should not be configured.

Implement the same handling in i40e driver. Instead of returning
an 'invalid parameter' error for zero-sized rings, just skip
that ring during queue pair configuration.

That unified handling is needed for AF_XDP implementation for
'iavf' driver. In that use case we sometimes need to configure
Tx ring only for a given queue pair.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
This flag was never set, so remove it and simplify buffer
cleaning process.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Replace the existing ring mask (common for RX and TX rings)
in iavf_q_vector with two masks dedicated to handling RX and TX
rings separately.

The virtchnl interface allows separate masks to be used for different
ring types, so there is no need to merge them into a single mask.
Also, after adding XDP support to iavf, the number of RX and TX
rings can be asymmetric. Therefore, this patch is a necessary
preparation for XDP support.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
The XDP and AF_XDP feature is initialized using .ndo functions. Those
functions are always synchronous and may require some serious queues
reconfiguration including changing the number of queues.

Performing such a reconfiguration implies sending a bunch of VIRTCHNL
messages to the PF in order to disable queues, re-enable and re-configure
them, or update the RSS LUT.
By definition, those VIRTCHNL messages are sent asynchronously, so the
result of each VIRTCHNL operation can be received from the PF via admin
queue after some time.
Moreover, the previous implementation of some VIRTCHNL functions (e.g.
'iavf_disable_queues()' or 'iavf_enable_queues()' does not allow to call
them selectively for specific queues only.

In order to addres those problems and cover all scenarios of XDP and
AF_XDP initialization, implement a polling mechanism with a timeout for
blocking the execution of XDP .ndo functions until the result of
VIRTCHNL operation on PF is known to the driver.
Also, refactor the existing VIRTCHNL API by adding functions for
selective queue enabling, disabling and configuration.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Introduce modular functions to allocate and initialize Rx and Tx rings
in order to prepare the initialization procedure to easily fit the XDP
setup.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Extend basic structures of the driver (e.g. 'iavf_adapter', 'iavf_ring')
by adding members necessary to support XDP. Register those members using
required functions from BPF API.
Implement a support for XDP_TX and XDP_REDIRECT actions by adding
additional XDP Tx queues to transmit packets without interferring a
regular Tx traffic.
Finally, add required XDP setup and release calls to queue allocation
and deallocation functions respectively.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Add .ndo_bpf function to handle XDP_SETUP_PROG command.

In order to avoid synchronization issues, implement functions
dedicated to re-initialize only those parts of the interface which
are really necessary to setup the XDP program.
Such an approach is much lighter than performing a full reset of the
driver and thanks to it we can immediately know the result of traffic
initialization comparing to the reset task which triggers some
asynchronous events (e.g. link speed negotiation).

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Implement basic XDP program setup, refactor data path
to use xdp_buff, implement XDP_PASS and XDP_DROP actions.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Implement sending the packet from an XDP ring.
XDP path functions are separate from the general TX routines,
because this allows to simplify and therefore speedup the process.
It also makes code more friendly to future XDP-specific optimizations.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Implement XDP_REDIRECT action and ndo_xdp_xmit() callback.

For now, packets redirected from CPU with index greater than
XDP queues number are just dropped with an error.
This is a rather common situation, especially when VF is configured
to run on host and will be addressed in later patches.

Patch also refactors RX XDP handling to use switch statement due to
increased number of actions.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Port of commit 22bf877 ("ice: introduce XDP_TX fallback path").
The patch handles the case, when queue number is not sufficient for
the current number of CPUs. To avoid dropping some packets
redirected from other interfaces, XDP TxQs are allowed to be shared
between CPUs, which imposes the locking requirement.
Static key approach has little to none performance penalties
when sharing is not needed.

This mechanism is much more applicable when dealing with VFs.
In fact, maximum number of queue pairs that ice PF can give to
an iavf VF is 16, which allows up to 8 XDP TxQs, so without
XDP TxQ sharing, some redirected packets can be dropped even
on a 10 CPU system.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Enable NETDEV_XDP_ACT_BASIC and NETDEV_XDP_ACT_REDIRECT
XDP features in netdev.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
michalQb and others added 10 commits April 14, 2023 15:10
Add necessary functions and data structures to support
AF_XDP feature.
Implement handling of 'XDP_SETUP_XSK_POOL' in .ndo_bpf().
Also, implement functions for selectively stopping only
those queues which take part in XDP socket creation.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Implement Tx handling for AF_XDP feature in zero-copy mode.
Add '.ndo_xdp_xmit()' and '.ndo_xsk_wakeup()' implementations
to support AF_XDP Tx path.
Also, add Tx interrupt handling function for zero-copy mode.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Implement RX packet processing specific to AF_XDP ZC.
All actions except XDP_PASS are supported, the skb path will
be implemented in later patches.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
For now, filling the skb fields on Rx is a bit scattered across RQ
polling function. This makes it harder to reuse the code on XSk Rx
path and also sometimes costs some CPU (e.g. doing a lookup for the
decoded packet type two times).
Make it consistent and do everything in iavf_process_skb_fields(). First
of all, get the packet type and decode it. Then, move to hash, csum and
VLAN, which is moved here too. iavf_receive_skb() becomes then the
classic eth_type_trans() + napi_gro_receive() pair.
Finally, make the fields processing function global and the skb receive
function static inline in order to call them from a different file later
on.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Construct skb and fill in its fields, when AF_XDP
is enabled on the ring, if XDP program returns XDP_PASS.
(will be fixed up).

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
The existing implementation of 'iavf_request_traffic_irqs()'
function does not request any interrupt for q_vectors that
have no Tx nor Rx queues assigned to it. However, the function
'iavf_free_traffic_irqs()' releases interrupts for all q_vectors
unconditionally.
Such an approach may result in showing kernel warning about
an attempt of releasing the interrupt that was not requested.

In order to solve that potential issue make both functions
fully symmetric. Therefore, add the logic to 'iavf_free_traffic_irqs()'
for skipping not used q_vectors.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
When the number of queues is being changed by the user, the information
about a new queue number is kept in the adapter structure member
(num_req_queues). Such an information was always reset to zero just
after setting queue number request is processed.

However, that structure member should always provide an information
about user's preference regarding the requested queue number, so
it should be preserved for future driver reinitializations or
setting up the adapter for XDP program.

Remove setting the number of requested queues to zero and use that
value as a priority one during next reinitializations of the adapter,
in order to avoid the scenario when the queue count can be changed
automatically out of user's control.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
When XDP is enabled, our true maximum number of queue pairs
can be reduced up to being cut in half, ex. for system with 10 CPUs
and 16 queue pairs allowed by PF, normally maximum would be 10 queues,
but XDP requires 2 queue pairs per channel, so the maximum of 8 queues
can be used if program is attached.

The above fact has to be reflected in ethtool.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
To avoid race between .ndo_xdp_xmit(), normal ZC TX processing
and XDP_TX in ZC mode when also sharing queues,
add locking to the later two.

Locking in .ndo_xdp_xmit() is already present.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Enable NETDEV_XDP_ACT_XSK_ZEROCOPY feature in
netdev structure.

Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
return !!READ_ONCE(adapter->xdp_prog);
}

static inline struct xsk_buff_pool *iavf_xsk_pool(struct iavf_ring *ring)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants