Skip to content

Commit cc91702

Browse files
committed
Merge branch 'for-7.1/block' into for-next
* for-7.1/block: selftests/ublk: add read-only buffer registration test selftests/ublk: add filesystem fio verify test for shmem_zc selftests/ublk: add hugetlbfs shmem_zc test for loop target selftests/ublk: add shared memory zero-copy test selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target selftests/ublk: add shared memory zero-copy support in kublk ublk: eliminate permanent pages[] array from struct ublk_buf ublk: enable UBLK_F_SHMEM_ZC feature flag ublk: add PFN-based buffer matching in I/O path ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands
2 parents dec615f + affb5f6 commit cc91702

12 files changed

Lines changed: 1287 additions & 8 deletions

File tree

Documentation/block/ublk.rst

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -485,6 +485,123 @@ Limitations
485485
in case that too many ublk devices are handled by this single io_ring_ctx
486486
and each one has very large queue depth
487487

488+
Shared Memory Zero Copy (UBLK_F_SHMEM_ZC)
489+
------------------------------------------
490+
491+
The ``UBLK_F_SHMEM_ZC`` feature provides an alternative zero-copy path
492+
that works by sharing physical memory pages between the client application
493+
and the ublk server. Unlike the io_uring fixed buffer approach above,
494+
shared memory zero copy does not require io_uring buffer registration
495+
per I/O — instead, it relies on the kernel matching page frame numbers
496+
(PFNs) at I/O time. This allows the ublk server to access the shared
497+
buffer directly, which is unlikely for the io_uring fixed buffer
498+
approach.
499+
500+
Motivation
501+
~~~~~~~~~~
502+
503+
Shared memory zero copy takes a different approach: if the client
504+
application and the ublk server both map the same physical memory, there is
505+
nothing to copy. The kernel detects the shared pages automatically and
506+
tells the server where the data already lives.
507+
508+
``UBLK_F_SHMEM_ZC`` can be thought of as a supplement for optimized client
509+
applications — when the client is willing to allocate I/O buffers from
510+
shared memory, the entire data path becomes zero-copy without any per-I/O
511+
overhead.
512+
513+
Use Cases
514+
~~~~~~~~~
515+
516+
This feature is useful when the client application can be configured to
517+
use a specific shared memory region for its I/O buffers:
518+
519+
- **Custom storage clients** that allocate I/O buffers from shared memory
520+
(memfd, hugetlbfs) and issue direct I/O to the ublk device
521+
- **Database engines** that use pre-allocated buffer pools with O_DIRECT
522+
523+
How It Works
524+
~~~~~~~~~~~~
525+
526+
1. The ublk server and client both ``mmap()`` the same file (memfd or
527+
hugetlbfs) with ``MAP_SHARED``. This gives both processes access to the
528+
same physical pages.
529+
530+
2. The ublk server registers its mapping with the kernel::
531+
532+
struct ublk_shmem_buf_reg buf = { .addr = mmap_va, .len = size };
533+
ublk_ctrl_cmd(UBLK_U_CMD_REG_BUF, .addr = &buf);
534+
535+
The kernel pins the pages and builds a PFN lookup tree.
536+
537+
3. When the client issues direct I/O (``O_DIRECT``) to ``/dev/ublkb*``,
538+
the kernel checks whether the I/O buffer pages match any registered
539+
pages by comparing PFNs.
540+
541+
4. On a match, the kernel sets ``UBLK_IO_F_SHMEM_ZC`` in the I/O
542+
descriptor and encodes the buffer index and offset in ``addr``::
543+
544+
if (iod->op_flags & UBLK_IO_F_SHMEM_ZC) {
545+
/* Data is already in our shared mapping — zero copy */
546+
index = ublk_shmem_zc_index(iod->addr);
547+
offset = ublk_shmem_zc_offset(iod->addr);
548+
buf = shmem_table[index].mmap_base + offset;
549+
}
550+
551+
5. If pages do not match (e.g., the client used a non-shared buffer),
552+
the I/O falls back to the normal copy path silently.
553+
554+
The shared memory can be set up via two methods:
555+
556+
- **Socket-based**: the client sends a memfd to the ublk server via
557+
``SCM_RIGHTS`` on a unix socket. The server mmaps and registers it.
558+
- **Hugetlbfs-based**: both processes ``mmap(MAP_SHARED)`` the same
559+
hugetlbfs file. No IPC needed — same file gives same physical pages.
560+
561+
Advantages
562+
~~~~~~~~~~
563+
564+
- **Simple**: no per-I/O buffer registration or unregistration commands.
565+
Once the shared buffer is registered, all matching I/O is zero-copy
566+
automatically.
567+
- **Direct buffer access**: the ublk server can read and write the shared
568+
buffer directly via its own mmap, without going through io_uring fixed
569+
buffer operations. This is more friendly for server implementations.
570+
- **Fast**: PFN matching is a single maple tree lookup per bvec. No
571+
io_uring command round-trips for buffer management.
572+
- **Compatible**: non-matching I/O silently falls back to the copy path.
573+
The device works normally for any client, with zero-copy as an
574+
optimization when shared memory is available.
575+
576+
Limitations
577+
~~~~~~~~~~~
578+
579+
- **Requires client cooperation**: the client must allocate its I/O
580+
buffers from the shared memory region. This requires a custom or
581+
configured client — standard applications using their own buffers
582+
will not benefit.
583+
- **Direct I/O only**: buffered I/O (without ``O_DIRECT``) goes through
584+
the page cache, which allocates its own pages. These kernel-allocated
585+
pages will never match the registered shared buffer. Only ``O_DIRECT``
586+
puts the client's buffer pages directly into the block I/O.
587+
588+
Control Commands
589+
~~~~~~~~~~~~~~~~
590+
591+
- ``UBLK_U_CMD_REG_BUF``
592+
593+
Register a shared memory buffer. ``ctrl_cmd.addr`` points to a
594+
``struct ublk_shmem_buf_reg`` containing the buffer virtual address and size.
595+
Returns the assigned buffer index (>= 0) on success. The kernel pins
596+
pages and builds the PFN lookup tree. Queue freeze is handled
597+
internally.
598+
599+
- ``UBLK_U_CMD_UNREG_BUF``
600+
601+
Unregister a previously registered buffer. ``ctrl_cmd.data[0]`` is the
602+
buffer index. Unpins pages and removes PFN entries from the lookup
603+
tree.
604+
488605
References
489606
==========
490607

0 commit comments

Comments
 (0)