@@ -485,6 +485,123 @@ Limitations
485485 in case that too many ublk devices are handled by this single io_ring_ctx
486486 and each one has very large queue depth
487487
488+ Shared Memory Zero Copy (UBLK_F_SHMEM_ZC)
489+ ------------------------------------------
490+
491+ The ``UBLK_F_SHMEM_ZC `` feature provides an alternative zero-copy path
492+ that works by sharing physical memory pages between the client application
493+ and the ublk server. Unlike the io_uring fixed buffer approach above,
494+ shared memory zero copy does not require io_uring buffer registration
495+ per I/O — instead, it relies on the kernel matching page frame numbers
496+ (PFNs) at I/O time. This allows the ublk server to access the shared
497+ buffer directly, which is unlikely for the io_uring fixed buffer
498+ approach.
499+
500+ Motivation
501+ ~~~~~~~~~~
502+
503+ Shared memory zero copy takes a different approach: if the client
504+ application and the ublk server both map the same physical memory, there is
505+ nothing to copy. The kernel detects the shared pages automatically and
506+ tells the server where the data already lives.
507+
508+ ``UBLK_F_SHMEM_ZC `` can be thought of as a supplement for optimized client
509+ applications — when the client is willing to allocate I/O buffers from
510+ shared memory, the entire data path becomes zero-copy without any per-I/O
511+ overhead.
512+
513+ Use Cases
514+ ~~~~~~~~~
515+
516+ This feature is useful when the client application can be configured to
517+ use a specific shared memory region for its I/O buffers:
518+
519+ - **Custom storage clients ** that allocate I/O buffers from shared memory
520+ (memfd, hugetlbfs) and issue direct I/O to the ublk device
521+ - **Database engines ** that use pre-allocated buffer pools with O_DIRECT
522+
523+ How It Works
524+ ~~~~~~~~~~~~
525+
526+ 1. The ublk server and client both ``mmap() `` the same file (memfd or
527+ hugetlbfs) with ``MAP_SHARED ``. This gives both processes access to the
528+ same physical pages.
529+
530+ 2. The ublk server registers its mapping with the kernel::
531+
532+ struct ublk_shmem_buf_reg buf = { .addr = mmap_va, .len = size };
533+ ublk_ctrl_cmd(UBLK_U_CMD_REG_BUF, .addr = &buf);
534+
535+ The kernel pins the pages and builds a PFN lookup tree.
536+
537+ 3. When the client issues direct I/O (``O_DIRECT ``) to ``/dev/ublkb* ``,
538+ the kernel checks whether the I/O buffer pages match any registered
539+ pages by comparing PFNs.
540+
541+ 4. On a match, the kernel sets ``UBLK_IO_F_SHMEM_ZC `` in the I/O
542+ descriptor and encodes the buffer index and offset in ``addr ``::
543+
544+ if (iod->op_flags & UBLK_IO_F_SHMEM_ZC) {
545+ /* Data is already in our shared mapping — zero copy */
546+ index = ublk_shmem_zc_index(iod->addr);
547+ offset = ublk_shmem_zc_offset(iod->addr);
548+ buf = shmem_table[index].mmap_base + offset;
549+ }
550+
551+ 5. If pages do not match (e.g., the client used a non-shared buffer),
552+ the I/O falls back to the normal copy path silently.
553+
554+ The shared memory can be set up via two methods:
555+
556+ - **Socket-based **: the client sends a memfd to the ublk server via
557+ ``SCM_RIGHTS `` on a unix socket. The server mmaps and registers it.
558+ - **Hugetlbfs-based **: both processes ``mmap(MAP_SHARED) `` the same
559+ hugetlbfs file. No IPC needed — same file gives same physical pages.
560+
561+ Advantages
562+ ~~~~~~~~~~
563+
564+ - **Simple **: no per-I/O buffer registration or unregistration commands.
565+ Once the shared buffer is registered, all matching I/O is zero-copy
566+ automatically.
567+ - **Direct buffer access **: the ublk server can read and write the shared
568+ buffer directly via its own mmap, without going through io_uring fixed
569+ buffer operations. This is more friendly for server implementations.
570+ - **Fast **: PFN matching is a single maple tree lookup per bvec. No
571+ io_uring command round-trips for buffer management.
572+ - **Compatible **: non-matching I/O silently falls back to the copy path.
573+ The device works normally for any client, with zero-copy as an
574+ optimization when shared memory is available.
575+
576+ Limitations
577+ ~~~~~~~~~~~
578+
579+ - **Requires client cooperation **: the client must allocate its I/O
580+ buffers from the shared memory region. This requires a custom or
581+ configured client — standard applications using their own buffers
582+ will not benefit.
583+ - **Direct I/O only **: buffered I/O (without ``O_DIRECT ``) goes through
584+ the page cache, which allocates its own pages. These kernel-allocated
585+ pages will never match the registered shared buffer. Only ``O_DIRECT ``
586+ puts the client's buffer pages directly into the block I/O.
587+
588+ Control Commands
589+ ~~~~~~~~~~~~~~~~
590+
591+ - ``UBLK_U_CMD_REG_BUF ``
592+
593+ Register a shared memory buffer. ``ctrl_cmd.addr `` points to a
594+ ``struct ublk_shmem_buf_reg `` containing the buffer virtual address and size.
595+ Returns the assigned buffer index (>= 0) on success. The kernel pins
596+ pages and builds the PFN lookup tree. Queue freeze is handled
597+ internally.
598+
599+ - ``UBLK_U_CMD_UNREG_BUF ``
600+
601+ Unregister a previously registered buffer. ``ctrl_cmd.data[0] `` is the
602+ buffer index. Unpins pages and removes PFN entries from the lookup
603+ tree.
604+
488605References
489606==========
490607
0 commit comments