Skip to content

Commit 7fe6ac1

Browse files
committed
Merge tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe: - Add shared memory zero-copy I/O support for ublk, bypassing per-I/O copies between kernel and userspace by matching registered buffer PFNs at I/O time. Includes selftests. - Refactor bio integrity to support filesystem initiated integrity operations and arbitrary buffer alignment. - Clean up bio allocation, splitting bio_alloc_bioset() into clear fast and slow paths. Add bio_await() and bio_submit_or_kill() helpers, unify synchronous bi_end_io callbacks. - Fix zone write plug refcount handling and plug removal races. Add support for serializing zone writes at QD=1 for rotational zoned devices, yielding significant throughput improvements. - Add SED-OPAL ioctls for Single User Mode management and a STACK_RESET command. - Add io_uring passthrough (uring_cmd) support to the BSG layer. - Replace pp_buf in partition scanning with struct seq_buf. - zloop improvements and cleanups. - drbd genl cleanup, switching to pre_doit/post_doit. - NVMe pull request via Keith: - Fabrics authentication updates - Enhanced block queue limits support - Workqueue usage updates - A new write zeroes device quirk - Tagset cleanup fix for loop device - MD pull requests via Yu Kuai: - Fix raid5 soft lockup in retry_aligned_read() - Fix raid10 deadlock with check operation and nowait requests - Fix raid1 overlapping writes on writemostly disks - Fix sysfs deadlock on array_state=clear - Proactive RAID-5 parity building with llbitmap, with write_zeroes_unmap optimization for initial sync - Fix llbitmap barrier ordering, rdev skipping, and bitmap_ops version mismatch fallback - Fix bcache use-after-free and uninitialized closure - Validate raid5 journal metadata payload size - Various cleanups - Various other fixes, improvements, and cleanups * tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (146 commits) ublk: fix tautological comparison warning in ublk_ctrl_reg_buf scsi: bsg: fix buffer overflow in scsi_bsg_uring_cmd() block: refactor blkdev_zone_mgmt_ioctl MAINTAINERS: update ublk driver maintainer email Documentation: ublk: address review comments for SHMEM_ZC docs ublk: allow buffer registration before device is started ublk: replace xarray with IDA for shmem buffer index allocation ublk: simplify PFN range loop in __ublk_ctrl_reg_buf ublk: verify all pages in multi-page bvec fall within registered range ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support xfs: use bio_await in xfs_zone_gc_reset_sync block: add a bio_submit_or_kill helper block: factor out a bio_await helper block: unify the synchronous bi_end_io callbacks xfs: fix number of GC bvecs selftests/ublk: add read-only buffer registration test selftests/ublk: add filesystem fio verify test for shmem_zc selftests/ublk: add hugetlbfs shmem_zc test for loop target selftests/ublk: add shared memory zero-copy test selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target ...
2 parents b8f82cb + 36446de commit 7fe6ac1

121 files changed

Lines changed: 5468 additions & 3088 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Documentation/ABI/stable/sysfs-block

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -886,6 +886,21 @@ Description:
886886
zone commands, they will be treated as regular block devices and
887887
zoned will report "none".
888888

889+
What: /sys/block/<disk>/queue/zoned_qd1_writes
890+
Date: January 2026
891+
Contact: Damien Le Moal <dlemoal@kernel.org>
892+
Description:
893+
[RW] zoned_qd1_writes indicates if write operations to a zoned
894+
block device are being handled using a single issuer context (a
895+
kernel thread) operating at a maximum queue depth of 1. This
896+
attribute is visible only for zoned block devices. The default
897+
value for zoned block devices that are not rotational devices
898+
(e.g. ZNS SSDs or zoned UFS devices) is 0. For rotational zoned
899+
block devices (e.g. SMR HDDs) the default value is 1. Since
900+
this default may not be appropriate for some devices, e.g.
901+
remotely connected devices over high latency networks, the user
902+
can disable this feature by setting this attribute to 0.
903+
889904

890905
What: /sys/block/<disk>/hidden
891906
Date: March 2023
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
What: /sys/devices/virtual/nvme-fabrics/ctl/.../tls_configured_key
2+
Date: November 2025
3+
KernelVersion: 6.19
4+
Contact: Linux NVMe mailing list <linux-nvme@lists.infradead.org>
5+
Description:
6+
The file is avaliable when using a secure concatanation
7+
connection to a NVMe target. Reading the file will return
8+
the serial of the currently negotiated key.
9+
10+
Writing 0 to the file will trigger a PSK reauthentication
11+
(REPLACETLSPSK) with the target. After a reauthentication
12+
the value returned by tls_configured_key will be the new
13+
serial.

Documentation/admin-guide/blockdev/zoned_loop.rst

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ The options available for the add command can be listed by reading the
6262
/dev/zloop-control device::
6363

6464
$ cat /dev/zloop-control
65-
add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u,buffered_io
65+
add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,max_open_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u,buffered_io,zone_append=%u,ordered_zone_append,discard_write_cache
6666
remove id=%d
6767

6868
In more details, the options that can be used with the "add" command are as
@@ -80,6 +80,9 @@ zone_capacity_mb Device zone capacity (must always be equal to or lower
8080
conv_zones Total number of conventioanl zones starting from
8181
sector 0
8282
Default: 8
83+
max_open_zones Maximum number of open sequential write required zones
84+
(0 for no limit).
85+
Default: 0
8386
base_dir Path to the base directory where to create the directory
8487
containing the zone files of the device.
8588
Default=/var/local/zloop.
@@ -104,6 +107,11 @@ ordered_zone_append Enable zloop mitigation of zone append reordering.
104107
(extents), as when enabled, this can significantly reduce
105108
the number of data extents needed to for a file data
106109
mapping.
110+
discard_write_cache Discard all data that was not explicitly persisted using a
111+
flush operation when the device is removed by truncating
112+
each zone file to the size recorded during the last flush
113+
operation. This simulates power fail events where
114+
uncommitted data is lost.
107115
=================== =========================================================
108116

109117
3) Deleting a Zoned Device

Documentation/block/inline-encryption.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -153,7 +153,7 @@ blk-crypto-fallback completes the original bio. If the original bio is too
153153
large, multiple bounce bios may be required; see the code for details.
154154

155155
For decryption, blk-crypto-fallback "wraps" the bio's completion callback
156-
(``bi_complete``) and private data (``bi_private``) with its own, unsets the
156+
(``bi_end_io``) and private data (``bi_private``) with its own, unsets the
157157
bio's encryption context, then submits the bio. If the read completes
158158
successfully, blk-crypto-fallback restores the bio's original completion
159159
callback and private data, then decrypts the bio's data in-place using the

Documentation/block/ublk.rst

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -485,6 +485,125 @@ Limitations
485485
in case that too many ublk devices are handled by this single io_ring_ctx
486486
and each one has very large queue depth
487487

488+
Shared Memory Zero Copy (UBLK_F_SHMEM_ZC)
489+
------------------------------------------
490+
491+
The ``UBLK_F_SHMEM_ZC`` feature provides an alternative zero-copy path
492+
that works by sharing physical memory pages between the client application
493+
and the ublk server. Unlike the io_uring fixed buffer approach above,
494+
shared memory zero copy does not require io_uring buffer registration
495+
per I/O — instead, it relies on the kernel matching physical pages
496+
at I/O time. This allows the ublk server to access the shared
497+
buffer directly, which is unlikely for the io_uring fixed buffer
498+
approach.
499+
500+
Motivation
501+
~~~~~~~~~~
502+
503+
Shared memory zero copy takes a different approach: if the client
504+
application and the ublk server both map the same physical memory, there is
505+
nothing to copy. The kernel detects the shared pages automatically and
506+
tells the server where the data already lives.
507+
508+
``UBLK_F_SHMEM_ZC`` can be thought of as a supplement for optimized client
509+
applications — when the client is willing to allocate I/O buffers from
510+
shared memory, the entire data path becomes zero-copy.
511+
512+
Use Cases
513+
~~~~~~~~~
514+
515+
This feature is useful when the client application can be configured to
516+
use a specific shared memory region for its I/O buffers:
517+
518+
- **Custom storage clients** that allocate I/O buffers from shared memory
519+
(memfd, hugetlbfs) and issue direct I/O to the ublk device
520+
- **Database engines** that use pre-allocated buffer pools with O_DIRECT
521+
522+
How It Works
523+
~~~~~~~~~~~~
524+
525+
1. The ublk server and client both ``mmap()`` the same file (memfd or
526+
hugetlbfs) with ``MAP_SHARED``. This gives both processes access to the
527+
same physical pages.
528+
529+
2. The ublk server registers its mapping with the kernel::
530+
531+
struct ublk_shmem_buf_reg buf = { .addr = mmap_va, .len = size };
532+
ublk_ctrl_cmd(UBLK_U_CMD_REG_BUF, .addr = &buf);
533+
534+
The kernel pins the pages and builds a PFN lookup tree.
535+
536+
3. When the client issues direct I/O (``O_DIRECT``) to ``/dev/ublkb*``,
537+
the kernel checks whether the I/O buffer pages match any registered
538+
pages by comparing PFNs.
539+
540+
4. On a match, the kernel sets ``UBLK_IO_F_SHMEM_ZC`` in the I/O
541+
descriptor and encodes the buffer index and offset in ``addr``::
542+
543+
if (iod->op_flags & UBLK_IO_F_SHMEM_ZC) {
544+
/* Data is already in our shared mapping — zero copy */
545+
index = ublk_shmem_zc_index(iod->addr);
546+
offset = ublk_shmem_zc_offset(iod->addr);
547+
buf = shmem_table[index].mmap_base + offset;
548+
}
549+
550+
5. If pages do not match (e.g., the client used a non-shared buffer),
551+
the I/O falls back to the normal copy path silently.
552+
553+
The shared memory can be set up via two methods:
554+
555+
- **Socket-based**: the client sends a memfd to the ublk server via
556+
``SCM_RIGHTS`` on a unix socket. The server mmaps and registers it.
557+
- **Hugetlbfs-based**: both processes ``mmap(MAP_SHARED)`` the same
558+
hugetlbfs file. No IPC needed — same file gives same physical pages.
559+
560+
Advantages
561+
~~~~~~~~~~
562+
563+
- **Simple**: no per-I/O buffer registration or unregistration commands.
564+
Once the shared buffer is registered, all matching I/O is zero-copy
565+
automatically.
566+
- **Direct buffer access**: the ublk server can read and write the shared
567+
buffer directly via its own mmap, without going through io_uring fixed
568+
buffer operations. This is more friendly for server implementations.
569+
- **Fast**: PFN matching is a single maple tree lookup per bvec. No
570+
io_uring command round-trips for buffer management.
571+
- **Compatible**: non-matching I/O silently falls back to the copy path.
572+
The device works normally for any client, with zero-copy as an
573+
optimization when shared memory is available.
574+
575+
Limitations
576+
~~~~~~~~~~~
577+
578+
- **Requires client cooperation**: the client must allocate its I/O
579+
buffers from the shared memory region. This requires a custom or
580+
configured client — standard applications using their own buffers
581+
will not benefit.
582+
- **Direct I/O only**: buffered I/O (without ``O_DIRECT``) goes through
583+
the page cache, which allocates its own pages. These kernel-allocated
584+
pages will never match the registered shared buffer. Only ``O_DIRECT``
585+
puts the client's buffer pages directly into the block I/O.
586+
- **Contiguous data only**: each I/O request's data must be contiguous
587+
within a single registered buffer. Scatter/gather I/O that spans
588+
multiple non-adjacent registered buffers cannot use the zero-copy path.
589+
590+
Control Commands
591+
~~~~~~~~~~~~~~~~
592+
593+
- ``UBLK_U_CMD_REG_BUF``
594+
595+
Register a shared memory buffer. ``ctrl_cmd.addr`` points to a
596+
``struct ublk_shmem_buf_reg`` containing the buffer virtual address and size.
597+
Returns the assigned buffer index (>= 0) on success. The kernel pins
598+
pages and builds the PFN lookup tree. Queue freeze is handled
599+
internally.
600+
601+
- ``UBLK_U_CMD_UNREG_BUF``
602+
603+
Unregister a previously registered buffer. ``ctrl_cmd.data[0]`` is the
604+
buffer index. Unpins pages and removes PFN entries from the lookup
605+
tree.
606+
488607
References
489608
==========
490609

MAINTAINERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27015,7 +27015,7 @@ F: Documentation/filesystems/ubifs.rst
2701527015
F: fs/ubifs/
2701627016

2701727017
UBLK USERSPACE BLOCK DRIVER
27018-
M: Ming Lei <ming.lei@redhat.com>
27018+
M: Ming Lei <tom.leiming@gmail.com>
2701927019
L: linux-block@vger.kernel.org
2702027020
S: Maintained
2702127021
F: Documentation/block/ublk.rst

0 commit comments

Comments
 (0)