Skip to content

[BUG] vrcompositor crashes on theater mode: vkAcquireNextImageKHR with pending semaphore + uninitialized descriptors cause GPU page fault #906

@Spacefish

Description

@Spacefish

I think i found a new issue introduced by the timeline semaphores.
If i start the theater mode, steamvr reliably crashes once i try to start it..

If i enable the vulkan validation layers, it crashes before compositing the first frame (display is gray, then crash with -203 as error code).. So probably some kind of race condition which is timing sensitive.

Summary

When launching theater mode, the vrcompositor crashes reliably with a GPU page fault in the RenderThread. The crash chain involves multiple Vulkan API specification violations by vrcompositor, caught by VK_LAYER_KHRONOS_validation.


System Information

  • SteamVR version: 2.17.2 (build 1781214772)
  • Distribution: Ubuntu 26.04 LTS
  • Kernel: 7.1.0-rc4-spacy2026052101 (custom, kernel.org upstream + config changes)
  • GPU: AMD Radeon RX 7800 XT (Navi 32, RDNA3)
  • Vulkan driver: Mesa 26.2.0 (RADV for NAVI32), Mesa commit 7b286abe336
  • Desktop: KDE Plasma 6, Wayland
  • Headset: Valve Index (direct mode via wp_drm_lease_device_v1)

Crash Description

Sequence

  1. vrcompositor starts, initializes Vulkan, acquires DRM display lease via Wayland
  2. Theater mode is launched
  3. First crash (vrcompositor PID 27461): AcquireNextImageKHR returns VK_ERROR_SURFACE_LOST_KHR (-1000000000), compositor segfaults with NULL pointer dereference (mov rax, [rsi] where RSI=0) — error handling path missing
  4. vrcompositor restarts (PID 28694)
  5. GPU page fault: SQC (data) read at VRAM address 0x8001399000 with PERMISSION_FAULTS=0x3 (PTE exists but read access denied)
  6. GFX ring timeout: ring gfx_0.0.0 timeout, signaled seq=370360, emitted seq=370362
  7. Watchdog abort: Failed Watchdog timeout in thread Render in Present after 6.78s. Aborting. -> SIGABRT

Validation Layer Output (from vrcompositor-linux.txt)

1. vkAcquireNextImageKHR with busy semaphore

VUID-vkAcquireNextImageKHR-semaphore-01779
vkAcquireNextImageKHR(): Semaphore must not have any pending operations.
Semaphore 0xf600000000f6

This causes VK_ERROR_SURFACE_LOST_KHR, triggering the segfault in the error path.

2. Draw calls with uninitialized descriptors (repeated 10+ times)

VUID-vkCmdDrawIndexed-None-08114
vkCmdDrawIndexed(): the descriptor [VkDescriptorSet 0x4840000000484,
Set 0, Binding 9, Index 1] is being used in draw but has never been
updated via vkUpdateDescriptorSets() or a similar call.
VUID-vkCmdDraw-None-08114
vkCmdDraw(): the descriptor [VkDescriptorSet 0x4890000000489,
Set 0, Binding 9, Index 1] is being used in draw...

This is the direct cause of the GPU page fault: the shader reads from an uninitialized descriptor, which contains a garbage GPU VA. When the shader's SQC cache tries to read from that address, it hits a VRAM page without read access -> PERMISSION_FAULTS=0x3 -> GFX ring timeout.

3. Shader writes gl_Layer past framebuffer layer count

Undefined-Value-Layer-Written
Shader stage VK_SHADER_STAGE_VERTEX_BIT writes to Layer (gl_Layer)
but the framebuffer was created with layer count of 1

GPU Coredump (from /sys/class/drm/card1/device/devcoredump/data)

The fault is deterministic - same VRAM address 0x0000008001399000 across multiple runs:

[gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:255)
  Process vrcompositor pid 28694 thread RenderThread pid 28771
  in page starting at address 0x0000008001399000
  Faulty UTCL2 client ID: SQC (data) (0xa)
  PERMISSION_FAULTS: 0x3
  MAPPING_ERROR: 0x0
  RW: 0x0

Root Cause

The vrcompositor has a race condition or missing synchronization in its render loop:

  1. Acquire semaphore reuse: vkAcquireNextImageKHR is called with a semaphore that still has pending signal operations from the previous vkAcquireNextImageKHR. The spec requires the semaphore to be unsignaled (no pending operations). This causes the swapchain to enter an error state.

  2. Descriptors used before update: Descriptor sets are bound to the pipeline and draw commands are issued before vkUpdateDescriptorSets() is called for those descriptor sets. The GPU reads garbage descriptor data, which contains invalid GPU addresses, causing the SQC page fault.


Workaround

The crash is timing-sensitive. Switching the Mesa RADV driver version can mask the race by changing GPU scheduling behavior, but the root cause is in vrcompositor.


Attachments

vrcompositor-linux.txt

dump_steamvr_crash.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions