Enable SME2 Streaming SVE in ARM#9126
Conversation
Added: - Target::SME2 definition - streaming_vector_bits in Target for SME2 - Auto-detect SME2 and streaming_vector_bits - sme_streaming() scheduling directive in Func and Pipeline - DeviceAPI::Host_SMEStreaming in IR "For" - LowerSMEStreamingTasks pass to extract streaming closure - Attribute in LoweredFunc for streaming closure - LLVM Function attribute to control streaming mode - NoInline to prevent streaming closure from inlined - "aarch64_pstate_sm_body" to emit smstart/smstop transition - Disable gather/scatter in SME streaming mode Tests: - Add correctness/sme_streaming - Run simd_op_check_sve2 in SME streaming mode - Add test to assert runtime streaming vscale
|
This PR is ready for review. I will touch on this in dev meeting if I have a chance. |
Reason:
While vector_bits is used across multiple target architectures,
streaming_vector_bits is aarch64 specific. So we choose to
use Target::Feature rather than a new member for arbitrary bits.
- Removed Target::streaming_vector_bits member variable
- Added Feature::SME_SVL{128,256,512,1024,2048}
Revert the changes in halide_error_vscale_invalid to avoid potential runtime breaking changes.
Because streaming_vector_bits member variable has been removed.
|
Based on the feedback in dev meeting, |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9126 +/- ##
=======================================
Coverage ? 69.83%
=======================================
Files ? 256
Lines ? 77781
Branches ? 18606
=======================================
Hits ? 54317
Misses ? 18007
Partials ? 5457 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
The high order question is whether this should just be another top-level target feature for ARM processors. If so, no special loop annotation is required and It means one cannot generate both SME2 and NEON in the same pipeline, but having looked over the architecture spec I'm not convinced that is useful. It is slightly convenient in terms of bounds inference, but performance wise switching between the modes clears the entire vector state so imposing a function call boundary there is hardly a problem. (Same is true for SVE/SVE2.) My reading of the architecture spec is that fine grained switching between SME2 and one of the other vector extensions is not a great idea. Also per being a singular resource, its interaction with parallelism requires care at the level outside of Halide generated code. I expect doing it this way limits the processing that can be specified, but that would be true inside the loop labelled SME2 anyway. This may have been discussed with a question as to whether to fail compilation or to fallback to e.g. NEON. Really the initial use case is specialized kernels that are written specifically for the SME2 hardware anyway so failing compilation is fine. |
|
It is true that switching between streaming mode has some overhead. So, very frequent transitions (e.g. in inner-most loop) should be avoided in terms of performance. |
Enable SME2 Streaming SVE in ARM
This PR adds initial ARM SME2 streaming-mode support to Halide,
which allows us to compute with longer vector length SVE on targets with SME2.
A new
sme_streaming(enable, var)scheduling directive provides the usersthe option to control which loop is computed in streaming-mode.
The change introduces a new
Target::SME2feature with supplemental featuresTarget::SME_SVLDDD, where DDD represents streaming vector length in bits (e.g. 128, 256, 512, ...). IfTarget::SME2is enabled, exactly one ofTarget::SME_SVLDDDfeature must be enabled as well.natural_vector_size()now depends on whether in streaming-mode or not,because streaming vector length may have a value different from non-streaming vector length.
In Halide lowering, a new
LowerSMEStreamingTaskspass is added,which extracts the loop with streaming-mode as internal closure function
so that we can attach the LLVM function attributes to transit to/from streaming-mode.
aarch64_pstate_sm_bodyto emit smstart/smstop transitionNoInlineto prevent streaming closure from inlined to non-streaming functionIn CodeGen,
target_vscale()depends on whether streaming-mode or notand it varies even in a Module, although it is constant within Function boundary.
In streaming-mode, vector type code-gen and intrinsic selection are
performed based on
Target::sme_streaming_vector_bits()(streaming vscale).In terms of coverage, it is almost the same as existing SVE2 code-gen
while SME2 specific instruction has not been enabled for now.
Additionally, the following changes are implemented:
SME2andSME_SVLDDDtarget features on host CPUChecklist