API Reference
Vectorized Memory Access
Basic Operations
KernelIntrinsics.vload — Function
vload(A::AbstractArray{T}, idx, ::Val{Nitem}, ::Val{Rebase}=Val(true)) -> NTuple{Nitem,T}Load Nitem elements from array A as a tuple, using vectorized memory operations on GPU.
Arguments
A: Source arrayidx: Starting indexNitem: Number of elements to load (must be a power of 2)Rebase: Indexing mode (default:Val(true))
Indexing Modes
Val(true)(rebased): Index is multiplied byNitem, soidx=2loads elements[5,6,7,8]forNitem=4. This mode generates optimal aligned vector loads (ld.global.v4).Val(false)(direct): Loads starting directly atidx, soidx=2loads elements[2,3,4,5]. Handles misaligned access by decomposing into smaller aligned loads.
Example
a = CuArray{Int32}(1:16)
# Rebased indexing (default): idx=2 → loads elements 5,6,7,8
values = vload(a, 2, Val(4)) # returns (5, 6, 7, 8)
# Direct indexing: idx=2 → loads elements 2,3,4,5
values = vload(a, 2, Val(4), Val(false)) # returns (2, 3, 4, 5)See also: vstore!
KernelIntrinsics.vstore! — Function
vstore!(A::AbstractArray{T}, idx, values::NTuple{Nitem,T}, ::Val{Rebase}=Val(true)) -> NothingStore Nitem elements from a tuple to array A, using vectorized memory operations on GPU.
Arguments
A: Destination arrayidx: Starting indexvalues: Tuple ofNitemelements to storeRebase: Indexing mode (default:Val(true))
Indexing Modes
Val(true)(rebased): Index is multiplied byNitem, soidx=2stores to elements[5,6,7,8]forNitem=4. This mode generates optimal aligned vector stores (st.global.v4).Val(false)(direct): Stores starting directly atidx, soidx=2stores to elements[2,3,4,5]. Handles misaligned access by decomposing into smaller aligned stores.
Example
b = CUDA.zeros(Int32, 16)
# Rebased indexing (default): idx=2 → stores to elements 5,6,7,8
vstore!(b, 2, (Int32(10), Int32(20), Int32(30), Int32(40)))
# Direct indexing: idx=2 → stores to elements 2,3,4,5
vstore!(b, 2, (Int32(10), Int32(20), Int32(30), Int32(40)), Val(false))See also: vload
Dynamic Alignment Operations
These functions handle arbitrary starting indices by computing alignment at runtime and dispatching to the appropriate vectorized instruction pattern.
KernelIntrinsics.vload_multi — Function
vload_multi(A::AbstractArray{T}, i, ::Val{Nitem}) -> NTuple{Nitem,T}Load Nitem elements from array A starting at index i, automatically handling arbitrary alignment.
Computes alignment at runtime (mod = (pointer_offset + i - 1) % Nitem + 1) and dispatches to a statically-compiled load pattern via a switch table. This generates a mix of ld.global.v4, ld.global.v2, and scalar loads to maximize throughput.
Example
src = cu(Int32.(1:100))
# Works for any starting index — alignment handled automatically
values = vload_multi(src, 7, Val(8)) # loads elements 7:14See also: vload, vstore_multi!
KernelIntrinsics.vstore_multi! — Function
vstore_multi!(A::AbstractArray{T}, i, values::NTuple{Nitem,T}) -> NothingStore Nitem elements to array A starting at index i, automatically handling arbitrary alignment.
Computes alignment at runtime (mod = (pointer_offset + i - 1) % Nitem + 1) and dispatches to a statically-compiled store pattern via a switch table. This generates a mix of st.global.v4, st.global.v2, and scalar stores to maximize throughput.
Example
dst = cu(zeros(Int32, 100))
# Works for any starting index — alignment handled automatically
vstore_multi!(dst, 7, (Int32(1), Int32(2), Int32(3), Int32(4)))See also: vstore!, vload_multi
Memory Ordering
KernelIntrinsics.@access — Macro
@access [Scope] [Ordering] exprPerform a memory load or store with specified scope and ordering semantics.
This macro provides fine-grained control over memory ordering for lock-free synchronization patterns on GPU. It generates appropriate ld.acquire or st.release PTX instructions.
Arguments
Scope(optional): Visibility scope, one ofDevice(default),Workgroup, orSystemOrdering(optional): Memory ordering (see below)expr: Either a load (var = array[idx]) or store (array[idx] = value) expression
Orderings
For loads (default: Acquire):
Acquire: Subsequent reads see all writes before the corresponding releaseRelaxed: No ordering guaranteesVolatile: Volatile load (scope-less)Weak: Weak load (scope-less)
For stores (default: Release):
Release: Prior writes are visible before this storeRelaxed: No ordering guaranteesVolatile: Volatile store (scope-less)Weak: Weak store (scope-less)
AcqRel and SeqCst are not valid for individual loads/stores (use @fence instead). Volatile and Weak cannot have an explicit scope.
Syntax Forms
@access array[idx] = value # Release store (default)
@access var = array[idx] # Acquire load (default)
@access array[idx] # Acquire load, returns value directly
@access Release array[idx] = value # Explicit ordering
@access Acquire var = array[idx] # Explicit ordering
@access Device Release array[idx] = value # Explicit scope and orderingExample
@kernel function producer_consumer(X, Flag)
if @index(Global, Linear) == 1
X[1] = 42
@access Flag[1] = 1 # Release store: X[1]=42 visible before Flag[1]=1
end
# Other threads wait
while (@access Acquire Flag[1]) != 1
end
# Now X[1] is guaranteed to be 42
endSee also: @fence
KernelIntrinsics.@fence — Macro
@fence [Scope] [Ordering]Insert a memory fence with specified scope and ordering.
A memory fence ensures that memory operations before the fence are visible to other threads before operations after the fence. This is essential for correct synchronization in parallel GPU code.
Arguments
Scope(optional): Visibility scope, one ofDevice(default),Workgroup, orSystemOrdering(optional): Memory ordering, one ofAcquire,Release,AcqRel(default), orSeqCst
Arguments can be specified in any order. Weak, Volatile, and Relaxed orderings are not valid for fences.
Generated PTX
@fence→fence.acq_rel.gpu@fence Workgroup→fence.acq_rel.cta@fence System SeqCst→fence.sc.sys
Example
@kernel function synchronized_kernel(X, Flag)
X[1] = 10
@fence # Ensure X[1]=10 is visible before continuing
Flag[1] = 1
end
# Explicit scope and ordering
@fence Device AcqRel
@fence Workgroup Release
@fence System SeqCst
@fence SeqCst Device # Order doesn't matterSee also: @access
Warp Operations
KernelIntrinsics.@shfl — Macro
@shfl(direction, val, src, [warpsize=32], [mask=0xffffffff])Perform a warp shuffle operation, exchanging values between lanes within a warp.
Arguments
direction: Shuffle direction (Up,Down,Xor, orIdx)val: Value to shuffle (supports primitives, structs, and NTuples)src: Offset (for Up/Down), XOR mask (for Xor), or source lane (for Idx)warpsize: Warp size (default: 32)mask: Lane participation mask (default: 0xffffffff for all lanes)
Example
@kernel function shfl_kernel(dst, src)
I = @index(Global, Linear)
val = src[I]
shuffled = @shfl(Up, val, 1) # Get value from lane below
shuffled = @shfl(Down, val, 1) # Get value from lane above
shuffled = @shfl(Xor, val, 1) # Swap with adjacent lane
shuffled = @shfl(Idx, val, 0) # Broadcast lane 0 to all
dst[I] = shuffled
endSee also: @warpreduce, @warpfold
KernelIntrinsics.@warpreduce — Macro
@warpreduce(val, lane, [op=+], [warpsize=32], [mask=0xffffffff])Perform an inclusive prefix scan (reduction) within a warp.
Each lane i accumulates values from lanes 1 to i using the specified operator. Uses shuffle-up operations internally.
Arguments
val: Value to reduce (modified in-place)lane: Current lane index (1-based)op: Binary reduction operator (default:+)warpsize: Warp size (default: 32)mask: Lane participation mask (default: 0xffffffff)
Example
@kernel function scan_kernel(dst, src)
I = @index(Global, Linear)
val = src[I]
lane = (I - 1) % 32 + 1
@warpreduce(val, lane, +)
dst[I] = val # Contains prefix sum
end
# Input: [1, 2, 3, 4, ..., 32]
# Output: [1, 3, 6, 10, ..., 528]KernelIntrinsics.@warpfold — Macro
@warpfold(val, lane, [op=+], [warpsize=32], [mask=0xffffffff])Perform a warp-wide reduction, folding all values to a single result in lane 1.
Combines all values across the warp using the specified operator. Uses shuffle-down operations internally.
Arguments
val: Value to reduce (modified in-place)lane: Current lane index (1-based, unused but kept for API consistency)op: Binary reduction operator (default:+)warpsize: Warp size (default: 32)mask: Lane participation mask (default: 0xffffffff)
Example
@kernel function reduce_kernel(dst, src)
I = @index(Global, Linear)
val = src[I]
lane = (I - 1) % 32 + 1
@warpfold(val, lane, +)
if lane == 1
dst[1] = val # Contains sum of all 32 values
end
end
# Input: [1, 2, 3, ..., 32]
# Output: dst[1] = 528See also: @warpreduce, @shfl
KernelIntrinsics.@vote — Macro
@vote(mode, predicate, [mask=0xffffffff])Perform a warp vote operation, evaluating a predicate across all lanes.
Arguments
mode: Vote mode (All,Any,Uni, orBallot)predicate: Boolean predicate to evaluatemask: Lane participation mask (default: 0xffffffff)
Example
@kernel function vote_kernel(dst, src, threshold)
I = @index(Global, Linear)
val = src[I]
all_above = @vote(All, val > threshold) # True if all lanes above threshold
any_above = @vote(Any, val > threshold) # True if any lane above threshold
uniform = @vote(Uni, val > threshold) # True if all lanes have same result
mask = @vote(Ballot, val > threshold) # Bitmask of which lanes are above
dst[I] = mask
endSee also: @shfl
KernelIntrinsics.Up — Type
Up <: DirectionShuffle direction where each lane receives a value from a lane with a lower index.
@shfl(Up, val, offset): Lane i receives the value from lane i - offset. Lanes where i < offset keep their original value.
Result for offset=1: [1, 1, 2, 3, 4, ..., 31]
KernelIntrinsics.Down — Type
Down <: DirectionShuffle direction where each lane receives a value from a lane with a higher index.
@shfl(Down, val, offset): Lane i receives the value from lane i + offset. Lanes where i + offset >= warpsize keep their original value.
Result for offset=1: [2, 3, 4, ..., 31, 32, 32]
KernelIntrinsics.Xor — Type
Xor <: DirectionShuffle direction where each lane exchanges values based on XOR of lane indices.
@shfl(Xor, val, mask): Lane i receives the value from lane i ⊻ mask.
Common patterns:
mask=1: Swap adjacent pairs (0↔1, 2↔3, ...)mask=16: Swap first and second half of warp
KernelIntrinsics.Idx — Type
Idx <: DirectionShuffle direction where all lanes receive a value from a specific lane index.
@shfl(Idx, val, lane): All lanes receive the value from lane lane.
Useful for broadcasting a value from one lane to all others.
KernelIntrinsics.All — Type
All <: ModeVote mode that returns true if the predicate is true for all participating lanes.
KernelIntrinsics.Any — Type
Any <: ModeVote mode that returns true if the predicate is true for any participating lane.
KernelIntrinsics.Uni — Type
Uni <: ModeVote mode that returns true if the predicate has the same value across all participating lanes.
KernelIntrinsics.Ballot — Type
Ballot <: ModeVote mode that returns a UInt32 bitmask where bit i is set if lane i's predicate is true.
Index
KernelIntrinsics.AllKernelIntrinsics.AnyKernelIntrinsics.BallotKernelIntrinsics.DirectionKernelIntrinsics.DownKernelIntrinsics.IdxKernelIntrinsics.ModeKernelIntrinsics.UniKernelIntrinsics.UpKernelIntrinsics.XorKernelIntrinsics.vloadKernelIntrinsics.vload_multiKernelIntrinsics.vstore!KernelIntrinsics.vstore_multi!KernelIntrinsics.@accessKernelIntrinsics.@fenceKernelIntrinsics.@shflKernelIntrinsics.@voteKernelIntrinsics.@warpfoldKernelIntrinsics.@warpreduce