API Reference

Vectorized Memory Access

Basic Operations

KernelIntrinsics.vloadFunction
vload(A::AbstractArray{T}, idx, ::Val{Nitem}, ::Val{Rebase}=Val(true)) -> NTuple{Nitem,T}

Load Nitem elements from array A as a tuple, using vectorized memory operations on GPU.

Arguments

  • A: Source array
  • idx: Starting index
  • Nitem: Number of elements to load (must be a power of 2)
  • Rebase: Indexing mode (default: Val(true))

Indexing Modes

  • Val(true) (rebased): Index is multiplied by Nitem, so idx=2 loads elements [5,6,7,8] for Nitem=4. This mode generates optimal aligned vector loads (ld.global.v4).
  • Val(false) (direct): Loads starting directly at idx, so idx=2 loads elements [2,3,4,5]. Handles misaligned access by decomposing into smaller aligned loads.

Example

a = CuArray{Int32}(1:16)

# Rebased indexing (default): idx=2 → loads elements 5,6,7,8
values = vload(a, 2, Val(4))  # returns (5, 6, 7, 8)

# Direct indexing: idx=2 → loads elements 2,3,4,5
values = vload(a, 2, Val(4), Val(false))  # returns (2, 3, 4, 5)

See also: vstore!

source
KernelIntrinsics.vstore!Function
vstore!(A::AbstractArray{T}, idx, values::NTuple{Nitem,T}, ::Val{Rebase}=Val(true)) -> Nothing

Store Nitem elements from a tuple to array A, using vectorized memory operations on GPU.

Arguments

  • A: Destination array
  • idx: Starting index
  • values: Tuple of Nitem elements to store
  • Rebase: Indexing mode (default: Val(true))

Indexing Modes

  • Val(true) (rebased): Index is multiplied by Nitem, so idx=2 stores to elements [5,6,7,8] for Nitem=4. This mode generates optimal aligned vector stores (st.global.v4).
  • Val(false) (direct): Stores starting directly at idx, so idx=2 stores to elements [2,3,4,5]. Handles misaligned access by decomposing into smaller aligned stores.

Example

b = CUDA.zeros(Int32, 16)

# Rebased indexing (default): idx=2 → stores to elements 5,6,7,8
vstore!(b, 2, (Int32(10), Int32(20), Int32(30), Int32(40)))

# Direct indexing: idx=2 → stores to elements 2,3,4,5
vstore!(b, 2, (Int32(10), Int32(20), Int32(30), Int32(40)), Val(false))

See also: vload

source

Dynamic Alignment Operations

These functions handle arbitrary starting indices by computing alignment at runtime and dispatching to the appropriate vectorized instruction pattern.

KernelIntrinsics.vload_multiFunction
vload_multi(A::AbstractArray{T}, i, ::Val{Nitem}) -> NTuple{Nitem,T}

Load Nitem elements from array A starting at index i, automatically handling arbitrary alignment.

Computes alignment at runtime (mod = (pointer_offset + i - 1) % Nitem + 1) and dispatches to a statically-compiled load pattern via a switch table. This generates a mix of ld.global.v4, ld.global.v2, and scalar loads to maximize throughput.

Example

src = cu(Int32.(1:100))

# Works for any starting index — alignment handled automatically
values = vload_multi(src, 7, Val(8))  # loads elements 7:14

See also: vload, vstore_multi!

source
KernelIntrinsics.vstore_multi!Function
vstore_multi!(A::AbstractArray{T}, i, values::NTuple{Nitem,T}) -> Nothing

Store Nitem elements to array A starting at index i, automatically handling arbitrary alignment.

Computes alignment at runtime (mod = (pointer_offset + i - 1) % Nitem + 1) and dispatches to a statically-compiled store pattern via a switch table. This generates a mix of st.global.v4, st.global.v2, and scalar stores to maximize throughput.

Example

dst = cu(zeros(Int32, 100))

# Works for any starting index — alignment handled automatically
vstore_multi!(dst, 7, (Int32(1), Int32(2), Int32(3), Int32(4)))

See also: vstore!, vload_multi

source

Memory Ordering

KernelIntrinsics.@accessMacro
@access [Scope] [Ordering] expr

Perform a memory load or store with specified scope and ordering semantics.

This macro provides fine-grained control over memory ordering for lock-free synchronization patterns on GPU. It generates appropriate ld.acquire or st.release PTX instructions.

Arguments

  • Scope (optional): Visibility scope, one of Device (default), Workgroup, or System
  • Ordering (optional): Memory ordering (see below)
  • expr: Either a load (var = array[idx]) or store (array[idx] = value) expression

Orderings

For loads (default: Acquire):

  • Acquire: Subsequent reads see all writes before the corresponding release
  • Relaxed: No ordering guarantees
  • Volatile: Volatile load (scope-less)
  • Weak: Weak load (scope-less)

For stores (default: Release):

  • Release: Prior writes are visible before this store
  • Relaxed: No ordering guarantees
  • Volatile: Volatile store (scope-less)
  • Weak: Weak store (scope-less)

AcqRel and SeqCst are not valid for individual loads/stores (use @fence instead). Volatile and Weak cannot have an explicit scope.

Syntax Forms

@access array[idx] = value          # Release store (default)
@access var = array[idx]            # Acquire load (default)
@access array[idx]                  # Acquire load, returns value directly

@access Release array[idx] = value  # Explicit ordering
@access Acquire var = array[idx]    # Explicit ordering
@access Device Release array[idx] = value  # Explicit scope and ordering

Example

@kernel function producer_consumer(X, Flag)
    if @index(Global, Linear) == 1
        X[1] = 42
        @access Flag[1] = 1  # Release store: X[1]=42 visible before Flag[1]=1
    end

    # Other threads wait
    while (@access Acquire Flag[1]) != 1
    end
    # Now X[1] is guaranteed to be 42
end

See also: @fence

source
KernelIntrinsics.@fenceMacro
@fence [Scope] [Ordering]

Insert a memory fence with specified scope and ordering.

A memory fence ensures that memory operations before the fence are visible to other threads before operations after the fence. This is essential for correct synchronization in parallel GPU code.

Arguments

  • Scope (optional): Visibility scope, one of Device (default), Workgroup, or System
  • Ordering (optional): Memory ordering, one of Acquire, Release, AcqRel (default), or SeqCst

Arguments can be specified in any order. Weak, Volatile, and Relaxed orderings are not valid for fences.

Generated PTX

  • @fencefence.acq_rel.gpu
  • @fence Workgroupfence.acq_rel.cta
  • @fence System SeqCstfence.sc.sys

Example

@kernel function synchronized_kernel(X, Flag)
    X[1] = 10
    @fence  # Ensure X[1]=10 is visible before continuing
    Flag[1] = 1
end

# Explicit scope and ordering
@fence Device AcqRel
@fence Workgroup Release
@fence System SeqCst
@fence SeqCst Device  # Order doesn't matter

See also: @access

source

Warp Operations

KernelIntrinsics.@shflMacro
@shfl(direction, val, src, [warpsize=32], [mask=0xffffffff])

Perform a warp shuffle operation, exchanging values between lanes within a warp.

Arguments

  • direction: Shuffle direction (Up, Down, Xor, or Idx)
  • val: Value to shuffle (supports primitives, structs, and NTuples)
  • src: Offset (for Up/Down), XOR mask (for Xor), or source lane (for Idx)
  • warpsize: Warp size (default: 32)
  • mask: Lane participation mask (default: 0xffffffff for all lanes)

Example

@kernel function shfl_kernel(dst, src)
    I = @index(Global, Linear)
    val = src[I]
    
    shuffled = @shfl(Up, val, 1)      # Get value from lane below
    shuffled = @shfl(Down, val, 1)    # Get value from lane above
    shuffled = @shfl(Xor, val, 1)     # Swap with adjacent lane
    shuffled = @shfl(Idx, val, 0)     # Broadcast lane 0 to all
    
    dst[I] = shuffled
end

See also: @warpreduce, @warpfold

source
KernelIntrinsics.@warpreduceMacro
@warpreduce(val, lane, [op=+], [warpsize=32], [mask=0xffffffff])

Perform an inclusive prefix scan (reduction) within a warp.

Each lane i accumulates values from lanes 1 to i using the specified operator. Uses shuffle-up operations internally.

Arguments

  • val: Value to reduce (modified in-place)
  • lane: Current lane index (1-based)
  • op: Binary reduction operator (default: +)
  • warpsize: Warp size (default: 32)
  • mask: Lane participation mask (default: 0xffffffff)

Example

@kernel function scan_kernel(dst, src)
    I = @index(Global, Linear)
    val = src[I]
    lane = (I - 1) % 32 + 1
    
    @warpreduce(val, lane, +)
    
    dst[I] = val  # Contains prefix sum
end

# Input:  [1, 2, 3, 4, ..., 32]
# Output: [1, 3, 6, 10, ..., 528]

See also: @warpfold, @shfl

source
KernelIntrinsics.@warpfoldMacro
@warpfold(val, lane, [op=+], [warpsize=32], [mask=0xffffffff])

Perform a warp-wide reduction, folding all values to a single result in lane 1.

Combines all values across the warp using the specified operator. Uses shuffle-down operations internally.

Arguments

  • val: Value to reduce (modified in-place)
  • lane: Current lane index (1-based, unused but kept for API consistency)
  • op: Binary reduction operator (default: +)
  • warpsize: Warp size (default: 32)
  • mask: Lane participation mask (default: 0xffffffff)

Example

@kernel function reduce_kernel(dst, src)
    I = @index(Global, Linear)
    val = src[I]
    lane = (I - 1) % 32 + 1
    
    @warpfold(val, lane, +)
    
    if lane == 1
        dst[1] = val  # Contains sum of all 32 values
    end
end

# Input:  [1, 2, 3, ..., 32]
# Output: dst[1] = 528

See also: @warpreduce, @shfl

source
KernelIntrinsics.@voteMacro
@vote(mode, predicate, [mask=0xffffffff])

Perform a warp vote operation, evaluating a predicate across all lanes.

Arguments

  • mode: Vote mode (All, Any, Uni, or Ballot)
  • predicate: Boolean predicate to evaluate
  • mask: Lane participation mask (default: 0xffffffff)

Example

@kernel function vote_kernel(dst, src, threshold)
    I = @index(Global, Linear)
    val = src[I]
    
    all_above = @vote(All, val > threshold)   # True if all lanes above threshold
    any_above = @vote(Any, val > threshold)   # True if any lane above threshold
    uniform = @vote(Uni, val > threshold)     # True if all lanes have same result
    mask = @vote(Ballot, val > threshold)     # Bitmask of which lanes are above
    
    dst[I] = mask
end

See also: @shfl

source
KernelIntrinsics.DirectionType
Direction

Abstract type representing warp shuffle directions.

Subtypes:

  • Up: Shuffle values from lower lane indices
  • Down: Shuffle values from higher lane indices
  • Xor: Shuffle values using XOR of lane indices
  • Idx: Shuffle values from a specific lane index
source
KernelIntrinsics.UpType
Up <: Direction

Shuffle direction where each lane receives a value from a lane with a lower index.

@shfl(Up, val, offset): Lane i receives the value from lane i - offset. Lanes where i < offset keep their original value.

Result for offset=1: [1, 1, 2, 3, 4, ..., 31]

source
KernelIntrinsics.DownType
Down <: Direction

Shuffle direction where each lane receives a value from a lane with a higher index.

@shfl(Down, val, offset): Lane i receives the value from lane i + offset. Lanes where i + offset >= warpsize keep their original value.

Result for offset=1: [2, 3, 4, ..., 31, 32, 32]

source
KernelIntrinsics.XorType
Xor <: Direction

Shuffle direction where each lane exchanges values based on XOR of lane indices.

@shfl(Xor, val, mask): Lane i receives the value from lane i ⊻ mask.

Common patterns:

  • mask=1: Swap adjacent pairs (0↔1, 2↔3, ...)
  • mask=16: Swap first and second half of warp
source
KernelIntrinsics.IdxType
Idx <: Direction

Shuffle direction where all lanes receive a value from a specific lane index.

@shfl(Idx, val, lane): All lanes receive the value from lane lane.

Useful for broadcasting a value from one lane to all others.

source
KernelIntrinsics.ModeType
Mode

Abstract type representing warp vote modes.

Subtypes:

  • All: True if predicate is true for all lanes
  • Any: True if predicate is true for any lane
  • Uni: True if predicate is uniform across all lanes
  • Ballot: Returns a bitmask of predicate values
source
KernelIntrinsics.UniType
Uni <: Mode

Vote mode that returns true if the predicate has the same value across all participating lanes.

source

Index