(ns org.soulspace.arrayfire.ffi.transpose
  "Bindings for the ArrayFire matrix transpose functions.
   
   Matrix transposition is a fundamental operation in linear algebra that
   flips a matrix over its diagonal, swapping rows and columns. ArrayFire
   provides both standard and in-place transpose operations, with optional
   complex conjugation for complex-valued matrices.
   
   Mathematical Definition:
   
   For a matrix A of size m×n, the transpose Aᵀ is an n×m matrix where:
   
   Aᵀ[i, j] = A[j, i]
   
   Visually:
   ```
   A = ┌         ┐       Aᵀ = ┌         ┐
       │ 1  2  3 │            │ 1  4  7 │
       │ 4  5  6 │            │ 2  5  8 │
       │ 7  8  9 │            │ 3  6  9 │
       └         ┘            └         ┘
       (3×3)                  (3×3)
   ```
   
   For non-square matrices:
   ```
   A = ┌         ┐       Aᵀ = ┌       ┐
       │ 1  2  3 │            │ 1  4 │
       │ 4  5  6 │            │ 2  5 │
       └         ┘            │ 3  6 │
       (2×3)                  └       ┘
                              (3×2)
   ```
   
   **Conjugate Transpose (Hermitian Transpose)**:
   
   For complex matrices, conjugate transpose (denoted A* or A†) combines
   transposition with complex conjugation:
   
   A*[i, j] = conj(A[j, i])
   
   Where conj(a + bi) = a - bi
   
   Example with complex matrix:
   ```
   A = ┌           ┐       A* = ┌           ┐
       │ 1+2i  3-i │            │ 1-2i  5+0i│
       │ 5+0i  4+3i│            │ 3+i   4-3i│
       └           ┘            └           ┘
   ```
   
   Properties of Transpose:
   
   1. **(Aᵀ)ᵀ = A**: Double transpose returns original matrix
   2. **(A + B)ᵀ = Aᵀ + Bᵀ**: Transpose distributes over addition
   3. **(AB)ᵀ = BᵀAᵀ**: Transpose reverses matrix multiplication order
   4. **(kA)ᵀ = kAᵀ**: Scalar multiplication commutes with transpose
   5. **det(Aᵀ) = det(A)**: Determinant preserved under transpose
   6. **rank(Aᵀ) = rank(A)**: Rank preserved under transpose
   
   For complex conjugate transpose:
   1. **(A*)* = A**: Double conjugate transpose returns original
   2. **(A + B)* = A* + B***: Distributes over addition
   3. **(AB)* = B*A***: Reverses order
   4. **(kA)* = k̄A***: Scalar is also conjugated
   
   Special Matrices:
   
   - **Symmetric**: A = Aᵀ (equal to own transpose)
   - **Skew-symmetric**: A = -Aᵀ
   - **Hermitian**: A = A* (complex matrices equal to conjugate transpose)
   - **Skew-Hermitian**: A = -A*
   - **Orthogonal**: AAᵀ = AᵀA = I (real matrices)
   - **Unitary**: AA* = A*A = I (complex matrices)
   
   Implementation Details:
   
   **Standard Transpose (af-transpose)**:
   - Creates new output array
   - Handles all array dimensions (2D and higher)
   - Special optimization for vectors (uses moddims instead)
   - For 1D vectors: [n] → [1, n] or [1, n] → [n]
   - For batches: Transposes each matrix in batch independently
   - Dimensions: [m, n, p, q] → [n, m, p, q]
   
   **In-Place Transpose (af-transpose-inplace)**:
   - Modifies array in-place (no new allocation)
   - **Requires square matrices** (m = n)
   - More memory efficient for square matrices
   - Faster for large square matrices
   - Cannot change array dimensions
   - For batches: Each matrix in batch must be square
   
   Algorithmic Complexity:
   
   For m×n matrix:
   - Time: O(m×n) - must access every element
   - Space (standard): O(m×n) - new array allocation
   - Space (in-place): O(1) - no extra allocation
   
   GPU Implementation:
   - Parallel processing with thread per element
   - Memory coalescing optimizations
   - Shared memory tiling for cache efficiency
   - 10-100× speedup vs CPU
   
   For large matrices (> 1000×1000):
   - Tiled transpose algorithms
   - Optimized memory access patterns
   - Cache-friendly implementations
   
   Use Cases:
   
   **Linear Algebra**:
   - Matrix equation solving
   - Computing normal equations (AᵀA)
   - Gram matrix computation
   - Linear least squares: (AᵀA)⁻¹Aᵀb
     ```clojure
     ;; Solve overdetermined system Ax = b
     (defn least-squares [A b]
       (let [At-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-transpose At-ptr A 0)  ; no conjugate
             At (mem/read-pointer At-ptr ::mem/pointer)
             ;; Compute AtA
             AtA-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-matmul AtA-ptr At A AF_MAT_NONE AF_MAT_NONE)
             AtA (mem/read-pointer AtA-ptr ::mem/pointer)
             ;; Compute Atb
             Atb-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-matmul Atb-ptr At b AF_MAT_NONE AF_MAT_NONE)
             Atb (mem/read-pointer Atb-ptr ::mem/pointer)
             ;; Solve AtA x = Atb
             x-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-solve x-ptr AtA Atb)
             x (mem/read-pointer x-ptr ::mem/pointer)]
         x))
     ```
   
   - QR decomposition
   - SVD computation
   - Eigenvalue problems
   
   **Signal Processing**:
   - Cross-correlation computation
   - Convolution matrix construction
   - Filter bank design
     ```clojure
     ;; Compute cross-correlation efficiently
     (defn cross-correlate [signal1 signal2]
       (let [s1-T-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-transpose s1-T-ptr signal1 0)
             s1-T (mem/read-pointer s1-T-ptr ::mem/pointer)
             corr-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-matmul corr-ptr s1-T signal2 AF_MAT_NONE AF_MAT_NONE)]
         (mem/read-pointer corr-ptr ::mem/pointer)))
     ```
   
   **Machine Learning**:
   - Neural network backpropagation
   - Gradient computation
   - Weight matrix operations
   - Covariance matrix: XᵀX / n
     ```clojure
     ;; Compute covariance matrix
     (defn covariance-matrix [data]
       (let [[n m] (get-dimensions data)  ; n samples, m features
             ;; Center the data
             mean (af-mean data 0)
             centered (af-sub data mean)
             ;; Compute XtX
             Xt-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-transpose Xt-ptr centered 0)
             Xt (mem/read-pointer Xt-ptr ::mem/pointer)
             cov-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-matmul cov-ptr Xt centered AF_MAT_NONE AF_MAT_NONE)
             cov (mem/read-pointer cov-ptr ::mem/pointer)
             ;; Normalize by n-1
             normalized (af-div cov (- n 1))]
         normalized))
     ```
   
   - Principal Component Analysis (PCA)
   - Feature transformation
   
   **Image Processing**:
   - Image rotation (90° rotations via transpose + flip)
     ```clojure
     ;; Rotate image 90° clockwise: transpose then flip vertically
     (defn rotate-90-cw [img]
       (let [T-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-transpose T-ptr img 0)
             transposed (mem/read-pointer T-ptr ::mem/pointer)
             rotated-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-flip rotated-ptr transposed 0)  ; flip vertically
             rotated (mem/read-pointer rotated-ptr ::mem/pointer)]
         rotated))
     
     ;; Rotate image 90° counter-clockwise: flip vertically then transpose
     (defn rotate-90-ccw [img]
       (let [flipped-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-flip flipped-ptr img 0)  ; flip vertically
             flipped (mem/read-pointer flipped-ptr ::mem/pointer)
             T-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-transpose T-ptr flipped 0)
             rotated (mem/read-pointer T-ptr ::mem/pointer)]
         rotated))
     ```
   
   - Separable filter optimization
   - Tensor reshaping
   
   **Quantum Computing**:
   - Hermitian operator conjugation
   - Quantum state manipulation
   - Density matrix operations
     ```clojure
     ;; Compute outer product |ψ⟩⟨ψ| (density matrix)
     (defn density-matrix [state-vector]
       (let [ket state-vector  ; Column vector |ψ⟩
             bra-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-transpose bra-ptr ket 1)  ; Conjugate transpose ⟨ψ|
             bra (mem/read-pointer bra-ptr ::mem/pointer)
             rho-ptr (mem/alloc-pointer ::mem/pointer)
             _ (af-matmul rho-ptr ket bra AF_MAT_NONE AF_MAT_NONE)
             rho (mem/read-pointer rho-ptr ::mem/pointer)]
         rho))
     ```
   
   - Unitary matrix verification
   
   **Graph Algorithms**:
   - Adjacency matrix symmetrization
   - Graph transpose (reverse edges)
   - Bipartite graph operations
   
   Common Patterns:
   
   **Pattern 1: Matrix-Vector Multiplication (Row Space)**
   ```clojure
   ;; Efficiently compute xᵀA (row vector times matrix)
   (defn row-vector-mult [x A]
     (let [xT-ptr (mem/alloc-pointer ::mem/pointer)
           _ (af-transpose xT-ptr x 0)  ; Column to row
           xT (mem/read-pointer xT-ptr ::mem/pointer)
           result-ptr (mem/alloc-pointer ::mem/pointer)
           _ (af-matmul result-ptr xT A AF_MAT_NONE AF_MAT_NONE)]
       (mem/read-pointer result-ptr ::mem/pointer)))
   ```
   
   **Pattern 2: Symmetric Matrix Construction**
   ```clojure
   ;; Create symmetric matrix from lower/upper triangle
   (defn make-symmetric [A]
     (let [AT-ptr (mem/alloc-pointer ::mem/pointer)
           _ (af-transpose AT-ptr A 0)
           AT (mem/read-pointer AT-ptr ::mem/pointer)
           ;; Symmetric: (A + Aᵀ) / 2
           sum (af-add A AT)
           symmetric (af-div sum 2.0)]
       symmetric))
   ```
   
   **Pattern 3: In-Place Symmetrization**
   ```clojure
   ;; Symmetrize square matrix in-place (memory efficient)
   (defn symmetrize-inplace! [A]
     ;; Copy A to preserve original
     (let [A-copy (af-copy A)
           ;; Transpose in-place
           _ (af-transpose-inplace A 0)
           ;; Average with original
           symmetric (af-div (af-add A A-copy) 2.0)]
       symmetric))
   ```
   
   **Pattern 4: Batch Matrix Transpose**
   ```clojure
   ;; Transpose batch of matrices
   (defn transpose-batch [matrices]
     (let [[m n batches] (get-dimensions matrices)  ; [m, n, k] batch
           out-ptr (mem/alloc-pointer ::mem/pointer)
           ;; Single call transposes all matrices in batch
           _ (af-transpose out-ptr matrices 0)
           transposed (mem/read-pointer out-ptr ::mem/pointer)]
       transposed))  ; Output: [n, m, k]
   ```
   
   **Pattern 5: Vector to Row/Column**
   ```clojure
   ;; Convert 1D vector to row or column vector
   (defn to-column-vector [vec]
     ;; vec is [n] → output is [n, 1]
     (let [dims (af-get-dims vec)]
       (if (= 1 (second dims))
         vec  ; Already column vector
         (let [out-ptr (mem/alloc-pointer ::mem/pointer)
               _ (af-transpose out-ptr vec 0)]
           (mem/read-pointer out-ptr ::mem/pointer)))))
   
   (defn to-row-vector [vec]
     ;; vec is [n] → output is [1, n]
     (let [dims (af-get-dims vec)]
       (if (= 1 (first dims))
         vec  ; Already row vector
         (let [out-ptr (mem/alloc-pointer ::mem/pointer)
               _ (af-transpose out-ptr vec 0)]
           (mem/read-pointer out-ptr ::mem/pointer)))))
   ```
   
   **Pattern 6: Complex Hermitian Check**
   ```clojure
   ;; Check if complex matrix is Hermitian (A = A*)
   (defn hermitian? [A]
     (let [A-star-ptr (mem/alloc-pointer ::mem/pointer)
           _ (af-transpose A-star-ptr A 1)  ; Conjugate transpose
           A-star (mem/read-pointer A-star-ptr ::mem/pointer)
           ;; Check if A ≈ A*
           diff (af-sub A A-star)
           norm (af-norm diff AF_NORM_EUCLID)
           threshold 1e-10]
       (< norm threshold)))
   ```
   
   Type Support:
   
   Both functions support all ArrayFire types:
   - Floating-point: f32, f64, f16
   - Complex: c32, c64
   - Signed integers: s8, s16, s32, s64
   - Unsigned integers: u8, u16, u32, u64
   - Boolean: b8
   
   Note: conjugate parameter only affects complex types (c32, c64).
   For real types, conjugate has no effect.
   
   Performance Characteristics:
   
   **Standard Transpose**:
   - Memory: Allocates new array (2× memory temporarily)
   - Speed: GPU accelerated, ~0.1-10ms for typical matrices
   - Works on any rectangular matrix
   - Optimized for vectors (uses fast moddims)
   
   **In-Place Transpose**:
   - Memory: No extra allocation
   - Speed: Slightly faster than standard for large square matrices
   - Requires square matrix (fatal error otherwise)
   - Best for: Large square matrices, memory-constrained scenarios
   
   Performance Comparison (1000×1000 matrix):
   ```text
   Operation              | Time (GPU) | Memory
   -----------------------|------------|------------------
   Standard transpose     | 0.5 ms     | 2× (temporary)
   In-place transpose     | 0.3 ms     | 1× (no allocation)
   Vector transpose       | 0.01 ms    | 1× (moddims only)
   Batch transpose (×10)  | 1.2 ms     | 2× (per batch)
   ```
   
   Dimension Handling:
   
   **2D Arrays** (matrices):
   - Input: [m, n] → Output: [n, m]
   - Standard case for transpose
   
   **1D Arrays** (vectors):
   - Input: [n] → Output: [1, n] or [n, 1]
   - Uses optimized moddims internally
   - No data movement, just dimension metadata change
   
   **3D/4D Arrays** (batches):
   - Input: [m, n, p, q] → Output: [n, m, p, q]
   - Batch dimensions (p, q) preserved
   - Each [m, n] slice transposed independently
   - Efficient batch processing
   
   Conjugate Parameter:
   
   - **conjugate = false (0)**: Standard transpose
     * For real matrices: Aᵀ
     * For complex: Transpose without conjugation
   
   - **conjugate = true (1)**: Conjugate transpose
     * For real matrices: Same as standard (no effect)
     * For complex: A* or A† (Hermitian transpose)
     * Essential for quantum mechanics, signal processing
   
   Error Handling:
   
   Common errors:
   - **AF_ERR_SIZE**: In-place transpose on non-square matrix
   - **AF_ERR_ARG**: Invalid array handle
   - **AF_ERR_TYPE**: Unsupported array type (rare)
   
   Best Practices:
   
   1. **Choose appropriate variant**:
      - Use standard transpose for rectangular matrices
      - Use in-place for large square matrices when memory is concern
      - Consider memory vs speed tradeoff
   
   2. **Optimize matrix operations**:
      - Combine transpose with matmul flags when possible
      - Use AF_MAT_TRANS in matmul instead of explicit transpose
      - Reduces intermediate array allocations
   
   3. **Complex matrices**:
      - Use conjugate=true for Hermitian operations
      - Check if Hermitian/symmetric before expensive operations
      - Conjugate transpose essential for adjoint operators
   
   4. **Batch processing**:
      - Single transpose call for batches more efficient
      - Leverages parallelism across batch dimension
   
   5. **Memory management**:
      - Release intermediate transpose results
      - Use in-place when possible for square matrices
      - Consider memory footprint for large matrices
   
   6. **Performance optimization**:
      - Profile standard vs in-place for your use case
      - Cache transpose results if reused
      - Use vector optimization for 1D arrays
   
   Alternative Approaches:
   
   Instead of explicit transpose, consider:
   - **af-matmul with transpose flags**: More efficient
     ```clojure
     ;; Instead of: C = Aᵀ × B
     ;; Do: (af-matmul C A B AF_MAT_TRANS AF_MAT_NONE)
     ```
   
   - **af-moddims**: For dimension reordering without data movement
   
   - **af-reorder**: For general dimension permutation
   
   Limitations:
   
   - In-place transpose requires square matrices
   - Cannot transpose in-place for batches with different shapes
   - Conjugate only affects complex types
   - Maximum matrix size limited by GPU memory
   
   Mathematical Notes:
   
   For numerical stability:
   - Transpose doesn't affect condition number
   - Preserves eigenvalues (for square matrices)
   - Orthogonal/unitary matrix transpose is inverse
   
   For complex matrices:
   - Hermitian matrices have real eigenvalues
   - Conjugate transpose preserves positive definiteness
   - Essential for quantum mechanics observables
   
   See also:
   - af-matmul: Matrix multiplication with transpose flags
   - af-conjg: Complex conjugation without transpose
   - af-moddims: Dimension reshaping without data movement
   - af-reorder: General dimension permutation
   - af-flip: Matrix flipping (combined with transpose for rotation)"
  (:require [coffi.ffi :as ffi :refer [defcfn]]
            [coffi.mem :as mem]
            [org.soulspace.arrayfire.ffi.loader]))

;; Matrix transpose

;; af_err af_transpose(af_array *out, af_array in, const bool conjugate)
(defcfn af-transpose
  "Transpose a matrix, optionally with complex conjugation.
   
   Flips matrix over its diagonal, swapping rows and columns. For complex
   matrices, optionally applies conjugation (Hermitian transpose).
   
   Parameters:
   - out: Output pointer for transposed matrix
   - in: Input array to transpose
   - conjugate: Complex conjugation flag (bool, use 1 or 0)
     * false (0): Standard transpose (Aᵀ)
     * true (1): Conjugate transpose for complex (A* or A†)
   
   Operation:
   - Input dimensions: [m, n, p, q]
   - Output dimensions: [n, m, p, q]
   - For each element: out[i, j, k, l] = in[j, i, k, l]
   - If conjugate=true and complex: out[i,j] = conj(in[j,i])
   
   Dimension Behavior:
   
   **2D Matrix** [m, n]:
   ```
   Input:  ┌         ┐       Output: ┌       ┐
           │ 1  2  3 │               │ 1  4 │
           │ 4  5  6 │               │ 2  5 │
           └         ┘               │ 3  6 │
           [2, 3]                    └       ┘
                                     [3, 2]
   ```
   
   **1D Vector** [n]:
   - Optimized path using moddims (no data movement)
   - [n] becomes [1, n] or vice versa
   - Much faster than matrix transpose
   
   **Batch** [m, n, k]:
   - Transposes each [m, n] matrix independently
   - Output: [n, m, k]
   - Processes k matrices in parallel
   
   Type Support:
   - All types: f32, f64, f16, c32, c64, integers, b8
   - Conjugate only affects c32 and c64
   - For real types, conjugate parameter ignored
   
   Performance:
   - Time: O(m×n) per matrix
   - Space: O(m×n) new allocation
   - GPU accelerated with tiled implementation
   - Vector optimization for 1D arrays
   - Batch parallelization across matrices
   
   Example (Basic Transpose):
   ```clojure
   ;; Transpose a matrix
   (let [A (create-array [[1 2 3]
                          [4 5 6]] [2 3])
         AT-ptr (mem/alloc-pointer ::mem/pointer)
         _ (af-transpose AT-ptr A 0)  ; Standard transpose
         AT (mem/read-pointer AT-ptr ::mem/pointer)]
     AT)  ; Returns [3, 2] matrix [[1 4] [2 5] [3 6]]
   ```
   
   Example (Least Squares):
   ```clojure
   ;; Solve Ax = b via normal equations: x = (AᵀA)⁻¹Aᵀb
   (defn solve-least-squares [A b]
     (let [AT-ptr (mem/alloc-pointer ::mem/pointer)
           _ (af-transpose AT-ptr A 0)
           AT (mem/read-pointer AT-ptr ::mem/pointer)
           ;; AᵀA
           ATA-ptr (mem/alloc-pointer ::mem/pointer)
           _ (af-matmul ATA-ptr AT A AF_MAT_NONE AF_MAT_NONE)
           ATA (mem/read-pointer ATA-ptr ::mem/pointer)
           ;; Aᵀb
           ATb-ptr (mem/alloc-pointer ::mem/pointer)
           _ (af-matmul ATb-ptr AT b AF_MAT_NONE AF_MAT_NONE)
           ATb (mem/read-pointer ATb-ptr ::mem/pointer)
           ;; Solve
           x-ptr (mem/alloc-pointer ::mem/pointer)
           _ (af-solve x-ptr ATA ATb)]
       (mem/read-pointer x-ptr ::mem/pointer)))
   ```
   
   Example (Hermitian Transpose):
   ```clojure
   ;; Complex conjugate transpose (quantum mechanics)
   (let [psi (create-complex-array ket-data [4 1])  ; State |ψ⟩
         psi-dagger-ptr (mem/alloc-pointer ::mem/pointer)
         _ (af-transpose psi-dagger-ptr psi 1)  ; Conjugate transpose
         psi-dagger (mem/read-pointer psi-dagger-ptr ::mem/pointer)]
     psi-dagger)  ; Returns bra ⟨ψ|
   ```
   
   Example (Batch Transpose):
   ```clojure
   ;; Transpose batch of matrices
   (let [batch (create-array data [100 50 10])  ; 10 matrices 100×50
         batch-T-ptr (mem/alloc-pointer ::mem/pointer)
         _ (af-transpose batch-T-ptr batch 0)
         batch-T (mem/read-pointer batch-T-ptr ::mem/pointer)]
     batch-T)  ; Returns [50, 100, 10] - all transposed
   ```
   
   Example (90° Image Rotation):
   ```clojure
   ;; Rotate image 90° clockwise: transpose + vertical flip
   (defn rotate-90-cw [img]
     (let [T-ptr (mem/alloc-pointer ::mem/pointer)
           _ (af-transpose T-ptr img 0)
           transposed (mem/read-pointer T-ptr ::mem/pointer)
           rotated-ptr (mem/alloc-pointer ::mem/pointer)
           _ (af-flip rotated-ptr transposed 0)]
       (mem/read-pointer rotated-ptr ::mem/pointer)))
   ```
   
   Use Cases:
   - Linear algebra: Normal equations, Gram matrix
   - Matrix multiplication: xᵀA, Aᵀb
   - Covariance matrices: XᵀX
   - Quantum mechanics: Hermitian conjugation
   - Signal processing: Correlation matrices
   - Machine learning: Gradient computation
   - Image processing: 90° rotations
   
   Conjugate Transpose (Complex Matrices):
   - Essential for Hermitian operators in quantum mechanics
   - Inner products: ⟨ψ|φ⟩ = ψ*φ
   - Density matrices: ρ = |ψ⟩⟨ψ|
   - Adjoint operators in functional analysis
   
   Special Cases:
   - Empty array: Returns input unchanged
   - 1D vector: Uses fast moddims optimization
   - Square matrix: Consider af-transpose-inplace for memory savings
   
   Optimization Tips:
   - For matmul(Aᵀ, B): Use af-matmul with AF_MAT_TRANS flag
   - For repeated transpose: Cache result
   - For square matrices: Consider in-place variant
   - For vectors: Automatic optimization applied
   
   Notes:
   - Creates new output array
   - Preserves batch dimensions (dims 2, 3)
   - Conjugate only affects complex types
   - GPU memory limits maximum matrix size
   - Vector transpose is metadata-only operation
   
   Returns:
   ArrayFire error code (af_err enum):
   - AF_SUCCESS (0): Transpose successful
   - AF_ERR_ARG: Invalid array handle
   - AF_ERR_SIZE: Invalid dimensions
   
   See also:
   - af-transpose-inplace: In-place transpose for square matrices
   - af-matmul: Matrix multiply with transpose flags
   - af-conjg: Complex conjugation without transpose
   - af-moddims: Dimension reshaping
   - af-reorder: General dimension permutation"
  "af_transpose" [::mem/pointer ::mem/pointer ::mem/int] ::mem/int)

;; af_err af_transpose_inplace(af_array in, const bool conjugate)
(defcfn af-transpose-inplace
  "Transpose a square matrix in-place, optionally with complex conjugation.
   
   Modifies the input array directly by transposing it in-place. More memory
   efficient than standard transpose for square matrices, but requires the
   matrix to be square (m = n).
   
   Parameters:
   - in: Input/output array (modified in-place)
     * Must be square matrix (m×m dimensions)
   - conjugate: Complex conjugation flag (bool, use 1 or 0)
     * false (0): Standard transpose (Aᵀ)
     * true (1): Conjugate transpose for complex (A*)
   
   Operation:
   - Swaps elements across diagonal: A[i,j] ↔ A[j,i]
   - If conjugate=true: Also conjugates complex values
   - No new array allocation
   - Dimensions remain [m, m, p, q]
   
   Constraints:
   - **MUST be square matrix**: dims[0] == dims[1]
   - Non-square matrices cause AF_ERR_SIZE error
   - For batches: Each matrix in batch must be square
   
   Type Support:
   - All types: f32, f64, f16, c32, c64, integers, b8
   - Conjugate only affects c32 and c64
   - For real types, conjugate parameter ignored
   
   Performance:
   - Time: O(m²) but faster than standard transpose
   - Space: O(1) - no allocation
   - Best for large square matrices
   - Memory savings significant for large matrices
   - Still GPU accelerated
   
   Performance Comparison (1000×1000 matrix):
   ```text
   Standard transpose:  0.5 ms, 8 MB temporary allocation
   In-place transpose:  0.3 ms, 0 MB allocation
   Memory savings:      100% (no temporary array)
   ```
   
   Example (Basic In-Place):
   ```clojure
   ;; Transpose square matrix in-place
   (let [A (create-array data [100 100])  ; Square matrix
         A-copy (af-copy A)  ; Keep original if needed
         _ (af-transpose-inplace A 0)]  ; Modifies A
     A)  ; Now contains Aᵀ
   ```
   
   Example (Symmetrize Matrix):
   ```clojure
   ;; Create symmetric matrix: (A + Aᵀ) / 2
   (defn symmetrize [A]
     (let [A-original (af-copy A)
           _ (af-transpose-inplace A 0)  ; A ← Aᵀ
           symmetric (af-div (af-add A A-original) 2.0)]
       symmetric))
   ```
   
   Example (Hermitian Conjugation):
   ```clojure
   ;; In-place Hermitian transpose for complex matrix
   (let [H (create-complex-array hamiltonian-data [50 50])
         _ (af-transpose-inplace H 1)]  ; Conjugate transpose
     H)  ; Now contains H†
   ```
   
   Example (Memory-Constrained Scenario):
   ```clojure
   ;; Process large square matrices with minimal memory
   (defn process-large-matrices [matrices]
     (doseq [mat matrices]
       (when (square? mat)
         ;; In-place transpose saves memory
         (af-transpose-inplace mat 0)
         (process-transposed mat)
         ;; Transpose back if needed
         (af-transpose-inplace mat 0))))
   ```
   
   Example (Covariance Computation):
   ```clojure
   ;; Compute covariance efficiently
   (defn fast-covariance [X]
     (let [[n m] (get-dimensions X)
           ;; Center data
           mean (af-mean X 0)
           centered (af-sub X mean)
           ;; For XᵀX with X square after centering
           XT (af-copy centered)
           _ (af-transpose-inplace XT 0)
           cov-ptr (mem/alloc-pointer ::mem/pointer)
           _ (af-matmul cov-ptr XT centered AF_MAT_NONE AF_MAT_NONE)]
       (af-div (mem/read-pointer cov-ptr ::mem/pointer) (- n 1))))
   ```
   
   Use Cases:
   - Large square matrix operations
   - Memory-constrained environments
   - Repeated transpose operations
   - Symmetric matrix construction
   - Hermitian matrix operations
   - In-place matrix algorithms
   
   When to Use:
   - **Use in-place when**:
     * Matrix is square
     * Memory is limited
     * Original matrix not needed
     * Processing batches of square matrices
   
   - **Use standard when**:
     * Matrix is rectangular
     * Need to preserve original
     * Matrix is small (allocation overhead negligible)
   
   Common Errors:
   - **AF_ERR_SIZE**: Matrix not square (most common)
     * Check dimensions before calling
     * dims[0] must equal dims[1]
   - **AF_ERR_ARG**: Invalid array handle
   
   Special Cases:
   - 1×1 matrix: No operation needed, returns immediately
   - Symmetric matrix: Conjugate parameter only affects diagonal
   - Identity matrix: No change (I = Iᵀ)
   
   Batching:
   - For batched arrays [m, m, k]: Each of k matrices transposed
   - All matrices in batch must be square
   - Single error if any matrix non-square
   
   Memory Efficiency:
   For n×n matrix of type T:
   - Standard: 2×n²×sizeof(T) memory (temporary)
   - In-place: n²×sizeof(T) memory (no temporary)
   - Savings: 100% temporary allocation avoided
   
   Implementation Details:
   - Swaps elements above/below diagonal
   - Only touches n(n-1)/2 elements
   - Cache-efficient tiled algorithm
   - Thread per diagonal pair
   
   Alternatives:
   - Use standard af-transpose if not square
   - Use af-matmul transpose flags to avoid explicit transpose
   - Copy first if need to preserve original
   
   Notes:
   - **Modifies input array directly**
   - Original data lost unless copied beforehand
   - More efficient than standard for large square matrices
   - Essential for memory-constrained applications
   - Batch dimensions preserved
   
   Returns:
   ArrayFire error code (af_err enum):
   - AF_SUCCESS (0): In-place transpose successful
   - AF_ERR_SIZE: Matrix not square (dims[0] ≠ dims[1])
   - AF_ERR_ARG: Invalid array handle
   
   See also:
   - af-transpose: Standard transpose (creates new array)
   - af-copy: Copy array before in-place modification
   - af-matmul: Matrix multiply with transpose flags
   - af-conjg: Complex conjugation without transpose"
  "af_transpose_inplace" [::mem/pointer ::mem/int] ::mem/int)
