Matrix layouts

DataAxesFormats.MatrixLayouts Module

All stored Daf matrix data has a clear matrix layout, that is, a major_axis , regardless of whether it is dense or sparse.

That is, for Columns -major data, the values of each column are laid out consecutively in memory (each column is a single contiguous vector), so any operation that works on whole columns will be fast (e.g., summing the value of each column). In contrast, the values of each row are stored far apart from each other, so any operation that works on whole rows will be very slow in comparison (e.g., summing the value of each row).

For Rows -major data, the values of each row are laid out consecutively in memory (each row is a single contiguous vector). In contrast, the values of each column are stored far apart from each other. In this case, summing columns would be slow, and summing rows would be fast.

This is much simpler than the ArrayLayouts module which attempts to fully describe the layout of N-dimensional arrays, a much more ambitious goal which is an overkill for our needs.

Note

The "default" layout in Julia is column-major, which inherits this from matlab, which inherits this from FORTRAN, allegedly because this is more efficient for some linear algebra operations. In contrast, Python numpy uses row-major layout by default. In either case, this is just an arbitrary convention, and all systems work just fine with data of either memory layout; the key consideration is to keep track of the layout, and to apply operations "with the grain" rather than "against the grain" of the data.

Symbolic names for axes

Checking layout

DataAxesFormats.MatrixLayouts.major_axis Function
major_axis(matrix::AbstractMatrix)::Maybe{Int8}

Return the index of the major axis of a matrix, that is, the axis one should keep fixed for an efficient inner loop accessing the matrix elements. If the matrix doesn't support any efficient access axis, returns nothing .

DataAxesFormats.MatrixLayouts.minor_axis Function
minor_axis(matrix::AbstractMatrix)::Maybe{Int8}

Return the index of the minor axis of a matrix, that is, the axis one should vary for an efficient inner loop accessing the matrix elements. If the matrix doesn't support any efficient access axis, returns nothing .

Changing layout

DataAxesFormats.MatrixLayouts.relayout! Function
relayout!(destination::AbstractMatrix, source::AbstractMatrix)::AbstractMatrix
relayout!(destination::AbstractMatrix, source::NamedMatrix)::NamedMatrix

Return the same matrix data, but in the other memory layout.

Suppose you have a column-major UMIs matrix, whose rows are cells, and columns are genes. Therefore, summing the UMIs of a gene will be fast, but summing the UMIs of a cell will be slow. A transpose (no ! ) of a matrix is fast; it creates a zero-copy wrapper of the matrix with flipped axes, so its rows will be genes and columns will be cells, but in row-major layout. Therefore, still , summing the UMIs of a gene is fast, and summing the UMIs of a cell is slow.

In contrast, transpose! (with a ! ) (or transposer ) is slow; it creates a rearranged copy of the data, also returning a matrix whose rows are genes and columns are cells, but this time, in column-major layout. Therefore, in this case summing the UMIs of a gene will be slow, and summing the UMIs of a cell will be fast.

Note

It is almost always worthwhile to relayout! a matrix and then perform operations "with the grain" of the data, instead of skipping it and performing operations "against the grain" of the data. This is because (in Julia at least) the implementation of transpose! is optimized for the task, while the other operations typically don't provide any specific optimizations for working "against the grain" of the data. The benefits of a relayout! become even more significant when performing a series of operations (e.g., summing the gene UMIs in each cell, converting gene UMIs to fractions out of these totals, then computing the log base 2 of this fraction).

If you transpose (no ! ) the result of transpose! (with a ! ), you end up with a matrix that appears to be the same as the original (rows are cells and columns are genes), but behaves differently - summing the UMIs of a gene will be slow, and summing the UMIs of a cell is fast. This transpose of transpose! is a common idiom and is basically what relayout! does for you. In addition, relayout! will work for both sparse and dense matrices, and if destination is not specified, a similar matrix is allocated automatically for it.

Note

The caller is responsible for providing a sensible destination matrix (sparse for a sparse source , dense for a non-sparse source ). This can be a transposed matrix. If source is a NamedMatrix , then the result will be a NamedMatrix with the same axes. If destination is also a NamedMatrix , then its axes must match source .

DataAxesFormats.MatrixLayouts.transposer Function
transposer(matrix::AbstractMatrix)::AbstractMatrix
transposer(matrix::NamedMatrix)::NamedMatrix

This is a shorthand for LinearAlgebra.transpose!(similar(transpose(m)), m) . That is, this will return a transpose of a matrix, but instead of simply using a zero-copy wrapper, it actually rearranges the data. See relayout! .

DataAxesFormats.MatrixLayouts.copy_array Function
copy_array(array::AbstractArray)::AbstractArray

Create a mutable copy of an array. This differs from Base.copy in the following:

  • Copying a read-only array is a mutable array. In contrast, both Base.copy and Base.deepcopy of a read-only array will return a read-only array, which is technically correct, but is rather pointless for Base.copy .
  • Copying will preserve the layout of the data; for example, copying a Transpose array is still a Transpose array. In contrast, while Base.deepcopy will preserve the layout, Base.copy will silently relayout! the matrix, which is both expensive and confusing.
  • Copying a sparse vector or matrix gives the same type of sparse array or matrix. Copying anything else gives a simple dense array regardless of the original type. This is done because a deepcopy of PyArray will still share the underlying buffer. Sigh.
  • Copying a vector of anything derived from AbstractString returns a vector of AbstractString .

Changing format

DataAxesFormats.MatrixLayouts.bestify Function
bestify(
    matrix::AbstractMatrix{T};
    copy::Bool = false,
    sparse_if_saves_storage_fraction::AbstractFloat = 0.25,
)::AbstractMatrix{T} where {T <: StorageReal}
bestify(
    matrix::AbstractVector{T};
    copy::Bool = false,
    sparse_if_saves_storage_fraction::AbstractFloat = 0.25,
)::AbstractVector{T} where {T <: StorageReal}

Return a "best" (dense or sparse) version of an array. The sparse format is chosen if it saves at least sparse_if_saves_storage_fraction of the storage of the dense format. If copy , this will create a copy even if it is already in the best format.

Note

If not copy and the matrix is already sparse, we do not change the integer index type, even though this may save space.

DataAxesFormats.MatrixLayouts.densify Function
densify(matrix::AbstractMatrix{T}; copy::Bool = false)::AbstractMatrix{T} where {T <: StorageReal}
densify(vector::AbstractVector{T}; copy::Bool = false)::AbstractVector{T} where {T <: StorageReal}

Return a dense version of an array. This will preserve matrix layout. If copy , this will create a copy even if it is already dense.

DataAxesFormats.MatrixLayouts.sparsify Function
sparsify(matrix::AbstractMatrix{T}; copy::Bool = false)::AbstractMatrix{T} where {T <: StorageReal}
sparsify(vector::AbstractVector{T}; copy::Bool = false)::AbstractVector{T} where {T <: StorageReal}

Return a sparse version of an array. This will preserve the matrix layout. If copy , this will create a copy even if it is already sparse.

Assertions

DataAxesFormats.MatrixLayouts.@assert_matrix Macro
@assert_matrix(matrix::Any, [n_rows::Integer, n_columns::Integer], [major_axis::Int8])

Assert that the matrix is an AbstractMatrix and optionally that it has n_rows and n_columns . If the major_axis is given, also calls check_efficient_action to verify that the matrix is in an efficient layout.

DataAxesFormats.MatrixLayouts.check_efficient_action Function
check_efficient_action(
    source_file::AbstractString,
    source_line::Integer,
    operand::AbstractString,
    matrix::AbstractMatrix,
    axis::Integer,
)::Nothing

This will check whether the code about to be executed for an operand which is matrix works "with the grain" of the data, which requires the matrix to be in axis -major layout. If it isn't, then apply the inefficient_action_handler . Typically this isn't invoked directly; instead use @assert_matrix .

In general, you really want operations to go "with the grain" of the data. Unfortunately, Julia (and Python, and R, and matlab) will silently run operations "against the grain", which would be painfully slow. A liberal application of this function in your code will help in detecting such slowdowns, without having to resort to profiling the code to isolate the problem.

Note

This will not prevent the code from performing "against the grain" operations such as selectdim(matrix, Rows, 1) for a column-major matrix, but if you add this check before performing any (series of) operations on a matrix, then you will have a clear indication of whether such operations occur. You can then consider whether to invoke relayout! on the data, or (for data fetched from Daf ), simply query for the other memory layout.

Index