Matrix layouts
DataAxesFormats.MatrixLayouts
—
Module
All stored
Daf
matrix data has a clear matrix layout, that is, a
major_axis
, regardless of whether it is dense or sparse.
That is, for
Columns
-major data, the values of each column are laid out consecutively in memory (each column is a single contiguous vector), so any operation that works on whole columns will be fast (e.g., summing the value of each column). In contrast, the values of each row are stored far apart from each other, so any operation that works on whole rows will be very slow in comparison (e.g., summing the value of each row).
For
Rows
-major data, the values of each row are laid out consecutively in memory (each row is a single contiguous vector). In contrast, the values of each column are stored far apart from each other. In this case, summing columns would be slow, and summing rows would be fast.
This is much simpler than the ArrayLayouts module which attempts to fully describe the layout of N-dimensional arrays, a much more ambitious goal which is an overkill for our needs.
The "default" layout in Julia is column-major, which inherits this from matlab, which inherits this from FORTRAN, allegedly because this is more efficient for some linear algebra operations. In contrast, Python
numpy
uses row-major layout by default. In either case, this is just an arbitrary convention, and all systems work just fine with data of either memory layout; the key consideration is to keep track of the layout, and to apply operations "with the grain" rather than "against the grain" of the data.
Symbolic names for axes
DataAxesFormats.MatrixLayouts.Rows
—
Constant
A symbolic name for the rows axis. It is much more readable to write, say,
size(matrix, Rows)
, instead of
size(matrix, 1)
.
DataAxesFormats.MatrixLayouts.Columns
—
Constant
A symbolic name for the rows axis. It is much more readable to write, say,
size(matrix, Columns)
, instead of
size(matrix, 2)
.
DataAxesFormats.MatrixLayouts.axis_name
—
Function
axis_name(axis::Maybe{Integer})::String
Return the name of the axis (for messages).
Checking layout
DataAxesFormats.MatrixLayouts.major_axis
—
Function
major_axis(matrix::AbstractMatrix)::Maybe{Int8}
Return the index of the major axis of a matrix, that is, the axis one should keep
fixed
for an efficient inner loop accessing the matrix elements. If the matrix doesn't support any efficient access axis, returns
nothing
.
DataAxesFormats.MatrixLayouts.require_major_axis
—
Function
require_major_axis(matrix::AbstractMatrix)::Int8
Similar to
major_axis
but will
error
if the matrix isn't in either row-major or column-major layout.
DataAxesFormats.MatrixLayouts.minor_axis
—
Function
minor_axis(matrix::AbstractMatrix)::Maybe{Int8}
Return the index of the minor axis of a matrix, that is, the axis one should
vary
for an efficient inner loop accessing the matrix elements. If the matrix doesn't support any efficient access axis, returns
nothing
.
DataAxesFormats.MatrixLayouts.require_minor_axis
—
Function
require_minor_axis(matrix::AbstractMatrix)::Int8
Similar to
minor_axis
but will
error
if the matrix isn't in either row-major or column-major layout.
DataAxesFormats.MatrixLayouts.other_axis
—
Function
Changing layout
DataAxesFormats.MatrixLayouts.relayout!
—
Function
relayout!(destination::AbstractMatrix, source::AbstractMatrix)::AbstractMatrix
relayout!(destination::AbstractMatrix, source::NamedMatrix)::NamedMatrix
Return the same
matrix
data, but in the other memory layout.
Suppose you have a column-major UMIs matrix, whose rows are cells, and columns are genes. Therefore, summing the UMIs of a gene will be fast, but summing the UMIs of a cell will be slow. A
transpose
(no
!
) of a matrix is fast; it creates a zero-copy wrapper of the matrix with flipped axes, so its rows will be genes and columns will be cells, but in row-major layout. Therefore,
still
, summing the UMIs of a gene is fast, and summing the UMIs of a cell is slow.
In contrast,
transpose!
(with a
!
) (or
transposer
) is slow; it creates a rearranged copy of the data, also returning a matrix whose rows are genes and columns are cells, but this time, in column-major layout. Therefore, in this case summing the UMIs of a gene will be slow, and summing the UMIs of a cell will be fast.
It is almost always worthwhile to
relayout!
a matrix and then perform operations "with the grain" of the data, instead of skipping it and performing operations "against the grain" of the data. This is because (in Julia at least) the implementation of
transpose!
is optimized for the task, while the other operations typically don't provide any specific optimizations for working "against the grain" of the data. The benefits of a
relayout!
become even more significant when performing a series of operations (e.g., summing the gene UMIs in each cell, converting gene UMIs to fractions out of these totals, then computing the log base 2 of this fraction).
If you
transpose
(no
!
) the result of
transpose!
(with a
!
), you end up with a matrix that
appears
to be the same as the original (rows are cells and columns are genes), but behaves
differently
- summing the UMIs of a gene will be slow, and summing the UMIs of a cell is fast. This
transpose
of
transpose!
is a common idiom and is basically what
relayout!
does for you. In addition,
relayout!
will work for both sparse and dense matrices, and if
destination
is not specified, a
similar
matrix is allocated automatically for it.
The caller is responsible for providing a sensible
destination
matrix (sparse for a sparse
source
, dense for a non-sparse
source
). This can be a transposed matrix. If
source
is a
NamedMatrix
, then the result will be a
NamedMatrix
with the same axes. If
destination
is also a
NamedMatrix
, then its axes must match
source
.
DataAxesFormats.MatrixLayouts.relayout
—
Function
relayout(matrix::AbstractMatrix)::AbstractMatrix
relayout(matrix::NamedMatrix)::NamedMatrix
Same as
relayout!
but allocates the destination matrix for you. Is equivalent to
transpose(transposer(matrix))
.
DataAxesFormats.MatrixLayouts.transposer
—
Function
transposer(matrix::AbstractMatrix)::AbstractMatrix
transposer(matrix::NamedMatrix)::NamedMatrix
This is a shorthand for
LinearAlgebra.transpose!(similar(transpose(m)), m)
. That is, this will return a transpose of a matrix, but instead of simply using a zero-copy wrapper, it actually rearranges the data. See
relayout!
.
DataAxesFormats.MatrixLayouts.copy_array
—
Function
copy_array(array::AbstractArray)::AbstractArray
Create a mutable copy of an array. This differs from
Base.copy
in the following:
- Copying a read-only array is a mutable array. In contrast, both
Base.copyandBase.deepcopyof a read-only array will return a read-only array, which is technically correct, but is rather pointless forBase.copy. - Copying will preserve the layout of the data; for example, copying a
Transposearray is still aTransposearray. In contrast, whileBase.deepcopywill preserve the layout,Base.copywill silentlyrelayout!the matrix, which is both expensive and confusing. - Copying a sparse vector or matrix gives the same type of sparse array or matrix. Copying anything else gives a simple dense array regardless of the original type. This is done because a
deepcopyofPyArraywill still share the underlying buffer. Sigh. - Copying a vector of anything derived from
AbstractStringreturns a vector ofAbstractString.
Changing format
DataAxesFormats.MatrixLayouts.bestify
—
Function
bestify(
matrix::AbstractMatrix{T};
copy::Bool = false,
sparse_if_saves_storage_fraction::AbstractFloat = 0.25,
)::AbstractMatrix{T} where {T <: StorageReal}
bestify(
matrix::AbstractVector{T};
copy::Bool = false,
sparse_if_saves_storage_fraction::AbstractFloat = 0.25,
)::AbstractVector{T} where {T <: StorageReal}
Return a "best" (dense or sparse) version of an array. The sparse format is chosen if it saves at least
sparse_if_saves_storage_fraction
of the storage of the dense format. If
copy
, this will create a copy even if it is already in the best format.
If not
copy
and the matrix is already sparse, we do not change the integer index type, even though this may save space.
DataAxesFormats.MatrixLayouts.densify
—
Function
densify(matrix::AbstractMatrix{T}; copy::Bool = false)::AbstractMatrix{T} where {T <: StorageReal}
densify(vector::AbstractVector{T}; copy::Bool = false)::AbstractVector{T} where {T <: StorageReal}
Return a dense version of an array. This will preserve matrix layout. If
copy
, this will create a copy even if it is already dense.
DataAxesFormats.MatrixLayouts.sparsify
—
Function
sparsify(matrix::AbstractMatrix{T}; copy::Bool = false)::AbstractMatrix{T} where {T <: StorageReal}
sparsify(vector::AbstractVector{T}; copy::Bool = false)::AbstractVector{T} where {T <: StorageReal}
Return a sparse version of an array. This will preserve the matrix layout. If
copy
, this will create a copy even if it is already sparse.
Assertions
DataAxesFormats.MatrixLayouts.@assert_vector
—
Macro
@assert_vector(vector::Any, [n_elements::Integer])
Assert that the
vector
is an
AbstractVector
and optionally that it has
n_elements
, with a friendly error message if it fails.
DataAxesFormats.MatrixLayouts.@assert_matrix
—
Macro
@assert_matrix(matrix::Any, [n_rows::Integer, n_columns::Integer], [major_axis::Int8])
Assert that the
matrix
is an
AbstractMatrix
and optionally that it has
n_rows
and
n_columns
. If the
major_axis
is given, also calls
check_efficient_action
to verify that the matrix is in an efficient layout.
DataAxesFormats.MatrixLayouts.check_efficient_action
—
Function
check_efficient_action(
source_file::AbstractString,
source_line::Integer,
operand::AbstractString,
matrix::AbstractMatrix,
axis::Integer,
)::Nothing
This will check whether the code about to be executed for an
operand
which is
matrix
works "with the grain" of the data, which requires the
matrix
to be in
axis
-major layout. If it isn't, then apply the
inefficient_action_handler
. Typically this isn't invoked directly; instead use
@assert_matrix
.
In general, you really want operations to go "with the grain" of the data. Unfortunately, Julia (and Python, and R, and matlab) will silently run operations "against the grain", which would be painfully slow. A liberal application of this function in your code will help in detecting such slowdowns, without having to resort to profiling the code to isolate the problem.
This will not prevent the code from performing "against the grain" operations such as
selectdim(matrix, Rows, 1)
for a column-major matrix, but if you add this check before performing any (series of) operations on a matrix, then you will have a clear indication of whether such operations occur. You can then consider whether to invoke
relayout!
on the data, or (for data fetched from
Daf
), simply query for the other memory layout.
DataAxesFormats.MatrixLayouts.inefficient_action_handler
—
Function
inefficient_action_handler(handler::AbnormalHandler)::AbnormalHandler
Specify the
AbnormalHandler
to use when accessing a matrix in an inefficient way ("against the grain"). Returns the previous handler. The default handler is
WarnHandler
.
Index
-
DataAxesFormats.MatrixLayouts -
DataAxesFormats.MatrixLayouts.Columns -
DataAxesFormats.MatrixLayouts.Rows -
DataAxesFormats.MatrixLayouts.axis_name -
DataAxesFormats.MatrixLayouts.bestify -
DataAxesFormats.MatrixLayouts.check_efficient_action -
DataAxesFormats.MatrixLayouts.copy_array -
DataAxesFormats.MatrixLayouts.densify -
DataAxesFormats.MatrixLayouts.inefficient_action_handler -
DataAxesFormats.MatrixLayouts.major_axis -
DataAxesFormats.MatrixLayouts.minor_axis -
DataAxesFormats.MatrixLayouts.other_axis -
DataAxesFormats.MatrixLayouts.relayout -
DataAxesFormats.MatrixLayouts.relayout! -
DataAxesFormats.MatrixLayouts.require_major_axis -
DataAxesFormats.MatrixLayouts.require_minor_axis -
DataAxesFormats.MatrixLayouts.sparsify -
DataAxesFormats.MatrixLayouts.transposer -
DataAxesFormats.MatrixLayouts.@assert_matrix -
DataAxesFormats.MatrixLayouts.@assert_vector