HighDimMixedModels.jl
HighDimMixedModels.jl is a package for fitting regularized linear mixed-effect models to high dimensional, clustered data. It is a Julia implementation of the estimation approach in
Schelldorfer, J., Bühlmann, P., & DE GEER, S. V. (2011). Estimation for high‐dimensional linear mixed‐effects models using ℓ1‐penalization. Scandinavian Journal of Statistics, 38(2), 197-214.
Two options for penalties are provided, the original LASSO and the smoothly clipped absolute deviation (SCAD) penalty described in
Ghosh, A., & Thoresen, M. (2018). Non-concave penalization in linear mixed-effect models and regularized selection of fixed effects. AStA Advances in Statistical Analysis, 102, 179-210.
Quick Start
The cognitive dataset contains data from a study of the effect of an intervention in school lunches among schools in Kenya, accessed via the R package splmm. The data is longitudinal with measurements of students' performance on various tests taken at different points in time. We will fit a model with random intercepts and random growth slopes for each student. Note that while this is a low-dimensional example ($p < n$), the algorithm that this package implements was designed and tested with the high dimensional use-case ($p > n$) in mind.
First, we load the data into Julia and form a categorical variable for the treatment in the study, which was the type of lunch served (assigned at the school level).
using CSV
using DataFrames
using CategoricalArrays
cog_df = CSV.read("data/cognitive.csv", DataFrame)
# form categorical variable for treatment
cog_df.treatment = categorical(cog_df.treatment, levels=["control", "calorie", "meat", "milk"])Next we form model matrices with the help of the StatsModels formula syntax:
using StatsModels
f = @formula(ravens ~ 1 + treatment + year + sex + age_at_time0 +
height + weight + head_circ + ses + mom_read + mom_write + mom_edu)
model_frame = ModelFrame(f, cog_df)
model_mat = ModelMatrix(model_frame).mWe form two model matrices. One is low dimensional and includes only the columns that will have associated random effects and the other is higher dimensional and includes the many features whose effects will be regularized.
X = model_mat[:, 1:2] # Non-penalized, random effect columns (one for intercept, and the other for year)
G = model_mat[:, 3:end] # High dimensional set of covariates whose effects are regularizedFinally, we get the cluster (in this case, student) ids and the response, the students' Ravens test scores, and fit the model with the main function from the package, hdmm:
using HighDimMixedModels
student_id = cog_df.id
y = cog_df.ravens
fit = hdmm(X, G, y, student_id)HDMModel fit with 1562 observations
Log-likelihood at convergence: -3770.3
Random effect covariance matrix:
2×2 Matrix{Float64}:
1.67215 0.0
0.0 2.19103
Estimated 7 non-zero fixed effects:
──────────────
Estimate
──────────────
1 9.50882
2 -0.261202
3 0.0220777
4 -0.358626
5 1.07749
10 0.157324
13 0.0128426
──────────────
Estimated σ²: 5.9733
Note that by default, we fit a model with the SCAD penalty. To apply the LASSO penalty, simply specify penalty = "lasso" in the call to fit the model.
By default, the features that are assigned random slopes are all those that appear as columns in the matrix X, i.e. those whose coefficients in the model are not penalized. In the above code, X just contains a single constant column, so we are fitting a model with just random intercepts. It's possible, however, to include additional columns in X. If you do so, each corresponding feature will receive a random slopes, by default. If you wish to include a feature whose coefficient is not penalized, but do not wish to assign this feature a random slope, then you can specify the argument Z in the call to hdmm to be a matrix whose columns contain only the variables in X that you wish to assign random slopes.
We can inspect the model using common extraction functions from StatsBase.jl. For example, to get the residuals and fitted values,
julia> residuals(fit)1562-element Vector{Float64}: -2.704169909459875 0.9833594225342424 2.6385642026656875 -1.1479998926594597 1.2593833231914573 -0.03635319476662602 0.7050503903319552 -0.6828442320201695 0.7245389838307439 0.9702994403682759 ⋮ -0.5335857071654466 0.05696996924064379 -0.6864947236009336 -0.46228396830519003 3.4742611246677555 1.1402407554200877 -0.2799784187947161 0.11662994643531022 -3.745358103236086julia> fitted(fit)1562-element Vector{Float64}: 17.704169909459875 18.016640577465758 18.361435797334313 19.14799989265946 19.740616676808543 18.036353194766626 18.294949609668045 18.68284423202017 19.275461016169256 20.029700559631724 ⋮ 17.533585707165447 17.943030030759356 18.686494723600934 19.46228396830519 17.525738875332245 17.859759244579912 18.279978418794716 18.88337005356469 19.745358103236086
Note that these fitted values and residuals take into account the random effects by incorporating the best prediction of these random effects (BLUPs) for each student into the predictions.
To print a table with the names of the selected variables and their estimated coefficients:
coeftable(fit, coefnames(model_frame))| Estimate | |
|---|---|
| (Intercept) | 9.50882 |
| treatment: calorie | -0.261202 |
| treatment: meat | 0.0220777 |
| treatment: milk | -0.358626 |
| year | 1.07749 |
| head_circ | 0.157324 |
| mom_write | 0.0128426 |
Here, we plot the observed scores and our model's predictions for five different students over time:
using Plots
mask = student_id .== 1
plot(cog_df.year[mask], cog_df.ravens[mask], seriestype = :scatter, label = "student 1", color = 1 )
plot!(cog_df.year[mask], fitted(fit)[mask], seriestype = :line, color = 1, linestyle = :dash, linewidth = 3, label = "")
for i in [2,4,5,6]
mask = student_id .== i
# add student to plot
plot!(cog_df.year[mask], cog_df.ravens[mask], seriestype = :scatter, label = "student $i", color = i)
plot!(cog_df.year[mask], fitted(fit)[mask], seriestype = :line, color = i, linestyle = :dash, linewidth = 3, label = "")
end
plot!(legend=:outerbottom, legendcolumns=3, xlabel = "Year", ylabel = "Ravens Score")Function Documentation
HighDimMixedModels.Control — TypeAlgorithm Hyper-parameters
- tol :: Convergence tolerance
- seed :: Random seed for cross validation for estimating initial fixed effect parameters using Lasso
- trace :: Integer. 1 prints no output, 2 prints issues, and 3 prints the objective function values during the algorithm and issues
- max_iter :: Integer. Maximum number of iterations
- max_armijo :: Integer. Maximum number of steps in Armijo rule algorithm. If the maximum is reached, algorithm doesn't update current coordinate and proceeds to the next coordinate
- actnum :: Integer between 1 and 5. We will only update all fixed effect parameters every actnum iterations. Otherwise, we update only the parameters in thea current active set.
- a₀ :: a₀ in the Armijo step. See Schelldorfer et al. (2010)
- δ :: δ in the Armijo step. See Schelldorfer et al. (2010)
- ρ :: ρ in the Armijo step. See Schelldorfer et al. (2010)
- γ :: γ in the Armijo step. See Schelldorfer et al. (2010)
- lower :: Lower bound for the Hessian
- upper :: Upper bound for the Hessian
- var_int :: Tuple with bounds of interval on which to optimize the variance parameters used in
optimizefunction. See Optim.jl in section "minimizing a univariate function on a bounded interval" - cov_int :: Tuple with bounds of interval on which to optimize the covariance parameters used in
optimizefunction. See Optim.jl in section "minimizing a univariate function on a bounded interval" - optimize_method :: Symbol denoting method for performing the univariate optimization, either :Brent or :GoldenSection
- thres :: If variance or covariance parameter has smaller absolute value than
thres, parameter is set to 0
HighDimMixedModels.HDMModel — TypeFitted model object
HighDimMixedModels.L_diag_update! — MethodUpdate of coordinate s of L for diagonal covariance structure
ARGUMENTS
- L :: A vector of parameters which will be updated by the function
- s :: The coordinate of L that is being updated (number between 1 and length(L))
HighDimMixedModels.L_ident_update — MethodUpdate of L for identity covariance structure
HighDimMixedModels.L_sym_update! — MethodUpdate of L for general symmetric positive definite covariance structure ARGUMENTS
- L :: A lower triangular matrix of parameters which will be updated by the function
- coords :: Tuple representing the coordinates of the entry of L that is being updated
HighDimMixedModels.armijo! — MethodArmijo Rule
HighDimMixedModels.cov_start — MethodFinds an initial value for the variance and covariance parameters
ARGUMENTS
- XGgrp :: Vector of fixed effect design matrices for each group
- ygrp :: Vector of vector of responses for each group
- Zgrp :: Vector of random effects design matrix for each group
- β :: Initial iterate for fixed effect parameter vector (computed with Lasso ignoring group structure)
OUTPUT
- Assuming β is true fixed effect vector, MLE estimate of scalar L and scalar σ² as tuple (L, σ²)
HighDimMixedModels.get_cost — FunctionCalculates the objective function
HighDimMixedModels.get_negll — MethodCalculates the negative log-likelihod -l(ϕ̃) = -l(β, θ, σ²) = .5(Ntot*log(2π) + log|V| + (y-xβ)'V⁻¹(y-Xβ))
ARGUMENTS
- invVgrp :: Vector of length the number of groups, each of whose elements is the precision matrix of the responses within a group
- ygrp :: Vector of vector of responses for each group
- X :: Vector of fixed effect design matrices for each group
- β :: Fixed effects
OUTPUT
- Value of the negative log-likelihood
HighDimMixedModels.get_scad — FunctionCalculates the SCAD penalty
HighDimMixedModels.hdmm — FunctionFits penalized linear mixed effect model
ARGUMENTS Positional:
- X :: Low dimensional design matrix for unpenalized fixed effects (assumed to include column of ones) (REQUIRED)
- G :: High dimensional design matrix for penalized fixed effects (assumed to not include column of ones) (REQUIRE)
- y :: Vector of responses (REQUIRED)
- grp :: Vector of strings of same length as y assigning each observation to a particular group (REQUIRED)
- Z :: Design matrix for random effects (default is all columns of X)
NOTE: Z is not expected to be given in block diagonal form. It should be a vertical stack of subject design matrices Z₁, Z₂, ...
Keyword:
- standardize :: boolean (default true), whether to standardize design matrices before performing algorithm.
- penalty :: One of "scad" (default) or "lasso"
- λ :: Positive regularizing penalty (default is 10.0)
- scada :: Extra tuning parameter for the SCAD penalty (default is 3.7, ignored if penalty is "lasso"
- wts :: Vector of length number of penalized coefficients. Strength of penalty on covariate j is λ/wⱼ (Default is vector of 1's)
- init_coef :: Named tuple of form (β, L, σ²) giving initial values for parameters. If unspecified, then inital values for parameters are
calculated as follows: first, cross-validated LASSO that ignores grouping structure is performed to obtain initial estimates of the fixed effect parameters. Then, the random effect parameters are initialized as MLEs assuming the LASSO estimates are true fixed effect parameters.
- ψstr :: One of "diag" (default), "ident", or "sym", specifying covariance structure of random effects
- control :: Struct with fields for hyperparameters of the algorithm
OUTPUT
- Fitted model
HighDimMixedModels.hessian_diag! — MethodCalculates active_set entries of the diagonal of Hessian matrix for fixed effect parameters
HighDimMixedModels.invV! — MethodUpdates precision matrices of the responses, by group
ARGUMENTS
- invVgrp :: Container for precision matrices of the responses, by group
- Zgrp :: Container of random effects design matrices, by group
- L :: Parameters for random effect covariance matrix (can be scalar, vector, or lower triangular matrix)
- σ² :: Variance of error
OUTPUT
- invVgrp :: List of length the number of groups, each of whose elements is the covariance matrix of the responses within a group
HighDimMixedModels.scad_dir — MethodCalculates descent direction with SCAD penalty
HighDimMixedModels.scad_solution — MethodGets analytical solution for CGD iterate with SCAD penalty when the Hessian hasn't been truncated
HighDimMixedModels.soft_thresh — MethodSoft Threshold
HighDimMixedModels.special_quad — MethodCalculates (y-ỹ)'(invV)X[:,j], where ỹ are the fitted values if we ignored the jth column i.e. XG[:,Not(j)]β[Not(j)] To improve perforamce, we calculate ỹ with the entire dataset We then split into groups and calculate (y-ỹ)'(invV)*X[:,j] for each group
HighDimMixedModels.σ²update — MethodUpdate of σ²