Package overview
This page complements the home page and Getting started with repository conventions, nomenclature, and I/O details that apply across workflows.
Project root, data/, and results/
StochasticGene expects a project root (root keyword, often "."). Under it:
data/holds experimental inputs: histograms, trace folders, condition labels, and optional tables such asCellType_alleles.csvandCellType_halflife.csv.results/<resultfolder>/holds outputs from fit: rate files, diagnostics, and optional run-spec files.
The canonical resolution uses folder_path (see Utilities): if joinpath(root, resultfolder) exists it is used; otherwise joinpath(root, "results", resultfolder) is used (and created if needed). So fits usually land in root/results/<name>/.
Version control: the repository’s .gitignore excludes results/ so large MCMC outputs stay local. Archive what you need for papers or collaboration separately (e.g. exported CSV summaries, small info_*.toml markers without huge binaries).
Use rna_setup("dirname") (see the API reference) to create a minimal tree with example data for learning the layout.
Run specification files (info_<key>)
For key-based workflows, each run can be described by:
info_<key>.toml— small marker pointing at the JLD2 companion.info_<key>.jld2— full keyword dict (including types likemethod,probfn) read byread_run_spec(see Run specification (info TOML)).
fit with keyword key loads this spec when present and merges with explicit keywords (keywords win). Batch helpers such as write_run_spec_preset, makeswarm, and makeswarmfiles generate these files (see Cluster and batch workflows).
States vs steps (nomenclature)
- G states are mutually exclusive configurations of the promoter / gene regulatory model (the gene is in exactly one of G states at a time).
- R steps are positions along the transcription unit; many occupancy patterns are possible at once (which steps are loaded), so the “state space” of the elongation ladder is combinatorial. For R steps, the full GR state is a tensor product of G and the R occupancy pattern (and mRNA counts when tracked).
- With splicing, each R step can be represented in a higher alphabet (e.g. unoccupied / spliced / unspliced); state-space size grows with G, R, and S.
This distinction matters when setting onstates, interpreting simulator output, and reading coupling specs.
Rate vector ordering (typical)
For telegraph-style models, the rate vector stacks contributions in a fixed order (see docstrings and papers): G transitions, then R transitions, S if present, then decay (and noise parameters for traces). Not every index is necessarily fitted; use fittedparam to select indices. Coupled models use tuple layouts for per-unit parameters and a coupling block.
Data types (datatype)
Common values for fit:
datatype | Meaning (short) |
|---|---|
"rna" | Stationary RNA count histogram |
"rnaonoff" | RNA + ON/OFF dwell histograms |
"rnadwelltime" | RNA + multiple dwell-time types |
"trace" | Intensity traces (HMM likelihood) |
"tracerna" | Traces + RNA histogram |
"tracejoint" | Coupled / joint traces between units |
"tracegrid" | Grid-based trace likelihood |
datapath may be a file, folder, or vector of paths depending on type. traceinfo encodes frame interval, time window, active fraction, and background handling for traces.
Output file families
Under each resultfolder, names encode cell, condition, gene, model string (G, R, S, insertstep), and alleles. Typical prefixes:
rates_*.txt— posterior summaries and ML row (see file header / docs).measures_*.txt,param-stats_*.txt— diagnostics and parameter summaries.proposal-cov_*.jld2— proposal covariance matrix and metadata (see MCMC proposal & warmup below).burst_*.txt— optional burst statistics when requested.optimized_*.txt— optional optimizer output.
Underscore _ separates fields in filenames; avoid _ inside user labels where the naming convention would become ambiguous.
MCMC proposal covariance and warmup
When fitting expensive models (e.g., with ODE-based likelihood evaluation that takes minutes per step), proposal covariance reuse can significantly speed up workflows:
Proposal Covariance Reuse
The propcv keyword controls the proposal distribution:
propcv=0.01(positive): Use fixed coefficient of variation. MCMC will compute empirical covariance during warmup (ifwarmupsteps > 0) and save it toproposal-cov_*.jld2.propcv=-0.01(negative): Attempt to load covariance fromproposal-cov_*.jld2if it exists and model parameters match exactly (G, R, S, transitions, fittedparam, nalleles all equal).- If loading succeeds: Warmup is automatically skipped (even if
warmupsteps > 0), and sampling proceeds immediately with the loaded proposal. - If loading fails: Falls back to
abs(propcv)and warmup proceeds normally.
- If loading succeeds: Warmup is automatically skipped (even if
Workflow example:
# First run: compute and save covariance
fits1 = fit(; G=2, R=0, transitions=([1,2],[2,1]),
propcv=0.01, warmupsteps=10000, ...)
# Subsequent run: reuse covariance (warmup skipped automatically)
fits2 = fit(; G=2, R=0, transitions=([1,2],[2,1]),
propcv=-0.01, warmupsteps=10000, ...) # warmup still specified but skippedNo need to remember to set warmupsteps=0 on the second run — the presence of a loaded covariance automatically prevents warmup from running.
Adaptive Warmup
When warmupsteps > 0 and no covariance is loaded, the warmup phase adapts the proposal covariance to improve MCMC efficiency:
- Periodic Adaptation: Adapts every
max(1000, samplesteps ÷ 3)steps (typically 2–3 times per warmup). - Acceptance Rate Targeting:
- Acceptance target scales with problem dimension: 44% (d=1) → 30% (d=5–20) → 23.4% (d>>1).
- If current rate < 15%, shrinks proposals; if > 40%, expands them.
- Time Allocation: Warmup time is proportional to step count:
warmup_time = maxtime × (warmupsteps / total_steps). For expensive steps, increasemaxtimeor reducewarmupstepsif warmup times out before adaptation triggers.
The saved covariance is stored as proposal-cov_<name>.jld2 with metadata validation, ensuring proposals are only reused when model structure matches.
Cluster workflows (pointer)
For NIH Biowulf swarm generation, makeswarm / makeswarm_genes / makeswarmfiles, and the coupled pipeline (single-unit fits → create_combined_file → coupled fit), read Cluster and batch workflows.
Coupled CSV (Coupled_models_to_test)
Batch coupled jobs can read CSVs processed in coupled_csv.jl. Each row defines a coupled model by specifying connections between units.
Column naming (by pattern; order-independent):
Model_name— required, key for the modelenhancer_to_gene_1,enhancer_sign_1— optional, enhancer→gene connections for unit 1enhancer_to_gene_2,enhancer_sign_2— optional, enhancer→gene connections for unit 2gene_to_enhancerorgene_to_enhancer_sign— optional, gene→enhancer connectionsbackground_gene,background_gene_sign—optional, genetic background connections- Sign columns (
*_sign) accept:">0"(activate), anything else except empty/0/"free"(inhibit), or empty/0/"free"(:free mode)
Minimal example: Model_name, enhancer_to_gene_1, enhancer_sign_1 (other couplings default to :free mode).
See docstrings for csv_row_to_connections_simple, build_coupled_fit_spec_from_csv_cells, and makeswarmfiles_coupled (use ? in REPL after using StochasticGene), and the Coupled CSV section in the long docstring of makeswarmfiles (source: biowulf.jl; overview: Cluster and batch workflows).