Package overview
This page complements the home page and Getting started with repository conventions, nomenclature, and I/O details that apply across workflows.
Project root, data/, and results/
StochasticGene expects a project root (root keyword, often "."). Under it:
data/holds experimental inputs: histograms, trace folders, condition labels, and optional tables such asCellType_alleles.csvandCellType_halflife.csv.results/<resultfolder>/holds outputs from fit: rate files, diagnostics, and optional run-spec files.
The canonical resolution uses folder_path (see Utilities): if joinpath(root, resultfolder) exists it is used; otherwise joinpath(root, "results", resultfolder) is used (and created if needed). So fits usually land in root/results/<name>/.
Version control: the repository’s .gitignore excludes results/ so large fit outputs (MCMC, NUTS, ADVI) stay local. Archive what you need for papers or collaboration separately (e.g. exported CSV summaries, small info_*.toml markers without huge binaries).
Use rna_setup("dirname") (see the API reference) to create a minimal tree with example data for learning the layout.
Run specification files (info_<key>)
For key-based workflows, each run can be described by:
info_<key>.toml— small marker pointing at the JLD2 companion.info_<key>.jld2— full keyword dict (including types likemethod,probfn) read byread_run_spec(see Run specification (info TOML)).
fit with keyword key loads this spec when present and merges with explicit keywords (keywords win). Stage batch helpers such as make_fitscript, make_fitscripts_from_csv, make_commandfile, and make_commandfile_from_csv use these keys directly; Biowulf-oriented makeswarm* helpers also support this layout (see Cluster and batch workflows).
States vs steps (nomenclature)
- G states are mutually exclusive configurations of the promoter / gene regulatory model (the gene is in exactly one of G states at a time).
- R steps are positions along the transcription unit; many occupancy patterns are possible at once (which steps are loaded), so the “state space” of the elongation ladder is combinatorial. For R steps, the full GR state is a tensor product of G and the R occupancy pattern (and mRNA counts when tracked).
- With splicing, each R step can be represented in a higher alphabet (e.g. unoccupied / spliced / unspliced); state-space size grows with G, R, and S.
This distinction matters when setting onstates, interpreting simulator output, and reading coupling specs.
Rate vector ordering (typical)
For telegraph-style models, the rate vector stacks contributions in a fixed order (see docstrings and papers): G transitions, then R transitions, S if present, then decay (and noise parameters for traces). Not every index is necessarily fitted; use fittedparam to select indices. Coupled models use tuple layouts for per-unit parameters and a coupling block.
Data types (datatype)
Common values for fit:
datatype | Meaning (short) |
|---|---|
"rna" | Stationary RNA count histogram |
"rnaonoff" | RNA + ON/OFF dwell histograms |
"rnadwelltime" | RNA + multiple dwell-time types |
"trace" | Intensity traces (HMM likelihood) |
"tracerna" | Traces + RNA histogram |
"tracejoint" | Coupled / joint traces between units |
"tracegrid" | Grid-based trace likelihood |
For v1.10 and later, datatype may also be a tuple or vector of elementary modalities to request CombinedData:
datatype = (:rna, :dwelltime)
datapath = (
rna = "HBEC_smFISH",
dwelltime = ["dwelltime/CANX_ON.csv", "dwelltime/CANX_OFF.csv"],
)The tuple order is canonicalized, so (:dwelltime, :rna) is equivalent to (:rna, :dwelltime). Each modality is loaded independently; likelihoods are evaluated per modality and combined for total likelihood and WAIC.
datapath may be a file, folder, vector/tuple of paths, or for CombinedData a modality-keyed NamedTuple. trace_specs and dwell_specs are the preferred metadata containers for trace and dwell-time observations. Legacy traceinfo / dttype values can be read from old run specs during migration.
infolder and inlabel are retired. Use root + datapath for inputs and label + resultfolder for output naming/routing.
Output file families
Under each resultfolder, names encode cell, condition, gene, model string (G, R, S, insertstep), and alleles. Typical prefixes:
rates_*.txt— posterior summaries and ML row (see file header / docs).measures_*.txt,param-stats_*.txt— diagnostics and parameter summaries.proposal-cov_*.jld2— proposal covariance matrix and metadata (see MCMC proposal & warmup below).burst_*.txt— optional burst statistics when requested.optimized_*.txt— optional optimizer output.
Underscore _ separates fields in filenames; avoid _ inside user labels where the naming convention would become ambiguous.
MCMC proposal covariance and warmup
When fitting expensive models (e.g., with ODE-based likelihood evaluation that takes minutes per step), proposal covariance reuse can significantly speed up workflows:
Proposal Covariance Reuse
The propcv keyword controls the proposal distribution:
propcv=0.01(positive): Use fixed coefficient of variation. MCMC will compute empirical covariance during warmup (ifwarmupsteps > 0) and save it toproposal-cov_*.jld2.propcv=-0.01(negative): Attempt to load covariance fromproposal-cov_*.jld2if it exists and model parameters match exactly (G, R, S, transitions, fittedparam, nalleles all equal).- If loading succeeds: Warmup is automatically skipped (even if
warmupsteps > 0), and sampling proceeds immediately with the loaded proposal. - If loading fails: Falls back to
abs(propcv)and warmup proceeds normally.
- If loading succeeds: Warmup is automatically skipped (even if
Workflow example:
# First run: compute and save covariance
fits1 = fit(; G=2, R=0, transitions=([1,2],[2,1]),
propcv=0.01, warmupsteps=10000, ...)
# Subsequent run: reuse covariance (warmup skipped automatically)
fits2 = fit(; G=2, R=0, transitions=([1,2],[2,1]),
propcv=-0.01, warmupsteps=10000, ...) # warmup still specified but skippedNo need to remember to set warmupsteps=0 on the second run — the presence of a loaded covariance automatically prevents warmup from running.
Adaptive Warmup
When warmupsteps > 0 and no covariance is loaded, the warmup phase adapts the proposal covariance to improve MCMC efficiency:
- Periodic Adaptation: Adapts every
max(1000, samplesteps ÷ 3)steps (typically 2–3 times per warmup). - Acceptance Rate Targeting:
- Acceptance target scales with problem dimension: 44% (d=1) → 30% (d=5–20) → 23.4% (d>>1).
- If current rate < 15%, shrinks proposals; if > 40%, expands them.
- Time Allocation: Warmup time is proportional to step count:
warmup_time = maxtime × (warmupsteps / total_steps). For expensive steps, increasemaxtimeor reducewarmupstepsif warmup times out before adaptation triggers.
The saved covariance is stored as proposal-cov_<name>.jld2 with metadata validation, ensuring proposals are only reused when model structure matches.
Cluster workflows (pointer)
For stage-native command-file generation (make_commandfile*), Biowulf swarm generation (make_swarmfile_from_csv, makeswarm*), and the coupled pipeline (single-unit fits → create_combined_file → coupled fit), read Cluster and batch workflows.
Coupled CSV (Coupled_models_to_test)
Batch coupled jobs can read CSVs processed in coupled_csv.jl. Each row defines a coupled model by specifying connections between units.
Column naming (by pattern; order-independent):
Model_name— required, key for the modelenhancer_to_gene_1,enhancer_sign_1— optional, enhancer→gene connections for unit 1enhancer_to_gene_2,enhancer_sign_2— optional, enhancer→gene connections for unit 2gene_to_enhancerorgene_to_enhancer_sign— optional, gene→enhancer connectionsbackground_gene,background_gene_sign—optional, genetic background connections- Sign columns (
*_sign) accept:">0"(activate), anything else except empty/0/"free"(inhibit), or empty/0/"free"(:free mode)
Minimal example: Model_name, enhancer_to_gene_1, enhancer_sign_1 (other couplings default to :free mode).
See docstrings for csv_row_to_connections_simple, build_coupled_fit_spec_from_csv_cells, and makeswarmfiles_coupled (use ? in REPL after using StochasticGene), and the Coupled CSV section in the long docstring of makeswarmfiles (source: biowulf.jl; overview: Cluster and batch workflows).