Package overview

This page complements the home page and Getting started with repository conventions, nomenclature, and I/O details that apply across workflows.

Project root, data/, and results/

StochasticGene expects a project root (root keyword, often "."). Under it:

  • data/ holds experimental inputs: histograms, trace folders, condition labels, and optional tables such as CellType_alleles.csv and CellType_halflife.csv.
  • results/<resultfolder>/ holds outputs from fit: rate files, diagnostics, and optional run-spec files.

The canonical resolution uses folder_path (see Utilities): if joinpath(root, resultfolder) exists it is used; otherwise joinpath(root, "results", resultfolder) is used (and created if needed). So fits usually land in root/results/<name>/.

Version control: the repository’s .gitignore excludes results/ so large fit outputs (MCMC, NUTS, ADVI) stay local. Archive what you need for papers or collaboration separately (e.g. exported CSV summaries, small info_*.toml markers without huge binaries).

Use rna_setup("dirname") (see the API reference) to create a minimal tree with example data for learning the layout.

Run specification files (info_<key>)

For key-based workflows, each run can be described by:

  • info_<key>.toml — small marker pointing at the JLD2 companion.
  • info_<key>.jld2 — full keyword dict (including types like method, probfn) read by read_run_spec (see Run specification (info TOML)).

fit with keyword key loads this spec when present and merges with explicit keywords (keywords win). Batch helpers such as write_run_spec_preset, makeswarm, and makeswarmfiles generate these files (see Cluster and batch workflows).

States vs steps (nomenclature)

  • G states are mutually exclusive configurations of the promoter / gene regulatory model (the gene is in exactly one of G states at a time).
  • R steps are positions along the transcription unit; many occupancy patterns are possible at once (which steps are loaded), so the “state space” of the elongation ladder is combinatorial. For R steps, the full GR state is a tensor product of G and the R occupancy pattern (and mRNA counts when tracked).
  • With splicing, each R step can be represented in a higher alphabet (e.g. unoccupied / spliced / unspliced); state-space size grows with G, R, and S.

This distinction matters when setting onstates, interpreting simulator output, and reading coupling specs.

Rate vector ordering (typical)

For telegraph-style models, the rate vector stacks contributions in a fixed order (see docstrings and papers): G transitions, then R transitions, S if present, then decay (and noise parameters for traces). Not every index is necessarily fitted; use fittedparam to select indices. Coupled models use tuple layouts for per-unit parameters and a coupling block.

Data types (datatype)

Common values for fit:

datatypeMeaning (short)
"rna"Stationary RNA count histogram
"rnaonoff"RNA + ON/OFF dwell histograms
"rnadwelltime"RNA + multiple dwell-time types
"trace"Intensity traces (HMM likelihood)
"tracerna"Traces + RNA histogram
"tracejoint"Coupled / joint traces between units
"tracegrid"Grid-based trace likelihood

For v1.10 and later, datatype may also be a tuple or vector of elementary modalities to request CombinedData:

datatype = (:rna, :dwelltime)
datapath = (
    rna = "HBEC_smFISH",
    dwelltime = ["dwelltime/CANX_ON.csv", "dwelltime/CANX_OFF.csv"],
)

The tuple order is canonicalized, so (:dwelltime, :rna) is equivalent to (:rna, :dwelltime). Each modality is loaded independently; likelihoods are evaluated per modality and combined for total likelihood and WAIC.

datapath may be a file, folder, vector/tuple of paths, or for CombinedData a modality-keyed NamedTuple. trace_specs and dwell_specs are the preferred metadata containers for trace and dwell-time observations. Legacy traceinfo / dttype values can be read from old run specs during migration.

infolder and inlabel are retired. Use root + datapath for inputs and label + resultfolder for output naming/routing.

Output file families

Under each resultfolder, names encode cell, condition, gene, model string (G, R, S, insertstep), and alleles. Typical prefixes:

  • rates_*.txt — posterior summaries and ML row (see file header / docs).
  • measures_*.txt, param-stats_*.txt — diagnostics and parameter summaries.
  • proposal-cov_*.jld2 — proposal covariance matrix and metadata (see MCMC proposal & warmup below).
  • burst_*.txt — optional burst statistics when requested.
  • optimized_*.txt — optional optimizer output.

Underscore _ separates fields in filenames; avoid _ inside user labels where the naming convention would become ambiguous.

MCMC proposal covariance and warmup

When fitting expensive models (e.g., with ODE-based likelihood evaluation that takes minutes per step), proposal covariance reuse can significantly speed up workflows:

Proposal Covariance Reuse

The propcv keyword controls the proposal distribution:

  • propcv=0.01 (positive): Use fixed coefficient of variation. MCMC will compute empirical covariance during warmup (if warmupsteps > 0) and save it to proposal-cov_*.jld2.
  • propcv=-0.01 (negative): Attempt to load covariance from proposal-cov_*.jld2 if it exists and model parameters match exactly (G, R, S, transitions, fittedparam, nalleles all equal).
    • If loading succeeds: Warmup is automatically skipped (even if warmupsteps > 0), and sampling proceeds immediately with the loaded proposal.
    • If loading fails: Falls back to abs(propcv) and warmup proceeds normally.

Workflow example:

# First run: compute and save covariance
fits1 = fit(; G=2, R=0, transitions=([1,2],[2,1]), 
            propcv=0.01, warmupsteps=10000, ...)

# Subsequent run: reuse covariance (warmup skipped automatically)
fits2 = fit(; G=2, R=0, transitions=([1,2],[2,1]), 
            propcv=-0.01, warmupsteps=10000, ...)  # warmup still specified but skipped

No need to remember to set warmupsteps=0 on the second run — the presence of a loaded covariance automatically prevents warmup from running.

Adaptive Warmup

When warmupsteps > 0 and no covariance is loaded, the warmup phase adapts the proposal covariance to improve MCMC efficiency:

  • Periodic Adaptation: Adapts every max(1000, samplesteps ÷ 3) steps (typically 2–3 times per warmup).
  • Acceptance Rate Targeting:
    • Acceptance target scales with problem dimension: 44% (d=1) → 30% (d=5–20) → 23.4% (d>>1).
    • If current rate < 15%, shrinks proposals; if > 40%, expands them.
  • Time Allocation: Warmup time is proportional to step count: warmup_time = maxtime × (warmupsteps / total_steps). For expensive steps, increase maxtime or reduce warmupsteps if warmup times out before adaptation triggers.

The saved covariance is stored as proposal-cov_<name>.jld2 with metadata validation, ensuring proposals are only reused when model structure matches.

Cluster workflows (pointer)

For NIH Biowulf swarm generation, makeswarm / makeswarm_genes / makeswarmfiles, and the coupled pipeline (single-unit fits → create_combined_file → coupled fit), read Cluster and batch workflows.

Coupled CSV (Coupled_models_to_test)

Batch coupled jobs can read CSVs processed in coupled_csv.jl. Each row defines a coupled model by specifying connections between units.

Column naming (by pattern; order-independent):

  • Model_name — required, key for the model
  • enhancer_to_gene_1, enhancer_sign_1 — optional, enhancer→gene connections for unit 1
  • enhancer_to_gene_2, enhancer_sign_2 — optional, enhancer→gene connections for unit 2
  • gene_to_enhancer or gene_to_enhancer_sign — optional, gene→enhancer connections
  • background_gene, background_gene_sign —optional, genetic background connections
  • Sign columns (*_sign) accept: ">0" (activate), anything else except empty/0/"free" (inhibit), or empty/0/"free" (:free mode)

Minimal example: Model_name, enhancer_to_gene_1, enhancer_sign_1 (other couplings default to :free mode).

See docstrings for csv_row_to_connections_simple, build_coupled_fit_spec_from_csv_cells, and makeswarmfiles_coupled (use ? in REPL after using StochasticGene), and the Coupled CSV section in the long docstring of makeswarmfiles (source: biowulf.jl; overview: Cluster and batch workflows).