Skip to content

Latest commit

 

History

History
62 lines (42 loc) · 6.07 KB

File metadata and controls

62 lines (42 loc) · 6.07 KB

Contributing to LabTrust-Gym

Development setup

git clone https://github.com/fraware/LabTrust-Gym.git
cd LabTrust-Gym
pip install -e ".[dev]"
labtrust --version   # optional: check version + git SHA

For a full verification command sequence, see Evaluation checklist. To test and audit the repo (lint, format, typecheck, tests, benchmarks, quick-eval, coordination, reproduce, docs), run the steps there or use make verify.

Documentation

Comments and docstrings should be clear and free of unexplained jargon. See Documentation standards for module/class/function docstrings, structure, and style. New or modified public modules, classes, or functions must have docstrings that meet those standards (module purpose, no unexplained jargon, and for functions: summary and Args/Returns where applicable). Existing code is being brought up to standard incrementally.

Code quality

Before opening a PR:

  • ruff format and ruff check (lines must not exceed 120 characters; E501 is enforced). For naming exceptions (N802, N806), see Code style and lint.
  • mypy src/ (must pass; CI fails on type errors)
  • pytest -q -m "not slow" (quick check; excludes slow tests). For full env-dependent tests (security, CLI smoke): pip install -e ".[dev,env]" then run the security and CLI smoke steps from CI.
  • labtrust validate-policy

Policy files under policy/ must validate against the JSON schemas in policy/schemas/. validate-policy checks schema and structural validity only; it does not check logical correctness (e.g. zone connectivity, invariant feasibility, or that controls match risks). After validate-policy, review policy for logical consistency and appropriateness of controls; validation is necessary but not sufficient. New or modified policy files must pass validation. Legacy and design-only YAML (e.g. override matrix, compiler contracts) live under docs/architecture/design/ and are not loaded by the runtime.

Testing and contracts

  • Frozen contracts: Do not weaken runner output, queue contract, coordination interface, or risk register schema without a version bump and doc update. See Frozen contracts for the canonical list. For coordination: when N <= N_max only propose_actions is used; combine_submissions is never called (backward-compat guarantee).
  • Implementation audit: What is tested vs manual checklists: see Evaluation checklist and CI.
  • Troubleshooting: Common failures (verify-bundle, policy validation, pack gate, E2E chain): Troubleshooting.

Keep the repo root minimal: do not commit CLI or build artifacts (e.g. results.json, out.json, bench_smoke_*.json, quick_eval_*/, site/). Use labtrust_runs/ or --out <path> for benchmark and study outputs. See Repository structure.

Golden suite

The golden scenarios in policy/golden/golden_scenarios.v0.1.yaml define correctness. Do not weaken expectations. When adding engine behaviour, extend the suite only with new scenarios or new assertions; do not relax existing ones.

PR checklist

  • New or modified policy files validated
  • New emit types added to policy/emits/emits_vocab.v0.1.yaml (or none)
  • Golden suite impact explained
  • Tests added or updated
  • New or modified public functions/methods have docstrings in Google style (summary + Args/Returns/Raises where applicable)
  • New LLM coordination method or backend: Confirm the LLM excellence checklist (schema-valid decisions, hard-fail to NOOP, metadata, integration) and add the entry to the table in that section.

Preferred PR size: under 400 lines where practical.

Before tagging a release, run the full E2E artifacts chain and ensure it passes (package-release → export-risk-register into release dir → build-release-manifest → verify-release --strict-fingerprints → schema/crosswalk). Run python scripts/validate_security_safety_refs.py to ensure risk_registry, security_attack_suite, and safety case claims stay aligned. See CI and Trust verification.

Optional smoke tests (env vars)

  • Quick-eval — Run 1 episode each of throughput_sla, adversarial_disruption, multi_site_stat: labtrust quick-eval --seed 42 (requires .[env,plots]). CI runs this on every push/PR.
  • LABTRUST_BENCH_SMOKE=1 — Run benchmark smoke (1 episode per task): labtrust bench-smoke --seed 42 (requires .[env]).
  • LABTRUST_REPRO_SMOKE=1 — Run reproduce smoke: labtrust reproduce --profile minimal --out runs/repro_smoke (requires .[env,plots]).
  • Coordination tests — Run all coordination-related tests: pytest -q tests/ -k coordination (requires .[env]). CI coordination-smoke job (when LABTRUST_COORDINATION_SMOKE=1) runs validate-policy, these tests, and one-episode coord_scale + coord_risk. See Coordination methods (Coordination done checklist section).
  • LABTRUST_PAPER_SMOKE=1 — Run package-release paper profile smoke (1 episode baselines, 2 episodes insider_key_misuse study): labtrust package-release --profile paper_v0.1 --seed-base 100 --out /tmp/paper_smoke (requires .[env,plots]). Determinism: pytest tests/test_package_release.py -v (includes paper_v0.1 smoke and CLI test).
  • LABTRUST_MARL_SMOKE=1 — Run MARL smoke: pytest tests/test_marl_smoke.py -v (requires .[marl]).
  • Package-release: labtrust package-release --profile minimal --out /tmp/labtrust_release --seed-base 100 (requires .[env,plots]). For paper-ready artifact: --profile paper_v0.1 (see Paper provenance). Determinism: pytest tests/test_package_release.py -v with LABTRUST_REPRO_SMOKE=1 for minimal/full; paper_v0.1 tests use LABTRUST_PAPER_SMOKE=1.