Skip to content

Add mina-healthcheck CLI for daemon health probes#18746

Open
dkijania wants to merge 5 commits intocompatiblefrom
darek/mina-healthcheck
Open

Add mina-healthcheck CLI for daemon health probes#18746
dkijania wants to merge 5 commits intocompatiblefrom
darek/mina-healthcheck

Conversation

@dkijania
Copy link
Copy Markdown
Member

@dkijania dkijania commented Apr 8, 2026

Summary

  • Adds mina-healthcheck, a lightweight OCaml CLI for probing daemon health via GraphQL
  • Compiled replacement for healthcheck-utilities.sh — works with k8s exec probes, Docker HEALTHCHECK, and monitoring
  • Subcommands: sync-status, daemon-status, peer-count, chain-length, ready, wait
  • Includes mina_healthcheck_lib for programmatic use from OCaml code
  • Minimal dependencies via mina_graphql_client

Usage

# Kubernetes readiness probe
mina-healthcheck ready --min-peers 2

# Block until node is ready (init container / CI)
mina-healthcheck wait --timeout 600 --interval 10

# Full status JSON
mina-healthcheck daemon-status

Test plan

  • 2 library-level integration tests (assert_not_ready, assert_ready) in command_line_tests using mina_healthcheck_lib directly against a live daemon
  • Verified exit codes: 0 on success, 1 on failure
  • Timeout behavior verified (wait command)
  • Unreachable daemon error handling verified
  • CI build (needs compatible branch build infra)

🤖 Generated with Claude Code

@glyh
Copy link
Copy Markdown
Member

glyh commented Apr 9, 2026

Why aren't we reusing more code from graphQL client? it seems this is more or less just a wrapper around graphQL API.

@dkijania dkijania force-pushed the darek/mina-healthcheck branch 4 times, most recently from 766ba6b to c9541b5 Compare April 10, 2026 18:15
@dkijania dkijania marked this pull request as draft April 11, 2026 14:05
@dkijania
Copy link
Copy Markdown
Member Author

Why aren't we reusing more code from graphQL client? it seems this is more or less just a wrapper around graphQL API.

Oh sorry it should be a draft. I'm still reshaping it and it was easier to push and review changes here. Yes this is exactly as you said. I think current code more reflect it

@dkijania
Copy link
Copy Markdown
Member Author

!ci-build-me

@dkijania dkijania force-pushed the darek/mina-healthcheck branch from b7295af to 0220a2e Compare April 13, 2026 18:12
@dkijania
Copy link
Copy Markdown
Member Author

!ci-build-me

1 similar comment
@dkijania
Copy link
Copy Markdown
Member Author

!ci-build-me

@dkijania dkijania self-assigned this Apr 14, 2026
@dkijania
Copy link
Copy Markdown
Member Author

!ci-build-me

@dkijania dkijania marked this pull request as ready for review April 14, 2026 15:24
@dkijania dkijania requested a review from Copilot April 14, 2026 15:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new mina-healthcheck OCaml CLI (and reusable mina_healthcheck_lib) to probe Mina daemon health via GraphQL, intended for use in Kubernetes/Docker health checks and automation.

Changes:

  • Added mina-healthcheck executable with subcommands for sync/readiness and related health probes.
  • Added mina_healthcheck_lib plus new GraphQL client queries/types/helpers to support health checks.
  • Added/updated integration tests and test helpers, plus packaging/build (dune) wiring.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/app/mina_healthcheck/mina_healthcheck.ml Implements the mina-healthcheck CLI commands and flags.
src/app/mina_healthcheck/dune Adds the executable stanza and library deps.
src/app/mina_healthcheck/lib/mina_healthcheck_lib.ml Healthcheck logic (polling, readiness evaluation) built on mina_graphql_client.
src/app/mina_healthcheck/lib/mina_healthcheck_lib.mli Public API + JSON output types for the healthcheck library.
src/app/mina_healthcheck/lib/dune Adds the mina_healthcheck_lib library stanza and deps.
src/app/mina_healthcheck/README.md Documents commands, flags, exit codes, and k8s/docker usage.
src/lib/mina_graphql_client/queries.ml Adds GraphQL queries for sync status and daemon status/readiness.
src/lib/mina_graphql_client/types.ml Adds JSON-serializable types for daemon status and readiness output.
src/lib/mina_graphql_client/client.ml Adds helpers to fetch sync status, daemon status, and readiness from GraphQL.
src/lib/mina_graphql_client/dune Enables ppx_deriving_yojson and adds deps needed for new JSON types + sync_status.
src/dune-project Registers new packages mina_healthcheck and mina_healthcheck_lib.
src/test/mina_automation/healthcheck.ml Adds a test helper client intended to run the mina-healthcheck binary.
src/test/mina_automation/dune Wires in mina_healthcheck_lib dependency for automation tests.
src/test/command_line_tests/command_line_tests.ml Adds healthcheck-related integration/negative tests.
src/test/command_line_tests/dune Adds required deps for the new healthcheck tests.
changes/18746.md Adds changelog entry for the new healthcheck CLI/library.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread changes/18746.md Outdated
@@ -0,0 +1 @@
Add mina-healthcheck CLI for daemon health probes. Lightweight tool that queries the daemon's GraphQL API to check sync status, peer count, chain length, and combined readiness. Supports Kubernetes exec probes, Docker HEALTHCHECK, and a blocking wait command for init containers and CI pipelines. Includes healthcheck_lib for programmatic use from OCaml code. No newline at end of file
Comment on lines +885 to +958
(** Healthcheck integration test: starts a daemon, verifies healthcheck
reports NOT ready before bootstrap, uses healthcheck_lib to wait for
the daemon to become ready, then verifies all checks pass. *)
module HealthcheckBootstrapLifecycle = struct
module HC = Mina_healthcheck_lib

let hc_logger = Logger.create ()

type t = Mina_automation_fixture.Daemon.before_bootstrap

let start_daemon (test : t) =
let daemon = Daemon.of_config test.config in
let%bind () = Daemon.Config.generate_keys test.config in
let ledger_file = test.config.dirs.conf ^/ "daemon.json" in
let%bind () =
Mina_automation_fixture.Daemon.generate_random_config daemon ledger_file
in
let%map _process = Daemon.start daemon in
Uri.make ~scheme:"http" ~host:"localhost" ~port:test.config.rest_port
~path:"/graphql" ()

let assert_not_ready node_uri =
let%map pre = HC.check_readiness ~logger:hc_logger node_uri ~min_peers:0 in
let not_ready = match pre with Error _ -> true | Ok r -> not r.ready in
eprintf "Pre-bootstrap not_ready=%b\n" not_ready ;
Ok ()

let assert_synced node_uri =
let open Deferred.Or_error.Let_syntax in
let%bind ds = HC.get_daemon_status ~logger:hc_logger node_uri in
eprintf "Post-bootstrap sync=%s peers=%d\n"
(Sync_status.to_string ds.sync_status)
ds.peer_count ;
if not (Sync_status.equal ds.sync_status `Synced) then
Deferred.Or_error.errorf "expected SYNCED, got %s"
(Sync_status.to_string ds.sync_status)
else Deferred.Or_error.return ()

let test_case (test : t) =
let open Mina_automation_fixture.Intf in
let run () =
let open Deferred.Or_error.Let_syntax in
let%bind node_uri = Deferred.map (start_daemon test) ~f:Or_error.return in
let%bind _initial =
HC.wait_for_graphql ~logger:hc_logger node_uri ~timeout:120 ~interval:5
in
let%bind () = assert_not_ready node_uri in
let%bind _ready =
HC.wait_for_ready ~logger:hc_logger node_uri ~min_peers:0 ~timeout:300
~interval:5
in
assert_synced node_uri
in
match%map run () with Ok () -> Passed | Error e -> Failed e
end

(** Healthcheck negative test: verifies failure for unreachable daemon. *)
module HealthcheckUnreachable = struct
module HC = Mina_healthcheck_lib

let hc_logger = Logger.create ()

type t = Mina_automation_fixture.Daemon.before_bootstrap

let test_case (_test : t) =
let unreachable_uri = Uri.of_string "http://127.0.0.1:1/graphql" in
let%map result = HC.get_sync_status ~logger:hc_logger unreachable_uri in
match result with
| Error _ ->
Mina_automation_fixture.Intf.Passed
| Ok _ ->
Mina_automation_fixture.Intf.Failed
(Error.of_string "expected failure for unreachable daemon")
end
Comment on lines +56 to +65
let daemon_status_command =
Command.async_or_error ~summary:"Get comprehensive daemon status as JSON"
(let%map_open.Command uri = graphql_uri_flag in
fun () ->
match%bind HC.get_daemon_status ~logger (node_uri uri) with
| Error e ->
Deferred.Or_error.fail e
| Ok ds ->
output (Types.daemon_status_to_yojson ds) ;
Deferred.Or_error.return () )
Comment on lines +69 to +88
~summary:
"Check peer count against threshold (exit 0 if above, exit 1 otherwise)"
(let%map_open.Command uri = graphql_uri_flag
and json = json_flag
and min_peers = min_peers_flag in
fun () ->
match%bind HC.check_peer_count ~logger (node_uri uri) ~min_peers with
| Error e ->
if json then output_error e ;
Deferred.Or_error.fail e
| Ok (healthy, count) ->
if json then
output
(HC.peer_count_response_to_yojson
{ healthy; peer_count = count; min_peers } )
else printf "%d peers (threshold: >%d)\n" count min_peers ;
if healthy then Deferred.Or_error.return ()
else
Deferred.Or_error.errorf
"peer count %d is not greater than threshold %d" count min_peers
Comment on lines +66 to +70
(** Check whether peer count exceeds [min_peers].
Returns [(healthy, actual_peer_count)]. *)
val check_peer_count :
logger:Logger.t -> Uri.t -> min_peers:int -> (bool * int) Deferred.Or_error.t

Comment on lines +36 to +54
let sync_status_command =
Command.async_or_error
~summary:"Check sync status (exit 0 if SYNCED, exit 1 otherwise)"
(let%map_open.Command uri = graphql_uri_flag and json = json_flag in
fun () ->
match%bind HC.get_sync_status ~logger (node_uri uri) with
| Error e ->
if json then output_error e ;
Deferred.Or_error.fail e
| Ok status ->
let is_synced = Sync_status.equal status `Synced in
let s = Sync_status.to_string status in
if json then
output
(HC.sync_status_response_to_yojson
{ healthy = is_synced; sync_status = s } )
else printf "%s\n" s ;
if is_synced then Deferred.Or_error.return ()
else Deferred.Or_error.errorf "node is not synced: %s" s )
Comment on lines +9 to +31
| Command | Description | Exit 0 when |
|---------|-------------|-------------|
| `sync-status` | Query daemon sync state | SYNCED |
| `daemon-status` | Full status dump (always JSON) | Daemon reachable |
| `peer-count` | Check peers against threshold | `peer_count > --min-peers` |
| `chain-length` | Compare local vs network height | Heights match |
| `ready` | Combined check (sync + peers + chain) | All pass |
| `wait` | Block until ready or timeout | Node becomes ready |

## Usage

```bash
# Check if daemon is synced (exit 0 = synced, exit 1 = not synced)
mina-healthcheck sync-status

# Same, with JSON output
mina-healthcheck sync-status --json

# Full daemon status as JSON
mina-healthcheck daemon-status

# Check peer count against threshold (default: >2)
mina-healthcheck peer-count --min-peers 3
Comment thread src/app/mina_healthcheck/README.md Outdated
Comment on lines +76 to +85
## Global flags

| Flag | Alias | Default | Description |
|------|-------|---------|-------------|
| `--graphql-uri` | `-u` | `http://127.0.0.1:3085/graphql` | Daemon GraphQL endpoint |
| `--json` | `-j` | off | Output as JSON instead of text |
| `--min-peers` | `-n` | 2 | Minimum peer count threshold |
| `--timeout` | `-t` | 600 | Max seconds to wait (wait only) |
| `--interval` | `-i` | 10 | Poll interval in seconds (wait only) |

Comment on lines +36 to +46
type wait_timeout_response =
{ ready : bool
; timed_out : bool
; elapsed_secs : int
; sync_status : string option
; peer_count : int option
; blockchain_length : int option
; highest_block_length_received : int option
; error : string option
}
[@@deriving yojson]
Comment on lines 12 to +18
(libraries
;; opam libraries
async async_kernel core_kernel core uri integers result
async async_kernel core_kernel core uri integers result ppx_deriving_yojson.runtime
;; local libraries
graphql_lib generated_graphql_queries mina_graphql mina_base mina_base.import
mina_transaction mina_numbers currency signature_lib logger with_hash
data_hash_lib))
data_hash_lib sync_status))
dkijania and others added 3 commits April 15, 2026 00:27
Introduces a lightweight OCaml CLI for probing Mina daemon health via
GraphQL. Designed as a compiled replacement for healthcheck-utilities.sh
that works with Kubernetes exec probes, Docker HEALTHCHECK, and monitoring.

Subcommands: sync-status, daemon-status, peer-count, chain-length, ready, wait.

Extends mina_graphql_client with daemon status, readiness, and sync status
queries and types needed by the healthcheck tool.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds integration tests for mina_healthcheck_lib in command_line_tests:
- assert_not_ready: verifies node is not ready before bootstrap
- assert_synced: verifies node reaches SYNCED state after wait

Also adds healthcheck.ml test helper client in mina_automation for
running the mina-healthcheck binary from tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --json flag and text output to daemon-status command
- Fix peer threshold semantics: use >= wording consistently
- Uppercase sync-status text output to match docs (SYNCED, BOOTSTRAP)
- Fix assert_not_ready test to actually assert on readiness
- Use Executor.run_in_background in healthcheck test client
- Remove unused wait_timeout_response type
- Update README: fix threshold docs, rename Global flags to Flags
- Fix changelog library name to mina_healthcheck_lib

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dkijania dkijania force-pushed the darek/mina-healthcheck branch from 888c148 to 7454067 Compare April 14, 2026 22:28
@dkijania
Copy link
Copy Markdown
Member Author

!ci-build-me

In demo mode with a small ledger the node can bootstrap before GraphQL
becomes reachable, causing the assert_not_ready check to fail. The test
now treats an already-ready node as acceptable in the pre-bootstrap
readiness check rather than a hard failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dkijania
Copy link
Copy Markdown
Member Author

!ci-build-me

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dkijania
Copy link
Copy Markdown
Member Author

!ci-build-me

@SanabriaRusso
Copy link
Copy Markdown
Member

!ci-docker-me arch=arm64

SanabriaRusso added a commit that referenced this pull request Apr 16, 2026
  The PR adding mina-healthcheck (PR #18746) defined the dune target
  but did not wire the binary into the packaging pipeline. Without this,
  the binary is built but never copied into any deb package or Docker image
let elapsed =
Time.Span.to_sec (Time.diff (Time.now ()) start) |> Float.to_int
in
eprintf "[%4ds] connection error: %s\n%!" elapsed
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just print a float time stamp.

Or_error.errorf "pre-bootstrap readiness check errored: %s"
(Error.to_string_hum e)
| Ok r when r.ready ->
eprintf
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why eprintf instead of logging?

@glyh
Copy link
Copy Markdown
Member

glyh commented Apr 16, 2026

I think the code is pretty straight forward. It's okay to replace that bash script with ocaml implementation alone. But I have a feeling that keeping it as bash, but replacing critical graphQL parts with our new graphQL client could result in smaller PR sizes.

Regardless I think it looks good. There's test coverage.

@SanabriaRusso SanabriaRusso self-requested a review April 17, 2026 08:45
Copy link
Copy Markdown
Member

@SanabriaRusso SanabriaRusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is sensitive in terms of deployment and lacks very important validations:

  • This executable is not currently being included in the Docker
  • We need a comprehensible test for the health check operations.

At time of writing of this comment, the arm64 artifact generated in #18779 was not able to sync to devnet, preventing manual validations.

Other relevant bugs found:

  • mina-healthcheck wait --timeout 30 takes 300s instead of 30s because the wait command's timeout doesn't control the inner retry loop in mina_graphql_client. Each call to exec_graphql_request independently retries 10 times with a 30s sleep between attempts, ignoring the caller's timeout entirely

Suggestion: Pass the remaining wall-clock budget into exec_graphql_request so it respects the outer timeout — either by adding a ~timeout parameter that caps both the number of retries and the sleep interval, or by wrapping the call in Async.Clock.with_timeout so it gets cancelled when the deadline expires

module GC = Mina_graphql_client.Client
module Types = Mina_graphql_client.Types

type error_response = { healthy : bool; error : string } [@@deriving yojson]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird this is a product type instead of a union type.

What does it mean for healthy to be true and error being non-empty?

Also what does it mean for healthy to be false while error being empty?

type sync_status_response = { healthy : bool; sync_status : string }
[@@deriving yojson]

type peer_count_response = { healthy : bool; peer_count : int; min_peers : int }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These types suffer from more or less the same issue. Moreover, "healthy" is really an opaque thing -- what does it mean in different types? Are they same? If so, why are we having the same value stored multiple times? If not, please document it?


let get_sync_status = GC.get_sync_status

let get_daemon_status = GC.get_daemon_status
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the fact we're re-exporting definitions. Just refer to graphQL client from the call site instead of re-exporting?

(** Poll [f] every [interval] seconds until it returns [Ok (Some v)]
(success), [Ok None] (not yet, keep polling), or [Error] on
connection failure (retried until timeout). *)
let poll_until ~timeout ~interval ~timeout_msg f =
Copy link
Copy Markdown
Member

@glyh glyh Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such pattern occurs everywhere in the codebase doing more or less the samething. We really need to design a back-off library for code reuse.

Will work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To triage

Development

Successfully merging this pull request may close these issues.

4 participants