Add mina-healthcheck CLI for daemon health probes#18746
Add mina-healthcheck CLI for daemon health probes#18746dkijania wants to merge 5 commits intocompatiblefrom
Conversation
|
Why aren't we reusing more code from graphQL client? it seems this is more or less just a wrapper around graphQL API. |
766ba6b to
c9541b5
Compare
Oh sorry it should be a draft. I'm still reshaping it and it was easier to push and review changes here. Yes this is exactly as you said. I think current code more reflect it |
|
!ci-build-me |
b7295af to
0220a2e
Compare
|
!ci-build-me |
1 similar comment
|
!ci-build-me |
|
!ci-build-me |
There was a problem hiding this comment.
Pull request overview
This PR introduces a new mina-healthcheck OCaml CLI (and reusable mina_healthcheck_lib) to probe Mina daemon health via GraphQL, intended for use in Kubernetes/Docker health checks and automation.
Changes:
- Added
mina-healthcheckexecutable with subcommands for sync/readiness and related health probes. - Added
mina_healthcheck_libplus new GraphQL client queries/types/helpers to support health checks. - Added/updated integration tests and test helpers, plus packaging/build (dune) wiring.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/app/mina_healthcheck/mina_healthcheck.ml | Implements the mina-healthcheck CLI commands and flags. |
| src/app/mina_healthcheck/dune | Adds the executable stanza and library deps. |
| src/app/mina_healthcheck/lib/mina_healthcheck_lib.ml | Healthcheck logic (polling, readiness evaluation) built on mina_graphql_client. |
| src/app/mina_healthcheck/lib/mina_healthcheck_lib.mli | Public API + JSON output types for the healthcheck library. |
| src/app/mina_healthcheck/lib/dune | Adds the mina_healthcheck_lib library stanza and deps. |
| src/app/mina_healthcheck/README.md | Documents commands, flags, exit codes, and k8s/docker usage. |
| src/lib/mina_graphql_client/queries.ml | Adds GraphQL queries for sync status and daemon status/readiness. |
| src/lib/mina_graphql_client/types.ml | Adds JSON-serializable types for daemon status and readiness output. |
| src/lib/mina_graphql_client/client.ml | Adds helpers to fetch sync status, daemon status, and readiness from GraphQL. |
| src/lib/mina_graphql_client/dune | Enables ppx_deriving_yojson and adds deps needed for new JSON types + sync_status. |
| src/dune-project | Registers new packages mina_healthcheck and mina_healthcheck_lib. |
| src/test/mina_automation/healthcheck.ml | Adds a test helper client intended to run the mina-healthcheck binary. |
| src/test/mina_automation/dune | Wires in mina_healthcheck_lib dependency for automation tests. |
| src/test/command_line_tests/command_line_tests.ml | Adds healthcheck-related integration/negative tests. |
| src/test/command_line_tests/dune | Adds required deps for the new healthcheck tests. |
| changes/18746.md | Adds changelog entry for the new healthcheck CLI/library. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1 @@ | |||
| Add mina-healthcheck CLI for daemon health probes. Lightweight tool that queries the daemon's GraphQL API to check sync status, peer count, chain length, and combined readiness. Supports Kubernetes exec probes, Docker HEALTHCHECK, and a blocking wait command for init containers and CI pipelines. Includes healthcheck_lib for programmatic use from OCaml code. No newline at end of file | |||
| (** Healthcheck integration test: starts a daemon, verifies healthcheck | ||
| reports NOT ready before bootstrap, uses healthcheck_lib to wait for | ||
| the daemon to become ready, then verifies all checks pass. *) | ||
| module HealthcheckBootstrapLifecycle = struct | ||
| module HC = Mina_healthcheck_lib | ||
|
|
||
| let hc_logger = Logger.create () | ||
|
|
||
| type t = Mina_automation_fixture.Daemon.before_bootstrap | ||
|
|
||
| let start_daemon (test : t) = | ||
| let daemon = Daemon.of_config test.config in | ||
| let%bind () = Daemon.Config.generate_keys test.config in | ||
| let ledger_file = test.config.dirs.conf ^/ "daemon.json" in | ||
| let%bind () = | ||
| Mina_automation_fixture.Daemon.generate_random_config daemon ledger_file | ||
| in | ||
| let%map _process = Daemon.start daemon in | ||
| Uri.make ~scheme:"http" ~host:"localhost" ~port:test.config.rest_port | ||
| ~path:"/graphql" () | ||
|
|
||
| let assert_not_ready node_uri = | ||
| let%map pre = HC.check_readiness ~logger:hc_logger node_uri ~min_peers:0 in | ||
| let not_ready = match pre with Error _ -> true | Ok r -> not r.ready in | ||
| eprintf "Pre-bootstrap not_ready=%b\n" not_ready ; | ||
| Ok () | ||
|
|
||
| let assert_synced node_uri = | ||
| let open Deferred.Or_error.Let_syntax in | ||
| let%bind ds = HC.get_daemon_status ~logger:hc_logger node_uri in | ||
| eprintf "Post-bootstrap sync=%s peers=%d\n" | ||
| (Sync_status.to_string ds.sync_status) | ||
| ds.peer_count ; | ||
| if not (Sync_status.equal ds.sync_status `Synced) then | ||
| Deferred.Or_error.errorf "expected SYNCED, got %s" | ||
| (Sync_status.to_string ds.sync_status) | ||
| else Deferred.Or_error.return () | ||
|
|
||
| let test_case (test : t) = | ||
| let open Mina_automation_fixture.Intf in | ||
| let run () = | ||
| let open Deferred.Or_error.Let_syntax in | ||
| let%bind node_uri = Deferred.map (start_daemon test) ~f:Or_error.return in | ||
| let%bind _initial = | ||
| HC.wait_for_graphql ~logger:hc_logger node_uri ~timeout:120 ~interval:5 | ||
| in | ||
| let%bind () = assert_not_ready node_uri in | ||
| let%bind _ready = | ||
| HC.wait_for_ready ~logger:hc_logger node_uri ~min_peers:0 ~timeout:300 | ||
| ~interval:5 | ||
| in | ||
| assert_synced node_uri | ||
| in | ||
| match%map run () with Ok () -> Passed | Error e -> Failed e | ||
| end | ||
|
|
||
| (** Healthcheck negative test: verifies failure for unreachable daemon. *) | ||
| module HealthcheckUnreachable = struct | ||
| module HC = Mina_healthcheck_lib | ||
|
|
||
| let hc_logger = Logger.create () | ||
|
|
||
| type t = Mina_automation_fixture.Daemon.before_bootstrap | ||
|
|
||
| let test_case (_test : t) = | ||
| let unreachable_uri = Uri.of_string "http://127.0.0.1:1/graphql" in | ||
| let%map result = HC.get_sync_status ~logger:hc_logger unreachable_uri in | ||
| match result with | ||
| | Error _ -> | ||
| Mina_automation_fixture.Intf.Passed | ||
| | Ok _ -> | ||
| Mina_automation_fixture.Intf.Failed | ||
| (Error.of_string "expected failure for unreachable daemon") | ||
| end |
| let daemon_status_command = | ||
| Command.async_or_error ~summary:"Get comprehensive daemon status as JSON" | ||
| (let%map_open.Command uri = graphql_uri_flag in | ||
| fun () -> | ||
| match%bind HC.get_daemon_status ~logger (node_uri uri) with | ||
| | Error e -> | ||
| Deferred.Or_error.fail e | ||
| | Ok ds -> | ||
| output (Types.daemon_status_to_yojson ds) ; | ||
| Deferred.Or_error.return () ) |
| ~summary: | ||
| "Check peer count against threshold (exit 0 if above, exit 1 otherwise)" | ||
| (let%map_open.Command uri = graphql_uri_flag | ||
| and json = json_flag | ||
| and min_peers = min_peers_flag in | ||
| fun () -> | ||
| match%bind HC.check_peer_count ~logger (node_uri uri) ~min_peers with | ||
| | Error e -> | ||
| if json then output_error e ; | ||
| Deferred.Or_error.fail e | ||
| | Ok (healthy, count) -> | ||
| if json then | ||
| output | ||
| (HC.peer_count_response_to_yojson | ||
| { healthy; peer_count = count; min_peers } ) | ||
| else printf "%d peers (threshold: >%d)\n" count min_peers ; | ||
| if healthy then Deferred.Or_error.return () | ||
| else | ||
| Deferred.Or_error.errorf | ||
| "peer count %d is not greater than threshold %d" count min_peers |
| (** Check whether peer count exceeds [min_peers]. | ||
| Returns [(healthy, actual_peer_count)]. *) | ||
| val check_peer_count : | ||
| logger:Logger.t -> Uri.t -> min_peers:int -> (bool * int) Deferred.Or_error.t | ||
|
|
| let sync_status_command = | ||
| Command.async_or_error | ||
| ~summary:"Check sync status (exit 0 if SYNCED, exit 1 otherwise)" | ||
| (let%map_open.Command uri = graphql_uri_flag and json = json_flag in | ||
| fun () -> | ||
| match%bind HC.get_sync_status ~logger (node_uri uri) with | ||
| | Error e -> | ||
| if json then output_error e ; | ||
| Deferred.Or_error.fail e | ||
| | Ok status -> | ||
| let is_synced = Sync_status.equal status `Synced in | ||
| let s = Sync_status.to_string status in | ||
| if json then | ||
| output | ||
| (HC.sync_status_response_to_yojson | ||
| { healthy = is_synced; sync_status = s } ) | ||
| else printf "%s\n" s ; | ||
| if is_synced then Deferred.Or_error.return () | ||
| else Deferred.Or_error.errorf "node is not synced: %s" s ) |
| | Command | Description | Exit 0 when | | ||
| |---------|-------------|-------------| | ||
| | `sync-status` | Query daemon sync state | SYNCED | | ||
| | `daemon-status` | Full status dump (always JSON) | Daemon reachable | | ||
| | `peer-count` | Check peers against threshold | `peer_count > --min-peers` | | ||
| | `chain-length` | Compare local vs network height | Heights match | | ||
| | `ready` | Combined check (sync + peers + chain) | All pass | | ||
| | `wait` | Block until ready or timeout | Node becomes ready | | ||
|
|
||
| ## Usage | ||
|
|
||
| ```bash | ||
| # Check if daemon is synced (exit 0 = synced, exit 1 = not synced) | ||
| mina-healthcheck sync-status | ||
|
|
||
| # Same, with JSON output | ||
| mina-healthcheck sync-status --json | ||
|
|
||
| # Full daemon status as JSON | ||
| mina-healthcheck daemon-status | ||
|
|
||
| # Check peer count against threshold (default: >2) | ||
| mina-healthcheck peer-count --min-peers 3 |
| ## Global flags | ||
|
|
||
| | Flag | Alias | Default | Description | | ||
| |------|-------|---------|-------------| | ||
| | `--graphql-uri` | `-u` | `http://127.0.0.1:3085/graphql` | Daemon GraphQL endpoint | | ||
| | `--json` | `-j` | off | Output as JSON instead of text | | ||
| | `--min-peers` | `-n` | 2 | Minimum peer count threshold | | ||
| | `--timeout` | `-t` | 600 | Max seconds to wait (wait only) | | ||
| | `--interval` | `-i` | 10 | Poll interval in seconds (wait only) | | ||
|
|
| type wait_timeout_response = | ||
| { ready : bool | ||
| ; timed_out : bool | ||
| ; elapsed_secs : int | ||
| ; sync_status : string option | ||
| ; peer_count : int option | ||
| ; blockchain_length : int option | ||
| ; highest_block_length_received : int option | ||
| ; error : string option | ||
| } | ||
| [@@deriving yojson] |
| (libraries | ||
| ;; opam libraries | ||
| async async_kernel core_kernel core uri integers result | ||
| async async_kernel core_kernel core uri integers result ppx_deriving_yojson.runtime | ||
| ;; local libraries | ||
| graphql_lib generated_graphql_queries mina_graphql mina_base mina_base.import | ||
| mina_transaction mina_numbers currency signature_lib logger with_hash | ||
| data_hash_lib)) | ||
| data_hash_lib sync_status)) |
Introduces a lightweight OCaml CLI for probing Mina daemon health via GraphQL. Designed as a compiled replacement for healthcheck-utilities.sh that works with Kubernetes exec probes, Docker HEALTHCHECK, and monitoring. Subcommands: sync-status, daemon-status, peer-count, chain-length, ready, wait. Extends mina_graphql_client with daemon status, readiness, and sync status queries and types needed by the healthcheck tool. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds integration tests for mina_healthcheck_lib in command_line_tests: - assert_not_ready: verifies node is not ready before bootstrap - assert_synced: verifies node reaches SYNCED state after wait Also adds healthcheck.ml test helper client in mina_automation for running the mina-healthcheck binary from tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --json flag and text output to daemon-status command - Fix peer threshold semantics: use >= wording consistently - Uppercase sync-status text output to match docs (SYNCED, BOOTSTRAP) - Fix assert_not_ready test to actually assert on readiness - Use Executor.run_in_background in healthcheck test client - Remove unused wait_timeout_response type - Update README: fix threshold docs, rename Global flags to Flags - Fix changelog library name to mina_healthcheck_lib Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
888c148 to
7454067
Compare
|
!ci-build-me |
In demo mode with a small ledger the node can bootstrap before GraphQL becomes reachable, causing the assert_not_ready check to fail. The test now treats an already-ready node as acceptable in the pre-bootstrap readiness check rather than a hard failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
!ci-build-me |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
!ci-build-me |
|
!ci-docker-me arch=arm64 |
The PR adding mina-healthcheck (PR #18746) defined the dune target but did not wire the binary into the packaging pipeline. Without this, the binary is built but never copied into any deb package or Docker image
| let elapsed = | ||
| Time.Span.to_sec (Time.diff (Time.now ()) start) |> Float.to_int | ||
| in | ||
| eprintf "[%4ds] connection error: %s\n%!" elapsed |
There was a problem hiding this comment.
I'd just print a float time stamp.
| Or_error.errorf "pre-bootstrap readiness check errored: %s" | ||
| (Error.to_string_hum e) | ||
| | Ok r when r.ready -> | ||
| eprintf |
|
I think the code is pretty straight forward. It's okay to replace that bash script with ocaml implementation alone. But I have a feeling that keeping it as bash, but replacing critical graphQL parts with our new graphQL client could result in smaller PR sizes. Regardless I think it looks good. There's test coverage. |
SanabriaRusso
left a comment
There was a problem hiding this comment.
This PR is sensitive in terms of deployment and lacks very important validations:
- This executable is not currently being included in the Docker
- We need a comprehensible test for the health check operations.
At time of writing of this comment, the arm64 artifact generated in #18779 was not able to sync to devnet, preventing manual validations.
Other relevant bugs found:
- mina-healthcheck wait --timeout 30 takes 300s instead of 30s because the wait command's timeout doesn't control the inner retry loop in mina_graphql_client. Each call to exec_graphql_request independently retries 10 times with a 30s sleep between attempts, ignoring the caller's timeout entirely
Suggestion: Pass the remaining wall-clock budget into exec_graphql_request so it respects the outer timeout — either by adding a ~timeout parameter that caps both the number of retries and the sleep interval, or by wrapping the call in Async.Clock.with_timeout so it gets cancelled when the deadline expires
| module GC = Mina_graphql_client.Client | ||
| module Types = Mina_graphql_client.Types | ||
|
|
||
| type error_response = { healthy : bool; error : string } [@@deriving yojson] |
There was a problem hiding this comment.
It feels weird this is a product type instead of a union type.
What does it mean for healthy to be true and error being non-empty?
Also what does it mean for healthy to be false while error being empty?
| type sync_status_response = { healthy : bool; sync_status : string } | ||
| [@@deriving yojson] | ||
|
|
||
| type peer_count_response = { healthy : bool; peer_count : int; min_peers : int } |
There was a problem hiding this comment.
These types suffer from more or less the same issue. Moreover, "healthy" is really an opaque thing -- what does it mean in different types? Are they same? If so, why are we having the same value stored multiple times? If not, please document it?
|
|
||
| let get_sync_status = GC.get_sync_status | ||
|
|
||
| let get_daemon_status = GC.get_daemon_status |
There was a problem hiding this comment.
I don't like the fact we're re-exporting definitions. Just refer to graphQL client from the call site instead of re-exporting?
| (** Poll [f] every [interval] seconds until it returns [Ok (Some v)] | ||
| (success), [Ok None] (not yet, keep polling), or [Error] on | ||
| connection failure (retried until timeout). *) | ||
| let poll_until ~timeout ~interval ~timeout_msg f = |
There was a problem hiding this comment.
Such pattern occurs everywhere in the codebase doing more or less the samething. We really need to design a back-off library for code reuse.
Will work on it.
Summary
mina-healthcheck, a lightweight OCaml CLI for probing daemon health via GraphQLhealthcheck-utilities.sh— works with k8s exec probes, Docker HEALTHCHECK, and monitoringsync-status,daemon-status,peer-count,chain-length,ready,waitmina_healthcheck_libfor programmatic use from OCaml codemina_graphql_clientUsage
Test plan
assert_not_ready,assert_ready) incommand_line_testsusingmina_healthcheck_libdirectly against a live daemon🤖 Generated with Claude Code