Prometheus metrics for libp2p protocols#1199
Conversation
metrics-2026-02-08_19.57.51.mp4ping latency metrics(Histogram) on grafana |
gossipsub-metrics.mp4Screencast of the gossipsub metrics. Following metrics are getting recorded:
|
3ab8490 to
1592d66
Compare
|
@lla-dane : Hi Abhinav, this is a really strong and impactful PR, great work 👏 Love how you’ve brought Prometheus/Grafana observability directly into py-libp2p, the coverage across Ping, Gossipsub, Kad-DHT, and Swarm gives a solid, end-to-end view of protocol behavior. The metrics feel well chosen and immediately useful for debugging and performance analysis. The metrics-demo + Docker setup is a big win for DX as well, makes it super easy to spin things up and actually see what’s happening across nodes. Overall, this is a big step toward production-grade observability for py-libp2p. Happy to help test or review further & excited to see this land. We will discuss this in detail tomorrow. On the same note, wish if you could resolve the CI/CD issues. |
f9d9854 to
66fd7d6
Compare
|
@lla-dane : Great work, Abhinav. Please resolve the merge conflicts. Also, add a tracking issue for metrics specific to circuit relay. Please include a newsfragment. This PR is ready to merge. |
|
Fixed the merge conflicts and added the newsfragment. @seetadev |
| return RoutedHost( | ||
| network=swarm, | ||
| router=disc_opt, | ||
| enable_mDNS=enable_mDNS, | ||
| enable_upnp=enable_upnp, | ||
| bootstrap=bootstrap, | ||
| resource_manager=resource_manager, | ||
| bootstrap_allow_ipv6=bootstrap_allow_ipv6, | ||
| bootstrap_dns_timeout=bootstrap_dns_timeout, | ||
| bootstrap_dns_max_retries=bootstrap_dns_max_retries, |
There was a problem hiding this comment.
should the RoutedHost branch also get metric_recv_channel defined?
There was a problem hiding this comment.
I was unsure with this, so attached it to RoutedHost but it seems redundant, I will remove it.
|
In general, there are a lot of changes being made and code being added with very little testing. Large chunks of the code you're adding could be deleted and we'd never know from the CI run. It also appears that the |
…sage in prometheus
Like the major internal change was to add a new component Would you please flag, 1 or 2 places, so I get the idea of the redundant chunks and then will start removing them. Thanks!! @pacrob |
|
Thanks @lla-dane for the PR — this is strong and impactful work. The protocol coverage (Ping, Gossipsub, Kad-DHT, Swarm), demo setup, and documentation updates make observability much more practical for py-libp2p users. Required improvements
Suggested improvements
1. Summary of ChangesThis PR adds Prometheus-based observability across core protocols and runtime flows in Related issue context exists as issue No explicit breaking API changes are declared, but interface surface was expanded ( 2. Branch Sync Status and Merge ConflictsBranch Sync Status
Merge Conflict Analysis
✅ No merge conflicts detected. The PR branch can be merged cleanly into 3. Strengths
4. Issues FoundCritical
MajorRoutedHost vs BasicHost:
|
|
Sure sure, thanks for flagging the issues @acul71, will fix them shortly. |
Sorry, I was unclear. It's not that your code is redundant. Because your code is not hit when tests are run, there's no way for us to know if some future PR changes or breaks the work you've done here. |
Aah I see, I misunderstood. I will start writing tests so that all of my code is included in the CI runs for future PRs. |
Introduction
This pull request introduces Prometheus/Grafana metrics for core py-libp2p protocols, for real-time monitoring and analysis.
It enables developers to run a libp2p node and directly inspect internal protocol behavior—such as latency, message propagation, and DHT activity—through standard metrics pipelines.
A working demo (metrics-demo) is included in the examples directory, to showcase how multiple services operate together and how their metrics can be visualized using Prometheus and Grafana.
What's included
The following libp2p services are currently instrumented and exposed via Prometheus metrics:
Ping
ping: Round-trip time (RTT) measurements.ping_failure: Failed ping attempts.Provides visibility into peer-to-peer latency and connectivity reliability.
Gossipsub / Pubsub
gossipsub_received_total: Messages receivedgossipsub_publish_total: Messages publishedgossipsub_subopts_total: Subscription updatesgossipsub_control_total: Control messagesgossipsub_message_bytes: Message sizesEnables monitoring of message propagation, throughput, and pubsub activity.
Kademlia (Kad-DHT)
kad_inbound_total: Total inbound requestskad_inbound_find_node: FIND_NODE requestskad_inbound_get_value: GET_VALUE requestskad_inbound_put_value: PUT_VALUE requestskad_inbound_get_providers: GET_PROVIDERS requestskad_inbound_add_provider: ADD_PROVIDER requestsSwarm / Connection Lifecycle
swarm_incoming_conn: Incoming connectionsswarm_incoming_conn_error: Incoming connection failuresswarm_dial_attempt: Outgoing dial attemptsswarm_dial_attempt_error: Dial failuresTracks connection establishment behavior and network stability.
Demo & Observability Setup
A
metrics-demoCLI is included to:A Docker-based setup is provided to launch:
This allows real-time inspection of protocol-level behavior across nodes.
Necessity
Currently, diagnosing issues in py-libp2p (e.g., latency spikes, dropped messages, or DHT inconsistencies) relies heavily on logs, which are:
This PR introduces structured, queryable metrics that:
Reference
Inspired by the metrics design in the Rust implementation:
https://github.com/libp2p/rust-libp2p/tree/master/misc/metrics