Skip to content

Hardcoded 4096-byte STATUS_QUERY_MAX_SIZE_BYTES rejects legitimate gateway /status polling #937

@jimwavefive

Description

@jimwavefive

Bug Report: Hardcoded 4096-byte STATUS_QUERY_MAX_SIZE_BYTES rejects legitimate gateway /status polling (1.8.1+)

Summary

The /status endpoint in indexer-service-rs has a hardcoded 4096-byte body size limit (STATUS_QUERY_MAX_SIZE_BYTES) that rejects the indexingProgress query sent by Graph protocol gateways when the indexer has more than ~75 active allocations. The gateway batches deployment IDs into groups of 100, but 100 deployment IDs produce a ~5,275-byte body that exceeds the limit. This results in HTTP 400 errors for the majority of status polling requests.

We have observed 394,724 rejected requests over 61.5 hours (~107/min) from multiple Graph gateway IPs.

Environment

  • indexer-service-rs: latest release (Rust rewrite)
  • Active allocations: ~825
  • Affected endpoint: POST /status

Source IPs (Graph Gateway Infrastructure)

All rejected requests originate from known Graph protocol gateway infrastructure across multiple cloud providers and regions:

Source IP Cloud Provider Region Rejected Requests
34.185.191.203 GCP US 49,158
116.202.192.158 Hetzner DE (Germany) 47,829
34.116.224.212 GCP PL (Poland) 46,848
35.221.213.148 GCP TW (Taiwan) 40,191
35.226.217.37 GCP US 36,487
34.106.165.124 GCP US 29,090
35.245.28.4 GCP US 28,330
34.106.30.165 GCP US 27,424
35.221.97.60 GCP TW (Taiwan) 25,040
34.86.113.72 GCP US 23,600
136.111.129.249 GCP US 21,256
35.200.111.181 GCP TW (Taiwan) 17,400
(+ 5 others) 2,063
Total 394,716

Observation window: 2026-02-06T00:30:13Z to 2026-02-08T13:59:26Z (61.5 hours).

The Rejected Query (captured via tcpdump)

The gateway sends the following indexingProgress query to poll subgraph sync status. The query template is small (~210 bytes of GraphQL), but it includes all deployment IDs for the current batch in the $deployments variable:

{
  "query": "\n        query indexingProgress($deployments: [String!]!) {\n            indexingStatuses(subgraphs: $deployments) {\n                subgraph\n                chains {\n                    network\n                    latestBlock { number }\n                    earliestBlock { number }\n                }\n            }\n        }",
  "variables": {
    "deployments": [
      "QmXZQLa1fVsZyTU1asb4NaHy1WJxDMpTGSP5RmZcdm379u",
      "QmdCKcx2br3W7XbcMjLwwfXRZxrgL6WvRsqLnmAh4qkGGH",
      "... (100 deployment IDs total)"
    ]
  }
}

Total body size: 5,275 bytes (exceeds the 4,096-byte limit by 1,179 bytes).

Batching Behavior

The gateway splits deployment IDs into batches of 100. With ~825 allocations:

Batch Deployments Body Size HTTP Status Result
1-8 100 each 5,275 bytes 400 Rejected
9 (remainder) 25 1,600 bytes 200 Success
Version check 0 35 bytes 200 Success

8 out of 9 batches are rejected per polling cycle. The gateway only receives indexing status for 25 out of 825 subgraphs.

Root Cause in Code

The limit is a hardcoded constant, not configurable via TOML config or environment variables:

crates/service/src/constants.rs:

/// 4KB is generous for legitimate status queries, which are typically
/// under 500 bytes. Complex queries with many fields rarely exceed 2KB.
pub const STATUS_QUERY_MAX_SIZE_BYTES: usize = 4096;

crates/service/src/routes/status.rs:

pub async fn status(
    State(state): State<GraphNodeState>,
    body: Bytes,
) -> Result<impl IntoResponse, SubgraphServiceError> {
    if body.len() > STATUS_QUERY_MAX_SIZE_BYTES {
        return Err(SubgraphServiceError::InvalidStatusQuery(anyhow::anyhow!(
            "Query exceeds maximum size of {} bytes",
            STATUS_QUERY_MAX_SIZE_BYTES
        )));
    }
    // ...

The check is on the raw HTTP body (JSON envelope + query + all variables), not just the GraphQL query string.

Note: the existing max_request_body_size config option (default 2MB) only applies to the main /subgraphs/id/{id} query endpoint, not to the /status route.

Why the Limit Is Too Low

  • Each IPFS CID (deployment ID) is 46 characters + 3 bytes for JSON encoding (",",,) = ~49 bytes per ID
  • Query template overhead: ~380 bytes
  • Maximum deployments per batch at 4,096 bytes: (4096 - 380) / 49 = ~75 deployments
  • The gateway sends 100 per batch, producing 5,275 bytes
  • Any indexer with more than ~75 allocations will hit this on every status poll

The comment in the code ("typically under 500 bytes") reflects a status query without the $deployments variable (e.g., { version { version } } which is 35 bytes). The indexingProgress query with deployment IDs is a legitimate and expected gateway query pattern.

Secondary Issue: costModels Batch Limit

During investigation, a related issue was also observed. The gateway sends a costModels query with all 825 deployment IDs in a single request, which hits a separate batch limit:

{"data":null,"errors":[{"message":"Batch size 825 exceeds maximum allowed (200)","locations":[{"line":3,"column":17}],"path":["costModels"]}]}

This limit (max_cost_model_batch_size) is configurable in the TOML config (default: 200), but the gateway does not batch this query — it sends all deployment IDs at once.

Suggested Fix

  1. Make STATUS_QUERY_MAX_SIZE_BYTES configurable via the [service] TOML config section (e.g., status_query_max_size_bytes), similar to how max_request_body_size is configurable for the query endpoint.

  2. Increase the default to something that accommodates the gateway's batch size of 100 deployment IDs. A value of 8,192 bytes (8KB) would support up to ~159 deployments per batch, giving comfortable headroom. Alternatively, align it with max_request_body_size since the same DoS protections (rate limiting, authentication) apply to the /status route.

  3. Consider the costModels batching gap: the gateway sends all deployment IDs in a single costModels request without respecting max_cost_model_batch_size. Either the gateway needs to batch, or the default needs to be higher.

Impact

  • Gateways cannot determine indexing progress for the vast majority of allocated subgraphs
  • This likely affects query routing decisions, as the gateway has incomplete sync status data
  • Every indexer with >75 allocations is affected
  • The error rate scales with the number of allocations and the number of gateway nodes polling

How This Was Captured

Traffic was captured on the reverse proxy host using tcpdump on the plain HTTP segment between Traefik and the upstream indexer-service (port 7610), then reassembled with tshark:

tcpdump -i enp6s18 -s 0 -w capture.pcap 'port 7610'
tshark -r capture.pcap -qz "follow,tcp,ascii,<stream_id>"

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

In review / testing

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions