Hardcoded 4096-byte STATUS_QUERY_MAX_SIZE_BYTES rejects legitimate gateway /status polling

# Bug Report: Hardcoded 4096-byte STATUS_QUERY_MAX_SIZE_BYTES rejects legitimate gateway /status polling (1.8.1+)

## Summary

The `/status` endpoint in `indexer-service-rs` has a hardcoded 4096-byte body size limit (`STATUS_QUERY_MAX_SIZE_BYTES`) that rejects the `indexingProgress` query sent by Graph protocol gateways when the indexer has more than ~75 active allocations. The gateway batches deployment IDs into groups of 100, but 100 deployment IDs produce a ~5,275-byte body that exceeds the limit. This results in HTTP 400 errors for the majority of status polling requests.

We have observed **394,724 rejected requests over 61.5 hours** (~107/min) from multiple Graph gateway IPs.

## Environment

- **indexer-service-rs**: latest release (Rust rewrite)
- **Active allocations**: ~825
- **Affected endpoint**: `POST /status`

## Source IPs (Graph Gateway Infrastructure)

All rejected requests originate from known Graph protocol gateway infrastructure across multiple cloud providers and regions:

| Source IP | Cloud Provider | Region | Rejected Requests |
|-----------|---------------|--------|-------------------|
| 34.185.191.203 | GCP | US | 49,158 |
| 116.202.192.158 | Hetzner | DE (Germany) | 47,829 |
| 34.116.224.212 | GCP | PL (Poland) | 46,848 |
| 35.221.213.148 | GCP | TW (Taiwan) | 40,191 |
| 35.226.217.37 | GCP | US | 36,487 |
| 34.106.165.124 | GCP | US | 29,090 |
| 35.245.28.4 | GCP | US | 28,330 |
| 34.106.30.165 | GCP | US | 27,424 |
| 35.221.97.60 | GCP | TW (Taiwan) | 25,040 |
| 34.86.113.72 | GCP | US | 23,600 |
| 136.111.129.249 | GCP | US | 21,256 |
| 35.200.111.181 | GCP | TW (Taiwan) | 17,400 |
| *(+ 5 others)* | | | 2,063 |
| **Total** | | | **394,716** |

Observation window: 2026-02-06T00:30:13Z to 2026-02-08T13:59:26Z (61.5 hours).

## The Rejected Query (captured via tcpdump)

The gateway sends the following `indexingProgress` query to poll subgraph sync status. The query template is small (~210 bytes of GraphQL), but it includes all deployment IDs for the current batch in the `$deployments` variable:

```json
{
  "query": "\n        query indexingProgress($deployments: [String!]!) {\n            indexingStatuses(subgraphs: $deployments) {\n                subgraph\n                chains {\n                    network\n                    latestBlock { number }\n                    earliestBlock { number }\n                }\n            }\n        }",
  "variables": {
    "deployments": [
      "QmXZQLa1fVsZyTU1asb4NaHy1WJxDMpTGSP5RmZcdm379u",
      "QmdCKcx2br3W7XbcMjLwwfXRZxrgL6WvRsqLnmAh4qkGGH",
      "... (100 deployment IDs total)"
    ]
  }
}
```

**Total body size: 5,275 bytes** (exceeds the 4,096-byte limit by 1,179 bytes).

## Batching Behavior

The gateway splits deployment IDs into batches of 100. With ~825 allocations:

| Batch | Deployments | Body Size | HTTP Status | Result |
|-------|-------------|-----------|-------------|--------|
| 1-8 | 100 each | 5,275 bytes | **400** | Rejected |
| 9 (remainder) | 25 | 1,600 bytes | 200 | Success |
| Version check | 0 | 35 bytes | 200 | Success |

**8 out of 9 batches are rejected** per polling cycle. The gateway only receives indexing status for 25 out of 825 subgraphs.

## Root Cause in Code

The limit is a hardcoded constant, not configurable via TOML config or environment variables:

**`crates/service/src/constants.rs`:**
```rust
/// 4KB is generous for legitimate status queries, which are typically
/// under 500 bytes. Complex queries with many fields rarely exceed 2KB.
pub const STATUS_QUERY_MAX_SIZE_BYTES: usize = 4096;
```

**`crates/service/src/routes/status.rs`:**
```rust
pub async fn status(
    State(state): State<GraphNodeState>,
    body: Bytes,
) -> Result<impl IntoResponse, SubgraphServiceError> {
    if body.len() > STATUS_QUERY_MAX_SIZE_BYTES {
        return Err(SubgraphServiceError::InvalidStatusQuery(anyhow::anyhow!(
            "Query exceeds maximum size of {} bytes",
            STATUS_QUERY_MAX_SIZE_BYTES
        )));
    }
    // ...
```

The check is on the **raw HTTP body** (JSON envelope + query + all variables), not just the GraphQL query string.

Note: the existing `max_request_body_size` config option (default 2MB) only applies to the main `/subgraphs/id/{id}` query endpoint, **not** to the `/status` route.

## Why the Limit Is Too Low

- Each IPFS CID (deployment ID) is 46 characters + 3 bytes for JSON encoding (`"`,`"`,`,`) = ~49 bytes per ID
- Query template overhead: ~380 bytes
- Maximum deployments per batch at 4,096 bytes: **(4096 - 380) / 49 = ~75 deployments**
- The gateway sends **100 per batch**, producing **5,275 bytes**
- Any indexer with more than ~75 allocations will hit this on every status poll

The comment in the code ("typically under 500 bytes") reflects a status query without the `$deployments` variable (e.g., `{ version { version } }` which is 35 bytes). The `indexingProgress` query with deployment IDs is a legitimate and expected gateway query pattern.

## Secondary Issue: costModels Batch Limit

During investigation, a related issue was also observed. The gateway sends a `costModels` query with all 825 deployment IDs in a single request, which hits a **separate** batch limit:

```json
{"data":null,"errors":[{"message":"Batch size 825 exceeds maximum allowed (200)","locations":[{"line":3,"column":17}],"path":["costModels"]}]}
```

This limit (`max_cost_model_batch_size`) **is** configurable in the TOML config (default: 200), but the gateway does not batch this query — it sends all deployment IDs at once.

## Suggested Fix

1. **Make `STATUS_QUERY_MAX_SIZE_BYTES` configurable** via the `[service]` TOML config section (e.g., `status_query_max_size_bytes`), similar to how `max_request_body_size` is configurable for the query endpoint.

2. **Increase the default** to something that accommodates the gateway's batch size of 100 deployment IDs. A value of **8,192 bytes** (8KB) would support up to ~159 deployments per batch, giving comfortable headroom. Alternatively, align it with `max_request_body_size` since the same DoS protections (rate limiting, authentication) apply to the `/status` route.

3. **Consider the `costModels` batching gap**: the gateway sends all deployment IDs in a single `costModels` request without respecting `max_cost_model_batch_size`. Either the gateway needs to batch, or the default needs to be higher.

## Impact

- Gateways cannot determine indexing progress for the vast majority of allocated subgraphs
- This likely affects query routing decisions, as the gateway has incomplete sync status data
- Every indexer with >75 allocations is affected
- The error rate scales with the number of allocations and the number of gateway nodes polling

## How This Was Captured

Traffic was captured on the reverse proxy host using `tcpdump` on the plain HTTP segment between Traefik and the upstream indexer-service (port 7610), then reassembled with `tshark`:

```bash
tcpdump -i enp6s18 -s 0 -w capture.pcap 'port 7610'
tshark -r capture.pcap -qz "follow,tcp,ascii,<stream_id>"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardcoded 4096-byte STATUS_QUERY_MAX_SIZE_BYTES rejects legitimate gateway /status polling #937

Bug Report: Hardcoded 4096-byte STATUS_QUERY_MAX_SIZE_BYTES rejects legitimate gateway /status polling (1.8.1+)

Summary

Environment

Source IPs (Graph Gateway Infrastructure)

The Rejected Query (captured via tcpdump)

Batching Behavior

Root Cause in Code

Why the Limit Is Too Low

Secondary Issue: costModels Batch Limit

Suggested Fix

Impact

How This Was Captured

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Source IP	Cloud Provider	Region	Rejected Requests
34.185.191.203	GCP	US	49,158
116.202.192.158	Hetzner	DE (Germany)	47,829
34.116.224.212	GCP	PL (Poland)	46,848
35.221.213.148	GCP	TW (Taiwan)	40,191
35.226.217.37	GCP	US	36,487
34.106.165.124	GCP	US	29,090
35.245.28.4	GCP	US	28,330
34.106.30.165	GCP	US	27,424
35.221.97.60	GCP	TW (Taiwan)	25,040
34.86.113.72	GCP	US	23,600
136.111.129.249	GCP	US	21,256
35.200.111.181	GCP	TW (Taiwan)	17,400
(+ 5 others)			2,063
Total			394,716

Batch	Deployments	Body Size	HTTP Status	Result
1-8	100 each	5,275 bytes	400	Rejected
9 (remainder)	25	1,600 bytes	200	Success
Version check	0	35 bytes	200	Success

Hardcoded 4096-byte STATUS_QUERY_MAX_SIZE_BYTES rejects legitimate gateway /status polling #937

Description

Bug Report: Hardcoded 4096-byte STATUS_QUERY_MAX_SIZE_BYTES rejects legitimate gateway /status polling (1.8.1+)

Summary

Environment

Source IPs (Graph Gateway Infrastructure)

The Rejected Query (captured via tcpdump)

Batching Behavior

Root Cause in Code

Why the Limit Is Too Low

Secondary Issue: costModels Batch Limit

Suggested Fix

Impact

How This Was Captured

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions