Skip to content

[BUG]: After switching backend from SGLang to TensorRT-LLM, requests fail with no instances for backend/generate until frontend restart #8263

@Monokaix

Description

@Monokaix

Describe the Bug

This issue manifests specifically after changing the deployment backend from SGLang to TensorRT-LLM (while keeping the same served model name / Dynamo namespace). On Kubernetes via DynamoGraphDeployment (DGD), the graph can report Ready and both frontend and worker pods can be Running/Ready, but /v1/chat/completions (or completions) returns HTTP 500 with an error like:

Failed to generate completions: no instances found for endpoint ".../backend/generate"

Frontend logs may also show a model card checksum mismatch relative to the model’s canonical checksum (i.e., a newly registered worker’s Model Deployment Card (MDC) is rejected).

Workaround observed: deleting/recreating the frontend pod restores successful traffic, suggesting incorrect or stuck frontend-side interpretation of discovery events / canonical MDC state across the backend swap, compounded by discovery (KV/etcd) contents and event ordering during rollout.

Steps to Reproduce

  1. Deploy a DGD with SGLang as the backend (frontend + worker, typical single-replica setup).
  2. Wait until the DGD is Ready; call /v1/chat/completions (or completions) and confirm success.
  3. Change the DGD spec to TensorRT-LLM (e.g., switch backendFramework / worker image / args), keeping the same served-model-name and dynamoNamespace as before.
  4. Wait until the DGD is Ready again (confirm new frontend/worker pods exist and are Ready).
  5. Repeat the same API request as step 2.
  6. Observe HTTP 500 and no instances found for endpoint ".../backend/generate" in frontend logs; optionally observe MDC checksum / canonical checksum mismatch logs.
  7. Delete the frontend pod only, let it recreate, repeat step 5: requests succeed again.

Expected Behavior

After a validated DGD Ready following a SGLang → TensorRT-LLM backend change, the system should converge to routable backend/generate instances without manual pod surgery—either by correct discovery lifecycle (old registrations removed before incompatible MDCs appear) and/or defined upgrade semantics (operator/docs: drain order, discovery cleanup, forced frontend roll).

Actual Behavior

Following SGLang → TensorRT-LLM, requests fail with no instances found for endpoint ".../backend/generate" despite Kubernetes Ready state. Restarting only the frontend clears the failure.

Environment

dynamo v0.7.1

Additional Context

  • Likely related to strict per-model-name canonical MDC checksum handling in lib/llm/src/discovery/watcher.rs and ModelDeploymentCard::mdcsum() in lib/llm/src/model_card.rs, plus DiscoveryInstance::Model { card_json, ... } in lib/runtime/src/discovery/mod.rs.

  • Error string originates from lib/runtime/src/pipeline/network/egress/push_router.rs when zero available instances exist for the endpoint client.

  • Asymmetry: vLLM ↔ SGLang transitions may not hit this as often; SGLang → TensorRT-LLM appears more prone (TRT stack often produces a materially different MDC fingerprint vs SGLang for the same HF model name).

Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions