Context
We are developing a Crossplane controller for Slurm dynamic node migration. Integration tests need a kind (Kubernetes-in-Docker) cluster that can resolve and reach sind Slurm nodes — pods must be able to SSH into sind controllers and run Slurm commands.
To validate this, we wrote a bridge script that creates two sind clusters in the same realm and connects a single kind cluster to both. The script works, but it exposes several places where sind's API forces consumers to reach behind the abstraction into raw Docker.
What the script does
- Creates two sind clusters (same realm, shared mesh/DNS)
- Creates a kind cluster
- Attaches kind node containers to the sind mesh + both cluster networks
- Patches CoreDNS with a stub zone forwarding
<realm>.sind to the sind DNS server
- Creates a K8s secret from the sind SSH volume (private key + known_hosts)
- Verifies by running
sinfo on both controllers via SSH from kind pods
Abstraction leaks
1. Naming conventions reimplemented in the script
The script manually reconstructs internal naming:
SIND_MESH_NET="${SIND_REALM}-mesh"
SIND_NET_A="${SIND_REALM}-${SIND_CLUSTER_A}-net"
SIND_DNS_CONTAINER="${SIND_REALM}-dns"
SIND_SSH_VOLUME="${SIND_REALM}-ssh-config"
SIND_DNS_ZONE="${SIND_REALM}.sind"
These mirror pkg/mesh/mesh.go and pkg/cluster/naming.go. If sind changes its naming scheme, every consumer breaks.
2. DNS server IP requires docker inspect
docker inspect sind-dns \
--format '{{(index .NetworkSettings.Networks "sind-mesh").IPAddress}}'
sind get dns shows records but not the server address itself.
3. SSH credentials require Docker volume gymnastics
docker run --rm \
-v sind-ssh-config:/ssh:ro \
-v "${tmpdir}:/out:Z" \
alpine sh -c 'cp /ssh/id_ed25519 /ssh/known_hosts /out/'
sind get ssh-config returns the host file path, which is inaccessible from inside containers or CI runners.
4. Network connect/disconnect is raw Docker
sind has no awareness of external consumers on its networks. The teardown ordering is critical — forgetting to disconnect kind before sind delete cluster causes Docker errors because sind can't remove networks with active endpoints.
5. SSH image is a hardcoded constant
The script copies ghcr.io/gsi-hpc/sind-node:latest from pkg/mesh/ssh.go. No way to query it at runtime.
Suggested improvements
| Suggestion |
Eliminates |
sind get mesh [--output json] — expose DNS IP, mesh network name, SSH volume, SSH image |
Naming reimplementation, docker inspect for DNS IP, hardcoded image |
sind get ssh-key --private / --known-hosts — dump credentials to stdout |
Docker volume extraction hack |
--output json on all get commands |
awk parsing of human-readable tables |
sind network connect/disconnect <container> — let sind track external consumers |
Raw docker network connect, fragile teardown ordering |
sind get coredns-config — emit a CoreDNS stub zone snippet |
Manually constructing the Corefile block, documenting the ndots trap |
Broader ideas for Crossplane integration testing
sind status --watch or readiness signal — the controller will add/remove workers and needs to know when slurmd has registered, not just when the container is running.
sind delete realm — nuke all resources in a realm (mesh included) in one command for CI cleanup.
ndots documentation — Kubernetes defaults to ndots:5, which causes musl-based images to cycle through all search domains before trying bare FQDNs. .sind names have too few dots and time out. This is a gotcha anyone connecting kind to sind will hit.
sind-kind-bridge.sh
#!/usr/bin/env bash
#
# sind-kind-bridge.sh — Create two sind Slurm clusters (same realm) and one
# kind Kubernetes cluster, then connect them so that kind pods can resolve
# all sind node hostnames and reach them by IP.
#
# Architecture:
#
# sind creates a shared mesh network per realm that carries DNS and SSH.
# Each cluster gets its own network for node traffic. Two clusters in
# the same realm share the mesh, so a single DNS server holds records
# for both:
#
# <hostname>.<cluster>.<realm>.sind
#
# kind runs on a separate Docker network. By attaching the kind node
# containers to the sind mesh and both cluster networks, they gain L2
# connectivity to every sind container. A CoreDNS stub zone forwards
# the "<realm>.sind" domain to the sind DNS server.
#
# Docker's embedded DNS does NOT honour the host's systemd-resolved
# per-link routing, so the CoreDNS stub zone is necessary.
#
# Usage:
# ./sind-kind-bridge.sh # create everything
# ./sind-kind-bridge.sh teardown # disconnect kind and delete both clusters
#
set -euo pipefail
# ---------------------------------------------------------------------------
# Parameters
# ---------------------------------------------------------------------------
# Realm — all sind resources (mesh, DNS, clusters) live under this namespace.
SIND_REALM="sind"
# sind cluster names — two clusters in the same realm.
SIND_CLUSTER_A="alpha"
SIND_CLUSTER_B="beta"
# Optional: paths to sind config files. Leave empty to use default layout
# (one controller + one worker per cluster).
SIND_CONFIG_A=""
SIND_CONFIG_B=""
# Mount mode for /data inside sind containers.
# Use "volume" for a Docker volume or a host path for a bind mount.
SIND_DATA="volume"
# kind cluster name.
KIND_CLUSTER="sind-bridge"
# kind node image (leave empty for the kind default).
KIND_IMAGE=""
# SSH client image used by verification pods. Uses the same image as sind's
# own SSH relay container — it already has openssh-client installed.
SSH_IMAGE="ghcr.io/gsi-hpc/sind-node:latest"
# Slurm command to run on each controller during verification.
SLURM_VERIFY_CMD="sinfo"
# ---------------------------------------------------------------------------
# Derived names — follow sind's naming conventions
# ---------------------------------------------------------------------------
# Mesh network (shared by all clusters in the realm, hosts DNS + SSH).
SIND_MESH_NET="${SIND_REALM}-mesh"
# Per-cluster networks (carry the actual node IPs).
SIND_NET_A="${SIND_REALM}-${SIND_CLUSTER_A}-net"
SIND_NET_B="${SIND_REALM}-${SIND_CLUSTER_B}-net"
# DNS container that serves the <realm>.sind zone.
SIND_DNS_CONTAINER="${SIND_REALM}-dns"
# sind SSH config volume (shared across all clusters in the realm, contains
# the private key and known_hosts for passwordless SSH to every sind node).
SIND_SSH_VOLUME="${SIND_REALM}-ssh-config"
# CoreDNS zone that covers all clusters in the realm.
SIND_DNS_ZONE="${SIND_REALM}.sind"
# kubectl context for the kind cluster.
KIND_CONTEXT="kind-${KIND_CLUSTER}"
# Name of the Kubernetes secret that mirrors the sind SSH volume.
K8S_SSH_SECRET="sind-ssh"
# ---------------------------------------------------------------------------
# Functions
# ---------------------------------------------------------------------------
create_sind_cluster() {
# Create a single sind cluster.
# $1 — cluster name
# $2 — config file path (empty string for defaults)
local name=$1 config=$2
local flags=(--data "${SIND_DATA}")
[[ -n "${config}" ]] && flags+=(--config "${config}")
echo "==> Creating sind cluster '${name}'"
sind --realm "${SIND_REALM}" create cluster "${name}" "${flags[@]}"
sind --realm "${SIND_REALM}" status "${name}"
}
discover_sind_dns_ip() {
# Look up the sind DNS container's IP on the mesh network.
# Prints the IP to stdout.
docker inspect "${SIND_DNS_CONTAINER}" \
--format "{{(index .NetworkSettings.Networks \"${SIND_MESH_NET}\").IPAddress}}"
}
create_kind_cluster() {
# Create a kind cluster.
echo "==> Creating kind cluster '${KIND_CLUSTER}'"
local flags=(--name "${KIND_CLUSTER}")
[[ -n "${KIND_IMAGE}" ]] && flags+=(--image "${KIND_IMAGE}")
kind create cluster "${flags[@]}"
}
connect_kind_to_sind() {
# Attach every kind node container to the sind Docker networks.
# Each node needs connectivity to:
# - the mesh network → to reach the sind DNS server
# - each cluster network → to reach the node IPs that DNS resolves to
local nodes
mapfile -t nodes < <(kind get nodes --name "${KIND_CLUSTER}")
for node in "${nodes[@]}"; do
echo "==> Connecting ${node} to ${SIND_MESH_NET}"
docker network connect "${SIND_MESH_NET}" "${node}"
echo "==> Connecting ${node} to ${SIND_NET_A}"
docker network connect "${SIND_NET_A}" "${node}"
echo "==> Connecting ${node} to ${SIND_NET_B}"
docker network connect "${SIND_NET_B}" "${node}"
done
}
patch_coredns() {
# Add a stub zone to CoreDNS that forwards the sind realm domain to the
# sind DNS server. Without this, pods cannot resolve sind hostnames —
# CoreDNS would fall through to Docker's embedded DNS, which ignores
# the host's systemd-resolved routing rules.
# $1 — sind DNS server IP
local dns_ip=$1
echo "==> Patching CoreDNS: forwarding '${SIND_DNS_ZONE}' to ${dns_ip}"
kubectl --context "${KIND_CONTEXT}" apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
${SIND_DNS_ZONE}:53 {
errors
cache 30
forward . ${dns_ip}
}
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30 {
disable success cluster.local
disable denial cluster.local
}
loop
reload
loadbalance
}
EOF
kubectl --context "${KIND_CONTEXT}" rollout restart deployment coredns -n kube-system
kubectl --context "${KIND_CONTEXT}" rollout status deployment coredns -n kube-system --timeout=60s
}
load_ssh_image() {
# Load the SSH client image into kind so verification pods can use
# imagePullPolicy=Never and skip pulling from the registry. The sind
# node image is usually already present locally because sind pulled it
# when creating the clusters.
echo "==> Loading SSH image into kind"
if ! docker image inspect "${SSH_IMAGE}" &>/dev/null; then
docker pull "${SSH_IMAGE}"
fi
kind load docker-image "${SSH_IMAGE}" --name "${KIND_CLUSTER}"
}
create_ssh_secret() {
# Copy the sind SSH private key and known_hosts from the sind-ssh-config
# Docker volume into a Kubernetes secret. Verification pods mount this
# secret to authenticate against sind nodes.
#
# The volume is owned by root inside the container, so we use the :Z
# flag to handle SELinux relabelling when bind-mounting the temp dir.
echo "==> Creating Kubernetes secret '${K8S_SSH_SECRET}' from sind SSH volume"
local tmpdir
tmpdir=$(mktemp -d)
docker run --rm \
-v "${SIND_SSH_VOLUME}:/ssh:ro" \
-v "${tmpdir}:/out:Z" \
alpine sh -c 'cp /ssh/id_ed25519 /ssh/known_hosts /out/ && chmod 644 /out/*'
kubectl --context "${KIND_CONTEXT}" create secret generic "${K8S_SSH_SECRET}" \
--from-file="${tmpdir}/id_ed25519" \
--from-file="${tmpdir}/known_hosts"
rm -rf "${tmpdir}"
}
verify() {
# Verify the full pipeline: DNS resolution, IP connectivity, and SSH
# authentication by running a Slurm command on each controller from a
# kind pod.
#
# Each pod:
# - mounts the sind SSH secret (private key + known_hosts)
# - uses dnsConfig to set ndots=1 (Kubernetes defaults to ndots=5,
# which causes musl-based images to cycle through all search domains
# before trying the bare FQDN — and that times out for .sind names)
# - SSHes into the controller and runs the configured Slurm command
local controller
for controller in \
"controller.${SIND_CLUSTER_A}.${SIND_DNS_ZONE}" \
"controller.${SIND_CLUSTER_B}.${SIND_DNS_ZONE}"
do
echo "==> Verifying: ${SLURM_VERIFY_CMD} on ${controller}"
kubectl --context "${KIND_CONTEXT}" run -i --rm "verify-${RANDOM}" \
--image="${SSH_IMAGE}" --image-pull-policy=Never --restart=Never \
--overrides="$(cat <<OJSON
{
"spec": {
"dnsConfig": {
"options": [{"name": "ndots", "value": "1"}]
},
"containers": [{
"name": "ssh",
"image": "${SSH_IMAGE}",
"imagePullPolicy": "Never",
"command": ["ssh",
"-o", "StrictHostKeyChecking=yes",
"-o", "UserKnownHostsFile=/ssh/known_hosts",
"-i", "/ssh/id_ed25519",
"root@${controller}",
"${SLURM_VERIFY_CMD}"
],
"volumeMounts": [{
"name": "ssh",
"mountPath": "/ssh",
"readOnly": true
}]
}],
"volumes": [{
"name": "ssh",
"secret": {
"secretName": "${K8S_SSH_SECRET}",
"defaultMode": 384
}
}]
}
}
OJSON
)"
done
}
disconnect_kind_from_sind() {
# Detach every kind node container from the sind Docker networks.
# This MUST run before deleting sind clusters, otherwise sind cannot
# remove its networks (Docker refuses to delete a network that still
# has connected endpoints).
local nodes
mapfile -t nodes < <(kind get nodes --name "${KIND_CLUSTER}" 2>/dev/null) || true
for node in "${nodes[@]}"; do
for net in "${SIND_MESH_NET}" "${SIND_NET_A}" "${SIND_NET_B}"; do
echo "==> Disconnecting ${node} from ${net}"
docker network disconnect "${net}" "${node}" 2>/dev/null || true
done
done
}
teardown() {
# Tear down everything in the correct order:
# 1. Disconnect kind nodes from sind networks
# 2. Delete the kind cluster
# 3. Delete both sind clusters (sind cleans up networks/volumes)
echo "==> Tearing down"
disconnect_kind_from_sind
echo "==> Deleting kind cluster '${KIND_CLUSTER}'"
kind delete cluster --name "${KIND_CLUSTER}" 2>/dev/null || true
echo "==> Deleting sind clusters"
sind --realm "${SIND_REALM}" delete cluster --all
}
summary() {
echo ""
echo "==> Setup complete"
echo " sind realm : ${SIND_REALM}"
echo " sind clusters: ${SIND_CLUSTER_A}, ${SIND_CLUSTER_B}"
echo " kind cluster : ${KIND_CLUSTER} (context: ${KIND_CONTEXT})"
echo " sind DNS : ${SIND_DNS_IP} (zone: ${SIND_DNS_ZONE})"
echo ""
echo " DNS records reachable from kind pods:"
sind --realm "${SIND_REALM}" get dns | sed 's/^/ /'
}
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
case "${1:-up}" in
up)
create_sind_cluster "${SIND_CLUSTER_A}" "${SIND_CONFIG_A}"
create_sind_cluster "${SIND_CLUSTER_B}" "${SIND_CONFIG_B}"
SIND_DNS_IP=$(discover_sind_dns_ip)
echo "==> sind DNS at ${SIND_DNS_IP} on ${SIND_MESH_NET}"
create_kind_cluster
connect_kind_to_sind
patch_coredns "${SIND_DNS_IP}"
load_ssh_image
create_ssh_secret
verify
summary
;;
teardown)
teardown
;;
*)
echo "Usage: $0 [up|teardown]" >&2
exit 1
;;
esac
Example output: up
==> Creating sind cluster 'alpha'
CLUSTER STATUS (R/S/P/T)
alpha running (2/0/0/2)
NETWORKS
NAME DRIVER SUBNET GATEWAY STATUS
sind-mesh bridge 172.18.0.0/16 172.18.0.1 ✓
sind-alpha-net bridge 172.19.0.0/16 172.19.0.1 ✓
MESH SERVICES
NAME CONTAINER STATUS
dns sind-dns ✓
MOUNTS
MOUNT SOURCE TYPE STATUS
/etc/slurm sind-alpha-config volume ✓
/etc/munge sind-alpha-munge volume ✓
/data sind-alpha-data volume ✓
NODES
NAME ROLE IP CONTAINER MUNGE SSHD SERVICES
controller.alpha controller 172.19.0.4 running ✓ ✓ slurmctld ✓
worker-0.alpha worker 172.19.0.3 running ✓ ✓ slurmd ✓
==> Creating sind cluster 'beta'
CLUSTER STATUS (R/S/P/T)
beta running (2/0/0/2)
NETWORKS
NAME DRIVER SUBNET GATEWAY STATUS
sind-mesh bridge 172.18.0.0/16 172.18.0.1 ✓
sind-beta-net bridge 172.21.0.0/16 172.21.0.1 ✓
MESH SERVICES
NAME CONTAINER STATUS
dns sind-dns ✓
MOUNTS
MOUNT SOURCE TYPE STATUS
/etc/slurm sind-beta-config volume ✓
/etc/munge sind-beta-munge volume ✓
/data sind-beta-data volume ✓
NODES
NAME ROLE IP CONTAINER MUNGE SSHD SERVICES
controller.beta controller 172.21.0.3 running ✓ ✓ slurmctld ✓
worker-0.beta worker 172.21.0.4 running ✓ ✓ slurmd ✓
==> sind DNS at 172.18.0.2 on sind-mesh
==> Creating kind cluster 'sind-bridge'
Creating cluster "sind-bridge" ...
✓ Ensuring node image (kindest/node:v1.35.0) 🖼
✓ Preparing nodes 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
Set kubectl context to "kind-sind-bridge"
==> Connecting sind-bridge-control-plane to sind-mesh
==> Connecting sind-bridge-control-plane to sind-alpha-net
==> Connecting sind-bridge-control-plane to sind-beta-net
==> Patching CoreDNS: forwarding 'sind.sind' to 172.18.0.2
configmap/coredns configured
deployment.apps/coredns restarted
deployment "coredns" successfully rolled out
==> Loading SSH image into kind
Image: "ghcr.io/gsi-hpc/sind-node:latest" [...] loading...
==> Creating Kubernetes secret 'sind-ssh' from sind SSH volume
secret/sind-ssh created
==> Verifying: sinfo on controller.alpha.sind.sind
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 1 idle worker-0
pod "verify-28197" deleted from default namespace
==> Verifying: sinfo on controller.beta.sind.sind
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 1 idle worker-0
pod "verify-5643" deleted from default namespace
==> Setup complete
sind realm : sind
sind clusters: alpha, beta
kind cluster : sind-bridge (context: kind-sind-bridge)
sind DNS : 172.18.0.2 (zone: sind.sind)
DNS records reachable from kind pods:
HOSTNAME IP
controller.alpha.sind.sind 172.19.0.4
worker-0.alpha.sind.sind 172.19.0.3
controller.beta.sind.sind 172.21.0.3
worker-0.beta.sind.sind 172.21.0.4
Example output: teardown
==> Tearing down
==> Disconnecting sind-bridge-control-plane from sind-mesh
==> Disconnecting sind-bridge-control-plane from sind-alpha-net
==> Disconnecting sind-bridge-control-plane from sind-beta-net
==> Deleting kind cluster 'sind-bridge'
==> Deleting sind clusters
Context
We are developing a Crossplane controller for Slurm dynamic node migration. Integration tests need a kind (Kubernetes-in-Docker) cluster that can resolve and reach sind Slurm nodes — pods must be able to SSH into sind controllers and run Slurm commands.
To validate this, we wrote a bridge script that creates two sind clusters in the same realm and connects a single kind cluster to both. The script works, but it exposes several places where sind's API forces consumers to reach behind the abstraction into raw Docker.
What the script does
<realm>.sindto the sind DNS serversinfoon both controllers via SSH from kind podsAbstraction leaks
1. Naming conventions reimplemented in the script
The script manually reconstructs internal naming:
These mirror
pkg/mesh/mesh.goandpkg/cluster/naming.go. If sind changes its naming scheme, every consumer breaks.2. DNS server IP requires
docker inspectdocker inspect sind-dns \ --format '{{(index .NetworkSettings.Networks "sind-mesh").IPAddress}}'sind get dnsshows records but not the server address itself.3. SSH credentials require Docker volume gymnastics
docker run --rm \ -v sind-ssh-config:/ssh:ro \ -v "${tmpdir}:/out:Z" \ alpine sh -c 'cp /ssh/id_ed25519 /ssh/known_hosts /out/'sind get ssh-configreturns the host file path, which is inaccessible from inside containers or CI runners.4. Network connect/disconnect is raw Docker
sind has no awareness of external consumers on its networks. The teardown ordering is critical — forgetting to disconnect kind before
sind delete clustercauses Docker errors because sind can't remove networks with active endpoints.5. SSH image is a hardcoded constant
The script copies
ghcr.io/gsi-hpc/sind-node:latestfrompkg/mesh/ssh.go. No way to query it at runtime.Suggested improvements
sind get mesh [--output json]— expose DNS IP, mesh network name, SSH volume, SSH imagedocker inspectfor DNS IP, hardcoded imagesind get ssh-key --private/--known-hosts— dump credentials to stdout--output jsonon allgetcommandsawkparsing of human-readable tablessind network connect/disconnect <container>— let sind track external consumersdocker network connect, fragile teardown orderingsind get coredns-config— emit a CoreDNS stub zone snippetndotstrapBroader ideas for Crossplane integration testing
sind status --watchor readiness signal — the controller will add/remove workers and needs to know whenslurmdhas registered, not just when the container is running.sind delete realm— nuke all resources in a realm (mesh included) in one command for CI cleanup.ndotsdocumentation — Kubernetes defaults tondots:5, which causes musl-based images to cycle through all search domains before trying bare FQDNs..sindnames have too few dots and time out. This is a gotcha anyone connecting kind to sind will hit.sind-kind-bridge.sh
Example output: up
Example output: teardown