Skip to content

Commit 08d0a96

Browse files
authored
feat: Add --fail-fast option to stop on first failure (pdsh -k compatible) (#103)
* feat: Add --fail-fast option to stop on first failure (pdsh -k compatible) Add a new --fail-fast / -k option that stops execution immediately when any node fails (connection error or non-zero exit code). This provides pdsh compatibility and is useful for critical operations where partial execution is unacceptable. Implementation details: - Uses tokio::sync::watch channel for cancellation signaling - Uses tokio::select! to race between execution and cancellation - Cancels pending tasks waiting in semaphore queue - Reports which node caused the failure with error details - Works with existing parallel execution infrastructure Features: - Short option -k for pdsh compatibility - Long option --fail-fast for clarity - Can be combined with --require-all-success and --check-all-nodes - Supports all parallelism settings (including --parallel N) Closes #94 * fix(quality): Use floor_char_boundary() for safe UTF-8 string truncation Priority: MEDIUM Issue: String slicing with byte index could panic on multi-byte UTF-8 chars The error message truncation in execute_with_fail_fast() used direct byte indexing (&first_line[..47]) which can panic if the index falls in the middle of a multi-byte UTF-8 character. Fixed by using floor_char_boundary(47) to find the largest valid char boundary at or before byte 47, ensuring safe string truncation for all Unicode content including CJK characters and emoji. * docs: Add fail-fast option documentation to manpage and ARCHITECTURE.md - Add --fail-fast / -k option description in manpage OPTIONS section - Add fail-fast examples in manpage EXAMPLES section - Add fail-fast mode implementation details in ARCHITECTURE.md * fix(test): Prefer debug binary in connect_timeout_test to avoid stale cache The test helper was finding stale release binaries in CI, causing tests to fail because the old binary didn't have --connect-timeout option. Changed to prefer debug binary since `cargo test` builds debug binaries.
1 parent 290741e commit 08d0a96

File tree

9 files changed

+689
-7
lines changed

9 files changed

+689
-7
lines changed

ARCHITECTURE.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -393,6 +393,36 @@ The batch flag is passed through the executor chain:
393393
- Applied in both normal mode (`execute()`) and stream mode (`handle_stream_mode()`)
394394
- TUI mode maintains its own quit handling and ignores this flag
395395

396+
**Fail-Fast Mode (Added 2025-12):**
397+
398+
The `--fail-fast` / `-k` option enables immediate termination when any node fails. This is compatible with pdsh's `-k` flag and useful for:
399+
- Critical operations where partial execution is unacceptable
400+
- Deployment scripts where all nodes must succeed
401+
- Validation checks across clusters
402+
403+
Implementation uses:
404+
```rust
405+
// Cancellation signaling via tokio::sync::watch
406+
let (cancel_tx, cancel_rx) = watch::channel(false);
407+
408+
// Task selection with cancellation check
409+
tokio::select! {
410+
biased; // Prioritize cancellation check
411+
_ = cancel_rx.changed() => {
412+
// Task cancelled due to fail-fast
413+
return Err(anyhow!("Execution cancelled due to fail-fast"));
414+
}
415+
permit = semaphore.acquire() => {
416+
// Execute task normally
417+
}
418+
}
419+
```
420+
421+
The fail-fast mode integrates with:
422+
- `--require-all-success`: Both require all nodes to succeed, but fail-fast stops early
423+
- `--check-all-nodes`: Fail-fast stops early, check-all-nodes affects final exit code
424+
- `--parallel N`: Cancels pending tasks waiting in the semaphore queue
425+
396426
### 4. SSH Client (`ssh/client/*`, `ssh/tokio_client/*`)
397427

398428
**SSH Client Module Structure (Refactored 2025-10-17):**

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ A high-performance SSH client with **SSH-compatible syntax** for both single-hos
1515
- **Port Forwarding**: Full support for local (-L), remote (-R), and dynamic (-D) SSH port forwarding
1616
- **Jump Host Support**: Connect through bastion hosts using OpenSSH ProxyJump syntax (`-J`)
1717
- **Parallel Execution**: Execute commands across multiple nodes simultaneously
18+
- **Fail-Fast Mode**: Stop immediately on first failure with `-k` flag (pdsh compatible)
1819
- **Interactive Terminal UI (TUI)**: Real-time monitoring with 4 view modes (Summary/Detail/Split/Diff) for multi-node operations
1920
- **Cluster Management**: Define and manage node clusters via configuration files
2021
- **Progress Tracking**: Real-time progress indicators with smart detection (percentages, fractions, apt/dpkg)
@@ -219,6 +220,13 @@ bssh -C production --connect-timeout 10 "uptime"
219220

220221
# Different timeouts for connection and command
221222
bssh -C production --connect-timeout 5 --timeout 600 "long-running-job"
223+
224+
# Fail-fast mode: stop immediately on any failure (pdsh -k compatible)
225+
bssh -k -H "web1,web2,web3" "deploy.sh"
226+
bssh --fail-fast -C production "critical-script.sh"
227+
228+
# Combine fail-fast with require-all-success for critical operations
229+
bssh -k --require-all-success -C production "service-restart.sh"
222230
```
223231

224232
### Output Modes

docs/man/bssh.1

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -247,6 +247,25 @@ which is useful for programmatic parsing or cleaner display. Works
247247
with both stream mode (--stream) and file mode (--output-dir).
248248
Example: bssh -H host1,host2 --stream -N "uname -a"
249249

250+
.TP
251+
.BR \-k ", " \-\-fail\-fast
252+
Stop execution immediately on first failure (pdsh -k compatible).
253+
When enabled, bssh cancels pending commands when any node fails due to
254+
connection error or non-zero exit code. This is useful for:
255+
.RS
256+
.IP \[bu] 2
257+
Critical operations where partial execution is unacceptable
258+
.IP \[bu] 2
259+
Deployment scripts where all nodes must succeed
260+
.IP \[bu] 2
261+
Validation checks across clusters
262+
.RE
263+
.IP
264+
Running tasks are terminated gracefully, and the error message clearly
265+
indicates which node caused the failure. Can be combined with
266+
.B --require-all-success
267+
for strict error handling.
268+
250269
.TP
251270
.BR \-v ", " \-\-verbose
252271
Increase verbosity (can be used multiple times: -v, -vv, -vvv)
@@ -1240,6 +1259,29 @@ Example output:
12401259
Useful for monitoring long-running commands or when piping output.
12411260
.RE
12421261

1262+
.SS Fail-Fast Mode Examples
1263+
.TP
1264+
Stop on first failure during critical deployment:
1265+
.B bssh -k -C production "deploy.sh"
1266+
.RS
1267+
Execution stops immediately if any node fails the deployment script
1268+
.RE
1269+
1270+
.TP
1271+
Combine fail-fast with require-all-success:
1272+
.B bssh --fail-fast --require-all-success -C production "service-restart.sh"
1273+
.RS
1274+
Stops early on failure AND ensures final exit code reflects any failures
1275+
.RE
1276+
1277+
.TP
1278+
Sequential fail-fast with limited parallelism:
1279+
.B bssh -k --parallel 1 -H "node1,node2,node3" "critical-operation"
1280+
.RS
1281+
Runs commands one at a time, stopping on first failure
1282+
.RE
1283+
1284+
.SS File Transfer Examples
12431285
.TP
12441286
Upload configuration file to all nodes:
12451287
.B bssh -H "node1,node2,node3" upload /etc/myapp.conf /etc/myapp.conf

src/app/dispatcher.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -404,6 +404,7 @@ async fn handle_exec_command(cli: &Cli, ctx: &AppContext, command: &str) -> Resu
404404
check_all_nodes: cli.check_all_nodes,
405405
sudo_password,
406406
batch: cli.batch,
407+
fail_fast: cli.fail_fast,
407408
};
408409
execute_command(params).await
409410
}

src/cli.rs

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ use std::path::PathBuf;
2323
before_help = "\n\nBroadcast SSH - Parallel command execution across cluster nodes",
2424
about = "Broadcast SSH - SSH-compatible parallel command execution tool",
2525
long_about = "bssh is a high-performance SSH client with parallel execution capabilities.\nIt can be used as a drop-in replacement for SSH (single host) or as a powerful cluster management tool (multiple hosts).\n\nThe tool provides secure file transfer using SFTP and supports SSH keys, SSH agent, and password authentication.\nIt automatically detects Backend.AI multi-node session environments.\n\nOutput Modes:\n- TUI Mode (default): Interactive terminal UI with real-time monitoring (auto-enabled in terminals)\n- Stream Mode (--stream): Real-time output with [node] prefixes\n- File Mode (--output-dir): Save per-node output to timestamped files\n- Normal Mode: Traditional output after all nodes complete\n\nSSH Configuration Support:\n- Reads standard SSH config files (defaulting to ~/.ssh/config)\n- Supports Host patterns, HostName, User, Port, IdentityFile, StrictHostKeyChecking\n- ProxyJump, and many other SSH configuration directives\n- CLI arguments override SSH config values following SSH precedence rules",
26-
after_help = "EXAMPLES:\n SSH Mode:\n bssh user@host # Interactive shell\n bssh admin@server.com \"uptime\" # Execute command\n bssh -p 2222 -i ~/.ssh/key user@host # Custom port and key\n bssh -F ~/.ssh/myconfig webserver # Use custom SSH config\n\n Port Forwarding:\n bssh -L 8080:example.com:80 user@host # Local forward: localhost:8080 → example.com:80\n bssh -R 8080:localhost:80 user@host # Remote forward: remote:8080 → localhost:80\n bssh -D 1080 user@host # SOCKS5 proxy on localhost:1080\n bssh -L 3306:db:3306 -R 80:web:80 user@host # Multiple forwards\n bssh -D *:1080/4 user@host # SOCKS4 proxy on all interfaces\n\n Multi-Server Mode:\n bssh -C production \"systemctl status\" # Execute on cluster (TUI mode auto-enabled)\n bssh -H \"web1,web2,web3\" \"df -h\" # Execute on multiple hosts\n bssh -H \"web1,web2,web3\" -f \"web1\" \"df -h\" # Filter to web1 only\n bssh -C production -f \"web*\" \"uptime\" # Filter cluster nodes\n bssh --parallel 20 -H web* \"apt update\" # Increase parallelism\n\n Host Exclusion (--exclude):\n bssh -H \"node1,node2,node3\" --exclude \"node2\" \"uptime\" # Exclude single host\n bssh -C production --exclude \"web1,web2\" \"apt update\" # Exclude multiple hosts\n bssh -C production --exclude \"db*\" \"systemctl restart\" # Exclude with wildcard pattern\n bssh -C production --exclude \"*-backup\" \"df -h\" # Exclude backup nodes\n\n Output Modes:\n bssh -C prod \"apt-get update\" # TUI mode (default, interactive monitoring)\n bssh -C prod --stream \"tail -f log\" # Stream mode (real-time with [node] prefixes)\n bssh -C prod --output-dir ./logs \"ps\" # File mode (save to timestamped files)\n bssh -C prod \"uptime\" | tee log.txt # Normal mode (auto-detected when piped)\n\n Batch Mode (Ctrl+C Handling):\n bssh -C prod \"long-running-command\" # Default: first Ctrl+C shows status, second terminates\n bssh -C prod -b \"long-command\" # Batch mode: single Ctrl+C terminates immediately\n bssh -H nodes --batch --stream \"cmd\" # Useful for CI/CD and non-interactive scripts\n\n TUI Mode Controls (when in TUI):\n 1-9 Jump to node detail view\n s Enter split view (2-4 nodes)\n d Enter diff view (compare nodes)\n f Toggle auto-scroll\n ↑/↓ Scroll output\n ←/→ Switch nodes\n Esc Return to summary\n ? Show help\n q Quit\n\n File Operations:\n bssh -C staging upload file.txt /tmp/ # Upload to cluster\n bssh -H host1,host2 download /etc/hosts ./backups/\n\n Other Commands:\n bssh list # List configured clusters\n bssh -C production ping # Test connectivity\n bssh -H hosts interactive # Interactive mode\n\n SSH Config Example (~/.ssh/config):\n Host web*\n HostName web.example.com\n User webuser\n Port 2222\n IdentityFile ~/.ssh/web_key\n StrictHostKeyChecking yes\n\nDeveloped and maintained as part of the Backend.AI project.\nFor more information: https://github.com/lablup/bssh"
26+
after_help = "EXAMPLES:\n SSH Mode:\n bssh user@host # Interactive shell\n bssh admin@server.com \"uptime\" # Execute command\n bssh -p 2222 -i ~/.ssh/key user@host # Custom port and key\n bssh -F ~/.ssh/myconfig webserver # Use custom SSH config\n\n Port Forwarding:\n bssh -L 8080:example.com:80 user@host # Local forward: localhost:8080 → example.com:80\n bssh -R 8080:localhost:80 user@host # Remote forward: remote:8080 → localhost:80\n bssh -D 1080 user@host # SOCKS5 proxy on localhost:1080\n bssh -L 3306:db:3306 -R 80:web:80 user@host # Multiple forwards\n bssh -D *:1080/4 user@host # SOCKS4 proxy on all interfaces\n\n Multi-Server Mode:\n bssh -C production \"systemctl status\" # Execute on cluster (TUI mode auto-enabled)\n bssh -H \"web1,web2,web3\" \"df -h\" # Execute on multiple hosts\n bssh -H \"web1,web2,web3\" -f \"web1\" \"df -h\" # Filter to web1 only\n bssh -C production -f \"web*\" \"uptime\" # Filter cluster nodes\n bssh --parallel 20 -H web* \"apt update\" # Increase parallelism\n\n Host Exclusion (--exclude):\n bssh -H \"node1,node2,node3\" --exclude \"node2\" \"uptime\" # Exclude single host\n bssh -C production --exclude \"web1,web2\" \"apt update\" # Exclude multiple hosts\n bssh -C production --exclude \"db*\" \"systemctl restart\" # Exclude with wildcard pattern\n bssh -C production --exclude \"*-backup\" \"df -h\" # Exclude backup nodes\n\n Fail-Fast Mode (pdsh -k compatible):\n bssh -k -H \"web1,web2,web3\" \"deploy.sh\" # Stop on first failure\n bssh --fail-fast -C prod \"apt upgrade\" # Critical deployment - stop if any node fails\n bssh -k --require-all-success -C prod cmd # Fail-fast + require all success\n\n Output Modes:\n bssh -C prod \"apt-get update\" # TUI mode (default, interactive monitoring)\n bssh -C prod --stream \"tail -f log\" # Stream mode (real-time with [node] prefixes)\n bssh -C prod --output-dir ./logs \"ps\" # File mode (save to timestamped files)\n bssh -C prod \"uptime\" | tee log.txt # Normal mode (auto-detected when piped)\n\n Batch Mode (Ctrl+C Handling):\n bssh -C prod \"long-running-command\" # Default: first Ctrl+C shows status, second terminates\n bssh -C prod -b \"long-command\" # Batch mode: single Ctrl+C terminates immediately\n bssh -H nodes --batch --stream \"cmd\" # Useful for CI/CD and non-interactive scripts\n\n TUI Mode Controls (when in TUI):\n 1-9 Jump to node detail view\n s Enter split view (2-4 nodes)\n d Enter diff view (compare nodes)\n f Toggle auto-scroll\n ↑/↓ Scroll output\n ←/→ Switch nodes\n Esc Return to summary\n ? Show help\n q Quit\n\n File Operations:\n bssh -C staging upload file.txt /tmp/ # Upload to cluster\n bssh -H host1,host2 download /etc/hosts ./backups/\n\n Other Commands:\n bssh list # List configured clusters\n bssh -C production ping # Test connectivity\n bssh -H hosts interactive # Interactive mode\n\n SSH Config Example (~/.ssh/config):\n Host web*\n HostName web.example.com\n User webuser\n Port 2222\n IdentityFile ~/.ssh/web_key\n StrictHostKeyChecking yes\n\nDeveloped and maintained as part of the Backend.AI project.\nFor more information: https://github.com/lablup/bssh"
2727
)]
2828
pub struct Cli {
2929
/// SSH destination in format: [user@]hostname[:port] or ssh://[user@]hostname[:port]
@@ -196,6 +196,13 @@ pub struct Cli {
196196
)]
197197
pub check_all_nodes: bool,
198198

199+
#[arg(
200+
short = 'k',
201+
long = "fail-fast",
202+
help = "Stop execution immediately on first failure (pdsh -k compatible)\nCancels pending commands when any node fails (connection error or non-zero exit)\nUseful for critical operations where partial execution is unacceptable"
203+
)]
204+
pub fail_fast: bool,
205+
199206
#[arg(
200207
trailing_var_arg = true,
201208
help = "Command to execute on remote hosts",

src/commands/exec.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ pub struct ExecuteCommandParams<'a> {
4646
pub check_all_nodes: bool,
4747
pub sudo_password: Option<Arc<SudoPassword>>,
4848
pub batch: bool,
49+
pub fail_fast: bool,
4950
}
5051

5152
pub async fn execute_command(params: ExecuteCommandParams<'_>) -> Result<()> {
@@ -212,7 +213,8 @@ async fn execute_command_without_forwarding(params: ExecuteCommandParams<'_>) ->
212213
.with_connect_timeout(params.connect_timeout)
213214
.with_jump_hosts(params.jump_hosts.map(|s| s.to_string()))
214215
.with_sudo_password(params.sudo_password)
215-
.with_batch_mode(params.batch);
216+
.with_batch_mode(params.batch)
217+
.with_fail_fast(params.fail_fast);
216218

217219
// Set keychain usage if on macOS
218220
#[cfg(target_os = "macos")]

0 commit comments

Comments
 (0)