Skip to content

Fix smck bug#2525

Open
jeromekelleher wants to merge 2 commits into
tskit-dev:mainfrom
jeromekelleher:fix-smck-bug
Open

Fix smck bug#2525
jeromekelleher wants to merge 2 commits into
tskit-dev:mainfrom
jeromekelleher:fix-smck-bug

Conversation

@jeromekelleher
Copy link
Copy Markdown
Member

Closes #2523

This was a bit more subtle than the fix in #2524 implied (which would probably have led to biases). The issue is the rng producing a value of exactly zero.

In msp_smc_k_common_ancestor_event, gsl_ran_flat occasionally draws
random_mass == 0 (probability ~1/2^32 per CA event). With random_mass
of zero, fenwick_find returns the first non-zero hull and
remaining_mass = cum_sum - 0 lands at exactly x_hull->count. The walk
loop with predicate `remaining_mass >= 0` then over-iterates by one
match, runs off the head of the AVL tree, and trips the
tsk_bug_assert at line 7694.

Clamp remaining_mass strictly below x_hull->count so the loop
terminates after exactly `count` matches. The fix is local; it does
not change the RNG draw count or pattern, so seeded simulations that
did not previously trip the bug produce identical trajectories.

Add a regression test (test_smc_k_bug_repro) that replays the
user-reported failure (50 diploid samples, seq_length 2e7, recomb
1e-8, demography change at t=30, seed 3940783591). Pre-fix the test
aborts after ~74821 events; post-fix it completes in about a second.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.83%. Comparing base (1b14a8a) to head (577cd77).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2525   +/-   ##
=======================================
  Coverage   93.83%   93.83%           
=======================================
  Files          20       20           
  Lines       12104    12106    +2     
  Branches     2242     2243    +1     
=======================================
+ Hits        11358    11360    +2     
  Misses        567      567           
  Partials      179      179           
Flag Coverage Δ
C 84.11% <100.00%> (+<0.01%) ⬆️
c-python 72.89% <50.00%> (-0.01%) ⬇️
python-tests 98.58% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Python API 98.61% <ø> (ø)
Python C interface 92.85% <ø> (ø)
C library 91.05% <100.00%> (+<0.01%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The prior test_smc_k_bug_repro replayed the user-reported failing
simulation (50 diploid samples, seq_length 2e7, ~75k events) to hit
the bug after about one second. Replace it with
test_smc_k_zero_random_mass, which uses a custom GSL rng_type whose
get_double callback returns successive doubles from a fixed array.
With the script {0.5, 0.0} and a 2-sample SMCk simulation, the first
common-ancestor event has gsl_ran_flat return 0 deterministically,
making remaining_mass == x_hull->count exactly — the same trigger the
clamp at lib/msprime.c handles. Pre-clamp the test aborts at line
7694; post-clamp it completes in about a millisecond.

The scripted RNG type is defined as static helpers in test_ancestry.c
next to the test that uses it.

Add chained-event SMCk clamp test

test_smc_k_zero_random_mass triggers the clamp at lib/msprime.c on
the very first CA event of a fresh 2-sample simulation, only walking
the loop once. Add a sibling test_smc_k_zero_random_mass_chained that
runs one CA event first (coalescing one pair of 4 lineages, freeing
and reusing hull slots, rearranging AVL trees) and then drives
gsl_ran_flat to 0 on the second CA event. Pre-fix the second event
aborts at line 7694; post-fix both complete in about a millisecond.

The test covers the clamp working from a non-trivial post-event
state. We initially hoped to also drive the second event to a hull
with count > 1 via the fenwick zero-skip path, but fenwick_find
always stops at the first non-zero slot and any non-zero predecessor
blocks the route to a higher-count hull. The honest scope (clamp in
chained events, count == 1) is documented in the test comment.

Update CHANGELOG
@jeromekelleher
Copy link
Copy Markdown
Member Author

The change here seems quite straightforward, so I think we can issue a bugfix release once this is merged.

@molpopgen
Copy link
Copy Markdown
Member

That's quite surprising -- if the proper distribution is uniform on [0, x), why is 0 a problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SMCK simulation fails with boundary condition

2 participants