Fix smck bug by jeromekelleher · Pull Request #2525 · tskit-dev/msprime

jeromekelleher · 2026-05-08T12:15:15Z

This was a bit more subtle than the fix in #2524 implied (which would probably have led to biases). The issue is the rng producing a value of exactly zero.

In msp_smc_k_common_ancestor_event, gsl_ran_flat occasionally draws random_mass == 0 (probability ~1/2^32 per CA event). With random_mass of zero, fenwick_find returns the first non-zero hull and remaining_mass = cum_sum - 0 lands at exactly x_hull->count. The walk loop with predicate `remaining_mass >= 0` then over-iterates by one match, runs off the head of the AVL tree, and trips the tsk_bug_assert at line 7694. Clamp remaining_mass strictly below x_hull->count so the loop terminates after exactly `count` matches. The fix is local; it does not change the RNG draw count or pattern, so seeded simulations that did not previously trip the bug produce identical trajectories. Add a regression test (test_smc_k_bug_repro) that replays the user-reported failure (50 diploid samples, seq_length 2e7, recomb 1e-8, demography change at t=30, seed 3940783591). Pre-fix the test aborts after ~74821 events; post-fix it completes in about a second.

codecov · 2026-05-08T12:23:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.83%. Comparing base (1b14a8a) to head (577cd77).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2525   +/-   ##
=======================================
  Coverage   93.83%   93.83%           
=======================================
  Files          20       20           
  Lines       12104    12106    +2     
  Branches     2242     2243    +1     
=======================================
+ Hits        11358    11360    +2     
  Misses        567      567           
  Partials      179      179

Flag	Coverage Δ
C	`84.11% <100.00%> (+<0.01%)`	⬆️
c-python	`72.89% <50.00%> (-0.01%)`	⬇️
python-tests	`98.58% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Python API	`98.61% <ø> (ø)`
Python C interface	`92.85% <ø> (ø)`
C library	`91.05% <100.00%> (+<0.01%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The prior test_smc_k_bug_repro replayed the user-reported failing simulation (50 diploid samples, seq_length 2e7, ~75k events) to hit the bug after about one second. Replace it with test_smc_k_zero_random_mass, which uses a custom GSL rng_type whose get_double callback returns successive doubles from a fixed array. With the script {0.5, 0.0} and a 2-sample SMCk simulation, the first common-ancestor event has gsl_ran_flat return 0 deterministically, making remaining_mass == x_hull->count exactly — the same trigger the clamp at lib/msprime.c handles. Pre-clamp the test aborts at line 7694; post-clamp it completes in about a millisecond. The scripted RNG type is defined as static helpers in test_ancestry.c next to the test that uses it. Add chained-event SMCk clamp test test_smc_k_zero_random_mass triggers the clamp at lib/msprime.c on the very first CA event of a fresh 2-sample simulation, only walking the loop once. Add a sibling test_smc_k_zero_random_mass_chained that runs one CA event first (coalescing one pair of 4 lineages, freeing and reusing hull slots, rearranging AVL trees) and then drives gsl_ran_flat to 0 on the second CA event. Pre-fix the second event aborts at line 7694; post-fix both complete in about a millisecond. The test covers the clamp working from a non-trivial post-event state. We initially hoped to also drive the second event to a hull with count > 1 via the fenwick zero-skip path, but fenwick_find always stops at the first non-zero slot and any non-zero predecessor blocks the route to a higher-count hull. The honest scope (clamp in chained events, count == 1) is documented in the test comment. Update CHANGELOG

jeromekelleher · 2026-05-08T13:22:58Z

The change here seems quite straightforward, so I think we can issue a bugfix release once this is merged.

molpopgen · 2026-05-08T14:04:02Z

That's quite surprising -- if the proper distribution is uniform on [0, x), why is 0 a problem?

jeromekelleher force-pushed the fix-smck-bug branch from 323edf3 to d50953d Compare May 8, 2026 13:19

jeromekelleher mentioned this pull request May 8, 2026

Fix boundary condition in SMCK #2524

Closed

jeromekelleher force-pushed the fix-smck-bug branch from d50953d to 577cd77 Compare May 8, 2026 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix smck bug#2525

Fix smck bug#2525
jeromekelleher wants to merge 2 commits into
tskit-dev:mainfrom
jeromekelleher:fix-smck-bug

jeromekelleher commented May 8, 2026

Uh oh!

codecov Bot commented May 8, 2026 •

edited

Loading

Uh oh!

jeromekelleher commented May 8, 2026

Uh oh!

molpopgen commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeromekelleher commented May 8, 2026

Uh oh!

codecov Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented May 8, 2026

Uh oh!

molpopgen commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 8, 2026 •

edited

Loading