Skip to content

Performance: cap failed actions retention period to three months by default#1310

Open
kalessil wants to merge 7 commits intotrunkfrom
performance/cap-failed-actions-retention-by-default
Open

Performance: cap failed actions retention period to three months by default#1310
kalessil wants to merge 7 commits intotrunkfrom
performance/cap-failed-actions-retention-by-default

Conversation

@kalessil
Copy link

@kalessil kalessil commented Mar 7, 2026

Partially addresses ACTSCH-15 #912.

The issue resurfaces in stats shared under p1772090743156589-slack-C08SVLULL3B. This PR limits the retention period for failed actions to three months to prevent failed actions from continuously increasing the A-S table size.

We are addressing A-S scalability issues in HVM caused by the accumulation of failed actions, which increases database table size and slows A-S performance. As the database is often the main bottleneck, slower A-S queries further impact performance, especially during BFCM.

Note: The actions scheduler allows you to customize retention periods for action statuses, but these settings apply to all statuses collectively. However, failed statuses require a longer retention period than others to support troubleshooting while maintaining a manageable actions table size. This PR addresses this need.

Testing Instructions:

  • Build the plugin from this branch and install it on your test system.
  • Modify a completed action by setting its status to failed and updating all date fields to twelve months in the past.
  • Navigate to the admin area.
  • Confirm that the action record has been deleted.

@kalessil kalessil self-assigned this Mar 7, 2026
@kalessil kalessil mentioned this pull request Mar 8, 2026
5 tasks
@kalessil kalessil changed the title Performance: cap failed actions retention to three months by default Performance: cap failed actions retention period to three months by default Mar 8, 2026
@kalessil kalessil force-pushed the performance/cap-failed-actions-retention-by-default branch from b4b5d24 to ffb7a65 Compare March 8, 2026 14:35
@kalessil kalessil marked this pull request as ready for review March 9, 2026 07:58
$deleted_failed_entries = $this->clean_actions( array( ActionScheduler_Store::STATUS_FAILED ), $cutoff_failed, $batch_size );
$count_statuses = empty( $statuses_to_purge ) ? count( $this->default_statuses_to_purge ) : count( $statuses_to_purge );
$batch_size = (int) ( ( ( $count_statuses * $batch_size ) - count( $deleted_failed_entries ) ) / $count_statuses );
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch_size recalculation here can produce zero when $count_statuses is 1 (a single status via the action_scheduler_default_cleaner_statuses filter) and all $batch_size items are deleted. A per_page of 0 in query_actions() skips the LIMIT clause entirely, resulting in an unbounded query that returns all matching rows for deletion.

With the default 2 statuses this can't happen (minimum result is 10), but an early return when the budget is exhausted would be safer:

if ( $batch_size <= 0 ) {
    return $deleted_failed_entries;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good defensive addition: added in af7d701

/**
* Default queue cleaner process used by queue runner.
*
* @since 3.9.4 by default, failed actions are removed after three months.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a backward-incompatible change to documented behavior. docs/perf.md currently states: "By default, Action Scheduler does not automatically delete old failed actions." — and this PR directly reverses that guarantee.

The change is well-justified (failed action accumulation is a documented problem across P2s and merchant cases), but it would be good to:

  1. Update docs/perf.md to reflect the new default behavior.
  2. Document a clean opt-out path for sites that intentionally keep failed actions (e.g., for debugging, auditing, or compliance). Currently returning a very large value from action_scheduler_retention_period_for_failed is the only workaround, which isn't obvious.
  3. Consider a changelog/upgrade notice since this changes data retention defaults.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the file to include a note about the BC break and instructions for replicating the original behavior. I have also added the corresponding filter for this case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the docs in af7d701

@kalessil kalessil requested a review from prettyboymp March 9, 2026 16:15
Copy link
Member

@jorgeatorres jorgeatorres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments for discussing @kalessil, but nothing too serious. I really like the improvements! 💯

Comment on lines +69 to +77
$lifespan_default = apply_filters( 'action_scheduler_retention_period_by_default', $lifespan );

/**
* Set the retention period in seconds for actions with a failed status. If the action_scheduler_default_cleaner_statuses filter includes
* a failed status, this filter result will be ignored, and the retention period for failed actions will match that of other statuses.
*
* @param int $retention_period Retention period in seconds.
*/
$lifespan_failed = apply_filters( 'action_scheduler_retention_period_for_failed', 3 * $this->month_in_seconds );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably excessive caution, an the existing $lifespan didn't consider this, but I wonder if we should pass these lifespans through absint() or at least abs() so there's no chance we end up with a date in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean only for this specific filter @jorgeatorres ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should apply to all lifespan variables, unless I'm missing something. My concern is that a negative value could result in a date in the future, so we end up deleting more than necessary. Am I wrong?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the piggy-back style of cleaning as it is now, that's technically possible. I'll go ahead with extra defence against the mentioned cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 24eab61

Comment on lines +112 to +118
$batch_size = (int) ( ( ( $count_statuses * $batch_size ) - count( $deleted_failed_entries ) ) / $count_statuses );
if ( $batch_size <= 0 ) {
return $deleted_failed_entries;
}
}

$deleted_entries = $this->clean_actions( $statuses_to_purge, $cutoff_default, $batch_size );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is clever, I wonder if we should always let actions in other statuses be deleted. If there's a large backlog of failed actions, we will only be removing failed actions for a while. Maybe we can split the 'budget' since the beginning (even if we give more to failed) so that we always clean up at least some actions of the other statuses?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing failed actions will use part of the execution budget, but at least half will remain available for cleaning other statuses.

$batch_size = (int) ( ( ( $count_statuses * $batch_size ) - count( $deleted_failed_entries ) ) / $count_statuses ); ensures that, even after removing failed entries, the remaining batch size is distributed evenly among the other statuses.

This approach ensures we remain within the same execution budget and that cleaning failed statuses does not prevent other statuses from being processed.

Copy link
Author

@kalessil kalessil Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For improved execution budget management, please refer to #1312. In this case (here), I prefer a simpler approach.

Copy link
Member

@jorgeatorres jorgeatorres Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true when there are at least 2 different statuses returned by filter action_scheduler_default_cleaner_statuses (which is the default), but when there's just one, the total budget will be $batch_size * 1 = $batch_size and the failed budget will be $batch_size so we're left with no budget to process the remaining statuses, assuming there were at least $batch_size failed actions, which wouldn't be uncommon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I see your point. I'll look into handling this edge case better.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 5c4deed

Comment on lines +136 to +138
add_filter( 'action_scheduler_retention_period_for_failed', function( $retention_period ) {
return 12 * MONTH_IN_SECONDS;
} );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could change the filter so that 0 actually prevents removal and we document it here as a way to "bring back" the old behavior. I prefer the new one, but might be a way to soften the backwards incompatible change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that leaves a loophole where A-S can be blamed for bloating tables, but it's a fair point to provide an opportunity to restore original behaviour. I'll add the necessary changes today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think this is a blocker, but the filter exists so if for some reason someone needs to do this they can use it instead of having to do very strange things. And, in any case, it'd be a conscious choice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 24eab61

@kalessil kalessil requested a review from jorgeatorres March 18, 2026 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants