Performance: cap failed actions retention period to three months by default#1310
Performance: cap failed actions retention period to three months by default#1310
Conversation
b4b5d24 to
ffb7a65
Compare
| $deleted_failed_entries = $this->clean_actions( array( ActionScheduler_Store::STATUS_FAILED ), $cutoff_failed, $batch_size ); | ||
| $count_statuses = empty( $statuses_to_purge ) ? count( $this->default_statuses_to_purge ) : count( $statuses_to_purge ); | ||
| $batch_size = (int) ( ( ( $count_statuses * $batch_size ) - count( $deleted_failed_entries ) ) / $count_statuses ); | ||
| } |
There was a problem hiding this comment.
The batch_size recalculation here can produce zero when $count_statuses is 1 (a single status via the action_scheduler_default_cleaner_statuses filter) and all $batch_size items are deleted. A per_page of 0 in query_actions() skips the LIMIT clause entirely, resulting in an unbounded query that returns all matching rows for deletion.
With the default 2 statuses this can't happen (minimum result is 10), but an early return when the budget is exhausted would be safer:
if ( $batch_size <= 0 ) {
return $deleted_failed_entries;
}| /** | ||
| * Default queue cleaner process used by queue runner. | ||
| * | ||
| * @since 3.9.4 by default, failed actions are removed after three months. |
There was a problem hiding this comment.
This is a backward-incompatible change to documented behavior. docs/perf.md currently states: "By default, Action Scheduler does not automatically delete old failed actions." — and this PR directly reverses that guarantee.
The change is well-justified (failed action accumulation is a documented problem across P2s and merchant cases), but it would be good to:
- Update
docs/perf.mdto reflect the new default behavior. - Document a clean opt-out path for sites that intentionally keep failed actions (e.g., for debugging, auditing, or compliance). Currently returning a very large value from
action_scheduler_retention_period_for_failedis the only workaround, which isn't obvious. - Consider a changelog/upgrade notice since this changes data retention defaults.
There was a problem hiding this comment.
I'll update the file to include a note about the BC break and instructions for replicating the original behavior. I have also added the corresponding filter for this case.
jorgeatorres
left a comment
There was a problem hiding this comment.
I left some comments for discussing @kalessil, but nothing too serious. I really like the improvements! 💯
| $lifespan_default = apply_filters( 'action_scheduler_retention_period_by_default', $lifespan ); | ||
|
|
||
| /** | ||
| * Set the retention period in seconds for actions with a failed status. If the action_scheduler_default_cleaner_statuses filter includes | ||
| * a failed status, this filter result will be ignored, and the retention period for failed actions will match that of other statuses. | ||
| * | ||
| * @param int $retention_period Retention period in seconds. | ||
| */ | ||
| $lifespan_failed = apply_filters( 'action_scheduler_retention_period_for_failed', 3 * $this->month_in_seconds ); |
There was a problem hiding this comment.
Probably excessive caution, an the existing $lifespan didn't consider this, but I wonder if we should pass these lifespans through absint() or at least abs() so there's no chance we end up with a date in the future.
There was a problem hiding this comment.
Do you mean only for this specific filter @jorgeatorres ?
There was a problem hiding this comment.
I think it should apply to all lifespan variables, unless I'm missing something. My concern is that a negative value could result in a date in the future, so we end up deleting more than necessary. Am I wrong?
There was a problem hiding this comment.
With the piggy-back style of cleaning as it is now, that's technically possible. I'll go ahead with extra defence against the mentioned cases.
| $batch_size = (int) ( ( ( $count_statuses * $batch_size ) - count( $deleted_failed_entries ) ) / $count_statuses ); | ||
| if ( $batch_size <= 0 ) { | ||
| return $deleted_failed_entries; | ||
| } | ||
| } | ||
|
|
||
| $deleted_entries = $this->clean_actions( $statuses_to_purge, $cutoff_default, $batch_size ); |
There was a problem hiding this comment.
While this is clever, I wonder if we should always let actions in other statuses be deleted. If there's a large backlog of failed actions, we will only be removing failed actions for a while. Maybe we can split the 'budget' since the beginning (even if we give more to failed) so that we always clean up at least some actions of the other statuses?
There was a problem hiding this comment.
Removing failed actions will use part of the execution budget, but at least half will remain available for cleaning other statuses.
$batch_size = (int) ( ( ( $count_statuses * $batch_size ) - count( $deleted_failed_entries ) ) / $count_statuses ); ensures that, even after removing failed entries, the remaining batch size is distributed evenly among the other statuses.
This approach ensures we remain within the same execution budget and that cleaning failed statuses does not prevent other statuses from being processed.
There was a problem hiding this comment.
For improved execution budget management, please refer to #1312. In this case (here), I prefer a simpler approach.
There was a problem hiding this comment.
Yes, that's true when there are at least 2 different statuses returned by filter action_scheduler_default_cleaner_statuses (which is the default), but when there's just one, the total budget will be $batch_size * 1 = $batch_size and the failed budget will be $batch_size so we're left with no budget to process the remaining statuses, assuming there were at least $batch_size failed actions, which wouldn't be uncommon.
There was a problem hiding this comment.
Gotcha, I see your point. I'll look into handling this edge case better.
| add_filter( 'action_scheduler_retention_period_for_failed', function( $retention_period ) { | ||
| return 12 * MONTH_IN_SECONDS; | ||
| } ); |
There was a problem hiding this comment.
I wonder if we could change the filter so that 0 actually prevents removal and we document it here as a way to "bring back" the old behavior. I prefer the new one, but might be a way to soften the backwards incompatible change.
There was a problem hiding this comment.
Sure, that leaves a loophole where A-S can be blamed for bloating tables, but it's a fair point to provide an opportunity to restore original behaviour. I'll add the necessary changes today.
There was a problem hiding this comment.
Yes, I don't think this is a blocker, but the filter exists so if for some reason someone needs to do this they can use it instead of having to do very strange things. And, in any case, it'd be a conscious choice.
Partially addresses ACTSCH-15 #912.
The issue resurfaces in stats shared under p1772090743156589-slack-C08SVLULL3B. This PR limits the retention period for failed actions to three months to prevent failed actions from continuously increasing the A-S table size.
We are addressing A-S scalability issues in HVM caused by the accumulation of failed actions, which increases database table size and slows A-S performance. As the database is often the main bottleneck, slower A-S queries further impact performance, especially during BFCM.
Testing Instructions: