Skip to content

Extend Redis lock coverage#1455

Open
psilabs-dev wants to merge 9 commits intoDifegue:devfrom
psilabs-dev:bugfix/shinobulog
Open

Extend Redis lock coverage#1455
psilabs-dev wants to merge 9 commits intoDifegue:devfrom
psilabs-dev:bugfix/shinobulog

Conversation

@psilabs-dev
Copy link
Contributor

add_to_filemap is triggered on file events, which can frequently race against active file changes resulting in missing files during ID computation. If this happens, we don't really have an "error"-severity log.

This is one of the main invalid integration test case failure reasons, so tightening this condition will get rid of some noise. Or we can keep the logs if a file doesn't exist and demote that case to warning.

@psilabs-dev psilabs-dev marked this pull request as draft January 23, 2026 07:38
@psilabs-dev psilabs-dev changed the title Raise on compute_id only if file exists Extend Redis lock coverage Jan 24, 2026
@psilabs-dev psilabs-dev marked this pull request as ready for review January 24, 2026 06:25
@psilabs-dev
Copy link
Contributor Author

Well, this PR kind of went a different direction. Might as well just go all the way (but pinky swear this isn't another RotatingLog v2 lol)

The current exec_with_lock mechanism is limited to Mojo API but didn't really need to be, as locking applies to any "resource". So we can take out the "lock" mechanism and refactor it to exec_with_lock_pure, which handles all the locking logic instead, while doing a better job at it (e.g. redis connection resilience, lock acquisition rollback).

Multi-lock support is added. This is in anticipation of writes involving multiple resources, e.g. "add archive to category" may be a candidate for this, but that's something for another day. Also, the order should be deterministic.

Lock ownership with tokens is added. This means that even if a lock is acquired by a different worker, they still cannot release the lock as they don't "own" it. A sha strategy was chosen, but tbh we can probably get away with raw rands (though the latency of sha will be drowned by those of redis calls).

Optional TTL is added. This is for Shinobu, where I feel the job could take longer than 10s, especially with those category operations..

(exec_with_lock_pure isn't a good name, I'd rather have exec_with_lock be renamed to exec_with_lock_render/mojo, but cosmetic changes aside I'd rather not update more code than I need)

This exec_with_lock_pure covers few areas which touch archives: Shinobu at add_filemap , metadata plugin, and batch delete. The way current architecture is, we only need locking coverage of "API"-ish locations: Controller, Shinobu, Minion. However, manual upload isn't covered; there are quite a few differences between manual and API, and I don't think it's worth adding it.

The previous limitation with exec_with_lock not knowing what to do after TTL still exists. There's no convenient way to introduce TTL extensions, and the benefits for doing so aren't worth the trouble imo.

These changes should beat down these log assertion failures once and for all! Well of course, they would also make LRR more reliable and predictable to use...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant