Skip to content

Tests(py): speed up local restore flows#10499

Draft
nmanovic wants to merge 1 commit intov3-pr4a-runtime-foundationfrom
v3-pr4b-fast-restore
Draft

Tests(py): speed up local restore flows#10499
nmanovic wants to merge 1 commit intov3-pr4a-runtime-foundationfrom
v3-pr4b-fast-restore

Conversation

@nmanovic
Copy link
Copy Markdown
Contributor

Summary

This is the next stacked PR after v3-pr4a-runtime-foundation.

Scope is intentionally limited to faster and safer local test restore/reset behavior:

  • replace shell-based PostgreSQL restore with a persistent psycopg client
  • replace shell-based Redis restore with persistent Redis clients
  • drain active RQ jobs before DB restore so queued work cannot repopulate state after reset
  • move restore fixtures into an explicit pytest plugin module
  • keep kube behavior compatible by delegating kube restores through the existing legacy helpers

Out of scope:

  • profiles
  • dynamic ports / runtime URL rewriting
  • parallel runtime
  • kube runtime redesign
  • CI workflow changes

Why

The foundation PR introduced explicit local runtime ownership, but restore/reset was still paying shell/process startup cost on every operation and still relied on legacy wildcard fixture exports.

This PR keeps semantics the same while making repeated local reset paths materially cheaper and less flaky.

It also includes the local CVAT data restore optimization so repeated data resets do not repack/copy the baseline archive every time.

What changed

  • added tests/python/infra/db_restore.py
    • persistent psycopg-based DB restore using CREATE DATABASE ... WITH TEMPLATE ...
  • added tests/python/infra/redis_restore.py
    • persistent Redis client restore for in-memory and on-disk state
  • added tests/python/infra/rq_cleanup.py
    • drains active RQ jobs before DB restore
  • added tests/python/infra/fixtures.py
    • explicit restore fixtures for local and kube
  • updated tests/python/conftest.py
    • register infra.fixtures explicitly instead of re-exporting restore fixtures from the legacy module
  • updated local runtime wiring in tests/python/infra/instances/local_instance.py
    • use persistent restore helpers
    • close helper clients on runtime shutdown/recreate
  • updated restore-heavy auth/access-token tests to also reset Redis in-memory state
  • added restore helper dependencies to tests/python/requirements.txt

Measured speedup

Measured on the same local stack, same machine, 120 iterations each.

PostgreSQL restore

  • legacy shell path: 79.25 ms
  • new psycopg restore: 22.13 ms
  • speedup: 3.6x

Redis in-memory restore

  • legacy shell path: 55.94 ms
  • new Redis client restore: 0.35 ms
  • speedup: 161x

Redis on-disk restore

  • legacy shell path: 54.11 ms
  • new Redis client restore: 0.19 ms
  • speedup: 289x

CVAT data restore

  • legacy repeated copy/extract path: 102.35 ms
  • cached restore path: 68.51 ms
  • speedup: 1.5x

Validation

Static checks:

  • py_compile
  • black --check
  • isort --check --diff --resolve-all-configs
  • pylint
  • typos
  • remark

Targeted restore-heavy tests:

  • pytest tests/python/rest_api/test_auth.py tests/python/rest_api/test_access_tokens.py --platform=local --infra=reuse
  • result: 55 passed

Full Python suite:

  • pytest tests/python --platform=local --infra=reuse
  • result: 2178 passed, 20 skipped, 1 xfailed

CI-like Django unit tests:

  • docker compose -f docker-compose.yml -f docker-compose.dev.yml -f docker-compose.ci.yml run ... cvat_ci python manage.py test cvat/apps -v 2
  • result: 412 tests, OK (skipped=58)

Kube compatibility smoke:

  • deployed the current chart to minikube using the same deploy-first assumption as CI
  • ran:
    • pytest tests/python/rest_api/test_auth.py::TestBasicAuth::test_can_use_basic_auth --platform=kube -q
  • result: 1 passed

Review notes

  • This PR is local-restore focused only.
  • Kube restore logic is intentionally not redesigned here; kube compatibility is preserved through the existing legacy restore helpers.
  • Profiles, dynamic runtime URLs, and parallelism are deferred to later PRs to keep this increment reviewable.

@nmanovic nmanovic requested a review from zhiltsov-max as a code owner April 19, 2026 09:25
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8f310d4c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +55 to +59
"SELECT pg_terminate_backend(pg_stat_activity.pid) "
"FROM pg_stat_activity "
"WHERE pg_stat_activity.datname = %s "
"AND pid <> pg_backend_pid()",
(target,),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Terminate source DB sessions before templated restore

CREATE DATABASE ... WITH TEMPLATE requires the template database to have no active sessions, but this code only terminates connections to target_db. During local startup we call restore_from_template(source_db="cvat", target_db="test_db"), and cvat can still have live connections from server/worker processes, which makes restore fail with “source database is being accessed by other users.” The legacy restore.sql explicitly terminated :from when cloning from cvat, so this is a regression in restore reliability.

Useful? React with 👍 / 👎.

Comment on lines +75 to +77
if len(keys_to_delete) >= 1000:
self._inmem_db0.delete(*keys_to_delete)
keys_to_delete.clear()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid deleting keys while SCAN iteration is in progress

This loop mutates DB0 (delete) before SCAN has completed, but Redis only guarantees full key coverage for SCAN when the keyspace is not changing during iteration. With large keysets (where the >= 1000 branch is hit), some keys can be skipped and survive the restore, leaking Redis state between tests. The previous local restore path used a single key listing + delete pass, so this introduces a completeness risk in reset behavior.

Useful? React with 👍 / 👎.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@nmanovic nmanovic force-pushed the v3-pr4a-runtime-foundation branch from 3bc10fb to f2c8e67 Compare April 21, 2026 07:17
@nmanovic nmanovic force-pushed the v3-pr4b-fast-restore branch from b8f310d to fec0185 Compare April 21, 2026 07:17
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fec018501a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +60 to +64
queue.scheduled_job_registry.remove(job, delete_job=False)
elif status == JobStatus.DEFERRED:
queue.deferred_job_registry.remove(job, delete_job=False)

job.cancel(enqueue_dependents=False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve periodic scheduled jobs during queue drain

The drain logic unconditionally removes and cancels every scheduled/deferred job in the selected queues, and local restores call it for _BACKGROUND_JOB_QUEUES (which includes cleaning). In this repo, cleaning is also where PERIODIC_RQ_JOBS are scheduled, so the first DB reset cancels cron jobs like clean_up_sessions and they are not re-created until containers restart. That changes runtime behavior for the rest of the test session and can hide bugs in flows that rely on periodic RQ tasks.

Useful? React with 👍 / 👎.

@nmanovic nmanovic force-pushed the v3-pr4b-fast-restore branch from fec0185 to 6719c1e Compare April 21, 2026 12:47
@nmanovic nmanovic force-pushed the v3-pr4a-runtime-foundation branch from f2c8e67 to 0a878d3 Compare April 21, 2026 12:47
@nmanovic nmanovic marked this pull request as draft April 21, 2026 12:48
@nmanovic nmanovic force-pushed the v3-pr4b-fast-restore branch from 6719c1e to 943f5b3 Compare April 21, 2026 12:51
@nmanovic nmanovic force-pushed the v3-pr4a-runtime-foundation branch from 0a878d3 to ce7a4fe Compare April 21, 2026 12:51
@nmanovic nmanovic force-pushed the v3-pr4b-fast-restore branch from 943f5b3 to 86c7dba Compare April 21, 2026 15:30
@nmanovic nmanovic force-pushed the v3-pr4a-runtime-foundation branch from ce7a4fe to 29cef23 Compare April 21, 2026 15:30
@nmanovic nmanovic force-pushed the v3-pr4b-fast-restore branch from 86c7dba to 753f499 Compare April 21, 2026 16:06
@nmanovic nmanovic force-pushed the v3-pr4a-runtime-foundation branch from 29cef23 to 360c5ac Compare April 21, 2026 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant