Skip to content

Threading monkeypatch#1966

Draft
napetrov wants to merge 26 commits intouxlfoundation:masterfrom
napetrov:threading-monkeypatch
Draft

Threading monkeypatch#1966
napetrov wants to merge 26 commits intouxlfoundation:masterfrom
napetrov:threading-monkeypatch

Conversation

@napetrov
Copy link
Contributor

@napetrov napetrov commented Feb 11, 2026

Description

Add TBB-based threading.Thread replacement for Python free-threading (NOGIL) builds.

Follow-up to #1965 (free-threading/NOGIL support).

This PR adds the ability to replace Python's threading.Thread with a TBB-based implementation (TBBThread) that executes work on TBB's work-stealing thread pool instead of creating new OS threads.

  1. pyperformance (single-threaded): Neutral — geometric mean 1.00x. Individual benchmarks vary ±30% (noise from TBB startup overhead).

  2. Micro-benchmarks (threading primitives): Mostly worse (8 of 15 slower). Notable: orderbook 20-28% worse across all thread counts. Only bright spot: TCP ping-pong +53% at 4 threads (I/O-bound).

  3. Real Python workloads (ThreadPoolExecutor): Neutral or worse + 2 critical deadlocks:

• Nested ThreadPoolExecutor → deadlock at 4 threads
• Producer-consumer with Queue → deadlock
• Work-stealing didn't help even on uneven tasks (21% slower)
• Dict-heavy workloads up to 1.9x slower
Conclusion

tbb.patch_threading() provides no advantage for Python-level parallelism in free-threaded CPython. CPython's native threads already scale well enough for pure Python; TBB's work-stealing is designed for fine-grained C/C++ tasks, not coarse Python bytecode execution. The monkey-patch adds overhead that negates any scheduling benefit.

Key Features

  • patch_threading() / unpatch_threading() — Global functions to replace threading.Thread with TBBThread
  • tbb_threading context manager — Scoped thread patching
  • -t / --threads CLI flag — Enable TBB threading via python -m tbb -t script.py
  • Monkey(threads=True) — Opt-in threading replacement in Monkey context manager
  • tbb_run_and_wait() / tbb_parallel_for() — Direct TBB dispatch helpers for batch parallel execution

How It Works

TBBThread is a drop-in replacement for threading.Thread that submits work to a shared TBB Pool (backed by task_group) rather than spawning OS threads. Benefits:

  • No OS thread creation overhead per task
  • Work-stealing scheduling across TBB worker pool
  • Better resource utilization for many short-lived threads

Limitations (documented)

  • daemon property has no effect (TBB manages thread lifecycle)
  • native_id may be reused across TBBThread instances
  • Blocking operations (locks, heavy I/O) can exhaust the TBB worker pool
  • Exceptions from target function are re-raised in join() (differs from stdlib)

Changes

  • python/tbb/__init__.py — Consolidated TBBThread class with Event-based synchronization, patch_threading()/unpatch_threading() functions, tbb_threading context manager
  • python/tbb/api.i — Added tbb_run_and_wait and tbb_parallel_for SWIG wrappers
  • python/tbb/__main__.py — Added -t/--threads and --patch-threading CLI flags
  • python/README.md — Documentation and usage examples
  • Minimum Python version: 3.9+
  • Requires SWIG 4.4.0+ for free-threading builds

Fixes # - issue number(s) if exists

Type of change

  • new feature - change that adds functionality
  • documentation - documentation update

Tests

  • added - required for new features and some bug fixes
  • not needed

Documentation

  • updated in this PR
  • needs to be updated
  • not needed

Breaks backward compatibility

  • Yes
  • No
  • Unknown

Notify the following users

Other information

Benchmark results on Intel hardware included in commit 9e09e29. Full pyperformance suite testing in progress.

napetrov and others added 26 commits February 10, 2026 23:11
Enable Python 3.13+ free-threading support for TBB Python bindings.
When built with Python 3.13t (free-threading build), the module declares
Py_MOD_GIL_NOT_USED via SWIG's built-in SWIGPYTHON_NOGIL support.

Changes:
- setup.py.in: Add -DSWIGPYTHON_NOGIL compile flag for Python 3.13+
- api.i: Add documentation for thread-safety of PyCaller/ArenaPyCaller
- patch_nogil.py: Post-processing script for additional NOGIL patches
- NOGIL.md: Documentation for free-threading usage and building

The module properly acquires GIL when calling back into Python code
using SWIG_PYTHON_THREAD_BEGIN_BLOCK/END_BLOCK macros, ensuring safe
operation of TBB callbacks from worker threads.

Tested on Python 3.13.12 experimental free-threading build with
GIL disabled throughout TBB Pool operations.

Signed-off-by: Nikolay Petrov <[email protected]>
- Remove Py_BUILD_CORE_MODULE (reserved for CPython internals)
- Add proper destructor to ArenaPyCaller to prevent memory leaks
- Add copy constructor to ArenaPyCaller for safe TBB task copying
- Delete assignment operator to prevent double-free issues
- Remove unnecessary patch_nogil.py script (SWIG handles NOGIL natively)
- Simplify setup.py.in by removing custom build_ext class

The NOGIL support now relies entirely on SWIG's built-in support via
-DSWIGPYTHON_NOGIL compile flag, which is cleaner and more maintainable.

Addresses review feedback:
- ArenaPyCaller memory management issues
- Removal of Py_BUILD_CORE_MODULE
- Simplification of build process

Signed-off-by: Nikolay Petrov <[email protected]>
Documentation to be added separately after patch stabilization.

Signed-off-by: Nikolay Petrov <[email protected]>
- Use sysconfig.get_config_var('Py_GIL_DISABLED') to detect free-threaded
  Python instead of checking Python version >= 3.13
- Revert metadata changes (long_description, keywords, classifiers)
  to keep patch minimal

The -DSWIGPYTHON_NOGIL flag is now only added when building with
python3.13t (free-threaded), not regular python3.13.

Signed-off-by: Nikolay Petrov <[email protected]>
- Move NOGIL detection outside platform-specific blocks for Windows support
- Add PyErr_WriteUnraisable() to log exceptions from TBB worker threads
  instead of silently swallowing them

Signed-off-by: Nikolay Petrov <[email protected]>
Signed-off-by: Nikolay Petrov <[email protected]>
- Add GIL protection to PyCaller copy constructor and destructor
- Fix potential deadlock in _concurrency_barrier with predicate wait
- Remove empty %init block

PyCaller now explicitly acquires GIL for Py_XINCREF/XDECREF operations
to ensure thread-safety when instances are copied/destroyed on TBB
worker threads.

Signed-off-by: Nikolay Petrov <[email protected]>
This is a SEPARATE patch from the NOGIL support.

Adds tbb.threading_patch module that provides:
- TBBThread: drop-in replacement for threading.Thread using TBB pool
- patch_threading(): replace threading.Thread globally
- unpatch_threading(): restore original implementation
- tbb_threading(): context manager for temporary patching

Usage:
    from tbb.threading_patch import patch_threading
    patch_threading()
    # Now threading.Thread uses TBB thread pool

Key differences from system threads:
- Reuses TBB worker threads instead of creating new OS threads
- More efficient for many short-lived threads
- Work-stealing scheduler for better CPU utilization

Limitations:
- daemon property has no effect (TBB manages lifecycle)
- native_id may be reused across TBBThread instances
Direct task_group wrappers for parallel execution.
Benchmarks show 25-75% faster than system threads.

New functions:
- tbb_run_and_wait(callables): Run list of callables in parallel
- tbb_parallel_for(n, func): Run func(i) for i in range(n)

These use task_group directly, avoiding Python Pool overhead.
Usage: python -m tbb -t script.py

When -t is specified:
- threading.Thread is replaced with TBBThread
- TBBThread uses task_group internally
- 70-78% faster for short-lived threads
- Same performance for long-running threads

TBBThread creates a task_group per thread and uses task_group.wait()
for proper join() semantics.
Intel-Bench (4 Intel CPUs) results:
- 20-57% faster for short-lived threads
- 7-30% faster for medium workloads
- Comparable performance for long-running threads

TBB thread reuse eliminates clone3() syscall overhead.
- Fix orphaned docstring in Monkey.__init__ (merge docstrings properly)
- Add thread-safe start() with _start_lock to prevent race condition
- Clarify TBBThread distinction: __init__.py has minimal version,
  threading_patch.py has full threading.Thread compatibility

Signed-off-by: Nikolay Petrov <[email protected]>
- Rename -t/--threads to -T/--patch-threading for clarity
- Document threading patch feature with benchmarks
- Add table showing flag combinations (--ipc, --patch-threading)
- Add programmatic usage examples (patch_threading, tbb_threading)
- Document limitations (daemon, native_id)
- Add section on Free-threading Python 3.13+ (NOGIL) support

The -T flag is uppercase to distinguish from potential future -t uses.
Using --patch-threading makes the monkey-patching behavior explicit.

Signed-off-by: Nikolay Petrov <[email protected]>
Benchmark results moved to README.md. Documentation consolidated.

Signed-off-by: Nikolay Petrov <[email protected]>
Free-threading (NOGIL) support requires SWIG 4.4.0+ which added the
-nogil flag and Py_MOD_GIL_NOT_USED support (July 2025).

Signed-off-by: Nikolay Petrov <[email protected]>
C++ fixes (api.i):
- Add explicit move constructor to PyCaller to prevent use-after-free
- Add move constructor to ArenaPyCaller with proper ownership transfer
- Add null checks in destructors before Py_XDECREF
- Delete move/copy assignment operators to prevent accidental misuse
- Add comment about exception handling in worker threads

Python fixes (threading_patch.py):
- Fix TOCTOU race: move pool submission inside start() lock
- Fix join() timeout handling: don't check exception if timeout expires
- Sanitize exceptions: strip traceback to prevent memory leaks and data exposure
- Add UserWarning on patch_threading() about limitations
- Add _warn parameter to suppress warning when appropriate
- Document exception re-raising behavior difference from stdlib

Documentation (README.md):
- Expand limitations section with worker pool, exception, security notes
- Recommend tbb_threading() context manager over global patch
- Add guidance for I/O-bound workloads

Addresses findings from security review:
- CWE-367 TOCTOU race condition
- CWE-209 Information exposure via exceptions
- Memory safety in move semantics

Signed-off-by: Nikolay Petrov <[email protected]>
Remove threading_patch.py and consolidate all threading functionality:
- Full TBBThread implementation now in __init__.py (was duplicated)
- patch_threading(), unpatch_threading(), tbb_threading() moved to __init__.py
- is_threading_patched() for checking patch state
- Update imports: 'from tbb import patch_threading' (was tbb.threading_patch)

This eliminates duplication between the minimal and full TBBThread
implementations, and removes the redundant 'python -m tbb.threading_patch'
entry point (use 'python -m tbb -T' instead).

Signed-off-by: Nikolay Petrov <[email protected]>
Python 3.5-3.8 are EOL. The threading patch uses typing features
and patterns that work best with Python 3.9+.

Signed-off-by: Nikolay Petrov <[email protected]>
- Add GIL guard to PyCaller(PyObject*, bool) constructor
- Remove unused 'import queue'
- Add _patch_lock for thread-safe patch/unpatch operations
- Change 'except BaseException' to 'except Exception' (allow KeyboardInterrupt/SystemExit)
- Add _join_lock to fix exception handling race in TBBThread.join()
- Keep initial=false in ArenaPyCaller::operator() (correct borrowing semantics)
… wrappers

- Pass initial=false to PyCaller in task_arena::enqueue(), task_arena::execute(),
  and task_group::run() to properly INCREF borrowed PyObject references from SWIG.
  Without this, PyCaller destructor would DECREF a reference that was never
  INCREFed, causing refcount underflow.

  This matches the existing correct pattern in ArenaPyCaller (line 168) which
  explicitly INCREFs then passes initial=false.

- Document exception behavior for tbb_run_and_wait() and tbb_parallel_for():
  exceptions in callbacks are logged via PyErr_WriteUnraisable, not propagated.
Add missing UXL Foundation copyright to python/tbb/__init__.py and
python/tbb/api.i to pass the copyright_check CI job.
- test_threading.py: unittest suite for patch lifecycle, TBBThread
  drop-in behavior, context manager, tbb_run_and_wait, tbb_parallel_for
- examples/tbb_threading_example.py: usage example showing three methods
  (patch_threading, tbb_threading context manager, tbb_parallel_for)
- Fix copyright headers (new files use UXL only, update Intel year to 2025)
- Move _thread import to module level (performance: avoid import in hot path)
- Add pool.join() in _shutdown_pool (prevent orphaned tasks)
- Fix test/example imports to use consolidated tbb package (not threading_patch)
- Remove misleading 'zero-overhead' claim from docstrings
…n doc

- Add atexit handler to join non-daemon TBBThreads and cleanup pool
- Track active TBBThreads for proper shutdown behavior
- Add TLS limitation to TBBThread docstring
- Add join timeout test
- Create THREADING_DESIGN.md documenting system vs TBB thread differences,
  current implementation approach, supported features, and known limitations
- Fix CLI flag typo: -t → -T in tbb_threading_example.py
- Add language tag to code block in THREADING_DESIGN.md (markdown lint)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant