Draft
Conversation
Enable Python 3.13+ free-threading support for TBB Python bindings. When built with Python 3.13t (free-threading build), the module declares Py_MOD_GIL_NOT_USED via SWIG's built-in SWIGPYTHON_NOGIL support. Changes: - setup.py.in: Add -DSWIGPYTHON_NOGIL compile flag for Python 3.13+ - api.i: Add documentation for thread-safety of PyCaller/ArenaPyCaller - patch_nogil.py: Post-processing script for additional NOGIL patches - NOGIL.md: Documentation for free-threading usage and building The module properly acquires GIL when calling back into Python code using SWIG_PYTHON_THREAD_BEGIN_BLOCK/END_BLOCK macros, ensuring safe operation of TBB callbacks from worker threads. Tested on Python 3.13.12 experimental free-threading build with GIL disabled throughout TBB Pool operations. Signed-off-by: Nikolay Petrov <[email protected]>
- Remove Py_BUILD_CORE_MODULE (reserved for CPython internals) - Add proper destructor to ArenaPyCaller to prevent memory leaks - Add copy constructor to ArenaPyCaller for safe TBB task copying - Delete assignment operator to prevent double-free issues - Remove unnecessary patch_nogil.py script (SWIG handles NOGIL natively) - Simplify setup.py.in by removing custom build_ext class The NOGIL support now relies entirely on SWIG's built-in support via -DSWIGPYTHON_NOGIL compile flag, which is cleaner and more maintainable. Addresses review feedback: - ArenaPyCaller memory management issues - Removal of Py_BUILD_CORE_MODULE - Simplification of build process Signed-off-by: Nikolay Petrov <[email protected]>
Documentation to be added separately after patch stabilization. Signed-off-by: Nikolay Petrov <[email protected]>
- Use sysconfig.get_config_var('Py_GIL_DISABLED') to detect free-threaded
Python instead of checking Python version >= 3.13
- Revert metadata changes (long_description, keywords, classifiers)
to keep patch minimal
The -DSWIGPYTHON_NOGIL flag is now only added when building with
python3.13t (free-threaded), not regular python3.13.
Signed-off-by: Nikolay Petrov <[email protected]>
- Move NOGIL detection outside platform-specific blocks for Windows support - Add PyErr_WriteUnraisable() to log exceptions from TBB worker threads instead of silently swallowing them Signed-off-by: Nikolay Petrov <[email protected]>
Signed-off-by: Nikolay Petrov <[email protected]>
- Add GIL protection to PyCaller copy constructor and destructor - Fix potential deadlock in _concurrency_barrier with predicate wait - Remove empty %init block PyCaller now explicitly acquires GIL for Py_XINCREF/XDECREF operations to ensure thread-safety when instances are copied/destroyed on TBB worker threads. Signed-off-by: Nikolay Petrov <[email protected]>
This is a SEPARATE patch from the NOGIL support.
Adds tbb.threading_patch module that provides:
- TBBThread: drop-in replacement for threading.Thread using TBB pool
- patch_threading(): replace threading.Thread globally
- unpatch_threading(): restore original implementation
- tbb_threading(): context manager for temporary patching
Usage:
from tbb.threading_patch import patch_threading
patch_threading()
# Now threading.Thread uses TBB thread pool
Key differences from system threads:
- Reuses TBB worker threads instead of creating new OS threads
- More efficient for many short-lived threads
- Work-stealing scheduler for better CPU utilization
Limitations:
- daemon property has no effect (TBB manages lifecycle)
- native_id may be reused across TBBThread instances
Direct task_group wrappers for parallel execution. Benchmarks show 25-75% faster than system threads. New functions: - tbb_run_and_wait(callables): Run list of callables in parallel - tbb_parallel_for(n, func): Run func(i) for i in range(n) These use task_group directly, avoiding Python Pool overhead.
Usage: python -m tbb -t script.py When -t is specified: - threading.Thread is replaced with TBBThread - TBBThread uses task_group internally - 70-78% faster for short-lived threads - Same performance for long-running threads TBBThread creates a task_group per thread and uses task_group.wait() for proper join() semantics.
Intel-Bench (4 Intel CPUs) results: - 20-57% faster for short-lived threads - 7-30% faster for medium workloads - Comparable performance for long-running threads TBB thread reuse eliminates clone3() syscall overhead.
- Fix orphaned docstring in Monkey.__init__ (merge docstrings properly) - Add thread-safe start() with _start_lock to prevent race condition - Clarify TBBThread distinction: __init__.py has minimal version, threading_patch.py has full threading.Thread compatibility Signed-off-by: Nikolay Petrov <[email protected]>
- Rename -t/--threads to -T/--patch-threading for clarity - Document threading patch feature with benchmarks - Add table showing flag combinations (--ipc, --patch-threading) - Add programmatic usage examples (patch_threading, tbb_threading) - Document limitations (daemon, native_id) - Add section on Free-threading Python 3.13+ (NOGIL) support The -T flag is uppercase to distinguish from potential future -t uses. Using --patch-threading makes the monkey-patching behavior explicit. Signed-off-by: Nikolay Petrov <[email protected]>
Benchmark results moved to README.md. Documentation consolidated. Signed-off-by: Nikolay Petrov <[email protected]>
Free-threading (NOGIL) support requires SWIG 4.4.0+ which added the -nogil flag and Py_MOD_GIL_NOT_USED support (July 2025). Signed-off-by: Nikolay Petrov <[email protected]>
C++ fixes (api.i): - Add explicit move constructor to PyCaller to prevent use-after-free - Add move constructor to ArenaPyCaller with proper ownership transfer - Add null checks in destructors before Py_XDECREF - Delete move/copy assignment operators to prevent accidental misuse - Add comment about exception handling in worker threads Python fixes (threading_patch.py): - Fix TOCTOU race: move pool submission inside start() lock - Fix join() timeout handling: don't check exception if timeout expires - Sanitize exceptions: strip traceback to prevent memory leaks and data exposure - Add UserWarning on patch_threading() about limitations - Add _warn parameter to suppress warning when appropriate - Document exception re-raising behavior difference from stdlib Documentation (README.md): - Expand limitations section with worker pool, exception, security notes - Recommend tbb_threading() context manager over global patch - Add guidance for I/O-bound workloads Addresses findings from security review: - CWE-367 TOCTOU race condition - CWE-209 Information exposure via exceptions - Memory safety in move semantics Signed-off-by: Nikolay Petrov <[email protected]>
Remove threading_patch.py and consolidate all threading functionality: - Full TBBThread implementation now in __init__.py (was duplicated) - patch_threading(), unpatch_threading(), tbb_threading() moved to __init__.py - is_threading_patched() for checking patch state - Update imports: 'from tbb import patch_threading' (was tbb.threading_patch) This eliminates duplication between the minimal and full TBBThread implementations, and removes the redundant 'python -m tbb.threading_patch' entry point (use 'python -m tbb -T' instead). Signed-off-by: Nikolay Petrov <[email protected]>
Python 3.5-3.8 are EOL. The threading patch uses typing features and patterns that work best with Python 3.9+. Signed-off-by: Nikolay Petrov <[email protected]>
- Add GIL guard to PyCaller(PyObject*, bool) constructor - Remove unused 'import queue' - Add _patch_lock for thread-safe patch/unpatch operations - Change 'except BaseException' to 'except Exception' (allow KeyboardInterrupt/SystemExit) - Add _join_lock to fix exception handling race in TBBThread.join() - Keep initial=false in ArenaPyCaller::operator() (correct borrowing semantics)
… wrappers - Pass initial=false to PyCaller in task_arena::enqueue(), task_arena::execute(), and task_group::run() to properly INCREF borrowed PyObject references from SWIG. Without this, PyCaller destructor would DECREF a reference that was never INCREFed, causing refcount underflow. This matches the existing correct pattern in ArenaPyCaller (line 168) which explicitly INCREFs then passes initial=false. - Document exception behavior for tbb_run_and_wait() and tbb_parallel_for(): exceptions in callbacks are logged via PyErr_WriteUnraisable, not propagated.
Add missing UXL Foundation copyright to python/tbb/__init__.py and python/tbb/api.i to pass the copyright_check CI job.
- test_threading.py: unittest suite for patch lifecycle, TBBThread drop-in behavior, context manager, tbb_run_and_wait, tbb_parallel_for - examples/tbb_threading_example.py: usage example showing three methods (patch_threading, tbb_threading context manager, tbb_parallel_for)
- Fix copyright headers (new files use UXL only, update Intel year to 2025) - Move _thread import to module level (performance: avoid import in hot path) - Add pool.join() in _shutdown_pool (prevent orphaned tasks) - Fix test/example imports to use consolidated tbb package (not threading_patch) - Remove misleading 'zero-overhead' claim from docstrings
…n doc - Add atexit handler to join non-daemon TBBThreads and cleanup pool - Track active TBBThreads for proper shutdown behavior - Add TLS limitation to TBBThread docstring - Add join timeout test - Create THREADING_DESIGN.md documenting system vs TBB thread differences, current implementation approach, supported features, and known limitations
- Fix CLI flag typo: -t → -T in tbb_threading_example.py - Add language tag to code block in THREADING_DESIGN.md (markdown lint)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add TBB-based
threading.Threadreplacement for Python free-threading (NOGIL) builds.Follow-up to #1965 (free-threading/NOGIL support).
This PR adds the ability to replace Python's
threading.Threadwith a TBB-based implementation (TBBThread) that executes work on TBB's work-stealing thread pool instead of creating new OS threads.pyperformance (single-threaded): Neutral — geometric mean 1.00x. Individual benchmarks vary ±30% (noise from TBB startup overhead).
Micro-benchmarks (threading primitives): Mostly worse (8 of 15 slower). Notable: orderbook 20-28% worse across all thread counts. Only bright spot: TCP ping-pong +53% at 4 threads (I/O-bound).
Real Python workloads (ThreadPoolExecutor): Neutral or worse + 2 critical deadlocks:
• Nested ThreadPoolExecutor → deadlock at 4 threads
• Producer-consumer with Queue → deadlock
• Work-stealing didn't help even on uneven tasks (21% slower)
• Dict-heavy workloads up to 1.9x slower
Conclusion
tbb.patch_threading() provides no advantage for Python-level parallelism in free-threaded CPython. CPython's native threads already scale well enough for pure Python; TBB's work-stealing is designed for fine-grained C/C++ tasks, not coarse Python bytecode execution. The monkey-patch adds overhead that negates any scheduling benefit.
Key Features
patch_threading()/unpatch_threading()— Global functions to replacethreading.ThreadwithTBBThreadtbb_threadingcontext manager — Scoped thread patching-t/--threadsCLI flag — Enable TBB threading viapython -m tbb -t script.pyMonkey(threads=True)— Opt-in threading replacement in Monkey context managertbb_run_and_wait()/tbb_parallel_for()— Direct TBB dispatch helpers for batch parallel executionHow It Works
TBBThreadis a drop-in replacement forthreading.Threadthat submits work to a shared TBBPool(backed bytask_group) rather than spawning OS threads. Benefits:Limitations (documented)
daemonproperty has no effect (TBB manages thread lifecycle)native_idmay be reused acrossTBBThreadinstancesjoin()(differs from stdlib)Changes
python/tbb/__init__.py— ConsolidatedTBBThreadclass withEvent-based synchronization,patch_threading()/unpatch_threading()functions,tbb_threadingcontext managerpython/tbb/api.i— Addedtbb_run_and_waitandtbb_parallel_forSWIG wrapperspython/tbb/__main__.py— Added-t/--threadsand--patch-threadingCLI flagspython/README.md— Documentation and usage examplesFixes # - issue number(s) if exists
Type of change
Tests
Documentation
Breaks backward compatibility
Notify the following users
Other information
Benchmark results on Intel hardware included in commit 9e09e29. Full pyperformance suite testing in progress.