WIP Add Robust Multi-GPU and Cross-Device Support for PyIsolate Benchmarks #2

billybasass · 2025-07-14T19:57:29Z

This PR adds, cross-vendor GPU support to the PyIsolate benchmarking suite, enabling seamless benchmarking on NVIDIA (CUDA), AMD (ROCm), and Intel (XPU/oneAPI) GPUs, as well as CPU. It also improves device selection, serialization, and reporting for all supported platforms.
Key Features & Changes
Multi-GPU Device Selection:
Detects all available GPUs (NVIDIA, AMD, Intel) in the system.
Prompts the user to select which GPU to use if more than one is present.
Prints the name of the selected device at the start of each benchmark run.
Device-Aware Tensor Serialization:
CPU and CUDA: Uses PyTorch’s native (zero-copy) serialization for inter-process communication.
Intel XPU: Implements a fallback serialization for XPU tensors:
XPU tensors are moved to CPU, serialized as a buffer (with shape and dtype), sent through the multiprocessing queue, and reconstructed as XPU tensors on the receiving side.
This avoids PyCapsule/pickle errors and ensures all tests run on Intel GPUs, even though the transfer is not zero-copy.
All benchmarks (including shared-torch and RPC) now run on Intel GPUs.
Benchmark Script Enhancements:
Benchmarks are run on all available backends (CPU, CUDA, XPU).
Status messages and results include the backend and device name for clarity and reproducibility.
Extension Venv and Dependency Handling:
Extension venvs now install the correct PyTorch wheel and index for each backend.
Ensures correct torch version and index for Intel GPUs, and removes invalid version specifiers.
Linter and Syntax Fixes:
Fixed all linter and syntax errors, including stray else: statements and indentation issues.
Codebase is now linter-clean and CI-ready.
Motivation
Enable robust benchmarking and extension support across all major GPU vendors.
Ensure that all tests run on all supported hardware, with clear device selection and reporting.
Provide a future-proof serialization mechanism for device tensors, especially for Intel XPU, until native support is available in PyTorch.
Testing
Benchmarks were run on systems with NVIDIA, AMD, and Intel GPUs.
All tests pass, and results are correctly reported for each device.
No PyCapsule/pickle errors on Intel GPUs.
Closes Issues
Fixes device selection and reporting for multi-GPU systems.
Fixes XPU tensor serialization for multiprocessing.
Fixes linter and syntax errors in the codebase.
Ready for review!

… fixes

…ome you can still pick cpu only testing

… array writable before converting to a tensor

…ages are used; codebase ruff clean

…ding xpu for Intel GPUs)

… loop to always include CPU in the results, regardless of GPU presence.

kevinpaik1 · 2025-07-16T06:26:45Z

Hi @billybasass, thanks for this PR! I've completed a review of your multi-GPU support implementation.

The Good: Your multi-vendor GPU support (NVIDIA/AMD/Intel) is excellent and the device-aware tensor serialization is well-architected. The comprehensive
benchmarking with memory tracking is particularly impressive, and the Intel XPU support shows great forward-thinking.

Critical Issue: Unfortunately, the PR currently breaks the core library by adding mandatory numpy and torch imports to pyisolate/_internal/shared.py.
This makes the library unusable for anyone without these packages installed - even import pyisolate fails with ModuleNotFoundError.

Required Changes:

Make numpy/torch imports conditional (try/except blocks) so core functionality works without ML dependencies
Restore the removed .pre-commit-config.yaml file
Fix Python version inconsistencies (some files hardcode 3.12 while pyproject.toml allows 3.9+)

Next Steps: The GPU features themselves are valuable and well-implemented. Once the import issues are resolved to maintain backward compatibility, this will
be a great addition to the project.

I've prepared detailed feedback documents that I'll share separately. Please let me know if you need any clarification on the architectural changes needed to
make the ML dependencies optional.

… added back pre commit config, and clean venv handling

…d environment setup is now handled in main(). run_benchmarks() is now device-agnostic and will not throw UnboundLocalError

billybasass · 2025-07-16T19:15:27Z

Thank you Kevin,

Sorry for these mistakes, after trying some suggested changes I had to revert back to my first plan and it got a little messy. I have made the changes you asked for here and in your email. Here is a list of changes, and let me know if I missed something or need to add more. Changes were retested on Intel and NVidia systems with multiple GPUs.

PyIsolate Benchmark & Core Library: Summary of Changes

Optional GPU/Numpy Dependencies
Moved all top-level torch and numpy imports in core files to inside functions or wrapped them in try/except blocks.
Made GPU-specific features optional by isolating them in a new module: pyisolate/_internal/gpu_utils.py.
Core library can now be imported in minimal Python environments (no torch/numpy required for CPU-only use).
Python Version Consistency
Virtual environment creation in the core now uses the current Python interpreter version, not a hardcoded version (hardcoding was do to python 3.13 not being supported).
GPU Utility Refactor
Created pyisolate/_internal/gpu_utils.py for all GPU/XPU/torch-specific utilities (maybe_serialize_tensor, etc.).
Updated all references and imports in the codebase and documentation to use the new module.
Device Selection Robustness
PowerShell benchmark launcher now always passes the user’s device index to the Python benchmark scripts.
Device index is propagated to all extension subprocesses via the PYISOLATE_CUDA_DEVICE environment variable.
Extension code updated to set the device for both CUDA (NVIDIA) and XPU (Intel) backends.
Editable Install and venv Hygiene
Ensured all extension venvs use the latest source via editable install (-e .).
Provided and used a PowerShell cleanup script to remove all old venvs, .test_temps, and Python bytecode caches. to use open the main folder in powershell and run powershell -ExecutionPolicy Bypass -File .\cleanup_pyisolate.ps1
Core Library Import Fix
Fixed a critical bug in pyisolate/_internal/shared.py: the core library now imports and uses the real GPU utility functions from gpu_utils.py at runtime, not the stub that always raises ImportError.
Miscellaneous Improvements
Improved error handling and output for missing dependencies and device selection.
Ensured tabulate import is always handled gracefully for summary output.
Added back the .pre-commit-config.yaml (I thought this was one I added)

kevinpaik1 · 2025-07-17T05:54:08Z

PR Review Summary - v3

Status: ✅ Approved for Core Team Review (Domain Expert Validation Required)

For the Contributor

Good work, @billybasass! You've addressed the critical architectural issues from the previous review. The core library import problems are resolved, and the GPU functionality is now properly separated into an optional module.

The core team will conduct validation of the benchmarking components to ensure they align with the project's specific technical requirements.

Technical Summary

Scope: Multi-GPU support with cross-vendor compatibility + architectural refactoring
Testing: ✅ Core library imports verified, linting passes, architectural fixes confirmed
Impact: Adds GPU functionality while maintaining backward compatibility

Review Highlights

Architectural Changes:

✅ Core library import issues resolved (verified through testing)
✅ GPU utilities separated into optional gpu_utils.py module
✅ Pre-commit configuration restored and functional
✅ All linting and formatting requirements met
✅ Systematic implementation of previous feedback

Domain Expert Review Required:

Benchmarking accuracy and completeness validation
Performance characteristics alignment with project requirements
Coverage of critical GPU use cases
Integration with existing testing infrastructure

Code Quality:

Clean architectural separation following Python best practices
Appropriate error handling for missing dependencies
Consistent documentation and code style
Backward compatibility maintained

Next Steps

The core team will conduct validation of the benchmarking implementation.

Your multi-GPU architectural work provides a foundation for the core team to validate and build upon. Thank you!

guill

A number of changes requested. Put into broad categories, they are generally related to:

Type hint warnings that have been suppressed but should probably just be fixed to not need suppression.
Silently falling back to CPU-only rather than failing a test if we e.g. are trying to test CUDA but don't have CUDA installed.
Various one-off changes.

benchmarks/benchmark.py

pyisolate/_internal/host.py

pyisolate/_internal/shared.py

tests/test_benchmarks.py

…irectly uses test_instance.manager.

…tel XPU and AMD ROCm GPUs, not just CUDA.Fixed unresolved import for ExampleExtensionBase by importing from example.shared.

billybasass · 2025-07-26T18:21:56Z

Thank you for the great reply @guill, some of the unneeded things there left over for when I tried a different approach you suggested and I thought I got rid of (working under the weather) thank you for finding them. There is more than one place you did not know why I was using type ignore, it was because somethings were annotated and some not annotated so Python treats it as dynamic, and no error is raise, I have fixed this. You asked what ruff I used, that I updated to the lastest before running. For the hardcoding of pytorch when it comes to Intel GPUs. This should only happen with Intel GPUs to use pytorch 2.7 or up and is needed because Intel was not fully integrated with pytorch tell 2.7. I did ask Bob Duffy from Intel about this and we will probably have problems if we let it run lower.

Here is a summary of changes I made after checking every line you pointed out.

Type hint warnings and suppressions
Updated type annotations for benchmark_ext and benchmark_ext_shared to use Optional example.shared instead of object or untyped.
Removed unnecessary # type: ignore suppressions for these assignments and for runner assertions in tests.
Updated imports to include example.shared where needed.
Failing on missing GPU/ROCm instead of falling back to CPU
In both benchmarks/benchmark.py and benchmarks/memory_benchmark.py, the code now calls sys.exit(1) if no supported GPU backend is available, or if ROCm is not supported on Windows, instead of silently falling back to CPU.
In tests/test_benchmarks.py, the test now fails (exits) if CUDA is not available, rather than printing a warning and continuing.
One-off changes and cleanups
Removed unnecessary getattr(..., 'manager', None) checks; now directly uses test_instance.manager.
Removed unnecessary type: ignore[attr-defined] suppressions for NVML calls in memory_benchmark.py.
Removed redundant explicit float casts where Python's type inference is sufficient.
Removed unnecessary getattr(self.test_base, "test_root", None) checks, using self.test_base.test_root directly.
Cleaned up comments and added clarifying comments where appropriate.
Removed tabulate from the [project.optional-dependencies.test] section in pyproject.toml, keeping it only in [bench].

… decoupling (comfy decoupling part 1) ## PR Comfy-Org#3 Review Comments Addressed (38/38) ### Hardcoded Paths Removed (Comfy-Org#2, Comfy-Org#3, #13, #14, #36) - Remove /home/johnj references from tests and documentation - Use COMFYUI_ROOT env var with $HOME/ComfyUI fallback in conftest.py - Delete SETUP_GUIDE.md; consolidate in README.md - Add comfy_hello_world/node-venvs/ to .gitignore ### Error Messages & Documentation (Comfy-Org#1, #5, #7, #8) - Improve site-packages error with host_prefix, valid_parents, candidates - Restore editable install comment explaining -e flag behavior - Add 50-line architecture docstring to shared.py documenting three-layer serialization (Layer 1: _prepare_for_rpc, Layer 2: serialize_for_isolation, Layer 3: RPC envelope) - Clarify zero-copy: CUDA IPC on Linux keeps tensors on-device; otherwise CPU shared memory fallback ### Fail-Loud Pattern (Comfy-Org#4, #21, #22, #23, #24, #25, #26, #27, #29, #33, #35) - Restore try/except in client.py with fail-loud logging - Replace hasattr duck-typing with exact type_name matching for CLIP/VAE - Delete unused move_tensors_to_device function (VRAM leak risk) - Remove ALL truncation/defensive code; raise RuntimeError when objects lack _instance_id instead of silently returning opaque repr - Remove module allowlists entirely (ALLOWED_MODULES patterns eliminated) - _tensor_to_cuda now passthrough; no auto-migration of tensors to CUDA ### Code Clarity (#18, #19, #20, #28, #30, #31, #32, #34) - Remove "ComfyUI" hardcoded assumption; derive preferred_root from snapshot - Remove original_register_route replacement logic (simplified) - Add 5-line comment explaining .replace(".", "_x_") for Python identifiers - Restore singleton bypass comment (object.__new__ to prevent recursion) - Confirm handling_call_id IS read for parent_call_id tracking - Rename _tensor_to_cpu → _prepare_for_rpc (12 occurrences) - Add RPC message protocol comments (RPCRequest/RPCCallback/RPCResponse) - Add else clause raising ValueError for unknown RPC request kinds ### Code Quality (#15, #16, #17, #37, #38) - Move get_torch_ecosystem_packages() to _internal/torch_utils.py - Propagate only host env vars (no defaults added) - already correct - Ensure ruff/mypy pass with zero errors - Compare BOTH fingerprint AND full descriptor before skipping install ## New Features ### Adapter Decoupling Architecture - NEW: pyisolate/interfaces.py - IsolationAdapter protocol + SerializerRegistryProtocol - NEW: pyisolate/_internal/bootstrap.py - Child process bootstrap resolving "config before path" paradox - NEW: pyisolate/_internal/loader.py - Entry point discovery for adapters (pyisolate.adapters group) - NEW: pyisolate/_internal/serialization_registry.py - Dynamic serializer registry with O(1) lookup - host.py: build_extension_snapshot() integrates adapter metadata - client.py: bootstrap_child() applies snapshot before heavy imports ### Test Coverage ~40% → 96% - NEW: 23 test files added - tests/test_bootstrap.py, test_bootstrap_additional.py - tests/test_loader.py, test_loader_additional.py - tests/test_serialization_registry.py - tests/test_model_serialization.py, test_model_serialization_additional.py - tests/test_shared_unit.py, test_shared_additional.py, test_shared_rpc_flows.py - tests/test_host_unit.py, test_host_additional.py, test_host_public.py, test_host_internal_ext.py, test_host_integration.py - tests/test_client_entrypoint_extra.py - tests/test_remaining_coverage.py (comprehensive edge cases) - tests/test_fail_loud.py - Reorganize integration tests to tests/integration/ ### Type Annotations - Add type hints to AttrDict, AttributeContainer, AsyncRPC - Add __all__ export list to host.py - Use dict[str, Any] instead of Dict for modern Python

billybasass added 11 commits July 8, 2025 14:40

Add multi-GPU, Intel/AMD support, robust device selection, and linter…

6ed3376

… fixes

main change is device type was a integer and changed it to a string s…

a9ebed1

…ome you can still pick cpu only testing

fix redundant torch requirements and correct wheels

ea59404

set python to 3.12 because 3.13 does not have all wheels and make the…

60afae5

… array writable before converting to a tensor

ruff and lint checks

a01c1f4

fix: venv isolation for extension imports, ensure only venv site-pack…

14c2605

…ages are used; codebase ruff clean

intel testing added

7de4668

venv isolation has been applied

ca3ced1

added GPU share results for intel GPUs

42728ed

The memory benchmark will now loop over all available backends (inclu…

1b55f5b

…ding xpu for Intel GPUs)

The memory benchmark will now always run with --backend auto and test…

fe6fcd2

… loop to always include CPU in the results, regardless of GPU presence.

billybasass added 2 commits July 16, 2025 11:42

Refactor: make GPU features optional, move GPU utils to gpu_utils.py,…

a07285e

… added back pre commit config, and clean venv handling

Ruff test removed name so I added it back and All device selection an…

d8dd47c

…d environment setup is now handled in main(). run_benchmarks() is now device-agnostic and will not throw UnboundLocalError

billybasass marked this pull request as ready for review July 23, 2025 14:43

guill requested changes Jul 23, 2025

View reviewed changes

billybasass added 2 commits July 23, 2025 15:21

�[200~Removed unnecessary getattr(..., 'manager', None) checks; now d…

08c3b1c

…irectly uses test_instance.manager.

Updated device selection logic to allow tests/benchmarks to run on In…

514964b

…tel XPU and AMD ROCm GPUs, not just CUDA.Fixed unresolved import for ExampleExtensionBase by importing from example.shared.

billybasass requested a review from guill July 29, 2025 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP Add Robust Multi-GPU and Cross-Device Support for PyIsolate Benchmarks #2

WIP Add Robust Multi-GPU and Cross-Device Support for PyIsolate Benchmarks #2

Uh oh!

billybasass commented Jul 14, 2025

Uh oh!

kevinpaik1 commented Jul 16, 2025

Uh oh!

billybasass commented Jul 16, 2025

Uh oh!

kevinpaik1 commented Jul 17, 2025

Uh oh!

guill left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

billybasass commented Jul 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP Add Robust Multi-GPU and Cross-Device Support for PyIsolate Benchmarks #2

Are you sure you want to change the base?

WIP Add Robust Multi-GPU and Cross-Device Support for PyIsolate Benchmarks #2

Uh oh!

Conversation

billybasass commented Jul 14, 2025

Uh oh!

kevinpaik1 commented Jul 16, 2025

Uh oh!

billybasass commented Jul 16, 2025

Uh oh!

kevinpaik1 commented Jul 17, 2025

PR Review Summary - v3

For the Contributor

Technical Summary

Review Highlights

Next Steps

Uh oh!

guill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

billybasass commented Jul 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants