Skip to content

Conversation

@billybasass
Copy link

This PR adds, cross-vendor GPU support to the PyIsolate benchmarking suite, enabling seamless benchmarking on NVIDIA (CUDA), AMD (ROCm), and Intel (XPU/oneAPI) GPUs, as well as CPU. It also improves device selection, serialization, and reporting for all supported platforms.
Key Features & Changes
Multi-GPU Device Selection:
Detects all available GPUs (NVIDIA, AMD, Intel) in the system.
Prompts the user to select which GPU to use if more than one is present.
Prints the name of the selected device at the start of each benchmark run.
Device-Aware Tensor Serialization:
CPU and CUDA: Uses PyTorch’s native (zero-copy) serialization for inter-process communication.
Intel XPU: Implements a fallback serialization for XPU tensors:
XPU tensors are moved to CPU, serialized as a buffer (with shape and dtype), sent through the multiprocessing queue, and reconstructed as XPU tensors on the receiving side.
This avoids PyCapsule/pickle errors and ensures all tests run on Intel GPUs, even though the transfer is not zero-copy.
All benchmarks (including shared-torch and RPC) now run on Intel GPUs.
Benchmark Script Enhancements:
Benchmarks are run on all available backends (CPU, CUDA, XPU).
Status messages and results include the backend and device name for clarity and reproducibility.
Extension Venv and Dependency Handling:
Extension venvs now install the correct PyTorch wheel and index for each backend.
Ensures correct torch version and index for Intel GPUs, and removes invalid version specifiers.
Linter and Syntax Fixes:
Fixed all linter and syntax errors, including stray else: statements and indentation issues.
Codebase is now linter-clean and CI-ready.
Motivation
Enable robust benchmarking and extension support across all major GPU vendors.
Ensure that all tests run on all supported hardware, with clear device selection and reporting.
Provide a future-proof serialization mechanism for device tensors, especially for Intel XPU, until native support is available in PyTorch.
Testing
Benchmarks were run on systems with NVIDIA, AMD, and Intel GPUs.
All tests pass, and results are correctly reported for each device.
No PyCapsule/pickle errors on Intel GPUs.
Closes Issues
Fixes device selection and reporting for multi-GPU systems.
Fixes XPU tensor serialization for multiprocessing.
Fixes linter and syntax errors in the codebase.
Ready for review!

@kevinpaik1
Copy link

Hi @billybasass, thanks for this PR! I've completed a review of your multi-GPU support implementation.

The Good: Your multi-vendor GPU support (NVIDIA/AMD/Intel) is excellent and the device-aware tensor serialization is well-architected. The comprehensive
benchmarking with memory tracking is particularly impressive, and the Intel XPU support shows great forward-thinking.

Critical Issue: Unfortunately, the PR currently breaks the core library by adding mandatory numpy and torch imports to pyisolate/_internal/shared.py.
This makes the library unusable for anyone without these packages installed - even import pyisolate fails with ModuleNotFoundError.

Required Changes:

  1. Make numpy/torch imports conditional (try/except blocks) so core functionality works without ML dependencies
  2. Restore the removed .pre-commit-config.yaml file
  3. Fix Python version inconsistencies (some files hardcode 3.12 while pyproject.toml allows 3.9+)

Next Steps: The GPU features themselves are valuable and well-implemented. Once the import issues are resolved to maintain backward compatibility, this will
be a great addition to the project.

I've prepared detailed feedback documents that I'll share separately. Please let me know if you need any clarification on the architectural changes needed to
make the ML dependencies optional.

… added back pre commit config, and clean venv handling
…d environment setup is now handled in main(). run_benchmarks() is now device-agnostic and will not throw UnboundLocalError
@billybasass
Copy link
Author

Thank you Kevin,

Sorry for these mistakes, after trying some suggested changes I had to revert back to my first plan and it got a little messy. I have made the changes you asked for here and in your email. Here is a list of changes, and let me know if I missed something or need to add more. Changes were retested on Intel and NVidia systems with multiple GPUs.

PyIsolate Benchmark & Core Library: Summary of Changes

  1. Optional GPU/Numpy Dependencies
    Moved all top-level torch and numpy imports in core files to inside functions or wrapped them in try/except blocks.
    Made GPU-specific features optional by isolating them in a new module: pyisolate/_internal/gpu_utils.py.
    Core library can now be imported in minimal Python environments (no torch/numpy required for CPU-only use).

  2. Python Version Consistency
    Virtual environment creation in the core now uses the current Python interpreter version, not a hardcoded version (hardcoding was do to python 3.13 not being supported).

  3. GPU Utility Refactor
    Created pyisolate/_internal/gpu_utils.py for all GPU/XPU/torch-specific utilities (maybe_serialize_tensor, etc.).
    Updated all references and imports in the codebase and documentation to use the new module.

  4. Device Selection Robustness
    PowerShell benchmark launcher now always passes the user’s device index to the Python benchmark scripts.
    Device index is propagated to all extension subprocesses via the PYISOLATE_CUDA_DEVICE environment variable.
    Extension code updated to set the device for both CUDA (NVIDIA) and XPU (Intel) backends.

  5. Editable Install and venv Hygiene
    Ensured all extension venvs use the latest source via editable install (-e .).
    Provided and used a PowerShell cleanup script to remove all old venvs, .test_temps, and Python bytecode caches. to use open the main folder in powershell and run powershell -ExecutionPolicy Bypass -File .\cleanup_pyisolate.ps1

  6. Core Library Import Fix
    Fixed a critical bug in pyisolate/_internal/shared.py: the core library now imports and uses the real GPU utility functions from gpu_utils.py at runtime, not the stub that always raises ImportError.

  7. Miscellaneous Improvements
    Improved error handling and output for missing dependencies and device selection.
    Ensured tabulate import is always handled gracefully for summary output.

  8. Added back the .pre-commit-config.yaml (I thought this was one I added)

@kevinpaik1
Copy link

PR Review Summary - v3

Status: ✅ Approved for Core Team Review (Domain Expert Validation Required)

For the Contributor

Good work, @billybasass! You've addressed the critical architectural issues from the previous review. The core library import problems are resolved, and the GPU functionality is now properly separated into an optional module.

The core team will conduct validation of the benchmarking components to ensure they align with the project's specific technical requirements.

Technical Summary

  • Scope: Multi-GPU support with cross-vendor compatibility + architectural refactoring
  • Testing: ✅ Core library imports verified, linting passes, architectural fixes confirmed
  • Impact: Adds GPU functionality while maintaining backward compatibility

Review Highlights

Architectural Changes:

  • ✅ Core library import issues resolved (verified through testing)
  • ✅ GPU utilities separated into optional gpu_utils.py module
  • ✅ Pre-commit configuration restored and functional
  • ✅ All linting and formatting requirements met
  • ✅ Systematic implementation of previous feedback

Domain Expert Review Required:

  • Benchmarking accuracy and completeness validation
  • Performance characteristics alignment with project requirements
  • Coverage of critical GPU use cases
  • Integration with existing testing infrastructure

Code Quality:

  • Clean architectural separation following Python best practices
  • Appropriate error handling for missing dependencies
  • Consistent documentation and code style
  • Backward compatibility maintained

Next Steps

The core team will conduct validation of the benchmarking implementation.

Your multi-GPU architectural work provides a foundation for the core team to validate and build upon. Thank you!

@billybasass billybasass marked this pull request as ready for review July 23, 2025 14:43
Copy link
Member

@guill guill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A number of changes requested. Put into broad categories, they are generally related to:

  1. Type hint warnings that have been suppressed but should probably just be fixed to not need suppression.
  2. Silently falling back to CPU-only rather than failing a test if we e.g. are trying to test CUDA but don't have CUDA installed.
  3. Various one-off changes.

…tel XPU and AMD ROCm GPUs, not just CUDA.Fixed unresolved import for ExampleExtensionBase by importing from example.shared.
@billybasass
Copy link
Author

Thank you for the great reply @guill, some of the unneeded things there left over for when I tried a different approach you suggested and I thought I got rid of (working under the weather) thank you for finding them. There is more than one place you did not know why I was using type ignore, it was because somethings were annotated and some not annotated so Python treats it as dynamic, and no error is raise, I have fixed this. You asked what ruff I used, that I updated to the lastest before running. For the hardcoding of pytorch when it comes to Intel GPUs. This should only happen with Intel GPUs to use pytorch 2.7 or up and is needed because Intel was not fully integrated with pytorch tell 2.7. I did ask Bob Duffy from Intel about this and we will probably have problems if we let it run lower.

Here is a summary of changes I made after checking every line you pointed out.

  1. Type hint warnings and suppressions
    Updated type annotations for benchmark_ext and benchmark_ext_shared to use Optional example.shared instead of object or untyped.
    Removed unnecessary # type: ignore suppressions for these assignments and for runner assertions in tests.
    Updated imports to include example.shared where needed.

  2. Failing on missing GPU/ROCm instead of falling back to CPU
    In both benchmarks/benchmark.py and benchmarks/memory_benchmark.py, the code now calls sys.exit(1) if no supported GPU backend is available, or if ROCm is not supported on Windows, instead of silently falling back to CPU.
    In tests/test_benchmarks.py, the test now fails (exits) if CUDA is not available, rather than printing a warning and continuing.

  3. One-off changes and cleanups
    Removed unnecessary getattr(..., 'manager', None) checks; now directly uses test_instance.manager.
    Removed unnecessary type: ignore[attr-defined] suppressions for NVML calls in memory_benchmark.py.
    Removed redundant explicit float casts where Python's type inference is sufficient.
    Removed unnecessary getattr(self.test_base, "test_root", None) checks, using self.test_base.test_root directly.
    Cleaned up comments and added clarifying comments where appropriate.
    Removed tabulate from the [project.optional-dependencies.test] section in pyproject.toml, keeping it only in [bench].

@billybasass billybasass requested a review from guill July 29, 2025 22:23
pollockjj added a commit to pollockjj/pyisolate that referenced this pull request Dec 16, 2025
… decoupling (comfy decoupling part 1)

## PR Comfy-Org#3 Review Comments Addressed (38/38)

### Hardcoded Paths Removed (Comfy-Org#2, Comfy-Org#3, #13, #14, #36)
- Remove /home/johnj references from tests and documentation
- Use COMFYUI_ROOT env var with $HOME/ComfyUI fallback in conftest.py
- Delete SETUP_GUIDE.md; consolidate in README.md
- Add comfy_hello_world/node-venvs/ to .gitignore

### Error Messages & Documentation (Comfy-Org#1, #5, #7, #8)
- Improve site-packages error with host_prefix, valid_parents, candidates
- Restore editable install comment explaining -e flag behavior
- Add 50-line architecture docstring to shared.py documenting three-layer
  serialization (Layer 1: _prepare_for_rpc, Layer 2: serialize_for_isolation,
  Layer 3: RPC envelope)
- Clarify zero-copy: CUDA IPC on Linux keeps tensors on-device; otherwise
  CPU shared memory fallback

### Fail-Loud Pattern (Comfy-Org#4, #21, #22, #23, #24, #25, #26, #27, #29, #33, #35)
- Restore try/except in client.py with fail-loud logging
- Replace hasattr duck-typing with exact type_name matching for CLIP/VAE
- Delete unused move_tensors_to_device function (VRAM leak risk)
- Remove ALL truncation/defensive code; raise RuntimeError when objects
  lack _instance_id instead of silently returning opaque repr
- Remove module allowlists entirely (ALLOWED_MODULES patterns eliminated)
- _tensor_to_cuda now passthrough; no auto-migration of tensors to CUDA

### Code Clarity (#18, #19, #20, #28, #30, #31, #32, #34)
- Remove "ComfyUI" hardcoded assumption; derive preferred_root from snapshot
- Remove original_register_route replacement logic (simplified)
- Add 5-line comment explaining .replace(".", "_x_") for Python identifiers
- Restore singleton bypass comment (object.__new__ to prevent recursion)
- Confirm handling_call_id IS read for parent_call_id tracking
- Rename _tensor_to_cpu → _prepare_for_rpc (12 occurrences)
- Add RPC message protocol comments (RPCRequest/RPCCallback/RPCResponse)
- Add else clause raising ValueError for unknown RPC request kinds

### Code Quality (#15, #16, #17, #37, #38)
- Move get_torch_ecosystem_packages() to _internal/torch_utils.py
- Propagate only host env vars (no defaults added) - already correct
- Ensure ruff/mypy pass with zero errors
- Compare BOTH fingerprint AND full descriptor before skipping install

## New Features

### Adapter Decoupling Architecture
- NEW: pyisolate/interfaces.py - IsolationAdapter protocol + SerializerRegistryProtocol
- NEW: pyisolate/_internal/bootstrap.py - Child process bootstrap resolving
  "config before path" paradox
- NEW: pyisolate/_internal/loader.py - Entry point discovery for adapters
  (pyisolate.adapters group)
- NEW: pyisolate/_internal/serialization_registry.py - Dynamic serializer
  registry with O(1) lookup
- host.py: build_extension_snapshot() integrates adapter metadata
- client.py: bootstrap_child() applies snapshot before heavy imports

### Test Coverage ~40% → 96%
- NEW: 23 test files added
- tests/test_bootstrap.py, test_bootstrap_additional.py
- tests/test_loader.py, test_loader_additional.py
- tests/test_serialization_registry.py
- tests/test_model_serialization.py, test_model_serialization_additional.py
- tests/test_shared_unit.py, test_shared_additional.py, test_shared_rpc_flows.py
- tests/test_host_unit.py, test_host_additional.py, test_host_public.py,
  test_host_internal_ext.py, test_host_integration.py
- tests/test_client_entrypoint_extra.py
- tests/test_remaining_coverage.py (comprehensive edge cases)
- tests/test_fail_loud.py
- Reorganize integration tests to tests/integration/

### Type Annotations
- Add type hints to AttrDict, AttributeContainer, AsyncRPC
- Add __all__ export list to host.py
- Use dict[str, Any] instead of Dict for modern Python
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants