Explicit gpu global copies #2260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ThrudPrimrose wants to merge 18 commits into main from explicit-gpu-global-copies

Collaborator

ThrudPrimrose commented Jan 6, 2026 •

edited

Loading

This pass inserts GPU global copies as tasklets for explicit scheduling later.

Since this is a pass aimed at GPU specialization, I decided not use copy nodes. (Note: Right now, copy library nodes do not exist in the main branch, but I have a separate PR upcoming for copy and memset library nodes with a pass that converts memcpy and memset kernels to use these library nodes.

ThrudPrimrose added 16 commits

January 6, 2026 15:15


          Extend cases supported by the explicit copy transformations

f519b26


          Refactor, use views+

e483a25


          Refactor

4d7156f


          Refactor

09e29e6


          Extensions

8a754c3


          Fix bug

dad01d3

Fix

e38016c


          Check for GPU outputs in current stream generation

944db27


          Fix cpp codegen

c77ea55


          Precommit

0ed6406


          Merge branch 'main' into explicit-gpu-global-copies

1dac422


          Merge branch 'main' into explicit-gpu-global-copies

f762d5f


          Merge branch 'main' into explicit-gpu-global-copies

02b4182


          Add dtype

bca117c


          Rm gpu def

9dd1605


          Merge branch 'main' into explicit-gpu-global-copies

3e83c74

ThrudPrimrose requested review from alexnick83, phschaad and tbennun

January 28, 2026 10:24

Collaborator

phschaad commented Jan 28, 2026

cscs-ci run

alexnick83 requested changes

View reviewed changes

Contributor

alexnick83 left a comment

Very good work. Some comments and questions follow.

dace/transformation/passes/gpu_specialization/helpers/copy_strategies.py


		memlet = edge.data

		self.copy_shape = memlet.subset.size_exact()

Contributor

alexnick83 Jan 28, 2026

Extraneous due to lines 37, 39, and 42?

Collaborator Author

ThrudPrimrose Jan 28, 2026

True, I will update.

dace/transformation/passes/gpu_specialization/helpers/copy_strategies.py

+                          """Remove size-1 dims; keep tile strides; default to [1] if none remain."""
+                          n = len(subset)
+                          collapsed = [st for st, sz in zip(strides, subset.size()) if sz != 1]
+                          collapsed.extend(strides[n:])  # include tiles

Contributor

alexnick83 Jan 28, 2026

What are these tile strides exactly, and why are they in n:? This implies that the length of strides may be greater than the length of subset (n), but then how would zip in the above line work?

dace/transformation/passes/gpu_specialization/helpers/copy_strategies.py

+                      state = copy_context.state
+                      src_node, dst_node = copy_context.src_node, copy_context.dst_node
+                      # 1. Ensure copy is not occuring within a kernel

Contributor

alexnick83 Jan 28, 2026

I recall that in another PR, there was a discussion about the is_devicelevel helper methods. If the investigated issues have been resolved there, shouldn't you be using them here?

Collaborator Author

ThrudPrimrose Jan 28, 2026

I think it was solved, I will use devicelevel gpu and re-run unit tests.

dace/transformation/passes/gpu_specialization/helpers/copy_strategies.py

+                      - We are not currently generating kernel code
+                      - The copy occurs between two AccessNodes
+                      - The data descriptors of source and destination are not views.
+                      - The storage types of either src or dst is CPU_Pinned or GPU_Device

Contributor

alexnick83 Jan 28, 2026

GPU global?

Collaborator Author

ThrudPrimrose Jan 28, 2026

Yes, you are correct should be GPU global

dace/transformation/passes/gpu_specialization/helpers/copy_strategies.py

+                      else:
+                          # sanity check
+                          assert num_dims > 2, f"Expected copy shape with more than 2 dimensions, but got {num_dims}."

Contributor

alexnick83 Jan 28, 2026

A bit overzealous; is this supposed to catch num_dims == 0?

dace/transformation/passes/gpu_specialization/helpers/copy_strategies.py

+                              "Please implement this case if it is valid, or raise a more descriptive error if this path should not be taken."
+                          )
+                      # Potentially snychronization required if syncdebug is set to true in configurations

Contributor

alexnick83 Jan 28, 2026

Suggested change

      
                    # Potentially snychronization required if syncdebug is set to true in configurations
          
                    # Potentially synchronization required if syncdebug is set to true in configurations

dace/transformation/passes/gpu_specialization/helpers/copy_strategies.py

+                      """
+                      Generates GPU code for copying N-dimensional arrays using 2D memory copies.
+                      Uses {backend}Memcpy2DAsync for the last two dimensions, with nested loops

Contributor

alexnick83 Jan 28, 2026

The description makes it sound like this does not handle all ND copies. Isn't this an issue since this is the "fallback" copy method? Maybe this is not relevant, i.e., exotic copies should be generated in the first place with a mapped tasklet?

Collaborator Author

ThrudPrimrose Jan 28, 2026 •

edited

Loading

I guess it is confusing, we try what we can using Memcpy2D, but for exotic stuff I just generate maps, I will update accordingly

dace/transformation/passes/gpu_specialization/helpers/copy_strategies.py Outdated Show resolved Hide resolved

dace/transformation/passes/gpu_specialization/insert_explicit_gpu_global_memory_copies.py

+                              continue
+                          # If the subset has more than 2 dimensions and is not contiguous (represented as a 1D memcpy) then fallback to a copy kernel
+                          if len(edge.data.subset) > 2 and not edge.data.subset.is_contiguous_subset(

Contributor

alexnick83 Jan 28, 2026

Why isn't this check in OutOfKernelCopyStrategy.applicable?

dace/dtypes.py

                   GPU_ThreadBlock = ()  #: Thread-block code
                   GPU_ThreadBlock_Dynamic = ()  #: Allows rescheduling work within a block
                   GPU_Persistent = ()
+                  GPU_Warp = ()

Contributor

alexnick83 Jan 28, 2026

I believe we need this, but is it actually used anywhere in this PR?

Collaborator Author

ThrudPrimrose Jan 28, 2026

It is not utilized for the copies, but it is part of the new GPU codegen, I potentially had error when porting dtypes from the GPU codegen branch to this branch.

ThrudPrimrose and others added 2 commits

January 28, 2026 17:44


          Update dace/transformation/passes/gpu_specialization/helpers/copy_str…

c9cc32d

…ategies.py

Co-authored-by: alexnick83 <[email protected]>


          Update dace/transformation/passes/gpu_specialization/helpers/copy_str…

6aa4c27

…ategies.py

Co-authored-by: alexnick83 <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet