Skip to content

Conversation

@ThrudPrimrose
Copy link
Collaborator

@ThrudPrimrose ThrudPrimrose commented Jan 6, 2026

This pass inserts GPU global copies as tasklets for explicit scheduling later.

Since this is a pass aimed at GPU specialization, I decided not use copy nodes. (Note: Right now, copy library nodes do not exist in the main branch, but I have a separate PR upcoming for copy and memset library nodes with a pass that converts memcpy and memset kernels to use these library nodes.

@phschaad
Copy link
Collaborator

cscs-ci run

Copy link
Contributor

@alexnick83 alexnick83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work. Some comments and questions follow.


memlet = edge.data

self.copy_shape = memlet.subset.size_exact()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extraneous due to lines 37, 39, and 42?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I will update.

"""Remove size-1 dims; keep tile strides; default to [1] if none remain."""
n = len(subset)
collapsed = [st for st, sz in zip(strides, subset.size()) if sz != 1]
collapsed.extend(strides[n:]) # include tiles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these tile strides exactly, and why are they in n:? This implies that the length of strides may be greater than the length of subset (n), but then how would zip in the above line work?

state = copy_context.state
src_node, dst_node = copy_context.src_node, copy_context.dst_node

# 1. Ensure copy is not occuring within a kernel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall that in another PR, there was a discussion about the is_devicelevel helper methods. If the investigated issues have been resolved there, shouldn't you be using them here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was solved, I will use devicelevel gpu and re-run unit tests.

- We are not currently generating kernel code
- The copy occurs between two AccessNodes
- The data descriptors of source and destination are not views.
- The storage types of either src or dst is CPU_Pinned or GPU_Device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU global?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct should be GPU global


else:
# sanity check
assert num_dims > 2, f"Expected copy shape with more than 2 dimensions, but got {num_dims}."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit overzealous; is this supposed to catch num_dims == 0?

"Please implement this case if it is valid, or raise a more descriptive error if this path should not be taken."
)

# Potentially snychronization required if syncdebug is set to true in configurations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Potentially snychronization required if syncdebug is set to true in configurations
# Potentially synchronization required if syncdebug is set to true in configurations

"""
Generates GPU code for copying N-dimensional arrays using 2D memory copies.

Uses {backend}Memcpy2DAsync for the last two dimensions, with nested loops
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description makes it sound like this does not handle all ND copies. Isn't this an issue since this is the "fallback" copy method? Maybe this is not relevant, i.e., exotic copies should be generated in the first place with a mapped tasklet?

Copy link
Collaborator Author

@ThrudPrimrose ThrudPrimrose Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is confusing, we try what we can using Memcpy2D, but for exotic stuff I just generate maps, I will update accordingly

continue

# If the subset has more than 2 dimensions and is not contiguous (represented as a 1D memcpy) then fallback to a copy kernel
if len(edge.data.subset) > 2 and not edge.data.subset.is_contiguous_subset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this check in OutOfKernelCopyStrategy.applicable?

GPU_ThreadBlock = () #: Thread-block code
GPU_ThreadBlock_Dynamic = () #: Allows rescheduling work within a block
GPU_Persistent = ()
GPU_Warp = ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need this, but is it actually used anywhere in this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not utilized for the copies, but it is part of the new GPU codegen, I potentially had error when porting dtypes from the GPU codegen branch to this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants