-
Notifications
You must be signed in to change notification settings - Fork 150
Explicit gpu global copies #2260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
cscs-ci run |
alexnick83
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work. Some comments and questions follow.
|
|
||
| memlet = edge.data | ||
|
|
||
| self.copy_shape = memlet.subset.size_exact() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extraneous due to lines 37, 39, and 42?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I will update.
| """Remove size-1 dims; keep tile strides; default to [1] if none remain.""" | ||
| n = len(subset) | ||
| collapsed = [st for st, sz in zip(strides, subset.size()) if sz != 1] | ||
| collapsed.extend(strides[n:]) # include tiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these tile strides exactly, and why are they in n:? This implies that the length of strides may be greater than the length of subset (n), but then how would zip in the above line work?
| state = copy_context.state | ||
| src_node, dst_node = copy_context.src_node, copy_context.dst_node | ||
|
|
||
| # 1. Ensure copy is not occuring within a kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall that in another PR, there was a discussion about the is_devicelevel helper methods. If the investigated issues have been resolved there, shouldn't you be using them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was solved, I will use devicelevel gpu and re-run unit tests.
| - We are not currently generating kernel code | ||
| - The copy occurs between two AccessNodes | ||
| - The data descriptors of source and destination are not views. | ||
| - The storage types of either src or dst is CPU_Pinned or GPU_Device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU global?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct should be GPU global
|
|
||
| else: | ||
| # sanity check | ||
| assert num_dims > 2, f"Expected copy shape with more than 2 dimensions, but got {num_dims}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit overzealous; is this supposed to catch num_dims == 0?
| "Please implement this case if it is valid, or raise a more descriptive error if this path should not be taken." | ||
| ) | ||
|
|
||
| # Potentially snychronization required if syncdebug is set to true in configurations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Potentially snychronization required if syncdebug is set to true in configurations | |
| # Potentially synchronization required if syncdebug is set to true in configurations |
| """ | ||
| Generates GPU code for copying N-dimensional arrays using 2D memory copies. | ||
|
|
||
| Uses {backend}Memcpy2DAsync for the last two dimensions, with nested loops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description makes it sound like this does not handle all ND copies. Isn't this an issue since this is the "fallback" copy method? Maybe this is not relevant, i.e., exotic copies should be generated in the first place with a mapped tasklet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is confusing, we try what we can using Memcpy2D, but for exotic stuff I just generate maps, I will update accordingly
dace/transformation/passes/gpu_specialization/helpers/copy_strategies.py
Outdated
Show resolved
Hide resolved
| continue | ||
|
|
||
| # If the subset has more than 2 dimensions and is not contiguous (represented as a 1D memcpy) then fallback to a copy kernel | ||
| if len(edge.data.subset) > 2 and not edge.data.subset.is_contiguous_subset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this check in OutOfKernelCopyStrategy.applicable?
| GPU_ThreadBlock = () #: Thread-block code | ||
| GPU_ThreadBlock_Dynamic = () #: Allows rescheduling work within a block | ||
| GPU_Persistent = () | ||
| GPU_Warp = () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need this, but is it actually used anywhere in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not utilized for the copies, but it is part of the new GPU codegen, I potentially had error when porting dtypes from the GPU codegen branch to this branch.
…ategies.py Co-authored-by: alexnick83 <[email protected]>
…ategies.py Co-authored-by: alexnick83 <[email protected]>
This pass inserts GPU global copies as tasklets for explicit scheduling later.
Since this is a pass aimed at GPU specialization, I decided not use copy nodes. (Note: Right now, copy library nodes do not exist in the main branch, but I have a separate PR upcoming for copy and memset library nodes with a pass that converts memcpy and memset kernels to use these library nodes.