Nvidia target via NVRTC and Nim ↦ CUDA DSL by Vindaar · Pull Request #487 · mratsim/constantine

Vindaar · 2024-11-29T11:29:01Z

(Note: this is a draft, because it is a) a proof of concept and b) still depends on nimcuda for simplicity)

Table of contents & introduction

CUDA execution
NVRTC compiler helper
- Important note about CUDA
  library
CUDA code generator
- Notes on the cuda design
Profiling Nvidia code
A more complex example

This (draft) PR adds an experimental alternative to generating code
targeting Nvidia GPUs. Instead of relying on LLVM to generate Nvidia PTX
instructions, this PR adds 3 pieces:

a slightly modified version of the existing "CUDA execution helpers"
defined in codegen_nvidia.nim:
https://github.com/mratsim/constantine/blob/master/constantine/math_compiler/codegen_nvidia.nim
(if we decide to go ahead with this PR, I'll merge the two. They are
compatible, the new one just has a few extra features),
a helper file to initialize an NVRTC (Nvidia runtime compilation
library) compiler and compile a string of CUDA code,
a DSL to generate a CUDA code string from a Nim macro.

A few words on each of these first:

CUDA execution

Starting with this as it is already present in Constantine. Once one has
a compiled CUDA kernel and wishes to execute it, in principle one needs
to:

allocate memory for all arguments to be passed to the kernel, which
are not pure value types
memcopy the data from host to device
call cuLaunchKernel making sure to pass all parameters as an array
of pointers
copy output data back from device to host
free memory

Instead of having to do this manually, we use a typed macro, which
determines the required action based on the parameters passed to it.

The basic usage looks like:

execCuda("someKernel", numBlocks, threadsPerBlock, res, inputs)

where res and inputs are tuples (to support heterogeneous types).

Arguments passed as res are treated as output parameters. They will
both be copied to the device and afterwards back to the local
identifiers.

inputs will either be passed by value or copied, depending on if the
data is ref type or not. NOTE: A currently not implemented feature is
deep copying data structures, which contain references / pointers
themselves. This is important in particular if one wishes to pass data
as a struct of arrays (SoA).

In practice in the context of the code of this PR, you don't directly
interact with execCuda. This is done via the NVRTC compiler in the
next section.

NOTE: The parameters will be passed in the order:

first all elements of the res tuple in the tuple order
then all elements of the inputs tuple in their order

This means that your output arguments must be the first arguments of the
kernel currently!

NVRTC compiler helper

This is essentially an equivalent of the LLVM based NvidiaAssembler
part of the LLVM backend,

https://github.com/mratsim/constantine/blob/master/constantine/math_compiler/codegen_nvidia.nim#L501-L512

Similarly to all CUDA work, lots of boilerplate code is required to
initialize the device, set up the compilation pipeline, call the
compiler on the CUDA string etc. As most of this is identical in the
majority of use cases, we can automate it away. NOTE: We will likely
want to eventually add some context or config object to store e.g.
specific parameters to pass to the NVRTC compiler for example.

As an example, let's look at what the Saxpy
example
from the CUDA documentation looks like for us now.

import runtime_compile

# The actual CUDA kernel code from the example
# NOTE: Compared to the Saxpy documentation example, our kernel uses
# the `out` paramater as the first argument. This is because `execCuda`
# passes the arguments in the order 'all `res`, all `inputs`'.
const Saxpy = cstring"""
extern "C" __global__
void saxpy(float *out, float a, float *x, float *y, size_t n)
{
   size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
   if (tid < n) {
      out[tid] = a * x[tid] + y[tid];
   }
}
"""

proc main =
  var nvrtc = initNvrtc(Saxpy) # init the NVRTC helper with the string
  nvrtc.compile()              # compile the code
  nvrtc.getPtx()               # get the PTX of the code

  # Configure kernel launch parameters
  nvrtc.numBlocks = 32
  nvrtc.threadsPerBlock = 128

  const n = 128 * 32
  let bufferSize = n * csize_t(sizeOf cfloat)
  var a = cfloat 5.1
  var
    hX: array[n, cfloat]
    hY: array[n, cfloat]
    hOut: array[n, cfloat]
  for i in 0 ..< n: # Initialize host data (fill hX and hY with ur data)
    hX[i] = cfloat(i)
    hY[i] = cfloat(i * 2)

  ## Execute SAXPY.
  # `saxpy` is the name of the kernel
  # `hOut` is the output array, `a, hX, hY, n` are inputs
  nvrtc.execute("saxpy", (hOut), (a, hX, hY, n)) 
  for i in 0 ..< n:
    echo fmt"{a} * {hX[i]} + {hY[i]} = {hOut[i]}"

when isMainModule:
  main()

Clearly, most of the steps (compile, getPTtx) could also just be
done as part of the execute. I just haven't merged them yet.

We can see that the majority of the code is now setting up the input
data for the kernel.

Important note about CUDA library

To fully support all CUDA features using NVRTC, we need to use the
header pragma in the CUDA wrapper. See this nimcuda issue about the
problem:

SciNim/nimcuda#27

(Note: the current existing CUDA wrapper in Constantine also avoids the
header pragma. Once we switch to using our own, we'll have to make
that change and thus need the code below)

This implies that we need to know the path to the CUDA libraries at
compile time. Given that most people on linux systems tend to install
CUDA outside their package manager, this implies we need to pass the
path to the compiler.

The runtime_compile.nim file contains the following variables:

## Path to your CUDA installation. Currently a `strdefine`, will likely change in the future
const CudaPath {.strdefine.} = "/usr/local/cuda-12.6/targets/x86_64-linux/"

## NOTE: We need to define the paths to our CUDA installation at compile time,
## because we need to use the `{.header: ...}` pragma for the CUDA wrapper.
## We might move to a 'supply your own `nim.cfg` defining them' approach in the future.
{.passC: "-I" & CudaPath & "/include".}
{.passL: "-L" & CudaPath & "/lib -lcuda".}

You can compile a program using -d:CudaPath=<path/to/your/cuda> to set
the paths accordingly.

CUDA code generator

This brings us to the most interesting part of this PR. In the example
above we simply had a string of raw CUDA code. But for anyone who tends
to write Nim, this is likely not the most attractive nor elegant
solution. So instead for the Saxpy example from above, we can write:

const Saxpy = cuda:
  proc saxpy(res: ptr UncheckedArray[float32],
             a: float32,
             x, y: ptr UncheckedArray[float32], n: csize_t) {.global.} =
    let tid = blockIdx.x * blockDim.x + threadIdx.x
    if tid < n:
      res[tid] = a * x[tid] + y[tid]

Due to the anyhow somewhat restricted nature of writing CUDA code, the
vast majority of practical code is already supported. You likely won't
think about CUDA devices for complex string handling or ref object
madness as your first choice. Note that the features you'd expect to see
all work. We can access arrays, we have more sane types
(UncheckedArray instead of raw pointers), can access the CUDA special
block / thread related variables etc. The latter is implemented by
defining dummy types in runtime_compile.nim, which are only there to
make the Nim compiler as part of the typed macro pass happy. Also,
typical CUDA annotations like __global__ are mapped to Nim pragmas as
you can see.

Important Nim features that are currently not supported:

generics
implicit result variable
inbuilt Nim types seq, string etc.
on device openArray
while loops (simple)
case statements (should be straightforward, but likely not very
useful)
echo on device (but you can printf, see below!)
we'll likely want a staticFor. Constantine's is currently slightly
broken in the macro.
…?

Important Nim features that do work:

if statements
for loops
basic Nim objects
passing seq[T] (for T being value types!) to a kernel (technically
a feature of execCuda) and using seq[T] as a return type
you can define and use templates in the cuda macro.
- if you need access to constants defined outside of the cuda
  macro, you can create a template with a static body accessing
  the constant. The template body will be replaced by the constant
  value
- macros should work fine, too.
native Nim types are mapped to CUDA types (no need to use cfloat,
cint etc)
you can use when statements to avoid a runtime branch
most standard features (array indexing, member access, address of,
dereferencing, casting, object constructors, …)
- caveat for object constructors: In CUDA you cannot assign a
  statically sized array from a runtime value (or C / C++ for that
  matter). So BigInt(limbs: someArray) is invalid. You'll need
  to memcpy / manually assign data. Statically known arrays work
  though.

Important CUDA features currently not supported:

__shared__ memory (just needs to implement the pragma)
__synchthreads and similar functions (also just need a Nim name
for them and then map them to their CUDA name)
…?

Important CUDA features that do work:

access to blockIdx, blockDim, threadIdx
__global__, __device__, __forceinline__ pragmas (via
equivalent Nim pragmas without the _
inline PTX via Nim's asm statement
you can printf on device (obviously only use this to debug)
memcpy

Notes on the `cuda` design

Initially I started out with an untyped macro and thought I'd just
have the Nim code be only one layer above being a pure string literal.
Essentially just mapping Nim constructs directly to fixed strings. But I
quickly realized that having a typed macro would be much better,
because we could actually access type information and use templates in
the body (as they are expanded before the typed macro is executed!).

I think it is likely possible to go one step further than the
current code and access Nim procs defined outside the cuda
macro, as long as they are in scope (and not overloaded!). With a
typed macro we can get its body, insert it into the CUDA context and
treat them as __device__ functions.

I mainly think about this not really for the purpose of sharing lots of
code between the CUDA target and other targets. While the code sharing
could theoretically be quite beneficial, I think likely it won't be very
practical. Most likely different targets require a very different
approach in many details. E.g. the low level primitives using inline PTX
instructions. At a higher level one will need different approaches due
to the trade offs needed for efficient parallelism on Nvidia GPUs
compared to a CPU approach.

However, what I do think would be very useful is to be able to split the
cuda macro into multiple pieces (similar to how one writes Nim macros
really). Say one cuda call for type definitions, one for some device
functions etc. But due to the typed nature, this implies all the defined
types and functions would need to be visible in a global scope, which
currently would not be the case.

Profiling Nvidia code

Although it is probably obvious, it is worth mentioning that you can of
course use an Nvidia profiler (nvprof or ncu) on Nim binaries, which
use this feature.

A more complex example

For a more complex example, see the BigInt example file part of this PR.
There we implement modular addition for finite field elements, similar
to the current existing implementation for the LLVM target (using inline
PTX instructions).

It shows how one defines a type on the CUDA device, accesses a constant
from Constantine (the field modulus) using a template with a static
body, how to construct objects on device and more. You'll see that the
code essentially looks like normal host code.

Be aware of course, to actually achieve really high performance, just
launching lots of blocks with many threads won't give you an O(1-10k)
(depending on # of CUDA cores) speedup over a single thread. You'll
need to make sure to first go down the rabbit hole of thinking about
memory coalescence, blocks, warps and all that… As an example, a simple
benchmark performing additions of 2^25 pairs of finite field elements
of BN254_Snarks using a BigInt type, which stores 8 uint32 limbs
leads to only a 10x speedup compared to a single CPU core (using our
very optimized CPU code of course). nvprof shows that the memory
performance in that case is only 12.5%, because each thread has to jump
over the 8 limbs of the neighboring threads/lanes. This leads to non
coalesced memory access and causes a massive performance penalty. I
mention this in particular, because to implement a stucture of array
(SoA) approach for the data (where we have a single BigInts type,
which has one array for limb 0, one for limb 1 and so on) is currently
not supported in the context of copying data to the device via
execCuda. We need to extend the "when and what to copy" logic in the
macro first.

mratsim · 2025-01-27T22:06:36Z

Sorry for taking so long to review and discuss.

Approach

I think this is the best way forward for several reasons:

vs LLVM IR

It's not a problem for constant-time code because no branches to deal with, but as soon as you have if/then/else we need a lot of templates to make them palatable.
Access to cuda parallelism and scheduling primitives like block or stream will likely be verbose as well
debugging would be a pain unless we spend a lot of time on tooling
the generated Cuda code could instead be isolated in a file or even copy-pasted as a bug report should we need it.
performance using LLVM will likely be less than using proprietary Nvidia toolchain.
- A side-note, we can actually use Nvidia optimization toolchain if we generate LLVM IR compatible with NVVM IR, which requires downgrading to LLVM 7.0.1 IR which dates from December 2018. Downgrading is a path that Julia took: https://github.com/JuliaLLVM/llvm-downgrade but it's quite cumbersome to maintain forks of various LLVM versions.

vs NVCC

If we choose NVCC as a compiler we could use native Nim features for example generics. However everything would be compiled with NVCC, but it's untestable in CI and the CPU optimizations would certainly be second citizen.
If Nim supports heterogenous compilers, i.e. compiling some file with Clang and others with NVCC, and linking them together at the end, this may be an option.
It assumes we don't have the gnu++14 issue anymore (Nim 1.2 regression for Cuda Arraymancer#435) now that nvcc is an official Nim target (Add support for nvcc & hipcc (cuda/rocm) nim-lang/Nim#23805)
It may come with usability issues like dealing with headers not in path (and header don't have an equivalent to LD_LIBRARY_PATH so we'll need to as user to use a non-standard `{.strdefine.} which is meh for polish. (Cannot use with cuda in Window Arraymancer#409)
Hippo can be used as reference. The flow is easier than Cudanim

Misc reasons

Another reason to follow this path is that the AST can be reused for multiple backends:
- Obviously AMD ROCm as they copied 1:1 Cuda
- OpenCL can also reuse the same, and an OpenCL codegenerator already exists in Nim: https://github.com/can-lehmann/exprgrad/blob/v0.1.0/exprgrad/clgen.nim
- Apple Metal could also reuse the same.
  - There are Metal codegenerator in C++
    - https://github.com/halide/Halide/blob/v19.0.0/src/CodeGen_Metal_Dev.cpp
    - https://github.com/apache/tvm/blob/v0.18.0/src/target/source/codegen_metal.cc

Regarding OpenCL, I spent some time to look into how to generate it from LLVM IR but it seems like we need external tools to convert llvm to spirv (https://llvm.org/devmtg/2021-11/slides/2021-SPIR-V-SupportinLLVMandClang.pdf) and generating source code would avoid that. This is only relevant for Intel GPU though, but with the rise of AI, it might be that they become fast enough to accelerate KZG commitments. Trusted external binaries should be reduced to the minimum as we need the threat model/attach surface to be as small as possible i.e. we trust Nim, Clang, nvcc and that's all

Regarding Apple Metal, I spent a lot of time to investigate Julia's compilation pipeline (see https://github.com/JuliaGPU/GPUCompiler.jl) but the maintaining a fork of LLVM to allow downgrading IR to be compatible with Metal IR is a bit too much to maintain.

On the header file

The main appeal of NVRTC approach is not needing header file and so ealing with path issues that plagued Arraymancer (BLAS and cuda config, see: https://github.com/mratsim/Arraymancer/blob/v0.7.33/nim.cfg)

There should be a #define printf __nvrtc_printf or something similar in the header that we should replicate to avoid the header.

On Nim feature support

It would be much cleaner to support generics, static int and static enums, which should be possible if we use a typed macro, and should be possible with untyped if we rewrite the generic to C++ templates since Cuda/Hip works in C++ mode by default (but not OpenCL and maybe not Metal so not portable).
The current situation is similar to https://github.com/mratsim/Arraymancer/blob/v0.7.33/src/arraymancer/tensor/operators_blas_l1_opencl.nim which can be made tolerable with templates.

mratsim · 2025-01-27T22:20:23Z

Another note

I would still keep the LLVM IR code generator as it would enable the following use cases:

For users of other languages, use a subset of Constantine without needing the Nim compiler installed, reducing friction. That subset could be up to EVM precompiles for example.
SIMD vectorization made easy (especially those ARM Scalable Vector Extensions), in LLVM IR it's just changing i64 to <4 x i64>
Call convention optimizations to reduce copies / registers push/pop and cache misses and hopefully solve Low-level: discrepancy between field arithmetic performance and elliptic curve performance #446, ECRECOVER performance bug #521 that eluded my attempts to solve them (Cache misses: explore prefetching to address #446 #447, llvm: more tentatives at optimal field addition with pure LLVM IR #457, Tentative: Fp2 and towering optimizations #462)

Vindaar · 2025-02-18T17:19:21Z

Just updated the code to:

remove the nimcuda dependency and instead wrap only those pieces we actually use
combine the LLVM and NVRTC versions of the execCuda macro. That requires special casing a little bit, because for the LLVM backend we generate array_t types for finite field / EC elements for which we cannot just pass a pointer to the host object.

The main thing missing now is smarter detection of the CUDA paths on the user's machine (right now they are hardcoded). We can in principle also adapt the libpaths.nim of nimcuda https://github.com/SciNim/nimcuda/blob/master/src/nimcuda/cuda12_5/libpaths.nim courtesy of @lilkeet.

mratsim

I wonder if by using dynlib: libnvrtc-builtins.so we can avoid hardcoding the path.

I think it should work for the .so, and it helps on Windows as well. I'm unsure about about the static library linking step though.

mratsim · 2025-02-19T14:47:11Z

constantine/math_compiler/experimental/runtime_compile.nim

+  # Add the device runtime (provides printf support)
+  ## NOTE: Linking requires yout to pass the path to `libcudadevrt.a` at CT
+  res = cuLinkAddFile(linkState, CU_JIT_INPUT_LIBRARY,
+                      "/usr/local/cuda/targets/x86_64-linux/lib/libcudadevrt.a",  # Adjust path as needed


Hardcoded path here

mratsim · 2025-02-19T14:51:20Z

constantine/math_compiler/experimental/runtime_compile.nim

+let threadIdx* = NvThreadIdx()
+
+## Similar for procs. They don't need any implementation, as they won't ever be actually called.
+proc printf*(fmt: string) {.varargs.} = discard


I think we can declare it with something like

proc cu_printf*(fmt: cstring): cint {.sideeffect, importc: "printf", dynlib: "libnvrtc-builtins.so", varargs, discardable, tags:[WriteIOEffect].}

and not need to import cuda.h and deal with paths

mratsim · 2025-02-19T14:51:37Z

constantine/math_compiler/experimental/runtime_compile.nim

+
+## Similar for procs. They don't need any implementation, as they won't ever be actually called.
+proc printf*(fmt: string) {.varargs.} = discard
+proc memcpy*(dst, src: pointer, size: int) = discard


ditto with dynlib

IMPORTANT NOTE: For LLVM we generate `array_t` types for the finite field elements. By doing this we make it impossible to just copy over a Constantine finite field element or elliptic curve point (which are also an `array_t` type). Therefore, we have a `CUfunctionLLVM` type, which is used to differentiate between different `execCuda` calls, based on their "origin" (i.e. LLVM or NVRTC backends). Based on that backend we either allow passing simple structs by their host pointer or force a copy.

This was added for a reason after all in 5d66b52

… PTX

By mapping them to a regular cast.

``` proc setZero(a: var BigInt) {.device.} = proc setOne(a: var BigInt) {.device.} = proc add(r: var BigInt, a, b: BigInt) {.device.} = proc sub(r: var BigInt, a, b: BigInt) {.device.} = proc mul(r: var BigInt, a, b: BigInt) {.device.} = proc ccopy(a: var BigInt, b: BigInt, condition: bool) {.device.} = proc csetZero(r: var BigInt, condition: bool) {.device.} = proc csetOne(r: var BigInt, condition: bool) {.device.} = proc cadd(r: var BigInt, a: BigInt, condition: bool) {.device.} = proc csub(r: var BigInt, a: BigInt, condition: bool) {.device.} = proc doubleElement(r: var BigInt, a: BigInt) {.device.} = proc nsqr(r: var BigInt, a: BigInt, count: int) {.device.} = proc isZero(r: var bool, a: BigInt) {.device.} = proc isOdd(r: var bool, a: BigInt) {.device.} = proc neg(r: var BigInt, a: BigInt) {.device.} = proc cneg(r: var BigInt, a: BigInt, condition: bool) {.device.} = proc shiftRight(r: var BigInt, k: uint32) {.device.} = proc div2(r: var BigInt) {.device.} = ``` [back]

To fix: ```nim const code = cuda: proc sum() {.device.} = let inputIdx = 0 let rateIdx = 0 if inputIdx > 0 and rateIdx > 0: discard return echo code ``` which previously produced: ``` extern "C" __device__ void sum(){ long long inputIdx = 0; long long rateIdx = 0; if (((0 < inputIdx); && (0 < rateIdx);)) { ; }; return ; }; ``` and now: ``` extern "C" __device__ void sum(){ long long inputIdx = 0; long long rateIdx = 0; if (((0 < inputIdx) && (0 < rateIdx))) { ; }; return ; }; ```

This allows the user to e.g. allocate and copy memory to a CUDA device before calling `execute`.

so that one can e.g. copy to a global symbol before execution

i.e. to write ```cuda extern shared int foo[]; ``` we will now support ```nim var foo {.cuExtern, shared.}: array[0, int] ``` (the 0 size is the current placeholder on how to designate a `[]` array from Nim)

Allows to map a proc to a custom name. Useful for names that we can't write due to Nim limitations (i.e. starting with an underscore or names that match Nim keywords)

One can either define a Nim `const` for a variable that is already a const at the Nim compile time or use ```nim var foo {.constant.}: theType ``` if one wishes to copy to the symbol before kernel execution. This is useful for global constants that are not filled in at CUDA compile time, but before execution. For example: ```nim # Filled with `copyToSymbol` at runtime from host! var rc16 {.constant.}: array[30, array[BABYBEAR_WIDTH, BigInt]] var matInternalDiagM1 {.constant.}: array[BABYBEAR_WIDTH, BigInt] var montyInverse {.constant.}: BigInt ``` And in the host code: ```nim var nvrtc = initNvrtc(CudaCode) nvrtc.compile() nvrtc.getPtx() nvrtc.load() var p2bb = Poseidon2BabyBear.init() # copy Poseidon2 constants to CUDA kernel nvrtc.copyToSymbol("rc16", p2bb.rc16) nvrtc.copyToSymbol("matInternalDiagM1", p2bb.matInternalDiagM1) nvrtc.copyToSymbol("montyInverse", p2bb.montyInverse) ```

Need to finalize the logic of mapping 64 bit limbs to 32 bit limb constants

mratsim · 2025-05-14T14:55:19Z

Merging this for now. Next steps:

Removing the need for hardcoding header paths, it's painful for distributing on Linux as *-headers packages are not automatically installed, and it probably makes it extremely painful on Windows, seen with Arraymancer's nim.cfg.
Removing the need for hardcoding static lib path

Vindaar force-pushed the cudaJitCompiler branch 2 times, most recently from 283f9de to 9f9b785 Compare February 18, 2025 17:17

mratsim reviewed Feb 19, 2025

View reviewed changes

Vindaar added 24 commits May 14, 2025 15:11

commit the initial Nim ⇒ CUDA DSL, NVRTC & CUDA execution helpers

6b98782

[tests/examples] add example for a BigInt modular addition

353fdb0

remove nimcuda dependency, wrap everything we need manually

30f26d5

add quitOnFailure for the check calls to avoid regression

690cdf2

This was added for a reason after all in 5d66b52

[tests] turn big int modadd example into real test

b1af33d

copy libpaths.nim over from nimcuda

f015584

[cuda] add partial support for const in CUDA generator

f75343a

[cuda] fix minor issue in if statements in CUDA generator

2b16798

[cuda] add support for named blocks

f91f3f2

[cuda] add support for bool

b315729

[cuda] remove unnecessary semicolon

2de8850

[cuda] add support for {.volatile.} variables

4fbe3ba

[nvrtc] add modadd, modsub, mtymul implementations using inline…

9e49246

… PTX

[cuda] support basic type conversions

f37fa53

By mapping them to a regular cast.

[cuda] support var parameters in procs

46016a9

[cuda] make sure proc body is a block

c1a1efb

[cuda] support boolean / bitwise AND/OR and XOR, NOT

675284d

[cuda] support int32 literals

851fcc1

[cuda] handle prefix not

b264d8a

[cuda] make sure to pass array types by pointer instead of copy

5f5995c

[nvrtc] add more helpers, add TODO to investigate slct calls

fbb43ef

[tests] add test to pass by pointer and var

3e6a877

Vindaar added 23 commits May 14, 2025 15:11

[cuda] support const by mapping it to a __constant__

5a50d84

[cuda] allow type determination from array literal

169a5ad

[nvrtc] get rid of complexity with custom uint32 constants

7d7c61e

[staticFor] add stepped variant of staticFor

d9817b9

[cuda] support func, discard and command nnkCommand

93242b2

[cuda] extract type from getType for execution helper

1249bbc

[cuda] special case CUdeviceptr as a type that *must not* be copied

6c9faf7

This allows the user to e.g. allocate and copy memory to a CUDA device before calling `execute`.

[cuda] allow passing in shared memory size for a kernel

a0d15d6

[nvidia ABI] wrap cuModuleGetGlobal, cudaMemcpyKind and a couple more

843b497

[cuda] support while loops

995b009

[cuda] support void pointers and nil literals

f045af8

[cuda] refactor out module loading from execution

952d1b8

so that one can e.g. copy to a global symbol before execution

[cuda] store PTX before echoeing it

39de985

[cuda] add copyToSymbol helper to copy to constant symbol in CUDA code

69621cc

[cuda] generalize volatile annotation to support other pragmas

0fafdcb

i.e. to write ```cuda extern shared int foo[]; ``` we will now support ```nim var foo {.cuExtern, shared.}: array[0, int] ``` (the 0 size is the current placeholder on how to designate a `[]` array from Nim)

[cuda] cudaName pragma for custom name for a proc, eg __syncthreads

e8a30a9

Allows to map a proc to a custom name. Useful for names that we can't write due to Nim limitations (i.e. starting with an underscore or names that match Nim keywords)

[cuda] support float literals

b1db01a

[cuda] map arrays of explicit length 0 to [] arrays in CUDA

f5a97ca

[cuda] minor cleanup

d30ade3

[cuda] add gridDim, cuExtern and share + device malloc/free

5604027

force compilation with -d:CTT_32 for the moment

ee46b36

Need to finalize the logic of mapping 64 bit limbs to 32 bit limb constants

mratsim force-pushed the cudaJitCompiler branch from 0900bc0 to ee46b36 Compare May 14, 2025 13:11

mratsim marked this pull request as ready for review May 14, 2025 14:48

mratsim merged commit 9cfd28c into master May 14, 2025
16 checks passed

mratsim deleted the cudaJitCompiler branch May 14, 2025 14:48

mratsim mentioned this pull request Aug 5, 2025

Cuda: avoid header dependencies #555

Closed

mratsim mentioned this pull request Sep 1, 2025

Expand field tests with BabyBear, KoalaBear and Goldilocks #573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nvidia target via NVRTC and Nim ↦ CUDA DSL #487

Nvidia target via NVRTC and Nim ↦ CUDA DSL #487
mratsim merged 60 commits intomasterfrom
cudaJitCompiler

Vindaar commented Nov 29, 2024

Uh oh!

mratsim commented Jan 27, 2025

Uh oh!

mratsim commented Jan 27, 2025

Uh oh!

Vindaar commented Feb 18, 2025

Uh oh!

mratsim left a comment

Uh oh!

mratsim Feb 19, 2025

Uh oh!

mratsim Feb 19, 2025

Uh oh!

mratsim Feb 19, 2025

Uh oh!

Uh oh!

mratsim commented May 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Vindaar commented Nov 29, 2024

Table of contents & introduction

CUDA execution

NVRTC compiler helper

Important note about CUDA library

CUDA code generator

Notes on the cuda design

Profiling Nvidia code

A more complex example

Uh oh!

mratsim commented Jan 27, 2025

Approach

On the header file

On Nim feature support

Uh oh!

mratsim commented Jan 27, 2025

Uh oh!

Vindaar commented Feb 18, 2025

Uh oh!

mratsim left a comment

Choose a reason for hiding this comment

Uh oh!

mratsim Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

mratsim Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

mratsim Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mratsim commented May 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Notes on the `cuda` design