Numerical differences between C and OpenCL #2353

llandsmeer · 2026-01-22T22:34:08Z

llandsmeer
Jan 22, 2026

Hi, I usually target GPUs via JAX, but now I'm in a position where I need to target multiple devices (AMD+NVIDIA+Intel) and need to speed up some non-trivial code that doesn't play nicely with existing ML-libraries; futhark looks very promising, but after trying it for a bit I now have some questions

I'm seeing a big difference between the c and the opencl output, but since I just tried using futhark I'm not sure if it's a bug or that I'm doing something wrong

This is the code I'm currently using (neural simulation):

def step dt (a:f32) (du:f32) (v:f32,u:f32) (i0:f32,_) =
  if v > 30.0
  then
    (-65.0, u + du)
  else
    let v1 = v + 0.5 * dt * ((0.04 * v + 5.0) * v + 140.0 - u + i0)
    let v2 = v1 + 0.5 * dt * ((0.04 * v1 + 5.0) * v1 + 140.0 - u + i0)
    let u1 = u + dt * a * (0.2 * v2 - u)
    in if v2 > 30.0
    then
      (100.0, u1)
    else
      (v2, u1)

def rs = step 1 0.02 8.0

def main = scan (rs) (-65.0, -13.0) ((replicate 100 (4.0:f32,0.0:f32))) |> map (.0)

which has very different outputs on opencl & c:

Sequential C (expected output):

$ futhark c test.fut && echo | ./test
[-64.045006f32, -63.225807f32, -62.473724f32, -61.741394f32, -60.987762f32, -60.167908f32, -59.221390f32, -58.052788f32, -56.489525f32, -54.173897f32, -50.226768f32, -41.876823f32, -15.734381f32, 100.000000f32, -65.000000f32, -72.748299f32, -76.124908f32, -76.784073f32, -76.746216f32, -76.578667f32, -76.389313f32, -76.197487f32, -76.006729f32, -75.817673f32, -75.630447f32, -75.445045f32, -75.261467f32, -75.079666f32, -74.899651f32, -74.721390f32, -74.544868f32, -74.370056f32, -74.196953f32, -74.025536f32, -73.855774f32, -73.687653f32, -73.521172f32, -73.356293f32, -73.192993f32, -73.031281f32, -72.871117f32, -72.712494f32, -72.555389f32, -72.399796f32, -72.245674f32, -72.093040f32, -71.941849f32, -71.792099f32, -71.643761f32, -71.496834f32]

OpenCL (tested for both Intel(R) Iris(R) Xe Graphics (laptop), NVIDIA RTX PRO 6000 (workstation), wrong output):

$ futhark opencl test.fut && echo | ./test
[-64.045006f32, -106.590317f32, -110.178589f32, -105.268761f32, -110.339531f32, -105.525635f32, -109.471146f32, -104.199127f32, -110.456985f32, -105.720779f32, -109.601990f32, -104.413399f32, -109.782440f32, -104.668556f32, -108.836899f32, -103.409035f32, -110.536407f32, -105.857605f32, -109.691315f32, -104.564346f32, -109.869431f32, -104.816589f32, -108.934814f32, -103.570450f32, -109.999634f32, -105.009689f32, -109.077614f32, -103.771988f32, -109.275146f32, -104.014389f32, -108.269485f32, -102.873672f32, -103.122574f32, -101.693314f32, -100.745041f32, -102.010101f32, -100.699509f32, -101.949875f32, -100.942200f32, -102.253708f32, -100.666107f32, -101.903717f32, -100.906120f32, -102.205826f32, -100.856064f32, -102.148209f32, -101.114677f32, -102.426315f32, -100.643425f32, -101.871078f32]

At first I thought it was a numerical error due to the hardware, but

Lowering the timestep does not make a difference (and system is known to be stable for this timestep)
Switching to f64 on the workstation GPU doesn't change output
Running the same code in JAX on the workstation GPU gives the correct answer

Futhark version (latest non-nightly precompiled binary)

$ futhark --version
Futhark 0.25.34.
git: b88976e26b4a85a7e46138038d09a2be5a342d6b
Compiled with GHC 9.8.4.

Also running on the GPU in CUDA mode failed with

$ futhark cuda test.fut && echo | ./test
./test: During CUDA initialisation:
NVRTC compilation failed.

nvrtc: error: invalid value for --gpu-architecture (-arch)

What is going on? Is such behaviour expected for futhark?
(Bit more general question) Why does scan have type a -> a -> a and not a -> b -> a
(Completely different question) Is full support for custom derivatives for reverse AD planned? Being able to device custom derivatives is needed for gradient calculation in neural models. Same for checkpointing
How can I enable compilation for CUDA on the PRO 6000?

coancea · 2026-01-23T04:18:42Z

coancea
Jan 23, 2026
Collaborator

This is a very quick and incomplete answer: Scan in Futhark represents a prefix sum computation, hence it requires an associative operator. Yours is not, which i suspect is why it leads to different results. Moreover the differentiation of scan also require an associative operator. ie gives incorrect results if the operator is not associative. Since we are talking associativity, this also answers the type question: a -> b -> a dors not make sense in what associativity is concerned. (Folds have sequential semantics and their parallelization can be achieved by a map-reduce or map-scan composition, where both reduce and scan requires associative operators, ie a -> a -> a With Best regards, Cosmin

…

________________________________ From: Lennart Landsmeer ***@***.***> Sent: Thursday, January 22, 2026 11:34:33 PM To: diku-dk/futhark Cc: Subscribed Subject: [diku-dk/futhark] Numerical differences between C and OpenCL (Discussion #2353) Hi, I usually target GPUs via JAX, but now I'm in a position where I need to target multiple devices (AMD+NVIDIA+Intel) and need to speed up some non-trivial code that doesn't play nicely with existing ML-libraries; futhark looks very promising, but after trying it for a bit I now have some questions I'm seeing a big difference between the c and the opencl output, but since I just tried using futhark I'm not sure if it's a bug or that I'm doing something wrong This is the code I'm currently using (neural simulation): def step dt (a:f32) (du:f32) (v:f32,u:f32) (i0:f32,_) = if v > 30.0 then (-65.0, u + du) else let v1 = v + 0.5 * dt * ((0.04 * v + 5.0) * v + 140.0 - u + i0) let v2 = v1 + 0.5 * dt * ((0.04 * v1 + 5.0) * v1 + 140.0 - u + i0) let u1 = u + dt * a * (0.2 * v2 - u) in if v2 > 30.0 then (100.0, u1) else (v2, u1) def rs = step 1 0.02 8.0 def main = scan (rs) (-65.0, -13.0) ((replicate 100 (4.0:f32,0.0:f32))) |> map (.0) which has very different outputs on opencl & c: Sequential C (expected output): $ futhark c test.fut && echo | ./test [-64.045006f32, -63.225807f32, -62.473724f32, -61.741394f32, -60.987762f32, -60.167908f32, -59.221390f32, -58.052788f32, -56.489525f32, -54.173897f32, -50.226768f32, -41.876823f32, -15.734381f32, 100.000000f32, -65.000000f32, -72.748299f32, -76.124908f32, -76.784073f32, -76.746216f32, -76.578667f32, -76.389313f32, -76.197487f32, -76.006729f32, -75.817673f32, -75.630447f32, -75.445045f32, -75.261467f32, -75.079666f32, -74.899651f32, -74.721390f32, -74.544868f32, -74.370056f32, -74.196953f32, -74.025536f32, -73.855774f32, -73.687653f32, -73.521172f32, -73.356293f32, -73.192993f32, -73.031281f32, -72.871117f32, -72.712494f32, -72.555389f32, -72.399796f32, -72.245674f32, -72.093040f32, -71.941849f32, -71.792099f32, -71.643761f32, -71.496834f32] OpenCL (tested for both Intel(R) Iris(R) Xe Graphics (laptop), NVIDIA RTX PRO 6000 (workstation), wrong output): $ futhark opencl test.fut && echo | ./test [-64.045006f32, -106.590317f32, -110.178589f32, -105.268761f32, -110.339531f32, -105.525635f32, -109.471146f32, -104.199127f32, -110.456985f32, -105.720779f32, -109.601990f32, -104.413399f32, -109.782440f32, -104.668556f32, -108.836899f32, -103.409035f32, -110.536407f32, -105.857605f32, -109.691315f32, -104.564346f32, -109.869431f32, -104.816589f32, -108.934814f32, -103.570450f32, -109.999634f32, -105.009689f32, -109.077614f32, -103.771988f32, -109.275146f32, -104.014389f32, -108.269485f32, -102.873672f32, -103.122574f32, -101.693314f32, -100.745041f32, -102.010101f32, -100.699509f32, -101.949875f32, -100.942200f32, -102.253708f32, -100.666107f32, -101.903717f32, -100.906120f32, -102.205826f32, -100.856064f32, -102.148209f32, -101.114677f32, -102.426315f32, -100.643425f32, -101.871078f32] At first I thought it was a numerical error due to the hardware, but * Lowering the timestep does not make a difference (and system is known to be stable for this timestep) * Switching to f64 on the workstation GPU doesn't change output * Running the same code in JAX on the workstation GPU gives the correct answer Futhark version (latest non-nightly precompiled binary) $ futhark --version Futhark 0.25.34. git: b88976e Compiled with GHC 9.8.4. Also running on the GPU in CUDA mode failed with $ futhark cuda test.fut && echo | ./test ./test: During CUDA initialisation: NVRTC compilation failed. nvrtc: error: invalid value for --gpu-architecture (-arch) 1. What is going on? Is such behaviour expected for futhark? 2. (Bit more general question) Why does scan have type a -> a -> a and not a -> b -> a 3. (Completely different question) Is full support for custom derivatives for reverse AD planned? Being able to device custom derivatives is needed for gradient calculation in neural models. Same for checkpointing 4. How can I enable compilation for CUDA on the PRO 6000? — Reply to this email directly, view it on GitHub<#2353>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAJLGC6YCGS6FQUFP54NSM34IFF7TAVCNFSM6AAAAACSTDEP6OVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZZGM3TONJXGI>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

llandsmeer · 2026-01-23T11:17:24Z

llandsmeer
Jan 23, 2026
Author

Seems like I blindly assumed futhark scan() and JAX scan() would behave the same; thanks a lot for the quick clarification! I guess a loop statement would be the correct implementation then as it has to be sequential anyway.

0 replies

athas · 2026-01-23T14:28:23Z

athas
Jan 23, 2026
Maintainer

Cosmin already answered the interesting question, so I'll answer the other ones:

What is going on? Is such behaviour expected for futhark?

The error is a bug somewhere. Futhark does run-time compilation of embedded CUDA kernels, and queries the platform/hardware in order to pass options to the run-time compiler provided by NVIDIA (NVRTC). This is the function that determines the value of the -arch argument. Since we do not have access to every GPU, some of this is essentially untested, and based on our reading of the documentation. If you pass -D to the Futhark program, it should print what options it is passing to NVRTC. It you use `--pass-nvrtc-option=--gpu-architecture=foo, then you can override the GPU architecture. If you can figure out what works for the PRO 6000, please let us know. It's a fancier GPU than any of the Futhark developers have ever used.

0 replies

llandsmeer · 2026-01-24T01:56:05Z

llandsmeer
Jan 24, 2026
Author

-D output:

NVRTC compile options: -arch compute_120

Which looks like it should be correct according to the docs, but my nvcc (ubuntu 24 lts default) is apparantly too old.

This runs correctly though:

echo | ./test --nvrtc-option=-arch=compute_90

Maybe futhark could also look at the supported archs (nvcc --list-gpu-arch) at runtime, but that could also be out of scope for the compiler

Thanks for pointing out the compiler flag!

0 replies

llandsmeer · 2026-01-24T02:03:44Z

llandsmeer
Jan 24, 2026
Author

Btw, I'll be happy to run some benchmarks if that's useful (maybe after upgrading cuda...)

0 replies

llandsmeer · 2026-01-24T16:50:20Z

llandsmeer
Jan 24, 2026
Author

I ended up with this JAX-style scan (for if anyone else ends up needing that)

def seqscan 'carry 'inp 'out [n]
    (f:carry -> inp -> (carry, out))
    (carry0:carry)
    (inp:[n]inp)
    : [n]out =
  if n == 0 then ([]:>[n]out) else
  let update = \(trace:*[]out) (i:i64) (v:out) -> trace with [i] = v
  let (carry1, out0) = f carry0 inp[0]
  in (.0) <|
  loop (trace, carry, i) = (replicate n out0, carry1, 1i64)
  while i < n do
    let (carry', out) = f carry inp[i]
    in (update trace i out, carry', i + 1)



def main =
  seqscan (\(v, u) inp -> (rs (v, u) inp, v))
  (-70.0, -14.0)
  (replicate 100 1.0 ++ replicate 100 4.0 ++ replicate 100 1.0)

It would be nice to get rid of the replicate and turn in into an 'empty' allocation but I don't think futhark supports that (?)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical differences between C and OpenCL #2353

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Numerical differences between C and OpenCL #2353

Uh oh!

llandsmeer Jan 22, 2026

Replies: 6 comments

Uh oh!

coancea Jan 23, 2026 Collaborator

Uh oh!

llandsmeer Jan 23, 2026 Author

Uh oh!

athas Jan 23, 2026 Maintainer

Uh oh!

llandsmeer Jan 24, 2026 Author

Uh oh!

llandsmeer Jan 24, 2026 Author

Uh oh!

llandsmeer Jan 24, 2026 Author

llandsmeer
Jan 22, 2026

coancea
Jan 23, 2026
Collaborator

llandsmeer
Jan 23, 2026
Author

athas
Jan 23, 2026
Maintainer

llandsmeer
Jan 24, 2026
Author

llandsmeer
Jan 24, 2026
Author

llandsmeer
Jan 24, 2026
Author