Numerical differences between C and OpenCL #2353
Replies: 6 comments
-
|
This is a very quick and incomplete answer:
Scan in Futhark represents a prefix sum computation, hence it requires an associative operator. Yours is not, which i suspect is why it leads to different results. Moreover the differentiation of scan also require an associative operator. ie gives incorrect results if the operator is not associative.
Since we are talking associativity, this also answers the type question: a -> b -> a dors not make sense in what associativity is concerned.
(Folds have sequential semantics and their parallelization can be achieved by a map-reduce or map-scan composition, where both reduce and scan requires associative operators, ie a -> a -> a
With Best regards,
Cosmin
…________________________________
From: Lennart Landsmeer ***@***.***>
Sent: Thursday, January 22, 2026 11:34:33 PM
To: diku-dk/futhark
Cc: Subscribed
Subject: [diku-dk/futhark] Numerical differences between C and OpenCL (Discussion #2353)
Hi, I usually target GPUs via JAX, but now I'm in a position where I need to target multiple devices (AMD+NVIDIA+Intel) and need to speed up some non-trivial code that doesn't play nicely with existing ML-libraries; futhark looks very promising, but after trying it for a bit I now have some questions
I'm seeing a big difference between the c and the opencl output, but since I just tried using futhark I'm not sure if it's a bug or that I'm doing something wrong
This is the code I'm currently using (neural simulation):
def step dt (a:f32) (du:f32) (v:f32,u:f32) (i0:f32,_) =
if v > 30.0
then
(-65.0, u + du)
else
let v1 = v + 0.5 * dt * ((0.04 * v + 5.0) * v + 140.0 - u + i0)
let v2 = v1 + 0.5 * dt * ((0.04 * v1 + 5.0) * v1 + 140.0 - u + i0)
let u1 = u + dt * a * (0.2 * v2 - u)
in if v2 > 30.0
then
(100.0, u1)
else
(v2, u1)
def rs = step 1 0.02 8.0
def main = scan (rs) (-65.0, -13.0) ((replicate 100 (4.0:f32,0.0:f32))) |> map (.0)
which has very different outputs on opencl & c:
Sequential C (expected output):
$ futhark c test.fut && echo | ./test
[-64.045006f32, -63.225807f32, -62.473724f32, -61.741394f32, -60.987762f32, -60.167908f32, -59.221390f32, -58.052788f32, -56.489525f32, -54.173897f32, -50.226768f32, -41.876823f32, -15.734381f32, 100.000000f32, -65.000000f32, -72.748299f32, -76.124908f32, -76.784073f32, -76.746216f32, -76.578667f32, -76.389313f32, -76.197487f32, -76.006729f32, -75.817673f32, -75.630447f32, -75.445045f32, -75.261467f32, -75.079666f32, -74.899651f32, -74.721390f32, -74.544868f32, -74.370056f32, -74.196953f32, -74.025536f32, -73.855774f32, -73.687653f32, -73.521172f32, -73.356293f32, -73.192993f32, -73.031281f32, -72.871117f32, -72.712494f32, -72.555389f32, -72.399796f32, -72.245674f32, -72.093040f32, -71.941849f32, -71.792099f32, -71.643761f32, -71.496834f32]
OpenCL (tested for both Intel(R) Iris(R) Xe Graphics (laptop), NVIDIA RTX PRO 6000 (workstation), wrong output):
$ futhark opencl test.fut && echo | ./test
[-64.045006f32, -106.590317f32, -110.178589f32, -105.268761f32, -110.339531f32, -105.525635f32, -109.471146f32, -104.199127f32, -110.456985f32, -105.720779f32, -109.601990f32, -104.413399f32, -109.782440f32, -104.668556f32, -108.836899f32, -103.409035f32, -110.536407f32, -105.857605f32, -109.691315f32, -104.564346f32, -109.869431f32, -104.816589f32, -108.934814f32, -103.570450f32, -109.999634f32, -105.009689f32, -109.077614f32, -103.771988f32, -109.275146f32, -104.014389f32, -108.269485f32, -102.873672f32, -103.122574f32, -101.693314f32, -100.745041f32, -102.010101f32, -100.699509f32, -101.949875f32, -100.942200f32, -102.253708f32, -100.666107f32, -101.903717f32, -100.906120f32, -102.205826f32, -100.856064f32, -102.148209f32, -101.114677f32, -102.426315f32, -100.643425f32, -101.871078f32]
At first I thought it was a numerical error due to the hardware, but
* Lowering the timestep does not make a difference (and system is known to be stable for this timestep)
* Switching to f64 on the workstation GPU doesn't change output
* Running the same code in JAX on the workstation GPU gives the correct answer
Futhark version (latest non-nightly precompiled binary)
$ futhark --version
Futhark 0.25.34.
git: b88976e
Compiled with GHC 9.8.4.
Also running on the GPU in CUDA mode failed with
$ futhark cuda test.fut && echo | ./test
./test: During CUDA initialisation:
NVRTC compilation failed.
nvrtc: error: invalid value for --gpu-architecture (-arch)
1. What is going on? Is such behaviour expected for futhark?
2. (Bit more general question) Why does scan have type a -> a -> a and not a -> b -> a
3. (Completely different question) Is full support for custom derivatives for reverse AD planned? Being able to device custom derivatives is needed for gradient calculation in neural models. Same for checkpointing
4. How can I enable compilation for CUDA on the PRO 6000?
—
Reply to this email directly, view it on GitHub<#2353>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAJLGC6YCGS6FQUFP54NSM34IFF7TAVCNFSM6AAAAACSTDEP6OVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZZGM3TONJXGI>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
Seems like I blindly assumed futhark scan() and JAX scan() would behave the same; thanks a lot for the quick clarification! I guess a |
Beta Was this translation helpful? Give feedback.
-
|
Cosmin already answered the interesting question, so I'll answer the other ones:
The error is a bug somewhere. Futhark does run-time compilation of embedded CUDA kernels, and queries the platform/hardware in order to pass options to the run-time compiler provided by NVIDIA (NVRTC). This is the function that determines the value of the |
Beta Was this translation helpful? Give feedback.
-
|
-D output: Which looks like it should be correct according to the docs, but my nvcc (ubuntu 24 lts default) is apparantly too old. This runs correctly though: Maybe futhark could also look at the supported archs ( Thanks for pointing out the compiler flag! |
Beta Was this translation helpful? Give feedback.
-
|
Btw, I'll be happy to run some benchmarks if that's useful (maybe after upgrading cuda...) |
Beta Was this translation helpful? Give feedback.
-
|
I ended up with this JAX-style scan (for if anyone else ends up needing that) def seqscan 'carry 'inp 'out [n]
(f:carry -> inp -> (carry, out))
(carry0:carry)
(inp:[n]inp)
: [n]out =
if n == 0 then ([]:>[n]out) else
let update = \(trace:*[]out) (i:i64) (v:out) -> trace with [i] = v
let (carry1, out0) = f carry0 inp[0]
in (.0) <|
loop (trace, carry, i) = (replicate n out0, carry1, 1i64)
while i < n do
let (carry', out) = f carry inp[i]
in (update trace i out, carry', i + 1)
def main =
seqscan (\(v, u) inp -> (rs (v, u) inp, v))
(-70.0, -14.0)
(replicate 100 1.0 ++ replicate 100 4.0 ++ replicate 100 1.0)
It would be nice to get rid of the replicate and turn in into an 'empty' allocation but I don't think futhark supports that (?) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I usually target GPUs via JAX, but now I'm in a position where I need to target multiple devices (AMD+NVIDIA+Intel) and need to speed up some non-trivial code that doesn't play nicely with existing ML-libraries; futhark looks very promising, but after trying it for a bit I now have some questions
I'm seeing a big difference between the
cand theopencloutput, but since I just tried using futhark I'm not sure if it's a bug or that I'm doing something wrongThis is the code I'm currently using (neural simulation):
which has very different outputs on opencl & c:
Sequential C (expected output):
OpenCL (tested for both
Intel(R) Iris(R) Xe Graphics(laptop),NVIDIA RTX PRO 6000(workstation), wrong output):At first I thought it was a numerical error due to the hardware, but
Futhark version (latest non-nightly precompiled binary)
Also running on the GPU in CUDA mode failed with
a -> a -> aand nota -> b -> aBeta Was this translation helpful? Give feedback.
All reactions