RK4 tracking with analytical magnetic fields#231
RK4 tracking with analytical magnetic fields#231ndwang wants to merge 88 commits intobmad-sim:mainfrom
Conversation
…eld calculations.
…d add alive state check; changed while loop to for loop in rk4_kernel
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| is_solenoid = (mm[1] == 0) | ||
| bz = vifelse(is_solenoid, kn[1], zero(x)) | ||
|
|
||
| # Convert from normalized (field/Bρ) to physical units (Tesla) |
There was a problem hiding this comment.
Is this the best way to do it? Aren't the normalized fields what appear in the equations of motion?
There was a problem hiding this comment.
The equation of motion uses physical units to calculate Lorentz force
There was a problem hiding this comment.
In accelerator coordinates, you only need the normalized fields. On lines 95-97, you multiply by the charge and on lines 122-123, you divide by p0.
There was a problem hiding this comment.
E needs to be in physical units. If they are non-zero, the return value will be in mixed units
There was a problem hiding this comment.
Why? E is also multiplied by the charge and divided by p0
There was a problem hiding this comment.
@JosephPeterDevlin @mattsignorelli
Normalized field strength is for magnetic fields. It's not a thing for electric fields. E/Brho is not a physical quantity and will only confuse people in the future. If we decide to use E/Brho for this kernel, it means every other field solver (e.g. space charge, wakefield, CSR) needs to return fields in V/(T*m^2) or 1/s.
The performance improvement is an illusion. In most cases, it's just hiding the multiplication/division somewhere else. The only case where it's better is analytical magnets and it only saves 6 multiplications.
To put this into numbers, I benchmarked normalized units and physical units on CPU and GPU for three scenarios:
- Analytical magnetic field
- Field map
- Superposition of both (to represent superposition, or analytical magnet + any other field sources that returns physical units, such as space charge or wakefield)
tl;dr
Normalized units are marginally better for analytical magnets, but drastically worse on CPU when you have anything that lives in physical units land.
Here are the results:
On my PC (CPU i7 13700K, GPU 4070)
CPU — Analytical Multipole
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 3,012 | 3,032 | 3,039 ± 369 | 3,055 ± 98 | -0.5% |
| 10,000 | 30,291 | 30,509 | 30,853 ± 4,412 | 30,953 ± 3,050 | -0.3% |
| 100,000 | 307,193 | 308,359 | 309,472 ± 8,690 | 310,878 ± 11,590 | -0.5% |
CPU — Field Map (3D Rectangular Grid)
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 15,317 | 15,908 | 15,630 ± 563 | 16,220 ± 666 | -3.8% |
| 10,000 | 155,262 | 161,201 | 156,225 ± 4,181 | 162,160 ± 1,176 | -3.8% |
| 100,000 | 1,570,321 | 1,621,898 | 1,576,434 ± 17,162 | 1,634,477 ± 17,525 | -3.7% |
CPU — Superposition (Multipole + Field Map)
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 14,880 | 16,495 | 15,200 ± 676 | 16,851 ± 697 | -10.9% |
| 10,000 | 150,374 | 166,601 | 151,653 ± 3,100 | 168,686 ± 3,492 | -11.2% |
| 100,000 | 1,521,921 | 1,689,968 | 1,530,000 ± 6,404 | 1,696,674 ± 6,507 | -10.9% |
GPU (CUDA) — Analytical Multipole
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 4,135 | 4,026 | 4,207 ± 178 | 4,090 ± 179 | +2.8% |
| 10,000 | 4,147 | 4,038 | 4,213 ± 190 | 4,094 ± 154 | +2.8% |
| 100,000 | 37,009 | 36,043 | 37,785 ± 466 | 36,778 ± 434 | +2.7% |
GPU (CUDA) — Field Map (3D Rectangular Grid)
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 6,243 | 6,259 | 6,356 ± 217 | 6,357 ± 223 | -0.0% |
| 10,000 | 6,258 | 6,269 | 6,341 ± 193 | 6,359 ± 188 | -0.3% |
| 100,000 | 56,123 | 56,303 | 57,429 ± 543 | 57,621 ± 629 | -0.3% |
GPU (CUDA) — Superposition (Multipole + Field Map)
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 6,761 | 6,709 | 6,854 ± 176 | 6,804 ± 197 | +0.7% |
| 10,000 | 6,781 | 6,750 | 6,959 ± 284 | 6,972 ± 243 | -0.2% |
| 100,000 | 61,253 | 60,773 | 62,293 ± 544 | 61,829 ± 633 | +0.7% |
On NERSC
CPU — Analytical Multipole
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 4,592 | 4,412 | 4,688 ± 147 | 4,469 ± 113 | +4.7% |
| 10,000 | 46,488 | 44,556 | 46,895 ± 233 | 45,868 ± 8,410 | +2.2% |
| 100,000 | 468,548 | 447,922 | 471,795 ± 1,951 | 458,366 ± 325,462 | +2.9% |
CPU — Field Map (3D Rectangular Grid)
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 25,752 | 28,822 | 26,100 ± 220 | 29,211 ± 256 | -11.9% |
| 10,000 | 258,509 | 289,861 | 259,329 ± 1,016 | 291,629 ± 1,288 | -12.5% |
| 100,000 | 2,599,207 | 2,909,775 | 2,603,184 ± 5,625 | 2,912,697 ± 4,132 | -11.9% |
CPU — Superposition (Multipole + Field Map)
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 28,777 | 31,652 | 29,264 ± 351 | 32,007 ± 202 | -9.4% |
| 10,000 | 288,959 | 320,165 | 291,804 ± 2,853 | 321,247 ± 836 | -10.1% |
| 100,000 | 2,914,413 | 3,200,890 | 2,918,006 ± 5,082 | 3,216,311 ± 21,808 | -10.2% |
GPU (CUDA) — Analytical Multipole
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 964 | 949 | 977 ± 13 | 957 ± 11 | +2.1% |
| 10,000 | 972 | 952 | 980 ± 18 | 959 ± 13 | +2.1% |
| 100,000 | 3,760 | 3,697 | 3,788 ± 17 | 3,709 ± 15 | +2.1% |
GPU (CUDA) — Field Map (3D Rectangular Grid)
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 1,278 | 1,275 | 1,305 ± 4,477 | 1,297 ± 14 | +0.6% |
| 10,000 | 1,295 | 1,295 | 1,313 ± 14 | 1,309 ± 15 | +0.3% |
| 100,000 | 5,077 | 5,094 | 5,117 ± 4,403 | 5,113 ± 20 | +0.1% |
GPU (CUDA) — Superposition (Multipole + Field Map)
| N | Physical min (μs) | Normalized min (μs) | Physical med ± std (μs) | Normalized med ± std (μs) | Δ med |
|---|---|---|---|---|---|
| 1,000 | 1,565 | 1,586 | 1,599 ± 42 | 1,606 ± 16 | -0.4% |
| 10,000 | 1,614 | 1,629 | 1,644 ± 18 | 1,643 ± 11 | +0.0% |
| 100,000 | 6,258 | 6,241 | 6,314 ± 31 | 6,282 ± 26 | +0.5% |
Summary
My PC
| Scenario | CPU (avg) | GPU (avg) |
|---|---|---|
| Analytical multipole | -0.4% | +2.8% |
| Field map | -3.8% | -0.2% |
| Superposition | -11.0% | +0.4% |
NERSC
| Scenario | CPU | GPU |
|---|---|---|
| Analytical multipole | +3.2% | +2.1% |
| Field map | -12.1% | +0.3% |
| Superposition | -9.9% | +0.0% |
There was a problem hiding this comment.
It's not a matter of physicality. The Hamiltonian only contains the electric field (actually the scalar potential) divided by p_ref/q. No matter what, you have to divide by p_ref/q. I therefore am very suspicious of this benchmark.
There was a problem hiding this comment.
If we decide to use E/Brho for this kernel, it means every other field solver (e.g. space charge, wakefield, CSR) needs to return fields in V/(T*m^2) or 1/s.
Sounds fine to me... the phase space coordinates are normalized and not kinetic, so that actually makes sense
The performance improvement is an illusion. In most cases, it's just hiding the multiplication/division somewhere else. The only case where it's better is analytical magnets and it only saves 6 multiplications.
This doesn't make sense to me, because in one case the multiplication/division is done outside the kernel, so that it is only once and not done by every particle inside a kernel, and in the other case every particle has to do it.
I also am extremely suspicious of the benchmark, because logically with normalized fields, the multiplication/division is done outside of the kernel, not in. So any performance "losses" would have to be explained thoroughly.
There was a problem hiding this comment.
Field maps come from simulations (CST) or measurements. They are properties of the hardware, not the beam. The RF engineer will give you a file in V/m and have no idea what p_over_q_ref is.
OK what if we preprocess the field map. Well you can't store them normalized. Every time p_over_q_ref changes, you invalidate the field map. A cavity field map used for 12 cavities in a linac would need 12 separate copies because they have a different p_over_q_ref after acceleration. You want to optimize a cavity phase upstream? You have to re-normalize all field maps downstream. How about ramping? Magnets ramp B with p, so the normalized field strength stay constant. But you don't ramp RF cavity voltages with p. E/Brho is not a constant and you can't make a new field map for every ramping step. If EIC wants to reuse a magnet from RHIC, you cannot use a normalized field map unless you also have a lattice model and calcualte its p_over_q_ref.
The benchmark script is https://github.com/ndwang/BeamTracking.jl/blob/fieldmap/benchmark/rk4_units_benchmark.jl
Compare it with https://github.com/ndwang/BeamTracking.jl/blob/fieldmap/src/modules/RungeKuttaTracking.jl
The only difference is where p_over_q_ref is applied.
You should point to specific issues or benchmark it yourselves instead of handwaving about performance. "I'm suspicious" is not a counterargument
There was a problem hiding this comment.
You gave us a (probably AI generated) benchmark without the code, and made a claim that doing more FLOPS inside a kernel/per-particle loop is somehow faster than doing it outside of a kernel. This, which is a specific issue that we are pointing to, doesn't make sense to either me or @JosephPeterDevlin, hence our "I'm suspicious" response to the benchmark. We kindly requested you to clarify exactly what is causing the slowdown, and that question, at least to me, remains unanswered.
All I see are things like this:
https://github.com/ndwang/BeamTracking.jl/blob/cb875ef6018602f8478cf898a3d6932ca5273f47/src/modules/RungeKuttaTracking.jl#L92
which are extra operations inside a kernel. Until you point out exactly the cause of the slowdown, then a bunch of numbers on a table from a benchmark that neither of us understand don't mean anything to me or @JosephPeterDevlin .
This PR is for analytical magnetic fields. The entire codebase uses normalized coordinates because the canonical coordinates are normalized. In my view, this is such a minor request from @JosephPeterDevlin that I don't understand the strong pushback.
Furthermore, ramping is actually exactly the case where normalized magnetic fields make sense, because then I don't have to update every single B field at every time. When you ramp a ring, the normalized magnetic field strengths remain constant, even though the B fields change.
|
@mattsignorelli Can you do a quick review? |
|
@JosephPeterDevlin Fixed the 1s |
| E_force_z = charge * Ez | ||
| B_force_x = charge * (vy*Bz - vz*By) | ||
| B_force_y = charge * (vz*Bx - vx*Bz) | ||
| B_force_z = charge * (vx*By - vy*Bx) |
There was a problem hiding this comment.
Is B_force_z used?
RungeKuttatracking method for fixed step s-based 4th-order Runge-Kutta integration.