Big memory allocations in L1GS #238

Abdelrahman912 · 2025-11-29T22:04:18Z

closes #230

Abdelrahman912 · 2025-11-29T22:09:19Z

============================================================
Preconditioner Construction Timings:
============================================================
─────────────────────────────────────────────────────────────────────────────────
                                        Time                    Allocations      
                               ───────────────────────   ────────────────────────
       Tot / % measured:            152ms /  99.9%           6.26MiB / 100.0%    

Section                ncalls     time    %tot     avg     alloc    %tot      avg
─────────────────────────────────────────────────────────────────────────────────
_precompute_blocks          1    152ms  100.0%   152ms   6.26MiB  100.0%  6.26MiB
  kernel execution          1    100ms   65.5%   100ms   8.20KiB    0.1%  8.20KiB
  allocate SLbuffer         1   52.1ms   34.2%  52.1ms   5.22MiB   83.4%  5.22MiB
  isapprox(A, A')           1    464μs    0.3%   464μs   1.00MiB   16.0%  1.00MiB
  kernel setup              1   25.1μs    0.0%  25.1μs      128B    0.0%     128B
  allocate D_Dl1            1   24.6μs    0.0%  24.6μs   25.9KiB    0.4%  25.9KiB
  adapt(backend, _A)        1   40.0ns    0.0%  40.0ns     0.00B    0.0%    0.00B
  synchronize               1   36.0ns    0.0%  36.0ns     0.00B    0.0%    0.00B
─────────────────────────────────────────────────────────────────────────────────
============================================================
ldiv! Timings during solve:
============================================================
───────────────────────────────────────────────────────────────────────────────────────────
                                                  Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 136ms /  83.0%           4.22MiB /  22.1%    

Section                          ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────────────
ldiv! (CPU)                         120    113ms  100.0%   942μs    955KiB  100.0%  7.96KiB
  _forward_sweep!                   120    113ms   99.7%   939μs    954KiB   99.9%  7.95KiB
    _forward_sweep! (internal)      120    113ms   99.6%   938μs    953KiB   99.8%  7.94KiB
      kernel call                   120    111ms   98.6%   929μs    936KiB   98.0%  7.80KiB
      kernel construction           120    722μs    0.6%  6.01μs   15.0KiB    1.6%     128B
      synchronize                   120   10.6μs    0.0%  87.9ns     0.00B    0.0%    0.00B
      convert size_A                120   4.75μs    0.0%  39.6ns     0.00B    0.0%    0.00B
      @unpack partitioning          120   4.34μs    0.0%  36.2ns     0.00B    0.0%    0.00B
      ndrange calc                  120   4.18μs    0.0%  34.9ns     0.00B    0.0%    0.00B
      @unpack P                     120   3.37μs    0.0%  28.0ns     0.00B    0.0%    0.00B
  y .= x                            120    208μs    0.2%  1.73μs     0.00B    0.0%    0.00B
───────────────────────────────────────────────────────────────────────────────────────────============================================================
Unprec. no. iters: 981, time: 1.05442742
Prec. no. iters: 119, time: 0.136057039

============================================================
Preconditioner Construction Timings:
============================================================
─────────────────────────────────────────────────────────────────────────────────
                                        Time                    Allocations      
                               ───────────────────────   ────────────────────────
       Tot / % measured:            355ms /  99.8%           11.3MiB /  91.5%    

Section                ncalls     time    %tot     avg     alloc    %tot      avg
─────────────────────────────────────────────────────────────────────────────────
_precompute_blocks          1    355ms  100.0%   355ms   10.3MiB  100.0%  10.3MiB
  kernel execution          1    349ms   98.3%   349ms   8.20KiB    0.1%  8.20KiB
  allocate SLbuffer         1   4.19ms    1.2%  4.19ms   7.42MiB   72.0%  7.42MiB
  isapprox(A, A')           1   1.71ms    0.5%  1.71ms   2.85MiB   27.6%  2.85MiB
  kernel setup              1   53.5μs    0.0%  53.5μs      128B    0.0%     128B
  allocate D_Dl1            1   15.2μs    0.0%  15.2μs   30.9KiB    0.3%  30.9KiB
  synchronize               1   37.0ns    0.0%  37.0ns     0.00B    0.0%    0.00B
  adapt(backend, _A)        1   36.0ns    0.0%  36.0ns     0.00B    0.0%    0.00B
─────────────────────────────────────────────────────────────────────────────────
============================================================
ldiv! Timings during solve:
============================================================
───────────────────────────────────────────────────────────────────────────────────────────
                                                  Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 797ms /  62.8%           18.0MiB /  17.9%    

Section                          ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────────────
ldiv! (CPU)                         415    501ms  100.0%  1.21ms   3.21MiB  100.0%  7.93KiB
  _forward_sweep!                   415    499ms   99.8%  1.20ms   3.21MiB  100.0%  7.93KiB
    _forward_sweep! (internal)      415    499ms   99.7%  1.20ms   3.21MiB  100.0%  7.93KiB
      kernel call                   415    495ms   98.9%  1.19ms   3.16MiB   98.3%  7.80KiB
      kernel construction           415   2.63ms    0.5%  6.33μs   51.9KiB    1.6%     128B
      synchronize                   415   14.9μs    0.0%  36.0ns     0.00B    0.0%    0.00B
      convert size_A                415   12.3μs    0.0%  29.6ns     0.00B    0.0%    0.00B
      @unpack P                     415   12.1μs    0.0%  29.1ns     0.00B    0.0%    0.00B
      ndrange calc                  415   11.2μs    0.0%  27.0ns     0.00B    0.0%    0.00B
      @unpack partitioning          415   11.0μs    0.0%  26.6ns     0.00B    0.0%    0.00B
  y .= x                            415    796μs    0.2%  1.92μs     0.00B    0.0%    0.00B
───────────────────────────────────────────────────────────────────────────────────────────============================================================
Unprec. no. iters: 3067, time: 17.799613799
Prec. no. iters: 414, time: 0.796364037

@termi-official these are the outputs from the two examples in the test_preconditioners. There is the initial time which is done once and for all and there is the ldiv! which is called multiple times. the huge chunk of allocations comes from kernel constructions basically.

Also one thing that might be optimized is isapprox(A, A') but we need a prior info from the user I believe.

src/solver/linear/preconditioners/l1_gauss_seidel.jl

codecov · 2025-11-29T22:14:46Z

Codecov Report

❌ Patch coverage is 83.67347% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.73%. Comparing base (009811a) to head (8f872e3).

Files with missing lines	Patch %	Lines
...c/solver/linear/preconditioners/l1_gauss_seidel.jl	83.67%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #238      +/-   ##
==========================================
+ Coverage   71.70%   71.73%   +0.02%     
==========================================
  Files          78       78              
  Lines        6482     6491       +9     
==========================================
+ Hits         4648     4656       +8     
- Misses       1834     1835       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/solver/linear/preconditioners/l1_gauss_seidel.jl

termi-official

Thanks for investigating Abdelrahman! I left some comments on things which I catched on the fly.

Also take a look at CodeCov, it looks like not all of your code is covered in the CI.

src/solver/linear/preconditioners/l1_gauss_seidel.jl

termi-official

Thanks for the changes Abdelrahman. There are some minor things left tho.

@AbdAlazezAhmed can you confirm that this works now on the ECG?

src/solver/linear/preconditioners/l1_gauss_seidel.jl

test/test_preconditioners.jl

termi-official · 2025-12-03T11:25:45Z

test/test_preconditioners.jl

I think we are missing tests for the symmetric case.

Thunderbolt.jl/test/test_preconditioners.jl

Lines 157 to 163 in 8f872e3

@testset "Symmetric A" begin

md = mdopen("HB/bcsstk15")

A = md.A # type here is Symmetric{Float64, SparseMatrixCSC{Float64, Int64}}

b = ones(size(A, 1))

test_l1gs_prec(A, b)

end

end

matrix here is already of type Symmetric
also for the test algorithmic logic I added this

Thunderbolt.jl/test/test_preconditioners.jl

Lines 21 to 33 in 8f872e3

function test_sym(testname, A, x, y_exp, D_Dl1_exp, SLbuffer_exp, partsize)

@testset "$testname Symmetric" begin

total_ncores = 8 # Assuming 8 cores for testing

for ncores in 1:total_ncores # testing for multiple cores to check that the answer is independent of the number of cores

builder = L1GSPrecBuilder(PolyesterDevice(ncores))

P = A isa Symmetric ? builder(A, partsize) : builder(A, partsize; isSymA=true)

@test P.D_Dl1 ≈ D_Dl1_exp

@test P.SLbuffer ≈ SLbuffer_exp

y = P \ x

@test y ≈ y_exp

end

end

end

I see. Thanks!

However, we should use CG for the symmetric problems to ensure that the preconditioner actually conserves the symmetry.

src/solver/linear/preconditioners/Preconditioners.jl

termi-official

There are still some issues.

ThreadedSparseMatrixCSC support is missing
The preconditioner tanks the symmetry
There is too much of memory usage

Lets try to make the following work

using Thunderbolt
using SparseArrays,OrderedCollections,TimerOutputs,LinearAlgebra

# TimerOutputs.enable_debug_timings(Thunderbolt)
# TimerOutputs.enable_debug_timings(Thunderbolt.Preconditioners)
geo = Hexahedron
nel = 2

size = 2.
signal_strength = 0.04
nel_heart = (6, 6, 6) .* 1 .* nel
nel_torso = (8, 8, 8) .* Int(size) .* nel
ground_vertex = Vec(0.0, 0.0, 0.0)
electrodes = [
    ground_vertex,
    Vec(-size,  0.,  0.), 
    Vec( size,  0.,  0.),
    Vec( 0., -size,  0.),
    Vec( 0.,  size,  0.),
    Vec( 0.,  0., -size),
    Vec( 0.,  0.,  size),
]
electrode_pairs = [[i,1] for i in 2:length(electrodes)]

heart_grid = generate_mesh(geo, nel_heart)
Ferrite.transform_coordinates!(heart_grid, x->Vec{3}(sign.(x) .* x.^2))

κ  = ConstantCoefficient(SymmetricTensor{2,3,Float64}((1.0, 0, 0, 1.0, 0, 1.0)))
κᵢ = AnalyticalCoefficient(
    (x,t) -> norm(x,Inf) ≤ 1.0 ? SymmetricTensor{2,3,Float64}((1.0, 0, 0, 1.0, 0, 1.0)) : SymmetricTensor{2,3,Float64}((0.0, 0, 0, 0.0, 0, 0.0)),
    CartesianCoordinateSystem{3}()
)

heart_model = TransientDiffusionModel(
    κᵢ,
    NoStimulationProtocol(), # Poisoning to detecte if we accidentally touch these
    :φₘ
)
heart_fun = semidiscretize(
    heart_model,
    FiniteElementDiscretization(
        Dict(:φₘ => LagrangeCollection{1}()),
        Dirichlet[]
    ),
    heart_grid
)

op = Thunderbolt.setup_assembled_operator(
    Thunderbolt.PerColorAssemblyStrategy(Thunderbolt.PolyesterDevice(8)),
    Thunderbolt.BilinearDiffusionIntegrator(
        κ,
        QuadratureRuleCollection(2),
        :φₘ,
    ),
    Thunderbolt.SparseMatrixCSC,
    heart_fun.dh,
)
Thunderbolt.update_operator!(op, 0.0) # trigger assembly


torso_grid_ = generate_grid(geo, nel_torso, Vec((-size,-size,-size)), Vec((size,size,size)))
addcellset!(torso_grid_, "heart", x->norm(x,Inf) ≤ 1.0)
# addcellset!(torso_grid_, "surrounding-tissue", x->norm(x,Inf) ≥ 1.0)
torso_grid_.cellsets["surrounding-tissue"] = OrderedSet([i for i in 1:getncells(torso_grid_) if i ∉ torso_grid_.cellsets["heart"]])
torso_grid = Thunderbolt.to_mesh(torso_grid_)
u = zeros(Thunderbolt.solution_size(heart_fun))
geselowitz_electrodes = [[electrodes[1], electrodes[i]] for i in 2:length(electrodes)]

TimerOutputs.reset_timer!()
@info prod(nel_torso)
geselowitz_ecg = Thunderbolt.Geselowitz1989ECGLeadCache(
    heart_fun,
    torso_grid,
    κᵢ, κ,
    geselowitz_electrodes; 
    ground = OrderedSet([Thunderbolt.get_closest_vertex(ground_vertex, torso_grid)]),
    torso_heart_domain=["heart"],
    ipc                  = LagrangeCollection{1}(),
    qrc                  = QuadratureRuleCollection(3),
    linear_solver        = Thunderbolt.LinearSolve.KrylovJL_CG(precs = (A, p) -> (L1GSPrecBuilder(PolyesterDevice(8))(A, ceil(Int, SparseArrays.size(A,1)/8); isSymA=true), LinearAlgebra.I)),
    # linear_solver        = Thunderbolt.LinearSolve.KrylovJL_GMRES(precs = (A, p) -> (L1GSPrecBuilder(PolyesterDevice(8))(A, ceil(Int, 24)), LinearAlgebra.I)),
    system_matrix_type   = Thunderbolt.SparseMatrixCSR{Float64,Int64}, #Thunderbolt.ThreadedSparseMatrixCSR{Float64,Int64},
    strategy = Thunderbolt.PerColorAssemblyStrategy(Thunderbolt.PolyesterDevice(8)),
)

TimerOutputs.print_timer()

You need to apply these patches.

diff --git a/src/solver/interface.jl b/src/solver/interface.jl
index 8234949c..4279e657 100644
--- a/src/solver/interface.jl
+++ b/src/solver/interface.jl
@@ -148,13 +148,17 @@ update_constraints_block!(f::NullFunction, i::Block, solver_cache::AbstractTimeS
 
 create_system_matrix(T::Type{<:AbstractMatrix}, f::AbstractSemidiscreteFunction) = create_system_matrix(T, f.dh)
 
-function create_system_matrix(::Type{<:ThreadedSparseMatrixCSR{Tv,Ti}}, dh::AbstractDofHandler) where {Tv,Ti}
-    Acsct = transpose(convert(SparseMatrixCSC{Tv,Ti}, allocate_matrix(dh)))
+function create_system_matrix(::Type{<:ThreadedSparseMatrixCSR}, dh::AbstractDofHandler)
+    Acsct = allocate_matrix(SparseMatrixCSR{1,Float64,Int64}, dh)
     return ThreadedSparseMatrixCSR(Acsct)
 end
 
 function create_system_matrix(SpMatType::Type{<:SparseMatrixCSC}, dh::AbstractDofHandler)
-    A = convert(SpMatType, allocate_matrix(dh))
+    A = allocate_matrix(SparseMatrixCSC{Float64,Int64}, dh)
+    return A
+end
+function create_system_matrix(SpMatType::Type{<:SparseMatrixCSR}, dh::AbstractDofHandler)
+    A = allocate_matrix(SparseMatrixCSR{1,Float64,Int64}, dh)
     return A
 end

diff --git a/src/modeling/electrophysiology/ecg.jl b/src/modeling/electrophysiology/ecg.jl
index b0f47170..9db426db 100644
--- a/src/modeling/electrophysiology/ecg.jl
+++ b/src/modeling/electrophysiology/ecg.jl
@@ -402,6 +402,7 @@ function Geselowitz1989ECGLeadCache(
     linear_solver        = LinearSolve.KrylovJL_CG(),
     solution_vector_type = Vector{Float64},
     system_matrix_type   = ThreadedSparseMatrixCSR{Float64,Int64},
+    strategy             = SequentialAssemblyStrategy(SequentialCPUDevice()),
 )
     return Geselowitz1989ECGLeadCache(
         heart_fun,
@@ -416,6 +417,7 @@ function Geselowitz1989ECGLeadCache(
         linear_solver       ,
         solution_vector_type,
         system_matrix_type  ,
+        strategy            ,
     )
 end

termi-official · 2025-12-03T20:17:27Z

test/test_preconditioners.jl

I see. Thanks!

However, we should use CG for the symmetric problems to ensure that the preconditioner actually conserves the symmetry.

termi-official · 2025-12-03T21:56:36Z

src/solver/linear/preconditioners/l1_gauss_seidel.jl

+        η = convert(Tf, η)
+        D_Dl1 = adapt(backend, zeros(Tf, N)) # D + Dˡ
+        last_partsize = N - (nparts - 1) * partsize # size of the last partition
+        SLbuffer_size = (partsize * (partsize - 1) * (nparts-1)) ÷ 2 +  last_partsize * (last_partsize - 1) ÷ 2


We should store the lower/upper triangular parts in some sparse format. I do not have opinions about which one exactly.

termi-official · 2025-12-03T21:59:26Z

src/solver/linear/preconditioners/l1_gauss_seidel.jl

+        (; backend) = partitioning
+        # The following code is required because there is no assumption on the compatibality of x with the backend.
+        @timeit_debug "adapt(backend, y)" _y = adapt(backend, y)
+        @timeit_debug "_forward_sweep!" _forward_sweep!(_y, P)


The type of sweep should be ideally a user option -- and must be symmetric if the problem is symmetric, as we tank the symmetry needed for the solver otherwise.

add timeroutputs

9715068

Abdelrahman912 commented Nov 29, 2025

View reviewed changes

src/solver/linear/preconditioners/l1_gauss_seidel.jl Outdated Show resolved Hide resolved

Abdelrahman912 requested a review from termi-official November 29, 2025 22:11

Abdelrahman912 added 3 commits November 30, 2025 01:38

add hot fix for sym check

131471d

big mat test

77e4e3a

minor fix

8a8e5f6

Abdelrahman912 commented Nov 30, 2025

View reviewed changes

src/solver/linear/preconditioners/l1_gauss_seidel.jl Outdated Show resolved Hide resolved

minor changes

1e6fb37

Abdelrahman912 commented Nov 30, 2025

View reviewed changes

src/solver/linear/preconditioners/l1_gauss_seidel.jl Outdated Show resolved Hide resolved

termi-official requested changes Nov 30, 2025

View reviewed changes

Abdelrahman912 and others added 5 commits December 1, 2025 04:59

fixes

2a8b1e0

minor fix

5679c04

add eq 6.3

1620a1b

fix test

c656c08

Merge branch 'main' into l1gs-mem-fix

5172c8d

termi-official requested changes Dec 3, 2025

View reviewed changes

Abdelrahman912 added 3 commits December 3, 2025 17:36

minor fixes

00de37e

add eta and docstring

3fc68b2

typo

8f872e3

termi-official requested changes Dec 3, 2025

View reviewed changes

	@testset "Symmetric A" begin
	md = mdopen("HB/bcsstk15")
	A = md.A # type here is Symmetric{Float64, SparseMatrixCSC{Float64, Int64}}
	b = ones(size(A, 1))
	test_l1gs_prec(A, b)
	end
	end

	function test_sym(testname, A, x, y_exp, D_Dl1_exp, SLbuffer_exp, partsize)
	@testset "$testname Symmetric" begin
	total_ncores = 8 # Assuming 8 cores for testing
	for ncores in 1:total_ncores # testing for multiple cores to check that the answer is independent of the number of cores
	builder = L1GSPrecBuilder(PolyesterDevice(ncores))
	P = A isa Symmetric ? builder(A, partsize) : builder(A, partsize; isSymA=true)
	@test P.D_Dl1 ≈ D_Dl1_exp
	@test P.SLbuffer ≈ SLbuffer_exp
	y = P \ x
	@test y ≈ y_exp
	end
	end
	end

Big memory allocations in L1GS #238

Are you sure you want to change the base?

Big memory allocations in L1GS #238

Conversation

Abdelrahman912 commented Nov 29, 2025

Uh oh!

Abdelrahman912 commented Nov 29, 2025

Uh oh!

Uh oh!

codecov bot commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

termi-official left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

termi-official left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

termi-official Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Abdelrahman912 Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

termi-official Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

termi-official left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

termi-official Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

termi-official Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

termi-official Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 29, 2025 •

edited

Loading

termi-official left a comment •

edited

Loading

termi-official left a comment •

edited

Loading