Skip to content

Conversation

@Abdelrahman912
Copy link
Collaborator

closes #230

@Abdelrahman912
Copy link
Collaborator Author

============================================================
Preconditioner Construction Timings:
============================================================
─────────────────────────────────────────────────────────────────────────────────
                                        Time                    Allocations      
                               ───────────────────────   ────────────────────────
       Tot / % measured:            152ms /  99.9%           6.26MiB / 100.0%    

Section                ncalls     time    %tot     avg     alloc    %tot      avg
─────────────────────────────────────────────────────────────────────────────────
_precompute_blocks          1    152ms  100.0%   152ms   6.26MiB  100.0%  6.26MiB
  kernel execution          1    100ms   65.5%   100ms   8.20KiB    0.1%  8.20KiB
  allocate SLbuffer         1   52.1ms   34.2%  52.1ms   5.22MiB   83.4%  5.22MiB
  isapprox(A, A')           1    464μs    0.3%   464μs   1.00MiB   16.0%  1.00MiB
  kernel setup              1   25.1μs    0.0%  25.1μs      128B    0.0%     128B
  allocate D_Dl1            1   24.6μs    0.0%  24.6μs   25.9KiB    0.4%  25.9KiB
  adapt(backend, _A)        1   40.0ns    0.0%  40.0ns     0.00B    0.0%    0.00B
  synchronize               1   36.0ns    0.0%  36.0ns     0.00B    0.0%    0.00B
─────────────────────────────────────────────────────────────────────────────────
============================================================
ldiv! Timings during solve:
============================================================
───────────────────────────────────────────────────────────────────────────────────────────
                                                  Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 136ms /  83.0%           4.22MiB /  22.1%    

Section                          ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────────────
ldiv! (CPU)                         120    113ms  100.0%   942μs    955KiB  100.0%  7.96KiB
  _forward_sweep!                   120    113ms   99.7%   939μs    954KiB   99.9%  7.95KiB
    _forward_sweep! (internal)      120    113ms   99.6%   938μs    953KiB   99.8%  7.94KiB
      kernel call                   120    111ms   98.6%   929μs    936KiB   98.0%  7.80KiB
      kernel construction           120    722μs    0.6%  6.01μs   15.0KiB    1.6%     128B
      synchronize                   120   10.6μs    0.0%  87.9ns     0.00B    0.0%    0.00B
      convert size_A                120   4.75μs    0.0%  39.6ns     0.00B    0.0%    0.00B
      @unpack partitioning          120   4.34μs    0.0%  36.2ns     0.00B    0.0%    0.00B
      ndrange calc                  120   4.18μs    0.0%  34.9ns     0.00B    0.0%    0.00B
      @unpack P                     120   3.37μs    0.0%  28.0ns     0.00B    0.0%    0.00B
  y .= x                            120    208μs    0.2%  1.73μs     0.00B    0.0%    0.00B
───────────────────────────────────────────────────────────────────────────────────────────============================================================
Unprec. no. iters: 981, time: 1.05442742
Prec. no. iters: 119, time: 0.136057039

============================================================
Preconditioner Construction Timings:
============================================================
─────────────────────────────────────────────────────────────────────────────────
                                        Time                    Allocations      
                               ───────────────────────   ────────────────────────
       Tot / % measured:            355ms /  99.8%           11.3MiB /  91.5%    

Section                ncalls     time    %tot     avg     alloc    %tot      avg
─────────────────────────────────────────────────────────────────────────────────
_precompute_blocks          1    355ms  100.0%   355ms   10.3MiB  100.0%  10.3MiB
  kernel execution          1    349ms   98.3%   349ms   8.20KiB    0.1%  8.20KiB
  allocate SLbuffer         1   4.19ms    1.2%  4.19ms   7.42MiB   72.0%  7.42MiB
  isapprox(A, A')           1   1.71ms    0.5%  1.71ms   2.85MiB   27.6%  2.85MiB
  kernel setup              1   53.5μs    0.0%  53.5μs      128B    0.0%     128B
  allocate D_Dl1            1   15.2μs    0.0%  15.2μs   30.9KiB    0.3%  30.9KiB
  synchronize               1   37.0ns    0.0%  37.0ns     0.00B    0.0%    0.00B
  adapt(backend, _A)        1   36.0ns    0.0%  36.0ns     0.00B    0.0%    0.00B
─────────────────────────────────────────────────────────────────────────────────
============================================================
ldiv! Timings during solve:
============================================================
───────────────────────────────────────────────────────────────────────────────────────────
                                                  Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 797ms /  62.8%           18.0MiB /  17.9%    

Section                          ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────────────
ldiv! (CPU)                         415    501ms  100.0%  1.21ms   3.21MiB  100.0%  7.93KiB
  _forward_sweep!                   415    499ms   99.8%  1.20ms   3.21MiB  100.0%  7.93KiB
    _forward_sweep! (internal)      415    499ms   99.7%  1.20ms   3.21MiB  100.0%  7.93KiB
      kernel call                   415    495ms   98.9%  1.19ms   3.16MiB   98.3%  7.80KiB
      kernel construction           415   2.63ms    0.5%  6.33μs   51.9KiB    1.6%     128B
      synchronize                   415   14.9μs    0.0%  36.0ns     0.00B    0.0%    0.00B
      convert size_A                415   12.3μs    0.0%  29.6ns     0.00B    0.0%    0.00B
      @unpack P                     415   12.1μs    0.0%  29.1ns     0.00B    0.0%    0.00B
      ndrange calc                  415   11.2μs    0.0%  27.0ns     0.00B    0.0%    0.00B
      @unpack partitioning          415   11.0μs    0.0%  26.6ns     0.00B    0.0%    0.00B
  y .= x                            415    796μs    0.2%  1.92μs     0.00B    0.0%    0.00B
───────────────────────────────────────────────────────────────────────────────────────────============================================================
Unprec. no. iters: 3067, time: 17.799613799
Prec. no. iters: 414, time: 0.796364037

@termi-official these are the outputs from the two examples in the test_preconditioners. There is the initial time which is done once and for all and there is the ldiv! which is called multiple times. the huge chunk of allocations comes from kernel constructions basically.

Also one thing that might be optimized is isapprox(A, A') but we need a prior info from the user I believe.

@codecov
Copy link

codecov bot commented Nov 29, 2025

Codecov Report

❌ Patch coverage is 83.67347% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.73%. Comparing base (009811a) to head (8f872e3).

Files with missing lines Patch % Lines
...c/solver/linear/preconditioners/l1_gauss_seidel.jl 83.67% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #238      +/-   ##
==========================================
+ Coverage   71.70%   71.73%   +0.02%     
==========================================
  Files          78       78              
  Lines        6482     6491       +9     
==========================================
+ Hits         4648     4656       +8     
- Misses       1834     1835       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating Abdelrahman! I left some comments on things which I catched on the fly.

Also take a look at CodeCov, it looks like not all of your code is covered in the CI.

Copy link
Collaborator

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes Abdelrahman. There are some minor things left tho.

@AbdAlazezAhmed can you confirm that this works now on the ECG?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing tests for the symmetric case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@testset "Symmetric A" begin
md = mdopen("HB/bcsstk15")
A = md.A # type here is Symmetric{Float64, SparseMatrixCSC{Float64, Int64}}
b = ones(size(A, 1))
test_l1gs_prec(A, b)
end
end

matrix here is already of type Symmetric
also for the test algorithmic logic I added this

function test_sym(testname, A, x, y_exp, D_Dl1_exp, SLbuffer_exp, partsize)
@testset "$testname Symmetric" begin
total_ncores = 8 # Assuming 8 cores for testing
for ncores in 1:total_ncores # testing for multiple cores to check that the answer is independent of the number of cores
builder = L1GSPrecBuilder(PolyesterDevice(ncores))
P = A isa Symmetric ? builder(A, partsize) : builder(A, partsize; isSymA=true)
@test P.D_Dl1 D_Dl1_exp
@test P.SLbuffer SLbuffer_exp
y = P \ x
@test y y_exp
end
end
end

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

However, we should use CG for the symmetric problems to ensure that the preconditioner actually conserves the symmetry.

Copy link
Collaborator

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some issues.

  1. ThreadedSparseMatrixCSC support is missing
  2. The preconditioner tanks the symmetry
  3. There is too much of memory usage

Lets try to make the following work

using Thunderbolt
using SparseArrays,OrderedCollections,TimerOutputs,LinearAlgebra

# TimerOutputs.enable_debug_timings(Thunderbolt)
# TimerOutputs.enable_debug_timings(Thunderbolt.Preconditioners)
geo = Hexahedron
nel = 2

size = 2.
signal_strength = 0.04
nel_heart = (6, 6, 6) .* 1 .* nel
nel_torso = (8, 8, 8) .* Int(size) .* nel
ground_vertex = Vec(0.0, 0.0, 0.0)
electrodes = [
    ground_vertex,
    Vec(-size,  0.,  0.), 
    Vec( size,  0.,  0.),
    Vec( 0., -size,  0.),
    Vec( 0.,  size,  0.),
    Vec( 0.,  0., -size),
    Vec( 0.,  0.,  size),
]
electrode_pairs = [[i,1] for i in 2:length(electrodes)]

heart_grid = generate_mesh(geo, nel_heart)
Ferrite.transform_coordinates!(heart_grid, x->Vec{3}(sign.(x) .* x.^2))

κ  = ConstantCoefficient(SymmetricTensor{2,3,Float64}((1.0, 0, 0, 1.0, 0, 1.0)))
κᵢ = AnalyticalCoefficient(
    (x,t) -> norm(x,Inf)  1.0 ? SymmetricTensor{2,3,Float64}((1.0, 0, 0, 1.0, 0, 1.0)) : SymmetricTensor{2,3,Float64}((0.0, 0, 0, 0.0, 0, 0.0)),
    CartesianCoordinateSystem{3}()
)

heart_model = TransientDiffusionModel(
    κᵢ,
    NoStimulationProtocol(), # Poisoning to detecte if we accidentally touch these
    :φₘ
)
heart_fun = semidiscretize(
    heart_model,
    FiniteElementDiscretization(
        Dict(:φₘ => LagrangeCollection{1}()),
        Dirichlet[]
    ),
    heart_grid
)

op = Thunderbolt.setup_assembled_operator(
    Thunderbolt.PerColorAssemblyStrategy(Thunderbolt.PolyesterDevice(8)),
    Thunderbolt.BilinearDiffusionIntegrator(
        κ,
        QuadratureRuleCollection(2),
        :φₘ,
    ),
    Thunderbolt.SparseMatrixCSC,
    heart_fun.dh,
)
Thunderbolt.update_operator!(op, 0.0) # trigger assembly


torso_grid_ = generate_grid(geo, nel_torso, Vec((-size,-size,-size)), Vec((size,size,size)))
addcellset!(torso_grid_, "heart", x->norm(x,Inf)  1.0)
# addcellset!(torso_grid_, "surrounding-tissue", x->norm(x,Inf) ≥ 1.0)
torso_grid_.cellsets["surrounding-tissue"] = OrderedSet([i for i in 1:getncells(torso_grid_) if i  torso_grid_.cellsets["heart"]])
torso_grid = Thunderbolt.to_mesh(torso_grid_)
u = zeros(Thunderbolt.solution_size(heart_fun))
geselowitz_electrodes = [[electrodes[1], electrodes[i]] for i in 2:length(electrodes)]

TimerOutputs.reset_timer!()
@info prod(nel_torso)
geselowitz_ecg = Thunderbolt.Geselowitz1989ECGLeadCache(
    heart_fun,
    torso_grid,
    κᵢ, κ,
    geselowitz_electrodes; 
    ground = OrderedSet([Thunderbolt.get_closest_vertex(ground_vertex, torso_grid)]),
    torso_heart_domain=["heart"],
    ipc                  = LagrangeCollection{1}(),
    qrc                  = QuadratureRuleCollection(3),
    linear_solver        = Thunderbolt.LinearSolve.KrylovJL_CG(precs = (A, p) -> (L1GSPrecBuilder(PolyesterDevice(8))(A, ceil(Int, SparseArrays.size(A,1)/8); isSymA=true), LinearAlgebra.I)),
    # linear_solver        = Thunderbolt.LinearSolve.KrylovJL_GMRES(precs = (A, p) -> (L1GSPrecBuilder(PolyesterDevice(8))(A, ceil(Int, 24)), LinearAlgebra.I)),
    system_matrix_type   = Thunderbolt.SparseMatrixCSR{Float64,Int64}, #Thunderbolt.ThreadedSparseMatrixCSR{Float64,Int64},
    strategy = Thunderbolt.PerColorAssemblyStrategy(Thunderbolt.PolyesterDevice(8)),
)

TimerOutputs.print_timer()

You need to apply these patches.

diff --git a/src/solver/interface.jl b/src/solver/interface.jl
index 8234949c..4279e657 100644
--- a/src/solver/interface.jl
+++ b/src/solver/interface.jl
@@ -148,13 +148,17 @@ update_constraints_block!(f::NullFunction, i::Block, solver_cache::AbstractTimeS
 
 create_system_matrix(T::Type{<:AbstractMatrix}, f::AbstractSemidiscreteFunction) = create_system_matrix(T, f.dh)
 
-function create_system_matrix(::Type{<:ThreadedSparseMatrixCSR{Tv,Ti}}, dh::AbstractDofHandler) where {Tv,Ti}
-    Acsct = transpose(convert(SparseMatrixCSC{Tv,Ti}, allocate_matrix(dh)))
+function create_system_matrix(::Type{<:ThreadedSparseMatrixCSR}, dh::AbstractDofHandler)
+    Acsct = allocate_matrix(SparseMatrixCSR{1,Float64,Int64}, dh)
     return ThreadedSparseMatrixCSR(Acsct)
 end
 
 function create_system_matrix(SpMatType::Type{<:SparseMatrixCSC}, dh::AbstractDofHandler)
-    A = convert(SpMatType, allocate_matrix(dh))
+    A = allocate_matrix(SparseMatrixCSC{Float64,Int64}, dh)
+    return A
+end
+function create_system_matrix(SpMatType::Type{<:SparseMatrixCSR}, dh::AbstractDofHandler)
+    A = allocate_matrix(SparseMatrixCSR{1,Float64,Int64}, dh)
     return A
 end
diff --git a/src/modeling/electrophysiology/ecg.jl b/src/modeling/electrophysiology/ecg.jl
index b0f47170..9db426db 100644
--- a/src/modeling/electrophysiology/ecg.jl
+++ b/src/modeling/electrophysiology/ecg.jl
@@ -402,6 +402,7 @@ function Geselowitz1989ECGLeadCache(
     linear_solver        = LinearSolve.KrylovJL_CG(),
     solution_vector_type = Vector{Float64},
     system_matrix_type   = ThreadedSparseMatrixCSR{Float64,Int64},
+    strategy             = SequentialAssemblyStrategy(SequentialCPUDevice()),
 )
     return Geselowitz1989ECGLeadCache(
         heart_fun,
@@ -416,6 +417,7 @@ function Geselowitz1989ECGLeadCache(
         linear_solver       ,
         solution_vector_type,
         system_matrix_type  ,
+        strategy            ,
     )
 end

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

However, we should use CG for the symmetric problems to ensure that the preconditioner actually conserves the symmetry.

η = convert(Tf, η)
D_Dl1 = adapt(backend, zeros(Tf, N)) # D + Dˡ
last_partsize = N - (nparts - 1) * partsize # size of the last partition
SLbuffer_size = (partsize * (partsize - 1) * (nparts-1)) ÷ 2 + last_partsize * (last_partsize - 1) ÷ 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should store the lower/upper triangular parts in some sparse format. I do not have opinions about which one exactly.

(; backend) = partitioning
# The following code is required because there is no assumption on the compatibality of x with the backend.
@timeit_debug "adapt(backend, y)" _y = adapt(backend, y)
@timeit_debug "_forward_sweep!" _forward_sweep!(_y, P)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of sweep should be ideally a user option -- and must be symmetric if the problem is symmetric, as we tank the symmetry needed for the solver otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Big memory allocation in the L1GS

3 participants