Implement silhouette_score from scikit-learn by Shabasovich · Pull Request #2199 · helmholtz-analytics/heat

Shabasovich · 2026-03-13T10:59:17Z

Description

tried to implement silhouette_samples from scikit-learn

Type of change

new function

#2190

Additional

Explanation of calculation

Main goal

Calculation Silhouette score using Formula $$s(i) = \frac{b(i)-a(i)}{max{ a(i),b(i) }}$$

Example

When I refer to example I am using following values:

X=[[0, 0], [10, 10], [20, 20], [1, 1]]
labels=[10, 30, 20, 10]
- unique_labels = [10,20,30]
- labels_encoded = [0, 2, 1, 0]

Explanation of a(i) calculation

Formula: $$a(i) = \frac{1}{|C_{i}|-1} \cdot \sum_{j \in C_{i}, j \neq i} d(i,j)$$

Mask

Goal: isolate distances within the same cluster with boolean matrix
Uses == operator, that compares vector with its transpose. If labels are the same, has 1, meaning $M_{i,j}=1$ if $v_{i}=v_{j}$

Example

$$\begin{bmatrix} 0 \\ 2 \\ 1 \\ 0 \end{bmatrix} == \begin{bmatrix}0 & 2 & 1 & 0 \end{bmatrix} \implies \begin{bmatrix} (0==0)=1 & (0==2)=0 & (0==1)=0 & (0==0)=1 \\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 1 & 0 &0&1\end{bmatrix}$$

Distance Filtering

Multiplikation with Distance Mask $D$ gives a Matrix where in row $i$ every column $j$ that is not in i's cluster is set to 0.

Example

$$ D = \begin{bmatrix} d_{0,0} & d_{0,1} & d_{0,2} & d_{0,3} \\ d_{1,0} & d_{1,1} & d_{1,2} & d_{1,3} \\ d_{2,0} & d_{2,1} & d_{2,2} & d_{2,3} \\ d_{3,0} & d_{3,1} & d_{3,2} & d_{3,3} \\ \end{bmatrix} \times \begin{bmatrix} 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} d_{0,0} & 0 & 0 & d_{0,3} \\ 0 & d_{1,1} & 0 & 0 \\ 0 & 0 & d_{2,2} & 0 \\ d_{3,0} & 0 & 0 & d_{3,3} \end{bmatrix} $$

Sum

Since $d(i,i)=0$, the self-distance naturally drops out
So if we are summing among the rows, we get exactly $\sum_{j \in C_{i}} d(i,j)$ with $C_{i}$ is cluster of $i$, $d(i,j)$ is distance between samples $i$ and $j$
We can put it in array of sums like sums: [sum_dist_0, sum_dist_1, sum_dist_2, sum_dist_3]

Example

We get $d_{0,3}$ from 1 row

Denominator

Now we need a denominator for every point $i$ , so we need to determine the value of $|C_{i}|-1$ for each sample
labels_freqs has all the sizes, you just need to map them to each sample with labels_encoded and subtract 1

b(i) calculation

Formula:
$$b(i)= \min_{j \neq i} \frac{1}{|C_{j}|} \sum_{l \in C_{j}} d(i,l) $$

Mask

Same as in calculation of $a(i)$ but we use vector of unique labels
gives a Matrix where mask[i, k] = 1 if sample $i$ belongs to cluster $k$

Example

$$ \begin{bmatrix} 0 \\ 2 \\ 1 \\ 0 \end{bmatrix} == \begin{bmatrix} 0 & 1 & 2\end{bmatrix} \implies mask= \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \end{bmatrix}$$

Distance Filtering

Multiplikation with Distance Mask $D$ gives a Matrix where entry $(i,k)$ is the sum of distances from sample i to all points in cluster k. --> äquivalent to $\sum_{l\in C_{i}} d(i,l)$
we want neighbor cluster, so we set the distance to points in own cluster to infinity to force min to choose neighbors

Example

$$ D = \begin{bmatrix} d_{0,0} & d_{0,1} & d_{0,2} & d_{0,3} \\ d_{1,0} & d_{1,1} & d_{1,2} & d_{1,3} \\ d_{2,0} & d_{2,1} & d_{2,2} & d_{2,3} \\ d_{3,0} & d_{3,1} & d_{3,2} & d_{3,3} \\ \end{bmatrix} \times \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0\\ 1 & 0 & 0 \end{bmatrix} = \begin{bmatrix} d_{0,0}+d_{0,3} & d_{0,2} & d_{0,1} \\ d_{1,0}+d_{1,3} & d_{1,2} & d_{1,1} \\ d_{2,0}+d_{2,3} & d_{2,2} & d_{2,1} \\ d_{3,0}+d_{d_{3,3}} & d_{3,2} & d_{3,1} \end{bmatrix} $$ After setting inf: $$ \begin{bmatrix} \infty & d_{0,2} & d_{0,1} \\ d_{1,0}+d_{1,3} & d_{1,2} & \infty \\ d_{2,0}+d_{2,3} & \infty & d_{2,1} \\ \infty & d_{3,2} & d_{3,1} \end{bmatrix} $$

brownbaerchen

This is already looking quite good. I had a brief look and highlighted some general things. Mainly using assert statements in the tests.

Since you mentioned you had trouble debugging, I want to advertise the debugger once more. Just write breakpoint() in the code whenever you want to stop and explore. There can be trouble with parallel debugging. So when running in parallel, I only call a breakpoint on rank 0:

if ht.comm.rank == 0:
    breakpoint()

If you're already familiar with this stuff, just ignore this. But I know many python programmers, myself included, who were introduced to this way too late, so it doesn't hurt to mention it :D

for more information, see https://pre-commit.ci

Co-authored-by: Thomas Saupe <[email protected]>

for more information, see https://pre-commit.ci

ClaudiaComito

Thanks a lot @Shabasovich ! I have a few high-level comments :

the fundamental requirement for any Heat functionality, is that it must support massive memory-distributed arrays;
this PR is calling DNDarray.resplit_(None) often in the code, which means that the DNDarray gets gathered onto each MPI process. After a resplit_(None), any operation on that array will be single-process.
most importantly, for our use cases there won't be an "after resplit_(None)", because the process runs out of memory.

At the current stage, the code might be passing the tests, but doesn't exploit memory distribution. The good thing is that all of the Heat operations you're using do support memory distribution, so, next steps:

find the spots in the code where arrays are being "gathered" (all the resplit_(None) calls)
ask yourself: what's size do we expect this array to be if the original input data have, say, ~1e6 or 1e7 rows? does it make sense to try to copy it entirely on each process?
if not: don't resplit. Just apply the follow up Heat operation to the distributed array.

By the way, check out our new LLM and AI usage guidelines. It's really common for the LLMs to not grasp the memory-distributed requirement for Heat operations.

My recommendation is to not use the AI to write code for the next iteration, but to work on adapting the existing code. Since you've just taken the MPI course, you might want to refresh the tutorials again, especially check out Internals

Thanks again for all the work!

ClaudiaComito · 2026-03-24T09:23:53Z

+def check_array(X, accept_sparse=False, input_name="X"):
+    if not isinstance(X, ht.DNDarray):
+        # Convert to heat array
+        X = ht.array(X, split=0)


In general I wouldn't impose split=0 by default, as it comes with all kind of communication overhead downstream (i.e. every operation on the array will be performed in distributed mode).

Unless you've assessed that it has advantages from an algorithmic point of view.

i guess i will do a length check before converting it to heat array, because communication overhead is then reasonable

Personally, I would either not distribute in this case or require the input to be a heat array.
I am not sure what the workflow would be where you pass a torch tensor or whatever and use multiple tasks. But in this case, I would not be surprised if different tasks do different things. Since heat uses the world communicator within the function, this can lead to deadlocks.

ClaudiaComito · 2026-03-24T09:25:26Z

+
+def _check_y(y):
+    if not isinstance(y, ht.DNDarray):
+        y = ht.array(y, split=0)


same as above

ClaudiaComito · 2026-03-24T09:30:26Z

+def silhouette_samples(X, labels, *, metric="euclidean", **kwds):
+    X, labels = check_X_y(
+        X, labels, accept_sparse=["csr"]
+    )  # think about accept_sparse, i have no idea what it is and what csr means


hints:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

We also support sparse csr matrices with very few arithmetic operations, see ht.sparse .

But let's leave the sparse functionality for later, it's lower priority for now

Sparse matrices are matrices with few entries relative to the overall size. This is the case for the masking matrix, where some clusters only contain few entries. It is notably not the case for the distance matrix because every point has a distance wrt to any other point. So it doesn't make sense to me to accept sparse matrices as input and I agree that we should just remove it from this PR.
Having a sparse masking matrix would be nice, but I don't know if there is much to be gained when multiplying the dense distance matrix with a sparse masking matrix. Let's worry about this once it becomes a problem.

ClaudiaComito · 2026-03-25T05:07:46Z

+    denominator_a = labels_freqs.larray[labels_encoded.larray] - 1
+    denominator_a = ht.where(
+        ht.array(denominator_a, split=0).astype(ht.float32) > 0,
+        ht.array(denominator_a, split=0),
+        1.0,
+    )


with .larray you're operating directly on the underlying torch tensor (a slice of the global array). In this snippet of code:

on each process, we define denominator_a by indexing the local slice of label_freqs by the local slice of labels_encoded. Is that what we want, no info required from the other processes?

in the following line, I think we want to read the local denominator_a tensors into a global distributed DNDarray. The argument for that operation is is_split = 0. Check out the ht.array() docs.

because ht.array(is_split=0) must exchange information on slice size among processes (communication overhead), better define it only once.

It turns out you can completely remove larray and it still works. I also deleted line denominator_a = ht.array(denominator_a, split=0) and it also works fine

…heck_random_state, reworked checks of X and Y

codecov · 2026-04-07T09:24:00Z

Codecov Report

❌ Patch coverage is 95.38462% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.73%. Comparing base (991f913) to head (a442d31).

Files with missing lines	Patch %	Lines
heat/cluster/metrics.py	95.31%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2199      +/-   ##
==========================================
+ Coverage   91.71%   91.73%   +0.01%     
==========================================
  Files          86       87       +1     
  Lines       14221    14286      +65     
==========================================
+ Hits        13043    13105      +62     
- Misses       1178     1181       +3

Flag	Coverage Δ
unit	`91.73% <95.38%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

brownbaerchen

This review mostly contains what we discussed in the meeting. It may look like a lot of comments, but most of them should be very quick to implement and some of them maybe you choose not to implement, so don't worry :D

matrix

tests

brownbaerchen

There are three things left to do before merging this from my point of view

Figure out how to handle the scikit-learn dependency of the tests
Remove or clean up the benchmarks
Merge #2253, which this depends on
Merge #2262, which this depends on

I think the implementation and the tests are ready to be merged. I did a lot of refactoring since the scikit-learn code that some of this is based on could be more expressive and concise in my opinion.

Note that the tests are probably skipped on codebase so far because there scikit-learn is not installed and I am using pytest.importorskip here.

I am approving this PR in this review even though it still needs a little work since my previous one was blocking and I don't need to be blocking this PR at this time.

Again, I want to thank @Shabasovich for the work and applaud the creative implementation that works in parallel with no MPI calls at all.

all tests

…ilhouette_score

…silhouette_score

…:helmholtz-analytics/heat into features/2190-Implement_silhouette_score

first draft of silhouette_samples that works(according to my test)

b2f009f

github-project-automation Bot added this to Roadmap Mar 13, 2026

github-project-automation Bot moved this to Todo in Roadmap Mar 13, 2026

github-actions Bot added cluster features labels Mar 13, 2026

Shabasovich requested a review from ClaudiaComito March 13, 2026 10:59

Shabasovich self-assigned this Mar 13, 2026

brownbaerchen reviewed Mar 13, 2026

View reviewed changes

Shabasovich and others added 9 commits March 13, 2026 17:20

cleaned up a little bit and added helper functions for silhouette_score

b6496d3

[pre-commit.ci] auto fixes from pre-commit.com hooks

8bcb247

for more information, see https://pre-commit.ci

Update heat/cluster/tests/test_metrics.py

65caff2

Co-authored-by: Thomas Saupe <[email protected]>

Update heat/cluster/tests/test_metrics.py

73a0815

Co-authored-by: Thomas Saupe <[email protected]>

Update heat/cluster/tests/test_metrics.py

680d451

Co-authored-by: Thomas Saupe <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a700fb7

for more information, see https://pre-commit.ci

moved test to /tests

aea7fa0

removed float check and removed distribution of X and labels

69d31dc

[pre-commit.ci] auto fixes from pre-commit.com hooks

8a6e4e6

for more information, see https://pre-commit.ci

ClaudiaComito requested changes Mar 25, 2026

View reviewed changes

github-project-automation Bot moved this from Todo to In Progress in Roadmap Mar 25, 2026

Shabasovich and others added 6 commits March 25, 2026 13:46

added docstrings to main functions

fffb4f8

changes mentioned in comments: deleted almost all resplits, deleted c…

6d6737d

…heck_random_state, reworked checks of X and Y

added more docstrings

a460847

added scikit dependency

2bfdd62

Merge branch 'main' into features/2190-Implement_silhouette_score

69e4989

Merge branch 'main' into features/2190-Implement_silhouette_score

615bd64

ClaudiaComito marked this pull request as ready for review April 7, 2026 08:50

brownbaerchen requested changes Apr 7, 2026

View reviewed changes

Shabasovich force-pushed the features/2190-Implement_silhouette_score branch from c653d35 to 615bd64 Compare April 7, 2026 12:26

added benchmarks

b141c1e

brownbaerchen added 6 commits April 21, 2026 10:48

Raise NotImplementedError on random sampling with precomputed distance

5a13c11

matrix

Add test dependencies to accommodate scikit-learn

d88b7bd

Cleaned up minimal example test

71f2c05

Cleaned up tests a bit

d8c55a8

Return indices as ht.int64 in ht.unique(..., return_inverse=True)

ac06bd2

Merge branch 'fix/unique' into features/2190-Implement_silhouette_score

d011f11

brownbaerchen mentioned this pull request Apr 21, 2026

Return indices as ht.int64 in ht.unique(..., return_inverse=True) #2253

Merged

brownbaerchen added 4 commits April 21, 2026 12:27

removed any references to sparse input and refactored input validation

cec20a0

tests

Refactored input validation

b6300a1

Small refactoring

79a247f

Tiny refactor

39d77e4

brownbaerchen previously approved these changes Apr 21, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

Comment thread tests/core/test_manipulations.py

Comment thread heat/core/manipulations.py

Comment thread benchmarks/cb/cluster.py

Cleaned up continuous benchmarks for silhouette score

88ae97c

brownbaerchen dismissed their stale review via 88ae97c April 22, 2026 12:35

Expand dev optional dependencies to include everything required to run

b2728bf

all tests

brownbaerchen mentioned this pull request Apr 22, 2026

Expand dev optional dependencies to include everything required to run the entire test suite #2262

Merged

Merge branch 'expand_dev_dependencies' into features/2190-Implement_s…

5c58352

…ilhouette_score

JuanPedroGHM self-requested a review April 27, 2026 07:57

Merge branch 'main' into features/2190-Implement_silhouette_score

dff2497

JuanPedroGHM reviewed Apr 27, 2026

View reviewed changes

Comment thread heat/cluster/metrics.py

Comment thread heat/cluster/metrics.py

brownbaerchen added 5 commits April 27, 2026 16:43

Avoid duplicate reshape as suggested by @JuanPedroGHM

d5396b3

Merge remote-tracking branch 'upstream' into features/2190-Implement_…

73697e1

…silhouette_score

Fix GPU tests

1b1e854

Implement suggestion from @JuanPedroGHM

af60bb2

Merge branch 'main' into features/2190-Implement_silhouette_score

1799a0b

JuanPedroGHM previously approved these changes Apr 29, 2026

View reviewed changes

brownbaerchen added 2 commits April 29, 2026 12:22

Fix tests on GPU

a002e8c

Merge branch 'features/2190-Implement_silhouette_score' of github.com…

a442d31

…:helmholtz-analytics/heat into features/2190-Implement_silhouette_score

brownbaerchen dismissed JuanPedroGHM’s stale review via a442d31 April 29, 2026 10:23

brownbaerchen requested a review from ClaudiaComito April 29, 2026 11:32

Conversation

Shabasovich commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Additional

Main goal

Example

Explanation of a(i) calculation

Mask

Example

Distance Filtering

Example

Sum

Example

Denominator

b(i) calculation

Mask

Example

Distance Filtering

Example

Uh oh!

brownbaerchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ClaudiaComito left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClaudiaComito Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Shabasovich Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

brownbaerchen Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClaudiaComito Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ClaudiaComito Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

brownbaerchen Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ClaudiaComito Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Shabasovich Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

brownbaerchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shabasovich commented Mar 13, 2026 •

edited

Loading

ClaudiaComito left a comment •

edited

Loading

brownbaerchen Apr 7, 2026 •

edited

Loading

codecov Bot commented Apr 7, 2026 •

edited

Loading

brownbaerchen left a comment •

edited

Loading