Skip to content

Implement silhouette_score from scikit-learn#2199

Open
Shabasovich wants to merge 64 commits intomainfrom
features/2190-Implement_silhouette_score
Open

Implement silhouette_score from scikit-learn#2199
Shabasovich wants to merge 64 commits intomainfrom
features/2190-Implement_silhouette_score

Conversation

@Shabasovich
Copy link
Copy Markdown
Collaborator

@Shabasovich Shabasovich commented Mar 13, 2026

Description

  • tried to implement silhouette_samples from scikit-learn

Type of change

  • new function

#2190

Additional

Explanation of calculation

Main goal

  • Calculation Silhouette score using Formula $$s(i) = \frac{b(i)-a(i)}{max{ a(i),b(i) }}$$
Example

When I refer to example I am using following values:

  • X=[[0, 0], [10, 10], [20, 20], [1, 1]]
  • labels=[10, 30, 20, 10]
    • unique_labels = [10,20,30]
    • labels_encoded = [0, 2, 1, 0]

Explanation of a(i) calculation

Formula: $$a(i) = \frac{1}{|C_{i}|-1} \cdot \sum_{j \in C_{i}, j \neq i} d(i,j)$$

Mask

  • Goal: isolate distances within the same cluster with boolean matrix
  • Uses == operator, that compares vector with its transpose. If labels are the same, has 1, meaning $M_{i,j}=1$ if $v_{i}=v_{j}$
Example

$$\begin{bmatrix} 0 \\ 2 \\ 1 \\ 0 \end{bmatrix} == \begin{bmatrix}0 & 2 & 1 & 0 \end{bmatrix} \implies \begin{bmatrix} (0==0)=1 & (0==2)=0 & (0==1)=0 & (0==0)=1 \\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 1 & 0 &0&1\end{bmatrix}$$

Distance Filtering

Multiplikation with Distance Mask $D$ gives a Matrix where in row $i$ every column $j$ that is not in i's cluster is set to 0.

Example

$$ D = \begin{bmatrix} d_{0,0} & d_{0,1} & d_{0,2} & d_{0,3} \\ d_{1,0} & d_{1,1} & d_{1,2} & d_{1,3} \\ d_{2,0} & d_{2,1} & d_{2,2} & d_{2,3} \\ d_{3,0} & d_{3,1} & d_{3,2} & d_{3,3} \\ \end{bmatrix} \times \begin{bmatrix} 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} d_{0,0} & 0 & 0 & d_{0,3} \\ 0 & d_{1,1} & 0 & 0 \\ 0 & 0 & d_{2,2} & 0 \\ d_{3,0} & 0 & 0 & d_{3,3} \end{bmatrix} $$

Sum

  • Since $d(i,i)=0$, the self-distance naturally drops out
  • So if we are summing among the rows, we get exactly $\sum_{j \in C_{i}} d(i,j)$ with $C_{i}$ is cluster of $i$, $d(i,j)$ is distance between samples $i$ and $j$
  • We can put it in array of sums like sums: [sum_dist_0, sum_dist_1, sum_dist_2, sum_dist_3]
Example
  • We get $d_{0,3}$ from 1 row

Denominator

  • Now we need a denominator for every point $i$ , so we need to determine the value of $|C_{i}|-1$ for each sample
  • labels_freqs has all the sizes, you just need to map them to each sample with labels_encoded and subtract 1

b(i) calculation

Formula:
$$b(i)= \min_{j \neq i} \frac{1}{|C_{j}|} \sum_{l \in C_{j}} d(i,l) $$

Mask

  • Same as in calculation of $a(i)$ but we use vector of unique labels
  • gives a Matrix where mask[i, k] = 1 if sample $i$ belongs to cluster $k$
Example

$$ \begin{bmatrix} 0 \\ 2 \\ 1 \\ 0 \end{bmatrix} == \begin{bmatrix} 0 & 1 & 2\end{bmatrix} \implies mask= \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \end{bmatrix}$$

Distance Filtering

  • Multiplikation with Distance Mask $D$ gives a Matrix where entry $(i,k)$ is the sum of distances from sample i to all points in cluster k. --> äquivalent to $\sum_{l\in C_{i}} d(i,l)$
  • we want neighbor cluster, so we set the distance to points in own cluster to infinity to force min to choose neighbors
Example

$$ D = \begin{bmatrix} d_{0,0} & d_{0,1} & d_{0,2} & d_{0,3} \\ d_{1,0} & d_{1,1} & d_{1,2} & d_{1,3} \\ d_{2,0} & d_{2,1} & d_{2,2} & d_{2,3} \\ d_{3,0} & d_{3,1} & d_{3,2} & d_{3,3} \\ \end{bmatrix} \times \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0\\ 1 & 0 & 0 \end{bmatrix} = \begin{bmatrix} d_{0,0}+d_{0,3} & d_{0,2} & d_{0,1} \\ d_{1,0}+d_{1,3} & d_{1,2} & d_{1,1} \\ d_{2,0}+d_{2,3} & d_{2,2} & d_{2,1} \\ d_{3,0}+d_{d_{3,3}} & d_{3,2} & d_{3,1} \end{bmatrix} $$ After setting inf: $$ \begin{bmatrix} \infty & d_{0,2} & d_{0,1} \\ d_{1,0}+d_{1,3} & d_{1,2} & \infty \\ d_{2,0}+d_{2,3} & \infty & d_{2,1} \\ \infty & d_{3,2} & d_{3,1} \end{bmatrix} $$

Copy link
Copy Markdown
Collaborator

@brownbaerchen brownbaerchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already looking quite good. I had a brief look and highlighted some general things. Mainly using assert statements in the tests.

Since you mentioned you had trouble debugging, I want to advertise the debugger once more. Just write breakpoint() in the code whenever you want to stop and explore. There can be trouble with parallel debugging. So when running in parallel, I only call a breakpoint on rank 0:

if ht.comm.rank == 0:
    breakpoint()

If you're already familiar with this stuff, just ignore this. But I know many python programmers, myself included, who were introduced to this way too late, so it doesn't hurt to mention it :D

Comment thread tests/cluster/test_metrics.py
Comment thread heat/cluster/tests/test_metrics.py Outdated
Comment thread heat/cluster/tests/test_metrics.py Outdated
Comment thread heat/cluster/tests/test_metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
Copy link
Copy Markdown
Member

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @Shabasovich ! I have a few high-level comments :

  • the fundamental requirement for any Heat functionality, is that it must support massive memory-distributed arrays;
  • this PR is calling DNDarray.resplit_(None) often in the code, which means that the DNDarray gets gathered onto each MPI process. After a resplit_(None), any operation on that array will be single-process.
  • most importantly, for our use cases there won't be an "after resplit_(None)", because the process runs out of memory.

At the current stage, the code might be passing the tests, but doesn't exploit memory distribution. The good thing is that all of the Heat operations you're using do support memory distribution, so, next steps:

  • find the spots in the code where arrays are being "gathered" (all the resplit_(None) calls)
  • ask yourself: what's size do we expect this array to be if the original input data have, say, ~1e6 or 1e7 rows? does it make sense to try to copy it entirely on each process?
  • if not: don't resplit. Just apply the follow up Heat operation to the distributed array.

By the way, check out our new LLM and AI usage guidelines. It's really common for the LLMs to not grasp the memory-distributed requirement for Heat operations.

My recommendation is to not use the AI to write code for the next iteration, but to work on adapting the existing code. Since you've just taken the MPI course, you might want to refresh the tutorials again, especially check out Internals

Thanks again for all the work!

Comment thread heat/cluster/metrics.py Outdated
def check_array(X, accept_sparse=False, input_name="X"):
if not isinstance(X, ht.DNDarray):
# Convert to heat array
X = ht.array(X, split=0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I wouldn't impose split=0 by default, as it comes with all kind of communication overhead downstream (i.e. every operation on the array will be performed in distributed mode).

Unless you've assessed that it has advantages from an algorithmic point of view.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i will do a length check before converting it to heat array, because communication overhead is then reasonable

Copy link
Copy Markdown
Collaborator

@brownbaerchen brownbaerchen Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would either not distribute in this case or require the input to be a heat array.
I am not sure what the workflow would be where you pass a torch tensor or whatever and use multiple tasks. But in this case, I would not be surprised if different tasks do different things. Since heat uses the world communicator within the function, this can lead to deadlocks.

Comment thread heat/cluster/metrics.py Outdated

def _check_y(y):
if not isinstance(y, ht.DNDarray):
y = ht.array(y, split=0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Comment thread heat/cluster/metrics.py Outdated
def silhouette_samples(X, labels, *, metric="euclidean", **kwds):
X, labels = check_X_y(
X, labels, accept_sparse=["csr"]
) # think about accept_sparse, i have no idea what it is and what csr means
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hints:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

We also support sparse csr matrices with very few arithmetic operations, see ht.sparse .

But let's leave the sparse functionality for later, it's lower priority for now

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparse matrices are matrices with few entries relative to the overall size. This is the case for the masking matrix, where some clusters only contain few entries. It is notably not the case for the distance matrix because every point has a distance wrt to any other point. So it doesn't make sense to me to accept sparse matrices as input and I agree that we should just remove it from this PR.
Having a sparse masking matrix would be nice, but I don't know if there is much to be gained when multiplying the dense distance matrix with a sparse masking matrix. Let's worry about this once it becomes a problem.

Comment thread heat/cluster/metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
Comment on lines +170 to +175
denominator_a = labels_freqs.larray[labels_encoded.larray] - 1
denominator_a = ht.where(
ht.array(denominator_a, split=0).astype(ht.float32) > 0,
ht.array(denominator_a, split=0),
1.0,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with .larray you're operating directly on the underlying torch tensor (a slice of the global array). In this snippet of code:

  • on each process, we define denominator_a by indexing the local slice of label_freqs by the local slice of labels_encoded. Is that what we want, no info required from the other processes?
  • in the following line, I think we want to read the local denominator_a tensors into a global distributed DNDarray. The argument for that operation is is_split = 0. Check out the ht.array() docs.
  • because ht.array(is_split=0) must exchange information on slice size among processes (communication overhead), better define it only once.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out you can completely remove larray and it still works. I also deleted line denominator_a = ht.array(denominator_a, split=0) and it also works fine

Comment thread heat/cluster/metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
@github-project-automation github-project-automation Bot moved this from Todo to In Progress in Roadmap Mar 25, 2026
@ClaudiaComito ClaudiaComito marked this pull request as ready for review April 7, 2026 08:50
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 95.38462% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.73%. Comparing base (991f913) to head (a442d31).

Files with missing lines Patch % Lines
heat/cluster/metrics.py 95.31% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2199      +/-   ##
==========================================
+ Coverage   91.71%   91.73%   +0.01%     
==========================================
  Files          86       87       +1     
  Lines       14221    14286      +65     
==========================================
+ Hits        13043    13105      +62     
- Misses       1178     1181       +3     
Flag Coverage Δ
unit 91.73% <95.38%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Copy Markdown
Collaborator

@brownbaerchen brownbaerchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review mostly contains what we discussed in the meeting. It may look like a lot of comments, but most of them should be very quick to implement and some of them maybe you choose not to implement, so don't worry :D

Comment thread heat/cluster/metrics.py
Comment thread heat/cluster/metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
Comment thread heat/cluster/metrics.py Outdated
Comment thread tests/cluster/test_metrics.py
Comment thread heat/tests/test_metrics.py Outdated
Comment thread heat/tests/test_metrics.py Outdated
Comment thread heat/tests/test_metrics.py Outdated
@Shabasovich Shabasovich force-pushed the features/2190-Implement_silhouette_score branch from c653d35 to 615bd64 Compare April 7, 2026 12:26
brownbaerchen
brownbaerchen previously approved these changes Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@brownbaerchen brownbaerchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three things left to do before merging this from my point of view

  • Figure out how to handle the scikit-learn dependency of the tests
  • Remove or clean up the benchmarks
  • Merge #2253, which this depends on
  • Merge #2262, which this depends on

I think the implementation and the tests are ready to be merged. I did a lot of refactoring since the scikit-learn code that some of this is based on could be more expressive and concise in my opinion.

Note that the tests are probably skipped on codebase so far because there scikit-learn is not installed and I am using pytest.importorskip here.

I am approving this PR in this review even though it still needs a little work since my previous one was blocking and I don't need to be blocking this PR at this time.

Again, I want to thank @Shabasovich for the work and applaud the creative implementation that works in parallel with no MPI calls at all.

Comment thread pyproject.toml Outdated
Comment thread tests/core/test_manipulations.py
Comment thread heat/core/manipulations.py
Comment thread benchmarks/cb/cluster.py
@JuanPedroGHM JuanPedroGHM self-requested a review April 27, 2026 07:57
Comment thread heat/cluster/metrics.py
Comment thread heat/cluster/metrics.py
JuanPedroGHM
JuanPedroGHM previously approved these changes Apr 29, 2026
…:helmholtz-analytics/heat into features/2190-Implement_silhouette_score
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants