Implement silhouette_score from scikit-learn#2199
Conversation
brownbaerchen
left a comment
There was a problem hiding this comment.
This is already looking quite good. I had a brief look and highlighted some general things. Mainly using assert statements in the tests.
Since you mentioned you had trouble debugging, I want to advertise the debugger once more. Just write breakpoint() in the code whenever you want to stop and explore. There can be trouble with parallel debugging. So when running in parallel, I only call a breakpoint on rank 0:
if ht.comm.rank == 0:
breakpoint()
If you're already familiar with this stuff, just ignore this. But I know many python programmers, myself included, who were introduced to this way too late, so it doesn't hurt to mention it :D
for more information, see https://pre-commit.ci
Co-authored-by: Thomas Saupe <[email protected]>
Co-authored-by: Thomas Saupe <[email protected]>
Co-authored-by: Thomas Saupe <[email protected]>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Thanks a lot @Shabasovich ! I have a few high-level comments :
- the fundamental requirement for any Heat functionality, is that it must support massive memory-distributed arrays;
- this PR is calling
DNDarray.resplit_(None)often in the code, which means that the DNDarray gets gathered onto each MPI process. After aresplit_(None), any operation on that array will be single-process. - most importantly, for our use cases there won't be an "after resplit_(None)", because the process runs out of memory.
At the current stage, the code might be passing the tests, but doesn't exploit memory distribution. The good thing is that all of the Heat operations you're using do support memory distribution, so, next steps:
- find the spots in the code where arrays are being "gathered" (all the
resplit_(None)calls) - ask yourself: what's size do we expect this array to be if the original input data have, say, ~1e6 or 1e7 rows? does it make sense to try to copy it entirely on each process?
- if not: don't resplit. Just apply the follow up Heat operation to the distributed array.
By the way, check out our new LLM and AI usage guidelines. It's really common for the LLMs to not grasp the memory-distributed requirement for Heat operations.
My recommendation is to not use the AI to write code for the next iteration, but to work on adapting the existing code. Since you've just taken the MPI course, you might want to refresh the tutorials again, especially check out Internals
Thanks again for all the work!
| def check_array(X, accept_sparse=False, input_name="X"): | ||
| if not isinstance(X, ht.DNDarray): | ||
| # Convert to heat array | ||
| X = ht.array(X, split=0) |
There was a problem hiding this comment.
In general I wouldn't impose split=0 by default, as it comes with all kind of communication overhead downstream (i.e. every operation on the array will be performed in distributed mode).
Unless you've assessed that it has advantages from an algorithmic point of view.
There was a problem hiding this comment.
i guess i will do a length check before converting it to heat array, because communication overhead is then reasonable
There was a problem hiding this comment.
Personally, I would either not distribute in this case or require the input to be a heat array.
I am not sure what the workflow would be where you pass a torch tensor or whatever and use multiple tasks. But in this case, I would not be surprised if different tasks do different things. Since heat uses the world communicator within the function, this can lead to deadlocks.
|
|
||
| def _check_y(y): | ||
| if not isinstance(y, ht.DNDarray): | ||
| y = ht.array(y, split=0) |
| def silhouette_samples(X, labels, *, metric="euclidean", **kwds): | ||
| X, labels = check_X_y( | ||
| X, labels, accept_sparse=["csr"] | ||
| ) # think about accept_sparse, i have no idea what it is and what csr means |
There was a problem hiding this comment.
hints:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
We also support sparse csr matrices with very few arithmetic operations, see ht.sparse .
But let's leave the sparse functionality for later, it's lower priority for now
There was a problem hiding this comment.
Sparse matrices are matrices with few entries relative to the overall size. This is the case for the masking matrix, where some clusters only contain few entries. It is notably not the case for the distance matrix because every point has a distance wrt to any other point. So it doesn't make sense to me to accept sparse matrices as input and I agree that we should just remove it from this PR.
Having a sparse masking matrix would be nice, but I don't know if there is much to be gained when multiplying the dense distance matrix with a sparse masking matrix. Let's worry about this once it becomes a problem.
| denominator_a = labels_freqs.larray[labels_encoded.larray] - 1 | ||
| denominator_a = ht.where( | ||
| ht.array(denominator_a, split=0).astype(ht.float32) > 0, | ||
| ht.array(denominator_a, split=0), | ||
| 1.0, | ||
| ) |
There was a problem hiding this comment.
with .larray you're operating directly on the underlying torch tensor (a slice of the global array). In this snippet of code:
- on each process, we define
denominator_aby indexing the local slice oflabel_freqsby the local slice oflabels_encoded. Is that what we want, no info required from the other processes? - in the following line, I think we want to read the local
denominator_atensors into a global distributed DNDarray. The argument for that operation isis_split = 0. Check out the ht.array() docs. - because
ht.array(is_split=0)must exchange information on slice size among processes (communication overhead), better define it only once.
There was a problem hiding this comment.
It turns out you can completely remove larray and it still works. I also deleted line denominator_a = ht.array(denominator_a, split=0) and it also works fine
…heck_random_state, reworked checks of X and Y
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2199 +/- ##
==========================================
+ Coverage 91.71% 91.73% +0.01%
==========================================
Files 86 87 +1
Lines 14221 14286 +65
==========================================
+ Hits 13043 13105 +62
- Misses 1178 1181 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
brownbaerchen
left a comment
There was a problem hiding this comment.
This review mostly contains what we discussed in the meeting. It may look like a lot of comments, but most of them should be very quick to implement and some of them maybe you choose not to implement, so don't worry :D
c653d35 to
615bd64
Compare
There was a problem hiding this comment.
There are three things left to do before merging this from my point of view
- Figure out how to handle the scikit-learn dependency of the tests
- Remove or clean up the benchmarks
- Merge #2253, which this depends on
- Merge #2262, which this depends on
I think the implementation and the tests are ready to be merged. I did a lot of refactoring since the scikit-learn code that some of this is based on could be more expressive and concise in my opinion.
Note that the tests are probably skipped on codebase so far because there scikit-learn is not installed and I am using pytest.importorskip here.
I am approving this PR in this review even though it still needs a little work since my previous one was blocking and I don't need to be blocking this PR at this time.
Again, I want to thank @Shabasovich for the work and applaud the creative implementation that works in parallel with no MPI calls at all.
…:helmholtz-analytics/heat into features/2190-Implement_silhouette_score
Description
Type of change
#2190
Additional
Explanation of calculation
Main goal
Example
When I refer to example I am using following values:
X=[[0, 0], [10, 10], [20, 20], [1, 1]]labels=[10, 30, 20, 10]unique_labels = [10,20,30]labels_encoded = [0, 2, 1, 0]Explanation of a(i) calculation
Formula:$$a(i) = \frac{1}{|C_{i}|-1} \cdot \sum_{j \in C_{i}, j \neq i} d(i,j)$$
Mask
==operator, that compares vector with its transpose. If labels are the same, has 1, meaningExample
Distance Filtering
Multiplikation with Distance Mask$D$ gives a Matrix where in row $i$ every column $j$ that is not in i's cluster is set to 0.
Example
Sum
sums:[sum_dist_0, sum_dist_1, sum_dist_2, sum_dist_3]Example
Denominator
labels_freqshas all the sizes, you just need to map them to each sample withlabels_encodedand subtract 1b(i) calculation
Formula:
$$b(i)= \min_{j \neq i} \frac{1}{|C_{j}|} \sum_{l \in C_{j}} d(i,l) $$
Mask
mask[i, k] = 1if sampleExample
Distance Filtering
minto choose neighborsExample