General discussions: Problem with task 205 Clustering the Stars ambiguity in formulation

Alexandr Milovantsev 2019-07-31 16:25:41

I have found that task 205 have a little ambiguity in the formulation.
The culprit is with border points – those, which have not enough neighbours to become the core.
If border point is positioned between two clusters and have neighbours in both of them, should it belong simultaneous to both or only to one of them?

The first variant is indifferent to order of clusterisation.

In the second variant sizes of clusters will depend on order in which clusters are formed. It can lead to unwanted variations in answers.

Unfortunately, author had chosen the second way. Double hit is that it’s not mentioned in task formulation :)

I had chosen first variant and struggled to understand why my cluster randomly tends to become bigger then checker expected. And I have passed by luck alone :) .

So this condition definitely should be mentioned in the task.

Rodion (admin) 2019-07-31 19:26:03

Hello!

Well, I again feel a bit unsure as I just vaguely remember the task :)

However.

should it belong simultaneous to both

I don't think any definition of clustering allows item to belong to more than one cluster (independent of clustering algorithm).

Note that theoretically it may be not the border but even corner point with 3 or more equally pretending clusters. But including point to more than one will lead to controversies (like sum of point in clusters greater than total number).

Also this usually doesn't make good "business-sense". Suppose each point represents the set of symptoms of given patient and we want to cluster them to "ill" or "healthy". Making patient both ill and healthy leads us to nowhere. Though other classification approaches (besides clustering) may allow this, of course.

Unfortunately, author had chosen the second way

Well, I think author (me) supposed that, besides the thoughts explained above, the situation when the point appears at exact border are beyond probability. Author (I) will try to dedicate some time to this question and find out whether it is possible either to create dataset which doesn't give such ambiguities, or to make checker more intelligent :)

Thanks for sharing this!

Alexandr Milovantsev 2019-08-01 03:20:13

If we go to stay on second way, than simplest way to resolve this on my opinion is:
1. Add explicit note about border points and only one possibile "owner" for such points.
2. Define order of clusterisation by starting making clusters from first unprocessed core point according to order as they appear in original dataset.

mikamove 2024-03-03 10:54:56

Just to visually support the discussion:

  ___         ___
 /   \       /   \
|     |     |     |
|     |  O  |     |
 \___/       \___/
core 1        core 2

         |---|
          eps

If You have two cores (circles), it is not clear to which core the "O" point will be added. In Your proposed algorithm it is simply added to the first core considered in the loop. Hence the results depends on the order of input-data, which is a subideal property IMO and something one wouldn't expect, although one could find this out.

(So I hope this clarification is not considered as spoiling! but it has been stated less directly before in this post)

I first wrote up the entire algorithm using python sets of points, which was more elegant but behave non-deterministic in iteration order. After rewriting it with lists, I get exactly the correct results.

mikamove 2024-03-04 08:27:37

Note that in my previous post it is essential that the "O"-point is a non-core point. So in the end You will have 2 clusters where "O" is added to one or the other depending on the order of lines in the CSV-file.