Clustering the Stars
Popular task in data processing is about clustering the samples - i.e. splitting the whole set of data points into the groups of related or close ones. There exist different approaches to clustering and we'll now try one of the most popular - based on density of points - it is called DBSCAN. Besides data processing it could be very handy in finding regions in images for example.
Let us take the Gliese Catalogue of Nearest Stars, its 2-nd edition. For every star we'll take only:
x-s are some digits);
1.4, going through blue, white, yellow, red stars;
33light years (roughly from
10- the more, the dimmer).
If we try to plot this data on a chart, we'll get famous picture, similar to one shown above - Hertzsprung-Russell diagram. It shows that stars of given color usually tend to have some specific magnitude (and therefore some specific size) - i.e. there is a strong correlation.
However it also shows that there is more than one "line" of this correlation. Most of stars fall into the line called Main Sequence, but there are also a group of hot but small stars (in the lower left corner) - they are White Dwarfs. Really there could be more separate groups found, but for now it is not that important.
To find such groups not with our eyes, but with the help of computer, we can apply clustering to these data. Let us try the approach which is explained below.
2Dspace since we have only two axes - color and magnitude).
minPoints. The cluster could not have less than
epsis a kind of maximum distance between neighbor points in the cluster.
core-point any which have at least
minPoints - 1neighbors at distance not exceeding
core-point then proceed adding any other
core-points which lay not further than
epsfrom any of points already included in the core of the cluster. It is like growing the snowball.
core-points could be found to add to the cluster - we should add any other
non-corepoints found within
epsdistance from the points of the cluster's core.
3until no more unvisited
The very important step yet omitted is to normalize axes before step
2. Why is it necessary? You see that
even for our example, the range of colors (
-0.3 ... 1.4) is roughly 5 times less than the range of magnitudes
0 ... 10) so the distances by one axis are incomparable with distances by another. This surely will spoil
calculation of distance between points by Pythagorean formula. Normalization will scale and shift axes to roughly
similar values. Here is the simple algorithm:
average(i.e. sum and divide by amount);
s = sqrt( sum( (Xi - avg)^2 ) / (n-1) )- here
Xiare values and
nis their amount;
Xi' = (Xi - avg) / s- that's all.
For example if we have color values of
4 stars like
10, -20, 5, -5 then after such normalization they become
0.945 -1.323 0.567 -0.189.
The data are in CSV format, the first line is a header and should be skipped. You will be told how many data
samples to take from the beginning of this list (e.g. first
500 lines), and what values of
to use. You need to return sizes of clusters you get in decreasing order.
Input data: will contain
N (amount of lines to take) followed by
Output: should contain several integers - sizes of clusters in decreasing order.
input data: 700 3 0.375000 output data: 666 10 6 5 3
input data: 900 3 0.250000 output data: 818 17 13 8 6 4 3