July 2005 - Posts

30 July 2005
TF-IDF weighting source code
Hi all.

I have about 200 documents.
I want to represent this document in vector space model using the weighting scheme - TF IDF weighting before I can move to the next step - clustering.
Is there anyone here could help me with the full source code or a free software so that I can do the preprocessing with the documents.

Thanks in advance.


18 July 2005
Questions about my K-means C++ implementation
I have a couple questions about my implementation of the k-means algorithm (main source attached).

1) Notice in section (3), it will print out a warning statement if no points have been assigned to a cluster. Is this typical behaviour seen, and if so how do you handle it? Currently I just do nothing about it, but should I re-initialize a center if no points get assigned to it? For the particular data I'm clustering, I've found that in general, a value of k above 10 is likely to have a handful of empty clusters.


2) What I do with this code is cluster a bunch of floating point data (dimensionality of 17 to be exact). When I have the final means, I find the real data point that is closest to each mean and assign it to be that cluster's representative. Oddly enough, however, I discovered that sometimes a data point will be assigned as a representative as more than once cluster. I fixed this in my code to make sure that no point is used as a representative more than once, but do you think this is odd? Obviously it depends on the nature of the data (which I can't really visualize), but discovering this made me feel rather uneasy, as I thought the cluster centers would have a fairly large distance between each other.


3) This is a more general k-means question. About 2-4 of my data fields are very low (0 or almost nearly 0 for each data point). Would this characteristic of the data 'skew' or somehow negatively effect the clustering, or should I keep anything in mind about this? (So far the results of my clustering algorithm are unbelievably good though, and I verified it by hand 4 times just to make sure it was as good as I thought it was).


4) This is a more difficult question, but is there any way I can speed up this code? I can't think of any improvements to it, and I already compile it with full optimization but it's still annoyingly slow. (The ComputeDistance() function is actually a function pointer to a user-specified function. The two arguments to the function are both references to a vector of floats, and return a single float value. For example:

float SimCluster::EuclideanDistance(const vector<float> &a, const vector<float> &b) {
float result = 0.0;
for (unsigned int i = 0; i < a.size(); i++) {
result += pow(a[ i ] - b[ i ], 2);
}
return sqrt(result);
}



Thanks in advance for any advice/insight/knowledge you can share with me. I'd really appreciate it, because I have no one around for a 'sanity check'. :)