Questions about my K-means C++ implementation

starstarstarstarstarstarstarstarstarstar Rating: 0/5 (0 vote cast) print
I have a couple questions about my implementation of the k-means algorithm (main source attached).



1) Notice in section (3), it will print out a warning statement if no
points have been assigned to a cluster. Is this typical behaviour seen,
and if so how do you handle it? Currently I just do nothing about it,
but should I re-initialize a center if no points get assigned to it?
For the particular data I'm clustering, I've found that in general, a
value of k above 10 is likely to have a handful of empty clusters.





2) What I do with this code is cluster a bunch of floating point data
(dimensionality of 17 to be exact). When I have the final means, I find
the real data point that is closest to each mean and assign it to be
that cluster's representative. Oddly enough, however, I discovered that
sometimes a data point will be assigned as a representative as more
than once cluster. I fixed this in my code to make sure that no point
is used as a representative more than once, but do you think this is
odd? Obviously it depends on the nature of the data (which I can't
really visualize), but discovering this made me feel rather uneasy, as
I thought the cluster centers would have a fairly large distance
between each other.





3) This is a more general k-means question. About 2-4 of my data fields
are very low (0 or almost nearly 0 for each data point). Would this
characteristic of the data 'skew' or somehow negatively effect the
clustering, or should I keep anything in mind about this? (So far the
results of my clustering algorithm are unbelievably good though, and I
verified it by hand 4 times just to make sure it was as good as I
thought it was).





4) This is a more difficult question, but is there any way I can speed
up this code? I can't think of any improvements to it, and I already
compile it with full optimization but it's still annoyingly slow. (The
ComputeDistance() function is actually a function pointer to a
user-specified function. The two arguments to the function are both
references to a vector of floats, and return a single float value. For
example:









float SimCluster::EuclideanDistance(const vector<float> &a, const vector<float> &b) {
float result = 0.0;
for (unsigned int i = 0; i < a.size(); i++) {
result += pow(a[ i ] - b[ i ], 2);
}
return sqrt(result);
}





Thanks in advance for any advice/insight/knowledge you can share with
me. I'd really appreciate it, because I have no one around for a
'sanity check'. :)



 : Roots     Reply  

Replies (46)

It was a .cpp file. Meh I'll just paste the source here since it's notvery big. It's just annoying because [ i ] without spaces becomes [i]so I have to put spaces to stop that from happening. I can't reallypaste any more code than this because it's for work, but I'm going totry to ask my supervisor to let me make it open source. Basically Ibuilt a clustering framework that's very flexible. You can easilymodify the code to add a new file parser, select various distantmetrics, or select a number of clustering algorithms. This functionbelow is the clustering algorithm implementation for kmeans, which istied into a class that handles all the data clustering. The variablenames from other parts of the code seen in here should be prettyintuitive I hope.

#include

#include

#include "cluster.h"

#include "data.h"

using namespace std;

// This is the K-Means clustering algorithm

void SimCluster::KMeansClustering() {

floatmin_distance;// Temporary variable holiding minimum distance between a data pointand a mean

vector >new_means(centers); // 2D vector with the means to update the 2D vector to

bool not_done =true;// When set to false, the clustering has converged and is finished

bool no_change =false;// Boolean to hold whether the means have moved since the last iteration

floattmp_dist;// Temporary distance variable

while (not_done) {

// (1) Clear the counts from the last run

for (unsigned int i = 0; i < now_k; i++) {

counts[ i ] = 0;

}

// (2) Assign each data point to a cluster

for (unsigned int i = 0; i < DataManager->num_data_pts; i++) {

// Initiallyassume the first cluster has the minimum distance to the point

min_distance = ComputeDistance(DataManager->data[ i ], centers[0]);

labels[ i ] = 0;

// Now find the cluster with the true minimum distance to the point

for (unsigned int j = 1; j < now_k; j++) {

tmp_dist = ComputeDistance(DataManager->data[ i], centers[j]);

if (tmp_dist < min_distance) {

min_distance = tmp_dist;

labels[ i ] = j; // Assign thedata point to belong to cluster j

}

}

// Increasethe number of data points with it's closest mean as cluster "labels[ i]"

counts[labels[ i ]] += 1;

}

// (3) Update means based on the labeling

for (unsigned int i = 0; i < DataManager->num_data_pts; i++) {

for (unsigned int j = 0; j < DataManager->dimensions; j++) {

new_means[labels[ i ]][j] += DataManager->data[ i][j];

}

}

for (unsigned int i = 0; i < now_k; i++) {

if (counts[ i ] != 0) {

for (unsigned int j = 0; j dimensions; j++) {

new_means[ i ][j] = new_means[ i][j] / static_cast(counts[ i ]);

}

}

else {

if (WARNINGS) cerr << "WARNING: no points forcluster #" << i << endl;

}

}

// (4) Check for convergence (means didn't move)

no_change = true;

for (unsigned int i = 0; i < now_k; i++) {

for (unsigned int j= 0; j < DataManager->dimensions; j++) {

if (new_means[ i ][j] != centers[ i ][j]) {

no_change = false;

break;

}

}

if (!no_change) {

break;

}

}

if (no_change) { // Austin, we have convergence

not_done = false;

}

else {

centers = new_means;

}

} // while (not_done)

}

: Roots    Reply

profile
KINGSLEY TAGBO

Making it open-source has otherbenefits as well. You will be able to have other people analyze thecode and help you re-design it if necessary.Why did you choose C++, why not another language like C#?Thanks


Because I don't like Microsoft, and I don't like using solutions toproblems that are only applicable in a Microsoft developmentenvironment. I use Linux at home and at work, as well as UNIX and MacOSX at work. The world doesn't belong to Microsoft you know :) Besides, Idon't know C# and have absolutely no desire to learn it.

: Roots    Reply

profile
KINGSLEY TAGBO

I didn't mean to strike a note [:)] I learn as much as ppossible about not just systems but people as well.

I just wanted to know the guidelines, assumptions, and policies youo considered in deciding to work with the current framework.

There is nothing wrong with using C/C++ as you may know it outperforms C# in terms of speed.


No worries. :) I'm just a person who really dislikes being locked downinto a single software/platform, and I want my code to (hopefully) beuseful to someone else out there, who may not be running Windows, orMacOS, or Linux, or BSD, etc. Anyway, the primary reasons I chose C++were:

1) I know C++ =D

2) Computational speed, like you said

3) I'm writing this code in Linux on my Pentium M laptop, and need itto also run on PowerPC architectures (where my data is saved), so portability

4) Organization. Even though I like C and all, it makes me feel dirtywhen I use it nowadays. Then again, C++ isn't the prettiest beasteither...

5) Modularity. Like I said earlier, I designed the framework so that itis easy to write extensions to it (file parsers, clustering algorithms,etc.) and C++ does this rather well.

: Roots    Reply

- That's what I believed too. Glad to get some confirmation on it.

- Yeah, I thought about scoring the clusters based on the variance indistances or something. The way I score them now though is that Imultiply the cluster center by it's weight and the number of datapoints, and then compare that to the true sum of all points belongingto the cluster. In my specific case, I guess you could say I'm notreally interested in how tight the clusters are, but how well thecluster centers represent the set of datatpoints in that cluster. :)

- A colleague of mine ran gprof on the code yesterday and found thatremoving the pow() function to square the difference of two point andreplacing it with `delta = a[ i ] - b[ i ]; result += delta * delta;`cut the runtime from 351 to 297 seconds, so that's not bad.

- I have support for dimension reduction built into the code, but Ihaven't implemented it yet. I know it will make the program mustfaster, but I'm more concerned with getting the highest possibleaccuracy than cutting down the runtime. It's not unbearably slow, justunsatisfactory.

- Yeah, I suppose I should normalize this data. It's not too variantthough. Only one of the fields can range from about 1.0 to 20.0 and therest all hover between 0.0 and 0.7 or so. I will look into it and seehow it affects the results, but my program has such low error already Idoubt it will make much of a difference. :)

- I thought about trying out K-mediods, but since the results are soexcellent already I decided to play around with different algorithmsand distance metrics at another time. :)

Thanks for the comments. If I end up publishing a paper on the work I'mdoing (which is highly likely), I'm pretty sure my employer will allowme to open-source this. If that happens, I'll try to remember to comeback here and post about it. :D

: Roots    Reply

- Looking at how the centroids vary with held-out data is a good test, but the key thing is to look at the clusters with new data and verify that they clusters describe the new data the same way that they described the old data. This means movement of the centroids, change in the distance distribution and change in the relative frequencies of assignment to each cluster. But the most important thing is the part about new data.

- viva profiling.

- spectral techniques are not just for dimensionality reduction. The do a highly non-linear transformation on your data that allow elongated clusters to be handled well. If your data doesn't have these, then you shouldn't worry. Just don't fall into the assumption that dimensionality reduction (usually by svd to find a linear projection into a lower dimensional space) isn't the same as spectral reduction (which is something like svd on the inter-point distances rather than the coordinates).

- your data ranges make it sound like you definitely need to either eliminate the variable that goes to 1 to 20 or eliminate all of the other axes. Or scale your data. I think that something much simpler than you might like is happening with your data. You might try to determine if your clusters are preferentially aligned on certain axes. You can do this by looking at the eigenvalues of the positions relative to the centroid. If you have a few large eigenvalues and mostly very small ones, then your data is degenerate.

: tdunning    Reply

I notice you are using a std::vector, but it looks like you only ever walk through the data items from the start (as far as I can tell at a brief glance). Is it quicker if you use a std::list instead and walk through with an iterator, rather than using the random access [] operator? This may improve the speed (no promises though).

Frances.

: Buontempo    Reply

Hi again;

If you list the header files cluster.h and data,h, or try zipping up the whole app and post that, I'll have a look through to see what is slow. Of course, this is not platform independent. I'm running Windows XP. What are you running? Laso, the compile may (or may not) optimse things. Again, what are you using?

Frances.

: Buontempo    Reply

Hi all,

Just to throw my two cents worth in: why no using an old fashioned array?

OO might be fine but i found that reducing the function calls speeds up too

Udo

: usiebig    Reply

I doubt seriously that using an array would speed things up much at all unless something terrible is being done like accessing a linked list using [].

I would recommend permuting the list of training examples and then using an iterator. That gives you the best of both worlds. I would doubt very seriously if you will be able to detect the difference in speed between an iterator and native array access. Keeping the implementation abstract is a good idea to make it easier to re-use.

: tdunning    Reply

Wow, I have a lot to reply to suddenly. :)

- Why not use a linked list instead of a vector for better performance?

Because it won't give me better performance. IIRC, elements in a vectorare located contiguously in memory (thus exploiting spatial localityand giving optimum cache hit rates) while elements in a list are notguaranteed to be continguous. The advantages of using a list is that itis easy to dynamically resize your list (push front, push back, removemiddle, etc.) at no excessive computational cost (unlike a vector), butI don't need to do that in my code, so a vector is the best choice.

- Why not use an array instead of a vector?

An array would give the same performance as a vector (or maybe just a*tiny* bit more) because like a vector, the elements are contiguous inmemory.

- What are you running this code on?

Linux on both x86 and PowerPC platforms. It won't run on Windowswithout some work, because I use a wrapper script in bash and thatwould have to be re-written to use DOS commands (sick), and I also usethe getopt function to process command line arguments so you'd need toinstall GNU libraries on Windows to do so. I don't plan on making aWindows version (mostly because I have no desire to, and in my entirelife I've never compiled something on Windows and don't wish to start:). But when I eventually make the code publicly available of courseanyone will be able to port it to whatever they like. But for theexperiments I'm doing, (well for any experiment I've ever done in mylife) all I need is a *nix compatible O/S.

- Post data.h and cluster.h and your other code...

Like I said, I can't really make this code any more public than I haveright now, because I don't want to get in trouble with my company oranything. (If it weren't for that though, I would share it in aheart-beat). Also I'm still making minor to moderate changes to thecode as I complete more studies and I'm writing a users guide and adeveloper guide for it too, so it's worth it to wait. :) (By the way,I'm also going to start working on a visualization tool for easy humananalysis of the results using PCA to reduce points to 2 dimensions, sothat will be cool to have around too)

Having said that, I started running another study that I believe willhave much better results for us, in terms of what we are looking for.The clustering results themselves were already excellent (about 1.02%average sum error across all metrics in 46 different test sets), butthe best k was very small, and that did not sit well with the kind ofdata we're using. After removing one of the data fields from theclustering (which was more like a field that was an approximate resultof all the other fields), the best k shot up.

One final note, one of the features I had planned was to enable anoption for "max k auto-scaling". Basically what this meant was, afterthe clustering was finished the code would look at what max k duringexecution of kmeans and compare that with the best k found. If the bestk is very near to max k, the code will do some additional runs until itfinds a best k that is sufficiently far from max k.

This is one of my more advanced features and harder to work into thecurrent framework, but if I do eventually implement it I'm wonderingwhat you all think. Specifically, regarding the implementation:

- When is best k "too close" to max k? Is it when the differencebetween the two is 3 or less? Is it when best k is 95% or greater thanmax k?

- What should max k be scaled up to on detection of an insufficientlylow max k? Should it increase by 10? Should it increase my 10%? Shouldit double?

: Roots    Reply

If you are starting to see issues that depend on scaling your axes, youwould be very well advised to also look at spectral algorithms.The results can be really good and the results are less dependent on(but not independent of) scaling.

For that matter, your investigation of which axes to pick to optimizecluster fit to data is very similar to maximum a posteriori fits usinggeneral multivariate normal distributions. I think I havementioned it before, but Mackay's book has an excellent treatment ofclustering in this framework. You might find it a useful generalmethod for unifying your efforts on variable and scalingselection. Don't forget evaluation on held-out data as a usefultechnique.

Good luck!

: tdunning    Reply

Hi,

just one more word on speeding up the calculations:

- if an if- statement does not depent on loop variables put it outside the loop. (It will blow up the code, i know)

- if you ar accessing an array element inside a loop where the index does not depend on the loop variable: put it into a separate variable

- check if using another data type (long instead of int, double instead of float or vice versa) helps

and

- do a check on the performance of operations. The example below is for VB 6, the loop was like that:

getSystemTime start

for i from 0 to 1000000

to = from

end loop

getSystemTime end

Especially the result for the Private Type was surprising to me

' Long

' toVar = 7: 6984 msec 1,029481132

' toVar = 7: 8629 msec 1,271963443

' toVar = Variant/Integer: 11090 msec 1,634728774

' toVar = toVar + 1: 8826 msec 1,301002358

' toVar = toVar + 1: 8829 msec 1,301444575

' toVar = toVar - 1: 9244 msec 1,362617925

' toVar = toVar + 3: 8844 msec 1,30365566

' toVar = toVar - 3: 9647 msec 1,422022406

' toVar = fromVar: 6784 msec 1

' toVar = fromArr(7): 16427 msec 2,421432783

' toVar = fromArr(j): 16237 msec 2,393425708

' toVar = fromPrivType.peElem: 6780 msec 0,999410377

' toArr(3) = 7: 19345 msec 2,8515625

' toArr(3) = 7: 19178 msec 2,826945755

' toArr(3) = Variant/Long: 18497 msec 2,7265625

' toArr(3) = toArr(3) + 1: 36133 msec 5,326208726

' toArr(3) = toArr(3) + 1: 36273 msec 5,346845519

' toArr(3) = toArr(3) - 1: 35548 msec 5,239976415

' toArr(3) = toArr(3) + 3: 35745 msec 5,26901533

' toArr(3) = toArr(3) - 3: 35311 msec 5,205041274

' toArr(3) = fromVar: 16865 msec 2,485996462

' toArr(3) = fromArr(7): 27930 msec 4,117040094

' toArr(3) = fromArr(j): 27721 msec 4,086232311

' toArr(3) = fromPrivType.peElem: 16448 msec 2,424528302

' toArr(k) = 7: 16417 msec 2,419958726

' toArr(k) = 7: 16441 msec 2,423496462

' toArr(k) = Variant/Long: 19703 msec 2,904333726

' toArr(k) = toArr(k) + 1: 29593 msec 4,362175708

' toArr(k) = toArr(k) + 1: 29569 msec 4,358637972

' toArr(k) = toArr(k) - 1: 30615 msec 4,512824292

' toArr(k) = toArr(k) + 3: 29570 msec 4,358785377

' toArr(k) = toArr(k) - 3: 29583 msec 4,360701651

' toArr(k) = fromVar: 19011 msec 2,802329009

' toArr(k) = fromArr(7): 27934 msec 4,117629717

' toArr(k) = fromArr(j): 27727 msec 4,087116745

' toArr(3) = fromPrivType.peElem: 19001 msec 2,800854953

' toPrivType.peElem = 7: 6987 msec 1,029923349

' toPrivType.peElem = 7: 6987 msec 1,029923349

' toPrivType.peElem = Variant/Long: 11083 msec 1,633696934

' toPrivType.peElem = toPrivType.peElem + 1: 8837 msec 1,302623821

' toPrivType.peElem = toPrivType.peElem + 1: 8829 msec 1,301444575

' toPrivType.peElem = toPrivType.peElem - 1: 8830 msec 1,301591981

' toPrivType.peElem = toPrivType.peElem + 3: 8836 msec 1,302476415

' toPrivType.peElem = toPrivType.peElem - 3: 8827 msec 1,301149764

' toPrivType.peElem = fromVar: 6783 msec 0,999852594

' toPrivType.peElem = fromArr(7): 16424 msec 2,420990566

' toPrivType.peElem = fromArr(j): 16237 msec 2,393425708

' toPrivType.peElem = fromPrivType.peElem: 6787 msec 1,000442217

'

Regards

Udo

: usiebig    Reply

I received a little criticism at a meeting at work today for choosingmy initial cluster centers to be a random data point. Basically what Ido is:

1) Initialize the random number generator with srand(time(NULL))

2) For each cluster center that needs to be initialized, choose a random integer between 0 and the number of data points - 1

3) That point then becomes the initial cluster center

One of my co-workers believed that picking a random point in the dataspace (a ghost point, not a real data point) would have a significanteffect on the results. Do you agree or disagree (I know that it dependson the nature of the data, but in general ;) )? When I first wrote thissoftware I was initially going to compute my initial cluster centers byfinding a random floating point number for each data field that wasbetween the maximum and minimum value for that field, but I thoughtjust picking a real data point would work just as well and be simplerfrom a designer perspective. Is it even worth investigating howdifferent initial points effect the results? (FYI: I do 3 iterationsfor each value of k as well, and each iteration has new random startingpoints for the centers). Thanks =D

: Roots    Reply

These comments largely miss the mark with modern compilers. If you are using C++ or Java, then moving simple invariants outside of a loop will probably make no difference at all to speed.

The good advice from Udo is that you should test your attempted optimizations. I think you should test and profile both before and after you try anything. Remember that speeding up code that doesn't take much time to begin with will make almost no difference. Likewise, getting a very small win when there are big problems elsewhere doesn't make a measurable difference. The answer is to understand your code.

: tdunning    Reply

Starting with a random data point is very standard practice with clustering algorithms.

There are definitely going to be be some pathological starting conditions, but this will generally work well if clustering makes sense for your data at all.

Running the clustering multiple times and checking for consistency is the best medicine for the issue that they are complaining about. If you get very similar cluster positions and shapes for different starting points, you will have found a robust clustering. If not, you may still be ok.

As an example of a case where the starting point matters, imagine that you are clustering 1000 points from a 2-dimensional normal distribution into 3 clusters. What will generally happen is that you will wind up splitting the data into three roughly equal pieces, but which pieces are in the same cluster is not very stable for anything but near neighbors. Any stopping point will be a roughly equivalently good quantization of the data set, however. It just won't be a robust result.

On the other hand, if you are clustering 1000 points each from well separated 2-dimensional normal distributions, then your clustering will be nearly the same regardless of your starting points.

The moral here is that the data is pretty important.

Note that deterministic starting points in this sort of problem are almost always bad. For that matter, a fixed ordering of training data isn't even all that good. Permuting the training data is a very good idea.

Hope this helps.

: tdunning    Reply

please send me the full version c++ file of kmeans

 

: sunilcrsun    Reply

hi;

i am new here, i have assignment to implement K-Means in C++ so i try to work on your code, but the problem is where are cluster.h and data.h files????????????????????????

: aamirrind    Reply

[quote user="aamirrind"]

hi;

i am new here, i have assignment to implement K-Means in C++ so i try to work on your code, but the problem is where are cluster.h and data.h files????????????????????????

[/quote]

 

i need it too!!! pls. 


Stop asking for the code. I am not going to give it to you.

 

Its not that hard to implement K-means. I suggest you do your own homework/assignment for your job. Besides, the code I wrote belongs to my former employer and for me to share it would violate intellectual property law.

: Roots    Reply

oh, another bill gates lol

 

guy, do u now open source? and sharing information... blah

 


I want to get java source code for cn2 rule learning algorithm(cn2 rule induction algorithm).Can you tell me where I can download java source code for cn2 rule learning algorithm?Please help me.

: sawwai    Reply

I want C code for apriori algorithm in data mining.can u help me plz...

: Deepz    Reply

java code to implement classification through information gain in data mining and warehousing.

: mani    Reply

I want to get java source code for cn2 rule learning algorithm for data classification.If anyone knows ,please tell me where can I download java source code for cn2 rule learning algorithm.Please help me.

: sawwai    Reply

Can I have PrefixSpan and GSP Code in java?Thanks

: Manel    Reply

HiI built a C5.0 model and I'm now trying to score a data set. Just clicking on the calculate propensity scores boxes does not seem to result in accurate probability scores. Would appreciate any useful suggestions.Adrian

: Adrian    Reply

hi sir,i need fp association mining source code in java we have to give input flatfile as random generator function and and we have to give minimal suppurt and generate a tree

: priya    Reply

hello friends i am in urgent need of source code for naive bayes classifiercan any body help pls my email id is: gaakansha62@yahoo.com

: aakansha    Reply

I need the formula of Bi-liner or bicubic interpolation and how to write this into computer language. in C/CPP. I am unable to write in computer language.

: krishna    Reply

Interested in your code. but now what i exactly need is kmeans using arrays alone in java and should not use vectors and swing or jama.No need to plot the grapg and all just i have to display the cluster and its corresponding points.

: ramya    Reply

I am in need of APFT algorithm in c

: Anand    Reply

hi .i want a java code for fp growth algorithm..my project title is frequent itemsets using fp growth algoritm,...pls send me the code as soon as possible..mail me the code in my mail id adodka@hotmail.com


hai i need SOURCE CODE FOR Apriori,naive Bayesian classification, Back propagation algorithm,kMEANS,K-MEDOIDS CLUSTERING ALGORITHM IN JAVA

: Pooja garg    Reply

code for decision tree algorithm in data mining exactly in java language


hi i need java implementation of fpgrowth and apriori as soon as possible please


I want to implement particle swarm optimization for data clustering i \n jav...so please guide m


I need coding to construct a fp tree based on dynamic user input


Hmm, it seems the forums won't let me post an attachment. Oh well. Forget about my 4th question about speeding up the code.

: Roots    Reply

profile
KINGSLEY TAGBO

What is the file type and file sizeof the attachment you are posting? We can change the file type and filesize to accomodate your upload.Alternatively, you can zip the file as a *.zip and the site will accept it.We are looking forward to the source code so that we can analyze it.Thanks,


I found a bug that I think was responsible for the odd behavior I wasseeing in #1 and #2 in the top post. I forgot to reset the new_means 2Dvector before re-calculating the means (the old means were beingaccumulated into the new means), which as far as I can tell so far wasresponsible for clusters getting isolated and having no members. Kindasucks that such a simple error was causing me so much grief. Afterspeaking with a colleague yesterday, I also added a "no point" checkerjust in case the situation ever does occur even with this bug fix. Whatit does is if it finds a cluster with no members, it re-assigns thecluster point to the fartherst outlier of all the points (that is, thepoint with the greatest distance from it's cluster center).

: Roots    Reply

You asked a number of good questions about your clustering code. Here are some comments (more than answers).

- Empty clusters may be an indication that you have too many clusters. They can also just happen.

- One very good test for the quality of the clusters that you haveobtained is to see if the sizes and average distance to centroids forthe training data are similar to the same statistics for held-outdata. A good example of how this can go bad is if you ask for asmany clusters as you have training points. The clustering youwill get is one cluster per data point. New data will almostcertainly not be in the same place so the sizes and mean distances willbe dramatically different.

- Speeding up your code might best be done by computing the distancesusing as much vectorization as possible (consider getting a good matrixpackage to do this).

- You might try using a spectral technique to decrease thedimensionality of your problem. This will speed up the clustering(2-4 dimensional distances are much faster than 17 dimensional ones) atthe cost of doing the eigenvector decomposition. See Ng andJordan's excellent paper on this sort of technique.

- Normalization is a serious issue in almost all clusterings. Trynormalizing all axes to zero mean and unit variance. The spectraltechniques don't require this.

- K-medoids can be an interesting alternative to k-means. This also gives you a natural representative of each cluster.

I hope this helps.

: tdunning    Reply

hi..friends,,,,,,,,,i am doing project on CURE data clusteriong algorithm.............can any one send the code to mca_chandu@ymail.com.....please


hi,i would like to do my m.tech project in qualitative factors -educational institution by using fuzzy k means algorithm ....for this i need java source code.. can anyone show the way to get the code?can u suggest the way to use this algorithm effectively?


hey guys, Can anyone of you send me the c code for k nearest neighbourhood algorithm?My mail id is kurrakrishnachaitanya@gmail.com.pls send it soon.I have to submit my assignment

: krishna    Reply

hai sir i am sadananda i am studying BE, i am doing project on fuzzy c mean algorithm base,  

i am finding difficulty in coding  about the fuzzy c mean. i am requesting you

please send java source code for fuzzy c mean..



Post A Reply

 Questions & Answers