I've takes the K-Means clustering code and changed it around a bit to be more optimized. Changes I added include:
-
Make the clustering calculations run on multiple threads. The multi-threaded clustering makes the clustering run anywhere from 40% - 60% faster depending on how many vectors your clustering and how many dimensions each vector contains. I've found that one thread per processor (two per physical processor if running with hyper-threaded procs) was optimal. Basically I divide the vectors by the number of threads its gona run against, then kick off each thread to go figure out which cluster each vector that thread is responsible for belongs to. When the thread is finished, it returns an array of cluster indexes, which it uses to put the vector in the correct new cluster.
- I also changed the multi-dimensional arrays to be jagged arrays. This helps with performance as well as cleaning up the code a bit because you don’t have to create a new double[] and copy the values from main vector array. You can just pass the nth instance of the vector in the jagged array. Also, the CLR has optimization built into to work with straight arrays, but not multi-dimensional arrays. So jagged arrays (regular array of regular arrays) can take advantage of these optimizations.
- I've also made the unit test class easier to create different sized vectors with different dimensions.
John (Turbo)