Data Mining

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java,
and other programming languages
Welcome to Data Mining Sign in | Join | Help
in Search

Data Mining Source Code Newsletter

Business Analyst Training
Live, Online, Video Courses
Instructor-Led + Hands-On
BusinessAnalystBootCamp.Com

SQL + Database Training
Live, Online, Video Classes
Instructor-Led + Hands-On
SQLBootCamp.Com

Software Developer Training
Live, Online, Video Courses
Instructor-Led + Hands-On
SoftwareDevelperBootCamp.Com

IT CAREER COACH
Hands-On Experience Coaching
IT Skills Training
IT-Career-Coach.NET

IT Professional Newsletter
"Free" IT Career Success Tips
How To Accelerate Your Career
IT Career Newsletter

Ask IT Career Questions
"ASK" A Burning IT Career
Question Or Get Answers
Ask A Burning IT Question Now!

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed

K-MEANS MULTI-THREADED CLUSTERING ALGORITHM WITH SOURCE CODE BY JCONWELL

Last post 09-30-2005, 16:03 by jconwell. 0 replies.
Sort Posts: Previous Next
  •  09-30-2005, 16:03 6051

    K-MEANS MULTI-THREADED CLUSTERING ALGORITHM WITH SOURCE CODE BY JCONWELL

     I've takes the K-Means clustering code and changed it around a bit to be more optimized.  Changes I added include:

    • Make the clustering calculations run on multiple threads.  The multi-threaded clustering makes the clustering run anywhere from 40% - 60% faster depending on how many vectors your clustering and how many dimensions each vector contains.  I've found that one thread per processor (two per physical processor if running with hyper-threaded procs) was optimal.  Basically I divide the vectors by the number of threads its gona run against, then kick off each thread to go figure out which cluster each vector that thread is responsible for belongs to.  When the thread is finished, it returns an array of cluster indexes, which it uses to put the vector in the correct new cluster. 
    • I also changed the multi-dimensional arrays to be jagged arrays.  This helps with performance as well as cleaning up the code a bit because you don’t have to create a new double[] and copy the values from main vector array.  You can just pass the nth instance of the vector in the jagged array.  Also, the CLR has optimization built into to work with straight arrays, but not multi-dimensional arrays.  So jagged arrays (regular array of regular arrays) can take advantage of these optimizations.
    • I've also made the unit test class easier to create different sized vectors with different dimensions.

    John (Turbo)

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed