Data Mining Project Quick Start Guide

starstarstarstarstarstarstarstarstarstar Rating: 0/5 (0 vote cast) print
If you want to implement a data mining solution and you are not sure how to start, the following guide may help you :

(1.) What programming language will you create the data mining application with? Some possible languages are :

Java, C/C++, Visual Basic 6, Delphi, SQL Server, Visual Basic .NET, C#.NET.

(2.) What data mining algorithm will you implement?

Some popular data mining algorithms include :

(a.) Naive Bayes Classifiers

(b.) Decision Trees Algorithms (Chaid, C4.5)

(c.) One Attribute Rule (1R or 1Rule) algorithm

(d.) APriori Algorithm

(e.) Linear Regression Algorithms

(f.) Nearest Neighbor Instance Based Learning Algorithms, e.g. K Nearest Neigbor

(g.) Clustering Algorithms

(3.) Define the type of data you want to data mine, get samples of the data and mine it.
 : visual basic .net     Reply  

Replies (57)

Dear KT,

I had go through the 1Rule algorithm explanation and found that i'd have gain better understanding on data mining. Thank you.

I had visit the site, the UCI Repository of machine learning databases as well. Most of the data set are from medical field and there are a lots of them was unable to be downloaded. (with error "ftp access has been terminated")

My lecturer look for something technical and not management. So, maybe i will choose College Student Assessment as my research data. Can you suggest any other data that is suitable?

Again, please accept my thank you. You really helps me a lot.


: QiuTing    Reply


I probably can't suggest any other datasets that could be technical only and useful.

However, try searching for Datasets using Google.Visual Basic .NET

Dear Andy,

I would attempt to give my opinion.

I would be posting a reply in the next 48hours time.




The Naive Bayes Algorithm on the site can handle both Numeric or Continuous data and Discrete Data.

You do not have to bin the continuous data that you have to analyze ti using Naive Bayes Algorithm.

The Naive Bayes Algorithm implementation assumes that your data is continuously distributed.

You may change the Density Estimation from Normal or Gaussian Distribution to Kernel Density Estimation. There are different types of Kernel Density Estimators and these do not assume any specific distribution for your data.

What type of measurements are you taking. You said that about 50 different measurements are taken, what is about these measurements that make them not well suited for analysis using Naive Bayes, e.g.




The Naive Bayes Algorithm can handle both Numeric or Continuous data and Discrete Data.

You do not have to bin the continuous data that you have to analyze it using Naive Bayes Algorithm.

The Naive Bayes Algorithm implementation assumes that your data is continuously distributed.

You may change the Density Estimation from Normal or Gaussian Distribution to Kernel Density Estimation. There are different types of Kernel Density Estimators and these do not assume any specific distribution for your data.

What type of measurements are you taking. You said that about 50 different measurements are taken, what is it about these measurements that make them not well suited for analysis using Naive Bayes, e.g.




i doing a project on data mining using fuzzy association rule in VB.... and also like to incude a chart to display the value in order to define the fuzzy set.. How can i achieve these.. pls help.. any source code? thanks!

: kimleng    Reply


I want to do clustering. the algorithm I want to use is Birch. I have the source code for unix, but I want to transform it to C++. How should I do?

Thanks a lot

: happyever    Reply


I read that you have the source code for birch in UNIX. If you don't mind, can you send me the source code. Maybe I can works it in C++ with your help of course. Thank You


: Robert_wu    Reply

I was wondering if the there is any VB source code available for computing Roogh set reducts, equivalence classes and extraction of rules. I understand there is C++ library available (ROSETTA). I am not very good at C++. But, I want to use these classes in VB.Net. Any idea or suggestion how to get C++ classes to VB.Net seamlessly.

: t    Reply


im interesting in Data Mining With Rough Set. my thesis is about it and i like to know any thing about it . if you can help me please send to . i glad to share it with you .

I do have the paper on hand. But, while the paper is large in size (820 KB), I cannot post it here.

i cant open the pdf file..

can send me to ?

: kimleng    Reply

Dear kimleng,

What error did you get in opening the downloaded file?

I downloaded the same file from the site. If you cannot open the file downloaded by yourself from the site, even I give you mine, you will still be unable to open the file.



: stephenlai    Reply

Dear kimleng,

What error did you get in opening the downloaded file?

I downloaded the file from the given site. If you downloaded the same file from the site by yourself, but you cannot open the file, even I give you mine, you will still be unable to open the file.



hai happy ever. can i get the BIRCH algorithm in c++. now i develop sysytem using this method. and i can't find the alghorithm for BIRCH. can u send to me

: Hans    Reply

hello Constantinos ,

do you any source code of rough set theories?can u mail me

: lizaliez    Reply

Help me, I am doing project to demo prepruning and postpruning in C4.5. Any body have source code in C#? please send me : Thanks a lot

: duymap    Reply

i want source code of bide algorithm bide is an algorithm for finding the frequent closed sequence patterns

I have a dataset with 20 labelled records of football games. The labels are 'Won' or 'Lost'. I put 5 of these records aside and build a decision tree with the remaining 15 records. I then test how the tree predicts the labels of the 5 records which I set aside. 4 of the 5 labels are predicted correctly. What is the re-substitution error? How do I calculate it? What is my 'training set'? Is it the 15 records from the original dataset or the 5 records I set aside. Finally, why do I get a message 'IP address blocked' when I try to register on this site?

IM working on cluster ensemble project.i have generated results by applying kmeans algorithm to iris dataset. this process is repeated n times. now i need a c code for cluster ensemble by using coassociation cluster ensemble

: Nisha    Reply

i have a dataset for a chemistry exam what is the possible interesting analysis or data-mining activities that can be performed

: jj    Reply

I am currently working on a project for which i need CLARANS implementation for WEKA . If you have that or a simple java code for CLARANS please mail it to me on

: Jatin    Reply

A classification algorithm termed rough set algorithm is needed for my project which involves both the text and numerical data that provides how many times it has occured

im in need of k-mean source code

: Uma    Reply

Can anyone provide me with C codes for KNN,SVD,OLS,PLS algorithms for missing value estimation in DNA microarray data?Thanks in advance.

i need sequential pattern source code urgent please can any one help me

I am totally new to this data-mining business. I would like for someone to point me in the right direction, plese. In my business - I will have a database with several tables (let's assume MS SQL Server). For the sake of example, I have a 'person' table and an 'purchases' table (each person can make several different purchases).Here is an example of the 'person' table:* ID* First Name* Last Name* Address*Date of BirthHere is an example of the 'purchase' table:* Date of Purchase* Time of Purchase* Store Location* Number of Items Purchased* Total Purchase AmountWhat I would like to do is:I would like to perform 'data-mining' in order to find out the "categories", or "clusters" of characteristics, that define each 'purchase amount'.Meaning - my target 'column' is the 'purchase amount' column.Now I want to know all the different clusters of fields that produce each 'amount' in this column.For example - I would like to know what is common for all the people who bought between 2,000$ and 2,500$ ?Maybe they all came from the same neighbourhood ?Maybe all of them were young & came from the same neighbourhood?Maybe there are 2 distinct different groups that fit the '2000$-2500$' spending group ?I am almost sure there are different categories (or distinct clusters) that match each price might be old people that come to buy at night, another might be middle-aged people who buy more than 10 items, and another might be middle-aged people who buy one Sundays.From what I understand, this falls under the 'data-mining' region (clustering maybe?)Is this true ? and if so, what are the tools I must use in order to find out these set of parameters that define the different groups for each pricing category?As you can see, I have no idea which tools \\ applications I should use.Someone recommended installing an add-in for Excel that could do the trick, but this seems rather too simple.Someone else mentioned BI (don't know what exactly to do there)...I know there is also the 'Microsoft Analysis Services'. Does that help me in any way ?Thank you very much !John

: John Ming    Reply

Hii would like asking "which algorithm select optimal number of clusters" ?Regards

: m.alroby    Reply

I have centers U1(1,2,3) and U2 (4,5,6) and distance d1(1,2,3)d2(1,2,3)d3(1,2,3) d4(1,2,3) how can i calculate the kmeans with EUCLIDEAN DISTANCE??thanks

: giakomo    Reply

Source code for K-means algotrithm

Hi,I need the source code and the related data set for fP growth algorithm in java as soon as possible.Please reply through email.

: mahi    Reply

hie .. i hv just started with my final year project .. so as of now am in the step of data preprocessing.. i hv been asked to go with genetic algorithm for attribute reduction.. can any1 here pls explain me how to perform data preprocessing using the above algorithm and how it is different from other algorithms .. thank you

: sujji    Reply

I need java source code for K nearest neighbour algoritham implementation.please any one can provide code for this.Its helps alot to my project.My mail-id is

Plzzz help me...


i need an inplementation of FP growth. Please, can u help me?

I'm doing my final project for this semester I plan to make decision tree using SLIQ algorithm But i found problem doing that If you have source code using SLIQ or any paper, please contact me My email :

: nosier    Reply

Can someone pleas help me write a simple code on analysing data from webpage with data mining in .net

Hello Kingsley Tagbo,

First how can i address you?

Second, please accept my thank you for reply me. I really appreciate. : )

I think i will choose VB as my language because i'm more familiar with it.

Java would be my second choice coz i'm planning to learn Java in this two semester.

For choices of algorithm, i really have no idea because i don't know how does each algorithm works and which algorithm is better use in which case.

For types of data, i'm thinking to do something commercial, maybe stock or share forecasting. But i don't whether it's pratical or not because there are only one attribute for each record. From what know, there are already a stock forecasting toos in the market, called NetProphet.

Actually i'm not sure that whether i'm right to choose data mining. Along my research, i found it difficult to get resources. Furthermore, most of the algorithm are really difficult to be understand by me. For example, neural network.



: QiuTing    Reply

Dear QiuTing:

I appreciate your regards.

You don't need to address me. Just post a reply to my message. We already are introduced.

I have a Visual Basic application that uses the Naive Bayes Algorithm.

It is fully functional data mining program.

I understand your desire to work with commercial data they can be hard to get but,

here is a link to a data mining algorithm explantion that I have wrriten. It was written for the One Rule (1Rule) algorithm .

Here is a link to the UCI Repository of machine learning databases []. Irvine, CA: University of California, Department of Information and Computer Science

I will be putting up some information and articles on the actual working of data mining algorithms in the following weeks. It will probably help you. The other option for you will be to buy books on data mining and you can check for that.

Finally Neural Networks are fairly extensive and may not necessarily be better than the other algorithms. For a starter, you probably should not worry about them.



thks for your reply. My project is to study on data mining and data visualisation. I have about 12000 financial data, (e.g. fund info, return, Nav, date) which I am going to develop a system involve 3 parts,:

1) connect the data from MS Access to my application and be able for user to search and sort data thru sql query...

2) display the data in various chart format for analysts to analyse data.

3 the main part, after study data mining concept and research, i select to implement kmeans clustering to mine my data( focus on return and pricing mainly) and visualise my result to help if can explore any pattern from the financial data to help user to enhance decision making.

I have been referencing to research paper and some notes on web on concept of kmean and I want to try to implement it in Java which can link to my data. However, it seems quite difficult to implement thee algorithm in Java and with JDBC. Since the data are not saved in a flat file.

After reading materials online, I have some questions about clustering. I wonder if you could give me some ideas. Since I am having 12000 data the purpose of data mining is to discover interesting patterns between data. Hoever, if I choose to narrow the type of data and put it into sampling to work on kmean, e.g. select only, data between in 1999 / data where company is USA based , and then do a mining based on this groups, would the idea of data mining being lost? Becasue I did initially kind of assume it will have an effect on date and country based. Or clustering shouldnt initially assume the outcome?

As kmean only take numerical sample to do clustering,if I do it in 2 dimension, does it mean i can only limit input for 2 type of variables, e.g. return, fund asset. Does it also limit the full purpose of data mining to mine any possible factor? since I cannot do it or visualise the result in multidimension... or if I just not knowing how to do? Since I am interested in this point and wanna focus more research on this topic with regard of dimensional visualisation...

Once again thanks for condsidering my question!



: zenith    Reply


I am using Visual Basic 6 to analyze a a large collection of research data. The data consists of around 3000 records, each with 50 fields and each field contains continuous data. The data comes from analyzing digital images of biological specimens (cells) and recording the 50 measurements spit out by our analysis program. A cell can either be a cancer cell (positive cell) or a normal cell (negative). The hypothesis is that positive cell images contain characteristics that differ from those of normal cells. Using data mining, I would like to analyze the data and discover these rules/characteristics. I have downloaded and studied the naive bayes algorithm from this website. It seems to work very well with a few exceptions.

Nevertheless, I am interested in further exploring the different algorithms out there and was wondering if anyone has an opinion on what algorithm might best suit my needs; which are an algorithm that can handle a large amount of records, 50 fields per record, and continuous values for each field. The biggest drawback I have seen for many of the algorithms is that they are more well suited for data that is discrete as opposed to continuous. Although it might be possible to bin my data and make it all discrete values, I would rather avoid this as the data set can get very large and creating proper bins would be very time-consuming. Any opinions or recommendations would be great!

Thank you in advance,


: acsher    Reply


First of all my english is not too good, so i'm going to try to do my best.

I'm doing a thesis about data mining on a point-of-sale database. I'm trying to find behavior patterns in the data and i think that if i use a desition tree or rule based sistem i can achieve that, but i'm not sure if thats the more accurate way to do that.

I've studied the data mining process, but i dont understand yet what is the difference between a some regular set of querys and the data mining, so if you can help me with this issues i will very thankfull.



Hi Bernardo,

my project was on "Data Mining by using rough set theory". Rough

set theory is one of the 5-6 methods that nowadays are being used for data

mining. Particularly i ve used this method for the purpose of finding

equivalence relations among several records of data. I would like very much

to help but im not quite confident to choose which method best suites you.

I know that decision trees is quite reliable method, maybe if you give me

more to understand what " point-of-sale database"

is about i might be able to help more! Or If you like I can send you some

notes on "rough set theory" to see how this method works.



Decision tree does suffer the following problems.

Prepruning (Stop growing a branch when information becomes unreliable)

- Hard to find a good early stopping

Postpruning (Build a fully-grown decision tree and discard unreliable parts)

- Slow and inefficient

- Local optimum can cause important information be overlook

I have used another method. For determining whether an interesting relationship (an association) exists between two events / items, adjusted residual is employed.

This paper is an reference for you:-

K.C.C. Chan, and A.K.C. Wong, "A Statistical Technique for Extracting Classificatory Knowledge from Databases," in Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley, Eds., Cambridge, MA:MIT Press, pp. 107 - 123, 1991.

im working on the project of visualization of data & data mining.

i m right now studying decision trees.

i m confused regarding which direction i should go.

can u suggest me the best way to implement visualization of the data.

give me some information regarding the data mining algorithms & their implementaions.

i would wait for ur kind reply.


In practical terms Data Mining Visualization can be implemented for the Decision Trees Algorithm by using components in Visual Basic and .NET Framewotk like the TreeView control.

A TreeView control allows you to visualize the nodes in a decision trees and show the relationships between a parent node, child nodes and siblings of the parent nodes


Does anyone have existing source code in VB 6 to do fuzzy c-mean clustering?

Many thanks.



: miao    Reply

sir i am doing a final project. i want to implement the apriori algorithm. if the code is avail where is it?


Sir :

The available Apriori Algorithm is for sale. You can read more about Apriori Algorithm and the implemenation at

hi sureshkumar, have u know how to implement the apriori algorithm?


: moon    Reply

I want to do my final work. I have to do a data mining application using census data. I am familiar with C++, but one of my classmate sugested me If you have any sugestions for me or some exaples of such application, please contact me!

Regards, Ghenam

: ghenam    Reply

Hi KT, I have planned to implement "decision tree' in V.B. i have donloded ur "decision tree" . it works well. But what would i can do further in this project? how about pruning? i would like know more about "visualisation & regreesion ". how can i implement ? plz reply me soon

: sri    Reply


Pleaserefer to this excellent resource on classification and regression trees:

Hello Kingsley Tagbo,I am working on my final year project with title "Intrusion Detection using Data Mining approach".Could you mind give some advance, web-site, paper or source code to me for reference?Have you seen any application for Intrusion Detection using data mining? Please tell me your experience on this.RegardsSokandy

: sokandy    Reply

Very helpful check list.

How about doing step 3 first then follow other 2? I always study the data that I want to mine before proceding further.

Like to suggest this site (most of u already must have visited)


: yogeshsr11    Reply


I read that you have the source code for fuzzy clustering. If you don't mind, can you send me the source code. Maybe I can works it in CC++ with your help of course. Thank You

please send it to



: vikingvn    Reply

i am doing final year project in datamining..... if anybody have a document or an idea about "HIDING SENSITIVE ASSOCIATION RULE USING CLUSTERS OF SENSITIVE ASSOCIATION RULE" please reply to this mail id: research.

: RAMYA    Reply

hi i want to perform data mining in my project as i want to trace user purchase on my website n then forward the related product to my user so that the user dont have to search more

: tanu    Reply

Post A Reply

 Questions & Answers