(1.) What programming language will you create the data mining application with? Some possible languages are :

Java, C/C++, Visual Basic 6, Delphi, SQL Server, Visual Basic .NET, C#.NET.

(2.) What data mining algorithm will you implement?

Some popular data mining algorithms include :

(a.) Naive Bayes Classifiers

(b.) Decision Trees Algorithms (Chaid, C4.5)

(c.) One Attribute Rule (1R or 1Rule) algorithm

(d.) APriori Algorithm

(e.) Linear Regression Algorithms

(f.) Nearest Neighbor Instance Based Learning Algorithms, e.g. K Nearest Neigbor

(g.) Clustering Algorithms

(3.) Define the type of data you want to data mine, get samples of the data and mine it.

### Replies (57)

Andy:

The Naive Bayes Algorithm on the site can handle both Numeric or Continuous data and Discrete Data.

You do not have to bin the continuous data that you have to analyze ti using Naive Bayes Algorithm.

The Naive Bayes Algorithm implementation assumes that your data is continuously distributed.

You may change the Density Estimation from Normal or Gaussian Distribution to Kernel Density Estimation. There are different types of Kernel Density Estimators and these do not assume any specific distribution for your data.

What type of measurements are you taking. You said that about 50 different measurements are taken, what is about these measurements that make them not well suited for analysis using Naive Bayes, e.g.

Thanks,

VISUAL BASIC .NET

Andy:

The Naive Bayes Algorithm can handle both Numeric or Continuous data and Discrete Data.

You do not have to bin the continuous data that you have to analyze it using Naive Bayes Algorithm.

The Naive Bayes Algorithm implementation assumes that your data is continuously distributed.

You may change the Density Estimation from Normal or Gaussian Distribution to Kernel Density Estimation. There are different types of Kernel Density Estimators and these do not assume any specific distribution for your data.

What type of measurements are you taking. You said that about 50 different measurements are taken, what is it about these measurements that make them not well suited for analysis using Naive Bayes, e.g.

Thanks,

VISUAL BASIC .NET

I was wondering if the there is any VB source code available for computing Roogh set reducts, equivalence classes and extraction of rules. I understand there is C++ library available (ROSETTA). I am not very good at C++. But, I want to use these classes in VB.Net. Any idea or suggestion how to get C++ classes to VB.Net seamlessly.

I have a dataset with 20 labelled records of football games. The labels are 'Won' or 'Lost'. I put 5 of these records aside and build a decision tree with the remaining 15 records. I then test how the tree predicts the labels of the 5 records which I set aside. 4 of the 5 labels are predicted correctly. What is the re-substitution error? How do I calculate it? What is my 'training set'? Is it the 15 records from the original dataset or the 5 records I set aside. Finally, why do I get a message 'IP address blocked' when I try to register on this site?

I am totally new to this data-mining business. I would like for someone to point me in the right direction, plese. In my business - I will have a database with several tables (let's assume MS SQL Server). For the sake of example, I have a 'person' table and an 'purchases' table (each person can make several different purchases).Here is an example of the 'person' table:* ID* First Name* Last Name* Address*Date of BirthHere is an example of the 'purchase' table:* Date of Purchase* Time of Purchase* Store Location* Number of Items Purchased* Total Purchase AmountWhat I would like to do is:I would like to perform 'data-mining' in order to find out the "categories", or "clusters" of characteristics, that define each 'purchase amount'.Meaning - my target 'column' is the 'purchase amount' column.Now I want to know all the different clusters of fields that produce each 'amount' in this column.For example - I would like to know what is common for all the people who bought between 2,000$ and 2,500$ ?Maybe they all came from the same neighbourhood ?Maybe all of them were young & came from the same neighbourhood?Maybe there are 2 distinct different groups that fit the '2000$-2500$' spending group ?I am almost sure there are different categories (or distinct clusters) that match each price group...one might be old people that come to buy at night, another might be middle-aged people who buy more than 10 items, and another might be middle-aged people who buy one Sundays.From what I understand, this falls under the 'data-mining' region (clustering maybe?)Is this true ? and if so, what are the tools I must use in order to find out these set of parameters that define the different groups for each pricing category?As you can see, I have no idea which tools \\ applications I should use.Someone recommended installing an add-in for Excel that could do the trick, but this seems rather too simple.Someone else mentioned BI (don't know what exactly to do there)...I know there is also the 'Microsoft Analysis Services'. Does that help me in any way ?Thank you very much !John

hie .. i hv just started with my final year project .. so as of now am in the step of data preprocessing.. i hv been asked to go with genetic algorithm for attribute reduction.. can any1 here pls explain me how to perform data preprocessing using the above algorithm and how it is different from other algorithms .. thank you

Hello Kingsley Tagbo,

First how can i address you?

Second, please accept my thank you for reply me. I really appreciate. : )

I think i will choose VB as my language because i'm more familiar with it.

Java would be my second choice coz i'm planning to learn Java in this two semester.

For choices of algorithm, i really have no idea because i don't know how does each algorithm works and which algorithm is better use in which case.

For types of data, i'm thinking to do something commercial, maybe stock or share forecasting. But i don't whether it's pratical or not because there are only one attribute for each record. From what know, there are already a stock forecasting toos in the market, called NetProphet.

Actually i'm not sure that whether i'm right to choose data mining. Along my research, i found it difficult to get resources. Furthermore, most of the algorithm are really difficult to be understand by me. For example, neural network.

regards,

QiuTing

Dear QiuTing:

I appreciate your regards.

You don't need to address me. Just post a reply to my message. We already are introduced.

I have a Visual Basic application that uses the Naive Bayes Algorithm.

It is fully functional data mining program.

I understand your desire to work with commercial data they can be hard to get but,

here is a link to a data mining algorithm explantion that I have wrriten. It was written for the One Rule (1Rule) algorithm http://www.kdkeys.net/data-mining-project-quick-start-guide/#link-6777 .

Here is a link to the UCI Repository of machine learning databases [http://www.kdkeys.net/data-mining-project-quick-start-guide/#link-6778]. Irvine, CA: University of California, Department of Information and Computer Science

I will be putting up some information and articles on the actual working of data mining algorithms in the following weeks. It will probably help you. The other option for you will be to buy books on data mining and you can check Amazon.com for that.

Finally Neural Networks are fairly extensive and may not necessarily be better than the other algorithms. For a starter, you probably should not worry about them.

Thanks,

HI,

thks for your reply. My project is to study on data mining and data visualisation. I have about 12000 financial data, (e.g. fund info, return, Nav, date) which I am going to develop a system involve 3 parts,:

1) connect the data from MS Access to my application and be able for user to search and sort data thru sql query...

2) display the data in various chart format for analysts to analyse data.

3 the main part, after study data mining concept and research, i select to implement kmeans clustering to mine my data( focus on return and pricing mainly) and visualise my result to help if can explore any pattern from the financial data to help user to enhance decision making.

I have been referencing to research paper and some notes on web on concept of kmean and I want to try to implement it in Java which can link to my data. However, it seems quite difficult to implement thee algorithm in Java and with JDBC. Since the data are not saved in a flat file.

After reading materials online, I have some questions about clustering. I wonder if you could give me some ideas. Since I am having 12000 data the purpose of data mining is to discover interesting patterns between data. Hoever, if I choose to narrow the type of data and put it into sampling to work on kmean, e.g. select only, data between in 1999 / data where company is USA based , and then do a mining based on this groups, would the idea of data mining being lost? Becasue I did initially kind of assume it will have an effect on date and country based. Or clustering shouldnt initially assume the outcome?

As kmean only take numerical sample to do clustering,if I do it in 2 dimension, does it mean i can only limit input for 2 type of variables, e.g. return, fund asset. Does it also limit the full purpose of data mining to mine any possible factor? since I cannot do it or visualise the result in multidimension... or if I just not knowing how to do? Since I am interested in this point and wanna focus more research on this topic with regard of dimensional visualisation...

Once again thanks for condsidering my question!

regards,

Zenith

Hi,

I am using Visual Basic 6 to analyze a a large collection of research data. The data consists of around 3000 records, each with 50 fields and each field contains continuous data. The data comes from analyzing digital images of biological specimens (cells) and recording the 50 measurements spit out by our analysis program. A cell can either be a cancer cell (positive cell) or a normal cell (negative). The hypothesis is that positive cell images contain characteristics that differ from those of normal cells. Using data mining, I would like to analyze the data and discover these rules/characteristics. I have downloaded and studied the naive bayes algorithm from this website. It seems to work very well with a few exceptions.

Nevertheless, I am interested in further exploring the different algorithms out there and was wondering if anyone has an opinion on what algorithm might best suit my needs; which are an algorithm that can handle a large amount of records, 50 fields per record, and continuous values for each field. The biggest drawback I have seen for many of the algorithms is that they are more well suited for data that is discrete as opposed to continuous. Although it might be possible to bin my data and make it all discrete values, I would rather avoid this as the data set can get very large and creating proper bins would be very time-consuming. Any opinions or recommendations would be great!

Thank you in advance,

Andy

hello:

First of all my english is not too good, so i'm going to try to do my best.

I'm doing a thesis about data mining on a point-of-sale database. I'm trying to find behavior patterns in the data and i think that if i use a desition tree or rule based sistem i can achieve that, but i'm not sure if thats the more accurate way to do that.

I've studied the data mining process, but i dont understand yet what is the difference between a some regular set of querys and the data mining, so if you can help me with this issues i will very thankfull.

thanks

Bernardo

Hi Bernardo,

my project was on "Data Mining by using rough set theory". Rough

set theory is one of the 5-6 methods that nowadays are being used for data

mining. Particularly i ve used this method for the purpose of finding

equivalence relations among several records of data. I would like very much

to help but im not quite confident to choose which method best suites you.

I know that decision trees is quite reliable method, maybe if you give me

more to understand what " point-of-sale database"

is about i might be able to help more! Or If you like I can send you some

notes on "rough set theory" to see how this method works.

regards,

Constantinos

Decision tree does suffer the following problems.

Prepruning (Stop growing a branch when information becomes unreliable)

- Hard to find a good early stopping

Postpruning (Build a fully-grown decision tree and discard unreliable parts)

- Slow and inefficient

- Local optimum can cause important information be overlook

I have used another method. For determining whether an interesting relationship (an association) exists between two events / items, adjusted residual is employed.

This paper is an reference for you:-

K.C.C. Chan, and A.K.C. Wong, "A Statistical Technique for Extracting Classificatory Knowledge from Databases," in Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley, Eds., Cambridge, MA:MIT Press, pp. 107 - 123, 1991.

im working on the project of visualization of data & data mining.

i m right now studying decision trees.

i m confused regarding which direction i should go.

can u suggest me the best way to implement visualization of the data.

give me some information regarding the data mining algorithms & their implementaions.

i would wait for ur kind reply.

In practical terms Data Mining Visualization can be implemented for the Decision Trees Algorithm by using components in Visual Basic and .NET Framewotk like the TreeView control.

A TreeView control allows you to visualize the nodes in a decision trees and show the relationships between a parent node, child nodes and siblings of the parent nodes

Sir :

The available Apriori Algorithm is for sale. You can read more about Apriori Algorithm and the implemenation at http://www.kdkeys.net/data-mining-project-quick-start-guide/#link-6779

Pleaserefer to this excellent resource on classification and regression trees:http://www.kdkeys.net/ShowPost.aspx?PostID=2447Thanks

Hello Kingsley Tagbo,I am working on my final year project with title "Intrusion Detection using Data Mining approach".Could you mind give some advance, web-site, paper or source code to me for reference?Have you seen any application for Intrusion Detection using data mining? Please tell me your experience on this.RegardsSokandy

Dear KT,

I had go through the 1Rule algorithm explanation and found that i'd have gain better understanding on data mining. Thank you.

I had visit the site, the UCI Repository of machine learning databases as well. Most of the data set are from medical field and there are a lots of them was unable to be downloaded. (with error "ftp access has been terminated")

My lecturer look for something technical and not management. So, maybe i will choose College Student Assessment as my research data. Can you suggest any other data that is suitable?

Again, please accept my thank you. You really helps me a lot.

QiuTing