The term “data mining†is primarily used by statisticians, database researchers, and the MIS and business communities. Data mining is an extension of traditional data analysis and statistical approaches in that it incorporates analytical techniques drawn from a range of disciplines including, but not limited to,
1. numerical analysis,
2. pattern matching and areas of artiï¬cial intelligence such as machine learning,
3. Neural networks and genetic algorithms.
While many data mining tasks follow a traditional, hypothesis-driven data analysis approach, it is commonplace to employ an opportunistic, data driven approach that encourages the pattern detection algorithms to ï¬nd useful trends, patterns, and relationships. Essentially, the two types of data mining approaches differ in whether they seek to build models or to ï¬nd patterns. The ï¬rst approach, concerned with building models is, apart from the problems inherent from the large sizes of the data sets, similar to conventional exploratory statistical methods. The objective is to produce an overall summary of a set of data to identify and describe the main features of the shape of the distribution [Hand 1998]. Examples of such models include a cluster analysis partition of a set of data, a regression model for prediction, and a tree-based classiï¬cation rule. In model building, a distinction is sometimes made between empirical and mechanistic models (Box and Hunter 1965; Cox 1990; Hand 1995). The former (also sometimes called operational) seeks to model relationships without basing them on any underlying theory. The latter (sometimes called substantive or phenomenological) are based on some theory or mechanism for the underlying data generating process. Data mining, almost by deï¬nition, is primarily concerned with the operational. The second type of data mining approach, pattern detection, seeks to identify small (but nonetheless possibly important) departures from the norm, to detect unusual patterns of behaviour. Examples include unusual spending patterns in credit card usage (for fraud detection), sporadic waveforms in Electroencenograph traces, and objects with patterns of characteristics unlike others. It is this class of strategies that led to the notion of data mining as seeking “nuggets†of information among the mass of data. In general, business databases pose a unique problem for pattern extraction because of their complexity. Complexity arises from anomalies such as discontinuity, noise, ambiguity, and incompleteness [Fayyad, Piatetsky-Shapiro, and Smyth, 1996]. And while most data mining algorithms are able to separate the effects of such irrelevant attributes in determining the actual pattern, the predictive power of the mining algorithms may decrease as the number of these anomalies increase (Rajagopalan and Krovi, 2002).
2.4.4 Data Mining and Machine Learning
Machine learning is the study of computational methods for improving performance by mechanizing the acquisition of knowledge from experience [Langley and Simon 1995]. Machine learning aims to provide increasing levels of automation in the knowledge engineering process, replacing much time-consuming human activity with automatic techniques that improve accuracy or efï¬ciency by discovering and exploiting regularities in training data, in this section the basic machine learning algorithms used in data mining are discussed in brief.
2.4.4.1 Neural Networks (NN)
These are a class of systems modelled after the human brain. As the human brain consists of millions of neurons that are inter-connected by synapses, NN are formed from large numbers of simulated neurons, connected to each other in a manner similar to brain neurons. As in the human brain, the strength of neuron inter-connections may change (or be changed by the learning algorithm) in response to a presented stimulus or an obtained output, which enables the network to “learnâ€. A disadvantage of NN is that building the initial neural network model can be especially time-intensive because input processing almost always means that raw data must be transformed. Variable screening and selection requires large amounts of the analysts’ time and skill. Also, for the user without a technical background, ï¬guring out how neural networks operate is far from obvious (Peacock et al, 1998).
2.4.4.2 Case-Based Reasoning (CBR)
This is a technology that tries to solve a given problem by making direct use of past experiences and solutions. A case is usually a speciï¬c problem that was encountered and solved previously. Given a particular new problem, CBR examines the set of stored cases and ï¬nds similar ones. If similar cases exist, their solution is applied to the new problem, and the problem is added to the case base for future reference. A disadvantage of CBR is that the solutions included in the case database may not be optimal in any sense because they are limited to what was actually done in the past, not necessarily what should have been done under similar circumstances. Therefore, using them may simply perpetuate earlier mistakes (Peacock et al, 1998).
2.4.4.3 Genetic Algorithms (GA)
They operate through procedures modelled upon the evolutionary biological processes of selection, reproduction, mutation, and survival of the ï¬ttest to search for very good solutions to prediction and classiï¬cation problems. GA is used in data mining to formulate hypotheses about dependencies between variables in the form of association rules or some other internal formalism. A disadvantage of GA is that the solutions are difï¬cult to explain. Also, they do not provide interpretive statistical measures that enable the user to understand why the
procedure arrived at a particular solution (Peacock et al, 1998).