-
Comparative Study Of Learning From Imbalanced Data
-
-
-
1.5 Scope of the study
The study is restricted to the nature of Imbalanced data, providing comparative study of learning schemes for learning from imbalanced data. The scope of the study in broad terms of other than learning from imbalanced data. Few among them are;
Machine Learning algorithmic approach to learning from imbalanced data such as decision Trees (The Naïve Bayes Tree), and Artificial Neural network (The Multilayer Perceptron ), Machine learning performance evaluation measures, Performance and monitoring measures used in evaluating imbalanced data learning, Model Creation that would be used for learning from imbalanced data
1.6 Organization of the study
This study consist of the following sections: Chapter 1 – Introduction: This chapter gives the introduction of the entire report, presenting the historical background of the study, the rationale behind the work, imbalanced data and learning for such data giving the problem definition and aims/ objectives of the study. Chapter 2 – Literature Review: In this section a detail review of related study is being carried out hence discovering the theoretical framework upon which this research is built. Chapter 3 – Research Methodology and Application: In this section we have considered few methodologies used in the analysis of imbalanced data, focusing on the imbalanced data learning algorithms. Data-sets from the Keel repository with different imbalance ratios (IRs). Chapter 4 – Implementation and Evaluation: In this section machine learning algorithms the Naïve Bayes tress and the Multi-Layer Perceptron are used for learning on imbalanced datasets which are evaluated and implemented, providing evaluation metrics for imbalanced data classification problem. Hence we will show the experimental study carried out on the behavior of some algorithms, it also examine the use of non-parametric test for statistical comparisons of the results of the classifiers. In this section we will analyze the behaviors of the best combination of components under different IR levels. Chapter 5 – Discussion, Evaluation and conclusion: This section gives a detailed summary of the results are indicated and some conclusions and recommendations based on the findings will be made also providing suggestion (s) for future research, made for other investigations to carry out research in the related field or area.
1.7 Operational Definition
1.7.1 Concepts
Algorithm – It is a step by step finite sequence of well-defined set of instructions used to solve problems on a computer, a computational procedure that takes values as input and produces values as output, in order to solve a well-defined computational problem
Data – Numbers, characters, images, or other method of recording, in a form which can be assessed by a human or (especially) input into a computer, stored and processed there, or transmitted on some digital channel.
Data Mining – is an analytic process designed to explore data (usually large amounts of data – typically business or market related – also known as “big data”) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data
Imbalanced Dataset – A dataset is imbalanced if the classification categories are not approximately equally represented that is the classes are not approximately equally represented.
Learning – is the act of acquiring new, or modifying and reinforcing, existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information
Machine – an apparatus using mechanical power and having several parts, each with a definite function and together performing a particular task.
Machine Learning – a scientific discipline that explores the construction and study of algorithms that can learn from data and make/take decision on unseen data based on what they have learned from previous data.
Mining – a term explaining the process of finding a small set of precious patterns from a great deal of raw material (big data)
Comparative – Comparative study is a research methodology that aims to make comparisons across different field in this case algorithms used in learning from imbalanced data.
Attribute- a piece of information which determines the properties of a field or tag in a database or a string of characters in a display.
1.7.2 Technology
• Decision Tree – a predictive model which maps observations about an item to conclusions about the item’s target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning.
• Cross Validation – Cross validation sometimes called rotation estimation is a model validation technique for assessing how accurate and valid the result of a statistical analysis method will be.
• Artificial Neural Network- family of statistical learning algorithms inspired by biological neural networks (the central nervous systems of animals, in particular the brain) and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. Artificial neural networks are generally presented as systems of interconnected “neurons” which can compute values from inputs, and are capable of machine learning as well as pattern recognition, what makes them interesting is their adaptive nature.
1.7.3 Tools
Keel (Knowledge Extraction based on Evolutionary Learning) – is an open source (GPLv3) Java software tool which empowers the user to assess the behavior of evolutionary learning and Soft Computing based techniques for different kinds of Data Mining problems: regression, classification, clustering, Pattern mining and so on.
Datasets – a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set.
WEKA (Waikato Environment for Knowledge Analysis) – WEKA is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. WEKA contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
1.8 Conclusion
Machine learning is growing and expanding in a very rapid pace. Its importance and bewildered growth helps in combining of collaborative activities with sophisticated pattern recognition, intelligent decisions self-modifying and self-learning has brought about computing without infrastructure flexibility and ideal Power. This Section gives an overview and preliminary study on the study of learning pattern using imbalanced datasets evaluating algorithms that helps in the learning process.
-
-
-
ABSRACT - [ Total Page(s): 1 ]The automation of most of our activities has led to the continuous production of data that arrive in the form of fast-arriving streams. In a supervised learning setting, instances in these streams are labeled as belonging to a particular class. When the number of classes in the data stream is more than two, such a data stream is referred to as a multi-class data stream. Multi-class imbalanced data stream describes the situation where the instance distribution of the classes is skewed, such that ... Continue reading---
-
ABSRACT - [ Total Page(s): 1 ]The automation of most of our activities has led to the continuous production of data that arrive in the form of fast-arriving streams. In a supervised learning setting, instances in these streams are labeled as belonging to a particular class. When the number of classes in the data stream is more than two, such a data stream is referred to as a multi-class data stream. Multi-class imbalanced data stream describes the situation where the instance distribution of the classes is skewed, such that ... Continue reading---