• Comparative Study Of Learning From Imbalanced Data

  • CHAPTER ONE -- [Total Page(s) 2]

    Page 1 of 2

    1 2    Next
    • 1.1 Background of the Study

      In recent years, information and its transformation into Knowledge became crucial as more and more data is being generated in real world situations which are drastically varying the provision of services for use of predictive analytics or other certain advanced methods to extract value from such data, and seldom to a particular size of data set. However providing a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions. Machine Learning has become one of the mainstays of information technology and with that, a rather central, albeit usually hidden, part of our life. With the ever increasing amounts of data becoming available there is good reason to believe that smart data analysis will become even more pervasive as a necessary ingredient for technological progress.

      With this rapid growth several difficult machine learning “real-world” problems are posed, these problems are being characterized by imbalanced learning data, where at least one class is under-represented relative to others. Examples include (but are not limited to): fraud/intrusion detection, medical diagnosis/monitoring, bioinformatics, and text categorization. The imbalanced learning problem has drawn a significant amount of interest from academia, industry, and government funding agencies. The fundamental issue with the imbalanced learning problem is the ability of imbalanced data to significantly compromise the performance of most standard learning algorithms. Most standard algorithms assume or expect balanced class distributions or equal misclassification costs. Therefore, when presented with complex imbalanced data sets, these algorithms fail to properly represent the distributive characteristics of the data and resultantly provide unfavorable accuracies across the classes of the data. When translated to real-world domains, the imbalanced learning problem represents a recurring problem of high importance with wide-ranging implications, warranting increasing exploration.

      On these basis this Project seeks to provide a detailed comparative study of the current understanding of the imbalanced learning problem and the state-of-the-art solutions created to address this problem providing ensembles to address class imbalance, the assessment metrics for imbalanced learning and highlighting the major opportunities and challenges for learning from imbalanced data.


      1.2 Statement of the Problem

      In recent years the problem of imbalanced data has being recognized and is being considered as a very crucial problem in data mining and machine learning, this problem occurs when there is significantly fewer training instances of one class compared to another class often associated with asymmetric costs of misclassifying elements of different classes. Additionally the distribution of the test data may differ from that of the learning sample and the true misclassification costs may be unknown at learning time. The problem with class imbalances is that standard learners are often biased towards the majority class and that is because these classifiers attempt to reduce global quantities such as the error rate, not taking the data distribution into consideration. Although much awareness of the issues related to data imbalance has been raised, many of the key problems still remain open and are in fact encountered more often, especially when applied to massive datasets. In this project, we concentrate on the two class case.


      1.3 Objectives of the study

      In this project, we seek to;

      i. Provide a survey of the current understanding of the imbalanced learning problem and the state-of-the-art solutions created to address this problem.

      ii. Recognize and state crucial real world problems with imbalanced data.

      iii. Provide strategies of dealing with data in imbalanced domain.

      iv. Provide a critical review of the innovative research developments targeting the imbalanced learning problems

      v. Stimulate future research in this field, highlighting the major opportunities and challenges for learning from imbalanced data.

      vi. To comparatively study and determine the most efficient algorithm in learning from imbalanced data.

      vii. Provides various suggested methods that are used to compare and evaluate the performance of different imbalanced learning algorithms.

      viii. Provide Strategies to deal with imbalanced data sets.


      1.4 Significance of the study

      With the constant expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Hence a great influx of attention needs to be devoted to the imbalanced learning problem and the high activity of advancement in this field, remaining knowledgeable of all current developments can be an overwhelming task. Due to the relatively young age of this field and because of its rapid expansion, consistent assessments of past and current works in the field in addition to projections for future research are essential for long-term development. In this work, we will analyze the imbalanced learning problem which is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews, providing a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.


  • CHAPTER ONE -- [Total Page(s) 2]

    Page 1 of 2

    1 2    Next
    • ABSRACT - [ Total Page(s): 1 ]The automation of most of our activities has led to the continuous production of data that arrive in the form of fast-arriving streams. In a supervised learning setting, instances in these streams are labeled as belonging to a particular class. When the number of classes in the data stream is more than two, such a data stream is referred to as a multi-class data stream. Multi-class imbalanced data stream describes the situation where the instance distribution of the classes is skewed, such that ... Continue reading---