• A System For Health Document Classification Using Machine Learning

  • CHAPTER TWO -- [Total Page(s) 3]

    Page 3 of 3

    Previous   1 2 3
    • 2.3    ASSORTMENT OF MACHINE LEARNING ALGORITHMS FOR TEXT CLASSIFICATION
      After feature opting and transformation the documents can be flexibly denoted in a form that can be utilized by a ML algorithm. Most of the text classifiers adduced in the literature utilizing machine learning techniques, probabilistic models, etc. They regularly vary in the approach taken are decision trees, naïve Bayes, rule induction, neural networks, nearest neighbors, and lately, support vector machines. Though most of the approaches adduced, automated text classification is however a major area of research first due to the effectiveness of present automated text classifiers is not errorless and nevertheless require development.
      Naive Bayes is regularly utilized in text classification applications and experiments due to its easy and effectiveness (Kim, 2002). Nevertheless, its performance is reduced due to it does not model text.
      Schneider addressed the problems and display that they can be resolved by a few plain corrections. Klopotek and Woch presented results of empirical evaluation of a Bayesian multinet classifier depending on a novel method of learning very large tree-like Bayesian networks (Klopotek, 2003). The study advices that tree-like Bayesian networks are able to deal a text classification task in one hundred thousand variables with sufficient speed and accuracy.
      When Support vector machines (SVM), are applied to text classification supplying excellent precision, but less recollection. Customizing SVMs means to develop recollect which helps in adjusting the origin associated with an SVM. Shanahan and Roma explained an automatic process for adjusting the thresholds of generic SVM (Shanahan, 2003),for improved results. Johnson et al. explained a fast decision tree construction algorithm that receives benefits of the sparse text data, and a rule simplification method that translates the decision tree into a logically equivalent rule set.
      Lim introduced a method which raises performance of kNN based text classification by utilizing calculated parameters. Some variants of the kNN method with various decision functions, k values, and feature sets are also introduced and evaluated to discover enough parameters.
      For immediate document classification, Corner classification (CC) network, feed forward neural network is used. A training algorithm, TextCC is introduced in. The complexity of of text classification tasks generally varies. As the number of different classes augments as of complexity and hence the training set size is required. In multi-class text classification task, unavoidable some classes are a bit harder than others to classify. Reasons for this are: very few positive training examples for the class, and lack of good forecasting features for that class.
      When training a binary classifier per category in text categorization, we use all the documents in the training corpus that has the category as related training data and all the documents in the training corpus that are of the other categories are non related training data. It is a regular case that there is an overwhelming number of non related training documents specially when there is high number of categories with every allotted to a tiny documents, which is an “imbalanced data problem". This problem gives a certain risk to classification algorithms, which can accomplish perfection by simply classifying every example as negative. To resolve this problem, cost sensitive learning is required.
      2.4    REVIEW OF RELATED WORK
      LI et al, investigate four different methods for document classification: the naive Bayes classifier, the nearest neighbor classifier, decision trees and a subspace method. These were applied to seven-class Yahoo news groups (business, entertainment, health, international, politics, sports and technology) individually and in combination. They studied three classifier combination approaches: simple voting, dynamic classifier selection and adaptive classifier combination. Their experimental results indicate that the naive Bayes classifier and the subspace method outperform the other two classifiers on our data sets. Combinations of multiple classifiers did not always improve the classification accuracy compared to the best individual classifier. Among the three different combination approaches, the adaptive classifier combination method introduced here performed the best. The best classification accuracy that they were able to achieve on this seven-class problem is approximately 83%, which is comparable to the performance of other similar studies. However, the classification problem considered here is more difficult because the pattern classes used in our experiments have a large overlap of words in their corresponding documents (LI, 1998).
      Goller et al, thoroughly evaluate a wide variety of methods on a document classification task for German text. They evaluate different feature construction and selection methods and various classifiers. Their main results are: feature selection is necessary not only to reduce learning and classification time, but also to avoid overfitting (even for Support Vector Machines); surprisingly, their morphological analysis does not improve classification quality compared to a letter 5-gram approach. Support Vector Machines are significantly better than all other classification methods (Goller, 2002).
      Ankit et al, discusses the different types of feature vectors through which document can be represented and later classified. They compares the Binary, Count and TfIdf feature vectors and their impact on document classification. To test how well each of the three mentioned feature vectors perform, they used the 20-newsgroup dataset and converted the documents to all the three feature vectors. For each feature vector representation, they trained the Naïve Bayes classifier and then tested the generated classifier on test documents. In their results, they found that TfIdf performed 4% better than Count vectorizer and 6% better than Binary vectorizer if stop words are removed. If stop words are not removed, then TfIdf performed 6% better than Binary vectorizer and 11% better than Count vectorizer. Also, Count vectorizer performs better than Binary vectorizer, if stop words are removed by 2% but lags behind by 5% if stop words are not removed. Thus, they can conclude that TfIdf should be the preferred vectorizer for document representation and classification (Ankit, 2017).
  • CHAPTER TWO -- [Total Page(s) 3]

    Page 3 of 3

    Previous   1 2 3
    • ABSRACT - [ Total Page(s): 1 ]ABSTRACTDue to the massive increase in medical documents every day (including books, journals, blogs, articles, doctors' instructions and prescriptions, emails from patients, etc.), it is becoming very challenging to handle and to categorize them manually. One of the most challenging projects in information systems is extracting information from unstructured texts, including medical document classification. The discovery of knowledge from medical datasets is important in order to make effective ... Continue reading---

         

      APPENDIX A - [ Total Page(s): 2 ]APPENDIX A ... Continue reading---

         

      APPENDIX C - [ Total Page(s): 1 ]APPENDIX Cen-diseases.trainMalaria is a life-threatening mosquito-borne blood disease caused by a Plasmodium parasite Malaria was eliminated from the U.S. in the early 1950sMalaria is typically spread by mosquitoesMalaria symptoms can be classified into two categoriesMalaria happens when a bite from the female Anopheles mosquito infects the body with PlasmodiumMalaria is a mosquito-borne infectious disease affecting humans and other animals caused by parasitic protozoansMalaria is a mosquito-bor ... Continue reading---

         

      APPENDIX B - [ Total Page(s): 11 ]APPENDIX B ... Continue reading---

         

      CHAPTER ONE - [ Total Page(s): 2 ]CHAPTER ONE1.0    INTRODUCTIONThis chapter introduces the topic of the project work A System for Health Document Classification Using Machine Learning. In this chapter, we will consider the background of the study, statement of the problem, aims and objectives, methodology used to design the system, scope of the study, its significance, definition of terms, and we conclude with the project layout or organization of the project work.1.1    BACKGROUND OF THE STUDYContemporarily, most hospita ... Continue reading---

         

      CHAPTER THREE - [ Total Page(s): 3 ]3.4    SEQUENCE DIAGRAMSequence diagrams are simple subsets of interaction diagrams. They map out sequential events in an engineering or business process in order to streamline activities. Sequence diagrams are used to show how objects interact in a given situation. An important characteristic of a sequence diagram is that time passes from top to bottom: the interaction starts near the top of the diagram and ends at the bottom (i.e. Lower equals Later).3.5    CLASS DIAGRAMSWe begin our OOD ... Continue reading---

         

      CHAPTER FOUR - [ Total Page(s): 5 ]CHAPTER FOUR SYSTEM IMPLEMENTATION4.0    INTRODUCTIONAfter careful requirement gathering, analysis and design, the system is implemented. Implementation involves testing the system with required data and observing the results to see if the system has been properly deigned or if it contains bugs. This is usually done with data which has known results. In this chapter we will implement the system designed.4.1    SYSTEM REQUIREMENTSTo implement the application, the computer on which it will r ... Continue reading---

         

      CHAPTER FIVE - [ Total Page(s): 1 ]CHAPTER FIVE SUMMARY AND CONCLUSION5.0    INTRODUCTIONThis chapter summarizes and concludes the project work; it also gives recommendations and insight to future work.5.1    SUMMARYIn this project work we were able to succeed in applying Natural Language Processing which is a branch of Machine Learning to Classifying Health related documents. We made use of the OpenNLP Application Programming Interface which is a Java API for training a model and classifying the documents. We make use of M ... Continue reading---

         

      REFRENCES - [ Total Page(s): 1 ]REFERENCERussell Power, Jay Chen, Trishank Karthik and Lakshminarayanan Subramanian (2018),“Document Classification for Focused Topics” https://cs.nyu.edu/~jchen/publications/aaai4d-power.pdf.Hull D., J. Pedersen, and H. Schutze (1996), “Document routing as statistical classification,” in AAAI Spring Symp. On Machine Learning in Information Access Technical Papers, Palo Alto.Fox C. (1992), “Lexical analysis and stoplist,” in Information Retrieval Data Structur ... Continue reading---