• A System For Health Document Classification Using Machine Learning

  • CHAPTER TWO -- [Total Page(s) 3]

    Page 2 of 3

    Previous   1 2 3    Next
    • 2.2.2    STEMMING
      In linguistic morphology and information collection, stemming is the process for decreasing deviated (or sometimes derived) words to their stem, original form. The stem need not be identical to the morphological root of the word; it is usually enough if it is concern words map of similar stem, even if this stem is not a valid root. In computer science algorithms for stemming have been studied since 1968. Many search engines consider words with the similar stem as synonyms as a kind of query broadening, a process called conflation.
      2.2.3    STOP WORD REMOVAL
      Typically in computing, stop words are filtered out prior to the processing of natural language data (text) which is managed by man but not a machine. A prepared list of stop words do not exist which can be used by every tool. Though any stop word list is used by any tool in order to support the phrase search the list is ignored.
      Any group of words can be selected as the stop words for a particular cause. For a few search machines, these is a list of common words, short function words, like the, is, at, which and on that create problems in performing text mining phrases that consist them. Therefore it is needed to eliminate stop words contains lexical words, like "want" from phrases to raise performance.
      2.2.4    VECTOR REPRESENTATION OF THE DOCUMENTS
      Vector denotation of the documents is an algebraic model for symbolizing text documents (and any objects, in general) as vectors of identifiers, like, for example, index terms which will be utilized in information filtering, information retrieval, indexing and relevancy rankings where its primary use is in the SMART Information Retrieval System.
      A sequence of words is called a document (Leopold, 2002). Thus every document is generally denoted by an array of words. The group of all the words of a training group is called vocabulary, or feature set. Thus a document can be produced by a binary vector, assigning the value 1 if the document includes the feature-word or 0 if there is no word in the document.
      2.2.5    FEATURE SELECTION AND TRANSFORMATION
      The main objective of feature-selection methods is to decrease of the dimensionality of the dataset by eliminating features that are not related for the classification (Forman, 2003). The transformation procedure is explained for presenting a number of benefits, involving tiny dataset size, tiny computational needs for the text categorization algorithms (especially those that do not scale well with the feature set size) and comfortable shrinking of the search space. The goal is to reduce the curse of dimensionality to yield developed classification perfection. The other advantage of feature selection is its quality to decrease over fitting, i. e. the phenomenon by which a classifier is tuned also to the contingent characteristics   of   the   training   data   rather   than   the   constitutive characteristics of the categories, and therefore, to augment generalization.
      Feature Transformation differs considerably from Feature Selection approaches, but like them its aim is to decrease the feature set volume. The approach does not weight terms in order to neglect the lower weighted but compacts the vocabulary based on feature concurrencies.

  • CHAPTER TWO -- [Total Page(s) 3]

    Page 2 of 3

    Previous   1 2 3    Next
    • ABSRACT - [ Total Page(s): 1 ]ABSTRACTDue to the massive increase in medical documents every day (including books, journals, blogs, articles, doctors' instructions and prescriptions, emails from patients, etc.), it is becoming very challenging to handle and to categorize them manually. One of the most challenging projects in information systems is extracting information from unstructured texts, including medical document classification. The discovery of knowledge from medical datasets is important in order to make effective ... Continue reading---

         

      APPENDIX A - [ Total Page(s): 2 ]APPENDIX A ... Continue reading---

         

      APPENDIX C - [ Total Page(s): 1 ]APPENDIX Cen-diseases.trainMalaria is a life-threatening mosquito-borne blood disease caused by a Plasmodium parasite Malaria was eliminated from the U.S. in the early 1950sMalaria is typically spread by mosquitoesMalaria symptoms can be classified into two categoriesMalaria happens when a bite from the female Anopheles mosquito infects the body with PlasmodiumMalaria is a mosquito-borne infectious disease affecting humans and other animals caused by parasitic protozoansMalaria is a mosquito-bor ... Continue reading---

         

      APPENDIX B - [ Total Page(s): 11 ]APPENDIX B ... Continue reading---

         

      CHAPTER ONE - [ Total Page(s): 2 ]CHAPTER ONE1.0    INTRODUCTIONThis chapter introduces the topic of the project work A System for Health Document Classification Using Machine Learning. In this chapter, we will consider the background of the study, statement of the problem, aims and objectives, methodology used to design the system, scope of the study, its significance, definition of terms, and we conclude with the project layout or organization of the project work.1.1    BACKGROUND OF THE STUDYContemporarily, most hospita ... Continue reading---

         

      CHAPTER THREE - [ Total Page(s): 3 ]3.4    SEQUENCE DIAGRAMSequence diagrams are simple subsets of interaction diagrams. They map out sequential events in an engineering or business process in order to streamline activities. Sequence diagrams are used to show how objects interact in a given situation. An important characteristic of a sequence diagram is that time passes from top to bottom: the interaction starts near the top of the diagram and ends at the bottom (i.e. Lower equals Later).3.5    CLASS DIAGRAMSWe begin our OOD ... Continue reading---

         

      CHAPTER FOUR - [ Total Page(s): 5 ]CHAPTER FOUR SYSTEM IMPLEMENTATION4.0    INTRODUCTIONAfter careful requirement gathering, analysis and design, the system is implemented. Implementation involves testing the system with required data and observing the results to see if the system has been properly deigned or if it contains bugs. This is usually done with data which has known results. In this chapter we will implement the system designed.4.1    SYSTEM REQUIREMENTSTo implement the application, the computer on which it will r ... Continue reading---

         

      CHAPTER FIVE - [ Total Page(s): 1 ]CHAPTER FIVE SUMMARY AND CONCLUSION5.0    INTRODUCTIONThis chapter summarizes and concludes the project work; it also gives recommendations and insight to future work.5.1    SUMMARYIn this project work we were able to succeed in applying Natural Language Processing which is a branch of Machine Learning to Classifying Health related documents. We made use of the OpenNLP Application Programming Interface which is a Java API for training a model and classifying the documents. We make use of M ... Continue reading---

         

      REFRENCES - [ Total Page(s): 1 ]REFERENCERussell Power, Jay Chen, Trishank Karthik and Lakshminarayanan Subramanian (2018),“Document Classification for Focused Topics” https://cs.nyu.edu/~jchen/publications/aaai4d-power.pdf.Hull D., J. Pedersen, and H. Schutze (1996), “Document routing as statistical classification,” in AAAI Spring Symp. On Machine Learning in Information Access Technical Papers, Palo Alto.Fox C. (1992), “Lexical analysis and stoplist,” in Information Retrieval Data Structur ... Continue reading---