• A System For Health Document Classification Using Machine Learning

  • CHAPTER TWO -- [Total Page(s) 3]

    Page 1 of 3

    1 2 3    Next
    • CHAPTER TWO
      LITERATURE REVIEW
      2.0    DOCUMENT CLASSIFICATION
      Classification can be divided in two principal phases. The first phase is document representation, and the second phase is classification. The standard document representation used in text classification is the vector space model. The difference of classification systems is in document representation models. The more relevant the representation is, the more relevant the classification will be. The second phase includes learning from training corpus, making a model for classes and classifying the new documents according to the model.
      2.1    TEXT CATEGORIZATION
      Text categorization, the activity of labeling natural language texts with thematic categories from a set arranged in advance has accumulated an important status in the information systems field, due to because of augmentation of availability of documents in digital form and the confirms need to access them in easy ways.. Currently text categorization is applied in many contexts, ranging from document indexing depending on a managing vocabulary, to document filtering, automated metadata creation, vagueness of word sense, population of and in general any application needs document organization or chosen and adaptive document execution. These days text categorization is a discipline at the crossroads of ML and IR, and it claims a number of characteristics with other tasks like information/ knowledge pulling from texts and text mining (PAZIENZA, 1997). “Text mining” is mostly used to represent all the tasks that, by analyzing large quantities of text and identifying usage patterns, try to extract probably helpful (although only probably correct) information. Concentrating on the above opinion, text categorization is an illustration of text mining which includes:
      1.    the automatic assignment of documents to a predetermined set of categories,
      2.    the automatic reorganization of such a set of categories, or
      3.    the automatic identification
      The text classification is a crucial part of information management process. As net resources constantly grow, increasing the effectiveness of text classifiers is necessary. Document retrieval, its categorization, routing and aforementioned information filtering is often based on the text categorization (Hull, 1996).
      2.2    TAXONOMY OF TEXT CLASSIFICATION PROCESS
      The task of building a classifier for documents does not vary from other tasks of Machine Learning. The main point is the representation of a document (Leopold, 2002).
      One special certainty of the text categorization problem is that the number of features (unique words or phrases) reaches orders of tens of thousands flexibly. This develops big hindrances in applying many sophisticated learning algorithms to the text categorization, so dimension reduction methods are used which can be used either in choosing a subset of the original features (Brank, 2002), or transforming the features into new ones, that is, adding new features.
      2.2.1    TOKENIZATION
      The process of breaking a stream of text up into tokens that is words, phrases, symbols, or other meaningful elements is called Tokenization where the list of tokens is input to the next processing of text classification. Generally, tokenization occurs at the word level.
      Nevertheless, it is not easy to define the meaning of the "word". Where a tokenize process responds on simple heuristics, for instance:
      All contiguous strings of alphabetic characters are part of one token; similarly with numbers. Tokens are divided by whitespace characters, like a space or line break, or by punctuation characters. Punctuation and whitespace may or may not be added in the resulting list of tokens. In languages like English (and most programming languages) words are separated by whitespace, this approach is straightforward. Still, tokenization is tough for languages with no word boundaries like Chinese. Simple white spaced limited tokenization also shows toughness in word collocations like New York which must be considered as single token. Some ways to mention this problem are by improving more complex heuristics, querying a table of common collocations, or fitting the tokens to a language model that identifies collocations in a next processing.

  • CHAPTER TWO -- [Total Page(s) 3]

    Page 1 of 3

    1 2 3    Next
    • ABSRACT - [ Total Page(s): 1 ]ABSTRACTDue to the massive increase in medical documents every day (including books, journals, blogs, articles, doctors' instructions and prescriptions, emails from patients, etc.), it is becoming very challenging to handle and to categorize them manually. One of the most challenging projects in information systems is extracting information from unstructured texts, including medical document classification. The discovery of knowledge from medical datasets is important in order to make effective ... Continue reading---

         

      APPENDIX A - [ Total Page(s): 2 ]APPENDIX A ... Continue reading---

         

      APPENDIX C - [ Total Page(s): 1 ]APPENDIX Cen-diseases.trainMalaria is a life-threatening mosquito-borne blood disease caused by a Plasmodium parasite Malaria was eliminated from the U.S. in the early 1950sMalaria is typically spread by mosquitoesMalaria symptoms can be classified into two categoriesMalaria happens when a bite from the female Anopheles mosquito infects the body with PlasmodiumMalaria is a mosquito-borne infectious disease affecting humans and other animals caused by parasitic protozoansMalaria is a mosquito-bor ... Continue reading---

         

      APPENDIX B - [ Total Page(s): 11 ]APPENDIX B ... Continue reading---

         

      CHAPTER ONE - [ Total Page(s): 2 ]CHAPTER ONE1.0    INTRODUCTIONThis chapter introduces the topic of the project work A System for Health Document Classification Using Machine Learning. In this chapter, we will consider the background of the study, statement of the problem, aims and objectives, methodology used to design the system, scope of the study, its significance, definition of terms, and we conclude with the project layout or organization of the project work.1.1    BACKGROUND OF THE STUDYContemporarily, most hospita ... Continue reading---

         

      CHAPTER THREE - [ Total Page(s): 3 ]3.4    SEQUENCE DIAGRAMSequence diagrams are simple subsets of interaction diagrams. They map out sequential events in an engineering or business process in order to streamline activities. Sequence diagrams are used to show how objects interact in a given situation. An important characteristic of a sequence diagram is that time passes from top to bottom: the interaction starts near the top of the diagram and ends at the bottom (i.e. Lower equals Later).3.5    CLASS DIAGRAMSWe begin our OOD ... Continue reading---

         

      CHAPTER FOUR - [ Total Page(s): 5 ]CHAPTER FOUR SYSTEM IMPLEMENTATION4.0    INTRODUCTIONAfter careful requirement gathering, analysis and design, the system is implemented. Implementation involves testing the system with required data and observing the results to see if the system has been properly deigned or if it contains bugs. This is usually done with data which has known results. In this chapter we will implement the system designed.4.1    SYSTEM REQUIREMENTSTo implement the application, the computer on which it will r ... Continue reading---

         

      CHAPTER FIVE - [ Total Page(s): 1 ]CHAPTER FIVE SUMMARY AND CONCLUSION5.0    INTRODUCTIONThis chapter summarizes and concludes the project work; it also gives recommendations and insight to future work.5.1    SUMMARYIn this project work we were able to succeed in applying Natural Language Processing which is a branch of Machine Learning to Classifying Health related documents. We made use of the OpenNLP Application Programming Interface which is a Java API for training a model and classifying the documents. We make use of M ... Continue reading---

         

      REFRENCES - [ Total Page(s): 1 ]REFERENCERussell Power, Jay Chen, Trishank Karthik and Lakshminarayanan Subramanian (2018),“Document Classification for Focused Topics” https://cs.nyu.edu/~jchen/publications/aaai4d-power.pdf.Hull D., J. Pedersen, and H. Schutze (1996), “Document routing as statistical classification,” in AAAI Spring Symp. On Machine Learning in Information Access Technical Papers, Palo Alto.Fox C. (1992), “Lexical analysis and stoplist,” in Information Retrieval Data Structur ... Continue reading---