-
A System For Health Document Classification Using Machine Learning
-
-
-
CHAPTER THREE SYSTEM ANALYSIS AND DESIGN
3.0 INTRODUCTION
This chapter shows all the modules and components used to design the system, and how they work together. It also shows us how the users of the system interact with the system.
3.1 ANALYSIS OF THE EXSISTING SYSTEM
Currently the existing system is manual, health workers presently classify health documents through stacking of physical files in file cabinets. This makes it difficult to retrieve files when a file of a particular category is required.
3.2 ANALYSIS OF THE PROPOSED SYSTEM
System analysis and design deal with planning the development of information systems through understanding and specifying in detail what a system should do and how the components of the system should be implemented and work together. System analysts solve business problems through analyzing the requirements of information systems and designing such systems by applying analysis and design techniques.
3.2.1 REQUIREMENTS OF THE SYSTEM
For the system to serve its intended purpose properly, the system will have to meet the following requirements.
1. It should be able to accept as input text documents with the following extension .txt, .doc .pdf.
2. It should be able to search for defined text in documents.
3. It should be able to summarize documents.
4. It should be able to categorize and summarize text
5. It should be able to tokenize text, carry out stemming and lemmatization.
6. It should be able to identify sentences.
7. It should able to perform Conference resolution, Word Sense Disambiguation and Sentence Boundary Disambiguation.
3.3 TRAINING A MODEL
In machine learning, models are used to train algorithms. The algorithm learns from the model to the point that when it will produce similar result when similar data (similar to the model) is presented to the algorithm. In this project work we make use of the OpenNLP API for document classification. The OpenNLP API is a set of Java tools from the Apache software foundation for carrying out natural language processing which is an aspect of machine learning and is the domain of our project work.
In other to carry out the classification, we first train a model. Our model is built to identify disease such as malaria, hypertension and diarrhea. We opted to start with these three diseases as a little Google search shows them to be the most common diseases prevalent in Nigeria. In other to construct a model in OpenNLP, you need to create a file of training data. The training file format consists of a series of lines, the first word of the line is the category. The category is followed by text separated by whitespace. We use numerous lines of text containing the words malaria, hypertension and diarrhea which we source online mainly from Wikipedia to create a training file called†en- diseases.trainâ€. The en-diseases.train file is passed to the train method of the DocumentCategorizerME class. The train method trains the file and outputs a model file with a .bin file name extension.
3.4 CLASSIFYING THE DOCUMENT
After training, the model file produced will be used to, classify the health documents. The “categorizer†method of the DocumentCategorizerME is used to classify the documents either into Malaria, Diarrhea or Hypertension.
3.3 USE CASE DIAGRAMS
The use case diagram is used to show the interaction between the system use cases and its clients without much detail. A use case diagram displays an actor and its use cases, the actors are also the users of the system.
The users or actors of our document classification system include: Health Worker
-
-
-
ABSRACT - [ Total Page(s): 1 ]ABSTRACTDue to the massive increase in medical documents every day (including books, journals, blogs, articles, doctors' instructions and prescriptions, emails from patients, etc.), it is becoming very challenging to handle and to categorize them manually. One of the most challenging projects in information systems is extracting information from unstructured texts, including medical document classification. The discovery of knowledge from medical datasets is important in order to make effective ... Continue reading---
-
ABSRACT - [ Total Page(s): 1 ]ABSTRACTDue to the massive increase in medical documents every day (including books, journals, blogs, articles, doctors' instructions and prescriptions, emails from patients, etc.), it is becoming very challenging to handle and to categorize them manually. One of the most challenging projects in information systems is extracting information from unstructured texts, including medical document classification. The discovery of knowledge from medical datasets is important in order to make effective ... Continue reading---