• Development Of An Information Retrieval System Using Tree-structured Clustering
    [FRSC Benue State]

  • CHAPTER TWO -- [Total Page(s) 4]

    Page 3 of 4

    Previous   1 2 3 4    Next
    • 2.4    Similarity Measures for Categorical Data
      The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification and regression are major data mining techniques which compute the similarities between instances and hence choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. For these tasks, the choice of a similarity measure can be as important as the choice of data representation or feature selection. Most algorithms typically treat the similarity computation as an orthogonal step and can make use of any measure. The notion of similarity or distance for categorical data is not as straightforward as for continuous data.
      The key characteristic of categorical data is that different values that a categorical attribute takes are not inherently ordered. Thus it is not possible to directly compare two different categorical values. The simplest way to find similarity between two categorical attributes is to assign a similarity of one (1) if the values are identical and a similarity of zero (0) if the values are not identical. For two multivariate categorical data points, the similarity between them will be directly proportional to the number of attributes in which they match (Chandola, 2007).
      Although there is no inherent ordering in categorical data, there are other factors like co- occurrence statistics that can be effectively used to define what should be considered more similar and vice-versa. This observation has motivated researchers to come up with data-driven similarity measures for categorical attributes. Such measures take into account the frequency distribution of different attribute values in a given data set but most of these algorithms fail to capture any other feature in the dataset apart from frequency distribution of different attribute values in a given data set. One solution to the problem is to build a common repository of similarity measures for all commonly occurring concepts. Thus, the notion of similarity varies from one domain to another and hence the assignment of similarity must involve a thorough understanding of the domain.
      2.4.1    Characteristics of Categorical Data
      Here, we identify the key characteristics of a categorical data set that can potentially affect the behavior of a data-driven similarity measure.
      i.        Size of Data, (N): most measures are typically invariant of the size of the data, though there are some measures that do make use of this information.
      ii.    Number of attributes, (d): most measures are invariant of this characteristic, since they typically normalize the similarity over the number of attributes. But in our experimental results we observe that the number of attributes does affect the performance of the outlier detection algorithms.
      iii.    Number of values taken by each attribute, (nk):a data set might contain attributes that take several values and attributes that take very few values. For example, one attribute might take several hundred possible values, while another attribute might take very few values. A similarity measure might give more importance to the second attribute, while ignoring the first one.
      iv.    Distribution of fk(x): this refers to the distribution of frequency of values taken by an attribute in the given data set. In certain data sets an attribute might be distributed uniformly over the set Ak, while in others the distribution might be skewed. A similarity measure might give more importance to attribute values that occur rarely, while another similarity measure might give more importance to frequently occurring attribute values.
      The study of similarity between data objects with categorical variables has had a long history. Pearson proposed a chi-square statistic in the late 1800s which is often used to test independence between categorical variables in a contingency table. Pearson's chi- square statistic was later modified and extended, leading to several other measures. (Chandola, 2007). More recently, however, the overlap measure has become the most commonly used similarity measure for categorical data. Its popularity is perhaps related to its simplicity and ease of use. In this section, we will discuss the overlap measure and several data-driven similarity measures for categorical data.
      2.5    Information Retrieval System
      Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible Information Retrieval applications.
      An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy. An object is an entity that is represented by information in a database. User queries are matched against the database information. Depending on the application the data objects may be, for example, text documents, images, audio, mind maps or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates or metadata. Most IR systems compute a numeric score on how well each object in the database matches the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query.
      The idea of using computers to search for relevant pieces of information was popularized  in the article
      As We May Think by Vannevar Bush in 1945. The first automated information retrieval systems were introduced in the 1950s and 1960s. By 1970 several different techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents). Large-scale retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s.
      In 1992, the US Department of Defense along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that scale to huge corpora. The introduction of web search engines has boosted the need for very large scale retrieval systems even further.

  • CHAPTER TWO -- [Total Page(s) 4]

    Page 3 of 4

    Previous   1 2 3 4    Next
    • ABSRACT - [ Total Page(s): 1 ]Coming Soon ... Continue reading---

         

      APPENDIX A - [ Total Page(s): 2 ]REGISTRATION PAGE ... Continue reading---

         

      CHAPTER ONE - [ Total Page(s): 2 ]1.3    Justification for the StudyThis study provides a means of easy storage and retrieval of information of vehicles and their owners for the FRSC in Benue State. It eases the stress of searching through the entire directory when retrieving information on an existing record; it will ensure the provision of a clear statistics of vehicle owners in a particular local government in the state. The output of the study shall serve as a benchmark for the Federal Road Safety Corps on the ... Continue reading---

         

      CHAPTER THREE - [ Total Page(s): 7 ]Quality improvement and cost reduction:platform.due to a central communicationv.        Use of Less Space for Record Storage: There will be elimination of much space used in storing records by introducing a computer storage media (disks) which can keep vast volume of information in a less space.vi.Speed Optimization:This will eliminate the problems of time wasting in registering records, checking from one line to the next as well as preparing a revenue report which is faster than using man ... Continue reading---

         

      CHAPTER FOUR - [ Total Page(s): 2 ]CHAPTER FOURRESULT AND IMPLEMENTATION4.1    IntroductionSystems design could be seen as the application of systems theory to product development. According to Wikipedia it is defined as the process of defining the architecture, components, modules, interfaces, and data for a system to satisfy specified requirements.4.2    System RequirementIn developing any system, there is need to specify some system requirements for minimum performance. However, with respect to this work the system requi ... Continue reading---

         

      CHAPTER FIVE - [ Total Page(s): 1 ]CHAPTER FIVESUMMARY, CONCLUSION AND RECOMMENDATION5.1    SUMMARYThis project work is aimed at providing a software model for grouping a set of related records in the Federal Road Safety Commission. The system has been designed to automate data for which vehicle owners are being registered. Consistency, reliability, fairness and quick turnaround time is ensured with the use of this system. Based on the model used in this software, further improvements can be made in order to include other feat ... Continue reading---

         

      REFRENCES - [ Total Page(s): 1 ]REFERENCES1.    William B. Frakes and Ricardo Baeza-Yates.(1992). Information Retrieval    Data Structures & Algorithms. Prentice-Hall, Inc. ISBN 0-13-463837-9.2.    Ahmad, A. and Dey, L. (2007). A method to compute distance between two categorical values of some attributes in unsupervised learning for categorical data set.3.    Anderberg M.R. (1973). Cluster Analysis for Applications. Academic Press, New York.4.        Chandola Varun, Boriah Shyam and Kumar Vipin (2007). Simil ... Continue reading---