2.4 Similarity Measures for Categorical Data
The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification and regression are major data mining techniques which compute the similarities between instances and hence choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. For these tasks, the choice of a similarity measure can be as important as the choice of data representation or feature selection. Most algorithms typically treat the similarity computation as an orthogonal step and can make use of any measure. The notion of similarity or distance for categorical data is not as straightforward as for continuous data.
The key characteristic of categorical data is that different values that a categorical attribute takes are not inherently ordered. Thus it is not possible to directly compare two different categorical values. The simplest way to find similarity between two categorical attributes is to assign a similarity of one (1) if the values are identical and a similarity of zero (0) if the values are not identical. For two multivariate categorical data points, the similarity between them will be directly proportional to the number of attributes in which they match (Chandola, 2007).
Although there is no inherent ordering in categorical data, there are other factors like co- occurrence statistics that can be effectively used to define what should be considered more similar and vice-versa. This observation has motivated researchers to come up with data-driven similarity measures for categorical attributes. Such measures take into account the frequency distribution of different attribute values in a given data set but most of these algorithms fail to capture any other feature in the dataset apart from frequency distribution of different attribute values in a given data set. One solution to the problem is to build a common repository of similarity measures for all commonly occurring concepts. Thus, the notion of similarity varies from one domain to another and hence the assignment of similarity must involve a thorough understanding of the domain.
2.4.1 Characteristics of Categorical Data
Here, we identify the key characteristics of a categorical data set that can potentially affect the behavior of a data-driven similarity measure.
i. Size of Data, (N): most measures are typically invariant of the size of the data, though there are some measures that do make use of this information.
ii. Number of attributes, (d): most measures are invariant of this characteristic, since they typically normalize the similarity over the number of attributes. But in our experimental results we observe that the number of attributes does affect the performance of the outlier detection algorithms.
iii. Number of values taken by each attribute, (nk):a data set might contain attributes that take several values and attributes that take very few values. For example, one attribute might take several hundred possible values, while another attribute might take very few values. A similarity measure might give more importance to the second attribute, while ignoring the first one.
iv. Distribution of fk(x): this refers to the distribution of frequency of values taken by an attribute in the given data set. In certain data sets an attribute might be distributed uniformly over the set Ak, while in others the distribution might be skewed. A similarity measure might give more importance to attribute values that occur rarely, while another similarity measure might give more importance to frequently occurring attribute values.
The study of similarity between data objects with categorical variables has had a long history. Pearson proposed a chi-square statistic in the late 1800s which is often used to test independence between categorical variables in a contingency table. Pearson's chi- square statistic was later modified and extended, leading to several other measures. (Chandola, 2007). More recently, however, the overlap measure has become the most commonly used similarity measure for categorical data. Its popularity is perhaps related to its simplicity and ease of use. In this section, we will discuss the overlap measure and several data-driven similarity measures for categorical data.
2.5 Information Retrieval System
Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible Information Retrieval applications.
An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy. An object is an entity that is represented by information in a database. User queries are matched against the database information. Depending on the application the data objects may be, for example, text documents, images, audio, mind maps or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates or metadata. Most IR systems compute a numeric score on how well each object in the database matches the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query.
The idea of using computers to search for relevant pieces of information was popularized in the article
As We May Think by Vannevar Bush in 1945. The first automated information retrieval systems were introduced in the 1950s and 1960s. By 1970 several different techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents). Large-scale retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s.
In 1992, the US Department of Defense along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that scale to huge corpora. The introduction of web search engines has boosted the need for very large scale retrieval systems even further.