-
Development Of An Information Retrieval System Using Tree-structured Clustering
[FRSC Benue State] -
-
-
2.3 Hierarchical Agglomerative Clustering
Hierarchical Agglomerative Clustering (compare) is a similarity based bottom-up clustering technique in which at the beginning every term forms a cluster of its own. Then the algorithm iterates over the step that merges the two most similar clusters still available, until one arrives at a universal cluster that contains all the terms.
In our experiments, we use three different strategies to calculate the similarity between clusters: complete-, average- and single-linkage. The three strategies may be based on the same similarity measure between terms, i.e. the cosine measure in our experiments, but they measure the similarity between two non-trivial clusters in different ways.
2.3.1 Single Link: The single link method joins, at each step, the most similar pair of objects that are not yet in the same cluster. It has some attractive theoretical properties and can be implemented relatively efficiently, so it has been widely used. However, it has a tendency toward formation of long straggly clusters, or chaining, which makes it suitable for delineating ellipsoidal clusters but unsuitable for isolating spherical or poorly separated clusters. Single-link clustering defines the distance between two clusters A and B as the minimum distance between their members.
2.3.2 Group Average Link: As the name implies, the group average link method uses the average values of the pair wise links within a cluster to determine similarity. All objects contribute to intercluster similarity, resulting in a structure intermediate between the loosely bound single link clusters and tightly bound complete link clusters. The group average method has ranked well in evaluative studies of clustering methods.
2.3.3 Complete Link: The complete link method uses the least similar pair between each of two clusters to determine the intercluster similarity; it is called complete link because all entities in a cluster are linked to one another within some minimum similarity. Small, tightly bound clusters are characteristic of this method. Here the distance between clusters is the maximum distance between their members.
Two other HACM are sometimes used, the centroid and median methods. In the centroid method, each cluster as it is formed is represented by the coordinates of a group centroid, and at each stage in the clustering the pair of clusters with the most similar mean centroid is merged. The median method is similar but the centroids of the two merging clusters are not weighted proportionally to the size of the clusters. A disadvantage of these two methods is that a newly formed cluster may be more like some point than were its constituent points, resulting in reversals or inversions in the cluster hierarchy.
2.3.4 General Algorithm for the HACM
All of the hierarchical agglomerative clustering methods can be described by a general algorithm:
1. Identify the two closest points and combine them in a cluster.
2. Identify and combine the next two closest points (treating existing clusters as points)
3. If more than one cluster remains, return to step 1.
Individual HACM differ in the way in which the most similar pair is defined, and in the means used to represent a cluster. Lance and Williams (1966) proposed a general combinatorial formula, the Lance-Williams dissimilarity update formula, for calculating dissimilarities between new clusters and existing points, based on the dissimilarities prior to formation of the new cluster.
The dendrogram is a useful representation when considering retrieval from a clustered set of documents, since it indicates the paths that the retrieval process may follow.
-
-
-
ABSRACT - [ Total Page(s): 1 ]Coming Soon ... Continue reading---
-
ABSRACT - [ Total Page(s): 1 ]Coming Soon ... Continue reading---