Choosing Hierarthical Clustering for Walks in a Graph
I decided to go ahead with a hierarchical, agglomerative, single-link approach to clustering walks in a graph. The reasons for this are as follows.
- My dataset is not that large about 2500 walks, so the versatile but high-cost (both in time and space) hierarchical approach is preferred.
- I am a little unclear about this but what is the centroid of a group of walks? When we are dealing with structured data the notion of a centroid is not clear, so applying K-Means is a little tricky.
- As far as the agglomerative choice is concerned, this is strictly an implementation issue. A divisive approach will involve constructing a minimal spanning tree (MST) from the fully connected graph (vertices are walks edges are distances between walks) and then removing edges from the MST one by one. With the agglomerative approach, I simply compute the distances between clusters and merge them.
- The single-link choice basically deals with how to compute the distance between two clusters, when they have more than one points in them. Single-link approach computes this as the minimum distance between any two points (one drawn from the first cluster the second drawn from the second cluster). The complete-link approach uses the maximum distance between any two points from both the clusters. The survey paper on clustering in the earlier post mentions that single link is more versatile but produces bridges with noisy data. I have not thought about this much but I doubt that this will make a big difference.
- I think the big concern is how to compute the distance between two walks. The rest of the stuff is not that important.
So the next thing to do is to implement a general (with respect to the distance computation) hierarchical, agglomerative, single-link clustering algorithm. Make sure the code allows to compare the clustering will the known categories (the particular scenario or if the test was a base test or a transfer test).
Categories: Uncategorized