There was some problem with Greedy approach implementation. As seen from the previous results, for very low training sets, the greedy approach is performing worse than others. The problem was that after getting 100% coverage on the training set, the greedy implementation was not properly picking vertices at random. I think I fixed that and here are the update results.

1% Training on 100 partitions between -3000 and 3000
For really small training sizes here is what is observed. Note that results are no better that random for all three approaches.

25 partitions between -3000 and 3000, 0.25% used for training
Welcome to
Game Intelligence Group. This is your first post. Edit or delete it, then start blogging!
Completed experiment comparing Greedy, Top-N and Markov placement. Seems like Greedy outperforms others except in the case of small training sets.

1% Training, 100 partitions (-3000 to 3000)
Here is a pdf, 100-0.01

5% Training, 100 partitions between (-3000, 3000)
Here is a pdf. 100-0.05
For larger training sets and smaller number of partitions the results are pretty much the same.
Here is a list, but need to explore further. I am not sure how this works if the distance measure is bad.
- Dunn’s Index
- Davies-Boldin’s Index
- Partition Coefficient
- Classification Entropy
- Seperation Index
- Fuzzy Hypervolume
- CS Index
- Calinsky-Harbnasz Index
- I-Index
A good reference is A new index of cluster validity. Also look at Performance evaluation of some Clustering Algorithms and Validity Indices and A new cluster validity index for the fuzzy c-mean
The
paper proposes a technique to cluster webusers based on longest common subsequence of their clickstreams taking into account both the walk through the website as well as the time spent on each page. The motivation is to find groups of users based on similar interests or motivations behing vising the website. This is assuming that similarity in clickstream indicates similarity in interests or motivations behing vising the website.
The similarity measure between two walks is based on longest common subsequence (LCS). This is quite well studied and through dynamic programming it is possible to compute LCS in time
where
and
are the lengths of input sequences. The key contribution of this work is to exted this approach where in there is a real number (time spent on each page) assocated with each vertex in the walk. Here is a snippet from the paper.

Key idea behind extending LCS for clickstreams
I am not quite sure why geometric mean is used, other than that the approach is solid. The authors use a graph-based clustering approach and perform experiments with data from sulekha.com.
Here is a
paper on clustering Walks in a Graph, the application domain is clickstreams. There are a lot of ideas which apply to clustering player walks in FPS/MMORPGs. The distance computation (between) two walks is particularly important. I need to make sure I look at this when deciding on distance measures.
@conference{banerjee2001clickstream,
title={{Clickstream clustering using weighted longest common subsequences}},
author={Banerjee, A. and Ghosh, J.},
booktitle={Proc. of the Workshop on Web Mining, SIAM Conference on Data Mining},
pages={33--40},
year={2001}
}
I decided to go ahead with a hierarchical, agglomerative, single-link approach to clustering walks in a graph. The reasons for this are as follows.
- My dataset is not that large about 2500 walks, so the versatile but high-cost (both in time and space) hierarchical approach is preferred.
- I am a little unclear about this but what is the centroid of a group of walks? When we are dealing with structured data the notion of a centroid is not clear, so applying K-Means is a little tricky.
- As far as the agglomerative choice is concerned, this is strictly an implementation issue. A divisive approach will involve constructing a minimal spanning tree (MST) from the fully connected graph (vertices are walks edges are distances between walks) and then removing edges from the MST one by one. With the agglomerative approach, I simply compute the distances between clusters and merge them.
- The single-link choice basically deals with how to compute the distance between two clusters, when they have more than one points in them. Single-link approach computes this as the minimum distance between any two points (one drawn from the first cluster the second drawn from the second cluster). The complete-link approach uses the maximum distance between any two points from both the clusters. The survey paper on clustering in the earlier post mentions that single link is more versatile but produces bridges with noisy data. I have not thought about this much but I doubt that this will make a big difference.
- I think the big concern is how to compute the distance between two walks. The rest of the stuff is not that important.
So the next thing to do is to implement a general (with respect to the distance computation) hierarchical, agglomerative, single-link clustering algorithm. Make sure the code allows to compare the clustering will the known categories (the particular scenario or if the test was a base test or a transfer test).