Archive

Archive for September, 2009

Research Programming: Remember the Following

September 23rd, 2009 ketkar No comments

  1. For C++ code,  read and write data in the simplest of formats. If complicated formats are necessary write python scripts. You should be able  to make do with ifstream and ofstream. If you need Boost tokenizer, Sprit or LibXML, you are opening qa can of worms, what you need a python script and a simple intermediate format.
  2. Extract a small subset of real data for testing and debugging. Seriously, don’t do to work with the entire dataset.
  3. Don’t trust libraries with core stuff. Write all code from scratch. You should know exactly what is going on in your code.
  4. Save raw output data, all of it.
  5. Separate plotting from the execution of the algorithm. You should not have to rerun the experiment to get a different plot.
  6. Every once in a while extract and put away frequently used code in a library.
  7. Do not use a unit testing framework, it’s an overkill. Write small tests as you go along in the main, when successful delete and move on.
  8. Try to exactly replicate the algorithm as possible. No shortcuts.
Categories: Uncategorized Tags:

Problems with results

September 23rd, 2009 ketkar No comments

There was some problem with Greedy approach implementation. As seen from the previous results, for very low training sets, the greedy approach is performing worse than others. The problem was that after getting 100% coverage on the training set, the greedy implementation was not properly picking vertices at random. I think I fixed that and here are the update results.

1% Training on 100 partitions between -3000 and 3000

1% Training on 100 partitions between -3000 and 3000

For really small training sizes here is what is observed. Note that results are no better that random for all three approaches.

25 partitions between -3000 and 3000, 0.25% used for training

25 partitions between -3000 and 3000, 0.25% used for training

Categories: Uncategorized Tags:

Kernel/Distance Measure for Walks in a Graph for Classifying and Clustering Player Data

September 23rd, 2009 ketkar No comments

Given that we have a set of walks of the form W = \{(l_1, t_1), (l_2, t_2)…\} , where l is a vertex and t is the time spent of that vertex. Now we need a distance measure or kernel or similarity score between two walks W_i and W_j so that we can cluster or classify them with. Here is an idea based on this paper to do the same. The overall idea is based on Longest Common Subsequence (LCS) which basically doing a diff on two walks. To give an example, the longest common subsequence of AABBAAD and ABBDDDA is ABBD. Basically deletions are allowed. The problem with directly applying this idea is that it does not allow us to consider the time spent on each vertex. Obviously we could duplicate vertices so as to indicate the time  spent, so that  \{(A_1, 3), (B, 2)…\} becomes AAABB. This is a bad idea. The complexity of LCS is O(mn) where m and n are lengths of input sequences. So we are going to get clobbered pretty bad if users spend too much time in some locations. Here is a trick to do this quickly.

First compute the LCS, ignoring the time spent on each vertex. Note that for walks for W_A and W_B we will refer to the sequence of locations as L_A and L_B and timings as T_A and T_B. Suppose the output of LCS for walks L_A and L_B is in the form to two functions P_A and P_B such that L_A[P_A(i)] = L_B[P_B(i)], for i going from 0 to n which is the length of the LCS of L_A and L_B. Now a measure of the time spent in common on a particular vertex with index i can be computed as follows.

\frac{min(T_A[P_A(i)], T_B[P_B(i)])}{max(T_A[P_A(i)], T_B[P_B(i)])}

Then we compute a mean of such time spent on the same vertex in common for all indices.

C = \frac{1}{n} \sum_{i = 0}^n\frac{min(T_A[P_A(i)], T_B[P_B(i)])}{max(T_A[P_A(i)], T_B[P_B(i)])}

This does not take into consideration what fraction of the walk is common. So even if a very small part is exactly common we will get a complete match score of 1. To address this we multiply C by the following.

F = (\frac{\sum^n_{i=0}{T_A[P_A(i)]}}{\sum^{|A|}_{i=0}{T_A[i]}} \times \frac{\sum^n_{i=0}{T_B[P_B(i)]}}{\sum^{|B|}_{i=0}{T_B[i]}})^{\frac{1}{2}}

Clearly,

  1. F \times C = 0 if there are no locations in common.
  2. F \times C = 1 if walks are exactly identical.
  3. In general F \times C is between 0 and 1.

Categories: Uncategorized Tags:

Kernel/Distance Measure for Walks in a Graph for Classifying and Clustering Player Data

September 23rd, 2009 ketkar No comments

Given that we have a set of walks of the form W = \{(l_1, t_1), (l_2, t_2)…\} , where l is a vertex and t is the time spent of that vertex. Now we need a distance measure or kernel or similarity score between two walks W_i and W_j so that we can cluster or classify them with. Here is an idea to do the same. The overall idea is based on Longest Common Subsequence (LCS) which basically doing a diff on two walks. To give an example, the longest common subsequence of AABBAAD and ABBDDDA is ABBD. Basically deletions are allowed. The problem with directly applying this idea is that it does not allow us to consider the time spent on each vertex. Obviously we could duplicate vertices so as to indicate the time  spent, so that  \{(A_1, 3), (B, 2)…\} becomes AAABB. This is a bad idea. The complexity of LCS is O(mn) where m and n are lengths of input sequences. So we are going to get clobbered pretty bad if users spend too much time in some locations. Here is a trick to do this quickly.

First compute the LCS, ignoring the time spent on each vertex. Note that for walks for W_A and W_B we will refer to the sequence of locations as L_A and L_B and timings as T_A and T_B. Suppose the output of LCS for walks L_A and L_B is in the form to two functions P_A and P_B such that L_A[P_A(i)] = L_B[P_B(i)], for i going from 0 to n which is the length of the LCS of L_A and L_B. Now a measure of the time spent in common on a particular vertex with index i can be computed as \begin{equation} \frac{min(T_A[P_A(i)],T_B[P_B(i)])}{min(T_A[P_A(i)],T_B[P_B(i)])} \end{equation}

Categories: Uncategorized Tags:

Hello world!

September 21st, 2009 ketkar 1 comment

Welcome to Game Intelligence Group. This is your first post. Edit or delete it, then start blogging!

Categories: Uncategorized Tags:

Advertisement Placement Results

September 21st, 2009 ketkar No comments

Completed experiment comparing Greedy, Top-N and Markov placement. Seems like Greedy outperforms others except in the case of small training sets.

1% Training, 100 partitions (-3000 to 3000)

1% Training, 100 partitions (-3000 to 3000)

Here is a pdf, 100-0.01

5% Training, 100 partitions between (-3000, 3000)

5% Training, 100 partitions between (-3000, 3000)

Here is a pdf. 100-0.05

For larger training sets and smaller number of partitions the results are pretty much the same.

Categories: Uncategorized Tags:

How are clustering algorithms evaluated?

September 19th, 2009 ketkar No comments

Here is a list, but need to explore further. I am not sure how this works if the distance measure is bad.

  1. Dunn’s Index
  2. Davies-Boldin’s Index
  3. Partition Coefficient
  4. Classification Entropy
  5. Seperation Index
  6. Fuzzy Hypervolume
  7. CS Index
  8. Calinsky-Harbnasz Index
  9. I-Index

A good reference is A new index of cluster validity. Also look at Performance evaluation of some Clustering Algorithms and Validity Indices and A new cluster validity index for the fuzzy c-mean

Categories: Uncategorized Tags:

Notes on Clickstream Clustering Paper

September 19th, 2009 ketkar No comments

The paper proposes a technique to cluster webusers based on longest common subsequence of their clickstreams taking into account both the walk through the website as well as the time spent on each page. The motivation is to find groups of users based on similar interests or motivations behing vising the website. This is assuming that similarity in clickstream indicates similarity in interests or motivations behing vising the website.

The similarity measure between two walks is based on longest common subsequence (LCS). This is quite well studied and through dynamic programming it is possible to compute LCS in time O(mn) where m and n are the lengths of input sequences. The key contribution of this work is to exted this approach where in there is a real number (time spent on each page) assocated with each vertex in the walk.  Here is a snippet from the paper.

Key idea behind extending LCS for clickstreams

Key idea behind extending LCS for clickstreams

I am not quite sure why geometric mean is used, other than that the approach is solid. The authors use a graph-based clustering approach and perform experiments with data from sulekha.com.

Categories: Uncategorized Tags:

Paper on Clustering Walks in a Graph, Application domain is clickstreams

September 19th, 2009 ketkar No comments

Here is a paper on clustering Walks in a Graph, the application domain is clickstreams. There are a lot of ideas which apply to clustering player walks in FPS/MMORPGs. The distance computation (between) two walks is particularly important. I need to make sure I look at this when deciding on distance measures.

@conference{banerjee2001clickstream,
  title={{Clickstream clustering using weighted longest common subsequences}},
  author={Banerjee, A. and Ghosh, J.},
  booktitle={Proc. of the Workshop on Web Mining, SIAM Conference on Data Mining},
  pages={33--40},
  year={2001}
}
Categories: Uncategorized Tags:

Choosing Hierarthical Clustering for Walks in a Graph

September 19th, 2009 ketkar No comments

I decided to go ahead with a hierarchical,  agglomerative, single-link approach to clustering walks in a graph.  The reasons for this are as follows.

  1. My dataset is not that large about 2500 walks, so the versatile but high-cost (both in time and space) hierarchical approach is preferred.
  2. I am a little unclear about this but what is the centroid of a group of walks? When we are dealing with structured data the notion of a centroid is not clear,  so applying K-Means  is a little tricky.
  3. As far as the  agglomerative choice is concerned, this is strictly an implementation issue. A divisive approach will involve constructing a minimal spanning tree (MST) from the fully connected graph (vertices are walks edges are distances between walks) and then removing edges from the MST one by one.  With the agglomerative approach, I simply compute the distances between clusters and merge them.
  4. The single-link choice basically deals with how to compute the distance between two clusters, when they have more than one points in them. Single-link approach computes this as the minimum distance between any two points (one drawn from the first cluster the second drawn from the second cluster). The complete-link approach uses the maximum distance between any two points from both the clusters. The survey paper on clustering in the earlier post mentions that single link is more versatile but produces bridges with noisy data. I have not thought about this much but I doubt that this will make a big difference.
  5. I think the big concern is how to compute the distance between two walks. The rest of the stuff is not that important.

So the next thing to do is to implement a general (with respect to the distance computation)  hierarchical,  agglomerative, single-link clustering algorithm. Make sure the code allows to compare the clustering will the known categories (the particular scenario or if the test was a base test or a transfer test).

Categories: Uncategorized Tags: