Archive

Archive for the ‘Uncategorized’ Category

Partitioning with KMeans

October 2nd, 2009 ketkar No comments

Earlier I was using KMeans to get discrete locations from continuous 3D points. Dropped the idea as this is going to bias the results. Basically, if I have used KMeans instead of a uniform grid and then I do a test on an unseen sample, the sample is not really unseen as the entire dataset was used to do the discretization.

Here are the points in each location.
kmeans points hist

Here are the walks going through each location.

kmean hist

Note that in both cased K was set to 1000.

Categories: Uncategorized Tags:

UCT Transfer Learning dataset 3D Plot

October 2nd, 2009 ketkar No comments

Here are some points from the UCT Transfer Learning dataset in 3D. Its funny that we can almost see the map.

allpoints

Categories: Uncategorized Tags:

Frequencies of points and walks in discrete location

October 2nd, 2009 ketkar No comments

I have been superimposing a grid to get discrete locations from continuous 3D points. The following diagram should give a general idea.

figure41

Now the question is how main points lie in each discrete location, and perhaps more importantly, how many walks pass through each location?
Here is a plot for the number of points in each location.
points frequency1

Here is the plot for number of walks in each location.
walks frequency1

Categories: Uncategorized Tags:

Research Programming: Remember the Following

September 23rd, 2009 ketkar No comments

  1. For C++ code,  read and write data in the simplest of formats. If complicated formats are necessary write python scripts. You should be able  to make do with ifstream and ofstream. If you need Boost tokenizer, Sprit or LibXML, you are opening qa can of worms, what you need a python script and a simple intermediate format.
  2. Extract a small subset of real data for testing and debugging. Seriously, don’t do to work with the entire dataset.
  3. Don’t trust libraries with core stuff. Write all code from scratch. You should know exactly what is going on in your code.
  4. Save raw output data, all of it.
  5. Separate plotting from the execution of the algorithm. You should not have to rerun the experiment to get a different plot.
  6. Every once in a while extract and put away frequently used code in a library.
  7. Do not use a unit testing framework, it’s an overkill. Write small tests as you go along in the main, when successful delete and move on.
  8. Try to exactly replicate the algorithm as possible. No shortcuts.
Categories: Uncategorized Tags:

Problems with results

September 23rd, 2009 ketkar No comments

There was some problem with Greedy approach implementation. As seen from the previous results, for very low training sets, the greedy approach is performing worse than others. The problem was that after getting 100% coverage on the training set, the greedy implementation was not properly picking vertices at random. I think I fixed that and here are the update results.

1% Training on 100 partitions between -3000 and 3000

1% Training on 100 partitions between -3000 and 3000

For really small training sizes here is what is observed. Note that results are no better that random for all three approaches.

25 partitions between -3000 and 3000, 0.25% used for training

25 partitions between -3000 and 3000, 0.25% used for training

Categories: Uncategorized Tags:

Kernel/Distance Measure for Walks in a Graph for Classifying and Clustering Player Data

September 23rd, 2009 ketkar No comments

Given that we have a set of walks of the form W = \{(l_1, t_1), (l_2, t_2)…\} , where l is a vertex and t is the time spent of that vertex. Now we need a distance measure or kernel or similarity score between two walks W_i and W_j so that we can cluster or classify them with. Here is an idea based on this paper to do the same. The overall idea is based on Longest Common Subsequence (LCS) which basically doing a diff on two walks. To give an example, the longest common subsequence of AABBAAD and ABBDDDA is ABBD. Basically deletions are allowed. The problem with directly applying this idea is that it does not allow us to consider the time spent on each vertex. Obviously we could duplicate vertices so as to indicate the time  spent, so that  \{(A_1, 3), (B, 2)…\} becomes AAABB. This is a bad idea. The complexity of LCS is O(mn) where m and n are lengths of input sequences. So we are going to get clobbered pretty bad if users spend too much time in some locations. Here is a trick to do this quickly.

First compute the LCS, ignoring the time spent on each vertex. Note that for walks for W_A and W_B we will refer to the sequence of locations as L_A and L_B and timings as T_A and T_B. Suppose the output of LCS for walks L_A and L_B is in the form to two functions P_A and P_B such that L_A[P_A(i)] = L_B[P_B(i)], for i going from 0 to n which is the length of the LCS of L_A and L_B. Now a measure of the time spent in common on a particular vertex with index i can be computed as follows.

\frac{min(T_A[P_A(i)], T_B[P_B(i)])}{max(T_A[P_A(i)], T_B[P_B(i)])}

Then we compute a mean of such time spent on the same vertex in common for all indices.

C = \frac{1}{n} \sum_{i = 0}^n\frac{min(T_A[P_A(i)], T_B[P_B(i)])}{max(T_A[P_A(i)], T_B[P_B(i)])}

This does not take into consideration what fraction of the walk is common. So even if a very small part is exactly common we will get a complete match score of 1. To address this we multiply C by the following.

F = (\frac{\sum^n_{i=0}{T_A[P_A(i)]}}{\sum^{|A|}_{i=0}{T_A[i]}} \times \frac{\sum^n_{i=0}{T_B[P_B(i)]}}{\sum^{|B|}_{i=0}{T_B[i]}})^{\frac{1}{2}}

Clearly,

  1. F \times C = 0 if there are no locations in common.
  2. F \times C = 1 if walks are exactly identical.
  3. In general F \times C is between 0 and 1.

Categories: Uncategorized Tags:

Kernel/Distance Measure for Walks in a Graph for Classifying and Clustering Player Data

September 23rd, 2009 ketkar No comments

Given that we have a set of walks of the form W = \{(l_1, t_1), (l_2, t_2)…\} , where l is a vertex and t is the time spent of that vertex. Now we need a distance measure or kernel or similarity score between two walks W_i and W_j so that we can cluster or classify them with. Here is an idea to do the same. The overall idea is based on Longest Common Subsequence (LCS) which basically doing a diff on two walks. To give an example, the longest common subsequence of AABBAAD and ABBDDDA is ABBD. Basically deletions are allowed. The problem with directly applying this idea is that it does not allow us to consider the time spent on each vertex. Obviously we could duplicate vertices so as to indicate the time  spent, so that  \{(A_1, 3), (B, 2)…\} becomes AAABB. This is a bad idea. The complexity of LCS is O(mn) where m and n are lengths of input sequences. So we are going to get clobbered pretty bad if users spend too much time in some locations. Here is a trick to do this quickly.

First compute the LCS, ignoring the time spent on each vertex. Note that for walks for W_A and W_B we will refer to the sequence of locations as L_A and L_B and timings as T_A and T_B. Suppose the output of LCS for walks L_A and L_B is in the form to two functions P_A and P_B such that L_A[P_A(i)] = L_B[P_B(i)], for i going from 0 to n which is the length of the LCS of L_A and L_B. Now a measure of the time spent in common on a particular vertex with index i can be computed as \begin{equation} \frac{min(T_A[P_A(i)],T_B[P_B(i)])}{min(T_A[P_A(i)],T_B[P_B(i)])} \end{equation}

Categories: Uncategorized Tags:

Hello world!

September 21st, 2009 ketkar 1 comment

Welcome to Game Intelligence Group. This is your first post. Edit or delete it, then start blogging!

Categories: Uncategorized Tags:

Advertisement Placement Results

September 21st, 2009 ketkar No comments

Completed experiment comparing Greedy, Top-N and Markov placement. Seems like Greedy outperforms others except in the case of small training sets.

1% Training, 100 partitions (-3000 to 3000)

1% Training, 100 partitions (-3000 to 3000)

Here is a pdf, 100-0.01

5% Training, 100 partitions between (-3000, 3000)

5% Training, 100 partitions between (-3000, 3000)

Here is a pdf. 100-0.05

For larger training sets and smaller number of partitions the results are pretty much the same.

Categories: Uncategorized Tags:

How are clustering algorithms evaluated?

September 19th, 2009 ketkar No comments

Here is a list, but need to explore further. I am not sure how this works if the distance measure is bad.

  1. Dunn’s Index
  2. Davies-Boldin’s Index
  3. Partition Coefficient
  4. Classification Entropy
  5. Seperation Index
  6. Fuzzy Hypervolume
  7. CS Index
  8. Calinsky-Harbnasz Index
  9. I-Index

A good reference is A new index of cluster validity. Also look at Performance evaluation of some Clustering Algorithms and Validity Indices and A new cluster validity index for the fuzzy c-mean

Categories: Uncategorized Tags: