Kernel/Distance Measure for Walks in a Graph for Classifying and Clustering Player Data
Given that we have a set of walks of the form
, where
is a vertex and
is the time spent of that vertex. Now we need a distance measure or kernel or similarity score between two walks
and
so that we can cluster or classify them with. Here is an idea based on this paper to do the same. The overall idea is based on Longest Common Subsequence (LCS) which basically doing a diff on two walks. To give an example, the longest common subsequence of AABBAAD and ABBDDDA is ABBD. Basically deletions are allowed. The problem with directly applying this idea is that it does not allow us to consider the time spent on each vertex. Obviously we could duplicate vertices so as to indicate the time spent, so that
becomes AAABB. This is a bad idea. The complexity of LCS is
where
and
are lengths of input sequences. So we are going to get clobbered pretty bad if users spend too much time in some locations. Here is a trick to do this quickly.
First compute the LCS, ignoring the time spent on each vertex. Note that for walks for and
we will refer to the sequence of locations as
and
and timings as
and
. Suppose the output of LCS for walks
and
is in the form to two functions
and
such that
, for
going from
to
which is the length of the LCS of
and
. Now a measure of the time spent in common on a particular vertex with index
can be computed as follows.
Then we compute a mean of such time spent on the same vertex in common for all indices.
This does not take into consideration what fraction of the walk is common. So even if a very small part is exactly common we will get a complete match score of 1. To address this we multiply by the following.
Clearly,
if there are no locations in common.
if walks are exactly identical.
- In general
is between
and
.
Categories: Uncategorized