1 Introduction

This report describes our research conducted together with Prof. Tadeusz Morzy on the topic of measuring distance between sequences of sets. A sequence of sets is a data structure appearing in many scenarios including: retail (e.g., a sequence of purchases by a given client), co-authorship of papers (e.g., a sequence of co-authors of a given researcher), medicine (e.g., a sequence of drugs taken by a given patient). It’s recent popularity in the scientific community was sparked by Benson et al. in a 2018 KDD paper [1]. However, dealing with such sequences in clustering or classification scenarios is difficult as currently, to the best of our knowledge, there is no dedicated similarity or distance measure for this type of data. In this research, we aim to address this problem by considering the edit distance for sequences of sets.

Edit distance is a go-to measure when it comes to comparing symbolic sequences and is defined as a minimum number of operations required to transform one sequence into another. There are three such operations: insertion, deletion, and relabeling, which additionally can have an associated cost. In the classic scenario this cost is equal to 1, however, can be defined separately for each operation. In such case, the interpretation of edit distance slightly changes to a minimum cost of a sequence of operations transforming one sequence into another.

Our goal in this research is to:

  • formalize the edit distance measure for sequences of sets [spoiler alert - it will be very similar to the original edit distance, but with differently defined costs of edit operations, probably around Jaccard distance or another set-oriented metric],
  • prove it’s a metric,
  • implement it,
  • test it in a clustering and possibly classification scenario.

2 Measure definition

By a sequence \(s_i=<s_{i1}, s_{i2},...,s_{in}>\) we understand an ordered multiset of elements, where each element \(s_{ik}\), \(k=1..|s_i|\) is drawn from some set \(\mathcal{E}\). Given two sequences \(s_1\), \(s_2\), the edit distance \(\delta(s_1, s_2)\) between them is defined using the following recursive formula: \[ \delta(s_1, s_2) = \min \begin{cases} \delta\left(s_1,s_2-s_{2\rightarrow}\right) + c_{ins}\left(s_{2\rightarrow}\right) \\ \delta\left(s_1-s_{1\rightarrow},s_2\right) + c_{del}\left(s_{1\rightarrow}\right) \\ \delta\left(s_1-s_{1\rightarrow},s_2-s_{2\rightarrow}\right) + c_{rel}\left(s_{1\rightarrow}, s_{2\rightarrow}\right) \\ \end{cases} \] with stop conditions defined as: \[\delta(\emptyset, \emptyset) = 0\] \[\delta(s_1, \emptyset) = \delta(s_1-s_{1\rightarrow}, \emptyset) + c_{del}\left(s_{1\rightarrow}\right)\] \[\delta(\emptyset, s_2) = \delta(\emptyset, s_2-s_{2\rightarrow}, ) + c_{ins}\left(s_{2\rightarrow}\right)\] where \(s_{i\rightarrow}\) is the rightmost element of sequence \(s_i\), \(s_i-s_{i\rightarrow}\) is \(s_i\) without its rightmost element, and \(c_{del/ins/rel}\) are cost functions for deletion, insertion, and relabeling, respectively.

Now, it’s all fun and games until we change the definition of an element. In our case, we define each element \(s_{ik}\) to be a subset of \(\mathcal{E}\) rather than its element. To me this suggests that we have to play around with the cost functions of edit operations to reflect the contents of elements, but herein lies the problem — we know that edit distance is a metric if all costs are equal and greater than 0. However, I believe that we can loosen this requirement and still be left with a metric if: a) given the same element, the cost of deletion is equal to the cost of insertion and b) the cost of relabeling is itself a metric w.r.t. elements. The first requirement is easy since we wanted the deletion and insertion of any element to be equal anyway. So, regardles whether we are deleting/inserting an element with 5 items or 100 items, the cost will always be 1. We will discuss this decision later. The second requirement can be fulfilled by using any distance metric for sets. In our case, we’ll start with Jaccard distance as a go-to set-oriented measure and see how it goes (very scientific…). Given two sets \(A\) and \(B\), the Jaccard distance is defined as follows. \[J(A, B) = 1 - \frac{\left|A\cap B\right|}{\left|A\cup B\right|}\]

Now that we have all the pieces in place, we should probably prove that the resulting measure is still a mertic. It feels obvious and seems trivial, however, I have no idea how to do it yet, so let’s see how it works first and we’ll worry about the proof later;-P

3 Implementation

Here’s the code for the measure.

editDistance = function(s1, s2) {
    M = matrix(nrow = length(s1) + 1, ncol = length(s2) + 1)
    
    M[1, 1] = 0
    
    for(i in 2:(length(s1) + 1)) {
        M[i, 1] = i - 1
    }
    
    for(j in 2:(length(s2) + 1)) {
        M[1, j] = j - 1
    }
    
    for(i in 2:(length(s1) + 1)) {
        for(j in 2:(length(s2) + 1)) {
            
            M[i, j] = min(M[i - 1, j] + 1,
                          M[i, j - 1] + 1,
                          M[i - 1, j - 1] + jaccard(s1[[i - 1]], s2[[j - 1]]))
        }
            
    }
        
    M[length(s1) + 1, length(s2) + 1]
}

jaccard = function(set1, set2) {
    1 - (length(intersect(set1, set2)) / length(union(set1, set2)))
}

Sample sequences.

## Sequence 1:  (1 2) (1) (3 4)
## Sequence 2:  (5 1 6 2) (5 6 3 7)

And the distance between the two sequences is… (drumroll)… 2.3! I’d say it works pretty well:-D

4 Clustering

With our measure defined and implemented, let us test it in a clustering scenario.

We’ll start with a dataset containing sets of tags from posts of mathoverflow users [1].

There are 1594 sequences in the dataset. Let’s look at the distribution of their lengths.

I think it would be useful to get rid of the long tail of the data. The reason we want to do this is that the very long sequences will inevitably be in a very long distance from all other sequences due to their sheer length (you have to delete many elements to get from a long sequence to a short one). This will make visualization tricky and will also affect some clustering algorithms. DBSCAN would be fine, since it automatically deals with outliers, but k-means and AHC would be in trouble… There are only 12 sequences longer than 200, so we’ll get rid of those.

There. Much better. Let’s calculate the distance matrix between all sequences and visualize it using a heatmap. Now, I’ll be honest with you. I didn’t actually use the edit distance code I showed you before… I know… I mean, I did at first but then I realized it would probably take a few days or so to compute the matrix, so I re-implemented it in Rcpp. Now it runs much faster. I’ll probably do a runtime comparison sometime soon, but for now, let’s see the results.

And the distribution of distances looks like this.

Using the computed distance matrix we can cluster the dataset using DBSCAN. We’re setting \(eps = 10\) and \(MinPts = 5\). Here’s the result.

Well. This looks pretty terrible, but we have no idea if this set is in any way “clusterable” (it certainly wasn’t designed with this task in mind). In fact, I’ve just realized that this whole clustering experiment is kind of pointless, since we have neither the ground truth nor any other approach to compare to. This just prooves that edit distance can be used in clustering… Bravo… What we could do is to try classification using some distance-based algorithm and compare the result against a different distance metric and different representations. That way we’ll see if there’s any benefit to using edit distance over other approaches.

5 Classification

To test our edit distance on the classification task, we need a labeled dataset. However, to the best of my knowledge, classification of sequences of sets hasn’t been attempted yet, so there are none. Luckily, Benson et al. [1] provide several datasets for the purposes of their research, two of which are closely related, so we can try to create one based on those two. The two datasets in question contain sets of tags from posts of: a) MathOverflow and b) Mathematics Stack Exchange users, the first of which we already used in the clustering experiment. Naturally, the goal will be to predict whether a given sequence of sets of tags comes from MathOverflow or Mathematics Stack Exchange. Since both of these datasets are math-oriented, there should be enought overlap between their tags to make the task non-trivial, but not enough to make it impossible. This should be fun!

5.1 Dataset

First, let’s read both datasets and see how many overlapping tags are there.

The Jaccard coefficient between the two sets of tags is 0.2000779. It’s a fair bit of overlap, but I’m affraid it may not be enough to make the task in any way challenging for a simple bag-of-words representation. However, there is also a big difference in the sizes of these datasets. MathOverflow has 1594 sequences while Math SX 15726. Since we’re not aiming at solving the problem of class imbalance, we can actually kill two birds with one stone here by selecting only the top 1594 sequences from Math SX most similar to the ones from MathOverflow in terms of the used tags. This way we will end up with a perfectly balanced dataset containing examples which should not be trivial to identify.

After the operation, we get a dataset containing 3188 sequences divided into two classes 50:50. Interestingly, the distribution of their lengths is quite similar, as you can observe in the figure below. In the plot, we have omitted the sequences longer than 200 for better visualitasion (there are only 18 such sequences), however, they will not be removed from the analysis. Now there are 1478 unique tags in the dataset and the Jaccard coefficient between the tags in the classes is 0.7699594.

With our dataset ready, let’s start with a simple bag of words representation to establish a point of reference.

5.2 Baseline — bag of words

After performing bag of words classification using linear SVM, I’m afraid the dataset is too easy to play with, since this simple model was enough to achieve 0.99841 average test accuracy over 50 runs. I think we have to distort the dataset a little bit to make the task more challenging, but before we do that, let’s do one more thing. Let’s compute the distance matrix for this dataset using: a) Jaccard distance on the bag of words representation and b) our edit distance.

Jaccard distance on bag of words

(#fig:Plotting bag of words heatmap)Jaccard distance on bag of words

Edit distance

(#fig:Plotting heatmap for classification with edit distance)Edit distance

Now let’s test these matrices in action using the kNN classifier.

The average accuracy over 50 runs of knn (k=3) using the Jaccard distance on bag of words is 0.9987657 while the accuracy using our edit distance is 0.9797699. A more detailed breakdown of the accuracy over all runs looks like this.

It seems like the bag of words approach produces better and more stable results than the edit distance approach. Now let’s look at a 2D MDS of the bag of words distance matrix.

Well, this doesn’t look that hard to classify. Just keep in mind that we’ve just projected a ~1478-dimensional space to 2D. Let’s see how the same projection turns out for our edit distance.

Holy cow! I did not expect that! However, one can clearly observe one of the main drawbacks of edit distance, namely, its sensitivity w.r.t. length of the sequences. There’s absolutely no doubt that the furthest sequences are the longest ones. Of course this property is desirable in some scenarios, after all, there is a big difference between a sequence containing 100 elements and a one which contains 10, namely, 90 elements… However, in some cases we may not want that. Think about a history of customer purchases in a supermarket and imagine two customers buying nearly identical sets of products every time they visit. Are they similar? Well, according to edit distance — we don’t know! If one of those customers visits the shop twice as often as the other their similarity is at most 50%! Is that “correct”? I don’t know but I think in some cases it might be undesirable. The funny thing is that we already have a solution to this issue which we developed for trees. It’s called partial tree-edit distance [2] and it measures the number of operations required to transform one tree to become a subtree of the other tree. I know that “When all you have is a hammer, everything looks like nail”, but I think in this case it is warranted to use the same approach. If applied in our two customers scenario, the difference in their similarity stemming from their lengths would go away! Let’s try this approach and see what we’ll get.

5.3 Partial edit distance

The only difference between a regular edit distance and a partial tree edit distance is that in the case of the latter, the cost of inserting an element to the first sequence is 0. This means that we are actually looking for a subsequence of the second sequence which matches the first sequence best. Since we’re not doing subsequence matching and the measure is not symmetric, given two sequences \(s_1\), \(s_2\) we have to compute this measure twice: \(\delta^\rightarrow(s_1, s_2)\) and \(\delta^\rightarrow(s_2, s_1)\), and take the smaller of the two results.

Let’s begin by calculating the partial edit distance matrix of our dataset and visualizing it.

Partial edit distance

(#fig:Plotting heatmap for classification with partial edit edit distance)Partial edit distance

And now let’s use this matrix in knn classification and project the result on the plot together with our two previous approaches.

It kinda worked. Using partial edit distance improved the classification quality over the standard edit distance with average accuracy equal to 0.9898326. However, it still wasn’t sufficient to beat the bag of word approach. This tells me a two things: 1) the dataset is too simple, 2) the sets in each sequence are probably not dependent on previous sets, i.e., the order of sets doesn’t matter, which means the dataset is more of a set of sets than a sequence of sets situation.

6 Scratchpad

Clustering into 2 clusters using k-medoids.

The results are as follows. For Jaccard distance on BOW, ARI = 0.0111991, AMI = 0.0083427. For edit distance, ARI = 0.1915544, AMI = 0.2040623. For partial edit distance, ARI = 0.4648676, AMI = 0.3839137.

The above presented results on clustering give me confidence to start working on a paper. Roadmap:

  1. Gather more datasets (twitter?)
  2. Run classification and clustering on new datasets.
  3. Prepare awesome figures and examples illustrating:
    • three sample sequences of sets (two of similar lengths, one much longer)
    • bow representations of these sequences
    • distance between bow representations
    • edit edit distance on these sequences
    • partial edit distance on these sequences

Ok. I couldn’t find ANY labeled datasets containing sequences of sets or even construct my own based on the existing ones. Therefore, I decided to write my own dataset generator and I feel it was the right call. Now, we can test how our measure behaves under controlled conditions! The generator is highly parametrizable: - number of sequences per class, - min/max length of a sequence, - max set size, - number of distinct items, - mean item ids in each class. Each length of sequence, size of set, and items in each set, are drawn from a normal distribution. Since items are numbers, we can control how the sequences differ between the classes using the means of their distributions in each class. Here are two sample distributions of items in two classes.

Highly distinct items in each class

(#fig:Sample dataset 1)Highly distinct items in each class

Highly overlapping items between two classes

(#fig:Sample dataset 2)Highly overlapping items between two classes

Let’s play around with this generator. We’ll start by generating datasets consisting of two classes differentiated only by the means of the items selected — from easily identifiable classes (highly distinct items in each class) to impossible to identify classes (exactly the same items in both classes). The results are as follows.

This plot clearly shows that for very easy to distinguish classes, the measure doesn’t matter that much. However, when overlapping items between the classes are becoming prolific, it also shows that bag of word holds better accuracy than edit distance. This actually makes sense since the items in each consecutie set are selected independently from the previous sets, which means this dataset represents sets of sets rather than sequences of sets. And since each set in each class is drawn from the same items and the same distribution, each sequence can be treated as a single multiset, which is exactly how the bag of words representation views it!

In order to generate truly sequential data, we need to introduce some dependence between the sets so that their order actually matters. Also, if we start playing around with different average lengths of sequences between classes or different average numbers of items in sets, we should start seeing the advantages of the edit distance over bag of words. Let’s start with average length of sequences. We’ll go from fully overlapping lengths (normally distributed between 1 and 10) to fully distinct lengths (normally distributed between 1 and 10 for class 1 and between 11 and 20 for class 2).

Yep. As predicted, once we introduce even a minor difference in the average lengths of the sequeces between the classes, the edit distance immediately starts to show its superiority. Now let’s see how it goes with the average size of each set. Similarly, we’ll go from fully overlapping sizes, to fully distinct.

As expected, the number of elements in each set is impossible to track by the bag of words representation but easy to pick up by edit distance.

Let’s carry out the same tests as previously but for clustering, i.e., changing: means, lengths, set sizes. This time however, let’s switch to affinity propagation, as I’ve noticed it produces more stable results.

Now, let us try to introduce some dependence between the sets in each sequence. We will do this by replicating the Correlated Repeated Unions model (CRU) from [1]. In particular, the model follows 3 key observations regarding sequences of sets:

  1. Items in sets tend to repeat items from previous sets [repeat behaviour].
  2. Iems tend to repeat in groupg [subset correlation].
  3. The more recently a given item appeared in a previous set, the more likely it will appear in a current set [recency bias].

Let’s test our sequence generator by generating datasets with decreasing rate of the decay parameter from 1 to 0.1 with 0.1 step and illlustrating how much each set is dependent by the \(k\)th previous set. The dependence is measured using Jaccard coefficient between a given set and its \(k\)th preceding set, which is exactly the approach the authors in [1] used to illustrate the recency bias.

Ok. This looks absolutely spot on! Now let’s use this new model to generate datasets with increasing decay rates in one class and a very low decay rate in the second class to see whether bag of words or edit distance will pick it up.

References

[1] A. R. Benson, R. Kumar, and A. Tomkins, “Sequences of sets,” in Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining, 2018, pp. 1148–1157, doi: 10.1145/3219819.3220100.

[2] M. Piernik and T. Morzy, “Partial tree-edit distance: A solution to the default class problem in pattern-based tree classification,” in Proceedings of the 21st pacific-asia conference on advances in knowledge discovery and data mining, 2017, pp. 208–219.