1 Introduction

In our previous research we discussed the possibility of using clustering for feature extraction for classification purposes. We described a general methodology which requires only a distance measure and allows to use any clustering algorithm, provided it generates representatives for each cluster. The methodology involves the following steps: a) clustering of the training dataset, b) calculating the distances between all training examples and the cluster representatives (centers), c) encoding the distances as new features (one new feature per each cluster) d) training the new dataset. Afterwards, each new example has to be transformed into the new feature space prior to being classified using the cluster representatives discovered from the training dataset. However, when formulated this way, the role of clustering is de facto reduced to a method for selecting points in \(m\)-dimensional space, where \(m\) is the numer of features.

In this research, we aim to explore this idea further to check whether clustering-generated points are in any way better than any other configuration. Other possibile point selection strategies we aim to explore include:

  • uniformly distributed random points
  • random training examples
  • points selected using various measures of their representativeness, in particular measures from network analysis, like degree, betweenness, or closeness, but also measures allowing to include the information about the classes, like the ones we used in our study on network analysis for non-network data.

First of all, we should acknowledge that it is possible to select points using the above-described method in such a way that the training examples in the transformed feature space become linearly separable (the easiest way to achieve this is to pick as many points as there are examples). Generally, the less points we select, the better the generalization of the resulting model should be. However, the less points we select, the more important their positioning is. When given the optimal positioning for a given number of points, the number of points can be viewed as a regularization parameter. Consequently, our main focus should be to determine the optimal positioning for a given number of points. Afterwards, the number of points can be tuned in training as a hyper-parameter. This should be a slightly simpler optimization problem than determining the optimal number of points and their positions at once.

2 Example

First, let’s just play around with this idea a little bit on a dataset which is not linearly separable, but easily visualizable in 2D. For this purpose we will use a subset of features from wine dataset, namely x1, x2, Class.

First, let’s visualize the distribution of the clust.coef measure across all training examples.

Now, let’s show how this framework would select points based on top 3 points selected per class using clust.coef measure.

Finally, let’s see how clust.coef-generated points influence classification compared to original features and random points.

3 Notes on interpretation

Now, I’ve been thinking about what the above approach actually does and it seems to be strikingly similar to a very basic neural net but with a slightly different operation on each node and a vastly different (?) interpretation. Let’s explore this idea for a while. What our approach is basically doing is performing an embedding of the input space into \(R^m\), where each dimension corresponds to the euclidean distance from one of \(m\) selected points. So, for every input vector \(x\) and selected point \(w\), the respective dimension is generated as \(\sqrt{\sum_{i=1}^{m}(x_i-w_i)^2} = ||x-w||\). Now, one can look at it as a sort of neural network where, for each single neuron, \(x\) are the inputs and \(w\) the weights, but without the dot product and the activation function. In our case, there is no need for the activation function, since nonlinearity is introduced by the Euclidean distance. The dot product and the activation function are together substituted by \(||x-w||\). In the example analyzed above we have 2 input dimensions and 9 reference points (3 per class). If we simply plug the embedding to the standard softmax function in the output layer (3 outputs, because we have 3 classes) the corresponding network looks like this.

Sample neural network architecture

Figure 3.1: Sample neural network architecture

Remember, the only difference between this network and a “real” one is that the hidden layer aggregates the results a little differently. Naturally, we could cut the last layer and plug the embedding into further layers of a “real” neural network (or any other classifier, for that matter). Even with standard dot product, which can also be viewed as a distance measure, this interpretation of a neuron still holds. This leads me to think that this interpretation was probably introduced in 1950s and I was simply ignorant of it… I don’t know… I still find looking at neural nets’ weights as reference points in n-dimensional space pretty cool and definitely aiding the interpretability of the model. Through this lense, what each layer of a neural network represents is a set of \(m\) reference points \(w^{(1)} .. w^{(m)}\) and the training aims at finding the best positioning of these points with respect to whatever the next layer is doing.

Now, what the activation function is originally doing is introducing some non-linearity to the otherwise linear combination of components in the dot product. The Euclidean distance introduces a simple second order polynomial. However, one could use other distance measures with this method to introduce other non-linearities (just like with other activation functions).

4 Developing the neural network interpretation

What’s nice about the above-described interpretation is that the resulting network with Euclidean distances at nodes can be actually trained using backpropagation. We actually don’t have to change that much. The only two differences will be: a) in the feed-forward phase when calculating the output of a neuron we’ll have \(\sqrt{\sum_{i=1}^{m}(x_i-w_i)^2} = ||x-w||\), b) in the backward phase we have to change one derivative. Originally, in the backward phase for each weight \(w_i\) we are calculating its impact on the total error as follows: \(\frac{\partial Error}{\partial w_i}\). To achieve it, we expand this expression using the chain rule: \[\frac{\partial Error}{\partial w_i} = \frac{\partial Error}{\partial out}\frac{\partial out}{\partial sig}\frac{\partial sig}{\partial w_i}\] where \(sig\) (signal) is the linear combination of inputs to a given neuron (\(sig=\sum_{i=0}^{m}w_ix_i\)) and \(out\) is the singal put through some activation function (e.g., \(\sigma(sig)\)). Now, since we don’t have separate signal and activation, we get: \[\frac{\partial Error}{\partial w_i} = \frac{\partial Error}{\partial dist}\frac{\partial dist}{\partial w_i}\] where \(dist = \sqrt{\sum_{i=1}^{m}(x_i-w_i)^2}\). The first derivative does not change at all from the original setting, since in our case \(dist\) is simply the output of the layer. As for the second layer, treating \(dist\) as another complex function \(dist=\sqrt{tmp}\) and applying the chain rule again we obtain: \[\frac{\partial dist}{\partial w_i}=\frac{\partial \sqrt{tmp}}{\partial tmp} \frac{\partial tmp}{\partial w_i}=\\=\frac{1}{2\sqrt{tmp}}2(x_i-w_i)=\\=\frac{x_i-w_i}{\sqrt{tmp}}=\\=\frac{x_i-w_i}{dist}\]

Now this is something we could actually try to implement to see what results would a single hidden layer of such madness produce compared to a regular neural net layer. We’ll get to that in a second, but wait — there’s more! Ever since I read the awesome lottery ticket hypothesis paper I can’t stop thinking about it in the context of this research…

Ok, I read it like a week ago…

Ok, I didn’t actually read it, I only listened to the talk from the 2019 ICLR conference, but still, I can’t stop thinking about how it could be used to select the appropriate number of reference points by pruning dead neurons. If you’re not familiar with the lottery ticket hypothesis, I highly recommend you read about it because it’s freakin’ awesome! The basic premise of this idea is to shrink a neural net to minimal number of parameters (weights) by repeatedly training and pruning the connections (and eventually, possibly neurons) with the lowest weights. It turns out that in many instances you can prune it pretty heavily without sacrificing the accuracy, however, there is a catch! And this catch is where the brilliance of the lottery ticket hypothesis really plays in. If you do this process like you normally would, that is: pick random weights, train, prune, pick random weights, train, prune, etc., it won’t work! It works only if you preserve the initial random weights from the first run

I KNOW! When you think about it, it totally makes sense since different initial weights would probably light up completely different combination of weights and neurons, but coming up with this idea must have felt like a major eureka moment. It actually reminds me a lot of the Monty Hall problem, but I digress… My point is, by pruning our neural net where each neuron represents a reference point, we could potentially have a mechanism for determining the minimal number of such points. This is something we’ll definitely try later, but first, lets go back and test our concept of a Euclidean layer in a neural network.

5 Euclidean layer

Following the suggestion of dr Dariusz Brzeziński I implemented the above-described idea as a custom layer in keras instead of writing everything from scratch. Thanks to this move I can compare the Euclidean layer with a standard dense layer under the same conditions while being able to experiment with various activation functions, optimizers, evaluation measures, etc.. Let’s start with the iris dataset reduced to only two features (+ decision attribute). 70% of the data is used for training while the rest is used for testing. In both instances, the architecture will be the same as the one presented in Fig. 3.1. However, in the first run the hidden layer will be an ordinary dense layer with RelU activation function while in the second run the hidden layer will be the Euclidean layer. We will use SGD as the optimizer with categorical crossentropy as a loss function and accuracy as an evaluation measure.

Here’s how the training with a regular dense layer looks like.

And here’s the training with the Euclidean layer.

Now let’s see how these two models handle the test data.

Regular.layer.accuracy Euclidean.layer.accuracy
loss 0.1216545 0.1955114
accuracy 0.9555556 0.9111111

Well, there is some difference in accuracy, but I have to repeat this experiment multiple times and take the average to get a real sense about this outcome since currently the results vary drastically between each run (for now you’ll have to take my word for it). However, the results themselves are actually not the point. Sure, they should be at least comparable, but the main point is the interpretation of the Euclidean layer. Each neuron represents a reference point in the input feature space. We can plot these reference points and see what they look like w.r.t. the training data.

How cool is that! We have just trained a network to pick the best reference points from the point of view of accuracy.

6 Taking this idea further

We’re back live! The last time I looked into this research was 2020-02-03 so it’s been a while. Now, after this forced hiatus caused by teaching and another research I’ve been working on, I have a new perspective. First of all, the neural net interpretation is fun and all but now I think we need to focus on getting this thing to actually work. Currently, it kinda works, that is, it is capable of classifying any set of data given a distance function. Unfortunately, the same can be said about a coin toss (even without the distance function constraint!), so we have to make it perform better or at least comparably to other classifiers capable of tackling arbitrary types of data (I’m looking at you kNN and Similarity Forests). So, here are some ideas for the road ahead.

  • It would be nice to have some sort of ceiling analysis showing what this approach is capable of if maxed out. We’ve already established that we take the number of points as a given (hyper-parameter tuned during training) and focus on selecting the optimal position for each point. So the question is - what is the optimal position of the points?
  • After we answer the previous question, we can start figuring out how to find these points efficiently. We’ve already tried clustering, random points, and network centrality measures, but these are all more of a gut feeling choices (OK, clustering might have been a bit more than that) rather than informed decisions. The neural net approach seems like a good candidate, but we loose the universality of the method as it requires numeric features.
  • Finally, we should compare our developed approach against kNN and Similarity Forests.

Let’s go!

7 Angle neighborhood

If you find the title of this section enigmatic - bear with me - it’ll start making sense till the end.

When trying to find out which reference points would be optimal, I stumbled across a big problem… Namely… I couldn’t do it… However, progress has been made as I have several important insights into the problem.

The first major insight is that I think we should emphasize in this approach its universality w.r.t. the objects, that is, we don’t require the objects to be defined in any particular space. We just require a distance measure. This has a very important consequence - we’re not actually looking for reference points in space but rather reference training examples! This makes this problem way simpler (still hard though). Now, with this in mind, I played around with various network centrality measures and came to realize that selecting top-k points according to any measure is a big mistake!

Why, you ask?

Well, let me explain. Consider the following example 2D dataset (remember - we don’t actually have the space - just the examples, but the following would still hold if we had the space).

And here’s the same dataset in 3D so you can better understand my point.

Now, look at the top 5 nodes, according to the degree (18, 5, 6, 14, 15). What you’ll find is that they are all from the same region and if they were closer to each other - they still would have been selected. Do they look like the best reference points for the whole dataset? Of course not! That’s because we shouldn’t actually be looking for global ranking, but local maxima. What we want to discover in this landscape are the peaks! By the way, this also solves the problem of deciding how many of these points should we select as there are only as many peaks to choose from. For now, I suggest we simply select all peaks from this data and as far as I’m concerned, there are 5: 2, 6, 15, 24, 33.

I know what you’re thinking right now. Let’s just take the second derivative of this “function” and we’ll have the local maxima. Great! But now let’s go back to our assumption. First of all - we don’t actually have a function - just points. Secondly - we don’t actually have points - just objects! We can even simplify it a bit more and say we just have a distance matrix - that’s it. You can view it as a full graph with distance on the edges. So how to find peaks in such a setting?

I’ve been thinking about it for the past few days and came up with several possible solutions - all not satisfying - until I finally found a satisfying one. First, I’ll describe the intuition behind it and then I’ll show you how to actually do it.

7.1 Intuition

The intuition is simple. For each node, we check if it has any higher nodes in its nearest neighborhood. If it doesn’t - it’s a peak!

Simple, right?

However, one question remains - what’s considered the nearest neighborhood of a node? The answer to this question is surprisingly complicated and is the key to understanding this method. By the way - it’s not what you think - it’s not k closest nodes and here’s why. Consider the following example. Say we want to find the nearest neighbors of the blue node. Traditionally, if we wanted to find top 4 nearest neighbors, we would pick nodes (0, 1), (1, 2), (2, 1), and (2, 0). However, when thinking about topography and deciding whether our blue node is a peak or not, I think we should be considering nodes (1,2), (2, 1), (2, 6), and (6, 5). If all of them are at a lower altitude than our node - our node is a peak.

Now that I think of it, maybe calling it a nearest neighborhood is not the best name? Maybe we should call it a closest surrounding or something like that? Either way, the question remains, how to find such a surrounding? And this goes back to the weird angle neighborhood title of this section. Since we want to sample the closest points in all directions from a given point (however many directions there are in our undefined space), first, we need to define a direction. Let’s look at it from the perspective of any point (object) \(p\) and a candidate surrounding point \(k\) shown in the figure below. What we want to know is if \(k\) is the nearest point to \(p\) in its general direction or are there other points in this direction which are closer to \(p\). In other words, we are looking for any point \(x\) such that it is closer to \(p\) than \(k\) and lies in the general direction from \(p\) to \(k\). If we find it, it meanse that \(k\) is not in the closest surrounding of \(p\). Now, why do I keep calling it general direction? Because if we just looked for the points lying in the precise direction from \(p\) to \(k\) we would probably find very few, if any, and would effectively include all points to the closest surrounding of any given point. What we do insted is define an area between \(p\) and \(k\) which we’ll call… ahhh, why not… the general direction from \(p\) to \(k\). This area takes a shape of a triangle defined by the angle \(\theta\) presented in the figure below (and this is where the name of this section comes from). On a 2D plane the whole actual area looks like a diamond, in 3D it would look like two cones connected by their base, but in a general case, since we’re always considering only 3 nodes at a time - it’s a isosceles triangle. And this is very important because we don’t actually have a space. Remember - we just have objects and the distances between them! In the figure below you can see that with such a definition of a closest surrounding, \(k\) would be disqualified from the closest surrounding of \(p\) by the red points and wouldn’t be disqualified by the green ones.

Angle neighborhood

(#fig:angle neighborhood)Angle neighborhood

And that’s it as far as the intuition is concerned! What’s neat about this solution is that it determines the number of peaks itself - just as it should!

Now, if we use this method for finding peaks in our previous example with degrees and \(\theta\) defined intuitively at \(45^{\circ}\), we get 5 perfectly defined peaks: 2, 6, 15, 24, 33. I have to say - the more I write about this idea, the more I like it:-)

7.2 Technical description

Once you understand the intuition behind the angle neighborhood (closest surrounding? - I have to decide on the adequate name sometime soon…) the technical explanation is rather straightforward. For any point \(p\) and a candidate point \(k\), \(k\) is NOT in the closest surrounding of \(p\) if there exists any other point \(x\) such that the angle between a line connecting \(p\) and \(k\) and a line connecting \(p\) and \(x\) is smaller than \(\theta\) and the angle between a line connecting \(p\) and \(k\) and a line connecting \(k\) and \(x\) is smaller than \(\theta\) . Luckily, we have all the necessary distances, so the above is easy to calculate using the Law of cosines. So to rephrase it in a more computable way, \(k\) is NOT in the closest surrounding of \(p\) if there exists any other point \(x\) such that: \[ \delta(p,x) < \delta(p,k) \\ \text{and} \\ \delta(k,x) < \delta(p,k) \\ \text{and} \\ \frac{\delta(p,k)^2 + \delta(p,x)^2 - \delta(k,x)}{2\cdot\delta(p,k)\cdot\delta(p,x)} > \cos\theta \\ \text{and} \\ \frac{\delta(p,k)^2 + \delta(k,x)^2 - \delta(p,x)}{2\cdot\delta(p,k)\cdot\delta(k,x)} > \cos\theta \\ \] where \(\delta(a,b)\) signifies a distance between points \(a\) and \(b\).

8 Graph measures example re-run using angle neighborhood

Let’s now take the above-defined concepts and apply them in the experiment which we’ve already run before to compare the outcomes. We’ll compare the performance of our reference points defined using peaks vs traditional top \(k\) approach based on degree. Here are the results.

Ok - this is nice. Not much, but still nice:-)

#Experiments

8.1 Peak degree vs topk degree

Now, let’s do a real comparison - peak degree vs topk degree, multiple datasets (16), multiple repeats (30) - and see who is who. We’ll use linear SVM.

I’d say this is a pretty conclusive win for the peaks.

8.2 Peak degree through linear SVM vs knn

Ok, the results are slightly in favor of knn, but definitely not conclusively (although in few cases the difference is significant). However, now that I think of it, I’m not entirely sure what this test was supposed to show… The idea was to put our generated features to the test against a standard distance-based approach (knn), but the choice of linear SVM is arbitrary. If the results were the other way around (indicating superiority of our features) it would show that the generated space is linearly separable, which would be an awesome result and I’d probably welcome it with open arms, but I’m not sure we should even expect that.

8.3 Comments on evaluation

To be honest, I don’t really know how to comparatively evaluate our approach, since it doesn’t fit into any of the usual categories. It’s neither pure feature extraction (that’s domain specific) nor distance-based classification (that involves… well… classification). What we’re doing is actually a blend of the two - a universal distance-based feature extraction method. I hate tap dancing around the results when they don’t pan out the way I’d wished for, but - since this happens literally always - maybe I’m setting my expectations too high in the first place? Maybe “the best method that’ll beat all competition” is not the target I should be aiming for and “a new approach which makes sense in some cases and has certain advantages” is a better goal. I don’t know. Either way - my attitude doesn’t really matter here - what matters is to go forward! So, since I’ve already started working on a paper and I don’t want this issue to stall me, I’ll just carry on and hope that I’ll figure it out as I go.

So far, the contributions of the paper look as follows.

  1. Feature extraction method using distances from reference examples (the Method)
  2. Neural network interpretation of the method along with the necessary modifications to the backpropagation algorithm (the Euclidean layer)
  3. A new neighborhood definition related to topography of objects without explicitly defined space (\(\theta\)-neighborhood)
  4. Experimental evaluation
    • Top vs peak [done]
    • Comparison of network centrality measures: degree, clustering coefficient, page rank, status, PN
    • Comparison of various method to generate points: peak centrality, random examples, clustering
    • Comparative evaluation against competition - yeah, this I’ll need to figure out

I’ll update the report once I have some new thoughts or results.

8.4 Comparison of centrality measures

The uniformity of these results is extremely interesting. Why? Well, one thing you don’t see here is the number of peaks generated by each of these methods. And these numbers differ between the methods and datasets. Like, a lot! Here it is.

Since the number of peaks translates 1:1 to the number of features, I would expect much greater differences between the approaches. But it’s totally not the case! Therefore… Fascinating!

Now that I think of it - actually not that fascinating, as PN mostly comes out on top and it’s the one with the most reference points. So this results is actually pretty consistent with my expectations.

Below you can find some other results. I’m not too happy how they turned out, so I’m leaving them without a comment.. for now.

8.5 Comparison of classifiers on our features

8.6 Knn on original distances vs knn on distances calculated using our features

8.7 Knn on original distances vs SVM + RBF using our features

9 A new idea?

When thinking about what went wrong in the experiments above, I stumbled upon a simple realization and I’m not sure I thought about it this way before. In the limit, when we select all training examples as reference points (\(m=n\)), we simply supply the whole distance matrix as a dataset. I’m not saying this would make sense, but just that it is true. So when selecting a given example as a reference point, we simply take its corresponding row/column from the distance matrix and declare it a feature. Computing the whole distance matrix may be prohibitive in many situations, especially when dealing with objects, for which computing the distance is costly. This lead me to think about picking random reference points together with random examples to train a simple classifier on and doing so multiple times… See where I’m going with it? I realize the name “Similarity Forests” is already taken, but “Distance Forests” is not;-P The method would work like this.

For each tree in the forest:

  1. pick \(k\) random reference points (\(k<<n\)) \(D_{ref}\),
  2. pick \(b\) random training examples (\(b\leq n\)) \(D_b\),
  3. calculate the distances from all examples in \(D_{b}\) to all reference points \(D_{ref}\) and encode them as features,
  4. train the tree.

Naturally, classification is carried out based on the majority voting of the trees.

Now I’m also curious what is the relationship between this approach and Similarity Forests. However, to answer this question we’ll have to wait a bit since I’m taking 3 weeks off. I’ll see you on the \(7^{th}\) of September.

Still there? Ok, one last thing before I go. I think I know what went “wrong” in the experiments above. When you look at the comparison of centrality measures it actually becomes clear. We’re selecting too few points! After all, this was our conclusion from our experiments with clustering - the exact placement of the reference points is not as important as their number. A simple way to check it would be to change degree to PN and run the experiments again. I’m sure I’ll be able to refrain from testing it till the end of my vacation…