Unlocking Distances: Finding the Furthest Pairs in Your Data

Ever found yourself staring at a matrix of numbers, trying to pinpoint the most significant relationships? It's a common puzzle, especially when you're working with data points and want to understand which ones are furthest apart. Let's say you have a collection of vectors, like these:

data = np.array([[5, 6, 1], [2, 0, 8], [4, 9, 3]])

And you've used a handy tool like sklearn.pairwise_distances to calculate how far apart each pair of these vectors is. What you get back is a distance matrix. It looks something like this:

dists = np.array([
    [ 0.        ,  9.69535971,  3.74165739],
    [ 9.69535971,  0.        , 10.48808848],
    [ 3.74165739, 10.48808848,  0.        ]
])

Notice how the diagonal is all zeros? That's because a vector is always zero distance from itself. And see how the matrix is symmetrical? The distance from vector A to vector B is the same as from B to A. This symmetry is great, but it can sometimes trip you up when you're trying to find the largest distances.

Your goal is to find the indices that correspond to the biggest numbers in this dists matrix. These indices will tell you which original vectors are the most distant from each other. You might think, 'Easy! I'll just find the maximum in each row or column.' And you'd be on the right track, but there's a little snag.

If you try something like np.argmax(np.max(dists, axis=1)), you're essentially asking, 'For each row, which column has the biggest value?' Because of the symmetry and how argmax works (it returns the first maximum it finds), you might end up pointing to a zero on the diagonal instead of the actual pair of vectors that are furthest apart. It's like looking for the tallest person in a room and accidentally pointing at yourself if you're the tallest, rather than pointing to the person who is tallest relative to you.

So, how do we get around this and truly find those top N furthest pairs? The trick is to flatten the distance matrix and then use argsort. Think of argsort as sorting the values and giving you back the indices of where those sorted values would go. When you flatten the matrix, you're essentially creating a single list of all the distances.

Let's say we want the top 2 furthest pairs. We can flatten the dists matrix and then use argsort to get the indices of the largest values. However, we need to be careful to exclude the diagonal zeros. A common approach is to get all the indices using np.triu_indices (which gives you the upper triangle, excluding the diagonal) and then use those indices to select the relevant distances before sorting.

Here’s a more robust way to think about it: you want to find the indices of the largest values in the upper triangle of the distance matrix (since the lower triangle is just a mirror image). Once you have those top distances, you can map them back to the original row and column indices, which then tell you which pairs of vectors are the most separated.

It’s a bit like sifting through a pile of pebbles to find the two largest ones. You don't just pick the biggest pebble you see; you compare them all to find the true pair that stands out. This process helps ensure you're not just finding local maximums but the global ones that truly represent the greatest distances in your dataset.

You Might Also Like

Leave a Reply Cancel reply