Given user data, one often wants to find approximate matches in a large database. A good example of such a task is finding images similar to a given image in a large collection of images. We focus on the important and technically difficult case where each data element is high dimensional, or more generally, is represented by a point in a large metric space1 and distance calculations are computationally expensive.
In this paper we introduce a data structure to solve this problem called a GNAT -- Geometric Near-neighbor Access Tree. It is based on the philosophy that the data structure should act as a hierarchical geometrical model of the data as opposed to a simple decomposition of the data which doesn't use its intrinsic geometry. In experiments, we find that GNAT's outperform previous data structures in a number of applications.
Keywords -- near neighbor, metric space, approximate queries, data mining, Dirichlet domains, Voronoi regions.
The problem of finding the near neighbors of a given point in a large data set has been studied well and has a number of good solutions, if the data is in a simple (e.g. Euclidean), low-dimensional space. However, if the data lies in a large metric space1 the problem becomes much more difficult. Consider the following examples as a small sample of where this problem occurs:
Information Retrieval -- Finding sentences similar to a user's query from a given database of sentences.
Genetics -- Finding similar DNA or protein sequences in one of a number of large genetics databases.
Speaker Recognition -- Finding similar vocal patterns (e.g., under Fourier transforms) from a database of vocal patterns.
Image Recognition -- Finding images similar (using the Hausdorff metric [HKR93]) to a given one from a large image library.
Video Compression -- Finding the image blocks of a previous frame that are similar to blocks in a new frame (using a simple L1 or L2 metric, possibly after a DCT transform) to generate motion vectors in MPEG video compression.
Data Mining -- Finding approximate time series matches (e.g., stock histories or year long temperature).
*Supported by a Fellowship from the NSF. 1 By a large metric space we mean a space such that the volume of a ball grows very rapidly as its radius increases. High dimensional vector spaces are an example (a ball of radius 2 in a 20 dimensional Euclidean space is over a million times large than a ball of radius 1). See Section 3 .
All of the examples above fit into the model of finding near neighbors in a large metric space. In particular,
the first two examples above find near neighbors in the metric space of strings under some edit distance function.
The speaker, video, and possibly data mining examples find near-neighbors in a high dimensional vector space
under the L1 or L2 metric (more sophisticated metrics could be imagined). The image recognition example does
not fit as nicely into either of these classes but still qualifies as near-neighbor search in a large metric space.
Every data type above has some degree of correlation in its distribution. While it may be small (i.e., the
data resembles random vectors), it must be exploited to get good performance in a near neighbor search. To do
this in an application independent manner requires that the data structure capture the intrinsic geometry of the
data. As we will see (Section 4 ), our data structure, the GNAT, captures the geometry of data collections such
as the ones mentioned above by hierarchically breaking it down into regions which try to preserve fundamental
The other category of solutions are hierarchical and typically have an
J. K. Uhlmann outlined the foundation for two different methods, generally described as metric trees [Uhl91].
One of these methods, subsequently called vp-trees
This approach has the benefits of requiring only one distance calculation per node and automatically creating
balanced trees. However, it suffers from regions inside the median sphere and outside the median sphere being
very asymmetric, especially in a high-dimensional space. Since volume grows rapidly as the radius of a sphere
increases, the outside of the sphere will tend to be very thin, given that there are as many points on the inside
as on the outside, thus worsening search performance. In our work, we try to avoid such asymmetries. While
the limited branching factor of 2 can also be viewed as a weakness, we have conducted experiments with higher
degree variations of vp-trees and find little improvement in performance.
The other method, a generalized hyperplane trees (gh-tree), is constructed as follows. At the top node, pick
two points. Then, divide the remaining points based on which of these two they are closer to. Now, recursively
build both branches. This method is an improvement in that it is symmetric and the tree structure still tends to
be well balanced (assuming sufficiently random selection of the two points). However, it has a weakness in that
it requires two computations at every node and is limited to a branching factor of two.
A variation of gh-trees was implemented at ETH Zurich [BFR
Every data type above has some degree of correlation in its distribution. While it may be small (i.e., the data resembles random vectors), it must be exploited to get good performance in a near neighbor search. To do this in an application independent manner requires that the data structure capture the intrinsic geometry of the data. As we will see (Section 4 ), our data structure, the GNAT, captures the geometry of data collections such as the ones mentioned above by hierarchically breaking it down into regions which try to preserve fundamental geometric structure.
The other category of solutions are hierarchical and typically have anO(log n) query time given a sufficiently small range (typically too small to be practical). They are of the following form: The space is broken up hierarchically. At the top node, one or several data points are chosen. Then the distance between each of these to each of the remaining points is computed. Based on these distances, the points are separated into two or several different branches. For each branch, the structure is constructed recursively.
J. K. Uhlmann outlined the foundation for two different methods, generally described as metric trees [Uhl91]. One of these methods, subsequently called vp-trees3, was implemented by P. N. Yianilos [Yia93]. The basic construction of a vp-tree is to break the space up using spherical cuts. To build it, pick a point in the data set (this is called the vantage point, hence the name vp-tree). Now, consider the median sphere centered at the vantage point with a radius such that half the remaining points fall inside it and half fall outside. For every other point, put it in one branch if it is inside the sphere and in another branch if it is outside the sphere. Now, recursively construct the lower level branches.
This approach has the benefits of requiring only one distance calculation per node and automatically creating balanced trees. However, it suffers from regions inside the median sphere and outside the median sphere being very asymmetric, especially in a high-dimensional space. Since volume grows rapidly as the radius of a sphere increases, the outside of the sphere will tend to be very thin, given that there are as many points on the inside as on the outside, thus worsening search performance. In our work, we try to avoid such asymmetries. While the limited branching factor of 2 can also be viewed as a weakness, we have conducted experiments with higher degree variations of vp-trees and find little improvement in performance.
The other method, a generalized hyperplane trees (gh-tree), is constructed as follows. At the top node, pick two points. Then, divide the remaining points based on which of these two they are closer to. Now, recursively build both branches. This method is an improvement in that it is symmetric and the tree structure still tends to be well balanced (assuming sufficiently random selection of the two points). However, it has a weakness in that it requires two computations at every node and is limited to a branching factor of two.
A variation of gh-trees was implemented at ETH Zurich [BFR+93] as monotonous bisector trees (MBT's) to deal specifically with text. However, nothing in the method would have prevented them from dealing with arbitrary metric spaces. The key difference between MBT's and gh-trees is that MBT's only select one new point 2Some of the papers we mention below address the problem of finding nearest neighbors. However, their methods can be applied to finding all near neighbors with minimal change. 3We do not look at the enhancement of vp-trees called vpsb-trees.
The most relevant works, however, are also the oldest. Burkhard and Keller suggested several data structures (and algorithms) [BK73] for approximate search. The first is very similar to vp-trees except that it requires a finite number of discrete distance values. Essentially, for every vantage point, a separate branch is allocated for every possible distance value. This method, however, suffers from the same asymmetry problem as the vp-trees. The other two data structures, which are the closest to the GNAT, break up the space into a number of balls, storing the radii and centers. More specifically, divide the data points into groups using some method (this was left as a parameter). Pick a representative of each group and call it the center of the group. Then, calculate the radius (the maximal distance to another point) from the center for each group and pruning is performed based on these radii. Recursion is briefly mentioned but not analysed. The third method, an enhancement of the second, additionally requires that the diameter (the maximal distance between any two points) of the points in any group be less than a constant,k, and the group is then called a clique. In this case a minimal subset of the set of all maximal cliques is used as the set of all groups. These two schemes act as reasonably good models of the data space they store and if extended to a hierarchical structure, they have an arbitrary branching factor. However, they have several weaknesses. First, they do not work well with nonhomogeneous data, since we could easily end up with a lot of cliques containing only one point and several cliques containing very many points. Additionally, distance computations are not fully exploited in that distance to the center of one clique is not used to prune other cliques. Finally, while we do not focus on the cost of preprocessing in this paper, this cost was reported to be extremely high in the third method.
K. Fukunaga and P. Narendra worked out a very similar scheme, which requires more than just a metric space, to create a tree structure with an arbitrary branching factor in 1975 [FN75] as follows. Divide the data points intok groups. (How this is done is left as a parameter of the structure but in tests they used a clustering algorithm which requires more than just a metric space.) Then compute the mean4 of each group (once again a departure from a metric space) and the farthest distance from that mean to a point in the group. Then recursively create the structure for each group. While this method tends to have nice symmetric properties (given a reasonable clustering algorithm) that reflect the space and it has an arbitrary branching factor, it has several weaknesses. First, it relies on more than just a metric space; second, it requires many distance computations at each node and does not use them fully; and third, it does not deal effectively with balancing.
In this paper we present GNAT's which can be viewed as both a generalization of Fukunaga's method and a generalization of gh-trees. GNAT's provide good query performance by exploiting the geometry of the data. Unfortunately, while query time is reduced, the build time of the data structure is sharply increased. However, if the application is query dominant (or even if there are roughly as many queries as data points) the relative cost of building a GNAT becomes negligible. In tests, we find that GNAT's almost always perform better than both vp-trees and gh-trees, and scale better.
3 Large Metric Spaces
A metric space is a set X with a distance function d: X2 -> R such that: forall x, y, z in X,
1.d(x, y) >= 0 and d(x, y) = 0 iff x = y. (Positivity)
2.d(x, y) = d(y, x). (Symmetry)
3.d(x, y) + d(y, z) >= d(x, z). (Triangle Inequality)
Since we are dealing with arbitrary metric spaces, we assume the following model of computation: there is a large number of data points and a ``black box'' to compute distance between them.
The first important observation is that it is impossible to deal efficiently with all metric spaces. In particular, consider the metric space where the distance between two points is 0 if they are the same and 1 if they are different. Then our only option in finding a query point is a linear search and no fancy data structure will save us. In fact, the more any space resembles such a metric space, the more difficult it will be to search.
4The mean of a set of points (vectors) in a vector space is simply their sum divided by their number. The concept of a mean is not meaningful for arbitrary metric spaces.
Since visualizing high-dimensional data is difficult we look at some simple measures to help us understand the geometry of a given data space. A particularly useful measurement is the distribution of distances between points in the space. While the scales of these distributions vary greatly, we can compare them by considering at what range we would be interested in finding near neighbors. In each of the graphs of the distributions that follow, we have made 5000 random distance calculations in the data space and distributed them into a number of buckets. They axis represents the number of distances which fell into that bucket divided by the size of the bucket.
Dim 50 Dim 20 Distance 25 20 15 10 5 0 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
Figure 1: Distribution of distances underL1 metric in 20 and 50 dimensions.
Dim 50 Dim 20 Distance 4 3.5 3 2.5 2 1.5 1 0.5 0 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Figure 2: Distribution of distances underL2 metric in 20 and 50 dimensions.
The distributions of distances between random, uniformly chosen vectors in 20 and 50 dimensional hypercubes of side 1 under theL15distributions because of the Central Limit Theorem (Figure 1 ). For the L2 metric, we obtain a Gaussian-like (though not exactly Gaussian) distribution limit distribution (Figure 2 ). Note that the distributions for 50 dimensions should be viewed in relation to their larger ranges and hence are really quite narrow. The fact that the peaks are narrow indicate that the distance function has low entropy and that it 5Recall that the L1 metric is the sum of the absolute values of the differences of corresponding vector dimensions and the L2, or Euclidean, metric is the square root of the sum of the squares.
50 by 50 16 by 16 Distance 8000 6000 4000 2000 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Figure 3: Distribution of distances underL2 metric in 16 by 16 and 50 by 50 images.
Correlated data has somewhat different properties and tends to have a much flatter distance distribution. For example, taking random 16 by 16 or 50 by 50 blocks from an image, then treating them as 256 or 2500 dimensional vectors respectively and taking theL2 distances between them creates a distribution with two major maxima (Figure 3 ). The first maximum, near 0, indicates a great deal of clustering in the data since small distances are so probable; the second major maximum is one that is common to all large metric spaces we will investigate, which indicates that average distances are fairly likely.
Edit InsDel Distance 100 90 80 70 60 50 40 30 20 10 0 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
Figure 4: Distribution of edit distances in lines of text.
As another example, consider taking lines of text from a large text document (``A Tale of Two Cities'' in this example) and using a simple edit distance function. We considered two different such functions. Both counted the minimum number of operations needed to get from one line to the other. The first distance function,InsDel allowed only inserts and deletes of single characters as operations. The second, Edit, added the operation of replacing one character and is consistent with what is usually used to compute distance between strings. Both distance functions have the unfortunate property of requiring O(n2) steps to compute where n is the number of characters, hence minimizing such calculations is very important. When we map out their distributions (Figure 4 ), both turn out to be considerably less correlated than the images and look roughly like the uniformly chosen vectors (Figures 1 and 2 ).
Figure 5: A Simple GNAT
Our goal when designing GNAT's was to have a data structure that reflected the intrinsic geometry of the
underlying data. More specifically, the top node of our hierarchical structure should give us a very brief summary
of the data as a metric space, and as we progress down the hierarchy we get a more and more accurate sense of
the geometry of the data. This is achieved as a hierarchical, Dirichlet-domain
Another goal of GNAT's is to make full use of the distances we calculate (at least within one node). Therefore,
instead of relying on the Dirichlet structure to perform our search, we also prune branches by storing the ranges
of distance values from split points to the data points associated with other split points during the build. Then,
if query points fall outside these ranges by more than the range of the search, we can prune the split point and
everything under it from the search. For example, suppose we know that all points in the region associated with
Our goal when designing GNAT's was to have a data structure that reflected the intrinsic geometry of the underlying data. More specifically, the top node of our hierarchical structure should give us a very brief summary of the data as a metric space, and as we progress down the hierarchy we get a more and more accurate sense of the geometry of the data. This is achieved as a hierarchical, Dirichlet-domain6 based, structure. Given a number of points, the Dirichlet domain of one of those points is all possible points in the space which are closest to that point. At the top node, several distinguished split points are chosen and the space is broken up into Dirichlet domains. The remaining points are classified into groups depending on what Dirichlet domain they fall in. Each group is then structured recursively. A simple example of such a structure is illustrated in Figure 5 . The heavier points represent the split points of the top node and finer points are split points of the subnodes. The thick lines represent the boundaries of the regions of space that belong to the top level split points and the thin lines are those of the low level split points.
Another goal of GNAT's is to make full use of the distances we calculate (at least within one node). Therefore, instead of relying on the Dirichlet structure to perform our search, we also prune branches by storing the ranges of distance values from split points to the data points associated with other split points during the build. Then, if query points fall outside these ranges by more than the range of the search, we can prune the split point and everything under it from the search. For example, suppose we know that all points in the region associated with split pointq have distance between 1 and 2 from split point p and we want to search for points within a radius of 0.5 from a query point x. Then if x is more than 2.5 from p, we can apply the triangle inequality (to the points x,y, and p where y is any point in the region of q) and safely prune the node associated with q.
1. Choosek split points, p1, ..., pk from the dataset we wish to index (this number, known as the degree can vary throughout the tree, see Section 4.3 ). They are chosen randomly but we make sure they are fairly far apart (see Section 4.2 ).
3. For each pair of split points, (pi, pj), calculate the range(pi, pj) = [minij, maxij], a minimum and a maximum of Dist(pi, x) where x in Dj or x = pj.
4. Recursively build the tree for each ofDi's, possibly using a different degree (see Section 4.2 ).
A search in a GNAT is performed recursively as follows:
1. Assume we want to find all points with distance<= r to a given point x. Let P represent the set of split points of the current node (initially the top node of the GNAT) which possibly contain a near neighbor of x. Initially P contains all the split points of the current node.
2. Pick a pointp in P (never the same one twice). Compute the distance Dist(x, p). If it is less than or equal to r add p to the result.
3. For all pointsq in P, if [Dist(x, p) - r, Dist(x, p) + r] \ range(p, q) is empty, then remove q from P.
We can do this for the following reason. Lety be any point in the region associated with q. If Dist(y, p) < Dist(x, p) - r, then, by the triangle inequality, we have Dist(x, y) + Dist(y, p) >= Dist(x, p) and hence Dist(x, y) > r. Alternatively, if Dist(y, p) > Dist(x, p) + r, we can use the triangle inequality, Dist(y, x) + Dist(x, p) >= Dist(y, p) to deduce Dist(x, y) > r. y cannot fall into the range [Dist(x, p) - r, Dist(x, p) + r] because then range(p, q) would intersect it.
4. Repeat steps 2 and 3 until all remaining points inP have been tried.
5. For all remainingpi in P, recursively search Di.
4.2 Selecting Split Points
One of the issues we had to deal with was the selection of split points (step 1 in the construction). We wanted split points to be more or less random, since this way they are likely to be near the centers of clusters. However, we did not want the split points to bunch up, since if they did, distance calculations from them would not have as much information content (distances measured from one member of a bunch would be similar to those of another) and they would not model the geometry of the data well. Furthermore, if several split points were in the same cluster, they would likely divide the cluster at too high a level.
The strategy we settled on is to sample about 3 times the number of split points we wanted, and then pick those that were the farthest apart (according to a greedy algorithm). The number 3 was arrived at empirically. More specifically this is done as follows:
Pick one of the sample points at random. Then pick the sample point which is farthest away from this one. Then pick the sample point which is the farthest from these two (by farthest we mean that the minimum distance from the two is the greatest). Then pick the one farthest from these three. And so on, until there as many split points as desired. A simple dynamic programming type of algorithm can do this inO(nm) time where n is the number of sample points and m is the eventual number of split points.
4.3 Choosing the Degree of a Node and Balancing the Tree
For out initial experiments we chose a degree and kept it constant over all the nodes in a tree. This worked reasonably well with uncorrelated data. However, for correlated data, it sometimes unbalanced the tree substantially. Attempts to balance the size of branches (by weighting distances to split points) tended to harm performance more than they helped. So we considered ``balancing'' the tree by assigning higher degrees to those nodes that contained too many data points. This is done as follows:
The top node is allocated a degreek. Then each of its children is allocated a degree proportional to the number of data points it contains (with a certain maximum and minimum) so that the average is the global degree of k. This process works recursively so that the children of each node have average degree k (ignoring the maximum and minimum degrees). The minimum used in tests was 2 (or the number of points contained, if that was smaller) and the maximum was 5k or 200, whichever was smaller.
Unfortunately, GNAT's are sufficiently complex that they are unwieldy to formal analysis. Therefore the results we present here are weak and limited. We use the following notation:
N -- the number of data points.
n -- the number of nodes. Empty nodes are not counted.
s -- the maximum size (space requirement) of a data point.
d -- the maximum degree of a node.
k -- the average degree (equal to N=n).
k2 -- the second moment (the average of the squares) of the degree.
l -- the average depth of a point.
s -- the amount of memory needed to store a data point.
We produce the following simple results:
Space (memory) An GNAT takes O(Nk2 + Ns) space. In practice, k2 does not turn out to be much higher than k2.
Preprocessing Time This is the main disadvantage of GNAT's as compared to other data structures. In a perfect scenario, when the tree is completely balanced, we get a preprocessing time of O(Nk logk N) distance calculations which is a factor of k= log k more than binary trees requiring only one distance calculation per node. In the real world, this can be substantially more due to the fluctuating degrees and can be bound only by O(Nld) distance calculations. In tests, l tended to be very close to, though slightly more than logkN and certainly not all nodes were of degree d so the real preprocessing time lies somewhere between those two extremes.
Query Time This is the most important and most difficult performance attribute to evaluate. While there have been upper bounds in previous works, they have either been restricted to particular domains or have made assumptions which make them of no practical use. As a result of this and the added complexity of GNAT's we rely on the experimental results (see Section 6 ).
The system used for testing GNAT's and other data structures underwent several major revisions, being implemented in Mathematica, then C, and finally C++. In each of these versions, the benefit of having the data structure rely only on the distance function was a tremendous advantage.
In the final version, the code to handle vectors and text (including generation and/or loading and distance functions) is under a hundred lines each. The code for images is just over a hundred lines, mainly to deal with the file format.
5.1 Data Types Supported
The data types (metric spaces) with which the system works are as follows:
Vectors -- The simplest of the data types, these N-dimensional vectors from a hypercube of side 1 and can be chosen in two different ways -- chosen uniformly from Rn, and chosen uniformly from R2 and then mapped into Rn using a simple continuous function. Both L1 and L2 metrics are supported.
Lines of Text -- While the vectors and images both fit into vector spaces, lines of text do not. These were lines taken from a text document. Two different documents were attempted - all five acts of ``Hamlet'', which unfortunately totaled only about 4000 lines, and Dickens' ``A Tale of Two Cities''. For Dickens, this was a less meaningful test since taking lines out of a novel is not particularly reasonable (sentences would have been better but they were too long to deal with). Both texts were processed before being read by normalizing whitespace and getting rid of very short lines. Additionally, in Hamlet, speaker names were stripped. Results for these two were not as similar as one might expect (see Section 6 ). Both the InsDel and the Edit distance functions (see Section 3 ) were implemented and tested.
5.2 Data Structures Supported
The final implementation supports a number of data structures including GNAT's7.
VP-Trees -- See Section 2 for a brief description. In the tests presented here, we did not use any sampling technique to chose vantage points since we could not be sure that we would do it identically to [Yia93]. However, some limited tests with sampling indicated that savings were in the 10% range for images and were negligible for text and random vectors.
VPk-Trees A generalization of VP-Trees which differs in that at each node, instead of the remaining data points being split into two halves based on their distance from the vantage point, they are split into k sections of equal size (also based on distance from the vantage point). These were found to perform very similarly to vp-trees (sometimes a little better even) but there was not a sufficiently large difference to warrant further investigation.
GH-Trees -- See Section 2 for a brief description. They are essentially GNAT's of constant degree 2 without the sampling for split points and degree variation throughout the tree. Since they perform worse than GNAT's of degree 2, not many experiments were performed.
OPT-Trees -- These use a much smaller number of distance computations for queries than any other structure but they lose out by having far more costly other computations (even superlinear in the number of data points). The idea here is to pick a number of vantage points. Measure the distances from each to all the other points and store these in a table. When a query comes along, measure its distance to the first vantage point and based on that weed out all of the impossible data points. Then take the next vantage point and do the same. Continue until no more data points are pruned. Then, check each of the remaining ones individually. Up to the choice of vantage points this gives more or less the optimal performance, in terms of distance calculations. This structure can serve as a lower bound for distance calculations but is not a realistic goal to shoot for if one wants a scalable structure.
GNAT's -- This is the main data structure of this paper, described in Section 4 .
For each of these, we check how many distance calculations are used to both build the structure and perform queries.
7Both vpk-trees and opt-trees were developed during this project (though there has been previous work similar to opt-trees).
Given the flexibility of the testbed system, the number of possible interesting tests to run exceeded by far the number which could be run (in a reasonable amount of time) which in turn exceeded by far the number of test results presented in this section. Note that the benefits of GNAT's varied and in this section we try to present the range of results that were obtained.
Also, while opt-tree plots are in some of the graphs, it is important to keep in mind that these structures are only efficient in terms of distance calculations but are very inefficient otherwise. They are provided just to serve as a lower bound.
6.1 Random Vectors
opt-tree GNAT100 GNAT50 GNAT20 GNAT10 vp4-tree vp2-tree GNAT2 gh-tree Query Range 0.5 0.4 0.3 0.2 0.1 0 30 25 20 15 10 5 0
Figure 6: Varying Query Range for 3000 Vectors
opt-tree GNAT100 GNAT50 GNAT20 GNAT10 vp4-tree vp2-tree GNAT2 gh-tree Query Range 0.5 0.4 0.3 0.2 0.1 0 20 18 16 14 12 10 8 6 4 2 0
Figure 7: Varying Query Range for 20000 Vectors
A number of tests were performed on random vectors. The dimension was set at 50 and uniformly chosen vectors were produced and theL2 metric was used.. The range was varied and a number of different data structures were tested. The number of data points was 3000 (Figure 6 ) in one test and 20000 (Figure 7 ) in another. The number of test queries used in every case was 100.
The first thing to note is how difficult it actually is to perform these searches. All of the data structures seemed to struggle with ranges above 0.3 (reasonable queries could easily have ranges considerably above 0.5), looking at more than 50% of the data points in many cases. This is caused by the low information content of the distance calculations since they tend to return very similar numbers (Figure 2 ).
Figure 8: Varying Query Range for 3000 Lines of Hamlet Using the
Figure 9: Varying Query Range for 10000 Lines of Dickens Using the
We ran a number of experiments with text with varied success. When testing 3000 lines of Hamlet using the
Figure 8: Varying Query Range for 3000 Lines of Hamlet Using theInsDel Distance
GNAT100 GNAT50 GNAT20 GNAT10 vp2-tree Query Range 10 8 6 4 2 0 9 8 7 6 5 4 3 2 1 0
Figure 9: Varying Query Range for 10000 Lines of Dickens Using theEdit Distance
We ran a number of experiments with text with varied success. When testing 3000 lines of Hamlet using theInsDel distance, speedups above vp-trees were in the factor of 2 range (Figure 8 ). Testing on 10000 lines of ``A Tale of Two Cities'' with the Edit distance yielded much more dramatic yet varied speedups, ranging from around 50% with a query range of 10 to more than a factor of 6 with a range of 2 (Figure 9 ).
In working with large metric spaces, we have proved our intuition from low-dimensional spaces wrong in many ways. In many cases explored in this paper the data lies in a very large metric space whose only easily recognizable and readily usable structure is the distance between its points. In other words, the space is so large that it is meaningless to consider and use its geometry and one should concentrate on the intrinsic geometry of the actual set of data points.
Consequently, it is important to exploit the constraints of the distribution of data rather than rely on those of the whole space. Therefore, GNAT's try to model the data they are indexing. There are several important issues involved in doing this.
First, does one break up the space by occupation or by population? In other words, if the data is composed of a large cluster and a few outliers, should the top node assign one branch for the cluster and a few branches for the outliers or should it split the cluster immediately and not worry about the outliers until later? In GNAT'swe decided to compromise by first sampling (the by-population approach) and then by picking out the points that were far apart (the by-occupation approach) when choosing split points. The current method of selecting points that are far apart can become asymmetric and some pathological behavior was observed (though it didn't impact query performance much). This remains a problem for future work.
A second issue is how to handle balancing. In our experiments, we found that good balance was not crucial to the performance of the structure. We attempted to improve the structure by using ``weighted'' Dirichlet domains but these tended to decrease performance rather than improving it. (They did reduce build time though.) Intuitively, when the tree structure is altered so that it is balanced rather than so it reflects the geometry of the space, searches tend to descend down all branches. As a result we decided to keep the tree depth from varying too much by adjusting the degrees of the nodes.
For future work, we are considering new methods of building the tree. Bottom-up constructions could lead to very good query performance but theirO(n2) construction cost will not scale well. Consequently, we are considering schemes where a top-down construction is used but then is iteratively improved until it converges to a bottom-up type construction.
Another important research direction is to begin to use approximate distance metrics. For example, in order to compute near neighbors in text using the edit distance (an expensive computation), we can first use theq-gram distance [Ukk92] (a relatively fast computation) to narrow the search quickly and then apply proper edit distance to complete the search. The key is that q-gram distance is a lower bound for edit distance. Similarly, we could linearly project down a very high-dimensional space (such as 50 by 50 pixel images) to a somewhat lower dimensional space (e.g. by averaging together 2 by 2 pixel blocks) and use that as an approximation (L2 distance in the projection is a lower bound for L2 distance in the original space). All of these techniques, of course, rely on special knowledge of the metric space to construct the approximations. However, given the approximations, a general method could be applied.
I thank Prof. Michael Brin (my father), Prof. Hector Garcia-Molina, Luis Gravano, and Edouard Bugnion for helpful discussions and for listening to my endless ramblings.
[BFR+93] E. Bugnion, S. Fei, T. Roos, P. Widmayer, and F. Widmer. A spatial index for approximate multiple string matching. In Proc. First South American Workshop on String Processing, Belo Horizonte, Brazil, September 1993.
[BK73] W. A. Burkhard and R. M. Keller. Some approaches to best-match file searching.Communications of the ACM, 16(4), April 1973.
[FN75] K. Fukunaga and P. M. Narendra. A branch and bound algorithm for computingk-nearest neighbors. IEEE Trans. Comput., C-24:750--753, 1975.
[HKR93] Huttenlocher, Klanderman, and Rucklidge. Comparing images using the hausdorff distance.IEEE Transactions on Pattern Analysis and Machine Intelligence, 15, 1993.
[SW90] Dennis Shasha and Tsong-Li Wang. New techniques for best-match retrieval.ACM Transactions on Information Systems., 8(2):140, 1990.
[Uhl91] Uhlmann. Satisfying general proximity / similarity queries with metric trees.Information Processing Letters, 40, 1991.
[Ukk92] Ukkonen. Approximate string matching with q-grams and maximal matches.Theoretical Computer Science, 92, 1992.
[Yia93] Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. InACM- SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms), 1993.
Figure 10: Performance for 3000 16 by 16 images
GNAT100 GNAT50 GNAT20 GNAT10 vp2-tree Query Range 500 450 400 350 300 250 200 150 100 50 0 7 6 5 4 3 2 1 0
Figure 11: Performance for 3000 50 by 50 images