[

Stanford University

November 20, 1995

**Abstract**

**
**

**
**

**
**Given user data, one often wants to find approximate matches in a large database. A good example of such a
task is finding images similar to a given image in a large collection of images. We focus on the important and
technically difficult case where each data element is high dimensional, or more generally, is represented by a
point in a large metric space1 and distance calculations are computationally expensive.

In this paper we introduce a data structure to solve this problem called a GNAT -- Geometric Near-neighbor Access Tree. It is based on the philosophy that the data structure should act as a hierarchical geometrical model of the data as opposed to a simple decomposition of the data which doesn't use its intrinsic geometry. In experiments, we find that GNAT's outperform previous data structures in a number of applications.

*Keywords *-- near neighbor, metric space, approximate queries, data mining, Dirichlet domains, Voronoi regions.

**
**

**
**

**
**The problem of finding the near neighbors of a given point in a large data set has been studied well and has a
number of good solutions, *if *the data is in a simple (e.g. Euclidean), low-dimensional space. However, if the
data lies in a large metric space1 the problem becomes much more difficult. Consider the following examples as
a small sample of where this problem occurs:

**Information Retrieval **-- Finding sentences similar to a user's query from a given database of sentences.

**Genetics **-- Finding similar DNA or protein sequences in one of a number of large genetics databases.

**Speaker Recognition **-- Finding similar vocal patterns (e.g., under Fourier transforms) from a database of
vocal patterns.

**Image Recognition **-- Finding images similar (using the Hausdorff metric [HKR93]) to a given one from a large
image library.

**Video Compression **-- Finding the image blocks of a previous frame that are similar to blocks in a new frame
(using a simple *L*1 or *L*2 metric, possibly after a DCT transform) to generate motion vectors in MPEG
video compression.

**Data Mining **-- Finding approximate time series matches (e.g., stock histories or year long temperature).

__*__Supported by a Fellowship from the NSF. 1 By a large metric space we mean a space such that the volume of a ball grows very rapidly as its radius increases. High
dimensional vector spaces are an example (a ball of radius 2 in a 20 dimensional Euclidean space is over a million times large than
a ball of radius 1). See Section 3 .

All of the examples above fit into the model of finding near neighbors in a large metric space. In particular,
the first two examples above find near neighbors in the metric space of strings under some *edit *distance function.
The speaker, video, and possibly data mining examples find near-neighbors in a high dimensional vector space
under the *L*1 or *L*2 metric (more sophisticated metrics could be imagined). The image recognition example does
not fit as nicely into either of these classes but still qualifies as near-neighbor search in a large metric space.

Every data type above has some degree of correlation in its distribution. While it may be small (i.e., the data resembles random vectors), it must be exploited to get good performance in a near neighbor search. To do this in an application independent manner requires that the data structure capture the intrinsic geometry of the data. As we will see (Section 4 ), our data structure, the GNAT, captures the geometry of data collections such as the ones mentioned above by hierarchically breaking it down into regions which try to preserve fundamental geometric structure.

**
**

**
**

**
**A very large amount of work has been done to solve specific instances of near-neighbor finding problems.
Numerous articles have been written regarding finding similar vectors (e.g., time-series and geographic data),
text (files and documents), images, sounds (word recognition), etc. A more limited but still substantial amount
of work has addressed the general problem2. This work has mostly fallen into two categories. In one category,
we assume that distance calculations are so expensive that even an *O*(*n*) or *O*(*n *log *n*) search algorithm is
acceptable as long as it reduces the number of distance calculations. This is the case as long as the database size
is fairly small compared to the range of the search [FS82] or if preprocessing is not allowed and only arbitrary
precomputed distances are given [SW90].

The other category of solutions are hierarchical and typically have an *O*(log *n*) query time *given a sufficiently
small range *(typically too small to be practical). They are of the following form: The space is broken up
hierarchically. At the top node, one or several data points are chosen. Then the distance between each of these
to each of the remaining points is computed. Based on these distances, the points are separated into two or
several different branches. For each branch, the structure is constructed recursively.

J. K. Uhlmann outlined the foundation for two different methods, generally described as metric trees [Uhl91].
One of these methods, subsequently called vp-trees3, was implemented by P. N. Yianilos [Yia93]. The basic
construction of a vp-tree is to break the space up using spherical cuts. To build it, pick a point in the data
set (this is called the *vantage point*, hence the name *vp*-tree). Now, consider the *median *sphere centered at the
vantage point with a radius such that half the remaining points fall inside it and half fall outside. For every
other point, put it in one branch if it is inside the sphere and in another branch if it is outside the sphere. Now,
recursively construct the lower level branches.

This approach has the benefits of requiring only one distance calculation per node and automatically creating balanced trees. However, it suffers from regions inside the median sphere and outside the median sphere being very asymmetric, especially in a high-dimensional space. Since volume grows rapidly as the radius of a sphere increases, the outside of the sphere will tend to be very thin, given that there are as many points on the inside as on the outside, thus worsening search performance. In our work, we try to avoid such asymmetries. While the limited branching factor of 2 can also be viewed as a weakness, we have conducted experiments with higher degree variations of vp-trees and find little improvement in performance.

The other method, a generalized hyperplane trees (gh-tree), is constructed as follows. At the top node, pick two points. Then, divide the remaining points based on which of these two they are closer to. Now, recursively build both branches. This method is an improvement in that it is symmetric and the tree structure still tends to be well balanced (assuming sufficiently random selection of the two points). However, it has a weakness in that it requires two computations at every node and is limited to a branching factor of two.

A variation of gh-trees was implemented at ETH Zurich [BFR+93] as monotonous bisector trees (MBT's)
to deal specifically with text. However, nothing in the method would have prevented them from dealing with
arbitrary metric spaces. The key difference between MBT's and gh-trees is that MBT's only select *one *new point 2Some of the papers we mention below address the problem of finding *nearest *neighbors. However, their methods can be applied
to finding all near neighbors with minimal change. 3We do not look at the enhancement of vp-trees called vp*sb*-trees.

at each new node. They do this by reusing the point they are associated with in the parent node. As a result, MBT's overcome the first weakness but the branching factor remains a problem.

The most relevant works, however, are also the oldest. Burkhard and Keller suggested several data structures
(and algorithms) [BK73] for approximate search. The first is very similar to vp-trees except that it requires a
finite number of discrete distance values. Essentially, for every vantage point, a separate branch is allocated for
every possible distance value. This method, however, suffers from the same asymmetry problem as the vp-trees.
The other two data structures, which are the closest to the GNAT, break up the space into a number of balls,
storing the radii and centers. More specifically, divide the data points into groups using some method (this was
left as a parameter). Pick a representative of each group and call it the center of the group. Then, calculate
the radius (the maximal distance to another point) from the center for each group and pruning is performed
based on these radii. Recursion is briefly mentioned but not analysed. The third method, an enhancement of
the second, additionally requires that the diameter (the maximal distance between any two points) of the points
in any group be less than a constant, *k*, and the group is then called a clique. In this case a minimal subset of
the set of all maximal cliques is used as the set of all groups. These two schemes act as reasonably good models
of the data space they store and if extended to a hierarchical structure, they have an arbitrary branching factor.
However, they have several weaknesses. First, they do not work well with nonhomogeneous data, since we could
easily end up with a lot of cliques containing only one point and several cliques containing very many points.
Additionally, distance computations are not fully exploited in that distance to the center of one clique is not
used to prune other cliques. Finally, while we do not focus on the cost of preprocessing in this paper, this cost
was reported to be extremely high in the third method.

K. Fukunaga and P. Narendra worked out a very similar scheme, which requires more than just a metric space,
to create a tree structure with an arbitrary branching factor in 1975 [FN75] as follows. Divide the data points into
*k *groups. (How this is done is left as a parameter of the structure but in tests they used a clustering algorithm
which requires more than just a metric space.) Then compute the mean4 of each group (once again a departure
from a metric space) and the farthest distance from that mean to a point in the group. Then recursively create
the structure for each group. While this method tends to have nice symmetric properties (given a reasonable
clustering algorithm) that reflect the space and it has an arbitrary branching factor, it has several weaknesses.
First, it relies on more than just a metric space; second, it requires many distance computations at each node
and does not use them fully; and third, it does not deal effectively with balancing.

In this paper we present GNAT's which can be viewed as both a generalization of Fukunaga's method and a generalization of gh-trees. GNAT's provide good query performance by exploiting the geometry of the data. Unfortunately, while query time is reduced, the build time of the data structure is sharply increased. However, if the application is query dominant (or even if there are roughly as many queries as data points) the relative cost of building a GNAT becomes negligible. In tests, we find that GNAT's almost always perform better than both vp-trees and gh-trees, and scale better.

**
**

**
**

**
**A *metric space *is a set *X *with a distance function *d*: *X*2 __-> R __such that: __ forall __*x, y, z *__in __*X*,

1. *d*(*x, y*) __>= __0 and *d*(*x, y*) = 0 iff *x *= *y*. (Positivity)

2. *d*(*x, y*) = *d*(*y, x*). (Symmetry)

3. *d*(*x, y*) + *d*(*y, z*) __>= __*d*(*x, z*). (Triangle Inequality)

Since we are dealing with arbitrary metric spaces, we assume the following model of computation: there is a large number of data points and a ``black box'' to compute distance between them.

The first important observation is that it is impossible to deal efficiently with all metric spaces. In particular, consider the metric space where the distance between two points is 0 if they are the same and 1 if they are different. Then our only option in finding a query point is a linear search and no fancy data structure will save us. In fact, the more any space resembles such a metric space, the more difficult it will be to search.

4The mean of a set of points (vectors) in a vector space is simply their sum divided by their number. The concept of a mean is not meaningful for arbitrary metric spaces.

Furthermore, the distribution of data in the metric space is more important than the metric space itself. If the data lies on a two-dimensional surface that is embedded in a 50-dimensional space, query times will behave more like that of a two-dimensional space than that of a 50-dimensional space given an intelligent data structure. In a sense, for a high-dimensional space, the data determines the ``geometry'' of the space more than the constraints of the space itself.

Since visualizing high-dimensional data is difficult we look at some simple measures to help us understand
the geometry of a given data space. A particularly useful measurement is the distribution of distances between
points in the space. While the scales of these distributions vary greatly, we can compare them by considering
at what range we would be interested in finding near neighbors. In each of the graphs of the distributions that
follow, we have made 5000 random distance calculations in the data space and distributed them into a number
of buckets. The *y *axis represents the number of distances which fell into that bucket divided by the size of the
bucket.

Dim 50 Dim 20 Distance 25 20 15 10 5 0 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

Figure 1: Distribution of distances under *L*1 metric in 20 and 50 dimensions.

Dim 50 Dim 20 Distance 4 3.5 3 2.5 2 1.5 1 0.5 0 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

Figure 2: Distribution of distances under *L*2 metric in 20 and 50 dimensions.

The distributions of distances between random, uniformly chosen vectors in 20 and 50 dimensional hypercubes
of side 1 under the *L*15distributions because of the Central Limit Theorem (Figure 1 ). For the *L*2 metric,
we obtain a Gaussian-like (though not exactly Gaussian) distribution limit distribution (Figure 2 ). Note that
the distributions for 50 dimensions should be viewed in relation to their larger ranges and hence are really quite
narrow. The fact that the peaks are narrow indicate that the distance function has low entropy and that it 5Recall that the *L*1 metric is the sum of the absolute values of the differences of corresponding vector dimensions and the *L*2, or
Euclidean, metric is the square root of the sum of the squares.

may be difficult to index the data since arbitrary distance measurements will provide us with little information. However, by wisely choosing the distance computations, we can greatly improve the efficiency of the system.

50 by 50 16 by 16 Distance 8000 6000 4000 2000 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Figure 3: Distribution of distances under *L*2 metric in 16 by 16 and 50 by 50 images.

Correlated data has somewhat different properties and tends to have a much flatter distance distribution. For
example, taking random 16 by 16 or 50 by 50 blocks from an image, then treating them as 256 or 2500 dimensional
vectors respectively and taking the *L*2 distances between them creates a distribution with two major maxima
(Figure 3 ). The first maximum, near 0, indicates a great deal of clustering in the data since small distances
are so probable; the second major maximum is one that is common to all large metric spaces we will investigate,
which indicates that average distances are fairly likely.

Edit InsDel Distance 100 90 80 70 60 50 40 30 20 10 0 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0

Figure 4: Distribution of edit distances in lines of text.

As another example, consider taking lines of text from a large text document (``A Tale of Two Cities'' in this
example) and using a simple edit distance function. We considered two different such functions. Both counted
the minimum number of operations needed to get from one line to the other. The first distance function, *InsDel
*allowed only inserts and deletes of single characters as operations. The second, *Edit*, added the operation of
replacing one character and is consistent with what is usually used to compute distance between strings. Both
distance functions have the unfortunate property of requiring *O*(*n*2) steps to compute where *n *is the number of
characters, hence minimizing such calculations is very important. When we map out their distributions (Figure
4 ), both turn out to be considerably less correlated than the images and look roughly like the uniformly chosen
vectors (Figures 1 and 2 ).

**
**Figure 5: A Simple GNAT

Our goal when designing GNAT's was to have a data structure that reflected the intrinsic geometry of the
underlying data. More specifically, the top node of our hierarchical structure should give us a very brief summary
of the data as a metric space, and as we progress down the hierarchy we get a more and more accurate sense of
the geometry of the data. This is achieved as a hierarchical, Dirichlet-domain6 based, structure. Given a number
of points, the Dirichlet domain of one of those points is all possible points in the space which are closest to that
point. At the top node, several distinguished *split *points are chosen and the space is broken up into Dirichlet
domains. The remaining points are classified into groups depending on what Dirichlet domain they fall in. Each
group is then structured recursively. A simple example of such a structure is illustrated in Figure 5 . The
heavier points represent the split points of the top node and finer points are split points of the subnodes. The
thick lines represent the boundaries of the regions of space that belong to the top level split points and the thin
lines are those of the low level split points.

Another goal of GNAT's is to make full use of the distances we calculate (at least within one node). Therefore,
instead of relying on the Dirichlet structure to perform our search, we also prune branches by storing the ranges
of distance values from split points to the data points associated with other split points during the build. Then,
if query points fall outside these ranges by more than the range of the search, we can prune the split point and
everything under it from the search. For example, suppose we know that all points in the region associated with
split point *q *have distance between 1 and 2 from split point *p *and we want to search for points within a radius of
0.5 from a query point *x*. Then if *x *is more than 2.5 from *p*, we can apply the triangle inequality (to the points
*x*,*y*, and *p *where *y *is any point in the region of *q*) and safely prune the node associated with *q*.

**
**

**
**

**
**The basic GNAT construction is as follows:

1. Choose *k **split *points, *p*1*, ..., pk *from the dataset we wish to index (this number, known as the *degree *can
vary throughout the tree, see Section 4.3 ). They are chosen randomly but we make sure they are fairly
far apart (see Section 4.2 ).

6Computer scientists may know these better as the cells of a Voronoi diagram but these are more generally known as Dirichlet domains.

2. Associate each of the remaining points in the dataset with the closest split point. Let the set of points associated with a split point

3. For each pair of split points, (*pi, pj*), calculate the *range*(*pi, pj*) = [*min**ij, **max**ij*], a minimum and a maximum
of *Dist*(*pi, x*) where *x *__in __*Dj *or *x *= *pj*.

4. Recursively build the tree for each of *Di*'s, possibly using a different degree (see Section 4.2 ).

A search in a GNAT is performed recursively as follows:

1. Assume we want to find all points with distance __<= __*r *to a given point *x*. Let *P *represent the set of split
points of the current node (initially the top node of the GNAT) which possibly contain a near neighbor of
*x*. Initially *P *contains all the split points of the current node.

2. Pick a point *p *in *P *(never the same one twice). Compute the distance *Dist*(*x, p*). If it is less than or equal
to *r *add *p *to the result.

3. For all points *q *__in __*P*, if [*Dist*(*x, p*) __- __*r, Dist*(*x, p*) + *r*] __\ __*range*(*p, q*) is empty, then remove *q *from *P*.

We can do this for the following reason. Let *y *be any point in the region associated with *q*. If
*Dist*(*y, p*) *< Dist*(*x, p*) __- __*r*, then, by the triangle inequality, we have *Dist*(*x, y*) + *Dist*(*y, p*) __>= __*Dist*(*x, p*)
and hence *Dist*(*x, y*) *> r*. Alternatively, if *Dist*(*y, p*) *> Dist*(*x, p*) + *r*, we can use the triangle
inequality, *Dist*(*y, x*) + *Dist*(*x, p*) __>= __*Dist*(*y, p*) to deduce *Dist*(*x, y*) *> r*. *y *cannot fall into the range
[*Dist*(*x, p*) __- __*r, Dist*(*x, p*) + *r*] because then *range*(*p, q*) would intersect it.

4. Repeat steps 2 and 3 until all remaining points in *P *have been tried.

5. For all remaining *pi *__in __*P*, recursively search *Di*.

**
**

**
**

**
**One of the issues we had to deal with was the selection of split points (step 1 in the construction). We wanted
split points to be more or less random, since this way they are likely to be near the centers of clusters. However,
we did not want the split points to bunch up, since if they did, distance calculations from them would not have
as much information content (distances measured from one member of a bunch would be similar to those of
another) and they would not model the geometry of the data well. Furthermore, if several split points were in
the same cluster, they would likely divide the cluster at too high a level.

The strategy we settled on is to sample about 3 times the number of split points we wanted, and then pick those that were the farthest apart (according to a greedy algorithm). The number 3 was arrived at empirically. More specifically this is done as follows:

Pick one of the sample points at random. Then pick the sample point which is farthest away from this one.
Then pick the sample point which is the farthest from these two (by farthest we mean that the minimum distance
from the two is the greatest). Then pick the one farthest from these three. And so on, until there as many split
points as desired. A simple dynamic programming type of algorithm can do this in *O*(*nm*) time where *n *is the
number of sample points and *m *is the eventual number of split points.

**4.3 Choosing the Degree of a Node and Balancing the Tree**

**
**

**
**

**
**For out initial experiments we chose a degree and kept it constant over all the nodes in a tree. This worked
reasonably well with uncorrelated data. However, for correlated data, it sometimes unbalanced the tree
substantially. Attempts to balance the size of branches (by weighting distances to split points) tended to harm
performance more than they helped. So we considered ``balancing'' the tree by assigning higher degrees to those
nodes that contained too many data points. This is done as follows:

The top node is allocated a degree *k*. Then each of its children is allocated a degree proportional to the
number of data points it contains (with a certain maximum and minimum) so that the average is the global
degree of *k*. This process works recursively so that the children of each node have average degree *k *(ignoring
the maximum and minimum degrees). The minimum used in tests was 2 (or the number of points contained, if
that was smaller) and the maximum was 5*k *or 200, whichever was smaller.

**
**Unfortunately, GNAT's are sufficiently complex that they are unwieldy to formal analysis. Therefore the results
we present here are weak and limited. We use the following notation:

__ __*N *-- the number of data points.

__ __*n *-- the number of nodes. Empty nodes are not counted.

__ __*s *-- the maximum size (space requirement) of a data point.

__ __*d *-- the maximum degree of a node.

__ __*k *-- the average degree (equal to *N=n*).

__ __*k*2 -- the second moment (the average of the squares) of the degree.

__ __*l *-- the average depth of a point.

__ __*s *-- the amount of memory needed to store a data point.

We produce the following simple results:

**Space (memory) **An GNAT takes *O*(*Nk*2 + *Ns*) space. In practice, *k*2 does not turn out to be much higher
than *k*2.

**Preprocessing Time **This is the main disadvantage of GNAT's as compared to other data structures. In a
perfect scenario, when the tree is completely balanced, we get a preprocessing time of *O*(*Nk *log*k N*) distance
calculations which is a factor of *k= *log *k *more than binary trees requiring only one distance calculation per
node. In the real world, this can be substantially more due to the fluctuating degrees and can be bound only
by *O*(*Nld*) distance calculations. In tests, *l *tended to be very close to, though slightly more than *logkN
*and certainly not all nodes were of degree *d *so the real preprocessing time lies somewhere between those
two extremes.

**Query Time **This is the most important and most difficult performance attribute to evaluate. While there
have been upper bounds in previous works, they have either been restricted to particular domains or have
made assumptions which make them of no practical use. As a result of this and the added complexity of
GNAT's we rely on the experimental results (see Section 6 ).

**
**

**
**

**
**The system used for testing GNAT's and other data structures underwent several major revisions, being
implemented in Mathematica, then C, and finally C++. In each of these versions, the benefit of having the
data structure rely only on the distance function was a tremendous advantage.

In the final version, the code to handle vectors and text (including generation and/or loading and distance functions) is under a hundred lines each. The code for images is just over a hundred lines, mainly to deal with the file format.

**
**

**
**

**
**The data types (metric spaces) with which the system works are as follows:

**Vectors **-- The simplest of the data types, these N-dimensional vectors from a hypercube of side 1 and can be
chosen in two different ways -- chosen uniformly from __R__*n*, and chosen uniformly from __R__2 and then mapped
into __R__*n *using a simple continuous function. Both *L*1 and *L*2 metrics are supported.

**Lines of Text **-- While the vectors and images both fit into vector spaces, lines of text do not. These were
lines taken from a text document. Two different documents were attempted - all five acts of ``Hamlet'',
which unfortunately totaled only about 4000 lines, and Dickens' ``A Tale of Two Cities''. For Dickens, this
was a less meaningful test since taking lines out of a novel is not particularly reasonable (sentences would
have been better but they were too long to deal with). Both texts were processed before being read by
normalizing whitespace and getting rid of very short lines. Additionally, in Hamlet, speaker names were
stripped. Results for these two were not as similar as one might expect (see Section 6 ). Both the *InsDel
*and the *Edit *distance functions (see Section 3 ) were implemented and tested.

**
**

**
**

**
**The final implementation supports a number of data structures including GNAT's7.

**VP-Trees **-- See Section 2 for a brief description. In the tests presented here, we did not use any sampling
technique to chose vantage points since we could not be sure that we would do it identically to [Yia93].
However, some limited tests with sampling indicated that savings were in the 10% range for images and
were negligible for text and random vectors.

**VP***k***-Trees **A generalization of VP-Trees which differs in that at each node, instead of the remaining data points
being split into two halves based on their distance from the vantage point, they are split into *k *sections of
equal size (also based on distance from the vantage point). These were found to perform very similarly to
vp-trees (sometimes a little better even) but there was not a sufficiently large difference to warrant further
investigation.

**GH-Trees **-- See Section 2 for a brief description. They are essentially GNAT's of constant degree 2 without
the sampling for split points and degree variation throughout the tree. Since they perform worse than
GNAT's of degree 2, not many experiments were performed.

**OPT-Trees **-- These use a much smaller number of distance computations for queries than any other structure
but they lose out by having far more costly other computations (even superlinear in the number of data
points). The idea here is to pick a number of vantage points. Measure the distances from each to all the
other points and store these in a table. When a query comes along, measure its distance to the first vantage
point and based on that weed out all of the impossible data points. Then take the next vantage point and
do the same. Continue until no more data points are pruned. Then, check each of the remaining ones
individually. Up to the choice of vantage points this gives more or less the optimal performance, in terms
of distance calculations. This structure can serve as a lower bound for distance calculations but is not a
realistic goal to shoot for if one wants a scalable structure.

**GNAT's **-- This is the main data structure of this paper, described in Section 4 .

For each of these, we check how many distance calculations are used to both build the structure and perform queries.

7Both vp*k*-trees and opt-trees were developed during this project (though there has been previous work similar to opt-trees).

**
**Given the flexibility of the testbed system, the number of possible interesting tests to run exceeded by far the
number which could be run (in a reasonable amount of time) which in turn exceeded by far the number of test
results presented in this section. Note that the benefits of GNAT's varied and in this section we try to present
the range of results that were obtained.

Also, while opt-tree plots are in some of the graphs, it is important to keep in mind that these structures are only efficient in terms of distance calculations but are very inefficient otherwise. They are provided just to serve as a lower bound.

**
**

**
**

**
**opt-tree GNAT100 GNAT50 GNAT20 GNAT10 vp4-tree vp2-tree GNAT2 gh-tree Query Range 0.5 0.4 0.3 0.2 0.1 0 30 25 20 15 10 5 0

Figure 6: Varying Query Range for 3000 Vectors

opt-tree GNAT100 GNAT50 GNAT20 GNAT10 vp4-tree vp2-tree GNAT2 gh-tree Query Range 0.5 0.4 0.3 0.2 0.1 0 20 18 16 14 12 10 8 6 4 2 0

Figure 7: Varying Query Range for 20000 Vectors

A number of tests were performed on random vectors. The dimension was set at 50 and uniformly chosen vectors
were produced and the *L*2 metric was used.. The range was varied and a number of different data structures
were tested. The number of data points was 3000 (Figure 6 ) in one test and 20000 (Figure 7 ) in another.
The number of test queries used in every case was 100.

The first thing to note is how difficult it actually is to perform these searches. All of the data structures seemed to struggle with ranges above 0.3 (reasonable queries could easily have ranges considerably above 0.5), looking at more than 50% of the data points in many cases. This is caused by the low information content of the distance calculations since they tend to return very similar numbers (Figure 2 ).

Despite the difficulty that all these methods have, high degree GNAT's come out far ahead. In particular, the GNAT's of degrees 50 and 100 had more than a factor of 3 improvement over vp-trees in many cases.

**
**

**
**

**
**GNAT100 GNAT50 GNAT20 GNAT10 vp2-tree Query Range 10 9 8 7 6 5 4 3 2 20 18 16 14 12 10 8 6 4 2 0

Figure 8: Varying Query Range for 3000 Lines of Hamlet Using the *InsDel *Distance

GNAT100 GNAT50 GNAT20 GNAT10 vp2-tree Query Range 10 8 6 4 2 0 9 8 7 6 5 4 3 2 1 0

Figure 9: Varying Query Range for 10000 Lines of Dickens Using the *Edit *Distance

We ran a number of experiments with text with varied success. When testing 3000 lines of Hamlet using the
*InsDel *distance, speedups above vp-trees were in the factor of 2 range (Figure 8 ). Testing on 10000 lines of
``A Tale of Two Cities'' with the *Edit *distance yielded much more dramatic yet varied speedups, ranging from
around 50% with a query range of 10 to more than a factor of 6 with a range of 2 (Figure 9 ).

**
**

**
**

**
**The performance of GNAT's versus vp-trees on images varied greatly but unfortunately they were not nearly as
dramatic as the results for random vectors and text. For example, for 16 by 16 block images and the *L*2 metric,
the higher degree GNAT's gave only about 15% to 25% improvement over vp-trees (Figure 10 ). When using
50 by 50 blocks, results were a little better, with improvements in the 15% to 35% range for ranges above 200.
This is a clear indication that more work needs to be done to deal with clustered data. (Figure 11 ).

**
**In working with large metric spaces, we have proved our intuition from low-dimensional spaces wrong in many
ways. In many cases explored in this paper the data lies in a very large metric space whose only easily recognizable
and readily usable structure is the distance between its points. In other words, the space is so large that it is
meaningless to consider and use its geometry and one should concentrate on the intrinsic geometry of the actual
set of data points.

Consequently, it is important to exploit the constraints of the distribution of data rather than rely on those of the whole space. Therefore, GNAT's try to model the data they are indexing. There are several important issues involved in doing this.

First, does one break up the space by occupation or by population? In other words, if the data is composed of a large cluster and a few outliers, should the top node assign one branch for the cluster and a few branches for the outliers or should it split the cluster immediately and not worry about the outliers until later? In GNAT'swe decided to compromise by first sampling (the by-population approach) and then by picking out the points that were far apart (the by-occupation approach) when choosing split points. The current method of selecting points that are far apart can become asymmetric and some pathological behavior was observed (though it didn't impact query performance much). This remains a problem for future work.

A second issue is how to handle balancing. In our experiments, we found that good balance was not crucial to the performance of the structure. We attempted to improve the structure by using ``weighted'' Dirichlet domains but these tended to decrease performance rather than improving it. (They did reduce build time though.) Intuitively, when the tree structure is altered so that it is balanced rather than so it reflects the geometry of the space, searches tend to descend down all branches. As a result we decided to keep the tree depth from varying too much by adjusting the degrees of the nodes.

For future work, we are considering new methods of building the tree. Bottom-up constructions could lead
to very good query performance but their *O*(*n*2) construction cost will not scale well. Consequently, we are
considering schemes where a top-down construction is used but then is iteratively improved until it converges to
a bottom-up type construction.

Another important research direction is to begin to use approximate distance metrics. For example, in order
to compute near neighbors in text using the edit distance (an expensive computation), we can first use the
*q*-gram distance [Ukk92] (a relatively fast computation) to narrow the search quickly and then apply proper edit
distance to complete the search. The key is that *q*-gram distance is a lower bound for edit distance. Similarly, we
could linearly project down a very high-dimensional space (such as 50 by 50 pixel images) to a somewhat lower
dimensional space (e.g. by averaging together 2 by 2 pixel blocks) and use that as an approximation (*L*2 distance
in the projection is a lower bound for *L*2 distance in the original space). All of these techniques, of course, rely
on special knowledge of the metric space to construct the approximations. However, given the approximations,
a general method could be applied.

**
**

**
**I thank Prof. Michael Brin (my father), Prof. Hector Garcia-Molina, Luis Gravano, and Edouard Bugnion for
helpful discussions and for listening to my endless ramblings.

**References**

**
**

**
**[BFR+93] E. Bugnion, S. Fei, T. Roos, P. Widmayer, and F. Widmer. A spatial index for approximate multiple
string matching. In *Proc. First South American Workshop on String Processing, Belo Horizonte,
Brazil*, September 1993.

[BK73] W. A. Burkhard and R. M. Keller. Some approaches to best-match file searching. *Communications
of the ACM*, 16(4), April 1973.

[FN75] K. Fukunaga and P. M. Narendra. A branch and bound algorithm for computing *k*-nearest neighbors.
*IEEE Trans. Comput.*, C-24:750--753, 1975.

[FS82] C.D. Feustel and L. G. Shapiro. The nearest neighbor problem in an abstract metric space.

[HKR93] Huttenlocher, Klanderman, and Rucklidge. Comparing images using the hausdorff distance. *IEEE
Transactions on Pattern Analysis and Machine Intelligence*, 15, 1993.

[SW90] Dennis Shasha and Tsong-Li Wang. New techniques for best-match retrieval. *ACM Transactions on
Information Systems.*, 8(2):140, 1990.

[Uhl91] Uhlmann. Satisfying general proximity / similarity queries with metric trees. *Information Processing
Letters*, 40, 1991.

[Ukk92] Ukkonen. Approximate string matching with q-grams and maximal matches. *Theoretical Computer
Science*, 92, 1992.

[Yia93] Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In *ACM-
SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis
of Discrete Algorithms)*, 1993.

GNAT100 GNAT50 GNAT20 GNAT10 vp2-tree Query Range 500 450 400 350 300 250 200 150 100 50 0 18 16 14 12 10 8 6 4 2

Figure 10: Performance for 3000 16 by 16 images

GNAT100 GNAT50 GNAT20 GNAT10 vp2-tree Query Range 500 450 400 350 300 250 200 150 100 50 0 7 6 5 4 3 2 1 0

Figure 11: Performance for 3000 50 by 50 images