The cool thing about VP trees is that your space doesn't have to be Euclidean, i.e. you can have dimensions that depend on other dimensions, e.g. in RGBA color space values of RGB are meaningless when A=transparent. You can't search that space with KD trees, but you can with VP trees.
VP trees are well known but not necessarily the best approach for large datasets or high dimensional objects. I am creating a startup that offers the fastest similarity search engine built so far (10X faster than LSH from MIT). See more here: http://simmachines.com
You probably can't go into any details, but can you describe the rough approach R-01 is taking? Is it based on some existing paper or something completely new?
It does sound interesting. These metric-space based methods usually make poor use of caches at all levels (random access). So while they may indeed access only <10% of objects in the index, they can nevertheless be slower than a simple, cache-happy linear scan (given a simple enough distance function).
I wonder how that last example -- 120M strings in RAM, 5-NN search -- compares against an optimized linear scan?
EDIT: at the end of the white paper there are also several references to the author's academic articles. Apparently the method is based on speeding up sequential scans by compression ("sketches"), so there :)
I don't understand how this is related to the article. The article has to do with searching spatial data structures, such as you would make for partitioning the space of a large scene in a graphics or mapping application (or game, etc.). Maybe I'm wrong and someone could enlighten me....
It doesn't have to be for spatial data, that's just the easiest way to get N-dimensional data.
Basically, every variable in your problem becomes a dimension. You could think of grocery shopping as a 10-dimensional problem involving cost, dietary restrictions, nutritional value (we'll say there are 7 for this example) and brand preference. You could potentially use a VP tree to search for all products that fall within a 10-D sphere that encodes your product tolerance.
Right. There are plenty of cases where you ONLY have the distance (or similarity) between two points and don't actually have an n-dimensional point in space.