VP trees: A data structure for finding stuff fast

pornel · on Dec 2, 2011

The cool thing about VP trees is that your space doesn't have to be Euclidean, i.e. you can have dimensions that depend on other dimensions, e.g. in RGBA color space values of RGB are meaningless when A=transparent. You can't search that space with KD trees, but you can with VP trees.

arnoldoMuller · on Dec 2, 2011

VP trees are well known but not necessarily the best approach for large datasets or high dimensional objects. I am creating a startup that offers the fastest similarity search engine built so far (10X faster than LSH from MIT). See more here: http://simmachines.com

apu · on Dec 2, 2011

You probably can't go into any details, but can you describe the rough approach R-01 is taking? Is it based on some existing paper or something completely new?

Radim · on Dec 3, 2011

He has a white paper (understandably vague) on the site: http://www.simmachines.com/resources/whitePaper.pdf

It does sound interesting. These metric-space based methods usually make poor use of caches at all levels (random access). So while they may indeed access only <10% of objects in the index, they can nevertheless be slower than a simple, cache-happy linear scan (given a simple enough distance function).

I wonder how that last example -- 120M strings in RAM, 5-NN search -- compares against an optimized linear scan?

EDIT: at the end of the white paper there are also several references to the author's academic articles. Apparently the method is based on speeding up sequential scans by compression ("sketches"), so there :)

smspence · on Dec 2, 2011

I don't understand how this is related to the article. The article has to do with searching spatial data structures, such as you would make for partitioning the space of a large scene in a graphics or mapping application (or game, etc.). Maybe I'm wrong and someone could enlighten me....

rcfox · on Dec 2, 2011

It doesn't have to be for spatial data, that's just the easiest way to get N-dimensional data.

Basically, every variable in your problem becomes a dimension. You could think of grocery shopping as a 10-dimensional problem involving cost, dietary restrictions, nutritional value (we'll say there are 7 for this example) and brand preference. You could potentially use a VP tree to search for all products that fall within a 10-D sphere that encodes your product tolerance.

willvarfar · on Dec 2, 2011

3D spatial problems are just three dimensions. True you can do that with VP but you can do that with octtree too.

Imagine beyond 3D like clustering of friends or likes or recommendations or all those other N-dimensional data-points and you'll see the overlap.

joshu · on Dec 3, 2011

Right. There are plenty of cases where you ONLY have the distance (or similarity) between two points and don't actually have an n-dimensional point in space.

zrail · on Dec 2, 2011

Am I the only one this link isn't working for? When I click on it I get "index.php". file(1) says this:

    /Users/zrail/Downloads/index.php: data

Edit: Here's a link to what appears to be the original paper:

http://pnylab.com/pny/papers/vptree/vptree/