Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
VP trees: A data structure for finding stuff fast (stevehanov.ca)
102 points by rcfox on Dec 2, 2011 | hide | past | favorite | 9 comments


The cool thing about VP trees is that your space doesn't have to be Euclidean, i.e. you can have dimensions that depend on other dimensions, e.g. in RGBA color space values of RGB are meaningless when A=transparent. You can't search that space with KD trees, but you can with VP trees.


VP trees are well known but not necessarily the best approach for large datasets or high dimensional objects. I am creating a startup that offers the fastest similarity search engine built so far (10X faster than LSH from MIT). See more here: http://simmachines.com


You probably can't go into any details, but can you describe the rough approach R-01 is taking? Is it based on some existing paper or something completely new?


He has a white paper (understandably vague) on the site: http://www.simmachines.com/resources/whitePaper.pdf

It does sound interesting. These metric-space based methods usually make poor use of caches at all levels (random access). So while they may indeed access only <10% of objects in the index, they can nevertheless be slower than a simple, cache-happy linear scan (given a simple enough distance function).

I wonder how that last example -- 120M strings in RAM, 5-NN search -- compares against an optimized linear scan?

EDIT: at the end of the white paper there are also several references to the author's academic articles. Apparently the method is based on speeding up sequential scans by compression ("sketches"), so there :)


I don't understand how this is related to the article. The article has to do with searching spatial data structures, such as you would make for partitioning the space of a large scene in a graphics or mapping application (or game, etc.). Maybe I'm wrong and someone could enlighten me....


It doesn't have to be for spatial data, that's just the easiest way to get N-dimensional data.

Basically, every variable in your problem becomes a dimension. You could think of grocery shopping as a 10-dimensional problem involving cost, dietary restrictions, nutritional value (we'll say there are 7 for this example) and brand preference. You could potentially use a VP tree to search for all products that fall within a 10-D sphere that encodes your product tolerance.


3D spatial problems are just three dimensions. True you can do that with VP but you can do that with octtree too.

Imagine beyond 3D like clustering of friends or likes or recommendations or all those other N-dimensional data-points and you'll see the overlap.


Right. There are plenty of cases where you ONLY have the distance (or similarity) between two points and don't actually have an n-dimensional point in space.


Am I the only one this link isn't working for? When I click on it I get "index.php". file(1) says this:

    /Users/zrail/Downloads/index.php: data
Edit: Here's a link to what appears to be the original paper:

http://pnylab.com/pny/papers/vptree/vptree/




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: