Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to Partitioning Data for Linear Scalability in Geospatial Queries?
7 points by ninjakeyboard on July 17, 2016 | hide | past | favorite | 4 comments
How do you partition geospatial data for horizontal scalability? Seems the best option is less so partitioning and moreso to define geographic regions and then just duplicate the data to query against a region. Otherwise you'll have these weird borders, potentially butting up against the border of 4 tiles for a query, so you would have to query against 4 nodes (for a single geospacial query to get the data from all 4 tiles. I wonder how google places api etc handle this sort of problem.

The other potential solution is to overlap data so a node contains the tiles along its edges from the next and previous nodes as well. Not 100% sure how to handle this, what the best technology is etc.

Any recommendations welcome. I'm probably looking at the problem wrong - eg that a partition key in a columnar database query (eg cassandra) may be the floored lat & long integers getting a column range of the lesser significant digits. But maybe there is another way of looking at the data/problem space?



This podcast talks about scaling Second Life which has a strong geographic component:

http://www.se-radio.net/2009/07/episode-141-second-life-and-...

My naive intuition is that sharding on two or more axes with some denormalization makes sense: e.g. sharding on both geospatial location and information layers. Infrequently modified elements that overlap several geospatial regions could be stored alongside each. This implies eventual consistency and high availability. On the other hand, some elements might need higher consistency and therefore have lower availability.

Which is to say that the proper architecture is one that allows accurate metrics and high levels of tuning based on actual use and application requirements.

Good luck.


Having to query against 2 or 4 nodes is not bad, because you can and should run them concurrently, so you've still got the latency of one query. I wouldn't want to overlap data because that opens a new door for inconsistencies to occur.


Ya that was my fear with the duplication as well.


I found this - may be relevant. http://arxiv.org/abs/1509.00910




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: