Hacker News new | past | comments | ask | show | jobs | submit login

this uber-h3 has been public for a while, and it brings mixed reaction for me -- it seems like Uber wanted to 'invent' something really cool, as in, the showing off part is every bit as important, or more, than the tech. The reason to point that out is, spatial indexing is one of seriously studied topics in academic geometry. Yes, the classical papers can be a bit much, but there is a reason why they are so -- it really does benefit from that kind of attention.

If Uber needed a different kind of spatially indexed filter and search, then some system that is adaptive to data density would seem like the performant thing to build. Nested-hexagon space is certainly fun, but real car data is distributed in certain ways, that are not rectangle and also not hexagon.




I don't understand.

The events are associated with coordinates and then binned into their containing hexagon at a given resolution.

So if 10 people request a ride in the same hexagon in a given time frame it weights that hexagon according to surge pricing rules.


I just want to make it clear that there's a real reason behind H3, not just "invent something really cool." I don't work at Uber, anymore, but I did work on the team that originated H3 and worked on H3.

To condense it down to a few bullet points, we realized:

1. We needed to move from geofences to a grid for both data science reasons and scaling reasons to perform realtime aggregated and anonymized activity analysis for things like surge pricing.

2. With a grid along with quantized time ranges (eg 10 seconds or 1 minute intervals) we can reduce data collection to distributed increments across a small cluster of machines with O(1) compute time rather than a more complicated R-Tree (O(log(n)) + Point-in-Poly (O(m)) system for geofences (that also can't backfill prior results when new geofences are added)

3. With a grid, data science can be assured of approximately equal area and time across these space-time buckets so normalizing the data for analysis between them is not necessary, regardless of where on the planet it came from (cross-city analysis and forecasting). With a hexagon grid, data science can also be assured that all neighboring cells are the same distance from each other and that all neighboring cells share a measurable edge rather than an infinitesimal point (like squares) so flow analysis between the cells is similarly simplified and needs no normalizing.

4. With a hierarchical hexagon grid we can quantize the data at the finest granularity (which is higher resolution than commercial GPS, about the size of a coffee table in diameter) and it can be rapidly reaggregated upwards to other resolutions (with some small error introduced since hexagons do not properly tessellate) for data science to determine what the "right" resolution is for the analysis at hand, then the realtime aggregating system can be updated to index at that resolution natively, as well, to improve bucketing accuracy (H3 resolution 9 was one such blessed coarser resolution).

When we determined exactly what we needed, we didn't invent it to feel like we did something really cool, we reached out to Dr. Kevin Sahr in academia to help us make it, who used part of his DGGRID code to do so[1], and then spent nearly 2 years to make sure everything was legal for open sourcing.

What we did at Uber was focus on getting his core more easily consumable: Request that the core does not allocate memory on its own, but is passed in memory (so integration with memory-managed languages would be much simpler), implement many of the algorithms described by Red Blob Games to make manipulation of the data in the grid system more amenable, update the build and test system to a more modern standard (CMake, code coverage, and unit tests), and write C bindings in various languages.

[1](https://uber.github.io/h3/#/documentation/overview/use-cases) (references the paper that covers the work he did)


excellent -- reading carefully; this one is fun .. https://github.com/uber/h3-py/blob/master/docs/UnifiedDataLa...


Using H3 to analyze data originally indexed in different formats is definitely cool, but you can totally do that with S2 as well.

I personally feel the unidirectional edges[1] are a real differentiator. If you have these space-time buckets and compare counts between two points in time, you can see the change in density, but you don't know how active it is. Tracking transitions between hexagons in an anonymous, aggregated fashion lets you see if the "stagnant" areas of your map are stagnant because the actors actually aren't moving, or if there's a large "mixing" of the actors but you've reached a quasi-static-equilibrium in the system.

It also lets you see which regions are more connected to each other if, for instance, hexagon A always flows back and forth with hexagon B which also flows back and forth with hexagon C bordering both but no such flow between C and A, so from data you can spot potential rivers, deduce roadblocks/accidents, that certain businesses must be closed, etc.

[1](https://uber.github.io/h3/#/documentation/api-reference/unid...)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: