Opened 8 years ago

Last modified 4 months ago

#712 new defect

Geospatial API thoughts

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Library API Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

I've been looking at making the geospatial API not force you to specify a metric (the only user feedback I've so far heard about it was that it's annoying to have to do so when there's a sane default), and it feels like it provides unnecessary flexibility which close to nobody is actually going to want, but in doing so it forces inefficiencies upon all users.

The LatLongMetric class is essentially a pluggable algorithm for "distance between two coordinates".

We don't actually seem to document any requirements for it, though presumably it ought to fulfil the requirements of a "metric" in the mathematical sense (https://en.wikipedia.org/wiki/Metric_%28mathematics%29):

  • d(x,y) >= 0
  • d(x,y) == 0 <=> x == y
  • d(x,y) = d(y,x)
  • d(x,z) <= d(x,y) + d(y,z)

If we documented these as requirements, we might be able to take advantage them to optimise - e.g. when handling distance from a polygon.

The documentation only seems to suggest you might want to use a different metric which calculates great circle distances more exactly. The provided metric assumes that the planet is a sphere, and uses the haversine function, which suffers from numeric errors for points close to opposite each other - I struggle to believe that for purposes of ranking or limiting searches that either of these is a problem in reality.

Providing a metric which calculates "distance by road" or similar seems a more plausible reason an API user might to want to provide their own, but the documentation doesn't even hint that this as something you could do. You would need to take care that the metric wasn't too expensive to calculate of course - running a full routing algorithm for every pair of points being considered is unlikely to scale well.

The forced inefficiency issue I noted is that metric has to calculate "pointwise_distance" in metres. At least for sorting and filtering, we really just need a value which is strictly monotonic in distance (and for filtering to reverse map the threshold into the same domain). For haversine, we could sort by h, and avoid having to calculate an arcsin and square root each time. Not exposing metrics would easily allow that, but redefining the interface carefully could also allow it.

None of this feasible before 1.4.0 (except perhaps dropping metrics from the public API entirely), but I wanted to write down my thoughts while they were still fresh.

I think for 1.4.0 we default the metric if not specified, and either just leave the metric classes marked as experimental, or improve the documentation to actually say what the requirements are and give a better example of a user metric.

Change History (3)

comment:1 by Olly Betts, 18 months ago

From #186:

the reason why LatLongMetric has clone() is so that LatLongPostingSource::clone() can in turn clone the metric. Optionally reference counting the metric would provide the same ability with less object copying and better consistency, though this only happens once per sub database per query run, so really isn't a hot spot.

If we change that, it would make sense to change to passing a pointer instead of a reference. The bindings would also need an updating to keep a reference to the currently set metric.

comment:2 by Olly Betts, 11 months ago

Milestone: 1.5.0

1.5.0 is an opportunity to make API changes, so setting milestone so we consider this.

comment:3 by Olly Betts, 4 months ago

Milestone: 1.5.02.0.0
Note: See TracTickets for help on using tickets.