Opened 9 years ago

Closed 8 years ago

#481 closed enhancement (fixed)

Merge geospatial branch

Reported by: richard Owned by: richard
Priority: highest Milestone: 1.3.0
Component: Library API Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by olly)

We've had a few requests for the features in the geospatial branch, so I've put together a clean patch against trunk of what I think are the stable and reasonably "uncontroversial" parts of the geospatial branch. This patch omits several things from the geospatial branch:

  • The LatLongDistanceMatchDecider, since this is superceded by functionality in the LatLongDistancePostingSource, and we're moving towards deprecating MatchDeciders.
  • The parser for latitude-longitude coordinates. This is quite handy to allow lots of arbitrary formats for coordinates to be parsed into numeric values, but is a bit complex, and is a second lemon parser to be maintained, so I'd rather consider merging it at a later date. Without the parser, coordinates have to be supplied as a pair of doubles, rather than having the option of providing them as strings.
  • The half-completed HTM based generation of "acceleration" terms.

I've tidied up the documentation and tests, and fixed a few small issues in the patch, so I think this is ready to be reviewed and considered for merging to trunk. I'd be happy if it went in for 1.2.1, or at least got considered for it, so marking that as the milestone for now.

Attachments (7)

geospatial_clean1.patch (60.6 KB) - added by richard 9 years ago.
First patch to merge the geospatial branch.
geospatial_clean2.patch (60.5 KB) - added by richard 9 years ago.
Updated patch
geospatial_clean3.patch (60.6 KB) - added by richard 9 years ago.
Further updates to the patch
geocode.py (2.4 KB) - added by olly 9 years ago.
prototype compact encoding in python
geocode.cc (6.6 KB) - added by richard 8 years ago.
Implementation of geo encoding in C++
geocode.2.cc (6.6 KB) - added by richard 8 years ago.
Improved (speed) implementation of geo encoding for C++.
geocode.3.cc (7.1 KB) - added by richard 8 years ago.
Updated encoding to swap priority of lat and long, to make the end rounding safe, to make the encoding append to a string, and to make the decoding take a pointer and length

Download all attachments as: .zip

Change History (35)

Changed 9 years ago by richard

First patch to merge the geospatial branch.

comment:1 Changed 9 years ago by richard

  • Description modified (diff)

(Removed accidental start of sentence at the end of the description.)

comment:2 Changed 9 years ago by olly

  • Description modified (diff)

Some minor code-style issues:

  • We're using C++ forms of C headers now (so <cmath> not <math.h>) unless there's a good reason.
  • Use str() not om_tostring() - the latter is being phased out.
  • Generally in .cc files we use "using namespace std;" rather than explicitly qualifying with std:: everywhere, as that's more readable.

Using XAPIAN_VISIBILITY_DEFAULT on inlined API methods seems rather odd. I slightly wonder if we really want/need a Xapian API feature to convert from nautical miles to metres, but at least it is inlined so isn't adding more symbol relocations.

Unserialisation errors should now throw UnserialisationError rather than NetworkError. This patch catches NetworkError in at least two places and rethrows as InvalidArgumentError. I think this catch and rethrow should just be removed now we have UnserialisationError.

I wonder if it would be better to specify in the documentation that the lat/long coordinates should use the WGS84 datum rather than risking adding to the sadly rather widespread misunderstanding that a lat/long coordinate pair alone specifies a geographical location. WGS84 is what they almost inevitably will be in anyway given the prevalance of GPS, and means that if we actually later want to extend the geospatial support in a way which requires the datum, then we haven't tied our own hands. Meanwhile users can ignore the datum, but it's then their problem not ours.

What does code coverage look like?

comment:3 Changed 9 years ago by richard

  • Status changed from new to assigned

I've updated the patch with the style issues, and I've removed the nautical miles conversion. I've also removed the "meters-feet" conversions - I think the only one most users are likely to want is the "miles-meters" conversion, and if they need any others they just need to look up the constants.

The main place that non-WGS84 data I've come across comes from is from Ordnance Survey maps - but I think it's reasonable to recommend that WGS84 is used unless there's a good reason not to, so I've updated the documentation to say that.

Code coverage was fairly good when I last checked, but I'll generate an updated report shortly.

Changed 9 years ago by richard

Updated patch

Changed 9 years ago by richard

Further updates to the patch

comment:4 Changed 9 years ago by richard

I failed to fix the unserialisation error stuff in the previous patch - now updated (in geospatial_clean3.patch). I also removed the

LatLongCoords::unserialise(const char * ptr, const char * end)

form, since it doesn't really add anything over the

LatLongCoords::unserialise(const std::string & value)

form (other than avoiding the need to make a string if you only have the ptr and end - but that's not the case where this is normally called (ie, when unserialising a value) so I think it's better to keep the API a bit cleaner.

comment:5 Changed 9 years ago by richard

Code coverage is fairly good, except for the PostingSource? which has no coverage of the skip_to() and check() methods. I'll try and find time to write some tests to cover those soon, but other than that I think this patch is a pretty good candidate for pushing to trunk.

One potential API addition you might not be keen on, is in include/xapian/postingsource.h, where I've added a

virtual PostingSource * unserialise_with_registry(const std::string &s,
                                      const Registry & registry) const;

method. I think we need this method, but the name is a bit horrible: unlike the Query::unserialise() method, we can't just overload the unserialise() method with a variant which takes the registry, because this doesn't work with inheritance (as far as I can see). Any better ideas welcome, but I don't think this is too horrible an API to live with.

comment:6 Changed 9 years ago by olly

Some initial thoughts:

  • It seems we aren't likely to be ready to merge this without a fair bit more fettling, so I'd suggest we create a "geo-stuff to merge branch" for this. A VCS is good for controlling versions - multiple patches attached to a ticket isn't! I have to describe any suggested changes in a comment, or else check out another copy of the source tree, apply your patch, take a copy, make the change, and generate a patch to attach here for you to apply to your patched tree!
  • The planet name should really be "Earth" not "earth".
  • In GreatCircleMetric it doesn't seem useful to be able to specify the radius of the Earth - I'm struggling to see why anyone would (the closest I can come is indexing features on another planet). But if we do allow it, the constructor for it really should be marked explicit, and the description fixed ("The radius of to use, in metres").
  • We decided to use @ for doxygen comments (possibly this code predates that) - see HACKING:

We've decided to prefer to use @ rather than \ to introduce doxygen commands (the choice is essentially arbitrary, though \ introduces C/C++ escape sequences so @ is likely to make for easier to read mark up for C/C++ coders). A *lot* of existing comments use \ currently but please use @ for new comments.

  • GreatCircleMetric::operator() should probably clamp to 1.0 before sqrt() rather than after (since that avoids an expensive sqrt() calculation in the clamping case).
  • At least one name() method returns string("literal") whereas at least one other returns "literal" - probably better to be consistent, and I think we currently just return a string literal in similar cases.
  • There seem to be two identical LatLongDistancePostingSource ctor definitions. If they are different, I can't see how, though I'm surprised that the compiler doesn't complain.
  • Why does longitude of -0.0 need special treatment? I think that deserves a comment...
  • It's better not to use + multiple time to build up a string as this is likely to create a lot of temporary strings - at least with the versions of GCC I've experimented with, you get noticeably smaller code by using "+=" to build up a string (e.g. throw InvalidArgumentError("LatLongMetric " + new_metric_name + " not registered");)
  • LatLongCoords seems very clumsy. I'd imagine the common case is just dealing with single locations, but the API forces users to wrap those in this class. Also we don't currently expose STL iterators (or even STL structures apart from std::string I think) elsewhere in our API, so starting to do so is something to consider.
  • I wonder if there should be a way to say what the maximum interesting distance is? For example, in 2D Euclidean space, if |x1-x2| > max_dist, there's no need to go to the effort of computing the exact distance. I suspect there are similar crude bounds possible with most metrics (e.g. (difference in latitude) * min(metres_per_latitude) > max_dist).

comment:7 Changed 9 years ago by richard

Thanks for the initial review:

  • I've created the geotomerge branch, and applied the latest patch to it. I've then applied some further changes based on your comments.
  • Changed earth to Earth.
  • The use of being able to specify the radius for the GreatCircleMetric was precisely to be able to index features on another planet. For example, I played with a small database of features on Mars, which this worked nicely for. I've changed the constructor to be marked explict, and fixed the documentation comment, though.
  • The code does predate the decision to use @ - I've updated all the occurrences, I believe.
  • Changed the clamping to 1.0 for GreatCircleMetric to be before the sqrt(). Also wrapped it in "rare()".
  • Changed the return value of the name() methods to just return a string literal, which is what other classes already in trunk with name() methods do.
  • The private LatLongDistancePostingSource constructor takes a LatLongMetric *, the external takes a LatLongMetric &. The public ctor clones the metric immediately (so gets a metric that it is responsible for deleting). The internal one is passed a metric as the result of serialisation or cloning, and assumes ownership of the pointer. I'm not sure this is a problem; we could just allow the user to pass the metric as a pointer, but the current interface avoids confusion about object ownership - the reference passed only needs to be valid for the duration of the constructor call, which is a fairly normal assumption for values passed by reference.
  • I don't recall why a longitude of -0.0 was being converted to 0.0. Removing that doesn't seem to cause any problems (all tests pass OK still, both on the original geospatial branch, and the "geotomerge" one), and the serialised forms are identical. Therefore, I've removed this special-case.
  • I've changed all the occurrences of string building using "+" that I could find to use += instead.
  • I don't think there's a particular need performance-wise to be able to specify the maximum interesting distance. It would be quite possible to make a metric subclass in future which took an additional parameter (at construction time) to indicate the maximum interesting distance (or perhaps a bounding box of coordinates worth considering). I don't see that there's any need to do this now, though.

Whilst going through these comments, I also noticed that LatLongDistanceKeyMaker had an unfinished documentation comment: "If a document contains no". I've finished it off.

This leaves two of your comments unaddressed:

  • Can we tidy up the API to avoid forcing users to wrap single locations in LatLongCoords?
  • Should we avoid exposing STL iterators?

I'll take a look at these remaining two shortly.

comment:8 Changed 9 years ago by richard

Users of the C++ API aren't actually forced to use LatLongCoords? explicitly:

  • when serialising, to store values, a serialised LatLongCoord? has the same representation as a serialised LatLongCoords? object containing the same single coordinate, so a LatLongCoord? can be serialised and stored directly. (This is noted in the documentation, too.)
  • when creating a PostingSource?, Metric or KeyMaker?, a LatLongCoord? can be passed in place of the LatLongCoords? reference required, and the non-explicit LatLongCoords(const LatLongCoord & coord) constructor will be used to convert the parameter automatically.

However, the automatic conversion won't work in other languages; so, for example, python users will be forced to do the wrapping explicitly. We could provide a wrapper in such languages, of course.

I'm not sure what a cleaner approach would be, whilst retaining the ability to calculate distances based on sets of points.

Regarding exposing of STL iterators, it would be quite easy to write a simple iterator for iterating the contents of a LatLongCoords?; I'll take a go at that in a bit. Doing so should make the SWIG wrappers better, so it's probably worth it just for that.

comment:9 Changed 9 years ago by richard

I've changed LatLongCoords to use a vector internally, and preserve the order of coordinates in it. This makes it possible to use one to represent an boundary, instead of just a set of points, so future metrics could do nearest distance between boundary lines, instead of points.

Of course, for a metric to do a calculation like that would require making the double LatLongMetric::operator()(const LatLongCoords & a, const LatLongCoords &b) const; method virtual, which would then conflict with the operator() which takes LatLongCoord arguments, so one of these would need to be renamed.

comment:10 Changed 9 years ago by richard

An alternative approach would be to get rid of LatLongCoords altogether. Instead, LatLongMetric would take a set of coordinates in serialised form. To store multiple coordinates in a slot, users would simply append serialised coordinates together (which is what LatLongCoords currently does).

Also, most of the time one of the coordinates supplied to LatLongMetric is a "center" coordinate, and is fixed for the duration of a query. If that was supplied to the metric (at construction time, or shortly after that), the metric might be able to optimise calculations for that center (by building bounding boxes of interest, etc). So perhaps we want metrics which have a distance calculation method which takes a single coordinate, and returns the distance from the stored center.

I think we'd want to preserve the ability of LatLongMetric to calculate a pointwise distance based on a pair of LatLongCoord objects, though: future code to generate acceleration terms (like that on the geospatial branch) needs to do such calculations frequently.

On a related note, the LatLongCoord::unserialise(const char ** ptr, const char * end) method can't currently be used from the bindings, since the const char ** ptr type can't be created there. I think this probably needs special wrapping, to return a list of the coordinates.

comment:11 Changed 9 years ago by olly

  • Milestone changed from 1.2.1 to 1.2.2

Bumping to 1.2.2.

comment:12 Changed 9 years ago by olly

  • Milestone changed from 1.2.2 to 1.2.3

We're releasing a quick 1.2.2 to fix some portability issues in 1.2.1, so updating milestone.

comment:13 Changed 9 years ago by olly

  • Milestone changed from 1.2.3 to 1.2.4

I want to release 1.2.3, so bumping.

Changed 9 years ago by olly

prototype compact encoding in python

comment:14 Changed 9 years ago by olly

The attachment is a python prototype of a geocoding format which gives better than 2m precision in at most 6 bytes, and can be truncated to as little as 2 bytes to give a more approximate but shorter encoding of the same point.

comment:15 Changed 9 years ago by olly

  • Milestone changed from 1.2.4 to 1.2.5

Sadly this isn't going to be ready for 1.2.4.

comment:16 Changed 9 years ago by olly

  • Milestone changed from 1.2.5 to 1.2.6

Bumping.

comment:17 Changed 8 years ago by jbergstroem

For what its worth: This has been running in production over at my place since november last year (well, roughly 4-5 months). I have a patched version of 1.2.2 and we haven't run into any issues (yet? :). Thanks for the great work!

comment:18 Changed 8 years ago by olly

  • Milestone changed from 1.2.6 to 1.2.x
  • Priority changed from normal to highest

Rather than repegging this repeatedly, setting milestone to 1.2.x and priority to highest.

comment:19 Changed 8 years ago by olly

  • Milestone changed from 1.2.x to 1.3.0

Targetting at 1.3.0.

comment:20 Changed 8 years ago by olly

Richard pointed out on IRC that there's no copyright or licensing information on my prototype - to be clear I'm happy to license the code in geocode.py as MIT/X.

Changed 8 years ago by richard

Implementation of geo encoding in C++

comment:21 Changed 8 years ago by richard

Added a C++ implementation; not particular optimised, but does 1,000,000 encoding-decoding cycles a second for random coordinates on my (slowish) machine, so probably good enough.

Changed 8 years ago by richard

Improved (speed) implementation of geo encoding for C++.

comment:22 Changed 8 years ago by richard

The special cases are fragile, as implemented; I think that something at latitude -89.999999999 will be rounded to having latitude -90, but not trigger the special case (so the longitude will be stored, too). This should be easy to fix.

In performance testing, I noticed that a significant part of the time taken by geo_decode (around 30%) is doing the string comparison for "if (value == "\xff\xff")". It occurs to me that it might be better not to have the special short representations of -90 and +90: instead, always use 6 bytes for the representation. This has the advantage that multiple values could be stored in the slot with no need to mark terminators; and that when encoding multiple values, a 6 byte fixed buffer could be used, instead of a string (which would probably be faster). We could also just do the special case handling at encoding time, checking the integer representation of the number of 16ths of a second in the latitude, and setting the longitude to zero if that is the value for -90 or +90.

It also occurs to me that it might be better to store longitude values as more significant that latitude. With the current code, a range query on the encoded values can be used to check if the coordinate is in a given degree range of latitude values, but I suspect being able to do a quick range query to check a range of longitudes would be more useful in many cases; for example, it would enable US and european coordinates to be distinguished quickly; there is probably a greater range of longitude degree values in use than latitudes.

Changed 8 years ago by richard

Updated encoding to swap priority of lat and long, to make the end rounding safe, to make the encoding append to a string, and to make the decoding take a pointer and length

comment:23 Changed 8 years ago by richard

geocode format should probably have the discontinuity in longitude values at -180 instead of 0, since the world is less populated there. We should also implement a bounding box checker (possibly that works only at the degree level), or at least a checker for longitude ranges, so that all users don't need to write their own code to cope with the discontinuity in encoded ranges. Not sure of the appropriate API for a bbox checker, since we want to avoid effectively decoding the coordinate twice. It seems that decoding could perhaps be done which doing a bbox check, so I think a good API might be a function takes a bbox, and which either returns a decoded coordinate, or a value of false to indicate that the coordinate is outside the box.

comment:24 Changed 8 years ago by richard

I've put this code up as a git repository on github now (and I've separated the test code from the core code, and added a documented header file for it, too).

https://github.com/rboulton/geoencode

comment:25 Changed 8 years ago by richard

I've added the a class for checking bounding boxes to the git repository. This does a range check first on the first encoded byte, and only performs a full decode if neccessary. The range check optimisation appears to give around a 10% speedup in a very rough test (running a lightly modified testsuite under "time") for decoding evenly distributed coordinates and bounding boxes, even if the class is only used once, and gives a greater speedup for smaller boxes, and if the class is used many times. Decoding 10,000,000 coordinates took 9 seconds without the optimisation, 8.1 seconds with the optimisation, and 5.5 seconds if each bounding box was limited to 10% of the circumference of the earth in size. Further optimisations are probably possible with the same API, but I think this code is ready to be integrated into the geospatial code in Xapian now.

comment:26 Changed 8 years ago by richard

I've integrated this code with the geospatial code in Xapian, and done considerable cleanup now. My integration work is in a "geotomerge2" branch in my xapian fork on github. To see the differences compared to current master, go to:

https://github.com/rboulton/xapian/compare/master...geotomerge2

I believe this code is now ready to be applied to master; I would welcome review, particularly of the interfaces in the public header file - given the impending GSoC summit I understand that may take a couple of weeks, but am keen for it to go into trunk before it bitrots again.

comment:27 Changed 8 years ago by olly

LatLongCoord::operator< does:

    bool operator<(const LatLongCoord & other) const
    {
        if (latitude < other.latitude) return true;
        return (longitude < other.longitude);
    }

I think this probably should have a middle line of:

        if (latitude > other.latitude) return false;

As it stands, if a has lat 10, long 20 and b has lat 20, long 10 then both a < b and b < a are true, which is (a) surprising and (b) I believe leads to undefined behaviour with the STL. Is this just an oversight? Presumably this method exists so that they can be put in containers like std::map, so the actual ordering is mostly arbitrary, so long as it's consistent.

A bit unsure about having a naked struct for LatLongCoord - we've avoided this elsewhere in the API.

The unserialise() methods for LatLongCoord and LatLongCoords work differently to our existing unserialise methods, which all return a new object or object pointer. Perhaps that's sensible, but the inconsistency grates a bit.

There's a documented return value here for a (private) method which actually returns void:

    /** Calculate the distance for the current document.
     *
     *  Returns true if the distance was calculated ok, or false if the
     *  document didn't contain a valid serialised sequence of coordinates in
     *  the appropriate value slot.
     */
    void calc_distance();

LatLongDistanceKeyMaker?'s ctors lack documentation comments.

10E10 seems a bit of an odd default value (1e11 seems a more natural way to write the same thing). But perhaps we should just split the ctors and initialise defkey(9, '\xff') when no default distance is specified (which is more efficient than defaulting the distance to HUGE_VAL and then converting it, and avoids pulling in <cmath> in an API header).

I've also fixed a pile of documentation comment typos in my local tree. I'll try to sort out a merge request for changes.

comment:28 Changed 8 years ago by olly

  • Resolution set to fixed
  • Status changed from assigned to closed

Merged xapian-core changes in r16319 and bindings changes in r16320.

Note: See TracTickets for help on using tickets.