Opened 15 years ago
Closed 13 years ago
#481 closed enhancement (fixed)
Merge geospatial branch
Reported by: | Richard Boulton | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | highest | Milestone: | 1.3.0 |
Component: | Library API | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
We've had a few requests for the features in the geospatial branch, so I've put together a clean patch against trunk of what I think are the stable and reasonably "uncontroversial" parts of the geospatial branch. This patch omits several things from the geospatial branch:
- The LatLongDistanceMatchDecider, since this is superceded by functionality in the LatLongDistancePostingSource, and we're moving towards deprecating MatchDeciders.
- The parser for latitude-longitude coordinates. This is quite handy to allow lots of arbitrary formats for coordinates to be parsed into numeric values, but is a bit complex, and is a second lemon parser to be maintained, so I'd rather consider merging it at a later date. Without the parser, coordinates have to be supplied as a pair of doubles, rather than having the option of providing them as strings.
- The half-completed HTM based generation of "acceleration" terms.
I've tidied up the documentation and tests, and fixed a few small issues in the patch, so I think this is ready to be reviewed and considered for merging to trunk. I'd be happy if it went in for 1.2.1, or at least got considered for it, so marking that as the milestone for now.
Attachments (7)
Change History (35)
by , 15 years ago
Attachment: | geospatial_clean1.patch added |
---|
comment:1 by , 15 years ago
Description: | modified (diff) |
---|
(Removed accidental start of sentence at the end of the description.)
comment:2 by , 15 years ago
Description: | modified (diff) |
---|
Some minor code-style issues:
- We're using C++ forms of C headers now (so <cmath> not <math.h>) unless there's a good reason.
- Use str() not om_tostring() - the latter is being phased out.
- Generally in .cc files we use "using namespace std;" rather than explicitly qualifying with std:: everywhere, as that's more readable.
Using XAPIAN_VISIBILITY_DEFAULT on inlined API methods seems rather odd. I slightly wonder if we really want/need a Xapian API feature to convert from nautical miles to metres, but at least it is inlined so isn't adding more symbol relocations.
Unserialisation errors should now throw UnserialisationError rather than NetworkError. This patch catches NetworkError in at least two places and rethrows as InvalidArgumentError. I think this catch and rethrow should just be removed now we have UnserialisationError.
I wonder if it would be better to specify in the documentation that the lat/long coordinates should use the WGS84 datum rather than risking adding to the sadly rather widespread misunderstanding that a lat/long coordinate pair alone specifies a geographical location. WGS84 is what they almost inevitably will be in anyway given the prevalance of GPS, and means that if we actually later want to extend the geospatial support in a way which requires the datum, then we haven't tied our own hands. Meanwhile users can ignore the datum, but it's then their problem not ours.
What does code coverage look like?
comment:3 by , 15 years ago
Status: | new → assigned |
---|
I've updated the patch with the style issues, and I've removed the nautical miles conversion. I've also removed the "meters-feet" conversions - I think the only one most users are likely to want is the "miles-meters" conversion, and if they need any others they just need to look up the constants.
The main place that non-WGS84 data I've come across comes from is from Ordnance Survey maps - but I think it's reasonable to recommend that WGS84 is used unless there's a good reason not to, so I've updated the documentation to say that.
Code coverage was fairly good when I last checked, but I'll generate an updated report shortly.
comment:4 by , 15 years ago
I failed to fix the unserialisation error stuff in the previous patch - now updated (in geospatial_clean3.patch). I also removed the
LatLongCoords::unserialise(const char * ptr, const char * end)
form, since it doesn't really add anything over the
LatLongCoords::unserialise(const std::string & value)
form (other than avoiding the need to make a string if you only have the ptr and end - but that's not the case where this is normally called (ie, when unserialising a value) so I think it's better to keep the API a bit cleaner.
comment:5 by , 15 years ago
Code coverage is fairly good, except for the PostingSource which has no coverage of the skip_to() and check() methods. I'll try and find time to write some tests to cover those soon, but other than that I think this patch is a pretty good candidate for pushing to trunk.
One potential API addition you might not be keen on, is in include/xapian/postingsource.h, where I've added a
virtual PostingSource * unserialise_with_registry(const std::string &s, const Registry & registry) const;
method. I think we need this method, but the name is a bit horrible: unlike the Query::unserialise() method, we can't just overload the unserialise() method with a variant which takes the registry, because this doesn't work with inheritance (as far as I can see). Any better ideas welcome, but I don't think this is too horrible an API to live with.
comment:6 by , 14 years ago
Some initial thoughts:
- It seems we aren't likely to be ready to merge this without a fair bit more fettling, so I'd suggest we create a "geo-stuff to merge branch" for this. A VCS is good for controlling versions - multiple patches attached to a ticket isn't! I have to describe any suggested changes in a comment, or else check out another copy of the source tree, apply your patch, take a copy, make the change, and generate a patch to attach here for you to apply to your patched tree!
- The planet name should really be "Earth" not "earth".
- In GreatCircleMetric it doesn't seem useful to be able to specify the radius of the Earth - I'm struggling to see why anyone would (the closest I can come is indexing features on another planet). But if we do allow it, the constructor for it really should be marked
explicit
, and the description fixed ("The radius of to use, in metres").
- We decided to use @ for doxygen comments (possibly this code predates that) - see HACKING:
We've decided to prefer to use @ rather than \ to introduce doxygen commands (the choice is essentially arbitrary, though \ introduces C/C++ escape sequences so @ is likely to make for easier to read mark up for C/C++ coders). A *lot* of existing comments use \ currently but please use @ for new comments.
GreatCircleMetric::operator()
should probably clamp to 1.0 before sqrt() rather than after (since that avoids an expensive sqrt() calculation in the clamping case).
- At least one name() method returns string("literal") whereas at least one other returns "literal" - probably better to be consistent, and I think we currently just return a string literal in similar cases.
- There seem to be two identical
LatLongDistancePostingSource
ctor definitions. If they are different, I can't see how, though I'm surprised that the compiler doesn't complain.
- Why does longitude of -0.0 need special treatment? I think that deserves a comment...
- It's better not to use + multiple time to build up a string as this is likely to create a lot of temporary strings - at least with the versions of GCC I've experimented with, you get noticeably smaller code by using "+=" to build up a string (e.g.
throw InvalidArgumentError("LatLongMetric " + new_metric_name + " not registered");
)
- LatLongCoords seems very clumsy. I'd imagine the common case is just dealing with single locations, but the API forces users to wrap those in this class. Also we don't currently expose STL iterators (or even STL structures apart from std::string I think) elsewhere in our API, so starting to do so is something to consider.
- I wonder if there should be a way to say what the maximum interesting distance is? For example, in 2D Euclidean space, if |x1-x2| > max_dist, there's no need to go to the effort of computing the exact distance. I suspect there are similar crude bounds possible with most metrics (e.g. (difference in latitude) * min(metres_per_latitude) > max_dist).
comment:7 by , 14 years ago
Thanks for the initial review:
- I've created the geotomerge branch, and applied the latest patch to it. I've then applied some further changes based on your comments.
- Changed earth to Earth.
- The use of being able to specify the radius for the
GreatCircleMetric
was precisely to be able to index features on another planet. For example, I played with a small database of features on Mars, which this worked nicely for. I've changed the constructor to be markedexplict
, and fixed the documentation comment, though. - The code does predate the decision to use @ - I've updated all the occurrences, I believe.
- Changed the clamping to 1.0 for
GreatCircleMetric
to be before the sqrt(). Also wrapped it in "rare()". - Changed the return value of the
name()
methods to just return a string literal, which is what other classes already in trunk with name() methods do. - The private
LatLongDistancePostingSource
constructor takes aLatLongMetric *
, the external takes aLatLongMetric &
. The public ctor clones the metric immediately (so gets a metric that it is responsible for deleting). The internal one is passed a metric as the result of serialisation or cloning, and assumes ownership of the pointer. I'm not sure this is a problem; we could just allow the user to pass the metric as a pointer, but the current interface avoids confusion about object ownership - the reference passed only needs to be valid for the duration of the constructor call, which is a fairly normal assumption for values passed by reference. - I don't recall why a longitude of -0.0 was being converted to 0.0. Removing that doesn't seem to cause any problems (all tests pass OK still, both on the original geospatial branch, and the "geotomerge" one), and the serialised forms are identical. Therefore, I've removed this special-case.
- I've changed all the occurrences of string building using "+" that I could find to use += instead.
- I don't think there's a particular need performance-wise to be able to specify the maximum interesting distance. It would be quite possible to make a metric subclass in future which took an additional parameter (at construction time) to indicate the maximum interesting distance (or perhaps a bounding box of coordinates worth considering). I don't see that there's any need to do this now, though.
Whilst going through these comments, I also noticed that LatLongDistanceKeyMaker
had an unfinished documentation comment: "If a document contains no". I've finished it off.
This leaves two of your comments unaddressed:
- Can we tidy up the API to avoid forcing users to wrap single locations in
LatLongCoords
? - Should we avoid exposing STL iterators?
I'll take a look at these remaining two shortly.
comment:8 by , 14 years ago
Users of the C++ API aren't actually forced to use LatLongCoords explicitly:
- when serialising, to store values, a serialised LatLongCoord has the same representation as a serialised LatLongCoords object containing the same single coordinate, so a LatLongCoord can be serialised and stored directly. (This is noted in the documentation, too.)
- when creating a PostingSource, Metric or KeyMaker, a LatLongCoord can be passed in place of the LatLongCoords reference required, and the non-explicit
LatLongCoords(const LatLongCoord & coord)
constructor will be used to convert the parameter automatically.
However, the automatic conversion won't work in other languages; so, for example, python users will be forced to do the wrapping explicitly. We could provide a wrapper in such languages, of course.
I'm not sure what a cleaner approach would be, whilst retaining the ability to calculate distances based on sets of points.
Regarding exposing of STL iterators, it would be quite easy to write a simple iterator for iterating the contents of a LatLongCoords; I'll take a go at that in a bit. Doing so should make the SWIG wrappers better, so it's probably worth it just for that.
comment:9 by , 14 years ago
I've changed LatLongCoords
to use a vector internally, and preserve the order of coordinates in it. This makes it possible to use one to represent an boundary, instead of just a set of points, so future metrics could do nearest distance between boundary lines, instead of points.
Of course, for a metric to do a calculation like that would require making the double LatLongMetric::operator()(const LatLongCoords & a, const LatLongCoords &b) const;
method virtual, which would then conflict with the operator()
which takes LatLongCoord
arguments, so one of these would need to be renamed.
comment:10 by , 14 years ago
An alternative approach would be to get rid of LatLongCoords
altogether. Instead, LatLongMetric
would take a set of coordinates in serialised form. To store multiple coordinates in a slot, users would simply append serialised coordinates together (which is what LatLongCoords
currently does).
Also, most of the time one of the coordinates supplied to LatLongMetric
is a "center" coordinate, and is fixed for the duration of a query. If that was supplied to the metric (at construction time, or shortly after that), the metric might be able to optimise calculations for that center (by building bounding boxes of interest, etc). So perhaps we want metrics which have a distance calculation method which takes a single coordinate, and returns the distance from the stored center.
I think we'd want to preserve the ability of LatLongMetric
to calculate a pointwise distance based on a pair of LatLongCoord
objects, though: future code to generate acceleration terms (like that on the geospatial branch) needs to do such calculations frequently.
On a related note, the LatLongCoord::unserialise(const char ** ptr, const char * end)
method can't currently be used from the bindings, since the const char ** ptr
type can't be created there. I think this probably needs special wrapping, to return a list of the coordinates.
comment:12 by , 14 years ago
Milestone: | 1.2.2 → 1.2.3 |
---|
We're releasing a quick 1.2.2 to fix some portability issues in 1.2.1, so updating milestone.
comment:14 by , 14 years ago
The attachment is a python prototype of a geocoding format which gives better than 2m precision in at most 6 bytes, and can be truncated to as little as 2 bytes to give a more approximate but shorter encoding of the same point.
comment:17 by , 14 years ago
For what its worth: This has been running in production over at my place since november last year (well, roughly 4-5 months). I have a patched version of 1.2.2 and we haven't run into any issues (yet? :). Thanks for the great work!
comment:18 by , 13 years ago
Milestone: | 1.2.6 → 1.2.x |
---|---|
Priority: | normal → highest |
Rather than repegging this repeatedly, setting milestone to 1.2.x and priority to highest.
comment:20 by , 13 years ago
Richard pointed out on IRC that there's no copyright or licensing information on my prototype - to be clear I'm happy to license the code in geocode.py as MIT/X.
comment:21 by , 13 years ago
Added a C++ implementation; not particular optimised, but does 1,000,000 encoding-decoding cycles a second for random coordinates on my (slowish) machine, so probably good enough.
by , 13 years ago
Attachment: | geocode.2.cc added |
---|
Improved (speed) implementation of geo encoding for C++.
comment:22 by , 13 years ago
The special cases are fragile, as implemented; I think that something at latitude -89.999999999 will be rounded to having latitude -90, but not trigger the special case (so the longitude will be stored, too). This should be easy to fix.
In performance testing, I noticed that a significant part of the time taken by geo_decode (around 30%) is doing the string comparison for "if (value == "\xff\xff")". It occurs to me that it might be better not to have the special short representations of -90 and +90: instead, always use 6 bytes for the representation. This has the advantage that multiple values could be stored in the slot with no need to mark terminators; and that when encoding multiple values, a 6 byte fixed buffer could be used, instead of a string (which would probably be faster). We could also just do the special case handling at encoding time, checking the integer representation of the number of 16ths of a second in the latitude, and setting the longitude to zero if that is the value for -90 or +90.
It also occurs to me that it might be better to store longitude values as more significant that latitude. With the current code, a range query on the encoded values can be used to check if the coordinate is in a given degree range of latitude values, but I suspect being able to do a quick range query to check a range of longitudes would be more useful in many cases; for example, it would enable US and european coordinates to be distinguished quickly; there is probably a greater range of longitude degree values in use than latitudes.
by , 13 years ago
Attachment: | geocode.3.cc added |
---|
Updated encoding to swap priority of lat and long, to make the end rounding safe, to make the encoding append to a string, and to make the decoding take a pointer and length
comment:23 by , 13 years ago
geocode format should probably have the discontinuity in longitude values at -180 instead of 0, since the world is less populated there. We should also implement a bounding box checker (possibly that works only at the degree level), or at least a checker for longitude ranges, so that all users don't need to write their own code to cope with the discontinuity in encoded ranges. Not sure of the appropriate API for a bbox checker, since we want to avoid effectively decoding the coordinate twice. It seems that decoding could perhaps be done which doing a bbox check, so I think a good API might be a function takes a bbox, and which either returns a decoded coordinate, or a value of false to indicate that the coordinate is outside the box.
comment:24 by , 13 years ago
I've put this code up as a git repository on github now (and I've separated the test code from the core code, and added a documented header file for it, too).
comment:25 by , 13 years ago
I've added the a class for checking bounding boxes to the git repository. This does a range check first on the first encoded byte, and only performs a full decode if neccessary. The range check optimisation appears to give around a 10% speedup in a very rough test (running a lightly modified testsuite under "time") for decoding evenly distributed coordinates and bounding boxes, even if the class is only used once, and gives a greater speedup for smaller boxes, and if the class is used many times. Decoding 10,000,000 coordinates took 9 seconds without the optimisation, 8.1 seconds with the optimisation, and 5.5 seconds if each bounding box was limited to 10% of the circumference of the earth in size. Further optimisations are probably possible with the same API, but I think this code is ready to be integrated into the geospatial code in Xapian now.
comment:26 by , 13 years ago
I've integrated this code with the geospatial code in Xapian, and done considerable cleanup now. My integration work is in a "geotomerge2" branch in my xapian fork on github. To see the differences compared to current master, go to:
https://github.com/rboulton/xapian/compare/master...geotomerge2
I believe this code is now ready to be applied to master; I would welcome review, particularly of the interfaces in the public header file - given the impending GSoC summit I understand that may take a couple of weeks, but am keen for it to go into trunk before it bitrots again.
comment:27 by , 13 years ago
LatLongCoord::operator<
does:
bool operator<(const LatLongCoord & other) const { if (latitude < other.latitude) return true; return (longitude < other.longitude); }
I think this probably should have a middle line of:
if (latitude > other.latitude) return false;
As it stands, if a has lat 10, long 20 and b has lat 20, long 10 then both a < b
and b < a
are true, which is (a) surprising and (b) I believe leads to undefined behaviour with the STL. Is this just an oversight? Presumably this method exists so that they can be put in containers like std::map, so the actual ordering is mostly arbitrary, so long as it's consistent.
A bit unsure about having a naked struct for LatLongCoord - we've avoided this elsewhere in the API.
The unserialise() methods for LatLongCoord and LatLongCoords work differently to our existing unserialise methods, which all return a new object or object pointer. Perhaps that's sensible, but the inconsistency grates a bit.
There's a documented return value here for a (private) method which actually returns void:
/** Calculate the distance for the current document. * * Returns true if the distance was calculated ok, or false if the * document didn't contain a valid serialised sequence of coordinates in * the appropriate value slot. */ void calc_distance();
LatLongDistanceKeyMaker's ctors lack documentation comments.
10E10 seems a bit of an odd default value (1e11 seems a more natural way to write the
same thing). But perhaps we should just split the ctors and initialise
defkey(9, '\xff')
when no default distance is specified (which is more efficient
than defaulting the distance to HUGE_VAL and then converting it, and avoids pulling in
<cmath> in an API header).
I've also fixed a pile of documentation comment typos in my local tree. I'll try to sort out a merge request for changes.
comment:28 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
First patch to merge the geospatial branch.