Opened 15 years ago

Closed 6 years ago

Last modified 2 years ago

#374 closed defect (fixed)

IPv6 support

Reported by: Olly Betts Owned by: Olly Betts
Priority: low Milestone: 1.5.0
Component: Backend-Remote Version: SVN trunk
Severity: normal Keywords:
Cc: james@… Blocked By:
Blocking: Operating System: All

Description

Currently the TCP variant of the remote backend assumes IPv4, as does replication (which uses the same networking code).

Ideally we should work with IPv6 on platforms which support it (which should be most modern ones), but still support IPv4 on platforms which don't.

This isn't a hugely high priority though, as the vast majority of remote backend users will be using it over local networks. Also, we haven't had any user enquiries about this so far.

Marking for milestone:1.1.7 for now, though I don't feel strongly that this has to be done before 1.2.0.

Change History (18)

comment:1 by James Aylett, 15 years ago

Supporting ipv6 per se isn't terribly difficult; the key thing is to stop using gethostbyname() and start using getaddrinfo(), which conveniently returns all the details required to both call socket() and then either connect() or bind() (depending on how it's called; there's a AI_PASSIVE flag that switches between the two modes).

However there's deeper work that needs doing to do this properly. getaddrinfo() returns a linked list of struct addrinfo*s, and really we should bind to all of them (in the server) or try each in turn (in the client). The client isn't terribly difficult, but the server will need to move from a simply blocking accept() call to use something like select() in order to be able to service things appropriately.

That will probably be okay by itself (select() was in 4.2BSD and is supported by Windows, see some sample code and the Winsock function reference, both on MSDN), and I'd argue we don't need to worry about things like pselect() under linux, or kevent() / epoll() and friends *right now*. (At some point we may do, given that we're using the networking code in the replication system. It might be worth pulling the APR in to handle that once we hit there.)

I've tested the getaddrinfo() approach with the simplest possible approach, of calling bind() only on the first result of getaddrinfo(); however this isn't a great idea, because people will expect (eg) xapian-tcpsrv --port 9000 --interface localhost testdb to be accessible over ipv4. On my Mac, at least, the first localhost address returned is the ipv6 ::1. (We could add logic that prefers ipv4, or a switch or something, but that all feels unnatural and moving away from the ideal of it just doing the most obvious thing.)

Incidentally, getaddrinfo() has the added benefit of resolving service names as well as numeric ports. This suggests that the API of TcpServer needs to change to accommodate this.

I'm quite happy to do this work at some point (say on a branch we can point the autobuilders at to get the portability working). We should discuss what the "right" behaviour of the server is when a hostname resolves to multiple addresses, and when a hostname resolves to multiple addresses across different address families. As suggested above, my feeling is that binding to all of them across all families, and exiting if any fail, is the correct approach. (For clients, trying each in turn until one connects is probably fine; there may be some interactions with replication clients that may affect this, particularly around reconnecting?)

comment:2 by James Aylett, 15 years ago

Cc: james@… added

comment:3 by Olly Betts, 15 years ago

We already use select() in portable code, there's no reason to mess around here. APR is Apache 2 anyway, which (probably) isn't GPL2 compatible (IIRC Apache think it is, but the FSF don't).

As to the "right" behaviour, surely this must already have been decided? If not by a formal spec, then by the behaviour of the majority of existing clients/servers with IPv6 support.

But FWIW, it doesn't actually seem to be a problem if xapian-tcpsrv --port 9000 --interface localhost testdb isn't visible via 127.0.0.1 as the only client is ours which would also be IPv6 capable...

comment:4 by James Aylett, 15 years ago

Good point about APR and select(); that makes that side simple.

The most common behaviour (AFAICT) is to bind to everything that matches. bind9 works differently by instead evaluating all active interfaces and binding to each available ip address directly (at least by default on Debian), which always struck me as a little weird.

The only time I can see it making a real difference is if people need to port forward or encrypted tunnel the connections. That's rare, and providing we document behaviour not a huge problem.

On atreus, the following do so (when binding to 0.0.0.0):

rsync, bind (for a second port, bound to localhost), exim

sshd, imapd, apache only bind to ipv6. Since this still works for ipv4, we could see what happens if we favour ipv6 over ipv4. (linux appears to have extension flags to getaddrinfo() to make this easier, but it's not terribly difficult to do ourselves.)

comment:5 by Olly Betts, 15 years ago

Milestone: 1.1.71.2.0
Priority: normallow

I don't think any users have asked about ipv6 support, and remote is usually used on LANs, so bumping to 1.2.x. James said he might look at his, and if we have a plausible patch before 1.2.0, I'm still happy to consider it.

comment:6 by Olly Betts, 11 years ago

Milestone: 1.2.x1.3.x

I've still not seen any user demand, so -> 1.3.x for now. If we have a clean patch, we can consider backporting if 1.2.x is still alive then.

comment:7 by Olly Betts, 9 years ago

Milestone: 1.3.x1.3.3
Status: newassigned

I've been looking into this some more.

The Linux gethostbyname(3) man page says:

gethostbyname(), gethostbyaddr(), and h_errno are marked obsolescent in [the POSIX.1-2001] standard. POSIX.1-2008 removes the specifications of gethostbyname(), gethostbyaddr(), and h_errno, recommending the use of getaddrinfo(3) and getnameinfo(3) instead.

And the getaddrinfo(3) man page notes that this is specified by POSIX.1-2001.

Given this was standardised 13 years ago and is the standard way to support IPv6, it seems hard to conceive that there are relevant systems without getaddrinfo() at this point.

The Gnulib docs note about getaddrinfo():

This function is missing on some platforms: HP-UX 11.11, IRIX 6.5, OSF/1 5.1, Solaris 7, Cygwin 1.5.x, mingw, MSVC 9, Interix 3.5, BeOS.

https://www.gnu.org/software/gnulib/manual/html_node/getaddrinfo.html

Looking up the versions on that list, they're all out of support - often by many years.

It's unclear to me if MSVC 9 is in support or not, but it's irrelevant as we decided to require C++11 support for 1.3.x and up. As for OS versions, MSDN seems to indicate that getaddrinfo() was added in XP.

Checking the headers of the MXE install on the buildbot vm, that has getaddrinfo().

So I'd say we don't try to fall back to gethostbyname() - it's just going to complicate the code and probably for no reason. If there are really relevant systems with a C++11 compiler and without getaddrinfo(), we can look at supporting them (and in that case, Gnulib's approach of a compatibility implementation of getaddrinfo() which calls gethostbyname() under the covers seems a reasonable approach).

Another benefit of this change is that getaddrinfo() is thread-safe, while gethostbyname() isn't.

Incidentally, getaddrinfo() has the added benefit of resolving service names as well as numeric ports. This suggests that the API of TcpServer needs to change to accommodate this.

I don't think we need to worry about named ports - it doesn't make much sense to run a xapian-tcpsrv on a port reserved for another protocol, which is what that would allow, so I suggest we just set the AI_NUMERICSERV flag.

comment:8 by Olly Betts, 9 years ago

I have a patch for this, which is mostly there.

The main remaining issue is that I don't see how you're meant to handle binding to a specified hostname/ip address. The latest RFC and POSIX and the Linux man page all carefully say that you need to use AI_PASSIVE if you want to use bind() and that if node isn't NULL then AI_PASSIVE is ignored.

Trying to check the entries in the returned list against the hostname/ip address string seems rather tricky to get right.

The previous RFC didn't have the bit about ignore AI_PASSIVE when node isn't NULL, which makes a lot more sense. AFAICT, the RFC was changed to align it with POSIX.

comment:9 by Olly Betts, 9 years ago

It seems the wording is just rather poor, and it works as you would expect, rather than how the documentation appears to say.

It's passing the testsuite, but I want to do some additional tests to make sure that different combinations work smoothly - the testsuite can only really test on localhost.

comment:10 by Olly Betts, 9 years ago

Still some teething issues with IPv6, but as a first step, I've merged the changes to use getaddrinfo() instead of gethostbyname(), but set to still look for only IPv4 addresses [902cff350432e110c1b134dafa846c005451e383].

comment:11 by Olly Betts, 9 years ago

Milestone: 1.3.31.3.4

comment:12 by Olly Betts, 8 years ago

Milestone: 1.3.41.4.x

IPv6 support is a worthy goal, but I still don't think we've had anybody actually ask about it.

This really shouldn't block 1.4.0, especially as it shouldn't break the ABI.

comment:13 by Olly Betts, 8 years ago

I've pushed the WIP patch to branch ipv6.

comment:14 by Olly Betts, 7 years ago

Milestone: 1.4.x1.5.0

1.4.1 will give a saner error for IPv6 in stub database files - previously the first ':' in the IPv6 address was taken as the delimiter for the port number ([94771d0a2459d0550e68ba88b3f8eb820af8ef3d]).

I'm intending to merge the ipv6 branch into master, but not backport to 1.4.x until (a) there's some evidence of demand, or (b) we're very confident that it's not going to cause issues for people using IPv4.

comment:15 by Olly Betts, 6 years ago

Resolution: fixed
Status: assignedclosed

I've updated the patch and merged to current master in f51eaf9bba4a8de234fe6eecc30e4d6e4e73e986.

It passes the testsuite at least.

Testing in real world use may shake out some teething problems, but I think we can close this ticket now.

There's still not been anyone actually asking for this that I've seen, so I don't currently plan to backport to 1.4.x given the potential disruption to existing working setups.

comment:16 by Olly Betts, 6 years ago

A few follow-up commits have improved things, but this has still broken the appveyor builds, not entirely clear why.

I've not tested yet but I suspect this breaks the build on OpenBSD, which apparently doesn't support AI_V4MAPPED for rather specious "security" reasons.

comment:17 by Olly Betts, 5 years ago

The appveyor failures also seem to be due to AI_V4MAPPED-related limitations - Microsoft Windows defaults to setting IPV6_V6ONLY on IPv6 sockets so AI_V4MAPPED gives you an mapped IPv4 address which then fails to work.

You can unset IPV6_V6ONLY with setsockopt() but the constant isn't reliably defined in mingw headers.

I've instead gone for specifying AF_UNSPEC in the hints to getaddrinfo() so we resolve IPv4 addresses as AF_INET (in 260e18bc7b6219d6fff74e44b6eb652197ac4eec).

comment:18 by Olly Betts, 2 years ago

We've had our first inquiry about backporting this:

https://lists.xapian.org/pipermail/xapian-discuss/2022-April/009944.html

Sam seems to have found a workaround by reverse proxying connections.

At this point I'm leaning towards not backporting given the risks of breaking currently working setups and that the next release series will hopefully commence later this year.

If there's wider demand we can think about how to try to minimise the risks of such breakage (perhaps off by default with a way to enable support at runtime).

I can also probably pull out a patch combining all the relevant changes which applies to 1.4.x if that's helpful to anyone wanting IPv6 who's happy to build a custom 1.4.x version to support it.

Note: See TracTickets for help on using tickets.