Opened 10 years ago

Closed 5 years ago

#346 closed defect (fixed)

Python 3 support

Reported by: olly Owned by: richard
Priority: highest Milestone: 1.3.3
Component: Xapian-bindings (Python) Version: SVN trunk
Severity: normal Keywords:
Cc: kelm, lilydjwg@…, jocl1@… Blocked By:
Blocking: Operating System: All

Description (last modified by olly)

This is what I'm aware of that still needs to be done:

  • Resolve why "postingsource" testcase segfaults under threads (it's fine in the non-threaded case). As of r17524, we've disabled running it under threads. It's unknown why it fails.

In ToDoFor1.1.0?, Richard wrote:

Support for python 3. May already work, but needs testing, and may need some tweaking. Could do with someone going through all the places where strings are supplied and returned, and deciding whether the API should expect byte strings or unicode strings (or both).

Assigning to you, since you added it to the list and I don't really know what to look for...

Attachments (7)

py3-xapian12765.patch (13.6 KB) - added by kelm 10 years ago.
py3-xapian12765-updated.patch (6.3 KB) - added by olly 10 years ago.
Patch with merged and obsoleted changes removed
py3-xapian12765-updated2.patch (3.3 KB) - added by olly 10 years ago.
Second update of patch
sabrina-python3-updated.patch (7.1 KB) - added by olly 7 years ago.
Patch from Sabrina, updated after updating python3 tests from python2 with 2to3
sabrina-python3-update2.patch (14.3 KB) - added by sabrina 7 years ago.
This patch makes smoketest3.py work. It also make sortable_serialise and sortable_unserialise functions work well
sabrina-python3-update.patch (13.3 KB) - added by sabrina 7 years ago.
This patch makes smoketest3.py work. It also make sortable_serialise and sortable_unserialise functions work well
sabrina-python3-git-trunk.patch (9.5 KB) - added by sabrina 7 years ago.
This patch is the latest git trunk patch that makes smoketest3.py work.

Download all attachments as: .zip

Change History (95)

comment:1 Changed 10 years ago by olly

Python 3 support is a new feature in 1.1.0, so this could be marked as experimental (or just have possible bugs) in 1.1.0 (and I think you'd added it to the "could be experimental" section on the todo list).

comment:2 Changed 10 years ago by olly

  • Milestone changed from 1.1.0 to 1.1.1

Richard said on IRC that he's happy to mark Python 3 support as "experimental" for now.

comment:3 Changed 10 years ago by olly

Richard said on IRC before the 1.1.0 release:

python3.0 bindings don't even compile, currently with python3.0.1, anyway (the latest release)

Likely fixed by a newer SWIG - there have been some recent Python 3 fixes.

comment:4 Changed 10 years ago by olly

SVN trunk r12540 uses a newer SWIG and the bindings do now compile (at least with the Python "3.0.1+" in Ubuntu jaunty. The testsuite fails though.

The initial problem with that both smoketest3.py and pythontest3.py have from .testsuite import * instead of from testsuite import *. Fixing that, make check VERBOSE=1 gives me:

Running test: all... FAILED

smoketest3.py:35:<class 'TypeError'>: in method 'new_Stem', argument 1 of type '
std::string const &'
    33     expect(v2, v, "Unexpected version output")
    34 
->  35     stem = xapian.Stem("english")
    36     expect(str(stem), "Xapian::Stem(english)", "Unexpected str(stem)")
    37 
Traceback (most recent call last):
  File "/home/olly/svn/xapian/xapian-bindings/python/testsuite3.py", line 201, i
n runtest
    test_fn()
  File "/home/olly/svn/xapian/xapian-bindings/python/smoketest3.py", line 35, in
 test_all
    stem = xapian.Stem("english")
  File "/home/olly/svn/xapian/xapian-bindings/python/xapian/__init__.py", line 4
879, in __init__

comment:5 Changed 10 years ago by olly

  • Milestone changed from 1.1.1 to 1.1.2

Changed 10 years ago by kelm

comment:6 Changed 10 years ago by kelm

With the attached patch I am able to run (most of) the testsuite successfully. This is the output from running it on Windows, MSVC 9.0:

------ Build started: Project: xapian_python.mak, Configuration: Debug Win32 ------
Performing Makefile project actions
...snip...
 "c:\Program Files\Python30\python_d.exe" smoketest.py
Running test: all... ok
c:\Users\kelmpte\My Documents\projects\dev\xapian-12765\xapian-core\win32\Debug\Python30\testsuite3.py:115: DeprecationWarning: object.__new__() takes no parameters
  callable(*args)
1 tests passed, no failures
[50288 refs]
 "c:\Program Files\Python30\python_d.exe" pythontest.py
Running test: allterms_iter... ok
Running test: dbdocument_iter... ok
Running test: director_exception... ok
Running test: eset_iter... ok
Running test: exception_base... ok
Running test: get_uuid... ok
Running test: matchingterms_iter... ok
Running test: metadata_keys_iter... ok
Running test: mset_iter... ok
Running test: newdocument_iter... ok
Running test: position_iter... ok
Running test: postinglist_iter... ok
Running test: postingsource... FAILED
pythontest3.py:981:<class 'TypeError'>: Swig director type mismatch in output value of type 'Xapian::doccount'
   979 
   980     enquire = mkenq(db)
-> 981     mset = enquire.get_mset(0, 10)
   982 
   983     expect([item.docid for item in mset], [1, 3, 5, 7, 9])
Traceback (most recent call last):
  File "c:\Users\kelmpte\My Documents\projects\dev\xapian-12765\xapian-core\win32\Debug\Python30\testsuite3.py", line 202, in runtest
    test_fn()
  File "c:\Users\kelmpte\My Documents\projects\dev\xapian-12765\xapian-core\win32\Debug\Python30\pythontest3.py", line 981, in test_postingsource
    mset = enquire.get_mset(0, 10)
Xapian version: 1.1.0
Platform: Windows XP (5.1.2600)
When reporting this problem, please quote all the preceding lines from
"pythontest3.py:981" onwards.
Running test: postingsource2... FAILED
pythontest3.py:997:<class 'UnicodeDecodeError'>: 'utf8' codec can't decode byte 0xaa in position 0: unexpected code byte
   995     for id in range(10):
   996         doc = xapian.Document()
-> 997         doc.add_value(1, xapian.sortable_serialise(vals[id]))
   998         db.add_document(doc)
   999 
Traceback (most recent call last):
  File "c:\Users\kelmpte\My Documents\projects\dev\xapian-12765\xapian-core\win32\Debug\Python30\testsuite3.py", line 202, in runtest
    test_fn()
  File "c:\Users\kelmpte\My Documents\projects\dev\xapian-12765\xapian-core\win32\Debug\Python30\pythontest3.py", line 997, in test_postingsource2
    doc.add_value(1, xapian.sortable_serialise(vals[id]))
Xapian version: 1.1.0
Platform: Windows XP (5.1.2600)
When reporting this problem, please quote all the preceding lines from
"pythontest3.py:997" onwards.
Running test: preserve_enquire_sorter... ok
Running test: preserve_query_parser_stopper... ok
Running test: preserve_term_generator_stopper... ok
Running test: queryparser_custom_vrp... ok
Running test: queryparser_custom_vrp_deallocation... ok
Running test: queryparser_stoplist_iter... ok
Running test: queryparser_unstem_iter... ok
Running test: queryterms_iter... ok
Running test: scale_weight... ok
Running test: serialise_document... ok
Running test: serialise_query... ok
Running test: spell... ok
Running test: synonyms_iter... ok
Running test: termlist_iter... ok
Running test: value_iter... ok
Running test: value_mods... ok
Running test: value_stats... FAILED
pythontest3.py:1023:<class 'UnicodeDecodeError'>: 'utf8' codec can't decode byte 0xaa in position 0: unexpected code byte
  1021     for id in range(10):
  1022         doc = xapian.Document()
->1023         doc.add_value(1, xapian.sortable_serialise(vals[id]))
  1024         db.add_document(doc)
  1025 
Traceback (most recent call last):
  File "c:\Users\kelmpte\My Documents\projects\dev\xapian-12765\xapian-core\win32\Debug\Python30\testsuite3.py", line 202, in runtest
    test_fn()
  File "c:\Users\kelmpte\My Documents\projects\dev\xapian-12765\xapian-core\win32\Debug\Python30\pythontest3.py", line 1023, in test_value_stats
    doc.add_value(1, xapian.sortable_serialise(vals[id]))
Xapian version: 1.1.0
Platform: Windows XP (5.1.2600)
When reporting this problem, please quote all the preceding lines from
"pythontest3.py:1023" onwards.
Running test: valuesetmatchdecider... ok
Running test: weight_normalise... ok
c:\Users\kelmpte\My Documents\projects\dev\xapian-12765\xapian-core\win32\Debug\Python30\xapian.py:4879: DeprecationWarning: object.__new__() takes no parameters
  self._termfreq = self._iter._iter.get_termfreq()
c:\Users\kelmpte\My Documents\projects\dev\xapian-12765\xapian-core\win32\Debug\Python30\xapian.py:3414: DeprecationWarning: object.__new__() takes no parameters
  _xapian.Query_swiginit(self,_xapian.new_Query(*args))
c:\Users\kelmpte\My Documents\projects\dev\xapian-12765\xapian-core\win32\Debug\Python30\pythontest3.py:1097: DeprecationWarning: object.__new__() takes no parameters
  val = db.get_document(docid).get_value(1)
30 tests passed, 3 tests failed
[54773 refs]
NMAKE : fatal error U1077: '"c:\Program Files\Python30\python_d.exe"' : return code '0x1'
Stop.
Project : error PRJ0019: A tool returned an error code from "Performing Makefile project actions"
Build log was saved at "file://C:\Users\kelmpte\My Documents\projects\dev\xapian-12765\xapian-core\win32\Debug\Python30\BuildLog.htm"
xapian_python.mak - 2 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

Unfortunately I haven't had a chance to look into the remaining errors. The sortable_serialise related error seems to stem from improper unicode handling. Patch against svn trunk 12765.

comment:7 Changed 10 years ago by olly

sortable_serialise() doesn't return UTF-8 encoded data.

comment:8 Changed 10 years ago by olly

  • Priority changed from normal to high

Not supporting Python 3 would be a sad omission for 1.2.0 I feel, and we've had actual user interest in it. Also, there's a patch here which sounds like it gets us much closer, so this probably doesn't need a lot more work. But this could be delayed, so high priority.

Richard: please feel free to override as you know more about the Python stuff than I do.

comment:9 Changed 10 years ago by olly

  • Cc kelm added

Adding Peter to the cc: as I noticed he isn't there (note comment:7).

I've applied the msvc changes (blindly) and worked in some of the others, or addressed them differently.

I'll attach a patch with the changes I haven't dealt with yet in a moment.

Changed 10 years ago by olly

Patch with merged and obsoleted changes removed

comment:10 Changed 10 years ago by olly

This is what I'm aware of that still needs to be done:

  • The next stuff doesn't seem to work for me with Python 3, and it looks like it breaks Python 2 support as it currently is. It seems we should be able to define both next() and __next__() methods and each of Python 2 and Python 3 will use its own one and ignore the other.
  • The __len__() stuff in the patch only needs to handle a single argument ever it seems to me, so this should do (I didn't check if this fix is compatible with Python 2, but I think it should be):
MSet.__len__ = lambda self: MSet.size(self)
  • Unicode handling needs sorting out properly and not just by commenting out testcases which detect the current problems! I think Richard has some ideas.
  • I get lots of warnings like:
xapian/__init__.py:4221: DeprecationWarning: object.__new__() takes no parameters

comment:11 Changed 10 years ago by kelm

The patch was just enough to get Xapian functional for me (for the moment). There was no intention on my side to claim that it it solves everything. Actually, it is obvious that commenting out (very few) testcases cannot be "it". Nevertheless it demonstrated progress in moving towards Py3k support.

Please note that I haven't delved into it for a while, so please bear with me if I forgot something: The new warning stems from an incomplete iterator interface (from py3ks point of view). The next and len changes are essential to comply to the modified py3k iterator interface. I'm no too surprised that it did not work without them...

The reason for changes like:

MSet.__len__ = MSet.size 

to

MSet.__len__ = lambda self, *args: MSet.size(self, *args) 

was to work around a change in py3k - the removal of support for "unbound methods". It might well be that the *args parameter is not needed in this case...

Changed 10 years ago by olly

Second update of patch

comment:12 Changed 10 years ago by olly

  • Description modified (diff)

OK, I've incorporated the __len__ change and sorted out the __next__ changes so that they don't break Python 2 support but still work with Python 3.

I've updated the description with the remaining two items, and pruned the patch again.

Richard seemed to have some ideas on IRC for sorting out the Unicode stuff, and I've no idea about the __new__ stuff, so I'm not planning to work on this further for now, so feel free to.

comment:13 Changed 10 years ago by olly

  • Milestone changed from 1.1.3 to 1.2.0

While it would be nice to have this for 1.2.0, we should be able to do it for 1.2.x without problems. The only issue would be if we wanted to change the behaviour with Python 2.x slightly but that's probably a bad idea anyway.

So bumping to stay on track.

comment:14 Changed 9 years ago by olly

  • Component changed from Xapian-bindings to Xapian-bindings (Python)

comment:15 Changed 9 years ago by richard

  • Status changed from new to assigned

We should probably get the remaining issues sorted out quite soon - python 3.1 has been out for nearly a year now. Marking for milestone 1.2.3, and I'll try and get it done soon.

comment:16 Changed 9 years ago by richard

  • Milestone changed from 1.2.x to 1.2.3

comment:17 Changed 9 years ago by richard

With the latest swig, most of the DeprecationWarnings? disappear: running the testsuite now only produces two. I've tracked their source down, and both come from the SWIG_Python_NewShadowInstance() function, where it calls """PyBaseObject_Type.tp_new""" (ie, calls the new method on the base "object" object). According to http://bugs.python.org/issue1683368 since 2.6 passing any arguments to object.new() gives a warning (see http://www.python.org/download/releases/2.6/NEWS.txt and search for 1683368). The warning won't happen with 2.6 because SWIG only uses that code for python 3.0 or later.

I think we could stop the warning by stopping passing args to tp_new(), but then we wouldn't be using the args at all. I'm not quite sure what this code is trying to do, but I'm sure it's meant to use the args supplied to it!

comment:18 Changed 9 years ago by olly

  • Milestone changed from 1.2.3 to 1.2.4

I want to release 1.2.3, so bumping.

comment:19 Changed 9 years ago by olly

  • Milestone changed from 1.2.4 to 1.2.5

Richard said on IRC to bump this.

comment:20 Changed 8 years ago by olly

  • Milestone changed from 1.2.5 to 1.2.6

Support for Python 3.2 requires patching SWIG (since some C API calls got removed from Python). Richard said on IRC he's done some more work on this, but it's unlikely to be ready in time for 1.2.5, so bumping again.

comment:21 Changed 8 years ago by olly

  • Milestone changed from 1.2.6 to 1.2.x
  • Priority changed from high to highest

Rather than repegging this repeatedly, setting milestone to 1.2.x and priority to highest.

comment:22 Changed 8 years ago by olly

We're now using the latest SWIG (SVN trunk) for Xapian trunk - the last SWIG release included some Python 3.2 fixes, so hopefully that's fixed any issues Xapian was having with Python 3.2

comment:23 Changed 8 years ago by olly

  • Milestone changed from 1.2.x to 1.3.0

Richard and I discussed this briefly in person, and concluded we should try to get it working for the next release series. Once it is, we can consider backporting to 1.2.x. So updating milestone.

comment:24 Changed 8 years ago by olly

  • Milestone changed from 1.3.0 to 1.3.2

Changed 7 years ago by olly

Patch from Sabrina, updated after updating python3 tests from python2 with 2to3

Changed 7 years ago by sabrina

This patch makes smoketest3.py work. It also make sortable_serialise and sortable_unserialise functions work well

Changed 7 years ago by sabrina

This patch makes smoketest3.py work. It also make sortable_serialise and sortable_unserialise functions work well

comment:25 follow-up: Changed 7 years ago by sabrina

Sorry updated twice. You can just check the latest patch sabrina-python3-update.patch. This patch is tested in Python3.2, and smoketest3 works well. This patch mainly adds some functions to deal with PyUnicode? and PyBytes? issue. Another way to fix this is to patch SWIG, which may be more simple. But it need SWIG developer to updated it. See my SWIG patch http://pastebin.com/XDiqSMsf

Changed 7 years ago by sabrina

This patch is the latest git trunk patch that makes smoketest3.py work.

comment:26 in reply to: ↑ 25 ; follow-up: Changed 7 years ago by barry

I tested this latest branch on Ubuntu 12.10 with git/svn head. It does seem to build just fine with the following command:

PYTHON=python3.2 ./configure --disable-documentation --without-ruby --without-tcl --without-perl

I need to do more testing, but I hope this works. I need to ship a Python 3 compatible binding in Ubuntu 12.10 (currently in development).

comment:27 in reply to: ↑ 26 Changed 7 years ago by barry

Replying to barry:

I tested this latest branch on Ubuntu 12.10 with git/svn head.

I meant of course, that I tested with the sabrina-python3-git-trunk.patch against git head.

comment:28 follow-up: Changed 7 years ago by olly

There are certainly some useful changes in Sabrina's patch, which we hopefully should get in for the 1.3.2 development snapshot, such as the PEP3147 support (though that should use imp.get_tag() rather than hard-coding cpython-32) and probably the __next__ rename (though I've not looked at the reasons for that). However, I'm afraid it doesn't address the major remaining issue (Unicode strings) in the right way. If we had a patch which fixed all the remaining issues, we would have applied it by now! The patch changes the testcases to match what the output it produces, so passing the tests is largely meaningless.

The fundamental issue (as mentioned in the bug description) is that the Xapian C++ API uses std::string as both a UTF-8 string and a byte string. Some methods will only ever return one of these - e.g. Xapian::sortable_serialise() always returns a byte string, while Xapian::Stem::operator() always returns a UTF-8 string (well, unless you create a user stemming algorithm which doesn't...), but some can return either, generally depending what you stored earlier (e.g. Xapian::Document::get_value()). Similarly, some methods which take strings can take only one sort, or either, but in this case we can just handle whichever we are passed when the C++ API accepts a std::string. The key difference is that for a return value, we have to pick a Python type to return.

So to fix this, for each API method which returns std::string we need to decide whether it returns Unicode, bytes, or both. If it's both, the best solution is probably to add a second form (e.g. xapian.Document.get_value_unicode()) which does the conversion for the user, rather than forcing them to sprinkle explicit conversions around calls to xapian.Document.get_value() in their code. SWIG's %extend makes this pretty easy to do.

Or perhaps the standard should be for get_value() to return Unicode with a get_value_bytes() or get_value_raw() alternative. Or perhaps what we do should depend on how the method will usually be used (e.g. terms can be arbitrary binary strings, but in practice they're almost always UTF-8).

I guess if you're trying to get everyone onto Python 3 for Ubuntu, you've looked at quite a few upstreams already - has a standard pattern for resolving such situations already emerged?

One further complication may be the user sub-classable API classes (which SWIG calls "directors"). Here C++ calls back to Python, so it's the arguments rather than the return types which matter. I'm not sure if there are any cases there which could take either Unicode or bytes, but if there are, I think we probably have to always pass bytes and let the Python subclass explicitly convert if it wants to.

It looks like the feature freeze date for 12.10 is 23rd August, which is only just over 2 months away - if you want to see Python 3 support in a stable Xapian release by then, realistically you're going to have to be the one to actually make that happen. As things are currently, it's not looking at all likely it would even be fixed on trunk by then. It would certainly be good to sort out Python 3 support, but there's not yet much evidence of actual user demand, and Richard was the main one driving this, but isn't very active in Xapian development right now.

comment:29 Changed 7 years ago by olly

BTW, you can just configure "--with-python" to only enable Python. It will only auto-detect what languages are installed if you don't specify any "--with-<language>" options.

comment:30 in reply to: ↑ 28 Changed 7 years ago by barry

Thanks for the detailed response. A few comments:

Replying to olly:

There are certainly some useful changes in Sabrina's patch, which we hopefully should get in for the 1.3.2 development snapshot, such as the PEP3147 support (though that should use imp.get_tag() rather than hard-coding cpython-32) and probably the __next__ rename (though I've not looked at the reasons for that). However, I'm afraid it doesn't address the major remaining issue (Unicode strings) in the right way. If we had a patch which fixed all the remaining issues, we would have applied it by now! The patch changes the testcases to match what the output it produces, so passing the tests is largely meaningless.

Right, mainly I was responding just on the currency of the latest patch.

The fundamental issue (as mentioned in the bug description) is that the Xapian C++ API uses std::string as both a UTF-8 string and a byte string. Some methods will only ever return one of these - e.g. Xapian::sortable_serialise() always returns a byte string, while Xapian::Stem::operator() always returns a UTF-8 string (well, unless you create a user stemming algorithm which doesn't...), but some can return either, generally depending what you stored earlier (e.g. Xapian::Document::get_value()). Similarly, some methods which take strings can take only one sort, or either, but in this case we can just handle whichever we are passed when the C++ API accepts a std::string. The key difference is that for a return value, we have to pick a Python type to return.

Wow, this does make it even more challenging. Usually if the API you're interfacing to has a strong model of bytes v. strings, it's not too hard to work out the details (e.g. my earlier work on dbus-python), but it's certainly more difficult if the semantics are ambiguous.

So to fix this, for each API method which returns std::string we need to decide whether it returns Unicode, bytes, or both. If it's both, the best solution is probably to add a second form (e.g. xapian.Document.get_value_unicode()) which does the conversion for the user, rather than forcing them to sprinkle explicit conversions around calls to xapian.Document.get_value() in their code. SWIG's %extend makes this pretty easy to do.

Or perhaps the standard should be for get_value() to return Unicode with a get_value_bytes() or get_value_raw() alternative. Or perhaps what we do should depend on how the method will usually be used (e.g. terms can be arbitrary binary strings, but in practice they're almost always UTF-8).

I don't know the Xapian API very well, but on first blush I do like the idea of .get_value() returning one or the other consistently, and either adding a new API for the alternative, or leaving it to the user to do the conversion, though as you say, you don't want to make client code horribly less readable. I'm not sure whether get_value() should always return bytes or strings, though it seems like get_value_bytes() might be a good first approach.

I guess if you're trying to get everyone onto Python 3 for Ubuntu, you've looked at quite a few upstreams already - has a standard pattern for resolving such situations already emerged?

Well, the only upstream I currently have to support is software-center, since we're only converting to Python 3 on the standard desktop image (for 12.10 anyway). So its use case will be my primary driver. We have maybe a dozen reverse depends on python-xapian in total.

One further complication may be the user sub-classable API classes (which SWIG calls "directors"). Here C++ calls back to Python, so it's the arguments rather than the return types which matter. I'm not sure if there are any cases there which could take either Unicode or bytes, but if there are, I think we probably have to always pass bytes and let the Python subclass explicitly convert if it wants to.

That sounds reasonable.

It looks like the feature freeze date for 12.10 is 23rd August, which is only just over 2 months away - if you want to see Python 3 support in a stable Xapian release by then, realistically you're going to have to be the one to actually make that happen. As things are currently, it's not looking at all likely it would even be fixed on trunk by then. It would certainly be good to sort out Python 3 support, but there's not yet much evidence of actual user demand, and Richard was the main one driving this, but isn't very active in Xapian development right now.

Note that we don't necessarily need a stable Xapian release supporting Python 3. It would be okay if upstream blessed the patches (ideally, by committing them to svn), the we could package that up and feel fairly confident that when the stable release is made, we can switch to it without a ton of churn.

One big question is this: what version of Python 2 do you still need to support (please tell me, nothing earlier than 2.6 :), and how should we handle cases where the API has to change for Python 3?

comment:31 follow-up: Changed 7 years ago by richard

To chip in here; after much reflection, I've been thinking that the strings/bytes problem for Xapian is essentially that python 3 requires us to build an abstraction on top of Xapian's API to separate strings and bytes. However, because Xapian doesn't store information about the types of strings that many of its methods were given, such an abstraction must be leaky. Therefore, we should aim to make the abstraction as thin as possible, to make it easy to document the leaks.

The easiest and cleanest way to do this is to avoid doing any magic conversions, use the bytes type in all places where binary strings may be used (both in parameters and in return values), and use the unicode string type in any places where only encoded strings may be used. Specifically:

  • all methods which it can ever make sense to pass arbitrary binary strings to should accept only bytes. Those which are sometimes used to store text strings will therefore need the user to explicitly call .encode('utf8') before passing the arguments. However, this is actually a good thing, since the user will get that encoded value back if they later do a call to extract the string from Xapian again.
  • return types of methods which can return arbitrary binary strings should return only bytes. No magic "return unicode if the arguments passed to the call were unicode" or similar hackery, because this just makes the API harder to document and understand, and leads to subtle and hard to track down bugs.
  • there should be no special method variants added as syntax sugar to do conversions. ie, no "add_value_unicode" methods. It's clearer to tell the user to use .encode('utf8') before making the call, because then the user knows what actually happened without having to learn about each special API method in turn.
  • methods in the C++ API which always expect UTF-8 encoded strings (such as the QueryParser::parse() method), should accept only unicode strings. This will ensure that we're never making assumptions about the character set of the input, and is appropriate because users should be using unicode strings when manipulating text already (so there shouldn't be a need to do foo.decode('utf8') or anything similar).
  • callbacks from C++ to python which include strings in the parameters which come from Xapian should pass byte strings if the parameters can be arbitrary binary strings, and should pass unicode strings if there is no situation where that may happen.

Following this scheme will cause some backwards compatibility problems, since in Xapian's python 2 bindings, many methods accept unicode strings and automatically convert them to UTF8 before passing to Xapian, but we wouldn't carry this over to the python 3 bindings. However, the bytes / unicode support in python 2 is a mess and needs sorting out, so what better time than the python 3 transition for doing this.

comment:32 in reply to: ↑ 31 Changed 7 years ago by barry

Replying to richard:

The easiest and cleanest way to do this is to avoid doing any magic conversions, use the bytes type in all places where binary strings may be used (both in parameters and in return values), and use the unicode string type in any places where only encoded strings may be used.

+1. Let me see how much progress I can make with this approach.

comment:33 Changed 7 years ago by olly

The easiest and cleanest way to do this is to avoid doing any magic conversions, use the bytes type in all places where binary strings may be used (both in parameters and in return values), and use the unicode string type in any places where only encoded strings may be used.

I can see the appeal, but it's really not easiest, at least not to implement, since it requires specifying different std::string input typemaps for the two cases, whereas if we always accept either Python string type for any use of std::string, then a single std::string input typemap can do that. It'll mean wrapping new methods for Python which accept std::string becomes a chore too, rather than just happening automatically.

all methods which it can ever make sense to pass arbitrary binary strings to should accept only bytes.

I'm not sure it's quite so clear cut though - for example, Xapian::Stem seems "obviously UTF-8", but as I mentioned above, you could create a user stemmer which expects non-UTF-8 data, and then suddenly passing binary data to it makes some sense. This example is arguably a bit far-fetched, but the worrying part to me is that the status of Xapian::Stem potentially changed as the result of an addition to the C++ API (the user-defined stemmer feature).

And you still end up with the situation where there API isn't consistent, because some places where you want to pass Unicode take only Unicode strings, but other places where you want to pass Unicode insist on bytes, so you need to know which wants which, which makes the API a lot harder to learn and use.

Really, the only consistent thing to do would be only converting bytes to std::string and back, but I don't see that as a great approach. It's really just consistently unfriendly.

No magic "return unicode if the arguments passed to the call were unicode" or similar hackery, because this just makes the API harder to document and understand, and leads to subtle and hard to track down bugs.

I wasn't proposing anything of the sort. Pretty much all I'm suggesting is that we accept Unicode or bytes for std::string on input, and return bytes on output, with an alternative helper method which does the conversion for you.

If we force everyone to have to learn which string type they have to pass where, and write explicit conversions all over the shop, all I can really see us achieving is a proliferation in the number of self-proclaimed "nicer" Python interfaces to Xapian.

comment:34 follow-up: Changed 7 years ago by james

I've been thinking about this again during this discussion, and I'm definitely on Olly's side over trying to avoid lots of wrappers. As a python user, being unable to pass in Unicode objects (ie str) is just unacceptable, and will result in bug reports from everyone who doesn't realise they need a wrapper.

However returning bytes consistently has exactly the same problem and using different functions to get the behaviour most people will expect seems icky to me; the most sensible solution I've been able to come up with is that you could set an encoding at the library level, with None meaning use bytes. It's unclean conceptually but means you can achieve whatever you need with mostly little fuss; if you need different encodings with different databases for instance you'd have to manage everything manually anyway because so many things aren't associated with a Database object (Query, for instance). Doubly so if you want Document data to be UTF 8 but terms to be raw bytes.

By default I'd favour no encoding, which is the behaviour Olly is describing. For the purposes of getting Python 3 support for Ubuntu 12.10 that seems reasonable and would be forward compatible if (hopefully when) we implement output encoding in the python layer ourselves.

This leaves user implemented functions and the like. For these I'd definitely pass byte always, as they are more complex and rare anyway. If we documented a way of getting the user-set encoding from the wrapper people could write directors that took advantage of that if they wanted to make their code reusable.

comment:35 follow-up: Changed 7 years ago by olly

A global "encoding switch" doesn't really seem workable to me.

For starters, Xapian::sortable_unserialise() definitely needs to be passed bytes (it takes a binary string as returned by Xapian::sortable_serialise(), so passing a Unicode string converted to UTF-8 makes no sense at all, and having to convert your binary input to Unicode to pass it isn't going to work well), so there will need to be exceptions which the switch doesn't affect, or else some settings of the switch will render such functions unusable.

The problem with "manage everything manually" when a single setting of the encoding isn't enough is that the case of wanting both text and binary data isn't at all esoteric. For example, look at omindex, which indexes text but adds a (binary) document checksum as a value for collapsing identical documents. Alternatively, you could change the global encoding each time you want to pass the other sort, but then you're flipping it back and forth for every document you index.

And if you want to write something reusable you'll find yourself having to save the encoding state, and then set it to what you want, do your calls to Xapian, and then restore the encoding state. That really seems worse than having to specify the encoding at every call site.

In terms of text encodings, xapian-core only really supports UTF-8. In a lot of places, you just get back what you put in, but anything that actually looks at the contents as text expects UTF-8. So the only settings of the "encoding" switch which make sense on the C++ side are UTF-8 and binary data.

In response to Barry:

I guess if you're trying to get everyone onto Python 3 for Ubuntu, you've looked at quite a few upstreams already - has a standard pattern for resolving such situations already emerged?

Well, the only upstream I currently have to support is software-center, since we're only converting to Python 3 on the standard desktop image (for 12.10 anyway). So its use case will be my primary driver. We have maybe a dozen reverse depends on python-xapian in total.

I meant other upstream Python projects you're needing to get on to Python 3, not reverse dependencies of python-xapian (though getting the reverse dependencies ported to "python3-xapian" could take significant work, especially if we totally throw out compatibility with the current python-xapian API). I was wondering if there was a standard pattern for handling wrapping an interface like this.

One big question is this: what version of Python 2 do you still need to support (please tell me, nothing earlier than 2.6 :), and how should we handle cases where the API has to change for Python 3?

Adding Python 3 support really shouldn't change which Python 2 versions are supported. If the changes are invasive enough that this is really an issue, then I suggest we split the Python 3 bindings into a separate subdirectory - we already have different versions of all the tests, though currently they're mostly the result of 2to3, plus a few tweaks. Especially for 1.2.x, we really don't want to be risking breaking Python 2 support - our general policy is not to break compatibility with a version of other software within a Xapian stable release series without a very good reason.

We've not made a final decision on the versions of things we'll aim to support in 1.4.x - 2.6 may well be a sane cut-off there, but that doesn't really help since you want this support in 1.2.x.

As for the minimum 3.x to support, I certainly wouldn't worry about 3.0 - my impression is that the early adopters who actually tried it will have quickly moved on to the new cutting edge, while the conservative types will have feared the ".0".

comment:36 Changed 7 years ago by james

Hmm, your 80-20 is different to mine, suggesting it's more like a 50-50 split. However if we emit bytes and accept bytes + str (UTF-8 encoding), except for directors & internal places like the serialisation API, that's not incompatible with either explicit _utf8() functions or some other approach in the future. I'm still worried that people are going to feel the "need" to write wrappers, but we can tackle that later once we have some clear feedback from people running into problems.

I *think* that it'll be possible to write code that's compatible with both py2 and py3 if we return bytes, because you want to do all your encoding in client code for py2 anyway. (This is important for things like django-haystack, where the community seems to be about getting py3 compatibility without needing forks.)

comment:37 Changed 7 years ago by olly

Sorry, what's more like a 50-50 split? I can't see where I mentioned any ratios, so I'm unclear what you're talking about there.

I worry that doing half the job risks us discovering later that the planned approach for the second stage doesn't work so well in practice, at which point we either have a suboptimal API or have to go through the pain (as both developers and users) of deprecation.

I'm also not really sure whether the "returns bytes" version should always be the one named after the wrapped C++ method with a _utf8() or _unicode() or whatever variant, or if it's better to switch them and have the unsuffixed name return a Unicode string, with _bytes() or _raw() or something for the bytes version. Handling text is the more common case, and the parts where you're handling raw bytes seem like the "doing something special" case.

I think we need to decide if making it easy to write code which works with both is important, and also whether minimising changes required to make existing code work with the python3 version is important (the two aren't the same). If you have to make extensive modifications to your code to make it work with both, you may be happier to simply have a parallel version of the module for each version, and a simple wrapper to load the right one (like our smoketest.py).

comment:38 in reply to: ↑ 35 Changed 7 years ago by barry

Replying to olly:

In response to Barry:

I guess if you're trying to get everyone onto Python 3 for Ubuntu, you've looked at quite a few upstreams already - has a standard pattern for resolving such situations already emerged?

Well, the only upstream I currently have to support is software-center, since we're only converting to Python 3 on the standard desktop image (for 12.10 anyway). So its use case will be my primary driver. We have maybe a dozen reverse depends on python-xapian in total.

I meant other upstream Python projects you're needing to get on to Python 3, not reverse dependencies of python-xapian (though getting the reverse dependencies ported to "python3-xapian" could take significant work, especially if we totally throw out compatibility with the current python-xapian API). I was wondering if there was a standard pattern for handling wrapping an interface like this.

Ah, I see what you're getting at. It's difficult to come up with a one-size-fits-all rule because each package has its own unique view of bytes v strings. Some packages have a clear, consistent model, so even if that's manifest in a fuzzy way for Python 2, it's not hard to do the right thing in Python 3. Some, like dbus-python have an obvious default once you discover it (i.e. I made the wrong choice at first, but it was pretty evident by the scope of the changes both to the package and to clients of the library that it was the wrong choice; once reverted, it was much easier to port things).

It sounds to me like Xapian has a clear model most of the time, with some fuzzy areas, but this information gets lost when mapped to the C++ type, so that gets further fuzzy when mapping that single type to its Python equivalent.

Unfortunately, because Python 2 has let us all get away with sloppiness for so many years (if you ignore the mystifying UnicodeDecodeErrors? in non-English domains ;), it gets pretty tricky to clarify these for Python 3. Often, you *can't* because things can be either bytes or strings, which again, Python 2 let you be sloppy about (not saying that in a derogatory way - I'm the biggest culprit of that in my old code :).

One big question is this: what version of Python 2 do you still need to support (please tell me, nothing earlier than 2.6 :), and how should we handle cases where the API has to change for Python 3?

Adding Python 3 support really shouldn't change which Python 2 versions are supported. If the changes are invasive enough that this is really an issue, then I suggest we split the Python 3 bindings into a separate subdirectory - we already have different versions of all the tests, though currently they're mostly the result of 2to3, plus a few tweaks. Especially for 1.2.x, we really don't want to be risking breaking Python 2 support - our general policy is not to break compatibility with a version of other software within a Xapian stable release series without a very good reason.

Choosing a minimum Python 2 version does make a difference when implementing a single-source bilingual Python layer, which I think works best all things considered. Python 2.6 has *lots* of features for making it easier to support both Python 2 and 3 in the same code base. Python 2.7 makes it a little bit easier, but usually not enough to recommend it over also supporting 2.6. Supporting 2.5 or earlier makes things progressively more difficult with each older version; people have had success with that, but mostly for simpler libraries. I think the Xapian bindings are complex enough that it will *probably* be easier to support a single codebase with Python 2.6 as the minimum version. I completely understand the constraints on that though. An example is Twisted, which just released 12.1, the last to support anything older than Python 2.6.

We've not made a final decision on the versions of things we'll aim to support in 1.4.x - 2.6 may well be a sane cut-off there, but that doesn't really help since you want this support in 1.2.x.

We actually have some options here. Say we targeted a version that was not released. I'd probably take a git/svn snapshot and package that up as a separate transitional package. As I mentioned, my only *requirement* right now is software-center, so doing something interim there is fine. In fact, even if the API changes once Python 3 support is official, updating s-c won't be terribly difficult. We'll have others that we need to support after 12.10, but hopefully by then Xapian's Python 3 API will be more stable even if it's not yet released.

As for the minimum 3.x to support, I certainly wouldn't worry about 3.0 - my impression is that the early adopters who actually tried it will have quickly moved on to the new cutting edge, while the conservative types will have feared the ".0".

3.0 is an ex-parrot. I'd say even supporting Python 3.1 is questionable. 3.2 makes a good baseline and I don't think it will be terribly more difficult to support 3.3 once that's released.

comment:39 in reply to: ↑ 34 Changed 7 years ago by barry

Replying to james:

However returning bytes consistently has exactly the same problem and using different functions to get the behaviour most people will expect seems icky to me; the most sensible solution I've been able to come up with is that you could set an encoding at the library level, with None meaning use bytes. It's unclean conceptually but means you can achieve whatever you need with mostly little fuss; if you need different encodings with different databases for instance you'd have to manage everything manually anyway because so many things aren't associated with a Database object (Query, for instance). Doubly so if you want Document data to be UTF 8 but terms to be raw bytes.

By default I'd favour no encoding, which is the behaviour Olly is describing. For the purposes of getting Python 3 support for Ubuntu 12.10 that seems reasonable and would be forward compatible if (hopefully when) we implement output encoding in the python layer ourselves.

This leaves user implemented functions and the like. For these I'd definitely pass byte always, as they are more complex and rare anyway. If we documented a way of getting the user-set encoding from the wrapper people could write directors that took advantage of that if they wanted to make their code reusable.

I'm no fan of global magical state, but here's one way it *could* work. You could expose context managers which control the global state, one for input and one for output. This might actually be easier if your choices are utf-8 or bytes (though other encodings could probably be supported), and if you make a default consistent choice, e.g. of always accepting and returning bytes. I honestly don't know if in practice this would make your life easier, but it would work something like this:

value = blah.get_value() assert isinstance(value, bytes)

with xapian.as_utf_8():

value = blah.get_value()

assert isinstance(value, str)

Maybe you don't need one for input and you can auto-detect and auto-convert. The details of course would have to be worked out, but it would be possible to nest these for different input and output types, and you'd probably want a xapian.as_bytes() context manager to temporarily switch back, along with introspection methods, and all implemented as a stack of contexts. I've done something very similar with my flufl.i18n package which has to manage a global stack of locale states. It was a bit tricky to get the API right, but now it's quite useful and easily explained.

Again, not saying it *should* work this way, just that it *could* :)

comment:40 follow-up: Changed 7 years ago by olly

It sounds to me like Xapian has a clear model most of the time, with some fuzzy areas, but this information gets lost when mapped to the C++ type, so that gets further fuzzy when mapping that single type to its Python equivalent.

I guess Xapian's model is that strings are mostly opaque objects, and you get out what you put in. When it needs to look at them as text, then it assumes they are UTF-8 encoded. And serialisation functions/methods also obviously care about the contents of the string too, but treat it as binary data.

So simply mapping std::string to/from bytes in Python is probably the closest to that model, but I think that wouldn't seem very comfortable to use.

If we're changing 1.2.x to add Python 3 support, I'd rather be conservative and minimise the risks that we break Python 2 support, so simply splitting the two directories seems a sane approach, and avoids needing to consider whether to raise the minimum Python 2 version supported in the middle of a release series. It's unhelpful if we force users to upgrade Python just to be able to upgrade to a Xapian point release.

For 1.3.x, 2.6 seems a reasonable minimum requirement, especially as 1.4 is probably at least 6 months away. I'd prefer to solicit feedback on the mailing list before finalising the versions of the various languages we're aiming to support in 1.4, but I can't see this being controversial.

The context managers approach doesn't handle a case where you want to pass one string argument as utf-8 and the other as bytes well - you might want to do that when setting user metadata, for example key as utf-8, value as bytes.

Using something based on 1.3.x is probably a reasonable approach for software-centre in 12.10, provided it's made clear that the API is likely to change, so we don't end up with lots of other packages depending on it (or at least if we do, they have been warned). 1.3.x should be pretty stable for a "development branch" - mostly there are no ABI compatibility promises, and some API features are marked as experimental. Our record is good so far - almost all bugs in the development branches have either been in new code or also present in the previous release series. It would also be good to see a significant piece of real software using the Python 3 API.

I'd say the first step there is to get it all working but always returning bytes for std::string. That should give you something usable, and close enough to the final API that updating shouldn't be painful.

comment:41 in reply to: ↑ 40 Changed 7 years ago by barry

Replying to olly:

So simply mapping std::string to/from bytes in Python is probably the closest to that model, but I think that wouldn't seem very comfortable to use.

So, here's a plan: start with that naive mapping, and get some real-world experience (e.g. software-center) at what the pain points are. I think until we put a stake in the ground and see how easy or hard it is to use, we won't have a great sense about how to make it better. I can work with the s-c maintainers to adopt this interim API and adjust it as 1.3.x makes any future changes to its Python 3 support.

If we're changing 1.2.x to add Python 3 support, I'd rather be conservative and minimise the risks that we break Python 2 support, so simply splitting the two directories seems a sane approach, and avoids needing to consider whether to raise the minimum Python 2 version supported in the middle of a release series. It's unhelpful if we force users to upgrade Python just to be able to upgrade to a Xapian point release.

For 1.3.x, 2.6 seems a reasonable minimum requirement, especially as 1.4 is probably at least 6 months away. I'd prefer to solicit feedback on the mailing list before finalising the versions of the various languages we're aiming to support in 1.4, but I can't see this being controversial.

Sounds good.

The context managers approach doesn't handle a case where you want to pass one string argument as utf-8 and the other as bytes well - you might want to do that when setting user metadata, for example key as utf-8, value as bytes.

Good point.

Using something based on 1.3.x is probably a reasonable approach for software-centre in 12.10, provided it's made clear that the API is likely to change, so we don't end up with lots of other packages depending on it (or at least if we do, they have been warned). 1.3.x should be pretty stable for a "development branch" - mostly there are no ABI compatibility promises, and some API features are marked as experimental. Our record is good so far - almost all bugs in the development branches have either been in new code or also present in the previous release series. It would also be good to see a significant piece of real software using the Python 3 API.

I'd say the first step there is to get it all working but always returning bytes for std::string. That should give you something usable, and close enough to the final API that updating shouldn't be painful.

So looking at the git/svn head, it's the Xapian 1.3.x development branch. One option would be to create a separate xapian-bindings/python3 directory which would *only* be for Python 3. On the plus side, we could drop Python 2 support there and not worry about minimum versions. It will also be easier to develop and maintain if we only have to worry about the Python 3 API. The downside of course is that bugs affecting either binding will have to be fixed twice. I'm not sure which is better - opinions?

comment:42 Changed 7 years ago by olly

If you're planning to start wrapping from scratch, then definitely start in a new directory, and keep careful track of any code you take from the current bindings. The Python bindings still involve some code we can't relicense (in this case the BrightStation? copyright).

Otherwise I'm not sure what's best. It sounds like we may end up with an API which isn't so close to the python 2 one, which tends to suggest splitting them as the tests can't be the same.

comment:43 Changed 7 years ago by barry

Just want to mention that I took a crack at re-implementing Python 3 support using Cython. I did it mostly as an experimental contrast to SWIG wrapping, and while it's been a little slow going, I'm fairly happy with it. I'm going to continue down this path until I either get a complete working example (or in the short term, enough for Software Center) or hit an immovable road block. You can track my work at git@…:warsaw/xapian.git in the py3 branch.

I think this code is actually more or less independent of the rest of xapian-bindings, which means we could do one of two things, assuming the cython approach pans out. It could either be included in your package, definitely in a separate xapian-bindings/python3 directory, or I could release it as a separate package. I'm fine either way. I'll follow up here when I have something more complete.

comment:44 Changed 7 years ago by barry

Make that git at github dot com colon warsaw/xapian.git

comment:45 Changed 7 years ago by olly

We've tried to maintain non-SWIG based bindings for other languages (Perl and Java), and it doesn't work out well - they inevitably end up lagging significantly behind the C++ API. The key things SWIG gives us are that it largely automates wrapping new classes and methods, and that it's just one tool to have to learn.

So if you want to go that route, I'm afraid you get to maintain them as a separate project.

comment:46 Changed 7 years ago by barry

I've gotten back to this and spent more time on the swig-based bindings. It's not going well. One of the big problems I have is that I'm not sure swig's Python 3 mappings for std::string are entirely right, but maybe I don't know swig or c++ that well. An example: iiuc, std::string is just a container for bytes not unicodes, but swig wants to convert to and from unicodes to std::strings. E.g. SWIG_AsCharPtrAndSize() only checks its first arg for PyUnicode?-ness, not PyBytes?-ness, but I don't see any reason why it shouldn't accept bytes, or maybe *only* bytes. In Python 3, it then tries to decode it as a utf8 string, but there's no reason why it must be utf8 encoded data.

As another example, look at smoketest3.py. There's code that does the equivalent of xapian.Query(xapian.Query.OP_OR, ('foo, 'bar\xa3')). A strict translation of that to Python 3 ought to be xapian.Query(xapian.Query.OP_OR, (b'foo', b'bar\xa3')) which incidentally will work exactly the same in Python >= 2.6. With some hacker to accept the bytes object, I can make this work, but then swig once again gets in the way because it wants to convert the Query's .get_description() std::string to a Python 3 unicode using utf-8. It's the moral equivalent of b'bar\xa3'.decode('utf-8') which isn't value utf-8.

Suggestions welcome.

comment:47 Changed 7 years ago by olly

If you don't like what SWIG does by default, you can specify your own typemaps for std::string and SWIG should just use those instead of the built-in ones. We already do this for the Python 2 bindings, for example the input typemap (in python/util.i) is:

%typemap(in, fragment="XapianSWIG_anystring_as_ptr") std::string {
    std::string *ptr = (std::string *)0;
    int res = XapianSWIG_anystring_as_ptr(&($input), &ptr);
    if (!SWIG_IsOK(res) || !ptr) {
	%argument_fail((ptr ? res : SWIG_TypeError), "$type", $symname, $argnum);
    }
    $1 = *ptr;
    if (SWIG_IsNewObj(res)) delete ptr;
}

XapianSWIG_anystring_as_ptr is a helper function we provide which handles both unicode and str strings - this is how we get the Python2 bindings to accept either sort of string.

comment:48 Changed 7 years ago by barry

Here's another fundamental problem. get_description() for any type return std::string, and these can contain arbitrary byte sequences. The SWIG bindings do this:

%rename(str) get_description;

but this seems wrong for Python 3. If get_description() returns a std::string with non-UTF8 characters in it, as happens with Query objects in smoketest3.py, then str() of that will raise a UnicodeDecodeError?. What you really want in Python 3 is for get_description() to be %rename(bytes) so that you can then do bytes(my_query) and be assured of getting a reasonable answer. In Python 3 you cannot return a bytes object from str().

So what should str() do when there are non-UTF8 characters in the query (i.e. arbitrary bytes)? You could argue that raising a UDE still makes sense, or, since this is just the str() of an object, it could use 'replace' decoding to just ignore the bogus characters. Of course, this means that smoketest3.py is not correct for Python 3 anyway, and we still need a better %rename for get_description. On top of all that, we still need a better %typemap for std::string, or raise the issue in the SWIG tracker.

comment:49 Changed 7 years ago by olly

It's arguably a bug that get_description() doesn't always return valid UTF-8, so we could fix that in the C++ layer instead. The main case I can think of is Query objects - particularly OP_VALUE_RANGE, but even terms can be arbitrary byte sequences. We could instead return C-style \-escaped strings for such cases, which would be more readable too.

comment:50 Changed 6 years ago by olly

I have created #620 for fixing get_description() methods to always return valid UTF-8, and marked it for 1.3.2.

comment:51 Changed 6 years ago by olly

I've fixed #620 - get_description() will now always return UTF-8.

Also committed a change to rename next() to next() for Python 3 in r17475.

Still various testcase failures.

comment:52 Changed 6 years ago by olly

Added support for the PEP3147 name in r17476.

Sorted out "unicode or bytes for std::string" input typemaps for Python 3 in r17477.

comment:53 Changed 6 years ago by olly

r17478 and r17479 progress us a bit further - I've split python3 into a separate directory to make things more manageable. If the two turn out similar enough, we can merge later; if not, Python upstream seem determined to kill python 2 off so it shouldn't be a long term situation.

Last edited 6 years ago by olly (previous) (diff)

comment:54 Changed 6 years ago by olly

OK, so we need to decide what the python3 API is actually going to look like.

I agree we don't want anything too clever, but just mapping std::string to bytes everywhere seems too simplistic to me still - I just tried rewriting smoketest.py for that API as an experimental, and it's full of b'...', x.encode('utf-8') and y.decode('utf-8'). That's inevitably going to lead to people writing wrapper functions or classes so they can pass in Unicode strings without having to convert at every single call site so we'd end up with a proliferation of python wrappers, or at best one dominant wrapper which everyone uses. So it seems saner to just create the bindings with that wrapper included. It's likely to be more efficient if integrated too.

But I'm not a big python user, so it doesn't seem entirely sane for me to be designing the python API alone.

FWIW, adding "_raw" (or "_unicode") variants of methods returning std::string looks quite feasible to do - sabrina's trick of using typedef std::string pybytes; and then having an output typemap for bytes allows this to be done for a %extend method easily, and we can generate a set of forwarding wrappers from a list of methods to be handled this way (possibly via markup in the C++ API headers even - this would also be useful for some other languages - e.g. the Java bindings currently blindly convert std::string return values assuming UTF-8).

If we go that route, I'm not sure which is the better default. The more used case is almost certain to be unicode, and making the common case the shorter one seems sane, but "bytes as default with a unicode variant where it makes sense" avoids having to decide what to do with things like sortable_serialise() which definitely only return binary data. However, there are very few of those (possibly it's unique), so it could be an exception to the rule, or we could only have the bytes form for that function. I can't really think of a case where the bytes variant makes no sense.

Also not sure what the naming should be. Something short would be good: "_raw" is tolerable; "_unicode" seems too long. I wonder about "_b" or "_u" (to match the b'...' or u'...' notation).

comment:55 Changed 6 years ago by richard

I'm fairly strongly of the opinion that xapian methods which can accept arbitrary bytes should only accept bytes, and return byte arrays; this makes it clear what's going on, and avoids confusion when a value which is added as a unicode string comes back from Xapian as bytes. In python2, this change of type would tend not to be noticed, because almost everything accepted str or unicode. Or at least, it wouldn't be noticed until something far downstream of the conversion from unicode to bytes fell over on a rare piece of unicode data.

In python3, most things are much more fussy, and won't accept both bytes and str values. Therefore, it's more important to be explicit when values are changing type.

I'm thinking in particular about things like Xapian::Document, where it's not unreasonable for a programmer to expect that if a term is added to a document, and then that term is read out of the document again, the returned term will be the same as the added term.

I agree with the concerns about this leading to lots of different "thin" wrappers being created, and therefore think that having variants which accept or return (unicode) strings (and error if a return value cannot be converted to unicode) would be a good idea; the defaults should be to be strict about types though. The variants would just save some typing, but still make the programmer be explicit about what they're passing in.

I suggest the suffix "_str" for the unicode versions; in Python 3, "str" is a unicode string, and "_str" is reasonably short.

There are two objects which should accept unicode strings as some of their parameters - the query parser and term generator. These shouldn't have "bytes" variants; they always interpret the data as a string, and a python 3 programmer should be working with str objects if the data is a string.

query_parser.set_prefix is slightly awkward; the prefix typed by the user should be a "str", but the prefix applied internally to terms should be a "bytes". I'm not sure if a .set_prefix_str() variant, in which the internal prefix can be specified as a str, makes sense.

comment:56 Changed 6 years ago by james

I'm +1 on defaulting to using bytes, and +1 talking about "string" methods (whether _str, _s; I'm indifferent between them) instead of "unicode", since this is Py3.

I wonder if TermGenerator and QueryParser would ideally be namespace-separated (xapian.termparsing or something), since they go together and are an optional higher-level interface.

I wonder separately if since it doesn't affect stuff stored on disk, having query_parser.set_prefix accept both bytes and str types and serialising str -> bytes using an extra parameter (which defaults to 'utf8') might not be a good solution to that problem. I'm guessing that most people are going to want to think in terms of strings across the board; if you're just dealing in unicode data and are happy with UTF-8 (which is probably most Western applications?) then not having to think about bytes serialisation at all is a lot clearer.

I also wonder still if having an explicit wrapper (from xapian.utf8 import Document, Database) bundled as part of our own bindings which uses str everywhere that you'd usually have bytes and a _s variant, might stop people writing almost-trivial wrappers. Although honestly if all _s methods and functions take only str and return only str, with conversion in and out via utf8, then we're probably fine for new applications, having a xapian.utf8 wrapper set built-in would be useful in porting from Py2. (An alternative would be something like 2to3 for Xapian client code.)

comment:57 Changed 6 years ago by olly

I discussed this some with michelp on IRC yesterday. His feeling was making the conversions explicit was better, but having methods accept str or bytes for convenience would be reasonable. The "accept bytes or str, return bytes" is exactly what dbm (in the standard python libraries) does:

import dbm

# Open database, creating it if necessary.
db = dbm.open('cache', 'c')

# Record some values
db[b'hello'] = b'there'
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'

# Note that the keys are considered bytes now.
assert db[b'www.python.org'] == b'Python Website'
# Notice how the value is now in bytes.
assert db['www.cnn.com'] == b'Cable News Network'

# Often-used methods of the dict interface work too.
print(db.get('python.org', b'not present'))

# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4

# Close when done.
db.close()

Example taken from: http://docs.python.org/3.1/library/dbm.html

I tried to see if anyone's criticised this API, but failed to find anything online, so I've been thinking this seemed a reasonable approach. There's a clear logic to it, which is simple to explain and to understand (and as a bonus, also to implement):

The API is wrapped as "C++ std::string <-> Python bytes". For convenience, if you pass in str, then rather than raising an exception, we convert to it to bytes containing UTF-8 (since that's Xapian's standard representation for Unicode text).

Returning bytes works for UTF-8 or binary data (and we could leave adding variants returning Unicode until later without breaking code written to the bytes returning API). It also matches what we do for python2, so should also be familiar to existing users. It suffers from a round-tripping issue (add a term as str, get it back as bytes), but so does the dbm API, and nobody seems to be ranting about how confusing they find that.

It's possible to pick what happens on a function by function basis with SWIG, but unless there's some way to generate the lists of methods and parameters automatically, that will be a maintenance headache. It's also a more complicated rule for users to grasp.

Richard mentions text input to the QueryParser or TermGenerator as a case where only str should be accepted, but for the common case where your source data is in UTF-8, that would force you to convert from UTF-8 to <whatever Python uses internally> to get a Python str object, and then you'd pass that to the bindings which would have to convert it back to UTF-8 again to pass to xapian-core. Even if the changes in PEP 393 mean that Python 3.3 would keep the data as UTF-8 (I'm not clear on if that's actually the case), it still complicates the rules for the API user to remember.

The number of cases which inherently only take binary data is very small - I think it's only sortable_unserialise() currently. I think not accepting str for those cases would be desirable and feasible to do - here Unicode makes no sense, so an error really is better.

I don't think it's really feasible to have a "unicode version" of classes like Document or Database - the distinction can't really be made at the class level - e.g. values will often be binary but terms usually text.

I'm not sure you can make it at the function/method level either - a function could have two std::string parameters, or a std::string parameter and return type, but these be "text" and "binary data". E.g. the former pattern matches QueryParser's parse_query() and add_prefix() methods if you think term prefixes are binary data (as suggested above); the latter pattern is what an API for serialising a string would look like. So if "foo_str()" means all std::string parameters and return values are unicode, that doesn't really work - this is one reason why I think accepting either str or bytes for parameters is more workable - then the unicode variant only affects the return value, and there's only one of those (well, strictly speaking, returning std::pair or via passed in pointers/references is possible, and would naturally map to returning a tuple in Python, but only worrying about return values greatly reduces the number of cases, and multiple return values in C++ are a bit awkward so tend to be rare).

QueryParser assumes the encoding of text is utf-8, so providing a way to specify that str should be converted to a different encoding just seems to be setting a trap for users. For those who really want to do this, converting to bytes in the encoding they want in their own code allows this.

comment:58 Changed 6 years ago by james

Agree on QueryParser and specified encodings.

You've convinced me (the values argument is the key one) that having a wrapper ourselves isn't really going to be feasible. I still think we're going to see people generating wrappers though, as people work through the same line of thinking I came from and come to similar unsafe conclusions. (I wasn't thinking either at a method or class level, but that you'd choose to use the entire API in str-mode. For an awful lot of simple use cases that would suffice; except for values, which are too common and useful to ignore.)

I suspect an awful lot of people are going to move everything into py3 str objects anyway, but I do like the idea of not requiring it. Although I don't think that lack of criticism of the dbm API tells us much (my guess is most people don't need to use it, and those who do probably have a better understanding of what they're doing than a lot of people picking up python — or even xapian via python — and wanting to get stuff working), the utility of auto-conversion on the way in seems a good tradeoff. I still feel odd about it always ejecting bytes, but I can't think of a better approach. And as you say, it's easy to explain. (It will cause similar unicode problems to those people encounter with py2, so we'll probably need a FAQ at some point. Depends a bit on where you send the bytes if you fail to convert to a str, and what the libraries you're using do in those situations.)

comment:59 Changed 6 years ago by richard

After reviewing this discussion and thinking for a bit, I think the only safeish option for input is to accept both bytes and str, and automatically encode str as utf8. The sole exception would be for sortable_unserialise(), which should error if given a str.

For ease of explanation, implementation, and to avoid having to maintain lots of custom wrappers, I think all return values should be bytes. We should simply document this, and people can do ".decode('utf8')" for those return values where they expect that to work; eg, they can do this for a returned term if they're only storing UTF8 term values in the database.

comment:60 Changed 6 years ago by barry

Totally agree that inputs should accept either (utf-8) str or bytes.

Output could return a hybrid object that knows how to convert between (utf-8) str and bytes. Very roughly something like the following:

def foo():
    return b'abc'

def bar():
    return foo() + b'def'


class BS:
    def __init__(self, some_bytes):
        self._value = some_bytes

    @property
    def str(self):
        return self._value.decode('utf-8')

    @property
    def bytes(self):
        return self._value

    def __str__(self):
        return self.str

    def __bytes__(self):
        return self.bytes


v = BS(bar())
print(repr(v.bytes))
print(repr(v.str))
print(repr(bytes(v)))
print(repr(str(v)))

Users wouldn't have to write their own wrappers, but they'd have to be explicit about whether they wanted bytes or str in their own code.

% python3 /tmp/foo.py 
b'abcdef'
'abcdef'
b'abcdef'
'abcdef'

comment:61 Changed 6 years ago by richard

Interesting idea. It seems mildly insane to me at first impression, but it might possibly work well. My concern is that it would cause more work for users than necessary; I'm not actually sure how often users would be getting data back from Xapian and actually want to treat it as a string, anyway.

If we did this, we'd certainly want to make sure that Xapian accepted these BS objects anywhere where a str or bytes was accepted, too, so that data coming out of xapian could be fed back in (after whatever munging the user desired).

I'm -0 on this idea, because I think it's unnecessary complexity, but reserve the right to downgrade or upgrade that after thinking a bit more.

comment:62 Changed 6 years ago by olly

Richard's description in comment:59 is pretty much identical to what I'm currently thinking, except that get_description() (as renamed to __str__) needs to return a Unicode string because that's what Python requires (in comment:48 barry said: "In Python 3 you cannot return a bytes object from str()").

We should probably reject str passed to unserialise() methods as well as sortable_unserialise(), especially as SWIG can apply a typemap to all methods with the same name in one go.

So it seems like we're converging on a solution!

ISTR we considered a magic return value along those lines before and rejected it, but I think it was trying to be cleverer in some way - barry's suggestion seems worth experimenting with I think. It seems cleaner than having both "get_termname()" and "get_termname_str()"

So a plan - let's implement what's described in the first two paragraphs of this comment, which will give us a workable Python3 API, and we can try converting some Python 2 code to see what it feels like, and compare with how it would be if we returned the BS class. The work to implement the BS class approach would be on top of the work required for the simpler approach anyway.

Last edited 6 years ago by olly (previous) (diff)

comment:63 Changed 6 years ago by olly

I've committed progress so far in r17505.

Still need to stop str being accepted for unserialise() methods, and "make check" currently fails (though I have updated a lot of the testsuite for the "return bytes" change).

One issue I'm unsure on - should Xapian::version_string() really return a bytes string? It seems it's more a meta-data thing than really part of the Xapian API, so perhaps shouldn't be included in the "everything returns bytes" rule.

Perhaps we should rework all the version stuff for python - being able to get a string or a tuple would feel more natural I think, but currently we offer a byte string, or 3 integer components (each via a separate function). sys.version and sys.version_info seems a good model.

comment:64 Changed 6 years ago by james

+1 on xapian.version and xapian.version_info or similar (str and tuple respectively).

comment:65 Changed 6 years ago by richard

str seems right for reporting versions. However, the version should be available (as a str) as xapian.__version__.

There's a PEP about this http://www.python.org/dev/peps/pep-0396/

Basically, it says put a str holding the version number in module.__version__

Last edited 6 years ago by richard (previous) (diff)

comment:66 Changed 6 years ago by olly

It sounds like we should provide the bindings version in xapian.__version__ but that's a different thing to xapian.version_string() - this is the version of xapian-core in use. For example, you could have xapian-core 1.2.15 but xapian-bindings 1.2.14 - in this situation xapian.version_string() would give '1.2.15' but xapian.__version__ would give '1.2.14'.

There's also the version of xapian-core which the bindings were actually built against, which could be different again, but is probably less interesting. I don't think that's available to the python bindings currently (not sure the bindings version is either, at least in any direct way).

comment:67 Changed 6 years ago by olly

As of r17516, the python3 tests all pass except for a segfault in pythontest.py when it runs under threads (all the testcases are run without threads and pass, and then again with threads and we get this failure).

To do:

  • Reject str from unserialise methods
  • Provide __version__
  • Return str for version_string()
  • Check and update examples
  • Check and update docs
  • Decide on minimum supported version - I'm tempted to say 3.2, unless/until anyone cares enough to test with 3.1 and provide any patches required to make it work. Debian stable has 3.2, and barry suggested >= 3.2 above (and that was 16 months ago).
Last edited 6 years ago by olly (previous) (diff)

comment:68 Changed 6 years ago by olly

  • Description modified (diff)

Implemented xapian.__version__ (for 2 and 3) in r17521, and documented 3.2 as the minimum supported version in r17522.

Updated the todo list in the description.

comment:69 Changed 6 years ago by olly

  • Description modified (diff)

r17563 rejects str from unserialise methods and functions.

Last edited 6 years ago by olly (previous) (diff)

comment:70 Changed 6 years ago by olly

  • Description modified (diff)

And r17564 returns unicode from xapian.version_string().

Last edited 6 years ago by olly (previous) (diff)

comment:71 Changed 6 years ago by olly

  • Description modified (diff)

And r17565 updates the documentation (and converts it to .rst). I'd appreciate a review by people with a better grasp of Python than me:

http://trac.xapian.org/browser/svn/trunk/xapian-bindings/python3/docs/index.rst

Also, do we want to keep both:

  for t in document:
    print(t.term)

and:

  for t in document.termlist():
    print(t.term)

(and the other methods which duplicate __iter__ on the same object).

comment:72 Changed 6 years ago by olly

I guess we probably want a "migrating from the python2 bindings" section too.

comment:73 Changed 6 years ago by olly

  • Description modified (diff)
  • Milestone changed from 1.3.2 to 1.3.x

Examples updated in r17568, but I haven't tried running them all yet.

The rest of this ticket doesn't need to hold up 1.3.2 any longer.

comment:74 Changed 6 years ago by barry

Hi all - really fantastic to see all the great progress since I last had the opportunity to look. I just re-grabbed the svn trunk, but I had to make one change to get Xapian Python 3 bindings to build on Ubuntu Trusty:

diff --git a/configure.in b/configure.in
index d04c76f..851b644 100644
--- a/configure.in
+++ b/configure.in
@@ -764,7 +764,7 @@ else
     # Find the directory for libraries this is necessary to deal with
     # platforms that can have apps built for multiple archs: e.g. x86_64
     AC_MSG_CHECKING([for Python 3.x lib dir])
-    PY3LIBDIR=`($PYTHON3 -c "import sys; print(sys.lib)") 2>/dev/null`
+    PY3LIBDIR=`($PYTHON3 -c "import sysconfig; print(sysconfig.get_config_var('LIBPL'))") 2>/dev/null`
     if test -z "$PY3LIBDIR"; then
       # some dists don't have sys.lib  so the best we can do is assume lib
       PY3LIBDIR="lib" 

The original code doesn't make any sense to me, since sys.lib is certainly a non-standard attribute. Where did you find that?

In any case, using sysconfig is a more standard way of locating the multi-arch location for the Python 3 libraries on Debian and Ubuntu (and very likely other Linux distros).

With this change, I was able to build Python 3 support and actually import xapian._xapian. I haven't tried running the tests yet. What's the ETA for releasing an official Python 3 supporting Xapian version?

comment:75 Changed 6 years ago by barry

Also note that I don't get segfaults (afaict) when running make check on r17620, but I do get failures in pythontest.py for both Python 2 and Python 3:

pythontest.py:1223:Exception string not as expected: got 'Error reading block 4: Bad file descriptor', expected 'Database has been closed'
  1221 
  1222     db.close()
->1223     expect_exception(xapian.DatabaseError, "Database has been closed", check_vals, db, vals)
  1224     shutil.rmtree(dbpath)
  1225 

comment:76 Changed 6 years ago by barry

I don't know Xapian's C++ API all that well, but index.rst looks pretty reasonable to me.

comment:77 Changed 6 years ago by olly

The code you quote in comment:74 is in swig, not xapian. If swig fails to detect python3 correctly, that shouldn't be a problem for us - it should just mean that if you tried to run SWIG's testsuite, it wouldn't run Python3 tests. The built SWIG should still support generating wrappers for Python3. I'll forward the comments about sys.lib to the SWIG bug tracker though.

We don't currently have an ETA for a stable 1.4.0 release, though this should be in the next development release (1.3.2) which should be out before the end of the year. If you're just wanting bindings to use in Ubuntu software-centre, you could use 1.3.2 as a parallel installed version (I have packaging for 1.3.x). I wouldn't recommend a wholesale replacement of Xapian 1.2.x by 1.3.x in Ubuntu though.

Backporting the changes to 1.2.x is another option for you I guess, though it's not going to just be a trivial diff and patch as there are some structural changes to our SWIG interface files between the branches.

comment:78 Changed 6 years ago by olly

Looks like sys.lib was a distro patch which I guess didn't get merged:

http://bugs.python.org/issue1294959

(Looks like you were involved in that discussion!)

OpenSUSE at least seem to still be carrying it:

https://build.opensuse.org/package/view_file?file=Python-3.3.0b2-multilib.patch&package=python3&project=devel%3Alanguages%3Apython%3AFactory&rev=cb7b94b33478cb8add5a5eb3ab3068dc

comment:79 Changed 6 years ago by barry

Yeah, OpenSUSE really should deprecate that. It's completely non-standard and now there *is* a standard way of doing it. But that's not Xapian's problem. ;)

I'm a little confused about comment 77, when you say "If swig fails to detect python3 correctly, that shouldn't be a problem for us - it should just mean that if you tried to run SWIG's testsuite, it wouldn't run Python3 tests. The built SWIG should still support generating wrappers for Python3."

When I checked out from svn, it grabbed swig's git and built that, even though I have swig 2.0.10 installed on my system. Maybe bootstrap didn't find it or doesn't look for it. I'll have to take a closer look at how Xapian builds from svn to see if it's possible to use system swig instead (which I'd think would DTRT, but haven't checked yet).

Agreed about the packaging comments. Since you're the maintainer in Debian, I'd love to work with you on getting a snapshot of 1.3.2 into Ubuntu for software-center. Is the packaging you mention in the Vcs-Svn for xapian-core, or somewhere else?

comment:80 Changed 6 years ago by olly

We've had a few cases where changes to SWIG introduced bugs in our bindings, so we have a particular version of SWIG pulled in by bootstrap which snapshots and releases are generated with, rather than using whatever version of SWIG happens to be installed when the release is generated.

You can probably do mkdir swig; touch swig/.nobootstrap to get it to use the system/your own install of swig.

The 1.3 packaging is in the debian subdirectory of the main xapian SVN/git repo, which I think is where the Vcs-Svn headers point. I've actually just been updating it this week.

comment:81 Changed 6 years ago by olly

  • Milestone changed from 1.3.x to 1.3.2

comment:82 Changed 6 years ago by lilydjwg

  • Cc lilydjwg@… added

comment:83 Changed 5 years ago by olly

Currently the testsuite fails on some exception tests, which turns out to be a SWIG bug, fix for which is here: https://github.com/swig/swig/pull/137

This issue was the main thing delaying 1.3.2, so hopefully that'll finally emerge fairly soon.

Barry: BTW, packaging has now moved to collab-maint git and there's a 1.3 branch there. Where's Ubuntu at with this? It looks to me like software-center is still using python-xapian, so presumably this change isn't happening in 14.04 (and I guess a non-LTS makes more sense for switching so you don't end up supporting a pre-release of the python3 bindings for years).

comment:84 Changed 5 years ago by littlepig

  • Cc jocl1@… added

Is there any update on this? I'm improving Xapian backend (xapian-haystack) to django-haystack and I believe this may be the only thing missing[1] to deploy it on python 3.3 (it works correctly on python 2.7.6).

[1] https://travis-ci.org/jorgecarleitao/xapian-haystack/jobs/24926464

comment:85 Changed 5 years ago by olly

The information ticket is up to date. 1.3.2 has still not quite yet appeared for various other reasons, but should be out this month.

comment:86 Changed 5 years ago by olly

  • Description modified (diff)
  • Milestone changed from 1.3.2 to 1.3.3

I've checked the python3 examples, and fixed the few minor issues I found in r18118.

The remaining issue with a testcase not working under threads shouldn't delay 1.3.2, so bumping milestone.

comment:87 Changed 5 years ago by olly

Assem has discovered that the issue with the failing testcase seems to be related to the postingsource being in a list:

https://github.com/xapian/xapian/pull/50

It should work in a list (and the testcase is written as it is to test that we keep hold of a reference to the Python object in such cases), but I think this might provide a useful clue in discovering what's going wrong.

comment:88 Changed 5 years ago by olly

  • Resolution set to fixed
  • Status changed from assigned to closed

I had another look at this. Assem's discovery made me look closely XapianSWIGQueryItor and I realised it uses Python C API calls while the GIL is unlocked. Fixing that the testcase passes under threads. Fixed on trunk in [28b5660] and reenabled the testcase in [2e77f32].

I will apply the same fix for Python 2 - I think we're just lucky (or unlucky) we don't see a manifestation of the problem there.

That was the final issue in this ticket, so closing.

Note: See TracTickets for help on using tickets.