Opened 5 years ago

Last modified 5 months ago

#785 assigned enhancement

Use std::string_view whenever possible

Reported by: German M. Bravo Owned by: Olly Betts
Priority: highest Milestone: 1.5.0
Component: Other Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Using std::string_view would require c++17, but I do believe it'd bring a whole lot of measurable performance improvements if used appropriately.

Change History (10)

comment:1 by Olly Betts, 5 years ago

https://en.cppreference.com/w/cpp/compiler_support suggests that to rely on std::string_view being available would require GCC >= 7, Clang >= 4 or MSVC 2017 (and probably not supporting any other compilers, though those are at least probably the current top 3). That means you couldn't build with the system compiler on Debian stretch (the current stable release, though a new one is in freeze currently) or Ubuntu xenial (which is the newest travis-ci currently offers).

So bear in mind that's significantly more aggressive in terms of compiler requirements than we've ever previously been.

I'm not totally ruling it out as a possibility, but I think you need to demonstrate that we could deliver significant performance improvements (just being measurable is rather a low bar).

Or is there a way we could use std::string_view if available, but fall back if not?

comment:2 by German M. Bravo, 5 years ago

We could start by exposing an API which receives const char*, size_t instead of const std::string&, and internally try passing/using only functions receiving the former. Afterwards, adding support for string views is a matter of just exposing APIs that convert the string view to const char*, size_t:

void api_method(const char*, size_t);

void api_method(std::string_view arg) {
    api_method(arg.data(), arg.size());
}

Or we can create our own std::string_view (a mere wrapper maybe, when the real thing is available). String views are in basically that: const char*, size_t (or rather, actually, const char* begin, const char* end).

String views have its quirks and at times get tricky, but the essence is at the end not ever constructing new strings out of substrings.

Last edited 5 years ago by German M. Bravo (previous) (diff)

comment:3 by Olly Betts, 5 years ago

Such a refactor to an API method isn't the hard part though - to actually be useful that then requires chasing through everywhere inside the library where that std::string is currently passed and changing all of those to also take a pointer and length in the same way.

Unless someone with an awful lot of spare time is keen to make it happen, I think planning to do this "whenever possible" isn't realistic. Targetting a few places where it would actually deliver a significant gain might be though.

One method which comes to mind where someone might want to pass a substring of a large string or a large value from a buffer is Xapian::TermGenerator::index_text(), but that already supports passing a Xapian::Utf8Iterator and that can be constructed from a const char* and size_t.

comment:4 by Olly Betts, 5 years ago

The other way to attack this would be from the bottom up - find hot spot cases where substrings of std::string objects are created or std::string objects are created from buffers just to pass them to lower-level internal functions that require data in a std::string. We almost never use substr() to make a substring (there are only 8 uses in core if you ignore the test suite and development tools), so this means looking for places that call std::string constructors a lot in profile data.

comment:5 by German M. Bravo, 5 years ago

The other thing that could gain improvements is storing string_views inside objects in place of strings when we know or can make sure the actual string memory will be alive during the whole life of the object. i.e. classes which have a member std::string will get a copy of the string, instead, we could potentially save a lot of time if we keep a string view instead... unless we know the “parent” string object is doomed to be destroyed before the object currently containing the string is.

I’ll try to profile this to see where string copy constructors and string copy assignments are called from within xapian during a real world usage and report back the results.

comment:6 by Olly Betts, 4 years ago

I’ll try to profile this to see where string copy constructors and string copy assignments are called from within xapian during a real world usage and report back the results.

Did you get anywhere with that?

Thinking about this some more, the cases where this is likely to help are where strings can be long and currently get copied to store them internally, but I can't think of any examples of that. We'd also need to have a copy of the string elsewhere with at least as long a lifetime.

There are places which takes a potentially long string and process it, but we almost always pass strings by const reference, so passing a string_view is unlikely to be much different (if you pass a const reference to a string_view it's really no different, which if you pass by value it avoids a level of indirection but requires passing a structure rather than a pointer).

And a lot of strings are very short - e.g. most common terms are short enough to be stored inline via SSO so the amount of copying is typically just sizeof(std::string). Both libstd++ and libc++ implement SSO, though somewhat differently, and even the shortest SSO limit of 10 bytes is enough for most terms:

On x86-64:

  • with libstdc++ sizeof(std::string) is 32 and SSO is used for <= 15 bytes
  • with libc++ sizeof(std::string) is 24 and SSO is used for <= 22 bytes
  • sizeof(std::string_view) is presumably 16 bytes, but requires the string is already stored elsewhere

On x86:

  • with libstdc++ sizeof(std::string) is 24 and SSO is used for <= 15 bytes
  • with libc++ sizeof(std::string) is 12, and it looks like SSO is used for <= 10 bytes
  • sizeof(std::string_view) is presumably 8 bytes, but requires the string is already stored elsewhere

So libc++'s implementation is clearly cleverer space-wise, though there may be time costs to the more complex encoding scheme it seems to use.

We could provide overloaded forms for API methods like TermGenerator::index_text() which accepted std::string_view for convenience, and only declare them in the API header based on the value of __cplusplus. That case at least is really just syntactic sugar for:

termgenerator.index_text(Xapian::Utf8Iterator(strview.data(), strview.size()));

So there's really no runtime efficiency improvement to be had there. Other cases which currently require a std::string might deliver a speed-up, but may need more internal replumbing.

comment:7 by Olly Betts, 15 months ago

Status: newassigned

We now require C++17 for git master.

7224b8304169b9b21bbe0d613cf0963773952a87 and 4dc268bdc4df0e98f50276205f26cc5191113edb are a start to making use of std::string_view, and deal with an easy gain from allowing using Omega's HTML parser without having to copy the input into a std::string.

I think the key win for xapian-core is just adding an overload to the public API and allow TermGenerator::index_text() to take std::string_view and I'll definitely do that. There are probably gains to be had beyond that, but somebody needs to do that work, or at least provide evidence (rather than just belief) of gains to motivate somebody to do the work. Literally using std::string_view "whenever possible" is a lot of development work for unproved benefits and would add a lot of additional ownership lifetimes to worry about.

comment:8 by Olly Betts, 13 months ago

Milestone: 1.5.0

We should do TermGenerator::index_text() for 1.5.0, and check the rest of the API to see if anything else would benefit.

comment:9 by Olly Betts, 13 months ago

Working on this, I've spotted something - any API method which takes a string as a parameter and stores it internally would arguably benefit from taking std::string_view because it avoids constructing a temporary std::string in cases where the data isn't already a std::string (notably this is always the case in the bindings; it's also the case in C++ if you pass const char*).

Also this typically seems to be mostly a matter of just changing the parameter type - we currently use const std::string& so the parameter isn't modifiable inside the function anyway, and we are zero-byte safe so don't use c_str() (or rely on data() pointing to a nul-terminated string since C++11). The other difference I can see is that std::string_view::data() can be NULL for an empty string, which again shouldn't be a problem anywhere.

The one place this is more awkward is virtual methods which are in the API for users to override - I think it's worth changing these too, and suggest users overriding such methods in their code use a construct like this to support old and new API versions (along which checking the code using the parameter):

// XAPIAN_AT_LEAST was added in 1.4.2
#if defined(XAPIAN_AT_LEAST) && XAPIAN_AT_LEAST(1,5,0)
# define STRING_TYPE std::string_view
#else
# define STRING_TYPE const std::string&
#endif

// ...

class MyExpandDecider : public Xapian::ExpandDecider {
  public:
    bool operator()(STRING_TYPE term) const {
        // ...
    }
};

We could even define XAPIAN_STRING_TYPE for new versions, but none of the 1.4.x releases made so far would define that for you so then you'd need something like this, which doesn't really seem better than the above:

#ifndef XAPIAN_STRING_TYPE
# define XAPIAN_STRING_TYPE const std::string&
#endif

comment:10 by Olly Betts, 5 months ago

Priority: normalhighest
Note: See TracTickets for help on using tickets.