Opened 5 years ago

Last modified 5 years ago

#792 new enhancement

Remove Omega restriction on databases in subdirectories

Reported by: Olivier Hallot Owned by: Olly Betts
Priority: normal Milestone:
Component: Omega Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

This is a ticket for enhancement.

I have the situation of many omega database organized as

versioN

  • pt-BR
  • fr
  • ru
  • es
  • zn
  • ...

VersionNplusOne

  • pt-BR
  • fr
  • ru
  • es
  • zn
  • ...

that is, my databases are built separately and organized by version and languages. Search is to be done in version/lang

The suggestion is to add a 'LOC'ation param in the CGI parsing to be able to open the database at the right place, such as

database_dir/LOC/DB

where database_dir comes from /etc/omega.conf

ref: https://xapian.org/docs/omega/cgiparams.html

Note: I'll be glad to give a spin in building xapian+omega and hack it, if I get hint to where to look for the CGI parsing code.

Change History (5)

comment:1 by James Aylett, 5 years ago

So you have different databases _within_ each language directory? One approach would be to do something like fr-DATABASE for each named DATABASE. It doesn't separate by two levels of directories, but can be done now without modification in omega (but of course would need changes at your end — although you could create your existing structure and then use symlinks, which would be a minor finishing-up step).

If we do want to go down this road, the main place you'll want to look in the code is omega.cc:map_dbname_to_dir(). CGI parameters are managing into a multimap called cgi_params (which is mostly set up in cgiparam.cc but is used a lot in omega.cc so I'd look there).

Some other thoughts:

  1. LOC would have to not contain a slash, but also shouldn't be ... There may be other characters we'd have to sanitise for security.
  2. You'll probably need xLOC, acting as xDB (it's documented in cgiparams.rst / cgiparams.html).
  3. LOC may not be the best name. DBGROUP might make more logical sense.

comment:2 by Olly Betts, 5 years ago

james: I think it's the other way around - pt-BR, etc are Xapian databases, but there's a set of those within each version directory.

I assume this follows on from a recent IRC discussion about why DB=6.4/fr didn't work.

I'm not sure adding another CGI variable it really the best way to resolve the current restriction. It allows LOC=6.4&DB=fr so solves the immediate problem, but it imposes needless restrictions since you can't specify a different LOC for each DB, so you couldn't search the fr database for every version together (which is probably potentially useful here, and certainly is in the wider context).

My thought would be to instead add a config file option to allow specifying the separator used when generating the value for $dbname and when parsing DB parameters. Then you could pick a different one which would allow / to be used in DB parameters. We could probably even support // for that separator which would allow all valid relative Unix path names to be specified (a relative path so can't validly start with /, and double / between directory components or a trailing / on it would each be redundant).

The separator could default to / for compatibility and to avoid surprising people on upgrade by suddenly allowing access to databases that couldn't previously be reached, but potentially we could change that default in a new major version.

comment:3 by James Aylett, 5 years ago

I assumed that the version flip would be done by updating the config file, because I didn't see a reason for being able to query against different versions. (If the version represents different concurrent available versions of a corpus rather than an index or set-of-indexes version, that assumption wouldn't hold.)

Isn't it most common these days to use colons to separate paths? We could move DB to DBS over a number of releases using a different separator. That feels more usual to me than being able to configure the separator (which feels a bit sed s!!! ish).

We'd still need to be careful about .. and friends, which I assumed was to avoid this issue in DB entirely.

comment:4 by Olly Betts, 5 years ago

It doesn't seem justified to disrupt existing users to address this - it's been this way for 20+ years, and this is the first time somebody's actually complained about it that I'm aware of. Probably a few other people have hit it and just worked with it, and I am sympathetic to trying to lift the restriction.

Changing CGI parameter names is rather disruptive - for example, it means bookmarked searches and scripted use stop working. So we've so far generally avoided such changes, and also tried to ensure that any changes in meaning do something sensible with an existing bookmarked search - e.g. the encoding of $filters has evolved over time, which means incompatible changes to the behaviour of xFILTERS, but (a) we've carried code to also compare xFILTERS with what the old version would have given for $filters for a transitional period, and (b) if Omega thinks the filters have changes it forces the first page of results, which is generally reasonable (people are unlikely to intentionally bookmark page 2 of a search except perhaps in the short term) and also plays well with automated use (where you'd generally just not pass xFILTERS at all).

: is a common separator for listing several pathnames on Unix-like platforms, though there doesn't seem to be a universal standard - e.g. Microsoft ones seem to use ;, and ISTR classic Macs used : for the role / has on Unix.

: might be a reasonable fixed choice if we were designing this from a clean slate, but I'm not sure I like the idea of suddenly changing from / to : - that's again a potentially disruptive change to people who probably don't care that they can't currently have / in database names. Fixing it as : would also mean that you then couldn't have : in a database name, which is currently allowed, and there are probably users with databases so named.

It seems rather contrary to solve the inability to have / in a database name by disallowing some other character instead - what would we do when someone opens a ticket asking to allow : again? Change to another fixed value and round and round we go until people stop complaining?

Making it completely configurable is unnecessarily general, and some characters would just be a bad idea to use because they'd result in URL escaping (e.g. & -> DB=a%26B%26c) or just be confusing (e.g. an alphanumeric such as a), so perhaps it would be better to restrict the allowed choices. / is compatible with current deployments, and // has the nice property of allowing any valid relative database pathname to be specified. So we could allow those two and :.

We already need to vet filenames for .. (in CGI parameter FMT and arguments to $include{} and $log{}) so that's not a big deal.

comment:5 by Olly Betts, 5 years ago

Summary: Add a CGI param "LOC" to indicate a path to a leaf from database_dir folderRemove Omega restriction on databases in subdirectories

Retitling to reflect the actual problem rather than a particular proposed solution to it.

Note: See TracTickets for help on using tickets.