Opened 5 years ago
Last modified 5 years ago
#792 new enhancement
Remove Omega restriction on databases in subdirectories
Reported by: | Olivier Hallot | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Omega | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
This is a ticket for enhancement.
I have the situation of many omega database organized as
versioN
- pt-BR
- fr
- ru
- es
- zn
- ...
- pt-BR
- fr
- ru
- es
- zn
- ...
that is, my databases are built separately and organized by version and languages. Search is to be done in version/lang
The suggestion is to add a 'LOC'ation param in the CGI parsing to be able to open the database at the right place, such as
database_dir/LOC/DB
where database_dir comes from /etc/omega.conf
ref: https://xapian.org/docs/omega/cgiparams.html
Note: I'll be glad to give a spin in building xapian+omega and hack it, if I get hint to where to look for the CGI parsing code.
Change History (5)
comment:1 by , 5 years ago
comment:2 by , 5 years ago
james: I think it's the other way around - pt-BR
, etc are Xapian databases, but there's a set of those within each version directory.
I assume this follows on from a recent IRC discussion about why DB=6.4/fr
didn't work.
I'm not sure adding another CGI variable it really the best way to resolve the current restriction. It allows LOC=6.4&DB=fr
so solves the immediate problem, but it imposes needless restrictions since you can't specify a different LOC
for each DB
, so you couldn't search the fr
database for every version together (which is probably potentially useful here, and certainly is in the wider context).
My thought would be to instead add a config file option to allow specifying the separator used when generating the value for $dbname
and when parsing DB
parameters. Then you could pick a different one which would allow /
to be used in DB
parameters. We could probably even support //
for that separator which would allow all valid relative Unix path names to be specified (a relative path so can't validly start with /
, and double /
between directory components or a trailing /
on it would each be redundant).
The separator could default to /
for compatibility and to avoid surprising people on upgrade by suddenly allowing access to databases that couldn't previously be reached, but potentially we could change that default in a new major version.
comment:3 by , 5 years ago
I assumed that the version flip would be done by updating the config file, because I didn't see a reason for being able to query against different versions. (If the version represents different concurrent available versions of a corpus rather than an index or set-of-indexes version, that assumption wouldn't hold.)
Isn't it most common these days to use colons to separate paths? We could move DB
to DBS
over a number of releases using a different separator. That feels more usual to me than being able to configure the separator (which feels a bit sed s!!!
ish).
We'd still need to be careful about ..
and friends, which I assumed was to avoid this issue in DB
entirely.
comment:4 by , 5 years ago
It doesn't seem justified to disrupt existing users to address this - it's been this way for 20+ years, and this is the first time somebody's actually complained about it that I'm aware of. Probably a few other people have hit it and just worked with it, and I am sympathetic to trying to lift the restriction.
Changing CGI parameter names is rather disruptive - for example, it means bookmarked searches and scripted use stop working. So we've so far generally avoided such changes, and also tried to ensure that any changes in meaning do something sensible with an existing bookmarked search - e.g. the encoding of $filters
has evolved over time, which means incompatible changes to the behaviour of xFILTERS
, but (a) we've carried code to also compare xFILTERS
with what the old version would have given for $filters
for a transitional period, and (b) if Omega thinks the filters have changes it forces the first page of results, which is generally reasonable (people are unlikely to intentionally bookmark page 2 of a search except perhaps in the short term) and also plays well with automated use (where you'd generally just not pass xFILTERS
at all).
:
is a common separator for listing several pathnames on Unix-like platforms, though there doesn't seem to be a universal standard - e.g. Microsoft ones seem to use ;
, and ISTR classic Macs used :
for the role /
has on Unix.
:
might be a reasonable fixed choice if we were designing this from a clean slate, but I'm not sure I like the idea of suddenly changing from /
to :
- that's again a potentially disruptive change to people who probably don't care that they can't currently have /
in database names. Fixing it as :
would also mean that you then couldn't have :
in a database name, which is currently allowed, and there are probably users with databases so named.
It seems rather contrary to solve the inability to have /
in a database name by disallowing some other character instead - what would we do when someone opens a ticket asking to allow :
again? Change to another fixed value and round and round we go until people stop complaining?
Making it completely configurable is unnecessarily general, and some characters would just be a bad idea to use because they'd result in URL escaping (e.g. &
-> DB=a%26B%26c
) or just be confusing (e.g. an alphanumeric such as a
), so perhaps it would be better to restrict the allowed choices. /
is compatible with current deployments, and //
has the nice property of allowing any valid relative database pathname to be specified. So we could allow those two and :
.
We already need to vet filenames for ..
(in CGI parameter FMT
and arguments to $include{}
and $log{}
) so that's not a big deal.
comment:5 by , 5 years ago
Summary: | Add a CGI param "LOC" to indicate a path to a leaf from database_dir folder → Remove Omega restriction on databases in subdirectories |
---|
Retitling to reflect the actual problem rather than a particular proposed solution to it.
So you have different databases _within_ each language directory? One approach would be to do something like
fr-DATABASE
for each named DATABASE. It doesn't separate by two levels of directories, but can be done now without modification in omega (but of course would need changes at your end — although you could create your existing structure and then use symlinks, which would be a minor finishing-up step).If we do want to go down this road, the main place you'll want to look in the code is
omega.cc:map_dbname_to_dir()
. CGI parameters are managing into a multimap calledcgi_params
(which is mostly set up incgiparam.cc
but is used a lot inomega.cc
so I'd look there).Some other thoughts:
LOC
would have to not contain a slash, but also shouldn't be..
. There may be other characters we'd have to sanitise for security.xLOC
, acting asxDB
(it's documented incgiparams.rst
/cgiparams.html
).LOC
may not be the best name.DBGROUP
might make more logical sense.