Opened 9 years ago
Closed 9 years ago
#698 closed defect (fixed)
invalid cross-device link (NFS?)
Reported by: | mark dufour | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.23 |
Component: | Backend-Chert | Version: | 1.2.12 |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | Linux |
Description (last modified by )
hi,
we've recently switched to using xapian for version 7.2 of our groupware solution called zarafa (https://www.zarafa.com/), and are very happy with it so far!
we've run into a weird error recently though, which at first glance looks like a bug in xapian when used with NFS. we don't have an strace yet, but we do have one traceback:
File "/usr/lib/python2.7/dist-packages/zarafa_search/__init__.py", line 159, in main plugin.commit() File "/usr/lib/python2.7/dist-packages/zarafa_search/plugin_xapian.py", line 115, in commit db.delete_document('XK:'+doc['sourcekey'].lower()) File "/usr/lib/python2.7/contextlib.py", line 154, in __exit__ self.thing.close() DatabaseError: Couldn't update base file /srv/zarafa/index/1909D712B7DF49A0B1253DC64DD954CF-7B15C461919C4934A43DFC2D7479B7B8/spelling.baseB: Invalid cross-device link
the customer has recently moved their xapian databases to NFS, and is experiencing this issue now and then.
according to someone here who is more into file systems, xapian should perhaps check for this error, and if it occurs retry the respective operation in a safer way..?
Change History (9)
comment:1 by , 9 years ago
Component: | Other → Backend-Chert |
---|---|
Description: | modified (diff) |
comment:2 by , 9 years ago
Ping - if you don't provide the requested information, there's not much we can do to help.
comment:3 by , 9 years ago
thanks for the reminder! unfortunately we are also waiting for more information from our customer. I will see if I can ping somebody.
comment:4 by , 9 years ago
Thanks.
If it's just an occasional bogus error from that rename()
(e.g. due to that NFS bug, or something similar), we could simply retry the rename()
if it fails with EXDEV
(it would need to be a limited number of retries to avoid the risk of an infinite loop). But I'd rather not just make that sort of change without understanding more about the situation it's trying to solve - it's not good to end up with a codebase full of hacks which we aren't sure about the reasons for.
Also probably worth mentioning that the default backend (glass) in 1.4.x will do less of this renaming (one file per commit rather than 3-5 files per commit) so will probably suffer less from this issue.
comment:5 by , 9 years ago
thanks! from the customer (will try to ping them sooner next time):
---
So, unfortunately we are using a very old NFS server, the server is running Debian Lenny. Kernel version 2.6.26-1-amd64 nfs-kernel-server package version: Architecture: i386 Source: nfs-utils Version: 1:1.1.2-6lenny2 At the Zarafa server side, we are using Debian Wheezy, so the versions are: root@zarafa1:~# nfsstat -m /mnt/backup from 192.168.1.199:/backup/zarafa Flags: rw,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.199,mountvers=3,mountport=44108,mountproto=udp,local_lock=none,addr=192.168.1.199 root@zarafa1:~# uname -a Linux zarafa1 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1+deb7u1 x86_64 GNU/Linux root@zarafa1:~# modinfo nfs filename: /lib/modules/3.2.0-4-amd64/kernel/fs/nfs/nfs.ko license: GPL author: Olaf Kirch <okir@monad.swb.de> alias: nfs4 depends: fscache,sunrpc,lockd,auth_rpcgss,nfs_acl intree: Y vermagic: 3.2.0-4-amd64 SMP mod_unload modversions parm: callback_tcpport:portnrparm : cache_getent:Path to the client cache upcall program (string) parm: cache_getent_timeout:Timeout (in seconds) after which the cache upcall is assumed to have failed (ulong) parm: enable_ino64:bool parm: nfs4_disable_idmapping:Turn off NFSv4 idmapping when using 'sec=sys' (bool) root@zarafa1:~# nfsstat -v -m /mnt/backup from 192.168.1.199:/backup/zarafa Flags: rw,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.199,mountvers=3,mountport=44108,mountproto=udp,local_lock=none,addr=192.168.1.199 As you mentoined it can be an nfs bug, so we are working on the new storage area to provide local filesystem for the indexes. :)
---
so I guess the used xapian version is 1.2.12.
comment:6 by , 9 years ago
Version: | → 1.2.12 |
---|
I guess they're aware Debian lenny have been out of support for nearly 4 years now, since they said very old.
We could retry on EXDEV
as a workaround - workarounds for bugs in such old kernels seem a bit crazy, but it isn't clear what kernel version this actually got fixed in, and the overhead of the workaround is tiny (just a few lines of code on an error handling path). Looks like 2.6.32 is just about to finally reach its LTS end of life:
https://en.wikipedia.org/wiki/Linux_kernel
But since they're using that old an OS, I'd guess they're pretty conservative about changes so might be reluctant to install the latest Xapian 1.2.x once that change is in a release. Their other option is to rebuild the Xapian package version they're using with just the patch for this in (it will probably apply to 1.2.12 without difficulty - I don't think there have been major changes in this area).
Or just move to local storage, as it sounds like they're working on.
comment:7 by , 9 years ago
Milestone: | → 1.2.23 |
---|---|
Status: | new → assigned |
Overall it seems it's probably worth the effort to handle this, so I've fixed it in [3b5fae4719a706a357174203c26b0b17c0233bc5/git] for 1.3.5.
Will backport to 1.2.x if it's not complex to do so.
comment:9 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Backported for 1.2.23 in [a5f0a55c123d679f43ad90899fe55cf9b0162507/git].
I'd generally not recommend hosting databases on NFS. There are many corner cases that NFS doesn't really handle correctly, and good performance is rather too dependent on the exact configuration.
But aside from NFS infelicities, I would expect it to work.
The operation which fails works like so:
rename()
to move the temporary file to its final name.This is a very standard pattern for creating a file without having a partial file in place - it effectively allows atomic creation of a file.
It seems rename is failing because the source and destination aren't on the same filing system, which doesn't seem like it should be the case here - from
man rename
:I suspect your "someone" is suggesting the file should be moved in a way which works across filing systems, but that breaks it being an atomic update, which is the whole point of creating it as a temporary file in the first place.
I think you need to work out why
rename()
thinks it's being asking to rename across filing systems when the rename is within a directory.Perhaps there's some sort of overlay or union filing system in play too? If so, I think it needs to be configured such that
rename()
within a directory works.Or it could be an NFS bug - older kernels had bugs which could return
EXDEV
incorrectly, such as:http://www.spinics.net/lists/linux-nfs/msg17306.html
That's from 2010, and a quick look at recent kernel source suggests that it's since been addressed, but perhaps they're running an old enough kernel to be affected, or it's not entirely fixed, or there's another similar bug.
I think we need more info to determine what's actually going on.
Also, for completeness, which Xapian version is being used?