Opened 10 years ago
Last modified 7 months ago
#651 assigned enhancement
Protect writable fds by setting filepos very high
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Backend-Glass | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
I noticed that sqlite has SQLITE_MINIMUM_FILE_DESCRIPTOR and by default avoids using fds less than 3 in case something else in the same process tries to write to them, thinking output is going to stdout or stderr (e.g. a call to write(2, ...)
):
http://sqlite.org/compile.html#minimum_file_descriptor
Elsewhere I read that this was apparently added in response to actual database corruption from such situations (I bet that was a pain to debug!)
I think Xapian should definitely avoid at least fds 1 and 2 for writable handles, probably avoid fd 0 for reading, and perhaps just avoid anything < 3 for read or write like sqlite seems to.
I'll note we already have special case code to handle the lock file being opened as a low fd on Unix, where we fork a child process and want to hook up its stdin and stdout to the parent.
Marking for brass initially, but we should do this for all backends.
Attachments (1)
Change History (21)
comment:1 by , 10 years ago
Component: | Backend-Brass → Backend-Glass |
---|
comment:3 by , 10 years ago
Status: | new → assigned |
---|
This code sets the file position to the maximum possible off_t
value, which seems to work well at least on x86 Linux 2.6.32-68-generic (attempts to write a single byte with write()
fail with EFBIG
) and x86_64 Linux 3.2.0-4-amd64 (fails with EINVAL
):
#define SHIFT (sizeof(off_t) * 8 - 2) lseek(fd, (((off_t)1 << SHIFT)-1)|((off_t)1 << SHIFT), SEEK_SET);
Meanwhile pwrite()
works correctly.
If a system's libc emulates pwrite()
with lseek()
followed by write()
then this initial lseek()
will be harmless, but we won't be protected from stray writes.
Theoretically, we could protect ourselves by opening the file with O_APPEND
which shouldn't affect pwrite()
. However on Linux it does, and even where this worked it would result in extra garbage appearing in the file, whereas setting the offset absurdly high makes stray writes fail. The only downside I can see is that if some system did support such a large offset, the file size would increase to a ludicrous size. But most Unix FSes don't explicitly store all the zero-filled blocks this would create, so it wouldn't generally fill up huge amounts of disk, and the huge file size is a clear clue that there's a stray write somewhere.
This technique really needs testing on more systems - I'll tidy up the test program to more clearly report the outcome and attach it here.
comment:4 by , 10 years ago
If compiled with -D_FILE_OFFSET_BITS=64
, the x86 Linux machine fails on the initial lseek() with EINVAL
. It seems the maximum value we can seek to is somewhere between 1<<40 and 1<<41, but at 1<<40 we can successfully call write()
.
GNU/kFreeBSD 9.0-2-amd64 (Debian combining freebsd kernel and GNU libc) works - write()
fails with EFBIG
.
comment:5 by , 10 years ago
OS X 10.10 runs without warnings until the write(fd, "", 1)
line at which point it exits with:
Filesize limit exceeded: 25
This is SIGXFSZ (which also exists under linux), with ulimit -f (which I think is RLIMIT_FSIZE?) unlimited (both soft & hard).
SIGXFSZ is part of the 4.2BSD spec and POSIX.1-2001. There doesn't seem to be great documentation of this for OS X:
A file I/O operation that would create a file larger that the process' soft limit will cause the write to fail and a signal SIGXFSZ to be generated; this normally terminates the process, but may be caught.
Stevens is possibly more helpful: http://books.google.co.uk/books?id=kCTMFpEcIOwC&pg=PA925&lpg=PA925&dq=SIGXFSZ+mac+os+x&source=bl&ots=zvECZPZtkC&sig=8kbBKmpCO75jCRD98XieFkveoOY&hl=en&sa=X&ei=EhaDVMaBN8atU6SbhOAP&ved=0CDsQ6AEwBA#v=onepage&q=SIGXFSZ%20mac%20os%20x&f=false
comment:6 by , 10 years ago
The link doesn't seem to work for me, but I think the meaning is clear enough for our purposes here.
My initial thought was that this sinks the idea (at least on OS X) but actually this does seem a better reaction than allowing the process to corrupt the database, and in some ways better than the stray write()
returning -1 and some errno value which the caller may just quietly ignore.
I'll extend the test program to catch that signal and continue with checks.
comment:7 by , 10 years ago
The link to Stevens is just to the latest (I believe) Programming in the UNIX Environment, so the index of any edition should yield the same (limited) passage. The quote was from the OS X man pages.
Catching the signal & doing something suitable should also work on eg Linux where RLIMIT_FSIZE is set lower than expected?
comment:8 by , 10 years ago
I've updated the test program to catch that signal and report it nicely, then continue.
Also, if the maximum possible off_t value isn't lseekable(), it'll binary chop to find the actual seekable maximum. On my x86_64 Debian unstable machine, I get:
lseek failed: Invalid argument attempted to set pos to 0x7fffffffffffffff binary chopping to find actual max: lseek to 0x3fffffffffffffff fails, errno=22 (Invalid argument) lseek to 0x1fffffffffffffff fails, errno=22 (Invalid argument) lseek to 0xfffffffffffffff fails, errno=22 (Invalid argument) lseek to 0x7ffffffffffffff fails, errno=22 (Invalid argument) lseek to 0x3ffffffffffffff fails, errno=22 (Invalid argument) lseek to 0x1ffffffffffffff fails, errno=22 (Invalid argument) lseek to 0xffffffffffffff fails, errno=22 (Invalid argument) lseek to 0x7fffffffffffff fails, errno=22 (Invalid argument) lseek to 0x3fffffffffffff fails, errno=22 (Invalid argument) lseek to 0x1fffffffffffff fails, errno=22 (Invalid argument) lseek to 0xfffffffffffff fails, errno=22 (Invalid argument) lseek to 0x7ffffffffffff fails, errno=22 (Invalid argument) lseek to 0x3ffffffffffff fails, errno=22 (Invalid argument) lseek to 0x1ffffffffffff fails, errno=22 (Invalid argument) lseek to 0xffffffffffff fails, errno=22 (Invalid argument) lseek to 0x7fffffffffff fails, errno=22 (Invalid argument) lseek to 0x3fffffffffff fails, errno=22 (Invalid argument) lseek to 0x1fffffffffff fails, errno=22 (Invalid argument) lseek to 0xfffffffffff fails, errno=22 (Invalid argument) lseek to 0x7ffffffffff works lseek to 0xbffffffffff works lseek to 0xdffffffffff works lseek to 0xeffffffffff works lseek to 0xf7fffffffff works lseek to 0xfbfffffffff works lseek to 0xfdfffffffff works lseek to 0xfefffffffff works lseek to 0xff7ffffffff works lseek to 0xffbffffffff works lseek to 0xffdffffffff works lseek to 0xffeffffffff works lseek to 0xfff7fffffff works lseek to 0xfffbfffffff works lseek to 0xfffdfffffff works lseek to 0xfffefffffff works lseek to 0xffff7ffffff works lseek to 0xffffbffffff works lseek to 0xffffdffffff works lseek to 0xffffeffffff works lseek to 0xfffff7fffff works lseek to 0xfffffbfffff works lseek to 0xfffffdfffff works lseek to 0xfffffefffff works lseek to 0xffffff7ffff works lseek to 0xffffffbffff works lseek to 0xffffffdffff works lseek to 0xffffffeffff works lseek to 0xfffffff7fff works lseek to 0xfffffffbfff works lseek to 0xfffffffdfff works lseek to 0xfffffffefff works lseek to 0xffffffff7ff fails, errno=22 (Invalid argument) lseek to 0xffffffff3ff fails, errno=22 (Invalid argument) lseek to 0xffffffff1ff fails, errno=22 (Invalid argument) lseek to 0xffffffff0ff fails, errno=22 (Invalid argument) lseek to 0xffffffff07f fails, errno=22 (Invalid argument) lseek to 0xffffffff03f fails, errno=22 (Invalid argument) lseek to 0xffffffff01f fails, errno=22 (Invalid argument) lseek to 0xffffffff00f fails, errno=22 (Invalid argument) lseek to 0xffffffff007 fails, errno=22 (Invalid argument) lseek to 0xffffffff003 fails, errno=22 (Invalid argument) lseek to 0xffffffff001 fails, errno=22 (Invalid argument) lseek to 0xffffffff000 works Max lseekable off_t = 0xffffffff000 info: write() failed with errno=27 (File too large) Cool, that basically worked - please report the info and/or warnings given above
I tested the signal handler on Linux by setting the limit with:
ulimit -f 2000000000
I don't think we want to go installing a signal handler in the actual library though - in general I'd find it surprising if a library implicitly install a signal handler like that, and I'm not sure what we'd do after catching the signal anyway, as it most likely comes from someone else's code so we can't return control to anywhere sensible.
by , 10 years ago
Attachment: | lseek-safety-test.c added |
---|
test program for lseek() to max off_t value
comment:10 by , 10 years ago
OS X:
$ ./a.out test-file info: write() cause SIGXFSZ to be raised Cool, that basically worked - please report the info and/or warnings given above ulimit -f 1000000 ./a.out test-file RLIMIT_FSIZE cur = 0x3d090000 RLIMIT_FSIZE max = 0x3d090000 warning: write() succeeded, writing 0 bytes lseek SEEK_END now returns 0x7ffffffffffff000 (off_t_max was 0x7fffffffffffffff) warning: write changed file pos to 0x7ffffffffffff000 Cool, that basically worked - please report the info and/or warnings given above
…and then hangs (and disappears from the process listing?). Looks like an OS X bug somewhere here (first time I ran it however, it was fine; just caught the signal at write).
comment:11 by , 10 years ago
Milestone: | 1.3.3 → 1.3.4 |
---|
I've committed a fix which means we no longer use fds < 3 for writable database tables in master. We can still use them for other writable files, and I've not tried to implement the seek trick, as it seems we need to check portability more first.
comment:12 by , 10 years ago
Backported the fix to not use fds < 3 for 1.2.21 (for chert and brass) in [d6e1aa32aa8dfb3ed203cd203d6f6e0cad8a6ef4].
comment:13 by , 9 years ago
Milestone: | 1.3.4 → 1.4.x |
---|---|
Summary: | Avoid low fds → Protect writable fds by setting filepos very high |
Retitling to reflect what the issue is now about.
And time to get brutal and re-milestone this - it could be slotted into 1.4.x without problems.
comment:14 by , 8 years ago
Milestone: | 1.4.x → 1.4.2 |
---|
OK, so let's plan to roll this out per platform, carefully checking for any quirks of each platform as we do - otherwise we'll put this off forever while we worry about how it behaves on platforms we can't easily directly test on, whereas we can at least provide extra protection for some common platforms. Based on the above, it looks like Linux and OS X could be in the first batch.
comment:15 by , 8 years ago
Milestone: | 1.4.2 → 1.5.0 |
---|
Looking at different Linux FSes on x86-64 with kernel 4.7.0, these are the max lseek-able offsets for some FSes:
- btrfs - 0x7fffffffffffffff (max positive
off_t
value) - ext4 (4KB blocksize) - 0xffffffff000 (one block below 16TB, which is the documented max filesize)
- ext2/3 (1KB blocksize) - 0x404043000
- ext2/3 (4KB blocksize) - 0x1feff7fc000
Even with an enhanced version of the probing code, some of these take a lot of lseek() calls to probe, and that is multiplied by the number of tables in use (up to 6) so naively probing on every open to write doesn't seem a good idea.
The extra protection seems nice, but given we now avoid fds < 3 this is just an interesting idea which isn't actually worth the trouble. It also doesn't protect if the code writing to a stale fd calls lseek()
first.
But to think it through a bit more, the lseek offset to use could be cached in the DB (e.g. saved in the version file with the other per-table info like root block number, freelist data, etc). We need to detect if the file is moved to a different partition, or the FS rebuilt, or other things which might invalidate our cached value (e.g. also cache device ID and inode of the DB file). Or we just check with lseek()
first to one more than the cached value (which should fail), then to the cached value (which should succeed). Most of the time that's all we need, but if we don't get the failure or success, we would need to reprobe.
The only alternative to caching I can see if to build a lot of FS-specific knowledge into the library and check the type of FS and things like the block size if necessary, which sounds like a bad idea.
So caching seems to be needed but adding info to the version file would need to wait for glass+1 - that means that unless someone can think of a better plan, this isn't 1.4.x material (or at least not unless we have a development backend in 1.4.x).
comment:16 by , 22 months ago
The only alternative to caching I can see if to build a lot of FS-specific knowledge into the library and check the type of FS and things like the block size if necessary, which sounds like a bad idea.
We could potentially just have a list of offsets to try seeking to (generated in advance), and use the largest which is seekable. It works best if we use the highest offset supported, but it isn't vital.
Probably we could binary chop the candidate list (a failed lseek() presumably leaves the offset unchanged, though the man page doesn't clearly say so AFAICS on Linux at least) so probably we'd only need 3 or 4 lseek() calls per table opened to write (which would allow up to 7 or 15 values).
And as noted above we could also cache a successfully probed offset in the version file - copying the database to a new filing system would potentially invalidate the cached value, but it's cheap to check for that - just try to lseek to an offset one higher and if that fails try to lseek to the cached offset which means two lseek calls in the common case (one failed, one successful). It probably needs to be per-table since tables could be symlinked (or bind-mounted) to be on different FSes. Poking a bit around the linux kernel code I also saw something that suggested the ext4 filesize limit might vary even on a single FS (it looked like if a file isn't "extent-mapped" the size limit is a 32 bit value). We could use the probed result for the first table as the starting point for others though.
comment:17 by , 7 months ago
More results from testing on cfarm.net machines:
- macOS 12.6 is able to seek to the maximum off_t value (same as result reported for 10.10 above).
- OpenBSD 7.5 (cfarm220) is able to seek to the maximum off_t value.
- FreeBSD 14.0 (cfarm240) is able to seek to the maximum off_t value.
- NetBSD 10.0 (tested via CI) is able to seek to the maximum off_t value.
- AIX 7.1 and 7.3 (on cfarm111 and cfarm119) max is
0xffffffff000
(same as Linux ext4); we don't have CI coverage though so would need a full manual test build. - Solaris 11.4 (cfarm215) is able to seek to the maximum off_t value; we don't have CI coverage though so would need a full manual test build.
comment:18 by , 7 months ago
Milestone: | 1.5.0 → 2.0.0 |
---|
Pushed a simple implementation for Linux, FreeBSD, macOS and OpenBSD in 73b1b80749fc6e71d2a5934449727b1387e0ec75.
We can enable for more platforms as we are able to test them.
This is only done for glass currently since honey isn't block based and writes at the file position. We could consider changing that, but glass is still the default backend for now so that's where it's most useful to add protection.
No longer a blocker for 1.5.0.
comment:19 by , 7 months ago
Tested NetBSD too. An emerging pattern seems to be that on modern platforms we can lseek to the maximum possible off_t
value, so trying to seek to that and not doing anything further seems a reasonable default approach.
Linux is actually something of an oddity here in that it doesn't allow this for some FSes. A notable FS which doesn't is ext4 which is widely used (I suspect it's still the most widely used FS on Linux) so adding a fallback to the ext4 value on Linux seems reasonable, and that is what I've already implemented.
comment:20 by , 7 months ago
I wondered about mingw. There's no pread/pwrite there it seems, but we can lseek
to the max off_t value and then write()
fails with EINVAL
(tested with _FILE_OFFSET_BITS
set to 64 which gives 64-bit off_t
).
We could probably use ReadFile()
and WriteFile()
instead of pread/pwrite.
It occurs to me that a simple general measure to make database tables much safer from bogus writes to their fd (e.g. by user code or another library that previously had that fd open) would be to
lseek()
to way beyond the end of the file right after opening it for writing. Doing so is documented to not change the file size (unless data is actually written), and calls topread()
andpwrite()
are documented to not change the file offset.