Opened 11 years ago
Last modified 3 weeks ago
#651 assigned enhancement
Protect writable fds by setting filepos very high
| Reported by: | Olly Betts | Owned by: | Olly Betts |
|---|---|---|---|
| Priority: | normal | Milestone: | 3.0.0 |
| Component: | Backend-Glass | Version: | |
| Severity: | normal | Keywords: | |
| Cc: | Blocked By: | ||
| Blocking: | Operating System: | All |
Description
I noticed that sqlite has SQLITE_MINIMUM_FILE_DESCRIPTOR and by default avoids using fds less than 3 in case something else in the same process tries to write to them, thinking output is going to stdout or stderr (e.g. a call to write(2, ...)):
http://sqlite.org/compile.html#minimum_file_descriptor
Elsewhere I read that this was apparently added in response to actual database corruption from such situations (I bet that was a pain to debug!)
I think Xapian should definitely avoid at least fds 1 and 2 for writable handles, probably avoid fd 0 for reading, and perhaps just avoid anything < 3 for read or write like sqlite seems to.
I'll note we already have special case code to handle the lock file being opened as a low fd on Unix, where we fork a child process and want to hook up its stdin and stdout to the parent.
Marking for brass initially, but we should do this for all backends.
Attachments (1)
Change History (27)
comment:1 by , 11 years ago
| Component: | Backend-Brass → Backend-Glass |
|---|
comment:3 by , 11 years ago
| Status: | new → assigned |
|---|
This code sets the file position to the maximum possible off_t value, which seems to work well at least on x86 Linux 2.6.32-68-generic (attempts to write a single byte with write() fail with EFBIG) and x86_64 Linux 3.2.0-4-amd64 (fails with EINVAL):
#define SHIFT (sizeof(off_t) * 8 - 2) lseek(fd, (((off_t)1 << SHIFT)-1)|((off_t)1 << SHIFT), SEEK_SET);
Meanwhile pwrite() works correctly.
If a system's libc emulates pwrite() with lseek() followed by write() then this initial lseek() will be harmless, but we won't be protected from stray writes.
Theoretically, we could protect ourselves by opening the file with O_APPEND which shouldn't affect pwrite(). However on Linux it does, and even where this worked it would result in extra garbage appearing in the file, whereas setting the offset absurdly high makes stray writes fail. The only downside I can see is that if some system did support such a large offset, the file size would increase to a ludicrous size. But most Unix FSes don't explicitly store all the zero-filled blocks this would create, so it wouldn't generally fill up huge amounts of disk, and the huge file size is a clear clue that there's a stray write somewhere.
This technique really needs testing on more systems - I'll tidy up the test program to more clearly report the outcome and attach it here.
comment:4 by , 11 years ago
If compiled with -D_FILE_OFFSET_BITS=64, the x86 Linux machine fails on the initial lseek() with EINVAL. It seems the maximum value we can seek to is somewhere between 1<<40 and 1<<41, but at 1<<40 we can successfully call write().
GNU/kFreeBSD 9.0-2-amd64 (Debian combining freebsd kernel and GNU libc) works - write() fails with EFBIG.
comment:5 by , 11 years ago
OS X 10.10 runs without warnings until the write(fd, "", 1) line at which point it exits with:
Filesize limit exceeded: 25
This is SIGXFSZ (which also exists under linux), with ulimit -f (which I think is RLIMIT_FSIZE?) unlimited (both soft & hard).
SIGXFSZ is part of the 4.2BSD spec and POSIX.1-2001. There doesn't seem to be great documentation of this for OS X:
A file I/O operation that would create a file larger that the process' soft limit will cause the write to fail and a signal SIGXFSZ to be generated; this normally terminates the process, but may be caught.
Stevens is possibly more helpful: http://books.google.co.uk/books?id=kCTMFpEcIOwC&pg=PA925&lpg=PA925&dq=SIGXFSZ+mac+os+x&source=bl&ots=zvECZPZtkC&sig=8kbBKmpCO75jCRD98XieFkveoOY&hl=en&sa=X&ei=EhaDVMaBN8atU6SbhOAP&ved=0CDsQ6AEwBA#v=onepage&q=SIGXFSZ%20mac%20os%20x&f=false
comment:6 by , 11 years ago
The link doesn't seem to work for me, but I think the meaning is clear enough for our purposes here.
My initial thought was that this sinks the idea (at least on OS X) but actually this does seem a better reaction than allowing the process to corrupt the database, and in some ways better than the stray write() returning -1 and some errno value which the caller may just quietly ignore.
I'll extend the test program to catch that signal and continue with checks.
comment:7 by , 11 years ago
The link to Stevens is just to the latest (I believe) Programming in the UNIX Environment, so the index of any edition should yield the same (limited) passage. The quote was from the OS X man pages.
Catching the signal & doing something suitable should also work on eg Linux where RLIMIT_FSIZE is set lower than expected?
comment:8 by , 11 years ago
I've updated the test program to catch that signal and report it nicely, then continue.
Also, if the maximum possible off_t value isn't lseekable(), it'll binary chop to find the actual seekable maximum. On my x86_64 Debian unstable machine, I get:
lseek failed: Invalid argument attempted to set pos to 0x7fffffffffffffff binary chopping to find actual max: lseek to 0x3fffffffffffffff fails, errno=22 (Invalid argument) lseek to 0x1fffffffffffffff fails, errno=22 (Invalid argument) lseek to 0xfffffffffffffff fails, errno=22 (Invalid argument) lseek to 0x7ffffffffffffff fails, errno=22 (Invalid argument) lseek to 0x3ffffffffffffff fails, errno=22 (Invalid argument) lseek to 0x1ffffffffffffff fails, errno=22 (Invalid argument) lseek to 0xffffffffffffff fails, errno=22 (Invalid argument) lseek to 0x7fffffffffffff fails, errno=22 (Invalid argument) lseek to 0x3fffffffffffff fails, errno=22 (Invalid argument) lseek to 0x1fffffffffffff fails, errno=22 (Invalid argument) lseek to 0xfffffffffffff fails, errno=22 (Invalid argument) lseek to 0x7ffffffffffff fails, errno=22 (Invalid argument) lseek to 0x3ffffffffffff fails, errno=22 (Invalid argument) lseek to 0x1ffffffffffff fails, errno=22 (Invalid argument) lseek to 0xffffffffffff fails, errno=22 (Invalid argument) lseek to 0x7fffffffffff fails, errno=22 (Invalid argument) lseek to 0x3fffffffffff fails, errno=22 (Invalid argument) lseek to 0x1fffffffffff fails, errno=22 (Invalid argument) lseek to 0xfffffffffff fails, errno=22 (Invalid argument) lseek to 0x7ffffffffff works lseek to 0xbffffffffff works lseek to 0xdffffffffff works lseek to 0xeffffffffff works lseek to 0xf7fffffffff works lseek to 0xfbfffffffff works lseek to 0xfdfffffffff works lseek to 0xfefffffffff works lseek to 0xff7ffffffff works lseek to 0xffbffffffff works lseek to 0xffdffffffff works lseek to 0xffeffffffff works lseek to 0xfff7fffffff works lseek to 0xfffbfffffff works lseek to 0xfffdfffffff works lseek to 0xfffefffffff works lseek to 0xffff7ffffff works lseek to 0xffffbffffff works lseek to 0xffffdffffff works lseek to 0xffffeffffff works lseek to 0xfffff7fffff works lseek to 0xfffffbfffff works lseek to 0xfffffdfffff works lseek to 0xfffffefffff works lseek to 0xffffff7ffff works lseek to 0xffffffbffff works lseek to 0xffffffdffff works lseek to 0xffffffeffff works lseek to 0xfffffff7fff works lseek to 0xfffffffbfff works lseek to 0xfffffffdfff works lseek to 0xfffffffefff works lseek to 0xffffffff7ff fails, errno=22 (Invalid argument) lseek to 0xffffffff3ff fails, errno=22 (Invalid argument) lseek to 0xffffffff1ff fails, errno=22 (Invalid argument) lseek to 0xffffffff0ff fails, errno=22 (Invalid argument) lseek to 0xffffffff07f fails, errno=22 (Invalid argument) lseek to 0xffffffff03f fails, errno=22 (Invalid argument) lseek to 0xffffffff01f fails, errno=22 (Invalid argument) lseek to 0xffffffff00f fails, errno=22 (Invalid argument) lseek to 0xffffffff007 fails, errno=22 (Invalid argument) lseek to 0xffffffff003 fails, errno=22 (Invalid argument) lseek to 0xffffffff001 fails, errno=22 (Invalid argument) lseek to 0xffffffff000 works Max lseekable off_t = 0xffffffff000 info: write() failed with errno=27 (File too large) Cool, that basically worked - please report the info and/or warnings given above
I tested the signal handler on Linux by setting the limit with:
ulimit -f 2000000000
I don't think we want to go installing a signal handler in the actual library though - in general I'd find it surprising if a library implicitly install a signal handler like that, and I'm not sure what we'd do after catching the signal anyway, as it most likely comes from someone else's code so we can't return control to anywhere sensible.
by , 11 years ago
| Attachment: | lseek-safety-test.c added |
|---|
test program for lseek() to max off_t value
comment:10 by , 11 years ago
OS X:
$ ./a.out test-file info: write() cause SIGXFSZ to be raised Cool, that basically worked - please report the info and/or warnings given above ulimit -f 1000000 ./a.out test-file RLIMIT_FSIZE cur = 0x3d090000 RLIMIT_FSIZE max = 0x3d090000 warning: write() succeeded, writing 0 bytes lseek SEEK_END now returns 0x7ffffffffffff000 (off_t_max was 0x7fffffffffffffff) warning: write changed file pos to 0x7ffffffffffff000 Cool, that basically worked - please report the info and/or warnings given above
…and then hangs (and disappears from the process listing?). Looks like an OS X bug somewhere here (first time I ran it however, it was fine; just caught the signal at write).
comment:11 by , 11 years ago
| Milestone: | 1.3.3 → 1.3.4 |
|---|
I've committed a fix which means we no longer use fds < 3 for writable database tables in master. We can still use them for other writable files, and I've not tried to implement the seek trick, as it seems we need to check portability more first.
comment:12 by , 11 years ago
Backported the fix to not use fds < 3 for 1.2.21 (for chert and brass) in [d6e1aa32aa8dfb3ed203cd203d6f6e0cad8a6ef4].
comment:13 by , 10 years ago
| Milestone: | 1.3.4 → 1.4.x |
|---|---|
| Summary: | Avoid low fds → Protect writable fds by setting filepos very high |
Retitling to reflect what the issue is now about.
And time to get brutal and re-milestone this - it could be slotted into 1.4.x without problems.
comment:14 by , 9 years ago
| Milestone: | 1.4.x → 1.4.2 |
|---|
OK, so let's plan to roll this out per platform, carefully checking for any quirks of each platform as we do - otherwise we'll put this off forever while we worry about how it behaves on platforms we can't easily directly test on, whereas we can at least provide extra protection for some common platforms. Based on the above, it looks like Linux and OS X could be in the first batch.
comment:15 by , 9 years ago
| Milestone: | 1.4.2 → 1.5.0 |
|---|
Looking at different Linux FSes on x86-64 with kernel 4.7.0, these are the max lseek-able offsets for some FSes:
- btrfs - 0x7fffffffffffffff (max positive
off_tvalue) - ext4 (4KB blocksize) - 0xffffffff000 (one block below 16TB, which is the documented max filesize)
- ext2/3 (1KB blocksize) - 0x404043000
- ext2/3 (4KB blocksize) - 0x1feff7fc000
Even with an enhanced version of the probing code, some of these take a lot of lseek() calls to probe, and that is multiplied by the number of tables in use (up to 6) so naively probing on every open to write doesn't seem a good idea.
The extra protection seems nice, but given we now avoid fds < 3 this is just an interesting idea which isn't actually worth the trouble. It also doesn't protect if the code writing to a stale fd calls lseek() first.
But to think it through a bit more, the lseek offset to use could be cached in the DB (e.g. saved in the version file with the other per-table info like root block number, freelist data, etc). We need to detect if the file is moved to a different partition, or the FS rebuilt, or other things which might invalidate our cached value (e.g. also cache device ID and inode of the DB file). Or we just check with lseek() first to one more than the cached value (which should fail), then to the cached value (which should succeed). Most of the time that's all we need, but if we don't get the failure or success, we would need to reprobe.
The only alternative to caching I can see if to build a lot of FS-specific knowledge into the library and check the type of FS and things like the block size if necessary, which sounds like a bad idea.
So caching seems to be needed but adding info to the version file would need to wait for glass+1 - that means that unless someone can think of a better plan, this isn't 1.4.x material (or at least not unless we have a development backend in 1.4.x).
comment:16 by , 3 years ago
The only alternative to caching I can see if to build a lot of FS-specific knowledge into the library and check the type of FS and things like the block size if necessary, which sounds like a bad idea.
We could potentially just have a list of offsets to try seeking to (generated in advance), and use the largest which is seekable. It works best if we use the highest offset supported, but it isn't vital.
Probably we could binary chop the candidate list (a failed lseek() presumably leaves the offset unchanged, though the man page doesn't clearly say so AFAICS on Linux at least) so probably we'd only need 3 or 4 lseek() calls per table opened to write (which would allow up to 7 or 15 values).
And as noted above we could also cache a successfully probed offset in the version file - copying the database to a new filing system would potentially invalidate the cached value, but it's cheap to check for that - just try to lseek to an offset one higher and if that fails try to lseek to the cached offset which means two lseek calls in the common case (one failed, one successful). It probably needs to be per-table since tables could be symlinked (or bind-mounted) to be on different FSes. Poking a bit around the linux kernel code I also saw something that suggested the ext4 filesize limit might vary even on a single FS (it looked like if a file isn't "extent-mapped" the size limit is a 32 bit value). We could use the probed result for the first table as the starting point for others though.
follow-up: 22 comment:17 by , 19 months ago
More results from testing on cfarm.net machines:
- macOS 12.6 is able to seek to the maximum off_t value (same as result reported for 10.10 above).
- OpenBSD 7.5 (cfarm220) is able to seek to the maximum off_t value.
- FreeBSD 14.0 (cfarm240) is able to seek to the maximum off_t value.
- NetBSD 10.0 (tested via CI) is able to seek to the maximum off_t value.
- AIX 7.1 and 7.3 (on cfarm111 and cfarm119) max is
0xffffffff000(same value as Linux ext4; the filing system for both these machines is "jfs2"); we don't have CI coverage though so would need a full manual test build. - Solaris 11.4 (cfarm215) is able to seek to the maximum off_t value; we don't have CI coverage though so would need a full manual test build.
comment:18 by , 19 months ago
| Milestone: | 1.5.0 → 2.0.0 |
|---|
Pushed a simple implementation for Linux, FreeBSD, macOS and OpenBSD in 73b1b80749fc6e71d2a5934449727b1387e0ec75.
We can enable for more platforms as we are able to test them.
This is only done for glass currently since honey isn't block based and writes at the file position. We could consider changing that, but glass is still the default backend for now so that's where it's most useful to add protection.
No longer a blocker for 1.5.0.
comment:19 by , 19 months ago
Tested NetBSD too. An emerging pattern seems to be that on modern platforms we can lseek to the maximum possible off_t value, so trying to seek to that and not doing anything further seems a reasonable default approach.
Linux is actually something of an oddity here in that it doesn't allow this for some FSes. A notable FS which doesn't is ext4 which is widely used (I suspect it's still the most widely used FS on Linux) so adding a fallback to the ext4 value on Linux seems reasonable, and that is what I've already implemented.
comment:20 by , 19 months ago
I wondered about mingw. There's no pread/pwrite there it seems, but we can lseek to the max off_t value and then write() fails with EINVAL (tested with _FILE_OFFSET_BITS set to 64 which gives 64-bit off_t).
We could probably use ReadFile() and WriteFile() instead of pread/pwrite.
comment:21 by , 4 weeks ago
Solaris 11.4 (cfarm215) is able to seek to the maximum off_t value; we don't have CI coverage though so would need a full manual test build.
We now have a CI job running on Solaris so I've enabled this for Solaris in 0a31cadf4db67708860d0f343cc84d1a01a2b4b7.
comment:22 by , 4 weeks ago
Replying to Olly Betts:
- AIX 7.1 and 7.3 (on cfarm111 and cfarm119) max is
0xffffffff000(same value as Linux ext4; the filing system for both these machines is "jfs2"); we don't have CI coverage though so would need a full manual test build.
I've done a full manual test build of xapian-core of cfarm119 and the testsuite runs cleanly with fd protection enabled so I've enabled this for AIX as well. I went for trying to seek to the maximum off_t value followed by trying 0xffffffff000 in case the latter limit is specific to the filing system in use here (JFS2): 305c9a4d7353c04f28d644d336626836a42a0ba2
That's all the platforms I previously tested here enabled.
In CI we also have emscripten, dragonflybsd and cygwin we could test via CI.
There are also jobs for android (but that only tests building as it's a cross-build and we don't currently have an emulator hooked up to run tests), and mingw/msvc (which don't have pread()/pwrite() so would probably need to use ReadFile()/WriteFile() instead).
If we don't have pread() and pwrite() there's no point setting the file position high as it'll just get reset by the first block read or write, so I've added a check for that too: a42fc853899aef77e4f156e4222b0cd24c21bef7
comment:23 by , 4 weeks ago
The test works in CI for cygwin and dragonflybsd and can seek to the maximum off_t value.
Emscripten's highest lseek-able offset seems to be 0x20000000000000, and trying to write there fails with:
/home/runner/work/xapian/xapian/lseektest.js:1469
node.contents = new Uint8Array(newCapacity); // Allocate new storage.
^
RangeError: Invalid typed array length: 9007199254740992
Not the clearest error message, but better than creating a vast file and better than corrupting the database.
comment:24 by , 4 weeks ago
Enabled for Emscripten (ea797bcd340472d781dff3092f8a028c34f62f04), and for Cygwin and DragonFlyBSD (ad1a8e6fa2692ab70a21588d57a9522d59492c3e).
I looked at using ReadFile()/WriteFile() for Microsoft Windows. Just using these isn't actually enough as by default they update the file pointer (so aren't drop-in pread()/pwrite() replacements, and in particular setting the file pointer high after opening won't help because it'll get reset by the first block read or written). If we use "overlapped IO" then the file pointer isn't updated by ReadFile() or WriteFile(), though it means we need to open the file using CreateFile() so we can set FILE_FLAG_OVERLAPPED, but we can then at least wrap it in a file descriptor. If we do this, write() on that fd will just fail with EINVAL and we don't need to set the file pointer high to protect the file, so that part is actually nicer (sadly not enough to make up for the rest being a mess of platform-specific poorly designed and underdocumented APIs though).
I pretty much have that working. I should do a simple quick performance test before actually merging though, just to make sure that using overlapped API doesn't suck for performance. We don't make use of the parallelism it's really aimed to provide, and it's quite possible there's overhead from that.
comment:25 by , 3 weeks ago
The way to wrap a proprietary HANDLE as a standard file descriptor is:
int fd = _open_osfhandle((intptr_t)(handle), O_RDWR|O_BINARY);
I wondered if we could specify O_RDONLY instead of O_RDWR to get a read-only fd wrapping a read/write HANDLE (since we extract the HANDLE and use WriteFile() on it to actually write to the file), but this doesn't seem to make a difference as a call to write() still succeeds even though we said O_RDONLY. The documentation mentions _O_RDONLY (though oddly not _O_RDWR or _O_WRONLY) but sadly doesn't discuss what effect it actually has in this context: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/open-osfhandle?view=msvc-170
Anyway, I've pushed a version which protects fds from stray writes on Microsoft Windows as 30004ee6f7554cf6ddc6950022597d7fb706372f.
I'm going to (really) leave the rest for now, unless someone has access to another platform and wants to test that this works there too. If so download, compile and run the test program and report results. If you're able to, test it on multiple different filing systems (at least on Linux that makes a difference).

It occurs to me that a simple general measure to make database tables much safer from bogus writes to their fd (e.g. by user code or another library that previously had that fd open) would be to
lseek()to way beyond the end of the file right after opening it for writing. Doing so is documented to not change the file size (unless data is actually written), and calls topread()andpwrite()are documented to not change the file offset.