Opened 10 years ago

Last modified 7 months ago

#651 assigned enhancement

Protect writable fds by setting filepos very high

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Backend-Glass Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

I noticed that sqlite has SQLITE_MINIMUM_FILE_DESCRIPTOR and by default avoids using fds less than 3 in case something else in the same process tries to write to them, thinking output is going to stdout or stderr (e.g. a call to write(2, ...)):

http://sqlite.org/compile.html#minimum_file_descriptor

Elsewhere I read that this was apparently added in response to actual database corruption from such situations (I bet that was a pain to debug!)

I think Xapian should definitely avoid at least fds 1 and 2 for writable handles, probably avoid fd 0 for reading, and perhaps just avoid anything < 3 for read or write like sqlite seems to.

I'll note we already have special case code to handle the lock file being opened as a low fd on Unix, where we fork a child process and want to hook up its stdin and stdout to the parent.

Marking for brass initially, but we should do this for all backends.

Attachments (1)

lseek-safety-test.c (4.1 KB ) - added by Olly Betts 10 years ago.
test program for lseek() to max off_t value

Download all attachments as: .zip

Change History (21)

comment:1 by Olly Betts, 10 years ago

Component: Backend-BrassBackend-Glass

comment:2 by Olly Betts, 10 years ago

It occurs to me that a simple general measure to make database tables much safer from bogus writes to their fd (e.g. by user code or another library that previously had that fd open) would be to lseek() to way beyond the end of the file right after opening it for writing. Doing so is documented to not change the file size (unless data is actually written), and calls to pread() and pwrite() are documented to not change the file offset.

Last edited 10 years ago by Olly Betts (previous) (diff)

comment:3 by Olly Betts, 10 years ago

Status: newassigned

This code sets the file position to the maximum possible off_t value, which seems to work well at least on x86 Linux 2.6.32-68-generic (attempts to write a single byte with write() fail with EFBIG) and x86_64 Linux 3.2.0-4-amd64 (fails with EINVAL):

#define SHIFT (sizeof(off_t) * 8 - 2)
lseek(fd, (((off_t)1 << SHIFT)-1)|((off_t)1 << SHIFT), SEEK_SET);

Meanwhile pwrite() works correctly.

If a system's libc emulates pwrite() with lseek() followed by write() then this initial lseek() will be harmless, but we won't be protected from stray writes.

Theoretically, we could protect ourselves by opening the file with O_APPEND which shouldn't affect pwrite(). However on Linux it does, and even where this worked it would result in extra garbage appearing in the file, whereas setting the offset absurdly high makes stray writes fail. The only downside I can see is that if some system did support such a large offset, the file size would increase to a ludicrous size. But most Unix FSes don't explicitly store all the zero-filled blocks this would create, so it wouldn't generally fill up huge amounts of disk, and the huge file size is a clear clue that there's a stray write somewhere.

This technique really needs testing on more systems - I'll tidy up the test program to more clearly report the outcome and attach it here.

comment:4 by Olly Betts, 10 years ago

If compiled with -D_FILE_OFFSET_BITS=64, the x86 Linux machine fails on the initial lseek() with EINVAL. It seems the maximum value we can seek to is somewhere between 1<<40 and 1<<41, but at 1<<40 we can successfully call write().

GNU/kFreeBSD 9.0-2-amd64 (Debian combining freebsd kernel and GNU libc) works - write() fails with EFBIG.

comment:5 by James Aylett, 10 years ago

OS X 10.10 runs without warnings until the write(fd, "", 1) line at which point it exits with:

Filesize limit exceeded: 25

This is SIGXFSZ (which also exists under linux), with ulimit -f (which I think is RLIMIT_FSIZE?) unlimited (both soft & hard).

SIGXFSZ is part of the 4.2BSD spec and POSIX.1-2001. There doesn't seem to be great documentation of this for OS X:

A file I/O operation that would create a file larger that the process' soft limit will cause the write to fail and a signal SIGXFSZ to be generated; this normally terminates the process, but may be caught.

Stevens is possibly more helpful: http://books.google.co.uk/books?id=kCTMFpEcIOwC&pg=PA925&lpg=PA925&dq=SIGXFSZ+mac+os+x&source=bl&ots=zvECZPZtkC&sig=8kbBKmpCO75jCRD98XieFkveoOY&hl=en&sa=X&ei=EhaDVMaBN8atU6SbhOAP&ved=0CDsQ6AEwBA#v=onepage&q=SIGXFSZ%20mac%20os%20x&f=false

comment:6 by Olly Betts, 10 years ago

The link doesn't seem to work for me, but I think the meaning is clear enough for our purposes here.

My initial thought was that this sinks the idea (at least on OS X) but actually this does seem a better reaction than allowing the process to corrupt the database, and in some ways better than the stray write() returning -1 and some errno value which the caller may just quietly ignore.

I'll extend the test program to catch that signal and continue with checks.

comment:7 by James Aylett, 10 years ago

The link to Stevens is just to the latest (I believe) Programming in the UNIX Environment, so the index of any edition should yield the same (limited) passage. The quote was from the OS X man pages.

Catching the signal & doing something suitable should also work on eg Linux where RLIMIT_FSIZE is set lower than expected?

comment:8 by Olly Betts, 10 years ago

I've updated the test program to catch that signal and report it nicely, then continue.

Also, if the maximum possible off_t value isn't lseekable(), it'll binary chop to find the actual seekable maximum. On my x86_64 Debian unstable machine, I get:

lseek failed: Invalid argument
attempted to set pos to 0x7fffffffffffffff
binary chopping to find actual max:
lseek to 0x3fffffffffffffff fails, errno=22 (Invalid argument)
lseek to 0x1fffffffffffffff fails, errno=22 (Invalid argument)
lseek to 0xfffffffffffffff fails, errno=22 (Invalid argument)
lseek to 0x7ffffffffffffff fails, errno=22 (Invalid argument)
lseek to 0x3ffffffffffffff fails, errno=22 (Invalid argument)
lseek to 0x1ffffffffffffff fails, errno=22 (Invalid argument)
lseek to 0xffffffffffffff fails, errno=22 (Invalid argument)
lseek to 0x7fffffffffffff fails, errno=22 (Invalid argument)
lseek to 0x3fffffffffffff fails, errno=22 (Invalid argument)
lseek to 0x1fffffffffffff fails, errno=22 (Invalid argument)
lseek to 0xfffffffffffff fails, errno=22 (Invalid argument)
lseek to 0x7ffffffffffff fails, errno=22 (Invalid argument)
lseek to 0x3ffffffffffff fails, errno=22 (Invalid argument)
lseek to 0x1ffffffffffff fails, errno=22 (Invalid argument)
lseek to 0xffffffffffff fails, errno=22 (Invalid argument)
lseek to 0x7fffffffffff fails, errno=22 (Invalid argument)
lseek to 0x3fffffffffff fails, errno=22 (Invalid argument)
lseek to 0x1fffffffffff fails, errno=22 (Invalid argument)
lseek to 0xfffffffffff fails, errno=22 (Invalid argument)
lseek to 0x7ffffffffff works
lseek to 0xbffffffffff works
lseek to 0xdffffffffff works
lseek to 0xeffffffffff works
lseek to 0xf7fffffffff works
lseek to 0xfbfffffffff works
lseek to 0xfdfffffffff works
lseek to 0xfefffffffff works
lseek to 0xff7ffffffff works
lseek to 0xffbffffffff works
lseek to 0xffdffffffff works
lseek to 0xffeffffffff works
lseek to 0xfff7fffffff works
lseek to 0xfffbfffffff works
lseek to 0xfffdfffffff works
lseek to 0xfffefffffff works
lseek to 0xffff7ffffff works
lseek to 0xffffbffffff works
lseek to 0xffffdffffff works
lseek to 0xffffeffffff works
lseek to 0xfffff7fffff works
lseek to 0xfffffbfffff works
lseek to 0xfffffdfffff works
lseek to 0xfffffefffff works
lseek to 0xffffff7ffff works
lseek to 0xffffffbffff works
lseek to 0xffffffdffff works
lseek to 0xffffffeffff works
lseek to 0xfffffff7fff works
lseek to 0xfffffffbfff works
lseek to 0xfffffffdfff works
lseek to 0xfffffffefff works
lseek to 0xffffffff7ff fails, errno=22 (Invalid argument)
lseek to 0xffffffff3ff fails, errno=22 (Invalid argument)
lseek to 0xffffffff1ff fails, errno=22 (Invalid argument)
lseek to 0xffffffff0ff fails, errno=22 (Invalid argument)
lseek to 0xffffffff07f fails, errno=22 (Invalid argument)
lseek to 0xffffffff03f fails, errno=22 (Invalid argument)
lseek to 0xffffffff01f fails, errno=22 (Invalid argument)
lseek to 0xffffffff00f fails, errno=22 (Invalid argument)
lseek to 0xffffffff007 fails, errno=22 (Invalid argument)
lseek to 0xffffffff003 fails, errno=22 (Invalid argument)
lseek to 0xffffffff001 fails, errno=22 (Invalid argument)
lseek to 0xffffffff000 works
Max lseekable off_t = 0xffffffff000
info: write() failed with errno=27 (File too large)
Cool, that basically worked - please report the info and/or warnings given above

I tested the signal handler on Linux by setting the limit with:

ulimit -f 2000000000

I don't think we want to go installing a signal handler in the actual library though - in general I'd find it surprising if a library implicitly install a signal handler like that, and I'm not sure what we'd do after catching the signal anyway, as it most likely comes from someone else's code so we can't return control to anywhere sensible.

by Olly Betts, 10 years ago

Attachment: lseek-safety-test.c added

test program for lseek() to max off_t value

comment:9 by Olly Betts, 10 years ago

Added code to report any currently set resource limit on file size.

comment:10 by James Aylett, 10 years ago

OS X:

$ ./a.out test-file 
info: write() cause SIGXFSZ to be raised
Cool, that basically worked - please report the info and/or warnings given above
ulimit -f 1000000
./a.out test-file
RLIMIT_FSIZE cur = 0x3d090000
RLIMIT_FSIZE max = 0x3d090000
warning: write() succeeded, writing 0 bytes
lseek SEEK_END now returns 0x7ffffffffffff000 (off_t_max was 0x7fffffffffffffff)
warning: write changed file pos to 0x7ffffffffffff000
Cool, that basically worked - please report the info and/or warnings given above

…and then hangs (and disappears from the process listing?). Looks like an OS X bug somewhere here (first time I ran it however, it was fine; just caught the signal at write).

comment:11 by Olly Betts, 10 years ago

Milestone: 1.3.31.3.4

I've committed a fix which means we no longer use fds < 3 for writable database tables in master. We can still use them for other writable files, and I've not tried to implement the seek trick, as it seems we need to check portability more first.

comment:12 by Olly Betts, 10 years ago

Backported the fix to not use fds < 3 for 1.2.21 (for chert and brass) in [d6e1aa32aa8dfb3ed203cd203d6f6e0cad8a6ef4].

comment:13 by Olly Betts, 9 years ago

Milestone: 1.3.41.4.x
Summary: Avoid low fdsProtect writable fds by setting filepos very high

Retitling to reflect what the issue is now about.

And time to get brutal and re-milestone this - it could be slotted into 1.4.x without problems.

comment:14 by Olly Betts, 8 years ago

Milestone: 1.4.x1.4.2

OK, so let's plan to roll this out per platform, carefully checking for any quirks of each platform as we do - otherwise we'll put this off forever while we worry about how it behaves on platforms we can't easily directly test on, whereas we can at least provide extra protection for some common platforms. Based on the above, it looks like Linux and OS X could be in the first batch.

comment:15 by Olly Betts, 8 years ago

Milestone: 1.4.21.5.0

Looking at different Linux FSes on x86-64 with kernel 4.7.0, these are the max lseek-able offsets for some FSes:

  • btrfs - 0x7fffffffffffffff (max positive off_t value)
  • ext4 (4KB blocksize) - 0xffffffff000 (one block below 16TB, which is the documented max filesize)
  • ext2/3 (1KB blocksize) - 0x404043000
  • ext2/3 (4KB blocksize) - 0x1feff7fc000

Even with an enhanced version of the probing code, some of these take a lot of lseek() calls to probe, and that is multiplied by the number of tables in use (up to 6) so naively probing on every open to write doesn't seem a good idea.

The extra protection seems nice, but given we now avoid fds < 3 this is just an interesting idea which isn't actually worth the trouble. It also doesn't protect if the code writing to a stale fd calls lseek() first.

But to think it through a bit more, the lseek offset to use could be cached in the DB (e.g. saved in the version file with the other per-table info like root block number, freelist data, etc). We need to detect if the file is moved to a different partition, or the FS rebuilt, or other things which might invalidate our cached value (e.g. also cache device ID and inode of the DB file). Or we just check with lseek() first to one more than the cached value (which should fail), then to the cached value (which should succeed). Most of the time that's all we need, but if we don't get the failure or success, we would need to reprobe.

The only alternative to caching I can see if to build a lot of FS-specific knowledge into the library and check the type of FS and things like the block size if necessary, which sounds like a bad idea.

So caching seems to be needed but adding info to the version file would need to wait for glass+1 - that means that unless someone can think of a better plan, this isn't 1.4.x material (or at least not unless we have a development backend in 1.4.x).

comment:16 by Olly Betts, 22 months ago

The only alternative to caching I can see if to build a lot of FS-specific knowledge into the library and check the type of FS and things like the block size if necessary, which sounds like a bad idea.

We could potentially just have a list of offsets to try seeking to (generated in advance), and use the largest which is seekable. It works best if we use the highest offset supported, but it isn't vital.

Probably we could binary chop the candidate list (a failed lseek() presumably leaves the offset unchanged, though the man page doesn't clearly say so AFAICS on Linux at least) so probably we'd only need 3 or 4 lseek() calls per table opened to write (which would allow up to 7 or 15 values).

And as noted above we could also cache a successfully probed offset in the version file - copying the database to a new filing system would potentially invalidate the cached value, but it's cheap to check for that - just try to lseek to an offset one higher and if that fails try to lseek to the cached offset which means two lseek calls in the common case (one failed, one successful). It probably needs to be per-table since tables could be symlinked (or bind-mounted) to be on different FSes. Poking a bit around the linux kernel code I also saw something that suggested the ext4 filesize limit might vary even on a single FS (it looked like if a file isn't "extent-mapped" the size limit is a 32 bit value). We could use the probed result for the first table as the starting point for others though.

Last edited 7 months ago by Olly Betts (previous) (diff)

comment:17 by Olly Betts, 7 months ago

More results from testing on cfarm.net machines:

  • macOS 12.6 is able to seek to the maximum off_t value (same as result reported for 10.10 above).
  • OpenBSD 7.5 (cfarm220) is able to seek to the maximum off_t value.
  • FreeBSD 14.0 (cfarm240) is able to seek to the maximum off_t value.
  • NetBSD 10.0 (tested via CI) is able to seek to the maximum off_t value.
  • AIX 7.1 and 7.3 (on cfarm111 and cfarm119) max is 0xffffffff000 (same as Linux ext4); we don't have CI coverage though so would need a full manual test build.
  • Solaris 11.4 (cfarm215) is able to seek to the maximum off_t value; we don't have CI coverage though so would need a full manual test build.
Last edited 7 months ago by Olly Betts (previous) (diff)

comment:18 by Olly Betts, 7 months ago

Milestone: 1.5.02.0.0

Pushed a simple implementation for Linux, FreeBSD, macOS and OpenBSD in 73b1b80749fc6e71d2a5934449727b1387e0ec75.

We can enable for more platforms as we are able to test them.

This is only done for glass currently since honey isn't block based and writes at the file position. We could consider changing that, but glass is still the default backend for now so that's where it's most useful to add protection.

No longer a blocker for 1.5.0.

comment:19 by Olly Betts, 7 months ago

Tested NetBSD too. An emerging pattern seems to be that on modern platforms we can lseek to the maximum possible off_t value, so trying to seek to that and not doing anything further seems a reasonable default approach.

Linux is actually something of an oddity here in that it doesn't allow this for some FSes. A notable FS which doesn't is ext4 which is widely used (I suspect it's still the most widely used FS on Linux) so adding a fallback to the ext4 value on Linux seems reasonable, and that is what I've already implemented.

comment:20 by Olly Betts, 7 months ago

I wondered about mingw. There's no pread/pwrite there it seems, but we can lseek to the max off_t value and then write() fails with EINVAL (tested with _FILE_OFFSET_BITS set to 64 which gives 64-bit off_t).

We could probably use ReadFile() and WriteFile() instead of pread/pwrite.

Note: See TracTickets for help on using tickets.