| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
|---|
| 2 | <HTML> |
|---|
| 3 | <HEAD> |
|---|
| 4 | <TITLE>Xapian: Scalability</TITLE> |
|---|
| 5 | </HEAD> |
|---|
| 6 | <BODY BGCOLOR="white" TEXT="black"> |
|---|
| 7 | |
|---|
| 8 | <H1>Scalability</H1> |
|---|
| 9 | |
|---|
| 10 | <P>People often want to know how Xapian will scale. The short answer is "very |
|---|
| 11 | well" - a previous version of the software powered BrightStation's Webtop |
|---|
| 12 | search engine, which offered a search over around 500 million web pages (around |
|---|
| 13 | 1.5 terabytes of database files). Searches took less than a second.</P> |
|---|
| 14 | |
|---|
| 15 | <P>The largest recent installation we're aware of is probably <A |
|---|
| 16 | HREF="http://search.gmane.org/">gmane</A>, which currently indexes |
|---|
| 17 | over 50 million mail messages.</P> |
|---|
| 18 | |
|---|
| 19 | <H2>Benchmarking</H2> |
|---|
| 20 | |
|---|
| 21 | <P>One effect to be aware of when designing benchmarks is that queries will be |
|---|
| 22 | a lot slower when nothing is cached. So the first few queries on a database |
|---|
| 23 | which hasn't been searched recently will be unrepresentatively slow compared to |
|---|
| 24 | the typical case.</P> |
|---|
| 25 | |
|---|
| 26 | <P>In real use, pretty much all the non-leaf blocks from the B-trees being used |
|---|
| 27 | for the search will be cached pretty quickly, as well as many commonly used |
|---|
| 28 | leaf blocks.</P> |
|---|
| 29 | |
|---|
| 30 | <H2>General Scalability Considerations</H2> |
|---|
| 31 | |
|---|
| 32 | <P>In a large search application, I/O will end up being |
|---|
| 33 | the limiting factor. So you want a RAID setup optimised for fast reading, |
|---|
| 34 | lots of RAM in the box so the OS can cache lots of disk blocks (the access |
|---|
| 35 | patterns typically mean that you only need to cache a few percent of the |
|---|
| 36 | database to eliminate most disk cache misses).</P> |
|---|
| 37 | |
|---|
| 38 | <P>It also means that reducing the database size is usually a win. The Flint |
|---|
| 39 | and Quartz backends both compress the information in the tables in ways which |
|---|
| 40 | work well given the nature of the data but aren't too expensive to unpack (e.g. |
|---|
| 41 | lists of sorted docids are stored as differences with smaller values encoded in |
|---|
| 42 | fewer bytes). There is further potential for improving the encodings used.</P> |
|---|
| 43 | |
|---|
| 44 | <P>Another way to reduce disk I/O is to run databases through xapian-compact. |
|---|
| 45 | The Btree manager usually leaves some spare space in each block so that updates |
|---|
| 46 | are more efficient (though there are heuristics which will fill blocks fuller |
|---|
| 47 | if they detect a long sequence of sequential insertions so adding documents |
|---|
| 48 | to the end of an empty database will produce fairly compact tables, apart from |
|---|
| 49 | the postlist table). Compacting makes all blocks as full as possible, and so |
|---|
| 50 | reduces the size of the database. It also produces a database with revision 1 |
|---|
| 51 | which is inherently faster to search. The penalty is that updates will be slow |
|---|
| 52 | for a while, as they'll result in a lot of block splitting when all blocks are |
|---|
| 53 | full.</P> |
|---|
| 54 | |
|---|
| 55 | <P>Splitting the data over several databases is generally a good strategy. |
|---|
| 56 | Once each has finished being updated, compact it |
|---|
| 57 | to make it small and faster to search.</P> |
|---|
| 58 | |
|---|
| 59 | <P>A multiple-database scheme works particularly well if you want a rolling web |
|---|
| 60 | index where the contents of the oldest database can be rechecked and live links |
|---|
| 61 | put back into a new database which, once built, replaces the oldest database. |
|---|
| 62 | It's also good for a news-type application where older documents should expire |
|---|
| 63 | from the index.</P> |
|---|
| 64 | |
|---|
| 65 | <P>You could take this idea further by implementing an ultra-compact read-only |
|---|
| 66 | backend which would take a flint Btree and convert it to something like |
|---|
| 67 | a cdb hashed file. The same information would be stored, but with less |
|---|
| 68 | overhead than the flint Btrees (which need to allow for update). If |
|---|
| 69 | you need serious performance, implementing such a backend is worth |
|---|
| 70 | considering.</P> |
|---|
| 71 | |
|---|
| 72 | <H2>Size Limits in Xapian</H2> |
|---|
| 73 | |
|---|
| 74 | <P>The flint backend (which is now the recommended backend) stores |
|---|
| 75 | the indexes in several files containing Btree tables. If you're indexing with |
|---|
| 76 | positional information (for phrase searching) the term positions table is |
|---|
| 77 | usually the largest.</P> |
|---|
| 78 | |
|---|
| 79 | <P> |
|---|
| 80 | The current limits are: |
|---|
| 81 | |
|---|
| 82 | <ul> |
|---|
| 83 | <li> |
|---|
| 84 | Xapian uses unsigned 32-bit ints for document ids, so you're limited to |
|---|
| 85 | just over 4 billion documents in a database (as of 2003-09-10 that's |
|---|
| 86 | more than a third larger than Google). The other limits will cut in first |
|---|
| 87 | for a single database, but searches over multiple databases are done by |
|---|
| 88 | interleaving the document ids, so this might start to matter (especially if |
|---|
| 89 | one database is much larger than the others). This interleaving |
|---|
| 90 | technique could be changed fairly easily if it proves problematic. |
|---|
| 91 | |
|---|
| 92 | <li> |
|---|
| 93 | If you search many databases concurrently, you may hit the per-process |
|---|
| 94 | file-descriptor limit - each flint database uses 5 fds. Some Unix-like OSes |
|---|
| 95 | allow this limit to be raised. Another way to avoid it (and to spread the |
|---|
| 96 | search load) is to use the remote backend to search databases on a cluster of |
|---|
| 97 | machines. Flint could be made to not open fds for tables which aren't being |
|---|
| 98 | used during search (values and positions may not be), or to juggle fds - the |
|---|
| 99 | record table is typically only used for results, while the posting table is |
|---|
| 100 | typically only used during matching. |
|---|
| 101 | |
|---|
| 102 | <li> |
|---|
| 103 | If the OS has a filesize limit, that obviously applies to Xapian (a 2GB limit |
|---|
| 104 | is common for older operating systems). The xapian-core configure script |
|---|
| 105 | will attempt to detect and automatically enable support for "LARGE FILES" |
|---|
| 106 | where possible. |
|---|
| 107 | <P> |
|---|
| 108 | So what is the limit for a modern OS? Taking Linux 2.4 with the ext2 |
|---|
| 109 | filing system as an example, quoting from |
|---|
| 110 | linux/Documentation/filesystems/ext2.txt: |
|---|
| 111 | <blockquote><pre> |
|---|
| 112 | Filesystem block size: 1kB 2kB 4kB 8kB |
|---|
| 113 | |
|---|
| 114 | File size limit: 16GB 256GB 2048GB 2048GB |
|---|
| 115 | Filesystem size limit: 2047GB 8192GB 16384GB 32768GB |
|---|
| 116 | |
|---|
| 117 | There is a 2.4 kernel limit of 2048GB for a single block device, so no |
|---|
| 118 | filesystem larger than that can be created at this time. There is also |
|---|
| 119 | an upper limit on the block size imposed by the page size of the kernel, |
|---|
| 120 | so 8kB blocks are only allowed on Alpha systems (and other architectures |
|---|
| 121 | which support larger pages). |
|---|
| 122 | </pre></blockquote> |
|---|
| 123 | |
|---|
| 124 | <li> |
|---|
| 125 | The B-trees use a 32-bit unsigned block count. The default blocksize is 8K |
|---|
| 126 | which limits you to 32TB tables. You can increase the blocksize if this |
|---|
| 127 | is a problem, but it's best to do it before you create the database as |
|---|
| 128 | otherwise you need to use xapian-compact to rebuild the |
|---|
| 129 | database with the new blocksize, and that will take a while for such a large |
|---|
| 130 | database. The maximum blocksize allowed is 64K, which limits you to |
|---|
| 131 | 256TB tables. |
|---|
| 132 | |
|---|
| 133 | <li> |
|---|
| 134 | Flint stores the total length (i.e. number of terms) of all the documents in |
|---|
| 135 | a database so it can calculate the average document length. This is |
|---|
| 136 | currently stored as an unsigned 64-bit quantity so it's not likely to be a |
|---|
| 137 | limit you'll hit. It's listed for completeness. |
|---|
| 138 | </ul> |
|---|
| 139 | |
|---|
| 140 | If you've further questions about scalability, ask on the mailing lists - |
|---|
| 141 | people using Xapian to search large databases may be able to make further |
|---|
| 142 | suggestions. |
|---|
| 143 | |
|---|
| 144 | <!-- FOOTER $Author$ $Date$ $Id$ --> |
|---|
| 145 | </BODY> |
|---|
| 146 | </HTML> |
|---|