Full Text Search Engines, Part I
Openismus asked me to research how best to index media files and provide full text searching. For the last two years, I have used Tracker for this kind of thing. I like Tracker, but I want to avoid being biased. Therefore, I decided to evaluate alternatives.
Performance is an obvious requirement. We also want to provide a library to permit other applications to access the data we collected. Therefore, SQLite and Lucene (in its C++ incarnations) are obvious contenders. Lucene++ is an emerging project that got suggested by Mikkel Kamstrup Erlandsen at Canonical. QtCLucene is a bit special: So far Qt doesn't provide official support for this library and doesn't install its headers files. Still it is used by Qt's help system, which makes QtCLucene a widely deployed and well tested C++ implementation of Lucene.
Sadly, the big names like MySQL or PostgreSQL do not fit: MySQL's embedded server library is licensed under GPL (instead of LGPL, for instance), which greatly limits legal use cases. PostgreSQL doesn't provide any embedding at all. Because I enjoy RDF and SPARQL I also wondered about testing the Redland RDF libraries, but I found that they don't provide any full text search at all.
- Tracker 0.14.0-2ubuntu1
- SQLite 3.7.9-2ubuntu1
- Lucene++ 220.127.116.11 (e28b15b02ff9de2208965e9af8eb80983380cdcd)
- QtCLucene as provided by libqtcore4 4.8.1-0ubuntu4
- Xapian 1.2.8-1
- Ubuntu 12.04
- Intel Core 2 Duo P8400 (2.26GHz), 4 GiB RAM
- HDD: WDC WD2500BEVT-2, encrypted (aes-cbc-essiv:sha256)
To get somewhat realistic data I've fetched a copy of the Internet Movie Database from ftp.fu-berlin.de. Since it is a quite huge database (about 1 GiB when compressed with gzip) I've extracted a few subsets of it: All movies with at least 500,000, 50,000, 15,000 1,000 and 50 user votes. This data then got imported into a fresh instance of Tracker, SQLite, Lucene++ and QtCLucene. After that I've run a few trivial full text searches:
"The Matrix" Fast Furious "Star Trek" OR "Star Wars" Lord Rings King Keanu Reeves "Brad Pitt" OR "Bruce Willis" Jackson Samuel Quentin Tarantino Wachowski Thomas Neo Anderson Neo
Each scenario was repeated five times. To avoid cache effects each engine was tested after the others for a given set of parameters. Tracker was tested with two different scenarios: First I've tried the Nepomuk based multimedia ontology shipped with Tracker (nmm), after that I've also tried a flattened ontology (fmm) which is a much better fit for the data model of pure full text search indices like Lucene. All engines where used with default parameters. No magic configuration options or pragmas were applied. Feel free to repeat the tests with your own optimized settings, and report the results when doing so.
Source Code and Data
The source code of these benchmarks can be found at Gitorious, and can be built using autotools or qmake. Just like you prefer.
The charts have been created with LibreOffice:
It should be sufficient to copy the CSV data into the data sheet of
Select "English (USA)" as language in the import dialog, to ensure that
numbers are recognized properly. After that you still might have to sort
the rows by the columns
experiment. The data
sorting dialog provides an option for marking the first row as column header.
Update: I've pushed some more changes, so to exactly reproduce the results
discussed in this post, checkout the tags
releases/0.1 for the
initial results, and
releases/0.2 to also include Xapian tests.
|Lucene++||QtCLucene||SQLite||Tracker (Nepomuk)||Tracker (Flat)||Xapian|
|9||6.84 ms||3.46 ms||43.2 ms¹⁾||36.2 ms||7.13 ms||52.561 ms¹⁾|
|1,099||2.93 ms||5.72 ms||3.63 ms||26.4 ms||3.32 ms||5.94 ms|
|3,216||2.32 ms||5.37 ms||2.87 ms||21.2 ms||2.89 ms||4.97 ms|
|17,251||1.98 ms||5.10 ms||2.50 ms||14.2 ms||2.19 ms||3.58 ms|
|121,587||1.21 ms||5.21 ms||3.96 ms²⁾||10.4 ms||1.80 ms||2.30 ms|
- The dataset is tiny. I suspect that some startup overhead is invalidating this result.
- We might see first signs of a memory barrier here.
|Lucene++||QtCLucene||SQLite||Tracker (Nepomuk)||Tracker (Flat)||Xapian|
|9||2.23 ms||0.572 ms||0.159 ms||1.33 ms||0.494 ms||0.271 ms|
|1,099||6.06 ms||2.18 ms||1.17 ms||90.3 ms||1.67 ms||0.955 ms|
|3,216||8.72 ms||3.41 ms||1.55 ms||335 ms||3.57 ms||1.50 ms|
|17,251||13.1 ms||5.33 ms||1.92 ms||2,380 ms||7.52 ms||2.35 ms|
|121,587||17.0 ms||44.2 ms||17.4 ms||86,800 ms||19.885 ms||18.1 ms|
QtCLucene, SQLite, Tracker (Nepomuk) and Xapian seem to hit a memory barrier at 121,587 movies.
|Lucene++||QtCLucene||SQLite||Tracker (Nepomuk)||Tracker (Flat)||Xapian||Raw Data|
|9||80 KiB||76 KiB||368 KiB||4.4 MiB||2.3 MiB||424 KiB||104 KiB|
|1,099||4.9 MiB||4.8 MiB||32 MiB||59 MiB||29 MiB||21 MiB||7.8 MiB|
|3,216||12 MiB||12 MiB||75 MiB||114 MiB||53 MiB||47 MiB||18 MiB|
|17,251||39 MiB||39 MiB||257 MiB||305 MiB||155 MiB||170 MiB||57 MiB|
|121,587||154 MiB||154 MiB||1.0 GiB||906 MiB||521 MiB||683 MiB||198 MiB|
The performance of Tracker is devastating. Entirely not the result you want to see for a project you actually like and enjoy using. You clearly see the bad impact of the many joins it must perform for mapping the ontologies and queries to SQL. This is surprising since in my opinion Nepomuk's multimedia ontology is a quite typical ontology. Also the datasets itself are not that huge for something that initially started as file indexer. The (sadly quite unrealistic) flat ontology might give a few hints on how to improve Tracker. The execution times with this ontology are comparable with them of the other engines. Still the observed (and only estimated) complexity class for executing queries is worrying.
Lucene++ shines at writing data, it is just incredibly fast when building its index. In contrast to the other engines it even spends less time per movie, the bigger its index grows. It is noticable slower than QtCLucene or SQLite when looking up terms. Still I'd call an average time of 17 ms for finding matches within 122k documents a quite good achievement. Additionally Lucene++ seems to be implemented sufficiently efficient to not hit any memory barrier yet at this scale.
QtCLucene is about two times slower than Lucene++ or SQLite when building its index, still the index size doesn't seem to impact insertion time per movie. It pays back with good lookup performance. It is about 2 to 3 times faster than Lucene++. It seems to hit a memory barrier at 122k documents.
SQLite's performance is just in the middle between Lucene++ and QtCLucene when building the index. When searching terms it even beats QtCLucene, again by a factor of 2 to 3.
Lucene++ and QtCLucene consume less disk space than the original files, most probably because the raw data stores movies and artists in separate files. The records in this files must be linked with each other. Lucene just does this more efficiently. SQLite and Tracker consume significantly more disk space than Lucene or the original data. Partly this can be explained by fields being stored twice: Once in their table and another time in the full text search index. Column indexes also play a role. Still this doesn't explain why disk consumption is significantly higher.
Xapian's characteristics are quite similar to those of SQLite. It doesn't hit yet that memory barrier that affects SQLite's insert performance at 122k documents, maybe because it consumes only 2/3 of the disk space. Enjoyed its API for being much closer to modern C++ than any other engine. It gives more low-level access to all the FTS mechanics: For instance you have to attach values and feed the indexer yourself. Also you have to deal with token prefixes. Details that Lucene just hides behind a Field class and its attributes. Not sure yet, what approach I prefer.
Tracker is out. Lucene++, QtCLucene and SQLite are quite comparable in terms of performance, with Lucene++ being the fastest engine when building the index, and with SQLite being the fastest when performing full text searches. There are some first signs that Lucene++ is more memory efficient than its competitors. This needs further investigation. Also we should investigate capabilities for doing point and range searches, instead of full text searches.blog comments powered by Disqus