Mathias Hasselmann

Full Text Search Engines, Part I

Openismus asked me to research how best to index media files and provide full text searching. For the last two years, I have used Tracker for this kind of thing. I like Tracker, but I want to avoid being biased. Therefore, I decided to evaluate alternatives.

Performance is an obvious requirement. We also want to provide a library to permit other applications to access the data we collected. Therefore, SQLite and Lucene (in its C++ incarnations) are obvious contenders. Lucene++ is an emerging project that got suggested by Mikkel Kamstrup Erlandsen at Canonical. QtCLucene is a bit special: So far Qt doesn't provide official support for this library and doesn't install its headers files. Still it is used by Qt's help system, which makes QtCLucene a widely deployed and well tested C++ implementation of Lucene.

Sadly, the big names like MySQL or PostgreSQL do not fit: MySQL's embedded server library is licensed under GPL (instead of LGPL, for instance), which greatly limits legal use cases. PostgreSQL doesn't provide any embedding at all. Because I enjoy RDF and SPARQL I also wondered about testing the Redland RDF libraries, but I found that they don't provide any full text search at all.

Contenders

Test Platform

Test Scenario

To get somewhat realistic data I've fetched a copy of the Internet Movie Database from ftp.fu-berlin.de. Since it is a quite huge database (about 1 GiB when compressed with gzip) I've extracted a few subsets of it: All movies with at least 500,000, 50,000, 15,000 1,000 and 50 user votes. This data then got imported into a fresh instance of Tracker, SQLite, Lucene++ and QtCLucene. After that I've run a few trivial full text searches:

"The Matrix"
Fast Furious
"Star Trek" OR "Star Wars"
Lord Rings King
Keanu Reeves
"Brad Pitt" OR "Bruce Willis"
Jackson Samuel
Quentin Tarantino
Wachowski
Thomas Neo Anderson
Neo

Each scenario was repeated five times. To avoid cache effects each engine was tested after the others for a given set of parameters. Tracker was tested with two different scenarios: First I've tried the Nepomuk based multimedia ontology shipped with Tracker (nmm), after that I've also tried a flattened ontology (fmm) which is a much better fit for the data model of pure full text search indices like Lucene. All engines where used with default parameters. No magic configuration options or pragmas were applied. Feel free to repeat the tests with your own optimized settings, and report the results when doing so.

Source Code and Data

The source code of these benchmarks can be found at Gitorious, and can be built using autotools or qmake. Just like you prefer.

Run src/benchmark.sh to reproduce the tests. The log files can be turned into a CSV file by running src/report.sh.

The charts have been created with LibreOffice: It should be sufficient to copy the CSV data into the data sheet of logs/report.ods. Select "English (USA)" as language in the import dialog, to ensure that numbers are recognized properly. After that you still might have to sort the rows by the columns suite, num_movies and experiment. The data sorting dialog provides an option for marking the first row as column header.

Update: I've pushed some more changes, so to exactly reproduce the results discussed in this post, checkout the tags releases/0.1 for the initial results, and releases/0.2 to also include Xapian tests.

Results

Populating the Index

Lucene++QtCLuceneSQLiteTracker (Nepomuk)Tracker (Flat)Xapian
96.84 ms3.46 ms43.2 ms¹⁾36.2 ms7.13 ms52.561 ms¹⁾
1,0992.93 ms5.72 ms3.63 ms26.4 ms3.32 ms5.94 ms
3,2162.32 ms5.37 ms2.87 ms21.2 ms2.89 ms4.97 ms
17,2511.98 ms5.10 ms2.50 ms14.2 ms2.19 ms3.58 ms
121,5871.21 ms5.21 ms3.96 ms²⁾10.4 ms1.80 ms2.30 ms
  1. The dataset is tiny. I suspect that some startup overhead is invalidating this result.
  2. We might see first signs of a memory barrier here.

Query Execution Time

Lucene++QtCLuceneSQLiteTracker (Nepomuk)Tracker (Flat)Xapian
92.23 ms0.572 ms0.159 ms1.33 ms0.494 ms0.271 ms
1,0996.06 ms2.18 ms1.17 ms90.3 ms1.67 ms0.955 ms
3,2168.72 ms3.41 ms1.55 ms335 ms3.57 ms1.50 ms
17,25113.1 ms5.33 ms1.92 ms2,380 ms7.52 ms2.35 ms
121,58717.0 ms44.2 ms17.4 ms86,800 ms19.885 ms18.1 ms
ComplexityO(log(n)²)O(log(n)²)O(log(n)²)O(n log(n))O(sqrt(n))O(log(n)²)

QtCLucene, SQLite, Tracker (Nepomuk) and Xapian seem to hit a memory barrier at 121,587 movies.

Consumed Disk Space

Lucene++QtCLuceneSQLiteTracker (Nepomuk)Tracker (Flat)XapianRaw Data
980 KiB76 KiB368 KiB4.4 MiB2.3 MiB424 KiB104 KiB
1,0994.9 MiB4.8 MiB32 MiB59 MiB29 MiB21 MiB7.8 MiB
3,21612 MiB12 MiB75 MiB114 MiB53 MiB47 MiB18 MiB
17,25139 MiB39 MiB257 MiB305 MiB155 MiB170 MiB57 MiB
121,587154 MiB154 MiB1.0 GiB906 MiB521 MiB683 MiB198 MiB

Discussion

The performance of Tracker is devastating. Entirely not the result you want to see for a project you actually like and enjoy using. You clearly see the bad impact of the many joins it must perform for mapping the ontologies and queries to SQL. This is surprising since in my opinion Nepomuk's multimedia ontology is a quite typical ontology. Also the datasets itself are not that huge for something that initially started as file indexer. The (sadly quite unrealistic) flat ontology might give a few hints on how to improve Tracker. The execution times with this ontology are comparable with them of the other engines. Still the observed (and only estimated) complexity class for executing queries is worrying.

Lucene++ shines at writing data, it is just incredibly fast when building its index. In contrast to the other engines it even spends less time per movie, the bigger its index grows. It is noticable slower than QtCLucene or SQLite when looking up terms. Still I'd call an average time of 17 ms for finding matches within 122k documents a quite good achievement. Additionally Lucene++ seems to be implemented sufficiently efficient to not hit any memory barrier yet at this scale.

QtCLucene is about two times slower than Lucene++ or SQLite when building its index, still the index size doesn't seem to impact insertion time per movie. It pays back with good lookup performance. It is about 2 to 3 times faster than Lucene++. It seems to hit a memory barrier at 122k documents.

SQLite's performance is just in the middle between Lucene++ and QtCLucene when building the index. When searching terms it even beats QtCLucene, again by a factor of 2 to 3.

Lucene++ and QtCLucene consume less disk space than the original files, most probably because the raw data stores movies and artists in separate files. The records in this files must be linked with each other. Lucene just does this more efficiently. SQLite and Tracker consume significantly more disk space than Lucene or the original data. Partly this can be explained by fields being stored twice: Once in their table and another time in the full text search index. Column indexes also play a role. Still this doesn't explain why disk consumption is significantly higher.

Xapian's characteristics are quite similar to those of SQLite. It doesn't hit yet that memory barrier that affects SQLite's insert performance at 122k documents, maybe because it consumes only 2/3 of the disk space. Enjoyed its API for being much closer to modern C++ than any other engine. It gives more low-level access to all the FTS mechanics: For instance you have to attach values and feed the indexer yourself. Also you have to deal with token prefixes. Details that Lucene just hides behind a Field class and its attributes. Not sure yet, what approach I prefer.

Conclusion

Tracker is out. Lucene++, QtCLucene and SQLite are quite comparable in terms of performance, with Lucene++ being the fastest engine when building the index, and with SQLite being the fastest when performing full text searches. There are some first signs that Lucene++ is more memory efficient than its competitors. This needs further investigation. Also we should investigate capabilities for doing point and range searches, instead of full text searches.

blog comments powered by Disqus