Taschenorakel.de

Latest Postings

GNOME.Asia Summit 2013

I win the prize for last-to-write-about-conference, hands down. I had enough time to come up with solid excuses of course: Right after GNOME.Asia Summit in Seoul I went on to Tokyo, for two more conferences. And after that, I've been busy with my vacations.

With a direct flight from London (LHR) to Seoul (ICN), I was lucky enough to fly on Korean Air. Only Quatar Airways comes close in terms of service, though I still like their aubergine-colored uniforms more. Incheon Airport won a "best airport" award several times in a row, and I think it's well-earned. It starts with the gorgeous view when flying in, but the best argument going for the airport is probably A'REX, an affordable, fast and reliable railway line that connects Icheon with the most important spots in the city. Taxi drivers will try to convince you otherwise but don't be fooled, there's no competition to that line.

I stayed near Hongik University station, which is serviced by A'REX. It's a great location, especially at night. Tristan, our famous GTK+/Glade (and now also EDS) hacker, decided to stay nearby so we could go out for drinks and cure my jetlag with great food and a hangover. The hangover didn't happen (too much food?) and neither did the jetlag. As I arrived a couple of days early, I had enough time to discover the city a bit and prepare for the conference (read: slides and stuff). My talk about Wayland input methods seemed to be well received. It lead to a couple of good discussions about text input in general.

The venue was difficult to find in this maze of office buildings. It was good to see old & new friends at the summit though I had hoped for a lot more attendees. Perhaps (local?) advertising the event at universites would have helped. At least in my perception, the target audience for F/OSS development have always been students with enough spare time on their hands to try out stupid things (such as writing free software). I have come to prefer 3-day conferences over 2-day conferences. Two days just doesn't allow for enough hallway discussions as your time to attend talks is very limited. Spreading talks out over three days gives me a better opportunity to balance between talks and hallway track. I wish for instance that FOSDEM wasn't crammed into two days. The other extreme are probably Akademy and GUADEC that go over a week if you count BoFs and workshops as part of the confernence. But I am pretty sure that two days are too short ;-)

Thanks to the GNOME Foundation for 1) accepting my membership \o/ 2) approving my travel sponsorship. This would not have happened without my talk being accepted, which as I understand from talking to others who submitted session proposals was a quite lucky circumstance on its own.

Sponsored by GNOME Foundation

Direct Read Access for Evolution Addressbooks in LibreOffice

With Tristan just landing his work on direct-read-access for EDS 3.8 we at Openismus started to look for use-cases of DRA outside the customer work we did. A first project that came into mind was LibreOffice: It provides a connectivity driver for Evolution's address books. Basically this component is to support standard letters. Now mass printing letters isn't exactly dominated by the address book's reading performance, but to the user the addressbook appears like any other database, so it surely can used for more interesting tasks. Also the EDS function LibreOffice uses for accessing addressbooks got deprecated with EDS 3.8, so we gave it a try.

Downloading and building LibreOffice was a simple walk thanks to LibreOffice's great build instructions. Somewhat interested in build systems I was positively surprised when seeing LibreOffice's sophisticated build system that's entirely based on Autoconf and pure GNU make. The relevant code was easy to find and understand, so I patched it quickly. Getting environment variables and D-Bus configuration right for testing was the biggest effort. After being sufficiently confident about the code, I did a quick push to a new gerrit topic branch. Shortly later the code was reviewed and merged. Nice experience.

So what do you get from this patch: Accessing Evolution address books from LibreOffice should be more efficient now, as we eliminated a few layers of complexity for reading. This also should reduce worries about GLib main loops or threads interacting badly with LibreOffice, as direct-read-access skips D-Bus and just runs the backend drivers.

Still the database wizard doesn't list all available Evolution address books: There are local, LDAP and Groupwise address books, but Google address books are missing for instance. Additionally only the first address book of each backend is available. Looking at the database wizard now to fix this, and getting a first impression of the older code's complexity.

The Movie Database, Grilo and mock testing

This year at Openismus had a strong focus on media files and related metadata. So it isn't surprising that we also got in touch with Grilo. Grilo is a framework for discovering media files and decorating it with relevant metadata. One missing piece was background information about movies, like story summaries, artwork, information about artists. With the support of our customers we decided to change that.

Now an obvious choice for getting information on movies is Amazon's incredibly comprehensive Internet Movie Database. Sadly there doesn't seem to be any official web API that could be used by open source software. If I missed something, please tell me in the comments.

The Movie Database plugin for Grilo

Anyway, with IMDb out of reach we had to look for an alternative - and we've found a great one: The Movie Database. This database is a community driven project, built with the needs of media players in mind. It comes with a great and easy to use API. Like IMDb it seems to know just each and every relevant and irrelevant movie ever made. Some entertaining stuff like movie trivia, goofs and quotes are missing, but for most movies the database knows the IMDb id - so all the entertainment is just one click away.

Now that we've found a good movie database Jens started implementing a TMDb plugin for Grilo, and a first version of the plugin was released with grilo-plugins 0.2.2. I've later finished it while integrating it with our customer's project. Murray took care of examples and documentation.

The plugin implements a metadata resolver: It takes a media object, checks for a title or the unique movie id and then fills the media object with information it found on TMDb. Biggest obstacle when using the plugin is the need for an API key. It should be easy to get. The database operator hope these keys will help them to deal with misbehaving clients.

Another issue is related to Grilo's architecture. Grilo doesn't permit its sources to improve any metadata. So if you interpolated a movie title from the filename and pass a media object with that title to the TMDb plugin, the plugin is not permitted to replace your usually ugly title with the pretty and official title it found in the database. To work around that issue you would not resolve the TMDb data in a single step. Instead you'd do a first resolve operation to just retrieve the TMDb's movie id. In a second step you'd delete your own title, but keep the resolved movie id and let the plugin work with that data:

/* Create dummy media from filename */
media = grl_media_new ();
grl_media_set_title (media, "the_last_unicorn_720p");

/* Ask grilo to resolve the TMDb id */
metadata_key_tmdb_id = grl_registry_lookup_metadata_key (registry, "tmdb-id");
keys = grl_metadata_key_list_new (metadata_key_tmdb_id, GRL_METADATA_KEY_INVALID);
grl_source_resolve_sync (source, media, keys, options, &error);
/* Exercise: Release memory and handle possible errors */

/* Use the resolved id to get full metadata */
movie_id = grl_data_get_string (GRL_DATA (media), metadata_key_tmdb_id);

if (movie_id) {
    grl_data_remove (GRL_DATA (media), GRL_METADATA_KEY_TITLE);
    grl_source_resolve_sync (source, media, keys, options, &error);
    /* ... */
}

Behind your back the same number of network requests is needed for both approaches.

Testing the Plugin

At Openismus we have a mantra we strongly believe in:

No code is done before it has proper tests.

Or more radically:

It doesn't work if it isn't tested.

In the case of the TMDb plugin this commitment caused us a bit of work. Grilo didn't have any infrastructure for testing network based plugins yet. Also if we'd do those tests we'd like to run them offline:

So we took the challenge and also implemented a mock testing framework for Grilo. It became handy that Grilo already routes all its network access through a dedicated abstraction, so that it can implement request throttling. So we just hooked into that layer and introduced a few environment variables that influence behavior of that layer: First you'd set GRL_NET_CAPTURE_DIR to some folder name and then let your plugin perform a few interesting operations. In a next step you'd edit the recorded files and rename them nicely to suite your needs. Essential in that work is a generated key-value file of the name grl-net-mock-data-$PID.ini which maps URLs to captured responses. Once done you can set GRL_NET_MOCKED to point to that file. Grilo will then stop doing real network requests and answer all requests with the information this file provides.

Details about testing with the Grilo mock framework are in the reference manual. A few examples for using the framework can be found in Grilo's test folder.

Media Discovery with QtGStreamer

Earlier this year we at Openismus proposed a Qt based project that would utilize GStreamer for handling media files. Especially we were interested in using the GstDiscoverer class which provides a really nice and easy to use API for discovering properties of media files, such as the container format and the audio and video formats, but also more interesting things like EXIF information, when used with photos.

Now combining code from different worlds with their different paradigms isn't exactly fun. The resulting code often is a disgusting Frankenstein monster not fitting at any place, unless you wrap one of the libraries to match the project's preferred code style. Luckily in the case of Qt and GStreamer Collabora's George Kiagiadakis created QtGStreamer and therefore did most of the hard work already. Still that library didn't support our beloved GstDiscoverer class yet. So we had the choice: Use something different, or wrap that thing. Now we love doing free software, also we use GstDiscoverer with great success in the Rygel UPnP AV/DLNA Media Server already, and in the end the media files shall get played via GStreamer in the end. So we decided to just wrap that class for QtGStreamer.

Doing that work actually was surprisingly easy: A few loose ends here (#680235), a bit of nitpicking there (#680233, #GB680237). Biggest effort was doing the regression tests. This tests also demonstrate how easy the wrapped GstDiscoverer is to use. Synchronous media discovery is done like that:

QGst::DiscovererPtr discoverer = QGst::Discoverer::create(QGst::ClockTime::fromSeconds(1));
QGst::DiscovererInfoPtr info;

try {
    info = discoverer->discoverUri("file:///home/mathias/blockbuster.ogv");
} catch(const QGlib::Error &error) {
    qWarning("Discovery failed: %s", qPrintable(error.message()));
    // ...maybe also check error.domain() and .code()
}

You also can try asynchronous discovery if you have a Qt build that integrates GMainLoop:

QGst::DiscovererPtr discoverer = QGst::Discoverer::create(QGst::ClockTime::fromSeconds(1));

// Connect C++ member methods to the signals
QGlib::connect(discoverer, "starting", this, &DiscovererTest::onStartingDiscovery);
QGlib::connect(discoverer, "discovered", this, &DiscovererTest::onUriDiscovered);
QGlib::connect(discoverer, "finished", this, &DiscovererTest::onDiscoveryFinished, QGlib::PassSender);

discoverer->start();

QEventLoop loop;
loop.exec();

Usually only X11 builds match that requirement, but it should be possible to just hook QEventDispatcherGlib into your own application if needed.

The discovered data is accessible by the various attributes and methods of QGst::DiscovererInfo:

QGst::DiscovererInfoPtr info = ...;

qDebug() << info->uri();
qDebug() << info->tags();
qDebug() << info->duration();
// ...

Q_FOREACH(const QGst::DiscovererVideoInfoPtr &info, info->videoStreams()) {
    ...
}

Sadly our customer wasn't that much a fan of Qt as we thought, so we didn't have much use of our own for this work yet. This situation also delayed finishing the last few bits of that patches. Luckily Murray just took the time recently to do that last bits of work, and to get the patches merged. The code is in the git repository now and should get released with QtGStreamer 0.10.3. So whenever your Qt application needs to discover media file properties you also can use QtGStreamer now.

Performance and Memory Usage of Evolution Data Server

Openismus asked me to perform some benchmarks on Evolution Data Server. We wanted to track the progress of recent performance improvements and identify possible improvements. Therefore, I tested these versions of EDS:

The code is in a phonebooks-benchmarks repository on Gitorious with a full auotools build, and with a script to build and test all these versions of EDS. So you can try this too, and you can help to correct or improve the benchmarks. See below for details.

EDS offers various APIs to access the address book. The traditional interface was EBook, which has served us well in GNOME 2 and Maemo 5. However, some APIs such as batch saving are missing in the upstream version. Also its asynchronous API doesn't follow the conventions established later by GIO. To overcome these EBook shortcomings, and to make efficient use of GDBus, the EBookClient API was added in EDS 3.2. We can even use the backend EDataBook API, and that lets us measure the overhead imposed by using D-Bus.

I tested the back-ends with different numbers of contacts. For each benchmark, and for each contact count, we create an entirely new private database. D-Bus communication was moved to a private D-Bus session. To avoid swapping, the maximum virtual memory size was limited to 2 GiB per ulimit command. This limit probably caused the crash of Maemo 5's EDS in the test with 12,800 contacts, but I have not checked that yet.

Contact Saving

These benchmarks simply store a list of parsed contacts in the address book. This simulates use cases such as the initial import of contacts upon migration or synchronization.

To avoid possible side effects from lazy parsing, we retrieve the attribute list for each contact before starting the benchmark. With EBook from Maemo 5 and EBookClient since EDS 3.4, contacts are saved in batches of 3,200 contacts. This partitioning was needed to deal with resource limitations in the file backend. All other EDS variants must save their contacts one by one.

Contact saving, without batching

EBook, EBookClient, EDataBook implementation

As expected, the effort for contact saving grows quickly when not using a batch API. This is because a new database transaction must be created and committed for each contact. Things look much better when using the batch saving API which was available in Maemo 5 already, and was recently added to EBookClient:

Contact saving in batches

Batch saving performance of EDS 3.4+ is just excellent: Although slowly growing with the number of contacts, it remains below a threshold of 3 ms per contact even for 12,800 contacts. That growing effort can be accounted to growing attribute indices. The initial peak (until 50 contacts for Maemo 5, and until 400 contacts for EDS 3.4+) can be accounted to database setup cost.

In terms of performance there is no difference between using EBookClient or EDataBook (which avoids D-Bus).

Contact Fetching

A very basic, but essential, benchmark is fetching all contacts. To get a first impression I just fetched all contacts without any constraints.

Fetch all contacts

EBook, EBookClient, EDataBook implementation

Contact fetching performance decreased significantly during the EDS 3 series and then got better again: Fetching all contacts with 3.4 takes about 133% of the time that EDS 2.32 needs and even 225% of Maemo 5's time. With EDS 3.5 contact loading time is improving again, making the EBook version of EDS 3.5 comparable in performance to EDS 2.32. Git archeology and quick testing identifies Christophe Dumez's bugfix as the relevant change. Apparently the file backend didn't make optimal use of Berkeley DB transactions.

Still there is significant room for improvement, because:

  1. simple contact fetching with EBook 3.5 still takes 175% of the time Maemo 5 needs.
  2. EBookClient 3.5 is still 20% slower than EBook 3.5, and 64% slower than EDataBook.

This basic test shows already that the chosen client-server architecture of EDS causes a non-ignorable overhead.

It would be absolutely worth investigating how Maemo 5 reaches its far superior performance: After all it even beats EDataBook. I remember having spent significant time on avoiding vCard parsing and string copying. I also remember having replaced the rather inefficient GList with GPtrArray at several places. Some of the ideas have been ported upstream during Openismus' work for Intel. Apparently there are more gems to recover.

Fetching by UID

Fetching contacts by UID simulates lazy loading of contact lists: Initially, we only fetch contact IDs. We only fetch the full contacts when the device becomes idle, or when contacts become visible on screen. This approach is needed because even the fastest implementation (Maemo 5) needs several seconds to fetch any contact list of realistical size on mobile hardware. Another useful optimization we implemented on the Nokia N9 is fetching of partial contacts, that only contain relevant information, like for instance the person's name. EDS doesn't support this optimization.

As a first test we fetch contacts one-by-one, without using any kind of batch API:

Fetch by UID without batches

EBook, EBookClient, EDataBook implementation

The good news is that this chart shows quite constant performance for each client.

The bad news is that contact fetching is pretty slow: 3.9 ms per contact, as seen with EDS 3.5, translates roughly to 390 ms to fetch only 100 contacts on this hardware. Considering that typical mobile devices are roughly 10 times slower than my notebook, these numbers are disappointing. Especially if you consider that EDS 2.32 was about 4 times, and Maemo 5 even about 13 times faster. This are entirely different worlds. It should be investigated what causes this significant performance regression from EDS 2.32 to EDS 3.2+. One also should try to port the performance fixes of Maemo 5.

The performance reachable under ideal circumstances is shown by the EDataBook client. This only needs about 50 µs (0.05 ms) to fetch one contact by its id. Directly accessing the address book via EDataBook is about two orders of magnitude faster than the current EBookClient. That's the goal that EDS can, and should, aim for. Apparently a significant amount of time is spent on performing D-Bus communication, whereas the requested task can be performed by the backend within almost no time.

However, this data was acquired by fetching the contacts one by one. We can surely do better by using a batch API. That should drastically reduce the overhead caused, for instance, by D-Bus. But neither EBook or EBookClient provide an API to fetch contacts by lists of contact IDs. The thing that comes closest is filtering the contacts by a disjunction of UID field tests:

(or (is "id" "uid1") (is "id" "id2") ...)

So I tried that. The results for such queries, using batches of UIDs, look like this:

Fetch by UID in huge batches

EBook, EBookClient, EDataBook implementation

This chart speaks for itself. To remain responsive and appear fluid while scrolling, applications should render at 60 frames per second. To reach that framerate newly visible contacts must be fetched and rendered within less than 16 ms. EDS apparently cannot meet that performance goal even on desktop computers. Considering the huge performance differences between client-server access and direct access, as seen when fetching contacts one by one, it seems very worthwhile to add dedicated support for fetching multiple contacts by UID. The most obvious approach would be adding a batch API in the spirit of e_book_client_add_contacts(). Another solution would be adding more fast paths to the query processing code.

Filtering

EBook, EBookClient, EDataBook implementation, the queries used

Contact filtering is relatively efficient when using fields such as last-name, for which indices are hit. Still, the D-Bus overhead is noticeable: EDataBook needs less than 60% of EBook's or EBookClient's time.

The times to match long prefixes and suffixes look quite similar when hitting indices.

The behavior of EBook for short name prefixes is a bit strange. The EBook API is now deprecated, but it could still be worthwhile to identify the issue causing this strange behavior, so that it can be avoided in the future:

Interestingly, there seem to be no functional database indices for email addresses or phone numbers in more recent versions of EDS:

The behavior of Maemo 5's EDS is a bit surprising, as I know that Rob spent significant amounts of time on adding Berkeley DB based indices to that EDS version.

It might be worth optimizing index usage in EDS, because prefix and suffix searches are commonly used in mobile applications. Prefix searches need to be fast, for quick feedback during auto completion. Suffix searches need to be fast, for instance to provide instant caller identification.

Memory Usage

Memory is cheap those days. Still, especially on embedded devices, you should keep a close eye on the memory consumption of your software. Obviously, memory consumption grows with the number of contacts used:

Resident Set Size (RSS)

It's nice to see how memory consumption has reduced from release to release. It's also good to see that EBookClient seems to use slightly less memory than EBook.

You might miss the graphs for Maemo 5 and EDS 2.32. I had to drop them for this chart as they show serious memory leaks, preventing any useful examination. Apparently the leak really is in those versions of EDS: The EBook benchmarks for EDS 3.2+ are using exactly the same code but don't show this bad behavior.

Notice that I've accumulated the client's and the backend's memory consumption. This allows us to estimate the cost of EDS's client-server architecture. Apparently this architecture increases memory consumption by about 40% in these benchmarks.

While the RSS usage gives us information about the efficiency and correctness of the code, it's also interesting to check the virtual memory size needed by the benchmarks. Please take the following numbers with a reasonable grain of salt: I got these numbers by simply adding together the virtual memory size of the client and of the backend process, as reported by the processes' status file. A proper measurement would process the maps file to properly account for shared memory regions.

Virtual Memory Size (VMS)

The first issue we notice is the massively increased memory usage of EBookClient 3.2. It's almost 40% more than the others. Fortunately, the issue seems to have been fixed already in EDS 3.4.

At first glance, the very low virtual memory usage of the EDataBook benchmark is impressive. It seems to consume only 40% of the client-server based benchmarks. Still, there is a fairly high chance that this huge delta must be attributed to my poor measurement here: Assuming perfect code segment sharing there only remains a delta of about 20 MiB, which would be nothing but the RSS delta of EDataBook and EBookClient. It would be nice to investigate this in more detail.

RSS usage per contact

This chart shows the memory per contact after the contact saving benchmark. The overall memory usage per contact has grown dramatically by almost 40% in EDS 3+. The most efficient approach is apparently to directly access EDataBook, which consumes only 55% of the RSS per contact, compared to the client-server approaches.

RSS usage per contact, model only

This high memory usage per contact is a bit surprising since, after subtracting effects from library and database initialization, the memory usage per contact remained constant between EDS 2.32 and EDS 3.5. The parallel usage of both Berkeley DB and SQLite in the file backend might be to blame, but this is currently pure speculation from me.

The temporary regression in EDS 3.2 was apparently fixed. The increased memory usage of EBookClient and EDataBook over EBook is because the EBookClient and EDataBook benchmarks, in a slightly unrealistic bias for performance, store both the EContact and the VCard string for each contact.

Conclusions

The developers of Evolution Data Server have paid good attention to performance and have successfully implemented significant improvements. However, EDS releases regularly suffer performance regressions, and the performance of EDS still isn't good enough for usage in mobile devices. Fortunately the performance problems are solvable. Some fixes will be straightforward, such as adding more batch API (or fast paths) for query processing. Others will need careful performance monitoring: For instance when activating more database indices, to speed up queries, we must be careful not to slow down contact saving.

A not so trivial improvement would be adding a direct access API for reading the local database. The speed and memory usage measurements show the value of such API: Direct access is significantly faster than via D-Bus in most usage cases, and it seems to significantly reduce memory usage.

Another significant improvement should be finishing the file backend's transition to SQLite: Using two database backends in parallel significantly increases code complexity and has measurably bad impact on memory consumption.

Usage Instructions

The full source code of this project is in our phonebooks-benchmarks repository on Gitorious. You'll need a fairly recent C++ compiler because I also used this project to get more familiar with the new features of C++11. I've successfully tested g++ 4.6.3 and g++ 4.7.0. Clang 3.0 definitely won't work because of incomplete lambda support.

Other than that, you'll need Boost 1.48 to compile the benchmarks. The optional chart-drawing module uses ImageMagick++ and MathGL 2.0.2.

There is a simple script for building and installing the tested EDS versions. The configure script will give instructions.

To finally run the benchmarks just call src/phonebook-benchmarks, and to draw the charts run src/phonebook-benchmarks-mkcharts.

When doing your own tests that needs a non-trivial vCard generator take look at src/phonebook-benchmarks-mkvcards.

Outlook

It would be interesting to take a more detailed look at the virtual memory usage.

Also it would be educational to compare these results with other address book implementations. The first candidates I have in mind are QtContacts on Tracker and Tizen's native contacts API.

We didn't cover EBookView and EBookClientView yet. These views take a query and notify the application when the contact set matching the query has changed. Typically, every user interface needs to use them.

We also didn't talk about the calendar API yet.

Well, and most importantly we at Openismus would enjoy fixing the identified performance problems.

Using Full Text Search Engines as Datastore

It's a common design to use full text search engines only for free text searches, but to store the actual structured data in a separate database. Such designs come at a cost. Therefore Openismus asked me to build upon my previous post, where I analyzed several FTS engines. This time I'll research if we could use the full text search index itself as our primary data store.

Relations

A first obvious limitation is the lack of joins. So to use the FTS index as data store, you must denormalize your data. That is, instead of storing your movie database in distinct entity tables like Movie and Artist, linked by relationship tables like isLeadActor or isDirector, you must find a way to put everything into one single flat table. This isn't entirely nice in terms of redundancy and consistency. On the other hand joining tables is what makes relational databases slow and hinders distributing them across servers. Is there someone whispering "NoSQL"? Well. Yes, while I absolutely dislike their striking marketing: They are on to something, and with our journey today we enter their land.

Seems I've lost myself in chatting, so back on topic. So to store data in a FTS index we must denormalize our data. Luckily they make it easier than it sounds. In opposition to the relational model, there is no need to create complex relationships, just to assign more than only one actor or director to a movie: When adding artists to your movie you just tag each name with the proper field prefix before adding it to the index, and you are done. FTS engines natively support multi-value fields!

With some additional effort it also should be possible to store more structured data in those multi-value fields, things like (release-date, country), or (actor, role): You'd add more prefixes and use the positional information stored for phrase searches to reliably identify those fields. Sadly my time is too limited to research this more in detail, but the Internet surely has documents about this. Well, or for additional fun you can try to figure it out yourself.

Exact Matches

Now a match reported by an FTS engine only tells us that the document or the chosen field contains the phrase we were looking for. When searching for `title:"The Matrix"` any FTS engine will not only return the first movie of the Wachowskis' triology, it also will give matches for the other two movies, and works like *"Making 'The Matrix'"*. So for doing exact lookups we'll have to filter the initial result, and drop any document that doesn't exactly match our requirements. Sadly we really must check the field value instead of just checking the computed score: For instance with Lucene both *"The Matrix"* and *"Making 'The Matrix'"* will get a score of 100%, since both documents fully satisfy all terms of the query. Also we cannot use the score as indicator to only check fields for documents that got at least 100%: When searching for `director:"Quentin Tarantino"` the movie *"Inglourious Basterds"* will get less than 100%, since Tarantino was working with Eli Roth for this movie. So this additional filtering sounds expensive at the first moment, but remember that our index lookup dramatically reduced the data set already. When looking for `title:"The Matrix"` in the *imdb-50* data set, we talk about checking 9 documents instead of 121,587 documents for example. For useful data sets we won't notice the overhead, like the test results below are showing.

You can just add unanalyzed fields and use term queries on them like kamstrup pointed out.

Data Types

So we've learned that lack of relations isn't much of a problem for many useful datasets, but structured data is not only about relationships, it also is about data types. Full Text Search engines only support lexicographical order, so they surely fail for dates and numbers. You surely cannot use them to find documents within a given range!

I am sorry to disappoint you. The people researching FTS are smarter than that. Actually properly sorting and ranging dates, while only using lexicographic order is trivial. Most probably you have done it yourself already. Simply store your dates in ISO format, that is YYYY-MM-DDThh:mm:ss.SSSNNN or any prefix of this, and you are done. Omit the separators if you prefer. ISO-8601 explicitly is designed for lexicographic sorting.

So how do you do this with numbers? You could prefix them, for instance with zeros, to get a fixed width. This works reasonably if you know your number ranges, and in most cases you do. Sometimes you know the range from your application's context, e.g. the first known celluloid film was recorded in 1888. More easily you just use your technical limits, like [-263..263-1] for long integers. While first experiments really followed that approach, padding numbers with up to 18 zeros isn't exactly efficient or pretty. Also we didn't talk about floating point numbers yet. Therefore FTS engines like Lucene or Xapian provide more efficient mechanisms for turning numbers into sortable strings. First they write a prefix indicating number precision (64 bit, 32 bit, 10 bit, ...). Then they convert the numbers to some unsigned format, and apply some kind of base-128 encoding to the resulting bytes. The most significant bit gets stored first. For floating point numbers they shuffle some bits of the number's IEEE-754 representation. The resulting, sortable 64 bit integer then is encoded like any other number. You can consult Lucene's documentation, and the source code of Lucene::NumericUtils, or Xapian::sortable_serialise for details.

Benchmarks

Hope I didn't lose you with all this theory, now it is benchmark time!

To test how useful FTS engines are for storing arbitrary data I've extended my previous benchmark to better support range searches, and to support exact matching of fields. I've also added Michal Hruby's patch for supporting prefix searches. Since the prefix search gives countless hits, the query results consistently are limited to 10.000 rows now. I've dropped QtCLucene for now since it doesn't seem to support numeric range searches and such. It was forked from Java Lucene a long time ago. For SQLite I ran two sets of tests: bm_sqlite doesn't create indices for fields like movie title or artist names. Since such setup is unfair when comparing with FTS engines, the second set bm_sqlite_index creates indices for all fields we perform lookups for. For tracker we again test the Nepomok media ontology (bm_tracker) and a optimized ontology (bm_tracker_flat), that attaches all properties to the same RDF class. I had to disable prefix searches for bm_tracker: The query ran for more than 2 hours on the dataset with 17k movies. I seriously wish I'd get sponsored to improve Tracker's data model!

Source code still is in the fts-benchmark repository, tagged as release/0.3.

Results and Discussion

Each query got run 7 times on 5 different data sets. This time I didn't take the mean of the query execution times. The individual results of each dataset are grouped together and labeled with qxx_t1 to qxx_t7. Data and result sets grow with each group.

Also be careful when reading the charts as time is scaled logarithmically. You might want to consult the raw data tables below for details. Please keep in mind that the basic goal of this benchmarks is to test scalability, not raw performance. Therefore I don't mind much if an engine is 10 times slower than another for small data sets. Constant performance is the ideal result.

You'll also notice that some charts have gaps for bm_tracker. Like explained above I had to skip bm_tracker for few data sets, as tracker took way to long to perform those benchmarks.

rating:[90 TO 99]

Lucene++ appears significantly slower than its competition for small data sets, but then gives comparable results for data sets with more than 3,000 movies. Still I would not overrate this finding: We are talking about lookup times in the range of 10 ms. That's still pretty fast and close to measurement limits like the spikes in the other engine's results show.

release:[1999/01/01 TO 1999/09/30]

This results are similar to the rating:[90 TO 99] query.

release=1999/03/31

For this query you see the importance of having an index for your lookup keys: Performance of bm_lucene++ and bm_sqlite_index remains almost constant, while effort of the other engines grows dramatically as the data size grows.

Xapian's bad performance comes as a surprise, but actually I am to blame here: For stupid reasons I've implemented this very search as range search in Lucene++ and Xapian (release:[1999/03/31 TO 1999/03/31]). As the results indicate Lucene++ seems to putting more effort into optimizing range searches, and compensates my mistake.

title=The Matrix

Similar results as for release=1999/03/31, only that Xapian behaves as expected now. When given a proper query it also shows constant lookup time for exact phrase searches.

director=Quentin Tarantino

With this query you see the advantage you get from using denormalized tables: Lucene++ and Xapian just are as efficient as in the previous tests, but as a not so big surprise Tracker with the flat ontology now beats all remaining engines, including bm_sqlite_index.

T*

Performance of the different engines is similar to each other when performing prefix searches.

Raw Result Data

rating:[90 TO 99] - 9 movies, 3 matches
t1t2t3t4t5t6t7
bm_lucene++12.333 ms10.409 ms9.885 ms9.821 ms10.221 ms9.840 ms9.986 ms
bm_sqlite0.196 ms0.169 ms0.169 ms0.173 ms0.166 ms0.167 ms0.167 ms
bm_sqlite_index0.207 ms0.183 ms0.172 ms0.192 ms0.193 ms0.173 ms0.172 ms
bm_tracker0.992 ms0.655 ms0.582 ms0.589 ms0.554 ms0.549 ms0.525 ms
bm_tracker_flat0.693 ms0.463 ms0.437 ms0.461 ms0.450 ms0.443 ms0.436 ms
bm_xapian0.242 ms0.201 ms0.200 ms0.198 ms0.200 ms0.199 ms0.197 ms
rating:[90 TO 99] - 1,099 movies, 17 matches
t1t2t3t4t5t6t7
bm_lucene++12.949 ms13.057 ms12.981 ms13.018 ms13.150 ms12.840 ms12.644 ms
bm_sqlite0.696 ms0.546 ms0.516 ms0.530 ms0.515 ms0.518 ms0.522 ms
bm_sqlite_index0.448 ms0.234 ms0.231 ms0.237 ms0.236 ms0.231 ms0.231 ms
bm_tracker5.051 ms4.485 ms4.441 ms4.486 ms4.425 ms4.831 ms4.828 ms
bm_tracker_flat1.465 ms1.133 ms1.110 ms1.104 ms1.108 ms1.108 ms1.108 ms
bm_xapian1.445 ms1.285 ms1.159 ms7.824 ms1.878 ms1.669 ms1.393 ms
rating:[90 TO 99] - 3,216 movies, 35 matches
t1t2t3t4t5t6t7
bm_lucene++14.287 ms13.596 ms13.453 ms13.912 ms13.875 ms14.559 ms13.981 ms
bm_sqlite3.524 ms4.110 ms4.129 ms1.916 ms1.732 ms2.300 ms9.584 ms
bm_sqlite_index0.423 ms2.036 ms4.617 ms4.577 ms0.388 ms1.957 ms7.981 ms
bm_tracker12.776 ms11.816 ms12.449 ms11.755 ms11.762 ms11.983 ms11.764 ms
bm_tracker_flat2.935 ms2.517 ms2.374 ms2.264 ms2.250 ms2.261 ms2.258 ms
bm_xapian9.292 ms2.702 ms10.573 ms6.773 ms3.098 ms11.438 ms3.035 ms
rating:[90 TO 99] - 17,251 movies, 260 matches
t1t2t3t4t5t6t7
bm_lucene++58.996 ms56.894 ms62.172 ms57.028 ms57.255 ms57.540 ms57.259 ms
bm_sqlite36.682 ms28.260 ms34.116 ms34.786 ms35.195 ms35.813 ms35.221 ms
bm_sqlite_index45.802 ms62.460 ms31.603 ms32.982 ms33.302 ms31.904 ms31.656 ms
bm_tracker67.022 ms64.609 ms64.649 ms65.243 ms64.183 ms64.887 ms64.283 ms
bm_tracker_flat14.730 ms14.179 ms14.132 ms14.221 ms14.248 ms20.225 ms35.888 ms
bm_xapian94.872 ms47.067 ms85.202 ms28.575 ms142.854 ms48.562 ms52.567 ms
rating:[90 TO 99] - 121,587 movies, 1,510 matches
t1t2t3t4t5t6t7
bm_lucene++283.122 ms392.801 ms382.164 ms403.929 ms384.512 ms408.292 ms361.548 ms
bm_sqlite293.488 ms236.636 ms249.677 ms232.674 ms270.198 ms282.806 ms218.726 ms
bm_sqlite_index231.638 ms311.523 ms198.781 ms279.063 ms219.294 ms192.589 ms276.822 ms
bm_tracker-------
bm_tracker_flat181.478 ms272.453 ms251.730 ms256.744 ms293.067 ms230.615 ms245.113 ms
bm_xapian376.176 ms417.637 ms411.263 ms366.596 ms393.168 ms372.888 ms412.411 ms
release:[1999/01/01 TO 1999/09/30] - 9 movies, 2 matches
t1t2t3t4t5t6t7
bm_lucene++18.768 ms10.167 ms10.799 ms10.215 ms10.443 ms10.917 ms10.210 ms
bm_sqlite0.165 ms0.166 ms0.164 ms0.164 ms0.168 ms0.164 ms0.164 ms
bm_sqlite_index0.175 ms0.175 ms0.170 ms0.169 ms0.169 ms0.169 ms0.170 ms
bm_tracker1.074 ms0.569 ms0.546 ms0.561 ms0.544 ms0.549 ms0.546 ms
bm_tracker_flat0.877 ms0.480 ms0.460 ms0.458 ms0.461 ms0.458 ms0.456 ms
bm_xapian0.183 ms0.175 ms0.175 ms0.178 ms0.178 ms0.180 ms0.175 ms
release:[1999/01/01 TO 1999/09/30] - 1,099 movies, 34 matches
t1t2t3t4t5t6t7
bm_lucene++19.154 ms19.449 ms18.811 ms19.419 ms19.692 ms19.315 ms18.862 ms
bm_sqlite0.691 ms0.686 ms0.684 ms0.687 ms0.690 ms0.702 ms0.698 ms
bm_sqlite_index0.365 ms0.311 ms0.317 ms0.312 ms0.311 ms0.312 ms0.313 ms
bm_tracker6.231 ms5.543 ms5.734 ms5.522 ms5.663 ms5.538 ms5.465 ms
bm_tracker_flat1.998 ms1.494 ms1.466 ms1.469 ms1.470 ms1.454 ms1.469 ms
bm_xapian5.336 ms1.590 ms7.241 ms1.977 ms2.651 ms4.013 ms2.544 ms
release:[1999/01/01 TO 1999/09/30] - 3,216 movies, 84 matches
t1t2t3t4t5t6t7
bm_lucene++32.202 ms31.513 ms31.362 ms30.894 ms31.345 ms31.741 ms31.518 ms
bm_sqlite6.169 ms2.645 ms7.560 ms20.764 ms10.385 ms13.278 ms10.206 ms
bm_sqlite_index19.176 ms4.358 ms12.576 ms15.448 ms15.745 ms5.572 ms5.770 ms
bm_tracker15.507 ms14.803 ms13.629 ms15.465 ms13.930 ms14.515 ms13.652 ms
bm_tracker_flat3.956 ms3.488 ms3.183 ms3.176 ms3.213 ms3.193 ms3.157 ms
bm_xapian18.414 ms5.874 ms11.902 ms12.932 ms19.995 ms21.098 ms13.009 ms
release:[1999/01/01 TO 1999/09/30] - 17,251 movies, 374 matches
t1t2t3t4t5t6t7
bm_lucene++93.892 ms93.900 ms93.549 ms93.555 ms93.924 ms94.396 ms93.795 ms
bm_sqlite37.831 ms44.905 ms47.617 ms45.894 ms43.796 ms45.752 ms47.048 ms
bm_sqlite_index48.475 ms47.805 ms43.046 ms47.393 ms44.689 ms47.842 ms54.208 ms
bm_tracker72.507 ms72.667 ms72.233 ms73.570 ms72.997 ms72.991 ms72.527 ms
bm_tracker_flat29.351 ms48.892 ms55.351 ms49.793 ms88.375 ms55.393 ms45.917 ms
bm_xapian59.522 ms168.591 ms55.750 ms83.424 ms113.679 ms62.803 ms127.895 ms
release:[1999/01/01 TO 1999/09/30] - 121,587 movies, 2,265 matches
t1t2t3t4t5t6t7
bm_lucene++543.495 ms564.582 ms609.045 ms519.248 ms561.844 ms663.549 ms590.518 ms
bm_sqlite165.617 ms387.256 ms293.285 ms335.219 ms324.528 ms324.022 ms371.839 ms
bm_sqlite_index375.504 ms315.671 ms321.115 ms371.228 ms300.951 ms344.073 ms356.366 ms
bm_tracker-------
bm_tracker_flat241.569 ms316.626 ms398.308 ms349.426 ms398.289 ms318.078 ms363.809 ms
bm_xapian529.377 ms556.989 ms577.643 ms576.194 ms626.388 ms545.251 ms570.695 ms
release=1999/03/31 - 9 movies, 1 matches
t1t2t3t4t5t6t7
bm_lucene++10.065 ms10.068 ms9.702 ms9.974 ms9.837 ms9.751 ms10.356 ms
bm_sqlite0.164 ms0.165 ms0.171 ms0.168 ms0.167 ms0.164 ms0.162 ms
bm_sqlite_index0.171 ms0.169 ms0.171 ms0.172 ms0.175 ms0.165 ms0.164 ms
bm_tracker0.659 ms0.476 ms0.473 ms0.469 ms0.464 ms0.468 ms0.468 ms
bm_tracker_flat0.510 ms0.395 ms0.385 ms0.384 ms0.389 ms0.383 ms0.389 ms
bm_xapian0.154 ms0.152 ms0.151 ms0.153 ms0.152 ms0.156 ms0.152 ms
release=1999/03/31 - 1,099 movies, 2 matches
t1t2t3t4t5t6t7
bm_lucene++10.853 ms10.545 ms10.718 ms10.390 ms10.521 ms10.754 ms10.661 ms
bm_sqlite0.515 ms0.528 ms0.505 ms0.512 ms0.502 ms0.507 ms0.505 ms
bm_sqlite_index3.139 ms0.184 ms0.175 ms3.440 ms0.183 ms0.212 ms0.205 ms
bm_tracker4.559 ms4.229 ms4.177 ms4.220 ms4.383 ms4.532 ms4.464 ms
bm_tracker_flat0.977 ms0.830 ms0.800 ms0.808 ms0.802 ms0.811 ms0.802 ms
bm_xapian0.672 ms0.685 ms0.774 ms0.752 ms0.916 ms1.285 ms0.663 ms
release=1999/03/31 - 3,216 movies, 2 matches
t1t2t3t4t5t6t7
bm_lucene++10.799 ms10.762 ms11.399 ms10.676 ms10.704 ms10.169 ms10.325 ms
bm_sqlite1.912 ms1.462 ms1.453 ms1.163 ms1.151 ms1.157 ms4.858 ms
bm_sqlite_index0.366 ms0.350 ms0.355 ms1.883 ms0.364 ms0.345 ms0.371 ms
bm_tracker11.707 ms11.548 ms11.433 ms11.425 ms11.465 ms11.450 ms11.912 ms
bm_tracker_flat1.661 ms1.511 ms1.513 ms1.714 ms1.507 ms1.612 ms1.510 ms
bm_xapian1.278 ms1.364 ms1.359 ms1.821 ms1.994 ms1.429 ms3.192 ms
release=1999/03/31 - 17,251 movies, 3 matches
t1t2t3t4t5t6t7
bm_lucene++12.485 ms12.281 ms12.323 ms11.981 ms12.137 ms11.808 ms12.552 ms
bm_sqlite8.247 ms6.259 ms6.007 ms6.300 ms6.125 ms5.958 ms5.921 ms
bm_sqlite_index0.379 ms0.297 ms0.285 ms0.284 ms0.252 ms0.254 ms0.251 ms
bm_tracker61.537 ms60.815 ms61.014 ms60.821 ms61.013 ms60.850 ms60.820 ms
bm_tracker_flat11.063 ms8.021 ms8.414 ms8.690 ms7.798 ms7.811 ms8.313 ms
bm_xapian5.545 ms4.561 ms4.956 ms4.388 ms4.321 ms4.687 ms4.396 ms
release=1999/03/31 - 121,587 movies, 12 matches
t1t2t3t4t5t6t7
bm_lucene++14.005 ms14.031 ms12.792 ms14.354 ms12.736 ms13.862 ms13.374 ms
bm_sqlite64.517 ms61.783 ms61.669 ms62.418 ms61.377 ms61.326 ms62.036 ms
bm_sqlite_index9.994 ms0.403 ms0.358 ms0.351 ms0.368 ms0.363 ms3.368 ms
bm_tracker-------
bm_tracker_flat62.160 ms62.760 ms56.630 ms60.929 ms54.310 ms53.189 ms58.016 ms
bm_xapian29.180 ms28.239 ms28.080 ms28.054 ms27.777 ms27.615 ms27.505 ms
title=The Matrix - 9 movies, 1 matches
t1t2t3t4t5t6t7
bm_lucene++9.248 ms8.929 ms9.139 ms9.455 ms9.609 ms9.128 ms9.110 ms
bm_sqlite0.163 ms0.163 ms0.163 ms0.161 ms0.160 ms0.163 ms0.164 ms
bm_sqlite_index0.167 ms0.165 ms0.178 ms0.164 ms0.164 ms0.163 ms0.165 ms
bm_tracker0.733 ms0.484 ms0.475 ms0.478 ms0.481 ms0.475 ms0.476 ms
bm_tracker_flat0.575 ms0.400 ms0.380 ms0.382 ms0.379 ms0.387 ms0.379 ms
bm_xapian0.226 ms0.197 ms0.194 ms0.191 ms0.191 ms0.194 ms0.190 ms
title=The Matrix - 1,099 movies, 1 matches
t1t2t3t4t5t6t7
bm_lucene++10.758 ms10.578 ms10.083 ms10.230 ms10.555 ms10.630 ms10.831 ms
bm_sqlite0.728 ms0.524 ms0.504 ms0.501 ms0.506 ms0.500 ms0.501 ms
bm_sqlite_index0.218 ms0.203 ms0.201 ms0.198 ms0.199 ms0.277 ms0.233 ms
bm_tracker5.906 ms5.409 ms5.426 ms5.453 ms5.420 ms5.410 ms5.344 ms
bm_tracker_flat1.685 ms1.471 ms1.455 ms1.455 ms1.440 ms1.448 ms1.439 ms
bm_xapian0.445 ms0.385 ms0.398 ms0.373 ms0.836 ms0.451 ms0.374 ms
title=The Matrix - 3,216 movies, 1 matches
t1t2t3t4t5t6t7
bm_lucene++10.138 ms10.144 ms10.652 ms10.124 ms10.169 ms10.070 ms10.547 ms
bm_sqlite2.587 ms1.180 ms1.198 ms2.202 ms1.411 ms1.422 ms1.288 ms
bm_sqlite_index0.323 ms0.300 ms0.306 ms0.298 ms0.493 ms0.304 ms0.304 ms
bm_tracker15.097 ms14.727 ms14.692 ms14.759 ms14.840 ms14.888 ms14.791 ms
bm_tracker_flat3.727 ms3.529 ms3.558 ms3.545 ms3.504 ms3.504 ms3.520 ms
bm_xapian0.432 ms0.353 ms0.345 ms0.349 ms0.348 ms0.342 ms0.692 ms
title=The Matrix - 17,251 movies, 1 matches
t1t2t3t4t5t6t7
bm_lucene++12.462 ms11.871 ms12.020 ms11.603 ms12.469 ms11.850 ms11.823 ms
bm_sqlite6.093 ms6.096 ms6.130 ms5.941 ms5.882 ms5.959 ms6.789 ms
bm_sqlite_index1.431 ms0.304 ms0.201 ms0.200 ms0.201 ms0.199 ms0.199 ms
bm_tracker79.019 ms78.831 ms78.514 ms78.491 ms79.423 ms78.506 ms78.759 ms
bm_tracker_flat19.173 ms20.160 ms19.373 ms19.043 ms18.992 ms18.961 ms19.207 ms
bm_xapian0.422 ms0.344 ms0.339 ms0.335 ms0.336 ms0.339 ms0.345 ms
title=The Matrix - 121,587 movies, 1 matches
t1t2t3t4t5t6t7
bm_lucene++13.367 ms13.395 ms12.906 ms13.164 ms12.856 ms13.348 ms12.862 ms
bm_sqlite62.625 ms61.341 ms61.296 ms61.361 ms61.248 ms61.195 ms61.607 ms
bm_sqlite_index0.328 ms0.312 ms0.300 ms0.303 ms0.301 ms7.473 ms0.330 ms
bm_tracker-------
bm_tracker_flat138.148 ms131.762 ms130.937 ms131.431 ms131.471 ms130.975 ms130.770 ms
bm_xapian0.833 ms0.681 ms0.674 ms0.687 ms0.665 ms0.667 ms0.665 ms
director=Quentin Tarantino - 9 movies, 1 matches
t1t2t3t4t5t6t7
bm_lucene++9.112 ms9.540 ms9.671 ms9.258 ms9.510 ms9.597 ms9.126 ms
bm_sqlite0.273 ms0.243 ms0.243 ms0.241 ms0.239 ms0.239 ms0.239 ms
bm_sqlite_index0.282 ms0.243 ms0.257 ms0.244 ms0.245 ms0.243 ms0.337 ms
bm_tracker0.810 ms0.547 ms0.542 ms0.544 ms0.541 ms0.554 ms0.567 ms
bm_tracker_flat0.606 ms0.410 ms0.398 ms0.403 ms0.383 ms0.459 ms0.392 ms
bm_xapian0.215 ms0.204 ms0.195 ms0.197 ms0.195 ms0.208 ms0.194 ms
director=Quentin Tarantino - 1,099 movies, 9 matches
t1t2t3t4t5t6t7
bm_lucene++11.574 ms12.063 ms11.780 ms12.169 ms12.253 ms11.801 ms11.939 ms
bm_sqlite13.775 ms8.831 ms9.583 ms9.506 ms9.193 ms9.154 ms9.452 ms
bm_sqlite_index13.332 ms8.963 ms10.201 ms9.064 ms8.925 ms10.095 ms8.756 ms
bm_tracker5.173 ms4.644 ms4.546 ms4.473 ms4.552 ms4.472 ms4.455 ms
bm_tracker_flat1.137 ms0.857 ms0.851 ms0.855 ms0.844 ms0.842 ms0.844 ms
bm_xapian0.898 ms0.878 ms0.893 ms0.873 ms1.000 ms0.882 ms0.843 ms
director=Quentin Tarantino - 3,216 movies, 10 matches
t1t2t3t4t5t6t7
bm_lucene++12.343 ms12.175 ms12.307 ms12.004 ms12.235 ms12.947 ms12.194 ms
bm_sqlite40.967 ms37.867 ms38.607 ms37.618 ms37.487 ms37.124 ms38.147 ms
bm_sqlite_index43.470 ms36.820 ms37.027 ms36.779 ms36.957 ms36.585 ms36.782 ms
bm_tracker13.707 ms13.074 ms12.763 ms12.740 ms12.848 ms12.779 ms12.855 ms
bm_tracker_flat2.015 ms1.559 ms1.531 ms1.525 ms1.530 ms1.545 ms1.511 ms
bm_xapian0.933 ms0.886 ms0.908 ms2.944 ms1.023 ms1.030 ms0.799 ms
director=Quentin Tarantino - 17,251 movies, 13 matches
t1t2t3t4t5t6t7
bm_lucene++13.704 ms14.413 ms14.331 ms15.096 ms14.026 ms14.492 ms14.205 ms
bm_sqlite307.961 ms308.146 ms308.565 ms307.942 ms308.342 ms308.387 ms308.991 ms
bm_sqlite_index308.011 ms305.433 ms305.347 ms304.567 ms304.920 ms305.567 ms304.404 ms
bm_tracker72.690 ms72.075 ms72.005 ms71.999 ms71.938 ms71.946 ms72.108 ms
bm_tracker_flat7.489 ms6.996 ms6.877 ms6.987 ms7.148 ms7.088 ms7.021 ms
bm_xapian1.087 ms0.963 ms1.010 ms1.151 ms1.088 ms0.965 ms0.959 ms
director=Quentin Tarantino - 121,587 movies, 14 matches
t1t2t3t4t5t6t7
bm_lucene++13.546 ms13.955 ms13.981 ms13.854 ms13.740 ms14.114 ms15.816 ms
bm_sqlite4,752.853 ms2,793.690 ms2,800.197 ms2,795.611 ms2,800.578 ms2,794.765 ms2,801.000 ms
bm_sqlite_index2,806.890 ms2,789.648 ms2,788.729 ms2,791.168 ms2,788.102 ms2,790.845 ms2,789.475 ms
bm_tracker-------
bm_tracker_flat47.801 ms46.303 ms46.701 ms46.640 ms46.467 ms46.862 ms46.448 ms
bm_xapian20.098 ms1.260 ms1.176 ms1.162 ms1.156 ms1.149 ms1.148 ms
T* - 9 movies, 9 matches
t1t2t3t4t5t6t7
bm_lucene++17.303 ms17.072 ms16.927 ms16.539 ms16.816 ms16.758 ms16.797 ms
bm_sqlite0.547 ms0.544 ms0.547 ms0.541 ms0.541 ms0.546 ms0.544 ms
bm_sqlite_index0.553 ms0.549 ms0.554 ms0.553 ms0.658 ms0.547 ms0.544 ms
bm_tracker-------
bm_tracker_flat2.525 ms2.302 ms2.423 ms2.415 ms2.372 ms2.356 ms2.305 ms
bm_xapian3.086 ms2.871 ms2.947 ms2.893 ms3.104 ms3.022 ms3.126 ms
T* - 1,099 movies, 1,098 matches
t1t2t3t4t5t6t7
bm_lucene++358.775 ms355.830 ms350.287 ms349.816 ms347.998 ms356.585 ms347.143 ms
bm_sqlite64.679 ms142.927 ms143.776 ms142.847 ms145.319 ms147.244 ms135.600 ms
bm_sqlite_index62.383 ms151.941 ms144.456 ms144.108 ms141.330 ms173.728 ms169.799 ms
bm_tracker-------
bm_tracker_flat199.108 ms213.355 ms202.793 ms196.659 ms194.937 ms194.708 ms195.267 ms
bm_xapian419.323 ms516.929 ms677.357 ms591.280 ms599.091 ms643.124 ms497.649 ms
T* - 3,216 movies, 3,204 matches
t1t2t3t4t5t6t7
bm_lucene++842.413 ms968.828 ms958.367 ms1,002.383 ms932.222 ms946.388 ms1,004.821 ms
bm_sqlite327.669 ms415.921 ms440.198 ms408.543 ms432.575 ms537.572 ms412.061 ms
bm_sqlite_index310.218 ms432.201 ms413.221 ms404.165 ms479.691 ms431.758 ms436.533 ms
bm_tracker-------
bm_tracker_flat727.867 ms711.970 ms722.046 ms717.685 ms719.927 ms713.077 ms713.843 ms
bm_xapian1,442.238 ms1,470.821 ms1,415.183 ms1,392.164 ms1,437.493 ms1,464.149 ms1,520.747 ms
T* - 17,251 movies, ≥ 10,000 matches
t1t2t3t4t5t6t7
bm_lucene++3,006.139 ms3,127.174 ms3,136.617 ms3,151.197 ms3,131.469 ms3,141.155 ms3,056.497 ms
bm_sqlite1,481.321 ms1,388.573 ms1,468.062 ms1,533.263 ms1,422.012 ms1,442.638 ms1,456.166 ms
bm_sqlite_index1,346.717 ms1,451.410 ms1,508.228 ms1,411.643 ms1,460.563 ms1,514.390 ms1,391.342 ms
bm_tracker-------
bm_tracker_flat2,945.536 ms2,938.230 ms2,957.149 ms2,959.569 ms2,972.291 ms2,933.668 ms2,936.655 ms
bm_xapian3,391.825 ms3,490.307 ms3,474.203 ms3,483.310 ms3,560.886 ms3,505.060 ms3,398.937 ms
T* - 121,587 movies, ≥ 10,000 matches
t1t2t3t4t5t6t7
bm_lucene++3,627.408 ms3,625.588 ms3,546.610 ms3,508.233 ms3,599.160 ms4,597.857 ms4,101.686 ms
bm_sqlite2,182.548 ms2,109.730 ms2,109.812 ms2,121.573 ms2,104.320 ms2,117.912 ms2,145.342 ms
bm_sqlite_index2,108.863 ms2,103.648 ms2,131.009 ms2,132.823 ms2,109.655 ms2,137.286 ms2,106.779 ms
bm_tracker-------
bm_tracker_flat8,757.130 ms9,316.640 ms8,708.298 ms8,781.584 ms8,788.042 ms8,699.770 ms8,721.099 ms
bm_xapian4,805.474 ms4,528.004 ms4,692.763 ms4,640.065 ms4,618.215 ms4,647.170 ms4,674.588 ms

Full Text Search Engines, Part I

Openismus asked me to research how best to index media files and provide full text searching. For the last two years, I have used Tracker for this kind of thing. I like Tracker, but I want to avoid being biased. Therefore, I decided to evaluate alternatives.

Performance is an obvious requirement. We also want to provide a library to permit other applications to access the data we collected. Therefore, SQLite and Lucene (in its C++ incarnations) are obvious contenders. Lucene++ is an emerging project that got suggested by Mikkel Kamstrup Erlandsen at Canonical. QtCLucene is a bit special: So far Qt doesn't provide official support for this library and doesn't install its headers files. Still it is used by Qt's help system, which makes QtCLucene a widely deployed and well tested C++ implementation of Lucene.

Sadly, the big names like MySQL or PostgreSQL do not fit: MySQL's embedded server library is licensed under GPL (instead of LGPL, for instance), which greatly limits legal use cases. PostgreSQL doesn't provide any embedding at all. Because I enjoy RDF and SPARQL I also wondered about testing the Redland RDF libraries, but I found that they don't provide any full text search at all.

Contenders

Test Platform

Test Scenario

To get somewhat realistic data I've fetched a copy of the Internet Movie Database from ftp.fu-berlin.de. Since it is a quite huge database (about 1 GiB when compressed with gzip) I've extracted a few subsets of it: All movies with at least 500,000, 50,000, 15,000 1,000 and 50 user votes. This data then got imported into a fresh instance of Tracker, SQLite, Lucene++ and QtCLucene. After that I've run a few trivial full text searches:

"The Matrix"
Fast Furious
"Star Trek" OR "Star Wars"
Lord Rings King
Keanu Reeves
"Brad Pitt" OR "Bruce Willis"
Jackson Samuel
Quentin Tarantino
Wachowski
Thomas Neo Anderson
Neo

Each scenario was repeated five times. To avoid cache effects each engine was tested after the others for a given set of parameters. Tracker was tested with two different scenarios: First I've tried the Nepomuk based multimedia ontology shipped with Tracker (nmm), after that I've also tried a flattened ontology (fmm) which is a much better fit for the data model of pure full text search indices like Lucene. All engines where used with default parameters. No magic configuration options or pragmas were applied. Feel free to repeat the tests with your own optimized settings, and report the results when doing so.

Source Code and Data

The source code of these benchmarks can be found at Gitorious, and can be built using autotools or qmake. Just like you prefer.

Run src/benchmark.sh to reproduce the tests. The log files can be turned into a CSV file by running src/report.sh.

The charts have been created with LibreOffice: It should be sufficient to copy the CSV data into the data sheet of logs/report.ods. Select "English (USA)" as language in the import dialog, to ensure that numbers are recognized properly. After that you still might have to sort the rows by the columns suite, num_movies and experiment. The data sorting dialog provides an option for marking the first row as column header.

Update: I've pushed some more changes, so to exactly reproduce the results discussed in this post, checkout the tags releases/0.1 for the initial results, and releases/0.2 to also include Xapian tests.

Results

Populating the Index

Lucene++QtCLuceneSQLiteTracker (Nepomuk)Tracker (Flat)Xapian
96.84 ms3.46 ms43.2 ms¹⁾36.2 ms7.13 ms52.561 ms¹⁾
1,0992.93 ms5.72 ms3.63 ms26.4 ms3.32 ms5.94 ms
3,2162.32 ms5.37 ms2.87 ms21.2 ms2.89 ms4.97 ms
17,2511.98 ms5.10 ms2.50 ms14.2 ms2.19 ms3.58 ms
121,5871.21 ms5.21 ms3.96 ms²⁾10.4 ms1.80 ms2.30 ms
  1. The dataset is tiny. I suspect that some startup overhead is invalidating this result.
  2. We might see first signs of a memory barrier here.

Query Execution Time

Lucene++QtCLuceneSQLiteTracker (Nepomuk)Tracker (Flat)Xapian
92.23 ms0.572 ms0.159 ms1.33 ms0.494 ms0.271 ms
1,0996.06 ms2.18 ms1.17 ms90.3 ms1.67 ms0.955 ms
3,2168.72 ms3.41 ms1.55 ms335 ms3.57 ms1.50 ms
17,25113.1 ms5.33 ms1.92 ms2,380 ms7.52 ms2.35 ms
121,58717.0 ms44.2 ms17.4 ms86,800 ms19.885 ms18.1 ms
ComplexityO(log(n)²)O(log(n)²)O(log(n)²)O(n log(n))O(sqrt(n))O(log(n)²)

QtCLucene, SQLite, Tracker (Nepomuk) and Xapian seem to hit a memory barrier at 121,587 movies.

Consumed Disk Space

Lucene++QtCLuceneSQLiteTracker (Nepomuk)Tracker (Flat)XapianRaw Data
980 KiB76 KiB368 KiB4.4 MiB2.3 MiB424 KiB104 KiB
1,0994.9 MiB4.8 MiB32 MiB59 MiB29 MiB21 MiB7.8 MiB
3,21612 MiB12 MiB75 MiB114 MiB53 MiB47 MiB18 MiB
17,25139 MiB39 MiB257 MiB305 MiB155 MiB170 MiB57 MiB
121,587154 MiB154 MiB1.0 GiB906 MiB521 MiB683 MiB198 MiB

Discussion

The performance of Tracker is devastating. Entirely not the result you want to see for a project you actually like and enjoy using. You clearly see the bad impact of the many joins it must perform for mapping the ontologies and queries to SQL. This is surprising since in my opinion Nepomuk's multimedia ontology is a quite typical ontology. Also the datasets itself are not that huge for something that initially started as file indexer. The (sadly quite unrealistic) flat ontology might give a few hints on how to improve Tracker. The execution times with this ontology are comparable with them of the other engines. Still the observed (and only estimated) complexity class for executing queries is worrying.

Lucene++ shines at writing data, it is just incredibly fast when building its index. In contrast to the other engines it even spends less time per movie, the bigger its index grows. It is noticable slower than QtCLucene or SQLite when looking up terms. Still I'd call an average time of 17 ms for finding matches within 122k documents a quite good achievement. Additionally Lucene++ seems to be implemented sufficiently efficient to not hit any memory barrier yet at this scale.

QtCLucene is about two times slower than Lucene++ or SQLite when building its index, still the index size doesn't seem to impact insertion time per movie. It pays back with good lookup performance. It is about 2 to 3 times faster than Lucene++. It seems to hit a memory barrier at 122k documents.

SQLite's performance is just in the middle between Lucene++ and QtCLucene when building the index. When searching terms it even beats QtCLucene, again by a factor of 2 to 3.

Lucene++ and QtCLucene consume less disk space than the original files, most probably because the raw data stores movies and artists in separate files. The records in this files must be linked with each other. Lucene just does this more efficiently. SQLite and Tracker consume significantly more disk space than Lucene or the original data. Partly this can be explained by fields being stored twice: Once in their table and another time in the full text search index. Column indexes also play a role. Still this doesn't explain why disk consumption is significantly higher.

Xapian's characteristics are quite similar to those of SQLite. It doesn't hit yet that memory barrier that affects SQLite's insert performance at 122k documents, maybe because it consumes only 2/3 of the disk space. Enjoyed its API for being much closer to modern C++ than any other engine. It gives more low-level access to all the FTS mechanics: For instance you have to attach values and feed the indexer yourself. Also you have to deal with token prefixes. Details that Lucene just hides behind a Field class and its attributes. Not sure yet, what approach I prefer.

Conclusion

Tracker is out. Lucene++, QtCLucene and SQLite are quite comparable in terms of performance, with Lucene++ being the fastest engine when building the index, and with SQLite being the fastest when performing full text searches. There are some first signs that Lucene++ is more memory efficient than its competitors. This needs further investigation. Also we should investigate capabilities for doing point and range searches, instead of full text searches.

FOSDEM 2012

FOSDEM is only real with Belgian Waffles FOSDEM in 2012 was an exciting (and naturally, exhaustive) conference again. It's great to have so many relevant people who are all active in the free software world together in one place. It's also a great opportunity to discuss radical new ideas, ideally while experimenting with Belgium beer. Which is what we usually did when we weren't at the conference site.

It was nice to see Jarno and Esko at the conference, too. We even stayed in the same hotel. I hope they enjoyed the Ethiopian lunch as much as I did. And perhaps they're not too angry any more that we lead them to drink Absinthe ;-)

Jon and I gave two talks. Jon's talk (slides) was about Maliit as a project, explaining what Maliit is (and what it is not), combined with a short history lesson about the project. I tried to outline the difficulties of mobile text input in general (slides), picking some use-cases that are known from the desktop world and showing why simply copying the use-cases and their known interaction models does not work very well. I honestly liked Jon's talk more though.

Neither of us two actually managed to visit other talks, even though we wanted to. We had to ask Jarno, Esko and others about what great talks we missed. Apparently there were quite a few :-(

Our Maliit T-Shirts were well received, though we usually only handed them out when someone listened to our Maliit ramblings long enough.

We were asked about accessibility several times, which is currently not within the scope of Maliit but perhaps something to think about in the future.

We also got to talk with the people working on (text) input in Redhat and Intel, mostly in the context of Wayland. There are some interesting opportunities to get things (more) right this time around.

Thanks to our employer, Openismus, for sending us there!

The infrastructure of the Maliit project

Maliit T-Shirts! It took us a while to transform the Maliit project into a real opensource project. At first there was only public code, later some wiki pages @ meego.com together with constantly changing components in the official MeeGo bugtracker, then a public mailing list.

After that we tried to become independent of MeeGo, but neither freedesktop.org nor the GNOME project could give us a suitable home. So we had to go with our own infrastructure in the end, which probably was the best we could do, in any case. We now enjoy our own website (mostly a wiki, for which we can also analyze the traffic), our own IRC channel, our own public bugtracker, our own mailing lists and a build bot. We also make use of other services such as launchpad.org and the openSUSE Build Service, both for packaging but also as part of our continouous integration setup. Both services provide nightly builds for Maliit, for example (though we still lack packages for ARM).

But there was always one thing missing: T-Shirts. Now that this is solved, too, we can finally call Maliit a real opensource project ;-) Hopefully we'll soon have another group photo of the people who've been involved in the project over the years. I'll make sure to bring a couple of T-Shirts to FOSDEM, so make sure grab Jon or me if you want one.

How we enable others to write 3rd party plugins with Maliit

We finally published a video about Maliit - an input method framework including a virtual keyboard - and 3rd party plugins. Kudos goes to Jon for making time for that.

This video highlights one of Maliit's key features: pluggable input methods which come with their very own user interfaces. The Chinese input methods show how Maliit offers support for composed characters. The video is proof that 3rd party development for Maliit (open-source and proprietary) is not only possible but also happening.

maliit.org states that "it should be easy to customize existing input methods or develop powerful new input methods, whether for profit, research or fun", we actually mean it.

The harder question is of course how to motivate others to actually get started on input method development with Maliit. For that, we have a multipronged strategy:

  1. Provide sufficiently polished reference plugins that can show off Maliit capabilities but also serve as inspiration for new plugins (hence the BSD license for reference plugins). Our reference plugins are currently using Qt/C++ (Maliit Keyboard) and QML (Nemo Keyboard). We also have PySide support, but no one contributed a reference plugin yet. This gives choice to interested input method developers, and we think that's important. The reference plugins serve another role when it comes to designing new API: They become our testbed, allowing us to verify our API proposals.

  2. Ship Maliit with a bunch of example plugins and example applications. None of them try to be complete. They are all self-contained though and usually show one feature at a time. This can be tedious to maintain, but we believe that examples need to stay small and focused, otherwise developers won't look at them.

  3. Documentation that is easy to consume. Our documentation is not as concise and clear as we'd like it to be, but it's slowly improving. We also experiment with videos that can serve as an introduction to more in-depth (text) documentation.

  4. Packages for most common Linux distributions. This one seems obvious, but sadly, it's quite a lot of work for us to keep up with it (and we already use automated services such as Launchpad and OpenSuse Build Service). In the hope to attract dedicated packagers we wrote down some packaging guidelines

  5. An architecture that had 3rd party plugins and multiple toolkit support in mind from the start. The plugin developer facing API needs to be easy to use and clearly documented. This will be the focus of the upcoming 0.9x series.

We will demo Maliit @ FOSDEM 2012, hope to see you there!

Introducing libstarred

Inspired by someone's lazyweb call I felt like doing something entirely unrelated to my regular work.

libstarred provides a GTK+ cell renderer and a widget for showing and editing five-star ratings.

Star Ratings Demo

Feel free to use it. Merge requests welcome. Ideally someone finds time to polish it for inclusion into GTK+ itself.

Btw, hacked using QtCreator's Autotools Plugin.

Into the Wild

We kicked off the new 0.81 release series together with a nice announcement: We have our own bugtracker now!

This means that Maliit has a near complete project infrastructure, all available under *.maliit.org, and all that thanks to Karsten, the always professional and very experienced hostmaster here at Openismus.

There is one more thing that we need (as demonstrated by me when I made my first broken release), and that's a simple build bot setup for continuous integration. Right now, we still rely on Nokia's infrastructure. I am confident that buildbot will fit all of our requirements, as long as it is trivial to maintain.

I am also happy that we improved our documentation significantly, thanks to Dave. He translated the important documentation bits into proper English and made it more accessible, demonstrating his doxygen skills. As a bonus, he also updated the project's README files, something we had neglected for a long time.

Future development

Regarding the Maliit development, I think we have simplified things a lot. D-Bus activation for the Maliit server (which finally means one server instance per user session) and the new support for plain QML plugins makes it almost trivial to get started with Maliit. We also let go of the critically acclaimed MeeGo Keyboard in Maliit upstream, which made me a bit sad of course.

Still, it probably was the right decision: MeeGo Keyboard is heavily fine-tuned for Nokia's N9 and it has some dependencies that are hard to satisfy outside of Harmattan. Over time, with ever changing requirements, the code has naturally evolved into a rather complex design. The result, however, is a very polished product, and ultimately, that's the only thing that matters (even though many opensource developers will disagree, strangely enough). Everyone in the team is proud of what we have achieved.

At the same time, I can understand how new contributors will be put off by all the complexity. So Maliit upstream will instead focus on the very basic but almost trivial to build Maliit Keyboard. For new contributors, that's a good thing. For us, it means the possibility to fix shortcomings in the plugin API. This is important, as one of our main goals has always been to enable others to write great input method plugins for Maliit, which will then run on any platform that Maliit supports. The Swype VKB plugin and the Japanese VKB plugin for example both demonstrate that we are on a good way, but I think we can do better.

Maliit itself still needs a good reference plugin, of course, even if only as a showcase (though I want it to be more). All this doesn't mean that MeeGo Keyboard goes away; its development will continue in the MeeGo Touch repositories, just as before (effectively degraded as just another Maliit plugin). But what we can take over, hopefully, is our experience when it comes to creating one of the best virtual keyboards currently available.

Qt Quick best practices: Using Components

The next article in the series about Qt Quick best practices has been published (but don't miss out the other one about property bindings). This time, I talked about Components, and how they can help to keep your QML code clean and maintainable. The team behind the N9 Developer blog has been a great help to me, especially Ville Lavonious and Matti Airas. I am also thankful for the additional input (and proof reading!) from Jon Nordby and Sauli Kauppi. Thanks guys!

Miniature 0.5 'London 1851' released

From the release notes: "Miniature now supports different languages thanks to a determined community of translators. Thank you for your effort! This is why we are dedicating this release to the first international chess tournament, celebrated in London on 1851.

Miniature 0.5 is being released for MeeGo Harmattan (Nokia N9 & N950) and Maemo (Nokia N900). Thanks to everybody involved in the initial Maemo attempts and the experimental version that was made available after the Miniature 0.4 release."

We also improved usability, compared to the previous release, but there's still a ton of work left.

A bit of history

I started working on Miniature – a chess client for freechess.org – in November 2009, after reading the Call for Contributors. Even though we had a pretty cool P2P feature (based on Telepathy and developed mostly by Dariusz Mikulski), it never quite reached the original goal: playing chess online. Back then I was learning how to create UI's with Qt Graphics View, which was all the rage at the time. Well, we now know that writing real UI's with that technology is a major PITA, but for my pet project, it was just too much. I got lost in the struggle.

For the next 18 months, Miniature was basically dead. Another failed project that started so promising. Quim did not want to give up though. After the N9 announcement, he launched a second Call for Contributors.

Perhaps I responded to his mail because I was embarrased at the idea of people wasting time trying to salvage the working parts of Miniature; there simply wasn't much to salvage! So I started again, this time with a very clear goal: online chess, and online chess only. Let others create the actual UI and whatnot. Focusing on one prominent feature and not having to worry about the UI worked well for me, even though I had to iterate over some architecture ideas until I felt comfortable. Quim in the meantime started to prototype the UI with QML. It was impressive to see his results, a level of polish I could have never achieved with my Qt Graphics View approach. At some point the backend was good enough to be sewn together with the frontend and suddenly we had achieved where I failed before: A touch enabled chess client for the N9 that can play chess online.

Having my own useful application available on the N9, published through OVI store, means a lot to me. I hope others will enjoy Miniature as much as we enjoyed re-creating it the second time around.

Using MeeGo Keyboard from git on your Nokia N9

Usually AEGIS, the N9's security framework, protects system packages from being replaced. As such, files belonging to a system package can't be overwritten. And that's definitely a good thing, because otherwise each download from OVI store would put the user at a considerable risk.

Maliit is such a system package, but its flexible architecture allows for a creative way to replace the MeeGo Keyboard with a more recent version. This can be useful if you want to testdrive new features and to … nah whom am I kidding, it's purely for fun!

Be warned though, the following hack requires you to enable developer mode on your N9. Don't ever activate it unless you're absolutely sure what you're doing to your N9. It would be unforgivable to brick this beauty because of some misguided hack the planet attitude.

First we need to find a MeeGo Keyboard tag that will be compatible with the installed Maliit framework version on your device. Check that the output of

$ apt-cache showpkg meego-keyboard

matches the dependencies mentioned in the tag's Debian control file and the packages installed in your scratchbox ARMEL target.

Apply the community patch on top of the chosen tag. It renames the package to meego-keyboard-community and only installs the plug-in's .so file, together with a renamed CSS file (libmeegotouch requires that CSS file names match with library names).

This mean that we won't uninstall the regular package, as we still depend on most the other files that meego-keyboard installs.

Now build the Debian package. Copy it over and login to the device, then gain root access via devel-su. It's recommended to make a backup of /usr/lib/meego-im-plugins before installing the package.

After installing libmeego-keyboard-community, remove libmeego-keyboard.so from /usr/lib/meego-im-plugins, to avoid in-fights between the two plug-ins. Use

$ gconftool-2 -s /meegotouch/inputmethods/onscreen/enabled -t list --list-type string [libmeego-keyboard-community.so, en_gb.xml]
$ gconftool-2 -s /meegotouch/inputmethods/onscreen/active -t list --list-type string [libmeego-keyboard-community.so, en_gb.xml]

to activate the community plug-in. The language settings applets will most likely get confused, so be prepared that enabling new language layouts might only work directly via GConf from now on.

Gain user access and kill meego-im-uiserver. It should now load the new community plug-in. If you want to get the original MeeGo Keyboard back, uninstall the community package and copy the .so back from your backup. Alternately, you can try to reinstall it:

$ apt-get install --reinstall meego-keyboard

Have fun!