Mathias Hasselmann

iconv transliterations

This is a lazy web request.

In GNOME world we use the g_convert, which conveniently wraps the iconv library, to convert between character sets. A feature I considered quite useful was the builtin transliteration support: When adding the "//TRANSLIT" suffix to the target character set's name, iconv shall try to convert characters not present in the target charset to their most reasonable equivialent. For example "Schlüssel" should become "Schluessel" when converting from UTF-8 to ASCII and "доброй вечер" could become "dobroj vecher" (assuming the cyrillic input method of GTK+ works reasonably). This should be tested:

$ echo Schlüssel | LC_ALL=de_DE.UTF-8 iconv -t ASCII//TRANSLIT -f UTF-8
Schluessel

Wow!

$ echo Schlüssel | LC_ALL=en_US.UTF-8 iconv -t ASCII//TRANSLIT -f UTF-8
Schlussel

Huch?

$ echo доброй вечер | LC_ALL=de_DE.UTF-8 iconv -t ASCII//TRANSLIT -f UTF-8
?????? ?????
$ echo Schlüssel | LC_ALL=C iconv -t ASCII//TRANSLIT -f UTF-8
Schl?ssel

WTF?

Seems the transliteration support of iconv is highly locale dependant.

So I ask a the lazy web: Are there functions in the GNOME stack allowing locale indepentend UTF-8 to ASCII transliterations?

Comments

Oded Arbel commented on November 6, 2007 at 12:21 p.m.

I don't think its possible, with very similar letters meaning different things in different languages.

BTW - your captcha is really annoying.

Murray Cumming commented on November 6, 2007 at 12:46 p.m.

> Seems the transliteration support of iconv is highly locale dependant.

That makes sense to me. For instance, I doubt that ü can be transliterated as ue in every language that uses ü.

Philip Hofstetter commented on November 6, 2007 at 1:25 p.m.

Hi

> > Seems the transliteration support of iconv is highly locale dependant.

> That makes sense to me. For instance, I doubt that ü can
> be transliterated as ue in every language that uses ü.

yepp. And the en_US one was correct too. My home city, Zürich, is written Zurich in english. Not Zuerich.

Philip

Mathias Hasselmann commented on November 6, 2007 at 2:06 p.m.

Murray, Oded: Ok, agree. So considering the context of my question I have to rephrase it to "How to reliably transliterate to English?".

Surrounding the conversion by setlocale calls fails because:

1) setlocale changes the locale for the entire process, not just the current thread
2) accidently passing an unknown locale causes fallback to the "C" locale, which transliterates everything to question marks

Philip: Indeed, but guess that's more a translation thing if you consider other city names, like for instance München/Munich, Prag/Praha, Moskau/Moscow/Москва.

Oded: Regarding the captcha, yes technically not the best one, but by using Recaptcha this boring and pointless spam prevention thing gets a small meaning at least.

Joe Buck commented on November 6, 2007 at 5:08 p.m.

The conversion of ü to ue is not a general property of ü; it is specific to the German language.

The umlaut symbol used to be used for a different purpose in English, though the use is fading out; it is to mark a second vowel in a pair as a new syllable, not a dipthong. For example, the word coöperation, which in modern usage is written without the umlaut. Given this usage, you see why your second example is correct: given an umlaut in text that is supposed to be US English, when converting to ASCII you should drop the umlaut, not add an e.

rollsappletree commented on November 6, 2007 at 6:14 p.m.

I'd like to know more on this issue ... I'll stay tuned!

Mohammad commented on November 7, 2007 at 10:55 a.m.

Use libicu instead.