iconv transliterations
This is a lazy web request.
In GNOME world we use the g_convert, which conveniently wraps the iconv library, to convert between character sets. A feature I considered quite useful was the builtin transliteration support: When adding the "//TRANSLIT" suffix to the target character set's name, iconv shall try to convert characters not present in the target charset to their most reasonable equivialent. For example "Schlüssel" should become "Schluessel" when converting from UTF-8 to ASCII and "доброй вечер" could become "dobroj vecher" (assuming the cyrillic input method of GTK+ works reasonably). This should be tested:
$ echo Schlüssel | LC_ALL=de_DE.UTF-8 iconv -t ASCII//TRANSLIT -f UTF-8
Schluessel
Wow!
$ echo Schlüssel | LC_ALL=en_US.UTF-8 iconv -t ASCII//TRANSLIT -f UTF-8
Schlussel
Huch?
$ echo доброй вечер | LC_ALL=de_DE.UTF-8 iconv -t ASCII//TRANSLIT -f UTF-8
?????? ?????
$ echo Schlüssel | LC_ALL=C iconv -t ASCII//TRANSLIT -f UTF-8
Schl?ssel
WTF?
Seems the transliteration support of iconv is highly locale dependant.
So I ask a the lazy web: Are there functions in the GNOME stack allowing locale indepentend UTF-8 to ASCII transliterations?
I don't think its possible, with very similar letters meaning different things in different languages.
BTW - your captcha is really annoying.
> Seems the transliteration support of iconv is highly locale dependant.
That makes sense to me. For instance, I doubt that ü can be transliterated as ue in every language that uses ü.
Hi
> > Seems the transliteration support of iconv is highly locale dependant.
> That makes sense to me. For instance, I doubt that ü can
> be transliterated as ue in every language that uses ü.
yepp. And the en_US one was correct too. My home city, Zürich, is written Zurich in english. Not Zuerich.
Philip
Murray, Oded: Ok, agree. So considering the context of my question I have to rephrase it to "How to reliably transliterate to English?".
Surrounding the conversion by setlocale calls fails because:
1) setlocale changes the locale for the entire process, not just the current thread
2) accidently passing an unknown locale causes fallback to the "C" locale, which transliterates everything to question marks
Philip: Indeed, but guess that's more a translation thing if you consider other city names, like for instance München/Munich, Prag/Praha, Moskau/Moscow/Москва.
Oded: Regarding the captcha, yes technically not the best one, but by using Recaptcha this boring and pointless spam prevention thing gets a small meaning at least.
The conversion of ü to ue is not a general property of ü; it is specific to the German language.
The umlaut symbol used to be used for a different purpose in English, though the use is fading out; it is to mark a second vowel in a pair as a new syllable, not a dipthong. For example, the word coöperation, which in modern usage is written without the umlaut. Given this usage, you see why your second example is correct: given an umlaut in text that is supposed to be US English, when converting to ASCII you should drop the umlaut, not add an e.
I'd like to know more on this issue ... I'll stay tuned!
Use libicu instead.