Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort order of languages in Kiwix serve / library.kiwix.org is wrong #980

Open
Jaifroid opened this issue Apr 2, 2023 · 17 comments
Open

Comments

@Jaifroid
Copy link
Member

Jaifroid commented Apr 2, 2023

See screenshot below. The language "Español" is not sorted under "E", but (I suppose) by its English name "Spanish", i.e., it appears in the list along with other languages beginning with "S". This is unintuitive. It would be better for the list to be sorted in UTF-8 order of localized language names, with English only being used as a fallback if we don't have the localized version (I'm not sure why we wouldn't, however).

image

@Jaifroid
Copy link
Member Author

Jaifroid commented Apr 2, 2023

For info, this is how it is done in pwa.kiwix.org, where the sort order is by local language name, and the English designation is provided in parenthesis for those who may be looking for a particular language but don't know the local name or script. (The mapping isn't perfect, because it is provided by an array that I had to populate manually in places where info was missing.)

image

@kelson42
Copy link
Collaborator

kelson42 commented Apr 2, 2023

Seems clearly buggy even if I wonder a bit why this has not been detected earlier. Should be sorted alphabetically, based on what is displayed (not based on a technical ISO language code). Not in favour either to print anything additional. To me this should be done in the backend.

@veloman-yunkan
Copy link
Collaborator

@Jaifroid This is a complex topic (unless there is an established common way to do it).

It would be better for the list to be sorted in UTF-8 order of localized language names

Then languages with non-latin scripts will come after languages with latin scripts.

For info, this is how it is done in pwa.kiwix.org, where the sort order is by local language name

I guess that in pwa.kiwix.org you map different scripts to the Latin alphabet through some loose phonetic correspondence and treat the order of letters in the Latin alphabet as a good reference for sorting. However the Latin alphabet lacks dedicated letters for a lot of sounds and has to use digraphs (like sh, dz, ts, etc). If the language self name starts with such a sound (e.g. dz) it is counterintuitive to look it up in languages starting with the first letter of the digraph (d). Then the order of sounds in different alphabets is different. For example, in Russian the phonetic analog of V (the cyrillic letter В) - is the third letter of the alphabet.

I definitely agree that the current sort order is silly for languages with Latin-based scripts. However the general problem can hardly be solved in a way that doesn't raise similar concerns by speakers of other languages.

@Jaifroid
Copy link
Member Author

Jaifroid commented Apr 9, 2023

@veloman-yunkan Thanks for the explanation, which makes a lot of sense. Looking at it again, I think my solution was simply to order it alphabetically by the iso language codes (en, el, es) that Wikipedia uses, but making sure that same-language groups [like zh: '中文 (Chinese)', lzh: '文言 (Classical Chinese)'] appear together. This seems to give a better result than ordering by English names for languages but displaying localized names. I agree it's not perfect.

@kelson42
Copy link
Collaborator

@Jaifroid This is a complex topic (unless there is an established common way to do it).

there is a common way: use libicu https://unicode-org.github.io/icu/userguide/collation/

@veloman-yunkan
Copy link
Collaborator

@kelson42 I still think that the issue at hand is different - we are not talking about different locale-dependent collation methods. The challenge is to sort a list of languages that is composed of strings (language self-names) in different languages. The essence of the conflict can be illustrated by a fictional language Zxcvb - in the alphabet of that language Z is the first letter, thus for the list of languages to be intuitive for the speakers of Zxcvb that language must appear in the beginning (but that is absolutely confusing for other users of the Latin-based alphabets).

@kelson42
Copy link
Collaborator

kelson42 commented Apr 10, 2023

@veloman-yunkan I understand but there is no concrete/real evidence that sorting the language strings (each in its own language) using the collation of UI language is going to give a wrong result, or does it? If it does, what viable alternative do you propose?

To me, there clearly no perfect solution AFAIK, but we need to be pragmatic.

@kelson42 kelson42 changed the title Sort order of lanugages in Kiwix Serve / library.kiwix.org is wrong Sort order of languages in Kiwix serve / library.kiwix.org is wrong Apr 10, 2023
@Jaifroid
Copy link
Member Author

Short of allowing users to set their locale and then using a locale-specific sort algorithm (which would probably be nonsensical unless we knew names and spellings of languages in each of the many Wikipedia languages), then we have to have an approximation. The ISO-639 language codes as used by Wikipedia have a natural relationship to the ZIM archives we make. These are mostly the two-letter codes, but they occasionally use three-letter codes for more esoteric languages or variants thereof. I know we don't only do Wikipedia, but it is the most multi-lingual target we have. Unless someone has a better idea, or libicu provides an internationally accepted, global way of sorting international language names that is not specific to any one locale (and still allows giving language names in their native scripts)...

@kelson42
Copy link
Collaborator

@Jaifroid Wikipedia don't use ISO code for languages AFAIK in a visible manner, except for URLs... and most of the users don't know how URLs work. Here you really have a tech. trope IMO.

@kelson42
Copy link
Collaborator

An other solution would be to use a sophisticated solution which makes the question of the sorting less relevant. The language selector of Wikipedia is reusable AFAIK.

@Jaifroid
Copy link
Member Author

An other solution would be to use a sophisticated solution which makes the question of the sorting less relevant. The language selector of Wikipedia is reusable AFAIK.

That sounds interesting!

@mgautierfr
Copy link
Member

Not in favour either to print anything additional. To me this should be done in the backend.

I'm not sure if the last sentence is related the previous one but:

  • Printing additional information can be useful. As displaying the lang in its own name and in the current UI language.
  • Displaying data is the role of the frontend. I see no real reason why it should be made in the backend and not on the frontend

The language selector of Wikipedia is reusable AFAIK.

This is interesting, but I have just switch the language in a language I don't know (It seems it is Farsi from the url (https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C) ) and I'm totally lost. I cannot change the language to anything else as I don't know where to click. The only way I can change is by changing the url myself. And as the language can be set by a cookie (in our case) it could really difficult to reset it.

The challenge is to sort a list of languages that is composed of strings (language self-names) in different languages.

We will have a sorting at a moment. Either we select it and try to adapt it the better it fit or not. And as you say, the sorting will never be perfect. At least because we are doing the sort for two different persons a the same time: The one speaking the lang of the current UI and the one speaking the lang of the wanted language. As we don't know the wanted language, I think it is ok to sort language by the current UI language.

@Jaifroid
Copy link
Member Author

The best solution would probably be to sort by language names as displayed in the user's selected locale, but I think that would be really complex to programme and would involve a huge matrix of at least 321 x 321 language names, only counting Wikipedia languages and not the different language codes and names used by Gutenberg, PhET, etc. Can we even get those data in a standardized form, that doesn't require using an online API (since Kiwix Serve must work offline)?

My imperfect/pragmatic solution was to use the Wikipedia list of languages, giving the local name and script first, and the English name in parentheses, supplemented with the language codes used for non-Wikimedia projects (Gutenberg, PhET...). And I didn't have a better sort order than the language codes (el, eml, en, eo, es). The (98%-complete) table I compiled is here. Sorting and pruning of this list is done at display time according to the displayed archives, so it doesn't really matter in what order this array/object is sorted. JavaScript objects can't include keys with dashes in their names, so I had to adapt that (be-tarask becomes beTarask in this table).

NB I'm not advocating for this: just documenting it in case any part of it is useful.

@mgautierfr
Copy link
Member

mgautierfr commented Apr 11, 2023

ut I think that would be really complex to programme and would involve a huge matrix of at least 321 x 321 language names, ...

Not really.

First, there is two kind of languages:

  • The UI translated languages (29 for now). This is static (known at compilation time)
  • The languages displayed in the select box. This is dynamic as it depends of the books served.

We would only have something like 29xN languages.

Second, we mostly not need a matrix.
We already use ICU to get the display name of languages. We use it to have the display name of the language in its own language for now but nothing prevent us to display it in another languages.
We only have few languages which are not translated in ICU (https://github.com/kiwix/libkiwix/blob/main/src/library_dumper.cpp#L29-L61)
So N seems to be 33 and the matrix is 29*33. This is still huge but we don't need a matrix.
We have to include those languages in the translation files and let translators translate them in their languages.

Can we even get those data in a standardized form, that doesn't require using an online API (since Kiwix Serve must work offline)?

We already have a lot of data embedded in our binary (from ICU data to our own translations) so it should not be a problem.

@Jaifroid
Copy link
Member Author

Ah, OK, so it's less complex than I thought. That's good.

@veloman-yunkan
Copy link
Collaborator

BTW, a similar problem is present (as of writing this comment) in the language selector of the main page of Wikipedia (https://www.wikipedia.org) - the language list starts with Afrikaans followed by Polski:

screenshot

Also Bahasa Indonesia is between Hrvatski and Italiano.

On the one hand it looks like trying to present a long list of languages in a sorted order is a wrong idea - if one wants to select a particular language the right tool is a text box with suggestions. But if one wants to see what languages are available they must have access to the full list, however in that case I don't think that the order matters (or, rather, the order is stipulated by some other criteria, e.g. count of books, count of speakers, etc).

@Jaifroid
Copy link
Member Author

Well, I'd classify placing "Polski" (in Polish or English) after "Afrikaans" as a bug. I can't see any logic to it, and it's confusing. I don't think the fact that it's not easy to find a sorting mechanism should mean that we use random sorting.

While I take the point that a sorted dropdown is probably not the right approach, and an auto-complete text-box would avoid the sorting issue, I also agree that it's good to see the languages that are available.

In any case, sorting by the English-language spelling of the localized language names is probably quite insulting to some nationalities, and it's plain bizarre if the list doesn't show the sort key (i.e. doesn't give the language spelling by which it is sorted).

@kelson42 kelson42 added this to the 12.2.0 milestone Jul 26, 2023
@kelson42 kelson42 transferred this issue from kiwix/kiwix-tools Jul 26, 2023
@kelson42 kelson42 modified the milestones: 13.0.0, 13.1.0 Jul 26, 2023
@kelson42 kelson42 modified the milestones: 13.2.0, 13.3.0 Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants