Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: handle multiple alphabets #543

Open
adamdecaf opened this issue Mar 20, 2024 · 0 comments
Open

feature: handle multiple alphabets #543

adamdecaf opened this issue Mar 20, 2024 · 0 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@adamdecaf
Copy link
Member

adamdecaf commented Mar 20, 2024

Slack: https://moov-io.slack.com/archives/CFUCEBGH2/p1710500854485369

I have some results with curl 'http://localhost:8084/search?q=wiam+wahhab' and with curl 'http://localhost:8084/search?q=الخليلي+سيف.
It's the same person and even if results aren't the same, it means that you manage another alphabets.

The first link is a study about the phonetisation logic of the Arab language and the second is just a table of the different writing of the english phonetisation.
https://ccc.inaoep.mx/~villasen/bib/reglas%20de%20fonetizacion%20Arabe.pdf
http://www.aurint.de/phonetic_transcription.htm
The goal is not to have a 100% trusted translation, it's impossible with phonetisation transcription. But lucky we are, there is a Jaro Winkler passing.
The majority of the lists datas are in latin. So it would be too big I suppose to transcribe persons BUT if we do only once a big transcription all over the lists datas to have different alphabets phonetisation transcription for all it wouldn't be to big.
The execution way would be :
get the lists datas
transcribe to different alphabets
STORE the transcriptions into the database as table "arabic", "latin", "mandarin" etc and mark if it's the originals datas or a transcription
get the person to check
get the alphabet/language of the person datas (you already do that with the package "stopwords") research only in tables of the same alphabet AND get down the score minimum if the table alphabet isn't the original one from the list
Of course it will be a lot of work to transcribe into all the alphabets AND all alphabets can have different phonetisations (like english vs french). But after a lot of thinking and research it came to me that it's the best solution without being too big or with less trust.

Projects:

Arabic Phonetic Mapping Algorithm.pdf
Arabic Phonetization .pdf

Related: #150

@adamdecaf adamdecaf added bug Something isn't working enhancement New feature or request labels Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant