Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valency: saving sentence text, token spans #960

Open
myrix opened this issue Mar 30, 2023 · 0 comments
Open

Valency: saving sentence text, token spans #960

myrix opened this issue Mar 30, 2023 · 0 comments
Labels
backend bug is related to backend enhancement this label means that resolving the issue would improve some part of the system

Comments

@myrix
Copy link
Contributor

myrix commented Mar 30, 2023

Currently we only store token text data of sentences of valency data, and we have no option but to use only tokens themselves when we need to reconstruct sentence text, resulting in imperfect reconstruction.

E.g. in verb valency instance approval at /valency, we just reconstruct sentences through joining tokens by spaces, which is not completely right e.g. around dots and commas, with e.g. "кутске ай ." instead of "кутске ай." and "пуке , гуртэз" instead of "пуке, гуртэз":
image

Or in verb valency case analysis, at perspective view -> Tool -> Verb valency cases, sentences are reconstructed by ad-hoc algorithm by Mikhail, see https://github.com/ispras/lingvodoc/blob/b644a0a4256af4fd613b3c2fbf72203e0bed8eb6/lingvodoc/scripts/valency_verb_cases.py#L41.

We should save full sentence texts, making such imperfect reconstructions unnecessary; to do that, we would probably need to modify valency data extraction at https://github.com/ispras/lingvodoc/blob/heavy_refactor/lingvodoc/scripts/export_parser_result.py and at process_eaf() https://github.com/ispras/lingvodoc/blob/heavy_refactor/lingvodoc/schema/query.py#L16591. Perhaps by adding follow-up text to tokens?

If we would store full sentence texts, we should also probably store token spans to indicate where a particular token is located in the sentence.

@myrix myrix added enhancement this label means that resolving the issue would improve some part of the system backend bug is related to backend labels Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend bug is related to backend enhancement this label means that resolving the issue would improve some part of the system
Projects
None yet
Development

No branches or pull requests

1 participant