Valency: saving sentence text, token spans #960

myrix · 2023-03-30T13:00:42Z

Currently we only store token text data of sentences of valency data, and we have no option but to use only tokens themselves when we need to reconstruct sentence text, resulting in imperfect reconstruction.

E.g. in verb valency instance approval at /valency, we just reconstruct sentences through joining tokens by spaces, which is not completely right e.g. around dots and commas, with e.g. "кутске ай ." instead of "кутске ай." and "пуке , гуртэз" instead of "пуке, гуртэз":

Or in verb valency case analysis, at perspective view -> Tool -> Verb valency cases, sentences are reconstructed by ad-hoc algorithm by Mikhail, see https://github.com/ispras/lingvodoc/blob/b644a0a4256af4fd613b3c2fbf72203e0bed8eb6/lingvodoc/scripts/valency_verb_cases.py#L41.

We should save full sentence texts, making such imperfect reconstructions unnecessary; to do that, we would probably need to modify valency data extraction at https://github.com/ispras/lingvodoc/blob/heavy_refactor/lingvodoc/scripts/export_parser_result.py and at process_eaf() https://github.com/ispras/lingvodoc/blob/heavy_refactor/lingvodoc/schema/query.py#L16591. Perhaps by adding follow-up text to tokens?

If we would store full sentence texts, we should also probably store token spans to indicate where a particular token is located in the sentence.

myrix added enhancement this label means that resolving the issue would improve some part of the system backend bug is related to backend labels Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Valency: saving sentence text, token spans #960

Valency: saving sentence text, token spans #960

myrix commented Mar 30, 2023

Valency: saving sentence text, token spans #960

Valency: saving sentence text, token spans #960

Comments

myrix commented Mar 30, 2023