-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Takes very long for intents that have entities #30
Comments
That's ... interesting. I'm using nlpaug under the hood, and the code that deals with entities can be found here. I'm basically telling nlpaug that there are stopwords to ignore. It could be that nlpaug uses a regex under the hood, and if so, comparing 10K+ values may certainly take a while. Could you check how many different entity values you have? |
I have around 25 different entity values in 140 examples |
Strange. Can you share any details on the kinds of entities? 140 examples don't feel like much at all. |
The entities are days of the week (monday-sunday) and numbers (0-9) I ran the command to create typos about 2 hours ago and still running. |
Are you sure it's the entities that makes it so slow? It's not like you've got a file that has many intent examples? Or perhaps intent examples with very long sentences? |
Maybe the sentences, they are a bit long. I have 140 examples for this specific intent, but the sentences are long indeed. Thank you! |
How long? |
The longest has 50 words Update: I let it run over night and still didn't finish executing the script, should I keep waiting? |
No, feel free to cancel. Something strange is happening. Did you have multiple nlu files? If so, do all of them take that long? How many intents/intent examples do you have? |
No, the ones without entities took < 1min. I have multiple nlu files and the metrics are like this: |
Earlier you mentioned:
Were you referring to a single file earlier? Is it possible for you to share the one file that takes so long? |
Yes, that was a single file. I don't think I can share the file |
Could you try to separate the files into separate intents? |
They are separated, currently each file represents a different intent And for the files in which intents don't have any entities, the script ran fine, but in this other file, which has long sentences and some entities, is taking forever. |
Could you split that one intent file? Unless we can narrow the problem down, it's going to be hard for me to figure out what is going wrong. |
Oke, just split the file line by line and it seems like if I remove this sentence, everything works fine
|
Also it seems like if I have just |
So it's something to do with the numbers ... interesting. One thing about numbers as entities. Are you using DIET to detect these? It feels like a RegexEntityExtractor should work just fine. |
@koaning I think the issue comes from the function gen_curly_ents in combination with entities consisting of a single character. Debugging through an example that contains the utterance
the first pass strips the text to the remainder
cutting the text just to the opening square bracket (this happens because the entity value is just one character and Would a possible fix be to ignore the
instead of this line. from whereon the |
Are you able to confirm locally that this fixes things? If so, feel free to PR. But also ... let's do that tomorrow! Evenings are best spent not doing serious code ;) |
The above changes make the single character entities pass and the file processes instantly. However, in the result there's some strange reformatting of the entity annotations:
which messes up the training afterwards due to added whitespaces between square and round brackets. I don't think this is related to the change, since this also happens when running the un-changed version against the included test file (that does not have any single-character entities):
produces those lines
I wonder if it's related to this: makcedward/nlpaug#14 and the token pattern applied by |
If you're able to get the benchmark running locally with some IDE-fu to remove the spaces I'd argue that's fine for now. The codebase here needs to be refactored for a bunch of reasons but I wasn't able to make time for it during the 3.0 release. Ideas are certainly welcome though. I think |
In my project I have the intents split across multiple .yml files and for the intents that have no entities
python -m taipo keyboard augment data/nlu.yml data/typo-nlu.yml
works fast. If I have entities annotations, even a few, it seems like it's taking forever.Is this an issue or is just the way in works when you have entities?
The text was updated successfully, but these errors were encountered: