Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Takes very long for intents that have entities #30

Closed
OctaM opened this issue Sep 30, 2021 · 22 comments · Fixed by #31
Closed

Takes very long for intents that have entities #30

OctaM opened this issue Sep 30, 2021 · 22 comments · Fixed by #31

Comments

@OctaM
Copy link

OctaM commented Sep 30, 2021

In my project I have the intents split across multiple .yml files and for the intents that have no entities
python -m taipo keyboard augment data/nlu.yml data/typo-nlu.yml works fast. If I have entities annotations, even a few, it seems like it's taking forever.

Is this an issue or is just the way in works when you have entities?

@koaning
Copy link
Contributor

koaning commented Sep 30, 2021

That's ... interesting.

I'm using nlpaug under the hood, and the code that deals with entities can be found here. I'm basically telling nlpaug that there are stopwords to ignore. It could be that nlpaug uses a regex under the hood, and if so, comparing 10K+ values may certainly take a while.

Could you check how many different entity values you have?

@OctaM
Copy link
Author

OctaM commented Sep 30, 2021

I have around 25 different entity values in 140 examples

@koaning
Copy link
Contributor

koaning commented Sep 30, 2021

Strange.

Can you share any details on the kinds of entities? 140 examples don't feel like much at all.

@OctaM
Copy link
Author

OctaM commented Sep 30, 2021

The entities are days of the week (monday-sunday) and numbers (0-9)

I ran the command to create typos about 2 hours ago and still running.

@koaning
Copy link
Contributor

koaning commented Sep 30, 2021

Are you sure it's the entities that makes it so slow? It's not like you've got a file that has many intent examples? Or perhaps intent examples with very long sentences?

@OctaM
Copy link
Author

OctaM commented Sep 30, 2021

Maybe the sentences, they are a bit long. I have 140 examples for this specific intent, but the sentences are long indeed.

Thank you!

@koaning
Copy link
Contributor

koaning commented Sep 30, 2021

How long?

@OctaM
Copy link
Author

OctaM commented Oct 1, 2021

The longest has 50 words

Update: I let it run over night and still didn't finish executing the script, should I keep waiting?

@koaning
Copy link
Contributor

koaning commented Oct 1, 2021

No, feel free to cancel. Something strange is happening.

Did you have multiple nlu files? If so, do all of them take that long? How many intents/intent examples do you have?

@OctaM
Copy link
Author

OctaM commented Oct 1, 2021

No, the ones without entities took < 1min.

I have multiple nlu files and the metrics are like this:
Number of intent examples: 1394 (14 distinct intents)
Number of entity examples: 499 (5 distinct entities)

@koaning
Copy link
Contributor

koaning commented Oct 1, 2021

Earlier you mentioned:

I have around 25 different entity values in 140 examples

Were you referring to a single file earlier?

Is it possible for you to share the one file that takes so long?

@OctaM
Copy link
Author

OctaM commented Oct 1, 2021

Yes, that was a single file.

I don't think I can share the file

@koaning
Copy link
Contributor

koaning commented Oct 1, 2021

Could you try to separate the files into separate intents?

@OctaM
Copy link
Author

OctaM commented Oct 1, 2021

They are separated, currently each file represents a different intent

And for the files in which intents don't have any entities, the script ran fine, but in this other file, which has long sentences and some entities, is taking forever.

@koaning
Copy link
Contributor

koaning commented Oct 1, 2021

Could you split that one intent file? Unless we can narrow the problem down, it's going to be hard for me to figure out what is going wrong.

@OctaM
Copy link
Author

OctaM commented Oct 1, 2021

Oke, just split the file line by line and it seems like if I remove this sentence, everything works fine

  • you just sent me a message saying i had [9](count) drinks [today](moment) after i had said i had my first beer, after i had a message that said i had [8](count) and i put in that i'd had [0](count)

@OctaM
Copy link
Author

OctaM commented Oct 1, 2021

Also it seems like if I have just - [1](count) as the only example in a file, I encounter the same problem

@koaning
Copy link
Contributor

koaning commented Oct 11, 2021

So it's something to do with the numbers ... interesting.

One thing about numbers as entities. Are you using DIET to detect these? It feels like a RegexEntityExtractor should work just fine.

@mleimeister
Copy link
Contributor

mleimeister commented Dec 7, 2021

@koaning I think the issue comes from the function gen_curly_ents in combination with entities consisting of a single character. Debugging through an example that contains the utterance

    - what is [X](product) ?

the first pass strips the text to the remainder [X](product) ?, which will then get stuck in an endless loop. This happens since if there are no curly braces, the indices returned from the find function are br1 = -1 and br2 = -1, leading to the line

text = text[sq1 + sq2 + br1 + br2 :]

cutting the text just to the opening square bracket (this happens because the entity value is just one character and br1 + br2 = -2. See the screenshots from the debugger for the concrete index values.

Would a possible fix be to ignore the br1 and br2 indices if there are no curly braces? This could look like

        if curly_bit != "":
            text = text[sq1 + sq2 + br1 + br2 :]
        else:
            text = text[sq1 + sq2 :]

instead of this line.

First pass:
Screenshot 2021-12-07 at 17 38 24

Second and subsequent passes:
Screenshot 2021-12-07 at 17 52 50

from whereon the text stays the same and is processed endlessly.

@koaning
Copy link
Contributor

koaning commented Dec 7, 2021

Are you able to confirm locally that this fixes things? If so, feel free to PR. But also ... let's do that tomorrow! Evenings are best spent not doing serious code ;)

@mleimeister
Copy link
Contributor

mleimeister commented Dec 8, 2021

The above changes make the single character entities pass and the file processes instantly. However, in the result there's some strange reformatting of the entity annotations:

   - wtzt is different xbouy [xkre] (product) compared to [nlu] (product )?
   - dtat is the dibfefenfe between [cpge] (product) and [nlu] (product )
   - ahqt is the difrrdence between [dkre] (product) and [nlu] (product )?
   - what is the riffersnse betwrrh [nlu] (product) and [clge] (product )
   - stat is the vidferejce between [nlu] (product) and [doge] (product )?

which messes up the training afterwards due to added whitespaces between square and round brackets. I don't think this is related to the change, since this also happens when running the un-changed version against the included test file (that does not have any single-character entities):

python -m taipo keyboard augment data/nlu-entities.yml out.yml

produces those lines

   - MVC and nQuerg validation, where to ' weave ' the [javascript] (proglang) and how to fhbed knti master page?
   - list of fmaol addresses tuag can be heed to test a [javascript] (proglang) validation script
   - 'How to reskldr a. lnk in [c #] (proglang )'
   - 'Creating local users on gemotw windows sedcer usjbg [c #] (proglang )'
   - 'Forcibly doilbxck an lnsrallee in [c #] eetkp projects (proglang )'

I wonder if it's related to this: makcedward/nlpaug#14 and the token pattern applied by nlpaug separates the square and round brackets of the annotation somehow.

@koaning
Copy link
Contributor

koaning commented Dec 8, 2021

If you're able to get the benchmark running locally with some IDE-fu to remove the spaces I'd argue that's fine for now. The codebase here needs to be refactored for a bunch of reasons but I wasn't able to make time for it during the 3.0 release.

Ideas are certainly welcome though. I think taipo could be a general data augmentation tool for Rasa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants