Takes very long for intents that have entities #30

OctaM · 2021-09-30T12:15:53Z

In my project I have the intents split across multiple .yml files and for the intents that have no entities
python -m taipo keyboard augment data/nlu.yml data/typo-nlu.yml works fast. If I have entities annotations, even a few, it seems like it's taking forever.

Is this an issue or is just the way in works when you have entities?

The text was updated successfully, but these errors were encountered:

koaning · 2021-09-30T12:49:17Z

That's ... interesting.

I'm using nlpaug under the hood, and the code that deals with entities can be found here. I'm basically telling nlpaug that there are stopwords to ignore. It could be that nlpaug uses a regex under the hood, and if so, comparing 10K+ values may certainly take a while.

Could you check how many different entity values you have?

OctaM · 2021-09-30T13:15:57Z

I have around 25 different entity values in 140 examples

koaning · 2021-09-30T14:19:47Z

Strange.

Can you share any details on the kinds of entities? 140 examples don't feel like much at all.

OctaM · 2021-09-30T14:39:09Z

The entities are days of the week (monday-sunday) and numbers (0-9)

I ran the command to create typos about 2 hours ago and still running.

koaning · 2021-09-30T14:46:20Z

Are you sure it's the entities that makes it so slow? It's not like you've got a file that has many intent examples? Or perhaps intent examples with very long sentences?

OctaM · 2021-09-30T14:48:56Z

Maybe the sentences, they are a bit long. I have 140 examples for this specific intent, but the sentences are long indeed.

Thank you!

koaning · 2021-09-30T20:40:59Z

How long?

OctaM · 2021-10-01T08:47:21Z

The longest has 50 words

Update: I let it run over night and still didn't finish executing the script, should I keep waiting?

koaning · 2021-10-01T08:48:54Z

No, feel free to cancel. Something strange is happening.

Did you have multiple nlu files? If so, do all of them take that long? How many intents/intent examples do you have?

OctaM · 2021-10-01T08:50:35Z

No, the ones without entities took < 1min.

I have multiple nlu files and the metrics are like this:
Number of intent examples: 1394 (14 distinct intents)
Number of entity examples: 499 (5 distinct entities)

koaning · 2021-10-01T08:52:36Z

Earlier you mentioned:

I have around 25 different entity values in 140 examples

Were you referring to a single file earlier?

Is it possible for you to share the one file that takes so long?

OctaM · 2021-10-01T08:56:40Z

Yes, that was a single file.

I don't think I can share the file

koaning · 2021-10-01T09:16:17Z

Could you try to separate the files into separate intents?

OctaM · 2021-10-01T09:19:29Z

They are separated, currently each file represents a different intent

And for the files in which intents don't have any entities, the script ran fine, but in this other file, which has long sentences and some entities, is taking forever.

koaning · 2021-10-01T09:27:41Z

Could you split that one intent file? Unless we can narrow the problem down, it's going to be hard for me to figure out what is going wrong.

OctaM · 2021-10-01T09:45:38Z

Oke, just split the file line by line and it seems like if I remove this sentence, everything works fine

you just sent me a message saying i had [9](count) drinks [today](moment) after i had said i had my first beer, after i had a message that said i had [8](count) and i put in that i'd had [0](count)

OctaM · 2021-10-01T12:05:27Z

Also it seems like if I have just - [1](count) as the only example in a file, I encounter the same problem

koaning · 2021-10-11T07:42:52Z

So it's something to do with the numbers ... interesting.

One thing about numbers as entities. Are you using DIET to detect these? It feels like a RegexEntityExtractor should work just fine.

mleimeister · 2021-12-07T16:53:36Z

@koaning I think the issue comes from the function gen_curly_ents in combination with entities consisting of a single character. Debugging through an example that contains the utterance

    - what is [X](product) ?

the first pass strips the text to the remainder [X](product) ?, which will then get stuck in an endless loop. This happens since if there are no curly braces, the indices returned from the find function are br1 = -1 and br2 = -1, leading to the line

text = text[sq1 + sq2 + br1 + br2 :]

cutting the text just to the opening square bracket (this happens because the entity value is just one character and br1 + br2 = -2. See the screenshots from the debugger for the concrete index values.

Would a possible fix be to ignore the br1 and br2 indices if there are no curly braces? This could look like

        if curly_bit != "":
            text = text[sq1 + sq2 + br1 + br2 :]
        else:
            text = text[sq1 + sq2 :]

instead of this line.

First pass:

Second and subsequent passes:

from whereon the text stays the same and is processed endlessly.

koaning · 2021-12-07T17:57:00Z

Are you able to confirm locally that this fixes things? If so, feel free to PR. But also ... let's do that tomorrow! Evenings are best spent not doing serious code ;)

mleimeister · 2021-12-08T08:51:08Z

The above changes make the single character entities pass and the file processes instantly. However, in the result there's some strange reformatting of the entity annotations:

   - wtzt is different xbouy [xkre] (product) compared to [nlu] (product )?
   - dtat is the dibfefenfe between [cpge] (product) and [nlu] (product )
   - ahqt is the difrrdence between [dkre] (product) and [nlu] (product )?
   - what is the riffersnse betwrrh [nlu] (product) and [clge] (product )
   - stat is the vidferejce between [nlu] (product) and [doge] (product )?

which messes up the training afterwards due to added whitespaces between square and round brackets. I don't think this is related to the change, since this also happens when running the un-changed version against the included test file (that does not have any single-character entities):

python -m taipo keyboard augment data/nlu-entities.yml out.yml

produces those lines

   - MVC and nQuerg validation, where to ' weave ' the [javascript] (proglang) and how to fhbed knti master page?
   - list of fmaol addresses tuag can be heed to test a [javascript] (proglang) validation script
   - 'How to reskldr a. lnk in [c #] (proglang )'
   - 'Creating local users on gemotw windows sedcer usjbg [c #] (proglang )'
   - 'Forcibly doilbxck an lnsrallee in [c #] eetkp projects (proglang )'

I wonder if it's related to this: makcedward/nlpaug#14 and the token pattern applied by nlpaug separates the square and round brackets of the annotation somehow.

koaning · 2021-12-08T08:55:25Z

If you're able to get the benchmark running locally with some IDE-fu to remove the spaces I'd argue that's fine for now. The codebase here needs to be refactored for a bunch of reasons but I wasn't able to make time for it during the 3.0 release.

Ideas are certainly welcome though. I think taipo could be a general data augmentation tool for Rasa.

mleimeister mentioned this issue Dec 9, 2021

Fix entity annotations #31

Merged

koaning closed this as completed in #31 Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Takes very long for intents that have entities #30

Takes very long for intents that have entities #30

OctaM commented Sep 30, 2021

koaning commented Sep 30, 2021

OctaM commented Sep 30, 2021

koaning commented Sep 30, 2021

OctaM commented Sep 30, 2021 •

edited

Loading

koaning commented Sep 30, 2021

OctaM commented Sep 30, 2021 •

edited

Loading

koaning commented Sep 30, 2021

OctaM commented Oct 1, 2021

koaning commented Oct 1, 2021

OctaM commented Oct 1, 2021 •

edited

Loading

koaning commented Oct 1, 2021

OctaM commented Oct 1, 2021

koaning commented Oct 1, 2021

OctaM commented Oct 1, 2021 •

edited

Loading

koaning commented Oct 1, 2021

OctaM commented Oct 1, 2021 •

edited

Loading

OctaM commented Oct 1, 2021

koaning commented Oct 11, 2021

mleimeister commented Dec 7, 2021 •

edited

Loading

koaning commented Dec 7, 2021

mleimeister commented Dec 8, 2021 •

edited

Loading

koaning commented Dec 8, 2021

Takes very long for intents that have entities #30

Takes very long for intents that have entities #30

Comments

OctaM commented Sep 30, 2021

koaning commented Sep 30, 2021

OctaM commented Sep 30, 2021

koaning commented Sep 30, 2021

OctaM commented Sep 30, 2021 • edited Loading

koaning commented Sep 30, 2021

OctaM commented Sep 30, 2021 • edited Loading

koaning commented Sep 30, 2021

OctaM commented Oct 1, 2021

koaning commented Oct 1, 2021

OctaM commented Oct 1, 2021 • edited Loading

koaning commented Oct 1, 2021

OctaM commented Oct 1, 2021

koaning commented Oct 1, 2021

OctaM commented Oct 1, 2021 • edited Loading

koaning commented Oct 1, 2021

OctaM commented Oct 1, 2021 • edited Loading

OctaM commented Oct 1, 2021

koaning commented Oct 11, 2021

mleimeister commented Dec 7, 2021 • edited Loading

koaning commented Dec 7, 2021

mleimeister commented Dec 8, 2021 • edited Loading

koaning commented Dec 8, 2021

OctaM commented Sep 30, 2021 •

edited

Loading

OctaM commented Sep 30, 2021 •

edited

Loading

OctaM commented Oct 1, 2021 •

edited

Loading

OctaM commented Oct 1, 2021 •

edited

Loading

OctaM commented Oct 1, 2021 •

edited

Loading

mleimeister commented Dec 7, 2021 •

edited

Loading

mleimeister commented Dec 8, 2021 •

edited

Loading