Suggest a simple random crop augmenter #126

ddxgz · 2020-04-26T13:42:49Z

A very simple and naive augmenter, which just randomly crop part of the original text. Can work on char, word or sentence level.

I myself found it useful using with tf-idf, especially when you have only a very small dataset. I can provide an implementation if you'd like.

makcedward · 2020-04-26T17:11:37Z

Thank you for your offering. May you share more detail about that?

ddxgz · 2020-04-27T07:05:16Z

Here is an example implementation. If set to crop by token, and a ratio of 0.1, it will then return about 0.9 of the original text. For example:

Original:
The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.

Augmented Text might be 1:
brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.

Augmented Text might be 2:
The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the

Of course this might break the syntactic structure of the text, but it will introduce a little noise to a small dataset. In my own use case, classification of a few thousands of sample with tf-idf, it brings improvement.

def text_random_crop(text, crop_by: str = 'token', crop_ratio: float = 0.1):
    if crop_by == 'token':
        seq = nltk.word_tokenize(text)
    elif crop_by == 'sentence':
        seq = nltk.sent_tokenize(text)
    else:  # char
        seq = text

    size = len(seq)
    chop_size = size // (1 / crop_ratio)
    chop_offset = random.randint(0, int(chop_size))

    cropped = seq[chop_offset:size - chop_offset - 1]

    d = TreebankWordDetokenizer()
    return d.detokenize(cropped)

makcedward · 2020-04-28T03:20:36Z

Thank you for sharing. RandomCharAug's Delete augmenter and RandomWordaug's Delete augmenter should serve the purpose. For sentence-level, I will implement it in a later release.

ddxgz · 2020-04-28T07:52:10Z

Just checked RandomWordaug with action=delete . The behavior is different from what I suggested.

RandomWordaug randomly deletes words in a text, while what I suggested is randomly crop out words in a text.

Maybe I didn't express well in the previous example. Here below in 1 and 2 show the text within square brackets [ ] are the text that randomly cropped.

Original:
The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
Crop augmented Text might be 1:
The quick [brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.]
Crop augmented Text might be 2:
[The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the] lazy dog.
Delete augmented Text might be (from RandomWordaug with action=delete):
The [quick brown fox jumps over the lazy] dog. [The quick brown fox jumps over] the lazy [dog.]

This crop behavior could be added as a fourth action to the action parameter to RandomCharAug and RandomWordaug, also on the sentence level.

makcedward · 2020-08-06T00:27:30Z

Got what you mean, it is similar to CropAug (for audio). Will include it in coming release

makcedward added the enhancement New feature or request label Aug 6, 2020

makcedward closed this as completed in 903ec68 Aug 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggest a simple random crop augmenter #126

Suggest a simple random crop augmenter #126

ddxgz commented Apr 26, 2020

makcedward commented Apr 26, 2020

ddxgz commented Apr 27, 2020

makcedward commented Apr 28, 2020

ddxgz commented Apr 28, 2020

makcedward commented Aug 6, 2020

Suggest a simple random crop augmenter #126

Suggest a simple random crop augmenter #126

Comments

ddxgz commented Apr 26, 2020

makcedward commented Apr 26, 2020

ddxgz commented Apr 27, 2020

makcedward commented Apr 28, 2020

ddxgz commented Apr 28, 2020

makcedward commented Aug 6, 2020