Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest a simple random crop augmenter #126

Closed
ddxgz opened this issue Apr 26, 2020 · 5 comments
Closed

Suggest a simple random crop augmenter #126

ddxgz opened this issue Apr 26, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@ddxgz
Copy link

ddxgz commented Apr 26, 2020

A very simple and naive augmenter, which just randomly crop part of the original text. Can work on char, word or sentence level.

I myself found it useful using with tf-idf, especially when you have only a very small dataset. I can provide an implementation if you'd like.

@makcedward
Copy link
Owner

Thank you for your offering. May you share more detail about that?

@ddxgz
Copy link
Author

ddxgz commented Apr 27, 2020

Here is an example implementation. If set to crop by token, and a ratio of 0.1, it will then return about 0.9 of the original text. For example:

Original:
The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.

Augmented Text might be 1:
brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.

Augmented Text might be 2:
The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the

Of course this might break the syntactic structure of the text, but it will introduce a little noise to a small dataset. In my own use case, classification of a few thousands of sample with tf-idf, it brings improvement.

def text_random_crop(text, crop_by: str = 'token', crop_ratio: float = 0.1):
    if crop_by == 'token':
        seq = nltk.word_tokenize(text)
    elif crop_by == 'sentence':
        seq = nltk.sent_tokenize(text)
    else:  # char
        seq = text

    size = len(seq)
    chop_size = size // (1 / crop_ratio)
    chop_offset = random.randint(0, int(chop_size))

    cropped = seq[chop_offset:size - chop_offset - 1]

    d = TreebankWordDetokenizer()
    return d.detokenize(cropped)

@makcedward
Copy link
Owner

Thank you for sharing. RandomCharAug's Delete augmenter and RandomWordaug's Delete augmenter should serve the purpose. For sentence-level, I will implement it in a later release.

@ddxgz
Copy link
Author

ddxgz commented Apr 28, 2020

Just checked RandomWordaug with action=delete . The behavior is different from what I suggested.

RandomWordaug randomly deletes words in a text, while what I suggested is randomly crop out words in a text.

Maybe I didn't express well in the previous example. Here below in 1 and 2 show the text within square brackets [ ] are the text that randomly cropped.

  • Original:
    The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.

  • Crop augmented Text might be 1:
    The quick [brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.]

  • Crop augmented Text might be 2:
    [The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the] lazy dog.

  • Delete augmented Text might be (from RandomWordaug with action=delete):
    The [quick brown fox jumps over the lazy] dog. [The quick brown fox jumps over] the lazy [dog.]

This crop behavior could be added as a fourth action to the action parameter to RandomCharAug and RandomWordaug, also on the sentence level.

@makcedward makcedward added the enhancement New feature or request label Aug 6, 2020
@makcedward
Copy link
Owner

Got what you mean, it is similar to CropAug (for audio). Will include it in coming release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants