Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopwords #14

Closed
abcp4 opened this issue Jul 22, 2019 · 2 comments
Closed

Stopwords #14

abcp4 opened this issue Jul 22, 2019 · 2 comments
Labels
bug Something isn't working

Comments

@abcp4
Copy link
Contributor

abcp4 commented Jul 22, 2019

Hello, it seems like the stopwords aren't being filtered correctly:

image

The 'quick' word is not being ignored. It would be nice if it would just pass over them

@makcedward makcedward added the bug Something isn't working label Jul 22, 2019
makcedward added a commit that referenced this issue Jul 23, 2019
Add StopWordsAug and Fix #14
@abcp4
Copy link
Contributor Author

abcp4 commented Jul 25, 2019

A new issue with stopwords:
image

image

image
It seems like punctuation is turning it in a new word. Like, dog is not being filtered because of 'dog.' is being seen as a word.

@makcedward makcedward reopened this Jul 26, 2019
@makcedward
Copy link
Owner

makcedward commented Jul 26, 2019

You are right. Default tokenizer is splitting word by space.
tokens = text.split(' ')

Will enhance the implementation of tokenizer. Before that, there are 2 ways to overcome it.

  1. Split punctuation. For example, changing input to 'The quick brown fox , jumps over the lazy dog .'.
  2. Override custom tokenzier to the augmenter.
import re
# The _tokenizer is not good enough as punctuation will be removed in return.
def _tokenizer(text, token_pattern=r"(?u)\b\w\w+\b"):
            token_pattern = re.compile(token_pattern)
            return token_pattern.findall(text)

aug = nac.QwertyAug()
aug.tokenizer = _tokenizer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants