Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] [python] reduce unnecessary data loading in tests #3486

Merged
merged 16 commits into from
Oct 29, 2020

Conversation

jameslamb
Copy link
Collaborator

I spent some time this weekend learning snakeviz, a Python profiler. I tried running it over the Python package tests here, and I think I found an opportunity to cut the test run time.

I realized we're spending time loading and reloading the same datasets. This PR proposes reducing those calls by caching the results of sklearn.datasets.load_* calls. The datasets are really small so I don't think holding them all in memory will cause an issue.

UPDATE: Ok from some experiments, the speedup from this seems really small on my laptop. But might be larger in CI environments, where we know that disk I/O is a lot slower (#2965 (comment)).

Results

These results were obtained on my laptop. The blockquote results come from one run of each test. Then I ran them all again 10 times to get a better estimate of the mean time.

test_basic.py

before (mean = 1.33s)

58216 function calls (57604 primitive calls) in 0.462 seconds
==== 10 passed, 3 warnings in 2.18s ====

  • trials (seconds): [1.37, 1.40, 1.34, 1.37, 1.31, 1.30, 1.31, 1.31, 1.31, 1.31]

after (mean = 1.36s)

53470 function calls (52858 primitive calls) in 0.396 seconds
==== 10 passed, 3 warnings in 1.36s ====

  • trials (seconds): [1.39, 1.39, 1.40, 1.36, 1.35, 1.41, 1.40, 1.29, 1.31, 1.35]

test_engine.py

before (mean = 12.78s)

2749043 function calls (2651956 primitive calls) in 9.456 seconds
==== 63 passed, 2 skipped, 20 warnings in 13.69s ====

  • trials (seconds): [14.16, 13.13, 12.85, 12.66, 12.34, 12.60, 13.17, 12.98, 12.40, 11.50]

after (mean = 11.84s)

2749043 function calls (2651956 primitive calls) in 9.456 seconds
==== 63 passed, 2 skipped, 20 warnings in 11.29s ====

  • trials (seconds): [12.18, 10.71, 11.09, 10.88, 10.77, 12.64, 12.47, 14.90, 11.62, 11.14]

test_sklearn.py

before(mean = 9.68s)

4292703 function calls (4172793 primitive calls) in 8.044 seconds
==== 34 passed, 15 warnings in 11.07s ====

  • trials (seconds): [10.17, 9.52, 9.34, 9.58, 9.22, 9.34, 9.75, 10.21, 9.67, 9.99]

after (mean = 9.50s)

2945200 function calls (2843392 primitive calls) in 8.867 seconds
==== 34 passed, 15 warnings in 9.26s ====

  • trials (seconds): [9.67, 8.64, 9.32, 8.71, 8.73, 9.29, 8.97, 12.73, 9.48, 9.44]

How to reproduce these tests

# install lightgbm
pushd python-package
python setup.py install
popd

# install dependencies
pip install snakeviz pytest-profiling

# profile tests
pytest --profile tests/python_package_test/test_basic.py
pytest --profile tests/python_package_test/test_engine.py
pytest --profile tests/python_package_test/test_sklearn.py

# (optional) visualize profiling data
snakeviz prof/combined.prof

@jameslamb
Copy link
Collaborator Author

jameslamb commented Oct 26, 2020

interesting, I see this in some tests:

../../../../miniconda/envs/test-env/lib/python3.6/functools.py:477: in lru_cache
    raise TypeError('Expected maxsize to be an integer or None')
E   TypeError: Expected maxsize to be an integer or None

Maybe there was not a default value in older versions of Python? Because today it has a default of 128: https://docs.python.org/3/library/functools.html#functools.lru_cache

Anyway, I'm going to try switching these to just @cache. Since they're just tests that we completely control, not user code, I think it's ok to have an unbounded cache. And that should be faster. From https://docs.python.org/3/library/functools.html#functools.cache:

Returns the same as lru_cache(maxsize=None), creating a thin wrapper around a dictionary lookup for the function arguments. Because it never needs to evict old values, this is smaller and faster than lru_cache() with a size limit.

UPDATE: nevermind, functools.cache was added in Python 3.9.

@StrikerRUS
Copy link
Collaborator

@jameslamb I think we can have maxsize equals to 2 for functions with only one bool argument return_X_y=True/False and 32 for load_digits. None doesn't look as a good default value:
https://github.com/python/cpython/blob/920cb647ba23feab7987d0dac1bd63bfc2ffc4c0/Lib/functools.py#L549-L562

@jameslamb
Copy link
Collaborator Author

@StrikerRUS can you explain why you think None is a bad default?

From that code snippet, it looks like it would be faster than setting a maxsize (since there is no code about evicting things from the cache).

I understand why it would be bad in user code, since the cache can grow without limit, but for these unit tests where we completely control the set of unique combinations and know it to be small, I think we should have a preference for the faster option.

@StrikerRUS
Copy link
Collaborator

@jameslamb
maxsize=None means no LRU feature. I thought you want to use it.

@jameslamb
Copy link
Collaborator Author

@jameslamb
maxsize=None means no LRU feature. I thought you want to use it.

oh I see. No I really just cared more about the caching than using Least Recently Used (LRU), since we have so few variations of kwargs for each dataset.

I did forget though that functools.lru_cache isn't available in Python 2.7. Will push something to fix that 😬

@StrikerRUS
Copy link
Collaborator

@jameslamb Ah OK! I misunderstood you then.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb Please check my comments below.

tests/python_package_test/utils.py Show resolved Hide resolved
tests/python_package_test/utils.py Outdated Show resolved Hide resolved
import warnings
warnings.warn("Could not import functools.lru_cache", RuntimeWarning)

def lru_cache(maxsize=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please memoize this too

self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(*load_breast_cancer(return_X_y=True),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be the purpose? That's the only call to load_breast_cancer() in that module. So I think the caching would add a tiny bit of overhead for no benefit.

Copy link
Collaborator

@StrikerRUS StrikerRUS Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it differ with other calls you've memoized? 5 calls of load_breast_cancer() in test_plotting.py is even more than 3 calls of the same function in test_basic.py, for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is only 1 call in test_plotting.py

git grep load_breast_cancer tests/python_package_test/test_plotting.py

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, but the method in which this call is performed (setUp) is called before each test. So, actually we have 5 calls.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOOOOOOOOOO haha ok. I haven't used unittest.TestCase in a while, I forgot which one was a "run before every test" setup and which one was a "run exactly once, before any tests" one.

Ok yes I'll update this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in dfb0fd3

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Actually I think we should refactor this to "run exactly once, before any tests" (setUpClass()), but it is another issue, of course.

tests/python_package_test/test_sklearn.py Outdated Show resolved Hide resolved
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot!

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants