Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing values during prediction should throw an Exception if missing data wasn't present during training #4040

Closed
rmminusrslash opened this issue Mar 2, 2021 · 1 comment

Comments

@rmminusrslash
Copy link

rmminusrslash commented Mar 2, 2021

Summary

When we encounter missing values during prediction after training without any missing data, the model predicts these examples without logging a warning or an exception (see issue [this issue])(#2921):

  • Missing numerical values set to zero
  • Missing categorical values are sent to the right leaf

As a result, several or all features could be missing and the model would still return a prediction (of unknown quality).

I propose to at least log a warning and allow the model to be configured in a strict mode where unexpected missing values lead to an exception (I would argue this should be the default, but it might not work for compatibility reasons).

Motivation

Changing the current behavior is important for using lightgbm in production. When working with a train-test-split missing data in the testset is easily recognized and a difference between test and train is less likely than a difference between training and production data.

In production, data or code bugs can lead to one or multiple features being missing. In my experience, bugs that change the data happen as commonly as other bugs.

The current behavior would silently impute them to zero (numerical case) or assign them to an existing leaf (categorical case). The model would silently misbehave and it could be hard to detect, especially if the bug is only on the inference side, but not on the training data (which is typical when the data is not coming from a common feature store).

Description

Throw an exception when missing values are seen during inference but not during training. Value imputing should probably be done before calling the model, so I propose to make throwing an exception the default behaviour.

You might not agree (hence the current implementation), so maybe it could be an option to log a warning?

It would also be great to improve the documentation on the use_missing=false flag:

set this to false to disable the special handle of missing value

The doc string doesn't give an explanation of what is done instead during training and inference when disabling the special value handling.

References

#2921
https://lightgbm.readthedocs.io/en/latest/Parameters.html

@rmminusrslash rmminusrslash changed the title missing values during prediction can throw an Exception missing data wasn't present during training missing values during prediction should throw an Exception if missing data wasn't present during training Mar 2, 2021
@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants