Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values: NaN prediction does not match expectation from dump_model() #6139

Open
ahuber21 opened this issue Oct 10, 2023 · 6 comments
Open
Labels

Comments

@ahuber21
Copy link

ahuber21 commented Oct 10, 2023

Description

For a lightgbm.basic.Booster (regression) that was created using lightgbm.train(), the output of model.predict() does not correspond to the expectation from dump_model().

Reproducible example

import lightgbm
import numpy as np
from sklearn.datasets import make_regression


X, y = make_regression(n_samples=100, n_features=10, random_state=42)
params = {"task": "train", "boosting": "gbdt", "objective": "regression", "num_leaves": 4, "verbose": -1, "n_estimators": 1}
model = lightgbm.train(params, train_set=lightgbm.Dataset(X, y))
X_nan = np.array([np.nan] * 20, dtype=np.float32).reshape(2, 10)
prediction = model.predict(X_nan)[0]

# there's only one tree, go to the left child node until we reach a leaf
# to emulate the missing value case (asserting default_left is True)
node = model.dump_model()["tree_info"][0]["tree_structure"]
while "leaf_value" not in node:
    assert node["default_left"] is True
    node = node["left_child"]
expected = node["leaf_value"]

print(f"{prediction=}, {expected=}")
# prediction=3.6268868912259737, expected=-7.990239036381247

assert(prediction == expected)
# AssertionError

Environment info

LightGBM version or commit hash:

>>> lightgbm.__version__
'4.0.0'

Command(s) you used to install LightGBM

conda install lightgbm

# resulting in
conda list | grep lightgbm
# lightgbm                  4.0.0           py310hc6cd4ac_0    conda-forge

Additional Comments

Edit: Also reproduced with v4.1.0 from PyPI

JSON dump of the tree
{
    "name": "tree",
    "version": "v4",
    "num_class": 1,
    "num_tree_per_iteration": 1,
    "label_index": 0,
    "max_feature_idx": 9,
    "objective": "regression",
    "average_output": false,
    "feature_names": [
        "Column_0",
        "Column_1",
        "Column_2",
        "Column_3",
        "Column_4",
        "Column_5",
        "Column_6",
        "Column_7",
        "Column_8",
        "Column_9"
    ],
    "monotone_constraints": [],
    "feature_infos": {
        "Column_0": {
            "min_value": -2.211135309007885,
            "max_value": 2.632382064837391,
            "values": []
        },
        "Column_1": {
            "min_value": -2.650969808393012,
            "max_value": 3.0788808084552377,
            "values": []
        },
        "Column_2": {
            "min_value": -2.6197451040897444,
            "max_value": 2.5733598032498604,
            "values": []
        },
        "Column_3": {
            "min_value": -3.2412673400690726,
            "max_value": 2.5600845382687947,
            "values": []
        },
        "Column_4": {
            "min_value": -1.9875689146008928,
            "max_value": 3.852731490654721,
            "values": []
        },
        "Column_5": {
            "min_value": -2.301921164735585,
            "max_value": 2.075400798645439,
            "values": []
        },
        "Column_6": {
            "min_value": -2.198805956620082,
            "max_value": 2.463242112485286,
            "values": []
        },
        "Column_7": {
            "min_value": -1.9187712152990417,
            "max_value": 2.5269324258736217,
            "values": []
        },
        "Column_8": {
            "min_value": -2.6968866429415717,
            "max_value": 1.8861859012105302,
            "values": []
        },
        "Column_9": {
            "min_value": -2.4238793266289567,
            "max_value": 2.4553001399108942,
            "values": []
        }
    },
    "tree_info": [
        {
            "tree_index": 0,
            "num_leaves": 4,
            "num_cat": 0,
            "shrinkage": 1,
            "tree_structure": {
                "split_index": 0,
                "split_feature": 6,
                "split_gain": 1223880,
                "threshold": 0.8576786455955732,
                "decision_type": "<=",
                "default_left": true,
                "missing_type": "None",
                "internal_value": 10.9999,
                "internal_weight": 0,
                "internal_count": 100,
                "left_child": {
                    "split_index": 1,
                    "split_feature": 0,
                    "split_gain": 475907,
                    "threshold": 0.4374422005998846,
                    "decision_type": "<=",
                    "default_left": true,
                    "missing_type": "None",
                    "internal_value": 4.61272,
                    "internal_weight": 75,
                    "internal_count": 75,
                    "left_child": {
                        "split_index": 2,
                        "split_feature": 1,
                        "split_gain": 161949,
                        "threshold": -0.2550224746204132,
                        "decision_type": "<=",
                        "default_left": true,
                        "missing_type": "None",
                        "internal_value": -1.01996,
                        "internal_weight": 50,
                        "internal_count": 50,
                        "left_child": {
                            "leaf_index": 0,
                            "leaf_value": -7.990239036381247,
                            "leaf_weight": 20,
                            "leaf_count": 20
                        },
                        "right_child": {
                            "leaf_index": 3,
                            "leaf_value": 3.6268868912259737,
                            "leaf_weight": 30,
                            "leaf_count": 30
                        }
                    },
                    "right_child": {
                        "leaf_index": 2,
                        "leaf_value": 15.878088150918485,
                        "leaf_weight": 25,
                        "leaf_count": 25
                    }
                },
                "right_child": {
                    "leaf_index": 1,
                    "leaf_value": 30.16146995943785,
                    "leaf_weight": 25,
                    "leaf_count": 25
                }
            }
        }
    ],
    "feature_importances": {
        "Column_0": 1,
        "Column_1": 1,
        "Column_6": 1
    },
    "pandas_categorical": null
}
@jmoralez
Copy link
Collaborator

Hey @ahuber21, thanks for using LightGBM. The prediction is on the right leaf on the last split, you can see the criteria here

def _determine_direction_for_numeric_split(
fval: float,
threshold: float,
missing_type_str: str,
default_left: bool,
) -> str:
missing_type = _MissingType(missing_type_str)
if math.isnan(fval) and missing_type != _MissingType.NAN:
fval = 0.0
if ((missing_type == _MissingType.ZERO and _is_zero(fval))
or (missing_type == _MissingType.NAN and math.isnan(fval))):
direction = 'left' if default_left else 'right'
else:
direction = 'left' if fval <= threshold else 'right'
return direction

In this case the feature is nan and the missing type is "None", so the value is set to 0 and then compared against the thresholds.

Please let us know if you have further doubts.

@jmoralez
Copy link
Collaborator

Just to complement the answer a bit, the missing type is None because you didn't have any missing values in your training set

LightGBM/src/io/bin.cpp

Lines 322 to 333 in 8ed371c

if (!use_missing) {
missing_type_ = MissingType::None;
} else if (zero_as_missing) {
missing_type_ = MissingType::Zero;
} else {
if (non_na_cnt == num_sample_values) {
missing_type_ = MissingType::None;
} else {
missing_type_ = MissingType::NaN;
na_cnt = num_sample_values - non_na_cnt;
}
}

@ahuber21
Copy link
Author

Hi @jmoralez, the docs suggested that NaN would be used

LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting zero_as_missing=true.
https://lightgbm.readthedocs.io/en/v4.1.0/Advanced-Topics.html

Moreover, I tried to use X_none = np.array([None] * 20, dtype=np.float32).reshape(2, 10) with the same result.

At this point though I agree this is not a bug. I will modify my code / training sample such that MissingType::NaN will end up being selected.
Nevertheless, the behavior feels a bit inconsistent. Maybe the docs can be aligned a bit better with the code.

Thank you!

@jmoralez
Copy link
Collaborator

That refers to the training part. If you have NaNs in your training set they will be represented as missing and the missing type will be set to MissingType::NaN (C++ enum). If you don't have any missing values in your training set the missing type will be MissingType::None unless you set zero_as_missing=True. For inference both None and NaN (the python values) should produce the same results.

@ahuber21
Copy link
Author

Thanks for the details.

I was mostly surprised because the behavior was different from similar models, e.g. classifiers from XGBoost. After adding NaN values to my training data set everything works, like you explained. Great!

Please consider this issue resolved. But allow me one more question out of curiosity.
It looks like LightGBM is making a couple of assumptions about what is a zero and what is missing. Effectively 0, None, NaN could be either the literal value zero, or missing. That's just a lot of possibilities and I doubt that users will notice when this goes wrong, as the model will still produce valid-looking results. I only discovered my issue/misunderstanding in a unit test. Do you think the average LightGBM user is aware of these intricacies?
(Also, how are these prioritized? What happens if there are Nones and NaNs in the training set? What happens when there are neither, but both Nones and NaNs are in the inference data, etc.)

@jmoralez
Copy link
Collaborator

jmoralez commented Oct 11, 2023

Hey. I agree that the rules can be confusing, #2921 was exactly about trying to clarify that. We also have #4040 to warn the user about this behavior, which might have helped you in this case.

About your questions:

  • python's None and NaN are treated in the same way (they're converted to NaN).
  • In inference, when there weren't any missing values during training, it happens exactly what happened to you (they're replaced with 0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants