Missing values: NaN prediction does not match expectation from dump_model() #6139

ahuber21 · 2023-10-10T16:42:36Z

Description

For a lightgbm.basic.Booster (regression) that was created using lightgbm.train(), the output of model.predict() does not correspond to the expectation from dump_model().

Reproducible example

import lightgbm
import numpy as np
from sklearn.datasets import make_regression


X, y = make_regression(n_samples=100, n_features=10, random_state=42)
params = {"task": "train", "boosting": "gbdt", "objective": "regression", "num_leaves": 4, "verbose": -1, "n_estimators": 1}
model = lightgbm.train(params, train_set=lightgbm.Dataset(X, y))
X_nan = np.array([np.nan] * 20, dtype=np.float32).reshape(2, 10)
prediction = model.predict(X_nan)[0]

# there's only one tree, go to the left child node until we reach a leaf
# to emulate the missing value case (asserting default_left is True)
node = model.dump_model()["tree_info"][0]["tree_structure"]
while "leaf_value" not in node:
    assert node["default_left"] is True
    node = node["left_child"]
expected = node["leaf_value"]

print(f"{prediction=}, {expected=}")
# prediction=3.6268868912259737, expected=-7.990239036381247

assert(prediction == expected)
# AssertionError

Environment info

LightGBM version or commit hash:

>>> lightgbm.__version__
'4.0.0'

Command(s) you used to install LightGBM

conda install lightgbm

# resulting in
conda list | grep lightgbm
# lightgbm                  4.0.0           py310hc6cd4ac_0    conda-forge

Additional Comments

Edit: Also reproduced with v4.1.0 from PyPI

JSON dump of the tree

{
    "name": "tree",
    "version": "v4",
    "num_class": 1,
    "num_tree_per_iteration": 1,
    "label_index": 0,
    "max_feature_idx": 9,
    "objective": "regression",
    "average_output": false,
    "feature_names": [
        "Column_0",
        "Column_1",
        "Column_2",
        "Column_3",
        "Column_4",
        "Column_5",
        "Column_6",
        "Column_7",
        "Column_8",
        "Column_9"
    ],
    "monotone_constraints": [],
    "feature_infos": {
        "Column_0": {
            "min_value": -2.211135309007885,
            "max_value": 2.632382064837391,
            "values": []
        },
        "Column_1": {
            "min_value": -2.650969808393012,
            "max_value": 3.0788808084552377,
            "values": []
        },
        "Column_2": {
            "min_value": -2.6197451040897444,
            "max_value": 2.5733598032498604,
            "values": []
        },
        "Column_3": {
            "min_value": -3.2412673400690726,
            "max_value": 2.5600845382687947,
            "values": []
        },
        "Column_4": {
            "min_value": -1.9875689146008928,
            "max_value": 3.852731490654721,
            "values": []
        },
        "Column_5": {
            "min_value": -2.301921164735585,
            "max_value": 2.075400798645439,
            "values": []
        },
        "Column_6": {
            "min_value": -2.198805956620082,
            "max_value": 2.463242112485286,
            "values": []
        },
        "Column_7": {
            "min_value": -1.9187712152990417,
            "max_value": 2.5269324258736217,
            "values": []
        },
        "Column_8": {
            "min_value": -2.6968866429415717,
            "max_value": 1.8861859012105302,
            "values": []
        },
        "Column_9": {
            "min_value": -2.4238793266289567,
            "max_value": 2.4553001399108942,
            "values": []
        }
    },
    "tree_info": [
        {
            "tree_index": 0,
            "num_leaves": 4,
            "num_cat": 0,
            "shrinkage": 1,
            "tree_structure": {
                "split_index": 0,
                "split_feature": 6,
                "split_gain": 1223880,
                "threshold": 0.8576786455955732,
                "decision_type": "<=",
                "default_left": true,
                "missing_type": "None",
                "internal_value": 10.9999,
                "internal_weight": 0,
                "internal_count": 100,
                "left_child": {
                    "split_index": 1,
                    "split_feature": 0,
                    "split_gain": 475907,
                    "threshold": 0.4374422005998846,
                    "decision_type": "<=",
                    "default_left": true,
                    "missing_type": "None",
                    "internal_value": 4.61272,
                    "internal_weight": 75,
                    "internal_count": 75,
                    "left_child": {
                        "split_index": 2,
                        "split_feature": 1,
                        "split_gain": 161949,
                        "threshold": -0.2550224746204132,
                        "decision_type": "<=",
                        "default_left": true,
                        "missing_type": "None",
                        "internal_value": -1.01996,
                        "internal_weight": 50,
                        "internal_count": 50,
                        "left_child": {
                            "leaf_index": 0,
                            "leaf_value": -7.990239036381247,
                            "leaf_weight": 20,
                            "leaf_count": 20
                        },
                        "right_child": {
                            "leaf_index": 3,
                            "leaf_value": 3.6268868912259737,
                            "leaf_weight": 30,
                            "leaf_count": 30
                        }
                    },
                    "right_child": {
                        "leaf_index": 2,
                        "leaf_value": 15.878088150918485,
                        "leaf_weight": 25,
                        "leaf_count": 25
                    }
                },
                "right_child": {
                    "leaf_index": 1,
                    "leaf_value": 30.16146995943785,
                    "leaf_weight": 25,
                    "leaf_count": 25
                }
            }
        }
    ],
    "feature_importances": {
        "Column_0": 1,
        "Column_1": 1,
        "Column_6": 1
    },
    "pandas_categorical": null
}

The text was updated successfully, but these errors were encountered:

jmoralez · 2023-10-10T17:07:11Z

Hey @ahuber21, thanks for using LightGBM. The prediction is on the right leaf on the last split, you can see the criteria here

LightGBM/python-package/lightgbm/plotting.py

Lines 426 to 440 in 8ed371c

    
           def _determine_direction_for_numeric_split( 
        
               fval: float, 
        
               threshold: float, 
        
               missing_type_str: str, 
        
               default_left: bool, 
        
           ) -> str: 
        
               missing_type = _MissingType(missing_type_str) 
        
               if math.isnan(fval) and missing_type != _MissingType.NAN: 
        
                   fval = 0.0 
        
               if ((missing_type == _MissingType.ZERO and _is_zero(fval)) 
        
                       or (missing_type == _MissingType.NAN and math.isnan(fval))): 
        
                   direction = 'left' if default_left else 'right' 
        
               else: 
        
                   direction = 'left' if fval <= threshold else 'right' 
        
               return direction

In this case the feature is nan and the missing type is "None", so the value is set to 0 and then compared against the thresholds.

Please let us know if you have further doubts.

jmoralez · 2023-10-10T17:19:12Z

Just to complement the answer a bit, the missing type is None because you didn't have any missing values in your training set

LightGBM/src/io/bin.cpp

Lines 322 to 333 in 8ed371c

    
           if (!use_missing) { 
        
             missing_type_ = MissingType::None; 
        
           } else if (zero_as_missing) { 
        
             missing_type_ = MissingType::Zero; 
        
           } else { 
        
             if (non_na_cnt == num_sample_values) { 
        
               missing_type_ = MissingType::None; 
        
             } else { 
        
               missing_type_ = MissingType::NaN; 
        
               na_cnt = num_sample_values - non_na_cnt; 
        
             } 
        
           }

ahuber21 · 2023-10-10T17:33:07Z

Hi @jmoralez, the docs suggested that NaN would be used

LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting zero_as_missing=true.
https://lightgbm.readthedocs.io/en/v4.1.0/Advanced-Topics.html

Moreover, I tried to use X_none = np.array([None] * 20, dtype=np.float32).reshape(2, 10) with the same result.

At this point though I agree this is not a bug. I will modify my code / training sample such that MissingType::NaN will end up being selected.
Nevertheless, the behavior feels a bit inconsistent. Maybe the docs can be aligned a bit better with the code.

Thank you!

jmoralez · 2023-10-10T17:38:15Z

That refers to the training part. If you have NaNs in your training set they will be represented as missing and the missing type will be set to MissingType::NaN (C++ enum). If you don't have any missing values in your training set the missing type will be MissingType::None unless you set zero_as_missing=True. For inference both None and NaN (the python values) should produce the same results.

ahuber21 · 2023-10-11T07:57:29Z

Thanks for the details.

I was mostly surprised because the behavior was different from similar models, e.g. classifiers from XGBoost. After adding NaN values to my training data set everything works, like you explained. Great!

Please consider this issue resolved. But allow me one more question out of curiosity.
It looks like LightGBM is making a couple of assumptions about what is a zero and what is missing. Effectively 0, None, NaN could be either the literal value zero, or missing. That's just a lot of possibilities and I doubt that users will notice when this goes wrong, as the model will still produce valid-looking results. I only discovered my issue/misunderstanding in a unit test. Do you think the average LightGBM user is aware of these intricacies?
(Also, how are these prioritized? What happens if there are Nones and NaNs in the training set? What happens when there are neither, but both Nones and NaNs are in the inference data, etc.)

jmoralez · 2023-10-11T16:27:30Z

Hey. I agree that the rules can be confusing, #2921 was exactly about trying to clarify that. We also have #4040 to warn the user about this behavior, which might have helped you in this case.

About your questions:

python's None and NaN are treated in the same way (they're converted to NaN).
In inference, when there weren't any missing values during training, it happens exactly what happened to you (they're replaced with 0).

jmoralez added awaiting response question labels Oct 10, 2023

github-actions bot removed the awaiting response label Oct 10, 2023

jmoralez added the awaiting response label Oct 10, 2023

github-actions bot removed the awaiting response label Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing values: NaN prediction does not match expectation from dump_model() #6139

Missing values: NaN prediction does not match expectation from dump_model() #6139

ahuber21 commented Oct 10, 2023 •

edited

Loading

jmoralez commented Oct 10, 2023

jmoralez commented Oct 10, 2023

ahuber21 commented Oct 10, 2023

jmoralez commented Oct 10, 2023

ahuber21 commented Oct 11, 2023

jmoralez commented Oct 11, 2023 •

edited

Loading

Missing values: NaN prediction does not match expectation from dump_model() #6139

Missing values: NaN prediction does not match expectation from dump_model() #6139

Comments

ahuber21 commented Oct 10, 2023 • edited Loading

Description

Reproducible example

Environment info

Additional Comments

jmoralez commented Oct 10, 2023

jmoralez commented Oct 10, 2023

ahuber21 commented Oct 10, 2023

jmoralez commented Oct 10, 2023

ahuber21 commented Oct 11, 2023

jmoralez commented Oct 11, 2023 • edited Loading

ahuber21 commented Oct 10, 2023 •

edited

Loading

jmoralez commented Oct 11, 2023 •

edited

Loading