Error saving very large LightGBM models #3858

CHDev93 · 2021-01-26T15:16:40Z

How you are using LightGBM?

Python package

LightGBM component:

Environment info

Operating System: Windows 10

CPU/GPU model: GPU

C++ compiler version: NA

CMake version: NA

Java version: NA

Python version: 3.6.6

R version: NA

Other: NA

LightGBM version or commit hash: 3.1.0

Error message and / or logs

I'm observing errors when trying to train models with sufficiently large tree models (on either CPU or GPU). Namely, when the max_leaves and num_boosting_rounds are sufficiently high, the boosting rounds all finish, but when trying to serialise and deserialise the model back, an error occurs.

To avoid the automatic to and from_string calls after the final boosting round, I've tried setting keep_training_booster=True and then saving the model out to disk, then reloading it. Saving the model as text or as pickle both succeed on save but then fail on model load.

I've investigated this issue and found that when writing out to a text file the last tree written is "Tree=4348" even though I've requested more boosting rounds than this. When loading the model there's obviously a mismatch between the the number of elements in the "tree_sizes" attribute of the file (5000) and the actual number of trees in the file (4348) which causes an error

I believe the underlying issue is the same as here: #2828
I also found this comment alluding to a 2GB limit of string stream and my text file is almost exactly 2GB: #372 (comment)

I added some of my own logging inside the lightgbm python layer and have the following logs

....
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 38
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 43
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 46
CASEY: Finished final boosting iteration
Training complete: 21154.90s
Attempting to save model as pickle
Attempting to convert model with default buffer len 1048576
Creating string buffer for actual len (2147483648)
Converting model with actual buffer len (2147483648)
Converted model to string
Decode model string to utf-8
Successfully saved model as pickle
Attempting to load model from pickle
<program hangs and pulls up dialog box indicating python has stopped working>

Reproducible example(s)

Note the model must be really large to observe this error. This took almost 6 hours on a V100 GPU. If model size is not dependent on number of rows or columns, you might be able to use smaller numbers than I did and speed things up a little.
Before getting to enough boosting rounds for the model to crash, performance of the model continues to increase so there's reason to believe a model this big is really necessary.

n = int(2e7)
m = 250
max_leaves = 5000
max_bin = 255
x_train = np.random.randn(n, m).astype(np.float32)
A = np.random.randint(-5, 5, size=(m, 1))
y_train = (x_train @ A).astype(np.float32)

print(f"x_train.shape = {x_train.shape}, y_train.shape = {y_train.shape}")
n_boosting_rounds = 5000
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
     'device': 'gpu',
    'metric': {'rmse'},
    'device': 'gpu',
    'num_leaves': max_leaves,
    'bagging_fraction': 0.5,
    'feature_fraction': 0.5,
    'learning_rate': 0.01,
    'verbose': 2,
    'max_bin': max_bin,
}
ds_train = lgb.Dataset(x_train, y_train.ravel(), free_raw_data=False).construct()
start = time.perf_counter()
gbm = lgb.train(
    params,
    ds_train,
    num_boost_round=n_boosting_rounds,
    keep_training_booster=True, # set this to False and the code will crash here
)

print(f"Training complete: {elapsed_train_time:.2f}")

model_file = f"{train_start}_model.txt"
print(f"Attempting to save model as {model_format}", flush=True)
gbm.save_model(model_file)
print(f"Successfully saved model as {model_format}!", flush=True)

print(f"Attempting to load model from {model_format}", flush=True)
gbm = lgb.Booster(model_file=model_file) # program dies here!!
print(f"Successfully loaded model from {model_format}", flush=True)

Steps to reproduce

Generate some fake linear data
Train a gbdt with sufficient boosting_rounds and max_leaves to cause an error (note the boosting is fine, it's the clean up bit after boosting that's problematic)

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2021-01-26T15:37:40Z

Hi @CHDev93 !

We already have a feature request to support extremely large models: #2265.
Are errors you observe the same as there?

CHDev93 · 2021-01-26T15:52:01Z

Thanks for the quick follow up @StrikerRUS . Don't know how I missed this issue when looking through tickets the past couple days. Yes this problem looks very similar, though I don't get an error as helpful as the one shown in that ticket (or at all, actually).

When I've set keep_training_booster=True as you mention in that issue, that does allow the lgb.train call to succeed but then I still have no way to persist the model. Basically, if I set this flag, how can I put the model on disk such that I can load it back and make predictions? This comment seems to imply saving and loading should work but in practice I'm finding the file is corrupt #2265 (comment)

StrikerRUS · 2021-01-26T16:15:03Z

From your logs:

...
Attempting to convert model with default buffer len 1048576
Creating string buffer for actual len (2147483648)
Converting model with actual buffer len (2147483648)
...

2147483648 it is exactly INT_MAX + 1. I don't think this is a coincidence...

This comment seems to imply saving and loading should work but in practice I'm finding the file is corrupt #2265 (comment)

Unfortunately, it is only a potential fix. It is not implemented yet. This feature request is still open.
#2302

CHDev93 · 2021-01-26T16:21:45Z

Okay was just checking there wasn't a work-around involving this keep_training_booster combined with some method of getting it to disk that would work around this issue. If I understand correctly, there's no way around it at the current time.

Thanks for your help on this. Feel free to close this issue

StrikerRUS · 2021-01-26T16:36:57Z

Setting keep_training_booster=True is a workaround to complete training successfully.

I'm quite surprised that

Saving the model as text or as pickle both succeed on save but then fail on model load.

for part about pickle. Maybe you hit the following pickle issue or something similar? Could you please try joblib or some other alternative to pickle?

CHDev93 · 2021-01-26T17:57:05Z

Thanks @StrikerRUS ! I did know pickle has some issues at the 4GB limit but thought I might be safe at 2GB. I will kick off a run now with joblib to see if that helps.

I'm not certain exactly how these serialisation libraries work so hopefully they're not calling some of the objects methods during serialisation which could lead to the string conversion issue again. Will comment here when I have some results though

CHDev93 · 2021-01-27T09:46:37Z

Reran my minimum working example above with the below replacement code in the save portion (also reduced the data size to 3e6 and learning rate to 0.001 which just speeds up the cycle time but should keep the model size the same)

...
model_file = f"{train_start}_model.pkl"
print(f"Attempting to save model as {model_format}", flush=True)
with open(model_file, 'wb') as fout:
    joblib.dump(gbm, fout)
    
print(f"Attempting to load model from {model_format}", flush=True)
with open(model_file, 'rb') as fin:
    gbm = joblib.load(fin)
print(f"Successfully loaded model from {model_format}!", flush=True)

Boosting does again finish, saving works, and then loading the model causes the python process to crash

[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 25
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 25
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 23
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 25
Finished final boosting iteration
Training complete: 5838.44
Attempting to save model as joblib
Attempting to convert model with default buffer len 1048576
Creating string buffer for actual len (2147483648)
Converting model with actual buffer len (2147483648)
Converted model to string
Decode model string to utf-8
Attempting to load model from joblib

StrikerRUS · 2021-01-27T12:11:29Z

Ah OK, I see now.
Trained Booster calls self conversion to string internally during serialization.

LightGBM/python-package/lightgbm/basic.py

Lines 2288 to 2295 in d4658fb

    
           def __getstate__(self): 
        
               this = self.__dict__.copy() 
        
               handle = this['handle'] 
        
               this.pop('train_set', None) 
        
               this.pop('valid_sets', None) 
        
               if handle is not None: 
        
                   this["handle"] = self.model_to_string(num_iteration=-1) 
        
               return this

So I'm afraid that without implemented workaround for #2265 it is not actually possible to save huge trained model in binary format too.

StrikerRUS · 2021-01-27T12:12:33Z

Hmmm, however, this issue is marked as closed.
https://bugs.python.org/issue16865

Python version: 3.6.6

Maybe you could try newer Python version?

CHDev93 · 2021-01-27T22:56:38Z

Updated to python 3.8.7 and ran one job with keep_training_booster=True and one with the default. First one fails after trying to load the model after dumping it to disk with joblib. Second one fails after finishing boosting but before the code I have to save the model.

Forgot to add back all the print statements I had put into the lightgbm python code so I don't have any more details but I'm fairly confident this is the same issue as with 3.6.6. The pickle file is also 2,097,153KB (so very close to 2GB as before).

StrikerRUS · 2021-01-28T11:07:56Z

OK, got it! Thanks a lot for a lot of details! I'm going to link this issue to the feature request of supporting huge models so that they will be available there.

nhirschey · 2021-02-09T17:15:12Z

An additional data point: I had a similar issue that was fixed by setting keep_training_booster=True, except python would crash with no error (whether at the terminal or jupyter kernel).

I could train in R and command line, but loading the model out put by lightgbm.exe crashed python too, which led me to finding this solution in the repo.

R could train the model, but if I tried to save the model for input into python (or load the model trained externally by lightgbm.exe) R crashed.

CHDev93 · 2021-02-11T09:45:04Z

@StrikerRUS Although this issue is closed, I'll leave this here for reference in case there are plans for a fix

The original issue was seen on Windows with 5000 leaves and 5000 boosting rounds being sufficient to observe the problem consistently on data of shape (3e6, 250). I reran the same experiment on a Linux machine with the CPU using both 5000 boosting rounds and 8000 boosting rounds. Both models produced an output text file over 2GB (which I never observed on Windows) and didn't produce any python crashes.

The larger of the two files was 3.7GB and I manually checked the tail of the file and found "Tree=7999" indicating the full model is contained without the truncation I was seeing previously.

All of this strongly suggests this is the issue as another user referenced in a previous comment

CHDev93 · 2021-02-16T19:42:15Z

One more thing to add to this is that I can train a huge model on Linux (larger than 2GB), then load the model in Windows and do inference. I cross referenced the predictions with the Linux ones on a few thousand random data points and the L1 norm of the error is 0 so I'm fairly confident the model loaded in Windows is not corrupt (I was worried it was silently only loading 2GB of trees). The model load function appears to use string streams as well so I'm less sure about my previous hypothesis about the cause.

qwertyuu · 2023-02-13T21:30:02Z

Forgot to add back all the print statements I had put into the lightgbm python code so I don't have any more details but I'm fairly confident this is the same issue as with 3.6.6. The pickle file is also 2,097,153KB (so very close to 2GB as before).

I can attest to the same issue. I first used pickle to save my models on disk, then reverted to "save_model" into a text file.

Both have 2097153 KB size on the windows machine I'm using to train. This means that my model can never leave RAM without being corrupt past the 2GB file size mark, which is frustrating. I might try running that on linux/docker at some point just to be able to end the training but this makes LGBM a poor choice for very large models.

jameslamb · 2023-08-18T02:13:46Z

Sorry, this was locked accidentally. Just unlocked it. We'd still love help with this feature!

jameslamb added the feature request label Jan 27, 2021

StrikerRUS closed this as completed Jan 28, 2021

StrikerRUS mentioned this issue Jan 28, 2021

Feature Requests & Voting Hub #2302

Open

StrikerRUS added the help wanted label Jan 28, 2021

jameslamb mentioned this issue Feb 24, 2022

dump_model giving "JSONDecodeError: Expecting ',' delimiter" for large (250mb+) models #5020

Open

This comment was marked as off-topic.

Sign in to view

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023

microsoft unlocked this conversation Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error saving very large LightGBM models #3858

Error saving very large LightGBM models #3858

CHDev93 commented Jan 26, 2021 •

edited

Loading

StrikerRUS commented Jan 26, 2021

CHDev93 commented Jan 26, 2021 •

edited

Loading

StrikerRUS commented Jan 26, 2021

CHDev93 commented Jan 26, 2021

StrikerRUS commented Jan 26, 2021

CHDev93 commented Jan 26, 2021

CHDev93 commented Jan 27, 2021

StrikerRUS commented Jan 27, 2021 •

edited

Loading

StrikerRUS commented Jan 27, 2021 •

edited

Loading

CHDev93 commented Jan 27, 2021 •

edited

Loading

StrikerRUS commented Jan 28, 2021

nhirschey commented Feb 9, 2021

CHDev93 commented Feb 11, 2021 •

edited

Loading

CHDev93 commented Feb 16, 2021

qwertyuu commented Feb 13, 2023

This comment was marked as off-topic.

jameslamb commented Aug 18, 2023

Error saving very large LightGBM models #3858

Error saving very large LightGBM models #3858

Comments

CHDev93 commented Jan 26, 2021 • edited Loading

How you are using LightGBM?

Environment info

Error message and / or logs

Reproducible example(s)

Steps to reproduce

StrikerRUS commented Jan 26, 2021

CHDev93 commented Jan 26, 2021 • edited Loading

StrikerRUS commented Jan 26, 2021

CHDev93 commented Jan 26, 2021

StrikerRUS commented Jan 26, 2021

CHDev93 commented Jan 26, 2021

CHDev93 commented Jan 27, 2021

StrikerRUS commented Jan 27, 2021 • edited Loading

StrikerRUS commented Jan 27, 2021 • edited Loading

CHDev93 commented Jan 27, 2021 • edited Loading

StrikerRUS commented Jan 28, 2021

nhirschey commented Feb 9, 2021

CHDev93 commented Feb 11, 2021 • edited Loading

CHDev93 commented Feb 16, 2021

qwertyuu commented Feb 13, 2023

This comment was marked as off-topic.

jameslamb commented Aug 18, 2023

CHDev93 commented Jan 26, 2021 •

edited

Loading

CHDev93 commented Jan 26, 2021 •

edited

Loading

StrikerRUS commented Jan 27, 2021 •

edited

Loading

StrikerRUS commented Jan 27, 2021 •

edited

Loading

CHDev93 commented Jan 27, 2021 •

edited

Loading

CHDev93 commented Feb 11, 2021 •

edited

Loading