Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error saving very large LightGBM models #3858

Closed
CHDev93 opened this issue Jan 26, 2021 · 17 comments
Closed

Error saving very large LightGBM models #3858

CHDev93 opened this issue Jan 26, 2021 · 17 comments

Comments

@CHDev93
Copy link

CHDev93 commented Jan 26, 2021

How you are using LightGBM?

Python package

LightGBM component:

Environment info

Operating System: Windows 10

CPU/GPU model: GPU

C++ compiler version: NA

CMake version: NA

Java version: NA

Python version: 3.6.6

R version: NA

Other: NA

LightGBM version or commit hash: 3.1.0

Error message and / or logs

I'm observing errors when trying to train models with sufficiently large tree models (on either CPU or GPU). Namely, when the max_leaves and num_boosting_rounds are sufficiently high, the boosting rounds all finish, but when trying to serialise and deserialise the model back, an error occurs.

To avoid the automatic to and from_string calls after the final boosting round, I've tried setting keep_training_booster=True and then saving the model out to disk, then reloading it. Saving the model as text or as pickle both succeed on save but then fail on model load.

I've investigated this issue and found that when writing out to a text file the last tree written is "Tree=4348" even though I've requested more boosting rounds than this. When loading the model there's obviously a mismatch between the the number of elements in the "tree_sizes" attribute of the file (5000) and the actual number of trees in the file (4348) which causes an error

I believe the underlying issue is the same as here: #2828
I also found this comment alluding to a 2GB limit of string stream and my text file is almost exactly 2GB: #372 (comment)

I added some of my own logging inside the lightgbm python layer and have the following logs

....
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 38
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 43
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 46
CASEY: Finished final boosting iteration
Training complete: 21154.90s
Attempting to save model as pickle
Attempting to convert model with default buffer len 1048576
Creating string buffer for actual len (2147483648)
Converting model with actual buffer len (2147483648)
Converted model to string
Decode model string to utf-8
Successfully saved model as pickle
Attempting to load model from pickle
<program hangs and pulls up dialog box indicating python has stopped working>

Reproducible example(s)

Note the model must be really large to observe this error. This took almost 6 hours on a V100 GPU. If model size is not dependent on number of rows or columns, you might be able to use smaller numbers than I did and speed things up a little.
Before getting to enough boosting rounds for the model to crash, performance of the model continues to increase so there's reason to believe a model this big is really necessary.

n = int(2e7)
m = 250
max_leaves = 5000
max_bin = 255
x_train = np.random.randn(n, m).astype(np.float32)
A = np.random.randint(-5, 5, size=(m, 1))
y_train = (x_train @ A).astype(np.float32)

print(f"x_train.shape = {x_train.shape}, y_train.shape = {y_train.shape}")
n_boosting_rounds = 5000
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
     'device': 'gpu',
    'metric': {'rmse'},
    'device': 'gpu',
    'num_leaves': max_leaves,
    'bagging_fraction': 0.5,
    'feature_fraction': 0.5,
    'learning_rate': 0.01,
    'verbose': 2,
    'max_bin': max_bin,
}
ds_train = lgb.Dataset(x_train, y_train.ravel(), free_raw_data=False).construct()
start = time.perf_counter()
gbm = lgb.train(
    params,
    ds_train,
    num_boost_round=n_boosting_rounds,
    keep_training_booster=True, # set this to False and the code will crash here
)

print(f"Training complete: {elapsed_train_time:.2f}")

model_file = f"{train_start}_model.txt"
print(f"Attempting to save model as {model_format}", flush=True)
gbm.save_model(model_file)
print(f"Successfully saved model as {model_format}!", flush=True)

print(f"Attempting to load model from {model_format}", flush=True)
gbm = lgb.Booster(model_file=model_file) # program dies here!!
print(f"Successfully loaded model from {model_format}", flush=True)

Steps to reproduce

  1. Generate some fake linear data
  2. Train a gbdt with sufficient boosting_rounds and max_leaves to cause an error (note the boosting is fine, it's the clean up bit after boosting that's problematic)
@StrikerRUS
Copy link
Collaborator

Hi @CHDev93 !

We already have a feature request to support extremely large models: #2265.
Are errors you observe the same as there?

@CHDev93
Copy link
Author

CHDev93 commented Jan 26, 2021

Thanks for the quick follow up @StrikerRUS . Don't know how I missed this issue when looking through tickets the past couple days. Yes this problem looks very similar, though I don't get an error as helpful as the one shown in that ticket (or at all, actually).

When I've set keep_training_booster=True as you mention in that issue, that does allow the lgb.train call to succeed but then I still have no way to persist the model. Basically, if I set this flag, how can I put the model on disk such that I can load it back and make predictions? This comment seems to imply saving and loading should work but in practice I'm finding the file is corrupt #2265 (comment)

@StrikerRUS
Copy link
Collaborator

From your logs:

...
Attempting to convert model with default buffer len 1048576
Creating string buffer for actual len (2147483648)
Converting model with actual buffer len (2147483648)
...

2147483648 it is exactly INT_MAX + 1. I don't think this is a coincidence...

This comment seems to imply saving and loading should work but in practice I'm finding the file is corrupt #2265 (comment)

Unfortunately, it is only a potential fix. It is not implemented yet. This feature request is still open.
#2302

image

@CHDev93
Copy link
Author

CHDev93 commented Jan 26, 2021

Okay was just checking there wasn't a work-around involving this keep_training_booster combined with some method of getting it to disk that would work around this issue. If I understand correctly, there's no way around it at the current time.

Thanks for your help on this. Feel free to close this issue

@StrikerRUS
Copy link
Collaborator

Setting keep_training_booster=True is a workaround to complete training successfully.

I'm quite surprised that

Saving the model as text or as pickle both succeed on save but then fail on model load.

for part about pickle. Maybe you hit the following pickle issue or something similar? Could you please try joblib or some other alternative to pickle?

@CHDev93
Copy link
Author

CHDev93 commented Jan 26, 2021

Thanks @StrikerRUS ! I did know pickle has some issues at the 4GB limit but thought I might be safe at 2GB. I will kick off a run now with joblib to see if that helps.

I'm not certain exactly how these serialisation libraries work so hopefully they're not calling some of the objects methods during serialisation which could lead to the string conversion issue again. Will comment here when I have some results though

@CHDev93
Copy link
Author

CHDev93 commented Jan 27, 2021

Reran my minimum working example above with the below replacement code in the save portion (also reduced the data size to 3e6 and learning rate to 0.001 which just speeds up the cycle time but should keep the model size the same)

...
model_file = f"{train_start}_model.pkl"
print(f"Attempting to save model as {model_format}", flush=True)
with open(model_file, 'wb') as fout:
    joblib.dump(gbm, fout)
    
print(f"Attempting to load model from {model_format}", flush=True)
with open(model_file, 'rb') as fin:
    gbm = joblib.load(fin)
print(f"Successfully loaded model from {model_format}!", flush=True)

Boosting does again finish, saving works, and then loading the model causes the python process to crash

[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 25
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 25
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 23
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 25
Finished final boosting iteration
Training complete: 5838.44
Attempting to save model as joblib
Attempting to convert model with default buffer len 1048576
Creating string buffer for actual len (2147483648)
Converting model with actual buffer len (2147483648)
Converted model to string
Decode model string to utf-8
Attempting to load model from joblib

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jan 27, 2021

Ah OK, I see now.
Trained Booster calls self conversion to string internally during serialization.

def __getstate__(self):
this = self.__dict__.copy()
handle = this['handle']
this.pop('train_set', None)
this.pop('valid_sets', None)
if handle is not None:
this["handle"] = self.model_to_string(num_iteration=-1)
return this

So I'm afraid that without implemented workaround for #2265 it is not actually possible to save huge trained model in binary format too.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jan 27, 2021

Hmmm, however, this issue is marked as closed.
https://bugs.python.org/issue16865

Python version: 3.6.6

Maybe you could try newer Python version?

@CHDev93
Copy link
Author

CHDev93 commented Jan 27, 2021

Updated to python 3.8.7 and ran one job with keep_training_booster=True and one with the default. First one fails after trying to load the model after dumping it to disk with joblib. Second one fails after finishing boosting but before the code I have to save the model.

Forgot to add back all the print statements I had put into the lightgbm python code so I don't have any more details but I'm fairly confident this is the same issue as with 3.6.6. The pickle file is also 2,097,153KB (so very close to 2GB as before).

@StrikerRUS
Copy link
Collaborator

OK, got it! Thanks a lot for a lot of details! I'm going to link this issue to the feature request of supporting huge models so that they will be available there.

@nhirschey
Copy link

An additional data point: I had a similar issue that was fixed by setting keep_training_booster=True, except python would crash with no error (whether at the terminal or jupyter kernel).

I could train in R and command line, but loading the model out put by lightgbm.exe crashed python too, which led me to finding this solution in the repo.

R could train the model, but if I tried to save the model for input into python (or load the model trained externally by lightgbm.exe) R crashed.

@CHDev93
Copy link
Author

CHDev93 commented Feb 11, 2021

@StrikerRUS Although this issue is closed, I'll leave this here for reference in case there are plans for a fix

The original issue was seen on Windows with 5000 leaves and 5000 boosting rounds being sufficient to observe the problem consistently on data of shape (3e6, 250). I reran the same experiment on a Linux machine with the CPU using both 5000 boosting rounds and 8000 boosting rounds. Both models produced an output text file over 2GB (which I never observed on Windows) and didn't produce any python crashes.

The larger of the two files was 3.7GB and I manually checked the tail of the file and found "Tree=7999" indicating the full model is contained without the truncation I was seeing previously.

All of this strongly suggests this is the issue as another user referenced in a previous comment

@CHDev93
Copy link
Author

CHDev93 commented Feb 16, 2021

One more thing to add to this is that I can train a huge model on Linux (larger than 2GB), then load the model in Windows and do inference. I cross referenced the predictions with the Linux ones on a few thousand random data points and the L1 norm of the error is 0 so I'm fairly confident the model loaded in Windows is not corrupt (I was worried it was silently only loading 2GB of trees). The model load function appears to use string streams as well so I'm less sure about my previous hypothesis about the cause.

@qwertyuu
Copy link

Forgot to add back all the print statements I had put into the lightgbm python code so I don't have any more details but I'm fairly confident this is the same issue as with 3.6.6. The pickle file is also 2,097,153KB (so very close to 2GB as before).

I can attest to the same issue. I first used pickle to save my models on disk, then reverted to "save_model" into a text file.

image

image

Both have 2097153 KB size on the windows machine I'm using to train. This means that my model can never leave RAM without being corrupt past the 2GB file size mark, which is frustrating. I might try running that on linux/docker at some point just to be able to end the training but this makes LGBM a poor choice for very large models.

@github-actions

This comment was marked as off-topic.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
@jameslamb
Copy link
Collaborator

Sorry, this was locked accidentally. Just unlocked it. We'd still love help with this feature!

@microsoft microsoft unlocked this conversation Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants