[xgboost4j-spark] Incremental training #1859

fc1plusx · 2016-12-09T12:12:56Z

In this SO thread, http://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost, I could find some info about incremental training using the python interface but I was not able to find any info about such feature in the Spark version. Is it available? Will it be in the future? Many thanks!

xydrolase · 2016-12-09T14:15:39Z

Internally there is update function in ml.dmlc.xgboost4j.scala.Booster, which allows you to perform update iteration by iteration, but there's currently no wrapper for incremental updates in xgboost4j-spark.

Are you envisioning a Spark-streaming application? If you're concerned about memory, you can try with useExternalMemory = true when you train with Spark's XGBoost.

fc1plusx · 2016-12-09T14:50:33Z

No, a standard Spark application. But it would be better for my experimentation to train the classifier incrementally with the new data that I get from time to time. Any plan for this feature in the future? Thanks!

xydrolase · 2016-12-09T14:59:21Z

From a technical point of view, it is certainly doable with perhaps a similar interface to the Python one. But could you elaborate a bit more on the use case?

So you have, say, 50 million samples today, and you train a model with 50 trees.
A week later, you have another 50 million samples, and you want to update the model for another 50 iterations?

fc1plusx · 2016-12-09T15:14:22Z

Exactly. To elaborate more: I would like to update the previous model with the new data that I get. I think it is interesting to see how the model performs when trained incrementally vs re-trained with the full updated dataset (i.e. train with data1 & update with data2 vs train with data1 & train again with data1+data2). Also, from a computational point of view, incremental training should be less expensive than re-training with the updated dataset.

I'm not aware of the XGBoost internal details, but If i understand correctly, the GBT algorithm is inherently incremental but there are quite a lot of optimisations in XGBoost that require all the data, making it somehow a hybrid incremental-batch classification/regression framework. Am I right? In such a case, is incremental training still possible?

xydrolase · 2016-12-09T15:28:03Z

As I mentioned earlier, internally every iteration is trained by the update() function. So, training an entire model can be decomposed into multiple update() stages. Thus, you can certainly break down the training into multiple sessions.

If you train your model with subsample < 1.0, then, each tree is trained with slightly different data, so I don't see how incremental training would be that much different.
Do keep an eye on the distribution of the features though. If the distribution changes, from time to time, between your incremental updates, then I think it may not end up working very well, even though GBT is not sensitive to scaling.

It seems something interesting, and worthy to explore, but I don't have the time to implement it in the short term.

fc1plusx · 2016-12-09T15:45:19Z

Thanks a lot for your feedback!

I was aware of the feature distribution problem, and I don't think it is something that you can easily solve without allowing for complete model reorganisation. Nevertheless, this is shared by many incremental models and I believe is an issue somehow you need to learn to live with.

I would be really interested in seeing a (perhaps preliminary) incremental training method in XGBoost Spark (as you said, maybe similar to what's already available in the Python interface). In the meanwhile, I'll check what I can do on my side :-)

CodingCat · 2016-12-10T16:28:55Z

instead of providing the interface shown in stackoverflow, #1670 shows a better solution?

CodingCat closed this as completed Dec 15, 2016

antontarasenko mentioned this issue Jan 22, 2018

Is it possible to update a model with new data without retraining the model from scratch? #3055

Closed

lock bot locked as resolved and limited conversation to collaborators Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[xgboost4j-spark] Incremental training #1859

[xgboost4j-spark] Incremental training #1859

fc1plusx commented Dec 9, 2016

xydrolase commented Dec 9, 2016

fc1plusx commented Dec 9, 2016

xydrolase commented Dec 9, 2016

fc1plusx commented Dec 9, 2016

xydrolase commented Dec 9, 2016 •

edited

Loading

fc1plusx commented Dec 9, 2016 •

edited

Loading

CodingCat commented Dec 10, 2016

[xgboost4j-spark] Incremental training #1859

[xgboost4j-spark] Incremental training #1859

Comments

fc1plusx commented Dec 9, 2016

xydrolase commented Dec 9, 2016

fc1plusx commented Dec 9, 2016

xydrolase commented Dec 9, 2016

fc1plusx commented Dec 9, 2016

xydrolase commented Dec 9, 2016 • edited Loading

fc1plusx commented Dec 9, 2016 • edited Loading

CodingCat commented Dec 10, 2016

xydrolase commented Dec 9, 2016 •

edited

Loading

fc1plusx commented Dec 9, 2016 •

edited

Loading