Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xgboost4j-spark] Incremental training #1859

Closed
fc1plusx opened this issue Dec 9, 2016 · 7 comments
Closed

[xgboost4j-spark] Incremental training #1859

fc1plusx opened this issue Dec 9, 2016 · 7 comments

Comments

@fc1plusx
Copy link

fc1plusx commented Dec 9, 2016

In this SO thread, http://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost, I could find some info about incremental training using the python interface but I was not able to find any info about such feature in the Spark version. Is it available? Will it be in the future? Many thanks!

@xydrolase
Copy link
Contributor

Internally there is update function in ml.dmlc.xgboost4j.scala.Booster, which allows you to perform update iteration by iteration, but there's currently no wrapper for incremental updates in xgboost4j-spark.

Are you envisioning a Spark-streaming application? If you're concerned about memory, you can try with useExternalMemory = true when you train with Spark's XGBoost.

@fc1plusx
Copy link
Author

fc1plusx commented Dec 9, 2016

No, a standard Spark application. But it would be better for my experimentation to train the classifier incrementally with the new data that I get from time to time. Any plan for this feature in the future? Thanks!

@xydrolase
Copy link
Contributor

From a technical point of view, it is certainly doable with perhaps a similar interface to the Python one. But could you elaborate a bit more on the use case?

So you have, say, 50 million samples today, and you train a model with 50 trees.
A week later, you have another 50 million samples, and you want to update the model for another 50 iterations?

@fc1plusx
Copy link
Author

fc1plusx commented Dec 9, 2016

Exactly. To elaborate more: I would like to update the previous model with the new data that I get. I think it is interesting to see how the model performs when trained incrementally vs re-trained with the full updated dataset (i.e. train with data1 & update with data2 vs train with data1 & train again with data1+data2). Also, from a computational point of view, incremental training should be less expensive than re-training with the updated dataset.

I'm not aware of the XGBoost internal details, but If i understand correctly, the GBT algorithm is inherently incremental but there are quite a lot of optimisations in XGBoost that require all the data, making it somehow a hybrid incremental-batch classification/regression framework. Am I right? In such a case, is incremental training still possible?

@xydrolase
Copy link
Contributor

xydrolase commented Dec 9, 2016

As I mentioned earlier, internally every iteration is trained by the update() function. So, training an entire model can be decomposed into multiple update() stages. Thus, you can certainly break down the training into multiple sessions.

If you train your model with subsample < 1.0, then, each tree is trained with slightly different data, so I don't see how incremental training would be that much different.
Do keep an eye on the distribution of the features though. If the distribution changes, from time to time, between your incremental updates, then I think it may not end up working very well, even though GBT is not sensitive to scaling.

It seems something interesting, and worthy to explore, but I don't have the time to implement it in the short term.

@fc1plusx
Copy link
Author

fc1plusx commented Dec 9, 2016

Thanks a lot for your feedback!

I was aware of the feature distribution problem, and I don't think it is something that you can easily solve without allowing for complete model reorganisation. Nevertheless, this is shared by many incremental models and I believe is an issue somehow you need to learn to live with.

I would be really interested in seeing a (perhaps preliminary) incremental training method in XGBoost Spark (as you said, maybe similar to what's already available in the Python interface). In the meanwhile, I'll check what I can do on my side :-)

@CodingCat
Copy link
Member

instead of providing the interface shown in stackoverflow, #1670 shows a better solution?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants