-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[xgboost4j-spark] Incremental training #1859
Comments
Internally there is Are you envisioning a Spark-streaming application? If you're concerned about memory, you can try with |
No, a standard Spark application. But it would be better for my experimentation to train the classifier incrementally with the new data that I get from time to time. Any plan for this feature in the future? Thanks! |
From a technical point of view, it is certainly doable with perhaps a similar interface to the Python one. But could you elaborate a bit more on the use case? So you have, say, 50 million samples today, and you train a model with 50 trees. |
Exactly. To elaborate more: I would like to update the previous model with the new data that I get. I think it is interesting to see how the model performs when trained incrementally vs re-trained with the full updated dataset (i.e. train with data1 & update with data2 vs train with data1 & train again with data1+data2). Also, from a computational point of view, incremental training should be less expensive than re-training with the updated dataset. I'm not aware of the XGBoost internal details, but If i understand correctly, the GBT algorithm is inherently incremental but there are quite a lot of optimisations in XGBoost that require all the data, making it somehow a hybrid incremental-batch classification/regression framework. Am I right? In such a case, is incremental training still possible? |
As I mentioned earlier, internally every iteration is trained by the If you train your model with It seems something interesting, and worthy to explore, but I don't have the time to implement it in the short term. |
Thanks a lot for your feedback! I was aware of the feature distribution problem, and I don't think it is something that you can easily solve without allowing for complete model reorganisation. Nevertheless, this is shared by many incremental models and I believe is an issue somehow you need to learn to live with. I would be really interested in seeing a (perhaps preliminary) incremental training method in XGBoost Spark (as you said, maybe similar to what's already available in the Python interface). In the meanwhile, I'll check what I can do on my side :-) |
instead of providing the interface shown in stackoverflow, #1670 shows a better solution? |
In this SO thread, http://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost, I could find some info about incremental training using the python interface but I was not able to find any info about such feature in the Spark version. Is it available? Will it be in the future? Many thanks!
The text was updated successfully, but these errors were encountered: