[CORE] The update process for a tree model, and its application to feature importance #1670

khotilov · 2016-10-16T00:21:57Z

The main part of this PR is about adding the infrastructure to "update" an existing tree model. It is a simple mod of the GBTree booster, introducing a prosess_type parameter which allows to switch between the default full boosting process (which grows new trees and updates them) and the update process for an existing model (which works by passing a model through a desired set of updater modules using some specific data). I've made this parameter as enum instead of bool, so it would keep a possibility for other process types open. Overall, it is still the same booster that basically allows for a different starting point, so I don't think it would make sense to use inheritance to create a separate booster. And the process switch should work seamlessly for DART as well.

There could be various applications. E.g., it could be useful to adapt an existing model to a dataset that is somewhat different than the original training data, while still keeping the tree structure mostly the same. This would save time by not needing to rebuild all the trees, and it offers some elements of transfer learning.

Another useful application is for understanding the out-of-sample feature importance and the local feature importance of a model. So far, feature importance ranking was calculated based on information about the loss gains learned within training data, which could carry significant overfitting and may unfairly inflate some importances. Updating the model trees' stats in a hold-out sample would allow to obtain a more fair importance ranking. Also, the current feature importance is a global importance - in the whole training sample. But after building a model in heterogeneous data, I frequently want to see what sets of features are the most important in certain subsets of data for this specific model (i.e., without creating a new model in each subset from scratch). E.g., what drives this model's predictions at the upper end of regression outcome? Or what factors are the most influential on predictions within some cluster? A quick update of the model's stats by passing the data from a specific subset down the trees, allows to estimate the local importance in this data using re-calculated gains. An example is given below.

I have also modified the refresh updater by adding an option to not update the leaf values. This way we can update only the tree stats (for the importance estimation and other some sorts of tree-introspection analysis), but would keep the splits and leaf values intact (resulting in the same predictions as from the original model). One current limitation (or feature, as there are pros and cons to that) to keep in mind is that the refresh updater does not support the random instance subsampling, so a model, which initially used subsampling for its training, would not get updated in a similarly random manner. One example of when no-subsampling could be beneficial is when doing a stats update within the same training sample, the gains of each split would be updated using all data rather than the subsamples that were used during training, thus resulting in less "overfitted" importances.

Some more work would be needed to fully complete this functionality, but I'm putting it up, hoping to get some feedback.

library(xgboost)
library(data.table)
library(mlbench)

# predicting the outcome of a diabetes test
data(PimaIndiansDiabetes2)
dt <- PimaIndiansDiabetes2
str(dt)
setDT(dt)
fnames <- colnames(dt)[-9]

set.seed(1)
tr <- sample.int(nrow(dt), 0.7*nrow(dt))
dtrain <- xgb.DMatrix(as.matrix(dt[tr, -9, with=F]), label = as.numeric(dt$diabetes[tr])-1)
dtest <- xgb.DMatrix(as.matrix(dt[-tr, -9, with=F]), label = as.numeric(dt$diabetes[-tr])-1)
wl <- list(train = dtrain, test = dtest)

param <- list(max_depth = 2, eta = 0.05, nthread = 2, subsample = 0.5, min_child_weight = 5, 
              objective = "binary:logistic", eval_metric = "auc",
              base_score = mean(getinfo(dtrain,"label")))

bst <- xgb.train(param, dtrain, 50, wl)
# some significant overfitting is happening...

# Refresh the model within the same training data (without pruning)
rparam <- modifyList(param, list(process_type='update', updater='refresh', refresh_leaf=FALSE))
rbst <- xgb.train(rparam, dtrain, nrounds = bst$niter, watchlist = wl, xgb_model = bst)
# Note how the AUCs are still the same.
# The feature importances are now less affected by the overfited gains during subsampling:
xgb.importance(fnames, rbst)
# compare to the original model:
xgb.importance(fnames, bst)
# The splits and leaf values remain the same, only the split gains and cover values have changed:
xgb.plot.tree(fnames, rbst, n_first_tree = 5)
xgb.plot.tree(fnames, bst, n_first_tree = 5)

# Also, can do the same but against the test data:
tbst <- xgb.train(rparam, dtest, nrounds = bst$niter, watchlist = wl, xgb_model = bst)
xgb.importance(fnames, tbst)

# And, say, we want to see how the feature importances change within the BMI>30 cohort:
dtrain <- xgb.DMatrix(as.matrix(dt[tr, -9, with=F][mass>30]), 
                      label = as.numeric(dt[tr][mass>30]$diabetes)-1)
dtest <- xgb.DMatrix(as.matrix(dt[-tr, -9, with=F][mass>30]), 
                      label = as.numeric(dt[-tr][mass>30]$diabetes)-1)
wl <- list(train = dtrain, test = dtest)
mbst <- xgb.train(rparam, dtrain, nrounds = bst$niter, watchlist = wl, xgb_model = bst)
# The role of glucose test in this cohort is significantly higher:
xgb.importance(fnames, mbst)
# We can also observe how the 'mass' splits became non-splits in many trees (with Gain==0):
xgb.plot.tree(fnames, mbst, n_first_tree = 5)

tqchen · 2016-10-17T20:20:54Z

I think being able to "update" is pretty cool. We need to be careful about the prediction cache in existing GBM, since they could no longer be valid when we refreshes a leaf.

khotilov · 2016-11-14T05:17:57Z

@tqchen : good point. I've moved the trees initialization for the update into Configure, which should be a better place for it, I think.

And I've added some tests and documentation.

khotilov · 2016-11-29T17:12:15Z

@tqchen: what is needed to wrap this one up? Do you want me to change the python interface as well? do you want some unrelated documentation changes to be in a separate PR?

tqchen · 2016-11-29T17:48:27Z

The current logic looks good to me, though it does not resemble the general case of refreshing, which could cycle through the trees. Never-the-less, we could merge this in first, as first step

Here are a few check list before merging

Let us first make sure the travis pass, I think rebasing against latest master will do
Add a enum type about process type, instead of using 1, 0 to indicate process type, which will make the code more readable

enum TreeProcessType {
   kDefault,
   kUpdate
};

khotilov · 2016-11-30T06:56:04Z

Thanks, I've added a TreeProcessType enum.

There are many imaginable ways that one can update/refresh/modify/torture the trees, also using various samples of data. It could be interesting to design some robust set of essential building blocks for such exercises (as in your "unix philosophy" approach). The existing modular plugins system for updaters is already a good start. But my addition, however, so far mostly addresses some certain practical needs.

tqchen · 2016-12-01T18:18:21Z

please fix the lint error https://travis-ci.org/dmlc/xgboost/jobs/180001971

…tats only

…ame default process_type to 'default'; fix the trees and trees_to_update sizes comparison check

…ater, Gamma and Tweedie; added some parameter aliases; metrics indentation and some were non-documented

khotilov · 2016-12-04T02:14:36Z

@tqchen : Travis checks were finally done and were passing (why is it taking over a day now?). And I've rebased this PR one more time.

AbhishekSinghVerma · 2017-02-09T14:39:46Z

What is the date or R package version; for the planned release which will contain this commit? In other words when can I expect this change to be available in R CRAN package?

khotilov · 2017-02-10T07:30:57Z

@AbhishekSinghVerma The current CRAN release does contain this PR.

khotilov force-pushed the update_process branch from 68ccfe9 to 20b6f55 Compare November 14, 2016 05:14

khotilov force-pushed the update_process branch from 5331976 to 8f466c3 Compare November 30, 2016 06:48

khotilov force-pushed the update_process branch from 8f466c3 to bec476c Compare December 2, 2016 05:39

khotilov added 11 commits December 3, 2016 20:13

[CORE] allow updating trees in an existing model

1260032

[CORE] in refresh updater, allow keeping old leaf values and update s…

902545d

…tats only

[R-package] xgb.train mod to allow updating trees in an existing model

ada8b35

[R-package] added check for nrounds when is_update

b69e347

[CORE] merge parameter declaration changes; unify their code style

c372208

[CORE] move the update-process trees initialization to Configure; ren…

56cd066

…ame default process_type to 'default'; fix the trees and trees_to_update sizes comparison check

[R-package] unit tests for the update process type

8d346fe

[DOC] documentation for process_type parameter; improved docs for upd…

f8b6fe5

…ater, Gamma and Tweedie; added some parameter aliases; metrics indentation and some were non-documented

fix my sloppy merge conflict resolutions

e87a59b

[CORE] add a TreeProcessType enum

f8e4942

whitespace fix

1c9a174

khotilov force-pushed the update_process branch from bec476c to 1c9a174 Compare December 4, 2016 02:13

tqchen merged commit a44032d into dmlc:master Dec 4, 2016

khotilov mentioned this pull request Dec 6, 2016

Continue training from an existing model in R #1843

Closed

This was referenced Dec 10, 2016

[xgboost4j-spark] Incremental training #1859

Closed

[jvm-packages] provide interface for latest model update functionality #1861

Closed

khotilov mentioned this pull request Feb 21, 2017

Add prediction of feature contributions #2003

Merged

snawara mentioned this pull request Jul 25, 2017

Continue training model more than once (R) #2545

Closed

khotilov mentioned this pull request Feb 6, 2018

Variable importance on a validation set, by tree #3088

Closed

khotilov mentioned this pull request Mar 16, 2018

Documentation not clear #3175

Closed

lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE] The update process for a tree model, and its application to feature importance #1670

[CORE] The update process for a tree model, and its application to feature importance #1670

khotilov commented Oct 16, 2016

tqchen commented Oct 17, 2016

khotilov commented Nov 14, 2016

khotilov commented Nov 29, 2016

tqchen commented Nov 29, 2016

khotilov commented Nov 30, 2016

tqchen commented Dec 1, 2016

khotilov commented Dec 4, 2016

AbhishekSinghVerma commented Feb 9, 2017

khotilov commented Feb 10, 2017

[CORE] The update process for a tree model, and its application to feature importance #1670

[CORE] The update process for a tree model, and its application to feature importance #1670

Conversation

khotilov commented Oct 16, 2016

tqchen commented Oct 17, 2016

khotilov commented Nov 14, 2016

khotilov commented Nov 29, 2016

tqchen commented Nov 29, 2016

khotilov commented Nov 30, 2016

tqchen commented Dec 1, 2016

khotilov commented Dec 4, 2016

AbhishekSinghVerma commented Feb 9, 2017

khotilov commented Feb 10, 2017