Read the xgboost principle paper https://arxiv.org/pdf/1603.02754.pdf and make summary by own understanding, with some anlysis for Selection of loss function part.
1.What xgboost can solve:
Xgboost can solve the problems of binary classification, multiple classification, and regression.
Eg. Binary classification cases: :
Input and output:
Each person is a sample, the number of samples n is 5
Each person has m characteristics, age, male/female
Output y as 0 don't like playing games, 1 like playing games
Model training:generating tree and leaf weights through training sets
At the beginning of each tree, select one of the M attributes to split. After several branches, t leaves are formed. Each leaf is a set of a certain amount of samples according to the attribute division, sharing the weight Wj of the leaves.
After a tree is forked, the next tree is forked.
Model application:
Input the samples with m characteristics to the generated tree, find the leaves of the samples in each tree, add all weights wj, and get the predicted value yi. After the sigmoid function, binary classification yi will output 0 or 1 compared with the threshold value. After the softmax function, multivariate classification yi will get the maximum probability of a certain type and output a certain type of value.
Characteristics that xgboost can handle:
Input values may include continuous attributes, binary attributes, nominal attributes (more types than binary), ordinal attributes (small, medium and large clothing codes, etc.)
2.Model training 1: tree bifurcations
Just read xgboost principle https://arxiv.org/pdf/1603.02754.pdf There will be confusion when it comes to why Section 2.2 can get the structure of the tree and the weight of the book at the same time. The tree structure is obtained by splitting the tree in Section 3 before the evolution of the formula.
Examples of traditional decision trees::
In CART, Gini coefficient is selected to divide the tree branches. Gini coefficient represents the impurity. When it is 0, it is called 0 impurities. It is used to measure the impurity after classifying the samples. The branch method with small impurity is selected. CART uses dichotomy to avoid zero impurity caused by too many branches.
Decision Tree:https://blog.csdn.net/bjjoy2009/article/details/80841657
Xgboost tree model:
The gradient algorithm is introduced into the tree model optimization, and the bifurcation measurement formula is no longer the GINI coefficient, but uses the gi, hi of each sample (which is the weight obtained by comparing the value of the previous tree with the actual value, to substitute into the generation of the current tree. gi and hi will be introduced in the third section, which is the first derivative and the second derivative of the loss function, only related to the estimated value and the actual value after the previous tree is generated.
Greedy algorithm: traverses all m attributes, traverses the possible branching mode of the m-th attribute, and selects the largest split branching mode. Score is the obj before bifurcation minus the left and right of obj after bifurcation. When score is less than 0, it will not be bifurcated, and if score is greater than 0, it will be bifurcated When the tree model is established, the larger one is not easy to branch, and the smaller one is easy to branch.
Approximate algorithm: set the segmentation point, such as the three quantiles: after sorting, select 1 / 3 point and 2 / 3 point as the segmentation point. Refer to Section 3.2 of xgboost weight
Global: propose candidate segmentation points before learning each tree, and use this segmentation method in every determinant. More candidate points are needed
Local: the candidate segmentation points will be proposed again before each splitting, and the calculation steps required.
3.Model training 2: obtaining sample weight and leaf weight by gradient descent method
For each sample, the sum of leaf nodes of each tree can be expressed as the prediction value:
For logloss:
loss function There is no fixed function, so we can deal with the regression and classification problems, and summarize later.
The regular expression is binomial. To prevent over fitting, it is the sum of squares of all leaf node weights.
After Taylor expansion:
It should be noted that when this formula is used, the tree has been divided. the number of nodes is fixed, the set of samples in j node is fixed. This formula is only used to calculate the weight of leaves
It can be seen that the loss function at current tree t is only related with G, H, which can be used in 2 model training 1: the bifurcation of tree.
4. Selection of loss function: 'objective':
The process is as follows: sum the wj weight of sample i in each tree, take the y obtained as the predicted value, put y into the obj function, and then into the evaluation function eg.logloss
.
a. Binary classifier: binary:logistic
objective function sigmoid:first derivative:first derivative:
eval_metric:logloss
substitute sigmoid function as follows:
first derivative:
second derivative:
Classifier output: https://blog.csdn.net/u014033218/article/details/90516849
binary:logistic and 'objective': ' reg:logistic 'the output is the same, it's all the probability of prediction
binary:logitraw Is the score of the output, which is consistent with the above two probability values after being processed with sigmoid() function
In XGBClassifier, the probability value of prediction is taken as the threshold value of 0.5, which is 0 for those less than this value and 1 for those greater than this value,
b. Multi-classifier multi:softmax :
objective function softmax:
first derivative:
eval_metric:mlogloss:
For sample i:
N: Number of samples; m: number of categories;
:If the ith sample belongs to category j, it is 1, otherwise it is 0;
:The probability that the i-th sample is predicted to be the j-th type (1 / J on average), i.e. the j-th layer of softmax output
first derivative:
second derivative:
Code:https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_softmax.py
Multi category output: https://blog.csdn.net/phyllisyuell/article/details/81011660
Let's take the three kinds of data labeled 0,1,2 as an example.
Each category has its own tree, corresponding to WJC. J is the leaf number, and C is the tree number.
Multi: softmax is the classification result generated after softmax is used. The output is 0, 1, 2.
multi:softprob It is the probability matrix of Y output after passing softmax, and the probability matrix of n * 3 output (n is the number of test data).
Select the maximum probability assigned to the corresponding category eg [0.34, 0.56, 0.42], corresponding to softmax category 1
Taylor expansion of multiple classification: after derivation, improve this article, the principle of multiple classification https://www.jianshu.com/p/2698db68f2e7
The problem is and has different sets on different classification trees. It belongs to two leaf nodes and cannot be combined with one j.
In the post derivative, we should select the category c for derivative, separate and wic, and then merge I set on the c tree.
Derivation of class c= 1:;
Should this be ignored ?Then it could be this
We need to look into the code.
https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py
c. Linear regression reg:linear
RMSE root mean square error of loss function
5. xgboost parameter https://xgboost.readthedocs.io/en/latest/parameter.html
Gamma: the parameter used to control whether to prune or not. The larger it is, the more conservative it is. Generally, it is 0.1 and 0.2
max_ Depth: the depth of the tree, which has a great impact on the results, the deeper the tree, the easier it is to overfit
Alpha: L1 is regular. When the depth of the tree is too large, this parameter can be appropriately increased
Lambda: L2 is regular. The larger the parameter is, the less likely the model is to be over fitted.
Subsample: the ratio of random sampling. It is commonly understood that how many samples are selected as the training set, and selecting a ratio less than 1 can reduce variance, that is, prevent over fitting
colsample_ Bytree: here is how many columns are selected as the training set. The specific understanding is how many features are selected
min_ child_ Weight: determine the sum of the minimum leaf node sample weights. When its value is large, the model can avoid learning local special samples. But if this value is too high, it will lead to under fitting
ETA: learning rate
Silent: whether to print the information in the training process, 0 means to print, and 1 means to print
Nthread: number of running threads, - 1 all threads. This value needs to be adjusted according to the specific situation. Threads have a little impact on the final result. In the past test, the more threads, the worse the result will be
Seed: This randomly specifies a constant to prevent inconsistent results each time
'nthread': 7, number of CPU threads
6. Project demonstration in another article
7. Parallel computing:
When the next tree is generated, the weight GI hi of all samples can be obtained from the predicted value and actual value of the current tree. In order to determine the best segmentation point, we need to sort the eigenvalues in advance, calculate the gain of each feature after splitting, select the feature with the largest gain to take the classification, xgboost can pre calculate the gain of each feature at different segmentation points before training the next tree, and sort the logarithm data, save it as a block structure, and repeat this structure in the iteration, so each feature The gain calculation of the symbol can be carried out in parallel by multithreading.
8. Remaining problems:
Missing value processing: xgboost can automatically learn its splitting direction for samples with missing values of features
Generation of the first tree http://iphil.cc/?p=482
W derivative convex function formula of multivariate classification
Deduction of regression loss function