.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_piecewise_linear_regression_criterion.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_piecewise_linear_regression_criterion.py: Custom DecisionTreeRegressor adapted to a linear regression =========================================================== A :class:`sklearn.tree.DecisionTreeRegressor` can be trained with a couple of possible criterions but it is possible to implement a custom one (see `hellinger_distance_criterion `_). See also tutorial `Cython example of exposing C-computed arrays in Python without data copies `_ which describes a way to implement fast :epkg:`Cython` extensions. Piecewise data ++++++++++++++ Let's build a toy problem based on two linear models. .. GENERATED FROM PYTHON SOURCE LINES 19-42 .. code-block:: Python import matplotlib.pyplot as plt import numpy import numpy.random as npr from mlinsights.ext_test_case import measure_time from mlinsights.mlmodel.piecewise_tree_regression import PiecewiseTreeRegressor from mlinsights.mlmodel.piecewise_tree_regression_criterion import ( SimpleRegressorCriterion, ) from mlinsights.mlmodel.piecewise_tree_regression_criterion_fast import ( SimpleRegressorCriterionFast, ) from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor X = npr.normal(size=(1000, 4)) alpha = [4, -2] t = (X[:, 0] + X[:, 3] * 0.5) > 0 switch = numpy.zeros(X.shape[0]) switch[t] = 1 y = alpha[0] * X[:, 0] * t + alpha[1] * X[:, 0] * (1 - t) + X[:, 2] .. GENERATED FROM PYTHON SOURCE LINES 44-51 .. code-block:: Python fig, ax = plt.subplots(1, 1) ax.plot(X[:, 0], y, ".") ax.set_title("Piecewise examples") .. image-sg:: /auto_examples/images/sphx_glr_plot_piecewise_linear_regression_criterion_001.png :alt: Piecewise examples :srcset: /auto_examples/images/sphx_glr_plot_piecewise_linear_regression_criterion_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Piecewise examples') .. GENERATED FROM PYTHON SOURCE LINES 52-54 DecisionTreeRegressor +++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 54-59 .. code-block:: Python X_train, X_test, y_train, y_test = train_test_split(X[:, :1], y) .. GENERATED FROM PYTHON SOURCE LINES 61-67 .. code-block:: Python model = DecisionTreeRegressor(min_samples_leaf=100) model.fit(X_train, y_train) .. raw:: html
DecisionTreeRegressor(min_samples_leaf=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 69-75 .. code-block:: Python pred = model.predict(X_test) pred[:5] .. rst-class:: sphx-glr-script-out .. code-block:: none array([ 1.22084396, -0.00420038, 2.94896867, -0.00420038, 0.55651447]) .. GENERATED FROM PYTHON SOURCE LINES 77-86 .. code-block:: Python fig, ax = plt.subplots(1, 1) ax.plot(X_test[:, 0], y_test, ".", label="data") ax.plot(X_test[:, 0], pred, ".", label="predictions") ax.set_title("DecisionTreeRegressor") ax.legend() .. image-sg:: /auto_examples/images/sphx_glr_plot_piecewise_linear_regression_criterion_002.png :alt: DecisionTreeRegressor :srcset: /auto_examples/images/sphx_glr_plot_piecewise_linear_regression_criterion_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 87-89 DecisionTreeRegressor with custom implementation ================================================ .. GENERATED FROM PYTHON SOURCE LINES 93-101 .. code-block:: Python model2 = DecisionTreeRegressor( min_samples_leaf=100, criterion=SimpleRegressorCriterion(1, X_train.shape[0]) ) model2.fit(X_train, y_train) .. raw:: html
DecisionTreeRegressor(criterion=<mlinsights.mlmodel.piecewise_tree_regression_criterion.SimpleRegressorCriterion object at 0x55c473336390>,
                          min_samples_leaf=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 103-109 .. code-block:: Python pred = model2.predict(X_test) pred[:5] .. rst-class:: sphx-glr-script-out .. code-block:: none array([1.22084396, 0.02146014, 2.94896867, 0.02146014, 0.02146014]) .. GENERATED FROM PYTHON SOURCE LINES 111-120 .. code-block:: Python fig, ax = plt.subplots(1, 1) ax.plot(X_test[:, 0], y_test, ".", label="data") ax.plot(X_test[:, 0], pred, ".", label="predictions") ax.set_title("DecisionTreeRegressor\nwith custom criterion") ax.legend() .. image-sg:: /auto_examples/images/sphx_glr_plot_piecewise_linear_regression_criterion_003.png :alt: DecisionTreeRegressor with custom criterion :srcset: /auto_examples/images/sphx_glr_plot_piecewise_linear_regression_criterion_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 121-133 Computation time ++++++++++++++++ The custom criterion is not really efficient but it was meant that way. The code can be found in `piecewise_tree_regression_criterion `_. Bascially, it is slow because each time the algorithm optimizing the tree needs the class Criterion to evaluate the impurity reduction for a split, the computation happens on the whole data under the node being split. The implementation in `_criterion.pyx `_ does it once. .. GENERATED FROM PYTHON SOURCE LINES 133-138 .. code-block:: Python measure_time("model.fit(X_train, y_train)", globals()) .. rst-class:: sphx-glr-script-out .. code-block:: none {'average': np.float64(0.0006277020659908886), 'deviation': np.float64(0.00013720017019929376), 'min_exec': np.float64(0.0004980177799734519), 'max_exec': np.float64(0.0008900149600231089), 'repeat': 10, 'number': 50, 'ttime': np.float64(0.006277020659908886), 'context_size': 1176, 'warmup_time': 0.000983060999715235} .. GENERATED FROM PYTHON SOURCE LINES 140-145 .. code-block:: Python measure_time("model2.fit(X_train, y_train)", globals()) .. rst-class:: sphx-glr-script-out .. code-block:: none {'average': np.float64(0.0029189347680076026), 'deviation': np.float64(0.00021396753727527702), 'min_exec': np.float64(0.002569425140027306), 'max_exec': np.float64(0.003275515240020468), 'repeat': 10, 'number': 50, 'ttime': np.float64(0.029189347680076028), 'context_size': 1176, 'warmup_time': 0.0035421940010564867} .. GENERATED FROM PYTHON SOURCE LINES 146-179 A loop is involved every time the criterion of the node is involved which raises a the computation cost of lot. The method ``_mse`` is called each time the algorithm training the decision tree needs to evaluate a cut, one cut involves elements betwee, position ``[start, end[``. :: ctypedef double float64_t cdef void _mean(self, intp_t start, intp_t end, float64_t *mean, float64_t *weight) nogil: if start == end: mean[0] = 0. return cdef float64_t m = 0. cdef float64_t w = 0. cdef int k for k in range(start, end): m += self.sample_wy[k] w += self.sample_w[k] weight[0] = w mean[0] = 0. if w == 0. else m / w cdef float64_t _mse(self, intp_t start, intp_t end, float64_t mean, float64_t weight) nogil: if start == end: return 0. cdef float64_t squ = 0. cdef int k for k in range(start, end): squ += (self.y[self.sample_i[k], 0] - mean) ** 2 * self.sample_w[k] return 0. if weight == 0. else squ / weight .. GENERATED FROM PYTHON SOURCE LINES 181-216 Better implementation +++++++++++++++++++++ I rewrote my first implementation to be closer to what :epkg:`scikit-learn` is doing. The criterion is computed once for all possible cut and then retrieved on demand. The code is below, arrays ``sample_wy_left`` is the cumulated sum of :math:`weight * Y` starting from the left side (lower *Y*). The loop disappeared. :: ctypedef double float64_t cdef void _mean(self, intp_t start, intp_t end, float64_t *mean, float64_t *weight) nogil: if start == end: mean[0] = 0. return cdef float64_t m = self.sample_wy_left[end-1] - (self.sample_wy_left[start-1] if start > 0 else 0) cdef float64_t w = self.sample_w_left[end-1] - (self.sample_w_left[start-1] if start > 0 else 0) weight[0] = w mean[0] = 0. if w == 0. else m / w cdef float64_t _mse(self, intp_t start, intp_t end, float64_t mean, float64_t weight) nogil: if start == end: return 0. cdef float64_t squ = self.sample_wy2_left[end-1] - (self.sample_wy2_left[start-1] if start > 0 else 0) # This formula only holds if mean is computed on the same interval. # Otherwise, it is squ / weight - true_mean ** 2 + (mean - true_mean) ** 2. return 0. if weight == 0. else squ / weight - mean ** 2 .. GENERATED FROM PYTHON SOURCE LINES 219-229 .. code-block:: Python model3 = DecisionTreeRegressor( min_samples_leaf=100, criterion=SimpleRegressorCriterionFast(1, X_train.shape[0]) ) model3.fit(X_train, y_train) pred = model3.predict(X_test) pred[:5] .. rst-class:: sphx-glr-script-out .. code-block:: none array([1.22084396, 0.02146014, 2.94896867, 0.02146014, 0.02146014]) .. GENERATED FROM PYTHON SOURCE LINES 231-240 .. code-block:: Python fig, ax = plt.subplots(1, 1) ax.plot(X_test[:, 0], y_test, ".", label="data") ax.plot(X_test[:, 0], pred, ".", label="predictions") ax.set_title("DecisionTreeRegressor\nwith fast custom criterion") ax.legend() .. image-sg:: /auto_examples/images/sphx_glr_plot_piecewise_linear_regression_criterion_004.png :alt: DecisionTreeRegressor with fast custom criterion :srcset: /auto_examples/images/sphx_glr_plot_piecewise_linear_regression_criterion_004.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 242-247 .. code-block:: Python measure_time("model3.fit(X_train, y_train)", globals()) .. rst-class:: sphx-glr-script-out .. code-block:: none {'average': np.float64(0.0005952476640013629), 'deviation': np.float64(9.200797957516122e-05), 'min_exec': np.float64(0.0004608890599774895), 'max_exec': np.float64(0.0008051304800028447), 'repeat': 10, 'number': 50, 'ttime': np.float64(0.005952476640013629), 'context_size': 1176, 'warmup_time': 0.0019230220023018774} .. GENERATED FROM PYTHON SOURCE LINES 248-251 Much better even though this implementation is currently 3, 4 times slower than scikit-learn's. Let's check with a datasets three times bigger to see if it is a fix cost or a cost. .. GENERATED FROM PYTHON SOURCE LINES 251-257 .. code-block:: Python X_train3 = numpy.vstack([X_train, X_train, X_train]) y_train3 = numpy.hstack([y_train, y_train, y_train]) .. GENERATED FROM PYTHON SOURCE LINES 259-264 .. code-block:: Python X_train.shape, X_train3.shape, y_train3.shape .. rst-class:: sphx-glr-script-out .. code-block:: none ((750, 1), (2250, 1), (2250,)) .. GENERATED FROM PYTHON SOURCE LINES 266-270 .. code-block:: Python measure_time("model.fit(X_train3, y_train3)", globals()) .. rst-class:: sphx-glr-script-out .. code-block:: none {'average': np.float64(0.0013359763760090574), 'deviation': np.float64(0.00012585663500117283), 'min_exec': np.float64(0.0011745783000515076), 'max_exec': np.float64(0.0016119259800325381), 'repeat': 10, 'number': 50, 'ttime': np.float64(0.013359763760090574), 'context_size': 1176, 'warmup_time': 0.0013088390005577821} .. GENERATED FROM PYTHON SOURCE LINES 271-275 The criterion needs to be reinstanciated since it depends on the features *X*. The computation does not but the design does. This was introduced to compare the current output with a decision tree optimizing for a piecewise linear regression and not a stepwise regression. .. GENERATED FROM PYTHON SOURCE LINES 275-283 .. code-block:: Python try: model3.fit(X_train3, y_train3) except Exception as e: print(e) .. rst-class:: sphx-glr-script-out .. code-block:: none n_samples=750 -- y.shape=[2250, 1, 0, 0, 0, 0, 0, 0] .. GENERATED FROM PYTHON SOURCE LINES 285-293 .. code-block:: Python model3 = DecisionTreeRegressor( min_samples_leaf=100, criterion=SimpleRegressorCriterionFast(1, X_train3.shape[0]) ) measure_time("model3.fit(X_train3, y_train3)", globals()) .. rst-class:: sphx-glr-script-out .. code-block:: none {'average': np.float64(0.0015539457760241932), 'deviation': np.float64(0.00025411727907129666), 'min_exec': np.float64(0.0012243494600261329), 'max_exec': np.float64(0.002184680420032237), 'repeat': 10, 'number': 50, 'ttime': np.float64(0.015539457760241931), 'context_size': 1176, 'warmup_time': 0.0022764530003769323} .. GENERATED FROM PYTHON SOURCE LINES 294-314 Still almost 2 times slower but of the same order of magnitude. We could go further and investigate why or continue and introduce a criterion which optimizes a piecewise linear regression instead of a stepwise regression. Criterion adapted for a linear regression +++++++++++++++++++++++++++++++++++++++++ The previous examples are all about decision trees which approximates a function by a stepwise function. On every interval :math:`[r_1, r_2]`, the model optimizes :math:`\sum_i (y_i - C)^2 \mathbb{1}_{ r_1 \leqslant x_i \leqslant r_2}` and finds the best constant (= the average) approxmating the function on this interval. We would to like to approximate the function by a regression line and not a constant anymore. It means minimizing :math:`\sum_i (y_i - X_i \beta)^2 \mathbb{1}_{ r_1 \leqslant x_i \leqslant r_2}`. Doing this require to change the criterion used to split the space of feature into buckets and the prediction function of the decision tree which now needs to return a dot product. .. GENERATED FROM PYTHON SOURCE LINES 314-322 .. code-block:: Python fixed = False if fixed: # It does not work yet. piece = PiecewiseTreeRegressor(criterion="mselin", min_samples_leaf=100) piece.fit(X_train, y_train) .. GENERATED FROM PYTHON SOURCE LINES 324-331 .. code-block:: Python if fixed: pred = piece.predict(X_test) pred[:5] .. GENERATED FROM PYTHON SOURCE LINES 333-342 .. code-block:: Python if fixed: fig, ax = plt.subplots(1, 1) ax.plot(X_test[:, 0], y_test, ".", label="data") ax.plot(X_test[:, 0], pred, ".", label="predictions") ax.set_title("DecisionTreeRegressor\nwith criterion adapted to linear regression") ax.legend() .. GENERATED FROM PYTHON SOURCE LINES 343-344 The coefficients for the linear regressions are kept into the following attribute: .. GENERATED FROM PYTHON SOURCE LINES 344-350 .. code-block:: Python if fixed: piece.betas_ .. GENERATED FROM PYTHON SOURCE LINES 351-352 Mapped to the following leaves: .. GENERATED FROM PYTHON SOURCE LINES 352-358 .. code-block:: Python if fixed: piece.leaves_index_, piece.leaves_mapping_ .. GENERATED FROM PYTHON SOURCE LINES 359-360 We can get the leave each observation falls into: .. GENERATED FROM PYTHON SOURCE LINES 360-366 .. code-block:: Python if fixed: piece.predict_leaves(X_test)[:5] .. GENERATED FROM PYTHON SOURCE LINES 367-369 The training is quite slow as it is training many linear regressions each time a split is evaluated. .. GENERATED FROM PYTHON SOURCE LINES 369-375 .. code-block:: Python if fixed: measure_time("piece.fit(X_train, y_train)", globals()) .. GENERATED FROM PYTHON SOURCE LINES 377-382 .. code-block:: Python if fixed: measure_time("piece.fit(X_train3, y_train3)", globals()) .. GENERATED FROM PYTHON SOURCE LINES 383-399 It works but it is slow, slower than the slow implementation of the standard criterion for decision tree regression. Next ++++ PR `Model trees (M5P and co) `_ and issue `Model trees (M5P) `_ propose an implementation a piecewise regression with any kind of regression model. It is based on `Building Model Trees `_. It fits many models to find the best splits and should be slower than this implementation in the case of a decision tree regressor associated with linear regressions. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 3.894 seconds) .. _sphx_glr_download_auto_examples_plot_piecewise_linear_regression_criterion.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_piecewise_linear_regression_criterion.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_piecewise_linear_regression_criterion.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_piecewise_linear_regression_criterion.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_