Trees¶

Digging into the tree structure¶

mlinsights.mltree.tree_structure.predict_leaves(model, X)[source]¶

Returns the leave every observations of X falls into.

@param model a decision tree @param X observations @return array of leaves

mlinsights.mltree.tree_structure.tree_find_common_node(tree, i, j, parents=None)[source]¶

Finds the common node to nodes i and j.

Parameters:

tree – tree
i – node index (tree.nodes[i])
j – node index (tree.nodes[j])
parents – precomputed parents (None -> calls tree_node_range())

Returns:

common root, remaining path to i, remaining path to j

mlinsights.mltree.tree_structure.tree_find_path_to_root(tree, i, parents=None)[source]¶

Lists nodes involved into the path to find node i.

Parameters:

tree – tree
i – node index (tree.nodes[i])
parents – precomputed parents (None -> calls tree_node_range())

Returns:

one array of size (D, 2) where D is the number of dimensions

mlinsights.mltree.tree_structure.tree_node_parents(tree)[source]¶

Returns a dictionary {node_id: parent_id}.

@param tree tree @return parents

mlinsights.mltree.tree_structure.tree_node_range(tree, i, parents=None)[source]¶

Determines the ranges for a node all dimensions. nan means infinity.

Parameters:

tree – tree
i – node index (tree.nodes[i])
parents – precomputed parents (None -> calls tree_node_range())

Returns:

one array of size (D, 2) where D is the number of dimensions

The following example shows what the function returns in case of simple grid in two dimensions.

<<<

import numpy
from sklearn.tree import DecisionTreeClassifier
from mlinsights.mltree import tree_leave_index, tree_node_range

X = numpy.array(
    [[0, 0], [0, 1], [0, 2], [1, 0], [1, 1], [1, 2], [2, 0], [2, 1], [2, 2]]
)
y = list(range(X.shape[0]))
clr = DecisionTreeClassifier(max_depth=4)
clr.fit(X, y)

leaves = tree_leave_index(clr)
ra = tree_node_range(clr, leaves[0])

print(ra)

>>>

    [[nan 0.5]
     [nan 0.5]]

mlinsights.mltree.tree_structure.tree_leave_index(model)[source]¶

Returns the indices of every leave in a tree.

Parameters:: model – something which has a member tree_
Returns:: leave indices

mlinsights.mltree.tree_structure.tree_leave_neighbors(model)[source]¶

The function determines which leaves are neighbors. The method uses some memory as it creates creates a grid of the feature spaces, each split multiplies the number of cells by two.

Parameters:: model – a sklearn.tree.DecisionTreeRegressor, a sklearn.tree.DecisionTreeClassifier, a model which has a member tree_
Returns:: a dictionary {(i, j): (dimension, x1, x2)}, i, j are node indices, if $X_d * sign < th * sign$ , the observations goes to node i, j otherwise, i < j. The border is somewhere in the segment [x1, x2].

The following example shows what the function returns in case of simple grid in two dimensions.

<<<

import numpy
from sklearn.tree import DecisionTreeClassifier
from mlinsights.mltree import tree_leave_neighbors

X = numpy.array(
    [[0, 0], [0, 1], [0, 2], [1, 0], [1, 1], [1, 2], [2, 0], [2, 1], [2, 2]]
)
y = list(range(X.shape[0]))
clr = DecisionTreeClassifier(max_depth=4)
clr.fit(X, y)

nei = tree_leave_neighbors(clr)

import pprint

pprint.pprint(nei)

>>>

    {(np.int32(2), np.int32(4)): [(np.int64(1),
                                   (np.float64(0.0), np.float64(0.0)),
                                   (np.float64(0.0), np.float64(1.0)))],
     (np.int32(2), np.int32(8)): [(np.int64(0),
                                   (np.float64(0.0), np.float64(0.0)),
                                   (np.float64(1.0), np.float64(0.0)))],
     (np.int32(4), np.int32(5)): [(np.int64(1),
                                   (np.float64(0.0), np.float64(1.0)),
                                   (np.float64(0.0), np.float64(2.0)))],
     (np.int32(4), np.int32(10)): [(np.int64(0),
                                    (np.float64(0.0), np.float64(1.0)),
                                    (np.float64(1.0), np.float64(1.0)))],
     (np.int32(5), np.int32(11)): [(np.int64(0),
                                    (np.float64(0.0), np.float64(2.0)),
                                    (np.float64(1.0), np.float64(2.0)))],
     (np.int32(8), np.int32(10)): [(np.int64(1),
                                    (np.float64(1.0), np.float64(0.0)),
                                    (np.float64(1.0), np.float64(1.0)))],
     (np.int32(8), np.int32(13)): [(np.int64(0),
                                    (np.float64(1.0), np.float64(0.0)),
                                    (np.float64(2.0), np.float64(0.0)))],
     (np.int32(10), np.int32(11)): [(np.int64(1),
                                     (np.float64(1.0), np.float64(1.0)),
                                     (np.float64(1.0), np.float64(2.0)))],
     (np.int32(10), np.int32(15)): [(np.int64(0),
                                     (np.float64(1.0), np.float64(1.0)),
                                     (np.float64(2.0), np.float64(1.0)))],
     (np.int32(11), np.int32(16)): [(np.int64(0),
                                     (np.float64(1.0), np.float64(2.0)),
                                     (np.float64(2.0), np.float64(2.0)))],
     (np.int32(13), np.int32(15)): [(np.int64(1),
                                     (np.float64(2.0), np.float64(0.0)),
                                     (np.float64(2.0), np.float64(1.0)))],
     (np.int32(15), np.int32(16)): [(np.int64(1),
                                     (np.float64(2.0), np.float64(1.0)),
                                     (np.float64(2.0), np.float64(2.0)))]}

Experiments, exercise¶

mlinsights.mltree.tree_digitize.digitize2tree(bins, right=False)[source]¶

Builds a decision tree which returns the same result as lambda x: numpy.digitize(x, bins, right=right) (see numpy.digitize).

Parameters:

bins – array of bins. It has to be 1-dimensional and monotonic.
right – Indicating whether the intervals include the right or the left bin edge. Default behavior is (right==False) indicating that the interval does not include the right edge. The left bin end is open in this case, i.e., bins[i-1] <= x < bins[i] is the default behavior for monotonically increasing bins.

Returns:

decision tree

Note

The implementation of decision trees in scikit-learn only allows one type of decision (<=). That’s why the function throws an exception when right=False. However, this could be overcome by using ONNX where all kind of decision rules are implemented. Default value for right is still False to follow numpy API even though this value raises an exception in digitize2tree.

The following example shows what the tree looks like.

<<<

import numpy
from sklearn.tree import export_text
from mlinsights.mltree import digitize2tree

x = numpy.array([0.2, 6.4, 3.0, 1.6])
bins = numpy.array([0.0, 1.0, 2.5, 4.0, 7.0])
expected = numpy.digitize(x, bins, right=True)
tree = digitize2tree(bins, right=True)
pred = tree.predict(x.reshape((-1, 1)))
print("Comparison with numpy:")
print(expected, pred)
print("Tree:")
print(export_text(tree, feature_names=["x"]))

>>>

    Comparison with numpy:
    [1 4 3 2] [1. 4. 3. 2.]
    Tree:
    |--- x <= 2.50
    |   |--- x <= 1.00
    |   |   |--- x <= 0.00
    |   |   |   |--- value: [0.00]
    |   |   |--- x >  0.00
    |   |   |   |--- value: [1.00]
    |   |--- x >  1.00
    |   |   |--- value: [2.00]
    |--- x >  2.50
    |   |--- x <= 4.00
    |   |   |--- x <= 2.50
    |   |   |   |--- value: [2.00]
    |   |   |--- x >  2.50
    |   |   |   |--- value: [3.00]
    |   |--- x >  4.00
    |   |   |--- x <= 7.00
    |   |   |   |--- x <= 4.00
    |   |   |   |   |--- value: [3.00]
    |   |   |   |--- x >  4.00
    |   |   |   |   |--- value: [4.00]
    |   |   |--- x >  7.00
    |   |   |   |--- value: [5.00]