Classifications et courbes ROC¶

La courbe ROC est une façon de visualiser la performance d’un classifieur ou plutôt sa pertinence. Voyons comment.

[1]:

%matplotlib inline

[2]:

from teachpyx.datasets import load_wines_dataset

data = load_wines_dataset()
X = data.drop(["quality", "color"], axis=1)
y = data["color"]

[3]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

[4]:

from sklearn.linear_model import LogisticRegression

clr = LogisticRegression()
clr.fit(X_train, y_train)

~/vv/this312/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:406: ConvergenceWarning: lbfgs failed to converge after 100 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

[4]:

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

[5]:

import pandas

score = clr.decision_function(X_test)
dfsc = pandas.DataFrame(score, columns=["score"])
dfsc["color"] = y_test.values
dfsc.head()

[5]:

	score	color
0	4.959154	white
1	-5.770727	red
2	10.241925	white
3	2.645375	white
4	11.503256	white

[6]:

ax = dfsc[dfsc["color"] == "white"]["score"].hist(
    bins=25, figsize=(6, 3), label="white", alpha=0.5
)
dfsc[dfsc["color"] == "red"]["score"].hist(bins=25, ax=ax, label="red", alpha=0.5)
ax.set_title("Distribution des scores pour les deux classes")
ax.legend();

../../_images/practice_ml_winesc_color_roc_6_0.png

[7]:

from sklearn.metrics import roc_auc_score, roc_curve, auc

fpr0, tpr0, thresholds0 = roc_curve(
    y_test, score, pos_label=clr.classes_[1], drop_intermediate=False
)
fpr0.shape

[7]:

(1538,)

[8]:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1, figsize=(4, 4))
ax.plot([0, 1], [0, 1], "k--")
# aucf = roc_auc_score(y_test == clr.classes_[0], probas[:, 0]) # première façon
aucf = auc(fpr0, tpr0)  # seconde façon
ax.plot(fpr0, tpr0, label=clr.classes_[0] + " auc=%1.5f" % aucf)
ax.set_title("Courbe ROC - classifieur couleur des vins")
ax.legend();

../../_images/practice_ml_winesc_color_roc_8_0.png

Voyons ce qu’il se passe en général si nous décidons non pas d’évaluer un modèle qui classifie bien une classe mais les deux classes à la fois. La différence est subtile. Le score est celui de la classe prédite et non plus celui d’une classe.

[9]:

list_classes = list(clr.classes_)

pred = clr.predict(X_test)
scores = clr.decision_function(X_test)
pred_i = [list_classes.index(c) for c in pred]
probas = clr.predict_proba(X_test)
proba = [probas[i, c] for i, c in enumerate(pred_i)]
score = [scores[i] if c == 0 else -scores[i] for i, c in enumerate(pred_i)]

[10]:

dfpr = pandas.DataFrame(proba, columns=["proba"])
dfpr["color"] = y_test.values

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(10, 3))
dfpr[pred == dfpr["color"]]["proba"].hist(bins=25, label="bon", alpha=0.5, ax=ax[0])
dfpr[pred != dfpr["color"]]["proba"].hist(bins=25, label="faux", alpha=0.5, ax=ax[0])
ax[0].set_title("Distribution des probabilités faux / bon")
ax[0].legend()
dfpr[pred == dfpr["color"]]["proba"].hist(bins=25, label="bon", alpha=0.5, ax=ax[1])
dfpr[pred != dfpr["color"]]["proba"].hist(bins=25, label="faux", alpha=0.5, ax=ax[1])
ax[1].set_yscale("log")
ax[1].set_title("Distribution des probabilités faux / bon\néchelle logarithmique")
ax[1].legend();

../../_images/practice_ml_winesc_color_roc_11_0.png

Et avec les scores et non plus les probabilités.

[11]:

scores = clr.decision_function(X_test)
dfsc = pandas.DataFrame(scores, columns=["score"])
dfsc.loc[pred == "red", "score"] *= -1
dfsc["color"] = y_test.sort_values

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(10, 3))
dfsc[pred == dfsc["color"]]["score"].hist(bins=25, label="bon", alpha=0.5, ax=ax[0])
dfsc[pred != dfsc["color"]]["score"].hist(bins=25, label="faux", alpha=0.5, ax=ax[0])
ax[0].set_title("Distribution des scores faux / bon")
ax[0].legend()
dfsc[pred == dfsc["color"]]["score"].hist(bins=25, label="bon", alpha=0.5, ax=ax[1])
dfsc[pred != dfsc["color"]]["score"].hist(bins=25, label="faux", alpha=0.5, ax=ax[1])
ax[1].set_yscale("log")
ax[1].set_title("Distribution des scores faux / bon\néchelle logarithmique")
ax[1].legend();

../../_images/practice_ml_winesc_color_roc_13_0.png

[12]:

fpr, tpr, thresholds = roc_curve(pred == y_test, proba)

[13]:

fig, ax = plt.subplots(1, 1, figsize=(4, 4))
ax.plot([0, 1], [0, 1], "k--")
aucf = auc(fpr, tpr)  # seconde façon
ax.plot(fpr, tpr, label="bonne classification auc=%1.5f" % aucf)
ax.set_title("Courbe ROC - faux / bon")
ax.legend();

../../_images/practice_ml_winesc_color_roc_15_0.png

Mais pourquoi les résultats sont-ils moins bons ? Première AUC :

[14]:

pred = clr.predict(X_test)
scores = clr.decision_function(X_test)
roc_auc_score(y_test == "white", scores)

[14]:

0.9973206433743456

Seconde AUC, on suppose que 0 est la frontière entre les deux classes. Si le score est supérieure à 0, le vin est blanc, sinon il est rouge.

[15]:

import numpy

pred = numpy.array(["red" if s < 0 else "white" for s in scores])
fpr0, tpr0, thresholds0 = roc_curve(
    y_test == pred, numpy.abs(scores), drop_intermediate=False
)
auc(fpr0, tpr0)

[15]:

0.951975

Mais ce n’est peut-être pas le meilleur seuil ? Regardons.

[16]:

ths = []
aucs = []
n = 100
a, b = -1, 2
for thi in range(0, n + 1):
    th = a + thi * (b - a) * 1.0 / n
    ths.append(th)
    pred = numpy.array(["red" if s - th < 0 else "white" for s in scores])
    fpr, tpr, _ = roc_curve(
        y_test == pred, numpy.abs(scores - th), drop_intermediate=False
    )
    aucs.append(auc(fpr, tpr))

[17]:

dfa = pandas.DataFrame(dict(th=ths, AUC=aucs))
dfa.describe().T

[17]:

	count	mean	std	min	25%	50%	75%	max
th	101.0	0.500000	0.879005	-1.000000	-0.250000	0.50000	1.25000	2.000000
AUC	101.0	0.946323	0.008481	0.926092	0.941026	0.94812	0.95325	0.958874

[18]:

ax = dfa.plot(x="th", y="AUC", figsize=(3, 3), kind="scatter")
ax.set_title("Evolution de l'AUC en fonction\ndu seuil de décision");

../../_images/practice_ml_winesc_color_roc_23_0.png

Ca aide mais ce n’est toujours pas ça. Changer le seuil aide, il faudrait maintenant changer l’échelle du score, ce qu’on ne fera pas mais rien n’empêche de tracer les fonctions de répartition des scores négatifs et positifs… Voici les distributions.

[19]:

score = clr.decision_function(X_test)
dfsc = pandas.DataFrame(score, columns=["score"])
dfsc["color"] = y_test.values.ravel()
dfsc["score_abs"] = dfsc["score"].abs()

[20]:

fig, ax = plt.subplots(1, 2, figsize=(12, 4))
dfsc[dfsc["color"] == "white"]["score"].hist(
    bins=25, label="white", alpha=0.5, ax=ax[0]
)
dfsc[dfsc["color"] == "red"]["score"].hist(bins=25, label="red", alpha=0.5, ax=ax[0])
ax[0].set_title("Distribution des scores pour les deux classes")
ax[0].legend()
dfsc[dfsc["color"] == "white"]["score_abs"].hist(
    bins=25, label="white", alpha=0.5, ax=ax[1]
)
dfsc[dfsc["color"] == "red"]["score_abs"].hist(
    bins=25, label="red", alpha=0.5, ax=ax[1]
)
ax[1].set_title("Distribution des scores en valeur absolue pour les deux classes")
ax[1].legend();

../../_images/practice_ml_winesc_color_roc_26_0.png

[21]:

red = dfsc[dfsc["color"] == "red"].sort_values("score_abs").reset_index(drop=True)
red["count"] = 1
red["count_sum"] = red["count"].cumsum() * 1.0 / red.shape[0]
red.tail(n=2)

[21]:

	score	color	score_abs	count	count_sum
394	-10.560451	red	10.560451	1	0.997475
395	-12.879144	red	12.879144	1	1.000000

[22]:

white = dfsc[dfsc["color"] == "white"].sort_values("score_abs").reset_index(drop=True)
white["count"] = 1
white["count_sum"] = white["count"].cumsum() * 1.0 / white.shape[0]
white.tail(n=2)

[22]:

	score	color	score_abs	count	count_sum
1227	12.948889	white	12.948889	1	0.999186
1228	13.120824	white	13.120824	1	1.000000

[23]:

dfscsumabs = pandas.concat([red, white])

On fait pareil pour le score.

[24]:

red = (
    dfsc[dfsc["color"] == "red"]
    .sort_values("score", ascending=False)
    .reset_index(drop=True)
)
red["count"] = 1
red["count_sum"] = red["count"].cumsum() * 1.0 / red.shape[0]
white = dfsc[dfsc["color"] == "white"].sort_values("score").reset_index(drop=True)
white["count"] = 1
white["count_sum"] = white["count"].cumsum() * 1.0 / white.shape[0]
dfscsum = pandas.concat([red, white])

[25]:

fig, ax = plt.subplots(1, 2, figsize=(12, 4))
dfscsum[dfscsum["color"] == "white"].plot(
    x="score", y="count_sum", kind="scatter", label="white", alpha=0.5, ax=ax[0]
)
dfscsum[dfscsum["color"] == "red"].plot(
    x="score",
    y="count_sum",
    kind="scatter",
    color="red",
    label="red",
    alpha=0.5,
    ax=ax[0],
)
ax[0].set_title("Répartition des scores pour les deux classes")
ax[0].legend()
dfscsumabs[dfscsumabs["color"] == "white"].plot(
    x="score_abs", y="count_sum", kind="scatter", label="white", alpha=0.5, ax=ax[1]
)
dfscsumabs[dfscsumabs["color"] == "red"].plot(
    x="score_abs",
    y="count_sum",
    kind="scatter",
    color="red",
    label="red",
    alpha=0.5,
    ax=ax[1],
)
ax[1].set_title("Répartition des scores en valeur absolue pour les deux classes")
ax[1].legend();

../../_images/practice_ml_winesc_color_roc_32_0.png

Les deux courbes ne sont pas confondues. Cela veut dire que le score des vins blancs et celui des rouges ne suivent pas les mêmes distributions.

Notebook on github

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'deprecated'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	None
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	100
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None