Classification multi-classe#

On cherche à prédire la note d’un vin avec un classifieur multi-classe.

[1]:
%matplotlib inline
[3]:
from teachpyx.datasets import load_wines_dataset

df = load_wines_dataset()
X = df.drop(["quality", "color"], axis=1)
y = df["quality"]
[4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
[6]:
from sklearn.linear_model import LogisticRegression

clr = LogisticRegression(solver="liblinear")
clr.fit(X_train, y_train)
[6]:
LogisticRegression(solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[7]:
import numpy

numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100
[7]:
55.07692307692308

On regarde la matrice de confusion.

[8]:
from sklearn.metrics import confusion_matrix
import pandas

pandas.DataFrame(confusion_matrix(y_test, clr.predict(X_test)))
[8]:
0 1 2 3 4 5 6
0 0 0 4 7 0 0 0
1 0 0 47 22 0 0 0
2 0 0 332 184 0 0 0
3 0 0 169 541 10 0 0
4 0 0 19 217 22 0 0
5 0 0 3 42 5 0 0
6 0 0 0 1 0 0 0

On l’affiche différemment avec le nom des classes.

[9]:
conf = confusion_matrix(y_test, clr.predict(X_test))
dfconf = pandas.DataFrame(conf)
labels = list(clr.classes_)
if len(labels) < dfconf.shape[1]:
    labels += [
        9
    ]  # La classe 9 est très représentée, elle est parfois absente en train.
elif len(labels) > dfconf.shape[1]:
    labels = labels[: dfconf.shape[1]]  # ou l'inverse
dfconf.columns = labels
dfconf.index = labels
dfconf
[9]:
3 4 5 6 7 8 9
3 0 0 4 7 0 0 0
4 0 0 47 22 0 0 0
5 0 0 332 184 0 0 0
6 0 0 169 541 10 0 0
7 0 0 19 217 22 0 0
8 0 0 3 42 5 0 0
9 0 0 0 1 0 0 0

Pas extraordinaire. On applique la stratégie OneVsRestClassifier.

[10]:
from sklearn.multiclass import OneVsRestClassifier

clr = OneVsRestClassifier(LogisticRegression(solver="liblinear"))
clr.fit(X_train, y_train)
[10]:
OneVsRestClassifier(estimator=LogisticRegression(solver='liblinear'))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[11]:
numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100
[11]:
54.95384615384615

Le modèle logistique régression multi-classe est équivalent à la stratégie OneVsRest. Voyons l’autre.

[12]:
from sklearn.multiclass import OneVsOneClassifier

clr = OneVsOneClassifier(LogisticRegression(solver="liblinear"))
clr.fit(X_train, y_train)
[12]:
OneVsOneClassifier(estimator=LogisticRegression(solver='liblinear'))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[13]:
numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100
[13]:
55.138461538461534
[14]:
conf = confusion_matrix(y_test, clr.predict(X_test))
dfconf = pandas.DataFrame(conf)
labels = list(clr.classes_)
if len(labels) < dfconf.shape[1]:
    labels += [
        9
    ]  # La classe 9 est très représentée, elle est parfois absente en train.
elif len(labels) > dfconf.shape[1]:
    labels = labels[: dfconf.shape[1]]  # ou l'inverse
dfconf.columns = labels
dfconf.index = labels
dfconf
[14]:
3 4 5 6 7 8 9
3 0 0 5 6 0 0 0
4 0 0 46 23 0 0 0
5 0 0 332 183 1 0 0
6 0 0 169 524 27 0 0
7 0 0 18 200 40 0 0
8 0 0 6 32 12 0 0
9 0 0 0 1 0 0 0

A peu près pareil mais sans doute pas de manière significative. Voyons avec un arbre de décision.

[15]:
from sklearn.tree import DecisionTreeClassifier

clr = DecisionTreeClassifier()
clr.fit(X_train, y_train)
[15]:
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[16]:
numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100
[16]:
59.323076923076925

Et avec OneVsRestClassifier :

[17]:
clr = OneVsRestClassifier(DecisionTreeClassifier())
clr.fit(X_train, y_train)
[17]:
OneVsRestClassifier(estimator=DecisionTreeClassifier())
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[18]:
numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100
[18]:
53.35384615384615

Et avec OneVsOneClassifier

[19]:
clr = OneVsOneClassifier(DecisionTreeClassifier())
clr.fit(X_train, y_train)
[19]:
OneVsOneClassifier(estimator=DecisionTreeClassifier())
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[20]:
numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100
[20]:
62.58461538461538

Mieux.

[21]:
from sklearn.ensemble import RandomForestClassifier

clr = RandomForestClassifier()
clr.fit(X_train, y_train)
[21]:
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[23]:
numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100
[23]:
69.2923076923077
[24]:
clr = OneVsRestClassifier(RandomForestClassifier())
clr.fit(X_train, y_train)
[24]:
OneVsRestClassifier(estimator=RandomForestClassifier())
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[25]:
numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100
[25]:
69.41538461538461

Proche, il faut affiner avec une validation croisée.

[26]:
from sklearn.neural_network import MLPClassifier

clr = MLPClassifier(hidden_layer_sizes=30, max_iter=600)
clr.fit(X_train, y_train)
[26]:
MLPClassifier(hidden_layer_sizes=30, max_iter=600)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[27]:
numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100
[27]:
52.800000000000004
[28]:
clr = OneVsRestClassifier(MLPClassifier(hidden_layer_sizes=30, max_iter=600))
clr.fit(X_train, y_train)
[28]:
OneVsRestClassifier(estimator=MLPClassifier(hidden_layer_sizes=30,
                                            max_iter=600))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[29]:
numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100
[29]:
52.800000000000004

Pas foudroyant.


Notebook on github