Classification multi-classe¶

On cherche à prédire la note d’un vin avec un classifieur multi-classe.

[1]:

%matplotlib inline

[3]:

from teachpyx.datasets import load_wines_dataset

df = load_wines_dataset()
X = df.drop(["quality", "color"], axis=1)
y = df["quality"]

[4]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

[6]:

from sklearn.linear_model import LogisticRegression

clr = LogisticRegression(solver="liblinear")
clr.fit(X_train, y_train)

[6]:

LogisticRegression(solver='liblinear')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

[7]:

import numpy

numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100

[7]:

55.07692307692308

On regarde la matrice de confusion.

[8]:

from sklearn.metrics import confusion_matrix
import pandas

pandas.DataFrame(confusion_matrix(y_test, clr.predict(X_test)))

[8]:

	2	3	4
0	4	7	0
1	47	22	0
2	332	184	0
3	169	541	10
4	19	217	22
5	3	42	5
6	0	1	0

On l’affiche différemment avec le nom des classes.

[9]:

conf = confusion_matrix(y_test, clr.predict(X_test))
dfconf = pandas.DataFrame(conf)
labels = list(clr.classes_)
if len(labels) < dfconf.shape[1]:
    labels += [
        9
    ]  # La classe 9 est très représentée, elle est parfois absente en train.
elif len(labels) > dfconf.shape[1]:
    labels = labels[: dfconf.shape[1]]  # ou l'inverse
dfconf.columns = labels
dfconf.index = labels
dfconf

[9]:

	5	6	7
3	4	7	0
4	47	22	0
5	332	184	0
6	169	541	10
7	19	217	22
8	3	42	5
9	0	1	0

Pas extraordinaire. On applique la stratégie OneVsRestClassifier.

[10]:

from sklearn.multiclass import OneVsRestClassifier

clr = OneVsRestClassifier(LogisticRegression(solver="liblinear"))
clr.fit(X_train, y_train)

[10]:

OneVsRestClassifier(estimator=LogisticRegression(solver='liblinear'))

[11]:

numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100

[11]:

54.95384615384615

Le modèle logistique régression multi-classe est équivalent à la stratégie OneVsRest. Voyons l’autre.

[12]:

from sklearn.multiclass import OneVsOneClassifier

clr = OneVsOneClassifier(LogisticRegression(solver="liblinear"))
clr.fit(X_train, y_train)

[12]:

OneVsOneClassifier(estimator=LogisticRegression(solver='liblinear'))

[13]:

numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100

[13]:

55.138461538461534

[14]:

conf = confusion_matrix(y_test, clr.predict(X_test))
dfconf = pandas.DataFrame(conf)
labels = list(clr.classes_)
if len(labels) < dfconf.shape[1]:
    labels += [
        9
    ]  # La classe 9 est très représentée, elle est parfois absente en train.
elif len(labels) > dfconf.shape[1]:
    labels = labels[: dfconf.shape[1]]  # ou l'inverse
dfconf.columns = labels
dfconf.index = labels
dfconf

[14]:

	5	6	7
3	5	6	0
4	46	23	0
5	332	183	1
6	169	524	27
7	18	200	40
8	6	32	12
9	0	1	0

A peu près pareil mais sans doute pas de manière significative. Voyons avec un arbre de décision.

[15]:

from sklearn.tree import DecisionTreeClassifier

clr = DecisionTreeClassifier()
clr.fit(X_train, y_train)

[15]:

DecisionTreeClassifier()

[16]:

numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100

[16]:

59.323076923076925

Et avec OneVsRestClassifier :

[17]:

clr = OneVsRestClassifier(DecisionTreeClassifier())
clr.fit(X_train, y_train)

[17]:

OneVsRestClassifier(estimator=DecisionTreeClassifier())

[18]:

numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100

[18]:

53.35384615384615

Et avec OneVsOneClassifier

[19]:

clr = OneVsOneClassifier(DecisionTreeClassifier())
clr.fit(X_train, y_train)

[19]:

OneVsOneClassifier(estimator=DecisionTreeClassifier())

[20]:

numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100

[20]:

62.58461538461538

Mieux.

[21]:

from sklearn.ensemble import RandomForestClassifier

clr = RandomForestClassifier()
clr.fit(X_train, y_train)

[21]:

RandomForestClassifier()

[23]:

numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100

[23]:

69.2923076923077

[24]:

clr = OneVsRestClassifier(RandomForestClassifier())
clr.fit(X_train, y_train)

[24]:

OneVsRestClassifier(estimator=RandomForestClassifier())

[25]:

numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100

[25]:

69.41538461538461

Proche, il faut affiner avec une validation croisée.

[26]:

from sklearn.neural_network import MLPClassifier

clr = MLPClassifier(hidden_layer_sizes=30, max_iter=600)
clr.fit(X_train, y_train)

[26]:

MLPClassifier(hidden_layer_sizes=30, max_iter=600)

[27]:

numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100

[27]:

52.800000000000004

[28]:

clr = OneVsRestClassifier(MLPClassifier(hidden_layer_sizes=30, max_iter=600))
clr.fit(X_train, y_train)

[28]:

OneVsRestClassifier(estimator=MLPClassifier(hidden_layer_sizes=30,
                                            max_iter=600))

[29]:

numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100

[29]:

52.800000000000004

Pas foudroyant.

Notebook on github

	2	3	4
0	4	7	0
1	47	22	0
2	332	184	0
3	169	541	10
4	19	217	22
5	3	42	5
6	0	1	0

	5	6	7
3	4	7	0
4	47	22	0
5	332	184	0
6	169	541	10
7	19	217	22
8	3	42	5
9	0	1	0

	5	6	7
3	5	6	0
4	46	23	0
5	332	183	1
6	169	524	27
7	18	200	40
8	6	32	12
9	0	1	0

	2	3	4
0	4	7	0
1	47	22	0
2	332	184	0
3	169	541	10
4	19	217	22
5	3	42	5
6	0	1	0

	5	6	7
3	4	7	0
4	47	22	0
5	332	184	0
6	169	541	10
7	19	217	22
8	3	42	5
9	0	1	0

	5	6	7
3	5	6	0
4	46	23	0
5	332	183	1
6	169	524	27
7	18	200	40
8	6	32	12
9	0	1	0

	2	3	4
0	4	7	0
1	47	22	0
2	332	184	0
3	169	541	10
4	19	217	22
5	3	42	5
6	0	1	0

	5	6	7
3	4	7	0
4	47	22	0
5	332	184	0
6	169	541	10
7	19	217	22
8	3	42	5
9	0	1	0

	5	6	7
3	5	6	0
4	46	23	0
5	332	183	1
6	169	524	27
7	18	200	40
8	6	32	12
9	0	1	0