Classification multi-classe

On cherche à prédire la note d’un vin avec un classifieur multi-classe.

[1]:
%matplotlib inline
[2]:
from teachpyx.datasets import load_wines_dataset

df = load_wines_dataset()
X = df.drop(["quality", "color"], axis=1)
y = df["quality"]
[3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
[4]:
from sklearn.linear_model import LogisticRegression

clr = LogisticRegression()
clr.fit(X_train, y_train)
~/vv/this312/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:406: ConvergenceWarning: lbfgs failed to converge after 100 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[4]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[5]:
import numpy

numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[5]:
np.float64(45.84615384615385)

On regarde la matrice de confusion.

[6]:
from sklearn.metrics import confusion_matrix
import pandas

pandas.DataFrame(confusion_matrix(y_test, clr.predict(X_test)))
[6]:
0 1 2 3 4 5 6
0 0 0 0 5 0 0 0
1 0 0 23 37 1 0 0
2 0 0 224 296 0 0 0
3 0 0 196 521 0 0 0
4 0 0 36 233 0 0 0
5 0 0 6 45 0 0 0
6 0 0 0 2 0 0 0

On l’affiche différemment avec le nom des classes.

[7]:
conf = confusion_matrix(y_test, clr.predict(X_test))
dfconf = pandas.DataFrame(conf)
labels = list(clr.classes_)
if len(labels) < dfconf.shape[1]:
    labels += [
        9
    ]  # La classe 9 est très représentée, elle est parfois absente en train.
elif len(labels) > dfconf.shape[1]:
    labels = labels[: dfconf.shape[1]]  # ou l'inverse
dfconf.columns = labels
dfconf.index = labels
dfconf
[7]:
3 4 5 6 7 8 9
3 0 0 0 5 0 0 0
4 0 0 23 37 1 0 0
5 0 0 224 296 0 0 0
6 0 0 196 521 0 0 0
7 0 0 36 233 0 0 0
8 0 0 6 45 0 0 0
9 0 0 0 2 0 0 0

Pas extraordinaire. On applique la stratégie OneVsRestClassifier.

[8]:
from sklearn.multiclass import OneVsRestClassifier

clr = OneVsRestClassifier(LogisticRegression(solver="liblinear"))
clr.fit(X_train, y_train)
[8]:
OneVsRestClassifier(estimator=LogisticRegression(solver='liblinear'))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[9]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[9]:
np.float64(51.323076923076925)

Le modèle logistique régression multi-classe est équivalent à la stratégie OneVsRest. Voyons l’autre.

[10]:
from sklearn.multiclass import OneVsOneClassifier

clr = OneVsOneClassifier(LogisticRegression(solver="liblinear"))
clr.fit(X_train, y_train)
[10]:
OneVsOneClassifier(estimator=LogisticRegression(solver='liblinear'))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[11]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[11]:
np.float64(52.246153846153845)
[12]:
conf = confusion_matrix(y_test, clr.predict(X_test))
dfconf = pandas.DataFrame(conf)
labels = list(clr.classes_)
if len(labels) < dfconf.shape[1]:
    labels += [
        9
    ]  # La classe 9 est très représentée, elle est parfois absente en train.
elif len(labels) > dfconf.shape[1]:
    labels = labels[: dfconf.shape[1]]  # ou l'inverse
dfconf.columns = labels
dfconf.index = labels
dfconf
[12]:
3 4 5 6 7 8 9
3 0 0 3 2 0 0 0
4 0 0 40 20 1 0 0
5 0 0 304 212 3 1 0
6 0 0 182 508 27 0 0
7 0 0 22 210 37 0 0
8 0 0 7 35 9 0 0
9 0 0 0 2 0 0 0

A peu près pareil mais sans doute pas de manière significative. Voyons avec un arbre de décision.

[13]:
from sklearn.tree import DecisionTreeClassifier

clr = DecisionTreeClassifier()
clr.fit(X_train, y_train)
[13]:
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[14]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[14]:
np.float64(59.07692307692308)

Et avec OneVsRestClassifier :

[15]:
clr = OneVsRestClassifier(DecisionTreeClassifier())
clr.fit(X_train, y_train)
[15]:
OneVsRestClassifier(estimator=DecisionTreeClassifier())
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[16]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[16]:
np.float64(53.41538461538462)

Et avec OneVsOneClassifier

[17]:
clr = OneVsOneClassifier(DecisionTreeClassifier())
clr.fit(X_train, y_train)
[17]:
OneVsOneClassifier(estimator=DecisionTreeClassifier())
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[18]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[18]:
np.float64(58.83076923076923)

Mieux.

[19]:
from sklearn.ensemble import RandomForestClassifier

clr = RandomForestClassifier()
clr.fit(X_train, y_train)
[19]:
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[20]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[20]:
np.float64(66.4)
[21]:
clr = OneVsRestClassifier(RandomForestClassifier())
clr.fit(X_train, y_train)
[21]:
OneVsRestClassifier(estimator=RandomForestClassifier())
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[22]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[22]:
np.float64(66.27692307692308)

Proche, il faut affiner avec une validation croisée.

[23]:
from sklearn.neural_network import MLPClassifier

clr = MLPClassifier(hidden_layer_sizes=30, max_iter=600)
clr.fit(X_train, y_train)
[23]:
MLPClassifier(hidden_layer_sizes=30, max_iter=600)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[24]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[24]:
np.float64(49.66153846153846)
[25]:
clr = OneVsRestClassifier(MLPClassifier(hidden_layer_sizes=30, max_iter=600))
clr.fit(X_train, y_train)
[25]:
OneVsRestClassifier(estimator=MLPClassifier(hidden_layer_sizes=30,
                                            max_iter=600))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[26]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[26]:
np.float64(50.153846153846146)

Pas foudroyant.


Notebook on github