Classification multi-classe¶
On cherche à prédire la note d’un vin avec un classifieur multi-classe.
[1]:
%matplotlib inline
[2]:
from teachpyx.datasets import load_wines_dataset
df = load_wines_dataset()
X = df.drop(["quality", "color"], axis=1)
y = df["quality"]
[3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
[4]:
from sklearn.linear_model import LogisticRegression
clr = LogisticRegression()
clr.fit(X_train, y_train)
~/vv/this312/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:406: ConvergenceWarning: lbfgs failed to converge after 100 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT
Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
[4]:
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
[5]:
import numpy
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[5]:
np.float64(45.84615384615385)
On regarde la matrice de confusion.
[6]:
from sklearn.metrics import confusion_matrix
import pandas
pandas.DataFrame(confusion_matrix(y_test, clr.predict(X_test)))
[6]:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 |
| 1 | 0 | 0 | 23 | 37 | 1 | 0 | 0 |
| 2 | 0 | 0 | 224 | 296 | 0 | 0 | 0 |
| 3 | 0 | 0 | 196 | 521 | 0 | 0 | 0 |
| 4 | 0 | 0 | 36 | 233 | 0 | 0 | 0 |
| 5 | 0 | 0 | 6 | 45 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
On l’affiche différemment avec le nom des classes.
[7]:
conf = confusion_matrix(y_test, clr.predict(X_test))
dfconf = pandas.DataFrame(conf)
labels = list(clr.classes_)
if len(labels) < dfconf.shape[1]:
labels += [
9
] # La classe 9 est très représentée, elle est parfois absente en train.
elif len(labels) > dfconf.shape[1]:
labels = labels[: dfconf.shape[1]] # ou l'inverse
dfconf.columns = labels
dfconf.index = labels
dfconf
[7]:
| 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|
| 3 | 0 | 0 | 0 | 5 | 0 | 0 | 0 |
| 4 | 0 | 0 | 23 | 37 | 1 | 0 | 0 |
| 5 | 0 | 0 | 224 | 296 | 0 | 0 | 0 |
| 6 | 0 | 0 | 196 | 521 | 0 | 0 | 0 |
| 7 | 0 | 0 | 36 | 233 | 0 | 0 | 0 |
| 8 | 0 | 0 | 6 | 45 | 0 | 0 | 0 |
| 9 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
Pas extraordinaire. On applique la stratégie OneVsRestClassifier.
[8]:
from sklearn.multiclass import OneVsRestClassifier
clr = OneVsRestClassifier(LogisticRegression(solver="liblinear"))
clr.fit(X_train, y_train)
[8]:
OneVsRestClassifier(estimator=LogisticRegression(solver='liblinear'))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
LogisticRegression(solver='liblinear')
Parameters
[9]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[9]:
np.float64(51.323076923076925)
Le modèle logistique régression multi-classe est équivalent à la stratégie OneVsRest. Voyons l’autre.
[10]:
from sklearn.multiclass import OneVsOneClassifier
clr = OneVsOneClassifier(LogisticRegression(solver="liblinear"))
clr.fit(X_train, y_train)
[10]:
OneVsOneClassifier(estimator=LogisticRegression(solver='liblinear'))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
LogisticRegression(solver='liblinear')
Parameters
[11]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[11]:
np.float64(52.246153846153845)
[12]:
conf = confusion_matrix(y_test, clr.predict(X_test))
dfconf = pandas.DataFrame(conf)
labels = list(clr.classes_)
if len(labels) < dfconf.shape[1]:
labels += [
9
] # La classe 9 est très représentée, elle est parfois absente en train.
elif len(labels) > dfconf.shape[1]:
labels = labels[: dfconf.shape[1]] # ou l'inverse
dfconf.columns = labels
dfconf.index = labels
dfconf
[12]:
| 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|
| 3 | 0 | 0 | 3 | 2 | 0 | 0 | 0 |
| 4 | 0 | 0 | 40 | 20 | 1 | 0 | 0 |
| 5 | 0 | 0 | 304 | 212 | 3 | 1 | 0 |
| 6 | 0 | 0 | 182 | 508 | 27 | 0 | 0 |
| 7 | 0 | 0 | 22 | 210 | 37 | 0 | 0 |
| 8 | 0 | 0 | 7 | 35 | 9 | 0 | 0 |
| 9 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
A peu près pareil mais sans doute pas de manière significative. Voyons avec un arbre de décision.
[13]:
from sklearn.tree import DecisionTreeClassifier
clr = DecisionTreeClassifier()
clr.fit(X_train, y_train)
[13]:
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
[14]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[14]:
np.float64(59.07692307692308)
Et avec OneVsRestClassifier :
[15]:
clr = OneVsRestClassifier(DecisionTreeClassifier())
clr.fit(X_train, y_train)
[15]:
OneVsRestClassifier(estimator=DecisionTreeClassifier())In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
DecisionTreeClassifier()
Parameters
[16]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[16]:
np.float64(53.41538461538462)
Et avec OneVsOneClassifier
[17]:
clr = OneVsOneClassifier(DecisionTreeClassifier())
clr.fit(X_train, y_train)
[17]:
OneVsOneClassifier(estimator=DecisionTreeClassifier())In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
DecisionTreeClassifier()
Parameters
[18]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[18]:
np.float64(58.83076923076923)
Mieux.
[19]:
from sklearn.ensemble import RandomForestClassifier
clr = RandomForestClassifier()
clr.fit(X_train, y_train)
[19]:
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
[20]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[20]:
np.float64(66.4)
[21]:
clr = OneVsRestClassifier(RandomForestClassifier())
clr.fit(X_train, y_train)
[21]:
OneVsRestClassifier(estimator=RandomForestClassifier())In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
RandomForestClassifier()
Parameters
[22]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[22]:
np.float64(66.27692307692308)
Proche, il faut affiner avec une validation croisée.
[23]:
from sklearn.neural_network import MLPClassifier
clr = MLPClassifier(hidden_layer_sizes=30, max_iter=600)
clr.fit(X_train, y_train)
[23]:
MLPClassifier(hidden_layer_sizes=30, max_iter=600)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
[24]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[24]:
np.float64(49.66153846153846)
[25]:
clr = OneVsRestClassifier(MLPClassifier(hidden_layer_sizes=30, max_iter=600))
clr.fit(X_train, y_train)
[25]:
OneVsRestClassifier(estimator=MLPClassifier(hidden_layer_sizes=30,
max_iter=600))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
MLPClassifier(hidden_layer_sizes=30, max_iter=600)
Parameters
[26]:
numpy.mean(clr.predict(X_test).ravel() == y_test.values.ravel()) * 100
[26]:
np.float64(50.153846153846146)
Pas foudroyant.