{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Le gradient et le discret\n", "\n", "Les méthodes d'optimisation à base de gradient s'appuie sur une fonction d'erreur dérivable qu'on devrait appliquer de préférence sur des variables aléatoires réelles. Ce notebook explore quelques idées." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Un petit problème simple\n", "\n", "On utilise le jeu de données *iris* disponible dans [scikit-learn](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from sklearn import datasets\n", "\n", "iris = datasets.load_iris()\n", "X = iris.data[:, :2] # we only take the first two features.\n", "Y = iris.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On cale une régression logistique. On ne distingue pas apprentissage et test car ce n'est pas le propos de ce notebook." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/xadupre/vv/this/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1256: FutureWarning: 'multi_class' was deprecated in version 1.5 and will be removed in 1.7. Use OneVsRestClassifier(LogisticRegression(..)) instead. Leave it to its default value to avoid this warning.\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
LogisticRegression(multi_class='ovr', solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LogisticRegression(multi_class='ovr', solver='liblinear')" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "clf = LogisticRegression(multi_class=\"ovr\", solver=\"liblinear\")\n", "clf.fit(X, Y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Puis on calcule la matrice de confusion." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[49, 1, 0],\n", " [ 2, 21, 27],\n", " [ 1, 4, 45]])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "pred = clf.predict(X)\n", "confusion_matrix(Y, pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiplication des observations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Le paramètre ``multi_class='ovr'`` stipule que le modèle cache en fait l'estimation de 3 régressions logistiques binaire. Essayons de n'en faire qu'une seule en ajouter le label ``Y`` aux variables. Soit un couple $(X_i \\in \\mathbb{R^d}, Y_i \\in \\mathbb{N})$ qui correspond à une observation pour un problème multi-classe. Comme il y a $C$ classes, on multiplie cette ligne par le nombre de classes $C$ pour obtenir :\n", "\n", "$$\\forall c \\in \\mathbb{[}1, ..., C\\mathbb{]}, \\; \\left\\{ \\begin{array}{ll} X_i' = (X_{i,1}, ..., X_{i,d}, Y_{i,1}, ..., Y_{i,C}) \\\\ Y_i' = \\mathbb{1}_{Y_i = c} \\\\ Y_{i,k} = \\mathbb{1}_{c = k}\\end{array} \\right.$$\n", "\n", "Voyons ce que cela donne sur un exemple :" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X1X2Y0Y1Y2Y'
05.13.51.00.00.01.0
15.13.50.01.00.00.0
25.13.50.00.01.00.0
\n", "
" ], "text/plain": [ " X1 X2 Y0 Y1 Y2 Y'\n", "0 5.1 3.5 1.0 0.0 0.0 1.0\n", "1 5.1 3.5 0.0 1.0 0.0 0.0\n", "2 5.1 3.5 0.0 0.0 1.0 0.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy\n", "import pandas\n", "\n", "\n", "def multiplie(X, Y, classes=None):\n", " if classes is None:\n", " classes = numpy.unique(Y)\n", " XS = []\n", " YS = []\n", " for i in classes:\n", " X2 = numpy.zeros((X.shape[0], 3))\n", " X2[:, i] = 1\n", " Yb = i == Y\n", " XS.append(numpy.hstack([X, X2]))\n", " Yb = Yb.reshape((len(Yb), 1))\n", " YS.append(Yb)\n", "\n", " Xext = numpy.vstack(XS)\n", " Yext = numpy.vstack(YS)\n", " return Xext, Yext\n", "\n", "\n", "x, y = multiplie(X[:1, :], Y[:1], [0, 1, 2])\n", "df = pandas.DataFrame(numpy.hstack([x, y]))\n", "df.columns = [\"X1\", \"X2\", \"Y0\", \"Y1\", \"Y2\", \"Y'\"]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Trois colonnes ont été ajoutées côté $X$, la ligne a été multipliée 3 fois, la dernière colonne est $Y$ qui ne vaut 1 que lorsque le 1 est au bon endroit dans une des colonnes ajoutées. Le problème de classification qui été de prédire la bonne classe devient : est-ce la classe à prédire est $k$ ? On applique cela sur toutes les lignes de la base et cela donne :" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X1X2Y0Y1Y2Y'
3815.52.40.00.01.00.0
526.93.11.00.00.00.0
1534.63.10.01.00.00.0
1895.13.40.01.00.00.0
3976.22.90.00.01.00.0
2395.52.50.01.00.01.0
1086.72.51.00.00.00.0
3985.12.50.00.01.00.0
224.63.61.00.00.01.0
134.33.01.00.00.01.0
\n", "
" ], "text/plain": [ " X1 X2 Y0 Y1 Y2 Y'\n", "381 5.5 2.4 0.0 0.0 1.0 0.0\n", "52 6.9 3.1 1.0 0.0 0.0 0.0\n", "153 4.6 3.1 0.0 1.0 0.0 0.0\n", "189 5.1 3.4 0.0 1.0 0.0 0.0\n", "397 6.2 2.9 0.0 0.0 1.0 0.0\n", "239 5.5 2.5 0.0 1.0 0.0 1.0\n", "108 6.7 2.5 1.0 0.0 0.0 0.0\n", "398 5.1 2.5 0.0 0.0 1.0 0.0\n", "22 4.6 3.6 1.0 0.0 0.0 1.0\n", "13 4.3 3.0 1.0 0.0 0.0 1.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Xext, Yext = multiplie(X, Y)\n", "numpy.hstack([Xext, Yext])\n", "df = pandas.DataFrame(numpy.hstack([Xext, Yext]))\n", "df.columns = [\"X1\", \"X2\", \"Y0\", \"Y1\", \"Y2\", \"Y'\"]\n", "df.iloc[numpy.random.permutation(df.index), :].head(n=10)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
GradientBoostingClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GradientBoostingClassifier()" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import GradientBoostingClassifier\n", "\n", "clf = GradientBoostingClassifier()\n", "clf.fit(Xext, Yext.ravel())" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[278, 22],\n", " [ 25, 125]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred = clf.predict(Xext)\n", "confusion_matrix(Yext, pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduire du bruit\n", "\n", "Un des problèmes de cette méthode est qu'on ajoute une variable binaire pour un problème résolu à l'aide d'une optimisation à base de gradient. C'est moyen. Pas de problème, changeons un peu la donne." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X1X2Y0Y1Y2Y'
05.13.51.0049200.0045320.0391571.0
15.13.50.0855631.1295750.1213370.0
25.13.50.1302750.1747631.0744600.0
\n", "
" ], "text/plain": [ " X1 X2 Y0 Y1 Y2 Y'\n", "0 5.1 3.5 1.004920 0.004532 0.039157 1.0\n", "1 5.1 3.5 0.085563 1.129575 0.121337 0.0\n", "2 5.1 3.5 0.130275 0.174763 1.074460 0.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def multiplie_bruit(X, Y, classes=None):\n", " if classes is None:\n", " classes = numpy.unique(Y)\n", " XS = []\n", " YS = []\n", " for i in classes:\n", " # X2 = numpy.random.randn((X.shape[0]* 3)).reshape(X.shape[0], 3) * 0.1\n", " X2 = numpy.random.random((X.shape[0], 3)) * 0.2\n", " X2[:, i] += 1\n", " Yb = i == Y\n", " XS.append(numpy.hstack([X, X2]))\n", " Yb = Yb.reshape((len(Yb), 1))\n", " YS.append(Yb)\n", "\n", " Xext = numpy.vstack(XS)\n", " Yext = numpy.vstack(YS)\n", " return Xext, Yext\n", "\n", "\n", "x, y = multiplie_bruit(X[:1, :], Y[:1], [0, 1, 2])\n", "df = pandas.DataFrame(numpy.hstack([x, y]))\n", "df.columns = [\"X1\", \"X2\", \"Y0\", \"Y1\", \"Y2\", \"Y'\"]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Le problème est le même qu'avant excepté les variables $Y_i$ qui sont maintenant réel. Au lieu d'être nul, on prend une valeur $Y_i < 0.4$." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X1X2Y0Y1Y2Y'
2126.02.20.1490541.1555960.1094131.0
1166.53.01.0717600.0928020.0139110.0
3916.13.00.0841430.1373361.0636570.0
165.43.91.0982010.0643080.0328781.0
2295.72.60.1269991.0655820.1274801.0
384.43.01.1646210.0507790.0092771.0
2136.12.90.0619901.0348180.0470331.0
3344.93.10.0317130.1412051.0431950.0
546.52.81.0661180.1582710.1877640.0
3795.72.60.0334430.0558181.0087790.0
\n", "
" ], "text/plain": [ " X1 X2 Y0 Y1 Y2 Y'\n", "212 6.0 2.2 0.149054 1.155596 0.109413 1.0\n", "116 6.5 3.0 1.071760 0.092802 0.013911 0.0\n", "391 6.1 3.0 0.084143 0.137336 1.063657 0.0\n", "16 5.4 3.9 1.098201 0.064308 0.032878 1.0\n", "229 5.7 2.6 0.126999 1.065582 0.127480 1.0\n", "38 4.4 3.0 1.164621 0.050779 0.009277 1.0\n", "213 6.1 2.9 0.061990 1.034818 0.047033 1.0\n", "334 4.9 3.1 0.031713 0.141205 1.043195 0.0\n", "54 6.5 2.8 1.066118 0.158271 0.187764 0.0\n", "379 5.7 2.6 0.033443 0.055818 1.008779 0.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Xextb, Yextb = multiplie_bruit(X, Y)\n", "df = pandas.DataFrame(numpy.hstack([Xextb, Yextb]))\n", "df.columns = [\"X1\", \"X2\", \"Y0\", \"Y1\", \"Y2\", \"Y'\"]\n", "df.iloc[numpy.random.permutation(df.index), :].head(n=10)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
GradientBoostingClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GradientBoostingClassifier()" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import GradientBoostingClassifier\n", "\n", "clfb = GradientBoostingClassifier()\n", "clfb.fit(Xextb, Yextb.ravel())" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[295, 5],\n", " [ 9, 141]])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predb = clfb.predict(Xextb)\n", "confusion_matrix(Yextb, predb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "C'est un petit peu mieux." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparaisons de plusieurs modèles\n", "\n", "On cherche maintenant à comparer le gain en introduisant du bruit pour différents modèles." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 11/11 [00:01<00:00, 6.17it/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelerr1err2
10AdaBoostClassifier0.3333330.333333
3DecisionTreeClassifier0.0488890.000000
4ExtraTreeClassifier0.0488890.000000
6ExtraTreesClassifier0.0488890.000000
8GaussianNB0.3333330.333333
1GradientBoostingClassifier0.1044440.022222
9KNeighborsClassifier0.1088890.097778
7MLPClassifier0.3333330.333333
0OneVsRestClassifier0.3333330.333333
2RandomForestClassifier0.0533330.002222
5XGBClassifier0.3333330.000000
\n", "
" ], "text/plain": [ " model err1 err2\n", "10 AdaBoostClassifier 0.333333 0.333333\n", "3 DecisionTreeClassifier 0.048889 0.000000\n", "4 ExtraTreeClassifier 0.048889 0.000000\n", "6 ExtraTreesClassifier 0.048889 0.000000\n", "8 GaussianNB 0.333333 0.333333\n", "1 GradientBoostingClassifier 0.104444 0.022222\n", "9 KNeighborsClassifier 0.108889 0.097778\n", "7 MLPClassifier 0.333333 0.333333\n", "0 OneVsRestClassifier 0.333333 0.333333\n", "2 RandomForestClassifier 0.053333 0.002222\n", "5 XGBClassifier 0.333333 0.000000" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def error(model, x, y):\n", " p = model.predict(x)\n", " cm = confusion_matrix(y, p)\n", " return (cm[1, 0] + cm[0, 1]) / cm.sum()\n", "\n", "\n", "def comparaison(model, X, Y):\n", " if isinstance(model, tuple):\n", " clf = model[0](**model[1])\n", " clfb = model[0](**model[1])\n", " model = model[0]\n", " else:\n", " clf = model()\n", " clfb = model()\n", "\n", " Xext, Yext = multiplie(X, Y)\n", " clf.fit(Xext, Yext.ravel())\n", " err = error(clf, Xext, Yext)\n", "\n", " Xextb, Yextb = multiplie_bruit(X, Y)\n", " clfb.fit(Xextb, Yextb.ravel())\n", " errb = error(clfb, Xextb, Yextb)\n", " return dict(model=model.__name__, err1=err, err2=errb)\n", "\n", "\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier\n", "from sklearn.ensemble import (\n", " RandomForestClassifier,\n", " ExtraTreesClassifier,\n", " AdaBoostClassifier,\n", ")\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.multiclass import OneVsRestClassifier\n", "from xgboost import XGBClassifier\n", "from tqdm import tqdm\n", "\n", "models = [\n", " (OneVsRestClassifier, dict(estimator=LogisticRegression(solver=\"liblinear\"))),\n", " GradientBoostingClassifier,\n", " (RandomForestClassifier, dict(n_estimators=20)),\n", " DecisionTreeClassifier,\n", " ExtraTreeClassifier,\n", " XGBClassifier,\n", " (ExtraTreesClassifier, dict(n_estimators=20)),\n", " (MLPClassifier, dict(activation=\"logistic\")),\n", " GaussianNB,\n", " KNeighborsClassifier,\n", " (\n", " AdaBoostClassifier,\n", " dict(\n", " estimator=LogisticRegression(solver=\"liblinear\"),\n", " algorithm=\"SAMME\",\n", " ),\n", " ),\n", "]\n", "\n", "res = []\n", "for model in tqdm(models):\n", " res.append(comparaison(model, X, Y))\n", "df = pandas.DataFrame(res)\n", "df.sort_values(\"model\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*err1* correspond à $Y_0, Y_1, Y_2$ binaire, *err2* aux mêmes variables mais avec un peu de bruit. L'ajout ne semble pas faire décroître la performance et l'améliore dans certains cas. C'est une piste à suivre. Reste à savoir si les modèles n'apprennent pas le bruit." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Avec une ACP\n", "\n", "On peut faire varier le nombre de composantes, j'en ai gardé qu'une. L'ACP est appliquée après avoir ajouté les variables binaires ou binaires bruitées. Le résultat est sans équivoque. Aucun modèle ne parvient à apprendre sans l'ajout de bruit." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 11/11 [00:01<00:00, 5.83it/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelerr1err2modelACPerrACP1errACP2
10AdaBoostClassifier0.3333330.333333AdaBoostClassifier0.3333330.333333
3DecisionTreeClassifier0.0488890.000000DecisionTreeClassifier0.3333330.000000
4ExtraTreeClassifier0.0488890.000000ExtraTreeClassifier0.3333330.000000
6ExtraTreesClassifier0.0488890.000000ExtraTreesClassifier0.3333330.000000
8GaussianNB0.3333330.333333GaussianNB0.3333330.333333
1GradientBoostingClassifier0.1044440.022222GradientBoostingClassifier0.3333330.231111
9KNeighborsClassifier0.1088890.097778KNeighborsClassifier0.3355560.302222
7MLPClassifier0.3333330.333333MLPClassifier0.3333330.333333
0OneVsRestClassifier0.3333330.333333OneVsRestClassifier0.3333330.333333
2RandomForestClassifier0.0533330.002222RandomForestClassifier0.3355560.020000
5XGBClassifier0.3333330.000000XGBClassifier0.3333330.262222
\n", "
" ], "text/plain": [ " model err1 err2 \\\n", "10 AdaBoostClassifier 0.333333 0.333333 \n", "3 DecisionTreeClassifier 0.048889 0.000000 \n", "4 ExtraTreeClassifier 0.048889 0.000000 \n", "6 ExtraTreesClassifier 0.048889 0.000000 \n", "8 GaussianNB 0.333333 0.333333 \n", "1 GradientBoostingClassifier 0.104444 0.022222 \n", "9 KNeighborsClassifier 0.108889 0.097778 \n", "7 MLPClassifier 0.333333 0.333333 \n", "0 OneVsRestClassifier 0.333333 0.333333 \n", "2 RandomForestClassifier 0.053333 0.002222 \n", "5 XGBClassifier 0.333333 0.000000 \n", "\n", " modelACP errACP1 errACP2 \n", "10 AdaBoostClassifier 0.333333 0.333333 \n", "3 DecisionTreeClassifier 0.333333 0.000000 \n", "4 ExtraTreeClassifier 0.333333 0.000000 \n", "6 ExtraTreesClassifier 0.333333 0.000000 \n", "8 GaussianNB 0.333333 0.333333 \n", "1 GradientBoostingClassifier 0.333333 0.231111 \n", "9 KNeighborsClassifier 0.335556 0.302222 \n", "7 MLPClassifier 0.333333 0.333333 \n", "0 OneVsRestClassifier 0.333333 0.333333 \n", "2 RandomForestClassifier 0.335556 0.020000 \n", "5 XGBClassifier 0.333333 0.262222 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.decomposition import PCA\n", "\n", "\n", "def comparaison_ACP(model, X, Y):\n", " if isinstance(model, tuple):\n", " clf = model[0](**model[1])\n", " clfb = model[0](**model[1])\n", " model = model[0]\n", " else:\n", " clf = model()\n", " clfb = model()\n", "\n", " axes = 1\n", " solver = \"full\"\n", " Xext, Yext = multiplie(X, Y)\n", " Xext = PCA(n_components=axes, svd_solver=solver).fit_transform(Xext)\n", " clf.fit(Xext, Yext.ravel())\n", " err = error(clf, Xext, Yext)\n", "\n", " Xextb, Yextb = multiplie_bruit(X, Y)\n", " Xextb = PCA(n_components=axes, svd_solver=solver).fit_transform(Xextb)\n", " clfb.fit(Xextb, Yextb.ravel())\n", " errb = error(clfb, Xextb, Yextb)\n", " return dict(modelACP=model.__name__, errACP1=err, errACP2=errb)\n", "\n", "\n", "res = []\n", "for model in tqdm(models):\n", " res.append(comparaison_ACP(model, X, Y))\n", "dfb = pandas.DataFrame(res)\n", "pandas.concat([df.sort_values(\"model\"), dfb.sort_values(\"modelACP\")], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Base d'apprentissage et de test\n", "\n", "Cette fois-ci, on s'intéresse à la qualité des frontières que les modèles trouvent en vérifiant sur une base de test que l'apprentissage s'est bien passé." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 11/11 [00:02<00:00, 5.40it/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelTTerr_trainerr2_trainerr2b_train_cleanerr_testerr2_testerr2b_test_clean
10AdaBoostClassifier0.3333330.3333330.3333330.3333330.3333330.333333
3DecisionTreeClassifier0.0466670.0000000.5666670.2200000.3000000.553333
4ExtraTreeClassifier0.0466670.0000000.2633330.1866670.1733330.266667
6ExtraTreesClassifier0.0466670.0000000.2133330.1666670.1866670.193333
8GaussianNB0.3333330.3333330.3333330.3333330.3333330.333333
1GradientBoostingClassifier0.0933330.0233330.3066670.1733330.1866670.246667
9KNeighborsClassifier0.1033330.1066670.1233330.1333330.1466670.146667
7MLPClassifier0.3333330.3333330.3333330.3333330.3333330.333333
0OneVsRestClassifier0.3333330.3333330.3333330.3333330.3333330.333333
2RandomForestClassifier0.0533330.0066670.1833330.1733330.1933330.153333
5XGBClassifier0.0533330.0000000.2100000.2066670.2066670.233333
\n", "
" ], "text/plain": [ " modelTT err_train err2_train err2b_train_clean \\\n", "10 AdaBoostClassifier 0.333333 0.333333 0.333333 \n", "3 DecisionTreeClassifier 0.046667 0.000000 0.566667 \n", "4 ExtraTreeClassifier 0.046667 0.000000 0.263333 \n", "6 ExtraTreesClassifier 0.046667 0.000000 0.213333 \n", "8 GaussianNB 0.333333 0.333333 0.333333 \n", "1 GradientBoostingClassifier 0.093333 0.023333 0.306667 \n", "9 KNeighborsClassifier 0.103333 0.106667 0.123333 \n", "7 MLPClassifier 0.333333 0.333333 0.333333 \n", "0 OneVsRestClassifier 0.333333 0.333333 0.333333 \n", "2 RandomForestClassifier 0.053333 0.006667 0.183333 \n", "5 XGBClassifier 0.053333 0.000000 0.210000 \n", "\n", " err_test err2_test err2b_test_clean \n", "10 0.333333 0.333333 0.333333 \n", "3 0.220000 0.300000 0.553333 \n", "4 0.186667 0.173333 0.266667 \n", "6 0.166667 0.186667 0.193333 \n", "8 0.333333 0.333333 0.333333 \n", "1 0.173333 0.186667 0.246667 \n", "9 0.133333 0.146667 0.146667 \n", "7 0.333333 0.333333 0.333333 \n", "0 0.333333 0.333333 0.333333 \n", "2 0.173333 0.193333 0.153333 \n", "5 0.206667 0.206667 0.233333 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "\n", "def comparaison_train_test(models, X, Y, mbruit=multiplie_bruit, acp=None):\n", " axes = acp\n", " solver = \"full\"\n", "\n", " ind = numpy.random.permutation(numpy.arange(X.shape[0]))\n", " X = X[ind, :]\n", " Y = Y[ind]\n", " X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0 / 3)\n", "\n", " res = []\n", " for model in tqdm(models):\n", " if isinstance(model, tuple):\n", " clf = model[0](**model[1])\n", " clfb = model[0](**model[1])\n", " model = model[0]\n", " else:\n", " clf = model()\n", " clfb = model()\n", "\n", " Xext_train, Yext_train = multiplie(X_train, Y_train)\n", " Xext_test, Yext_test = multiplie(X_test, Y_test)\n", " if acp:\n", " Xext_train_ = Xext_train\n", " Xext_test_ = Xext_test\n", " acp_model = PCA(n_components=axes, svd_solver=solver).fit(Xext_train)\n", " Xext_train = acp_model.transform(Xext_train)\n", " Xext_test = acp_model.transform(Xext_test)\n", " clf.fit(Xext_train, Yext_train.ravel())\n", "\n", " err_train = error(clf, Xext_train, Yext_train)\n", " err_test = error(clf, Xext_test, Yext_test)\n", "\n", " Xextb_train, Yextb_train = mbruit(X_train, Y_train)\n", " Xextb_test, Yextb_test = mbruit(X_test, Y_test)\n", " if acp:\n", " acp_model = PCA(n_components=axes, svd_solver=solver).fit(Xextb_train)\n", " Xextb_train = acp_model.transform(Xextb_train)\n", " Xextb_test = acp_model.transform(Xextb_test)\n", " Xext_train = acp_model.transform(Xext_train_)\n", " Xext_test = acp_model.transform(Xext_test_)\n", " clfb.fit(Xextb_train, Yextb_train.ravel())\n", "\n", " errb_train = error(clfb, Xextb_train, Yextb_train)\n", " errb_train_clean = error(clfb, Xext_train, Yext_train)\n", " errb_test = error(clfb, Xextb_test, Yextb_test)\n", " errb_test_clean = error(clfb, Xext_test, Yext_test)\n", "\n", " res.append(\n", " dict(\n", " modelTT=model.__name__,\n", " err_train=err_train,\n", " err2_train=errb_train,\n", " err_test=err_test,\n", " err2_test=errb_test,\n", " err2b_test_clean=errb_test_clean,\n", " err2b_train_clean=errb_train_clean,\n", " )\n", " )\n", "\n", " dfb = pandas.DataFrame(res)\n", " dfb = dfb[\n", " [\n", " \"modelTT\",\n", " \"err_train\",\n", " \"err2_train\",\n", " \"err2b_train_clean\",\n", " \"err_test\",\n", " \"err2_test\",\n", " \"err2b_test_clean\",\n", " ]\n", " ]\n", " dfb = dfb.sort_values(\"modelTT\")\n", " return dfb\n", "\n", "\n", "dfb = comparaison_train_test(models, X, Y)\n", "dfb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Les colonnes *err2b_train_clean* et *err2b_test_clean* sont les erreurs obtenues par des modèles appris sur des colonnes bruitées et testées sur des colonnes non bruitées ce qui est le véritable test. On s'aperçoit que les performances sont très dégradées sur la base d'test. Une raison est que le bruit choisi ajouté n'est pas centré. Corrigeons cela." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 11/11 [00:02<00:00, 4.58it/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelTTerr_trainerr2_trainerr2b_train_cleanerr_testerr2_testerr2b_test_clean
10AdaBoostClassifier0.3333330.3333330.3333330.3333330.3333330.333333
3DecisionTreeClassifier0.0333330.0000000.2566670.2000000.2933330.260000
4ExtraTreeClassifier0.0333330.0000000.1866670.1866670.2000000.180000
6ExtraTreesClassifier0.0333330.0000000.1433330.1466670.1533330.106667
8GaussianNB0.3333330.3333330.3333330.3333330.3333330.333333
1GradientBoostingClassifier0.0966670.0166670.1533330.1400000.1666670.146667
9KNeighborsClassifier0.1133330.1100000.1000000.1733330.1400000.146667
7MLPClassifier0.3333330.3333330.3333330.3333330.3333330.333333
0OneVsRestClassifier0.3333330.3333330.3333330.3333330.3333330.333333
2RandomForestClassifier0.0433330.0066670.1833330.1533330.1933330.153333
5XGBClassifier0.0433330.0000000.1933330.2066670.2000000.186667
\n", "
" ], "text/plain": [ " modelTT err_train err2_train err2b_train_clean \\\n", "10 AdaBoostClassifier 0.333333 0.333333 0.333333 \n", "3 DecisionTreeClassifier 0.033333 0.000000 0.256667 \n", "4 ExtraTreeClassifier 0.033333 0.000000 0.186667 \n", "6 ExtraTreesClassifier 0.033333 0.000000 0.143333 \n", "8 GaussianNB 0.333333 0.333333 0.333333 \n", "1 GradientBoostingClassifier 0.096667 0.016667 0.153333 \n", "9 KNeighborsClassifier 0.113333 0.110000 0.100000 \n", "7 MLPClassifier 0.333333 0.333333 0.333333 \n", "0 OneVsRestClassifier 0.333333 0.333333 0.333333 \n", "2 RandomForestClassifier 0.043333 0.006667 0.183333 \n", "5 XGBClassifier 0.043333 0.000000 0.193333 \n", "\n", " err_test err2_test err2b_test_clean \n", "10 0.333333 0.333333 0.333333 \n", "3 0.200000 0.293333 0.260000 \n", "4 0.186667 0.200000 0.180000 \n", "6 0.146667 0.153333 0.106667 \n", "8 0.333333 0.333333 0.333333 \n", "1 0.140000 0.166667 0.146667 \n", "9 0.173333 0.140000 0.146667 \n", "7 0.333333 0.333333 0.333333 \n", "0 0.333333 0.333333 0.333333 \n", "2 0.153333 0.193333 0.153333 \n", "5 0.206667 0.200000 0.186667 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def multiplie_bruit_centree(X, Y, classes=None):\n", " if classes is None:\n", " classes = numpy.unique(Y)\n", " XS = []\n", " YS = []\n", " for i in classes:\n", " # X2 = numpy.random.randn((X.shape[0]* 3)).reshape(X.shape[0], 3) * 0.1\n", " X2 = numpy.random.random((X.shape[0], 3)) * 0.2 - 0.1\n", " X2[:, i] += 1\n", " Yb = i == Y\n", " XS.append(numpy.hstack([X, X2]))\n", " Yb = Yb.reshape((len(Yb), 1))\n", " YS.append(Yb)\n", "\n", " Xext = numpy.vstack(XS)\n", " Yext = numpy.vstack(YS)\n", " return Xext, Yext\n", "\n", "\n", "dfb = comparaison_train_test(models, X, Y, mbruit=multiplie_bruit_centree, acp=None)\n", "dfb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "C'est mieux mais on en conclut que dans la plupart des cas, la meilleure performance sur la base d'apprentissage avec le bruit ajouté est due au fait que les modèles apprennent par coeur. Sur la base de test, les performances ne sont pas meilleures. Une erreur de 33% signifie que la réponse du classifieur est constante. On multiplie les exemples." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 11/11 [00:02<00:00, 3.96it/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelTTerr_trainerr2_trainerr2b_train_cleanerr_testerr2_testerr2b_test_clean
10AdaBoostClassifier0.3333330.3333330.3333330.3333330.3333330.333333
3DecisionTreeClassifier0.0200000.0000000.0900000.3266670.2800000.253333
4ExtraTreeClassifier0.0200000.0000000.1833330.2266670.2040000.293333
6ExtraTreesClassifier0.0200000.0000000.0500000.2133330.1946670.180000
8GaussianNB0.3333330.3333330.3333330.3333330.3333330.333333
1GradientBoostingClassifier0.0800000.0893330.1200000.1866670.1693330.160000
9KNeighborsClassifier0.0966670.0880000.1300000.1733330.1506670.146667
7MLPClassifier0.3333330.3333330.3333330.3333330.3333330.333333
0OneVsRestClassifier0.3333330.3333330.3333330.3333330.3333330.333333
2RandomForestClassifier0.0233330.0006670.0800000.2066670.1613330.186667
5XGBClassifier0.0333330.0000000.0766670.2266670.1880000.200000
\n", "
" ], "text/plain": [ " modelTT err_train err2_train err2b_train_clean \\\n", "10 AdaBoostClassifier 0.333333 0.333333 0.333333 \n", "3 DecisionTreeClassifier 0.020000 0.000000 0.090000 \n", "4 ExtraTreeClassifier 0.020000 0.000000 0.183333 \n", "6 ExtraTreesClassifier 0.020000 0.000000 0.050000 \n", "8 GaussianNB 0.333333 0.333333 0.333333 \n", "1 GradientBoostingClassifier 0.080000 0.089333 0.120000 \n", "9 KNeighborsClassifier 0.096667 0.088000 0.130000 \n", "7 MLPClassifier 0.333333 0.333333 0.333333 \n", "0 OneVsRestClassifier 0.333333 0.333333 0.333333 \n", "2 RandomForestClassifier 0.023333 0.000667 0.080000 \n", "5 XGBClassifier 0.033333 0.000000 0.076667 \n", "\n", " err_test err2_test err2b_test_clean \n", "10 0.333333 0.333333 0.333333 \n", "3 0.326667 0.280000 0.253333 \n", "4 0.226667 0.204000 0.293333 \n", "6 0.213333 0.194667 0.180000 \n", "8 0.333333 0.333333 0.333333 \n", "1 0.186667 0.169333 0.160000 \n", "9 0.173333 0.150667 0.146667 \n", "7 0.333333 0.333333 0.333333 \n", "0 0.333333 0.333333 0.333333 \n", "2 0.206667 0.161333 0.186667 \n", "5 0.226667 0.188000 0.200000 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def multiplie_bruit_centree_duplique(X, Y, classes=None):\n", " if classes is None:\n", " classes = numpy.unique(Y)\n", " XS = []\n", " YS = []\n", " for i in classes:\n", " for k in range(5):\n", " # X2 = numpy.random.randn((X.shape[0]* 3)).reshape(X.shape[0], 3) * 0.3\n", " X2 = numpy.random.random((X.shape[0], 3)) * 0.8 - 0.4\n", " X2[:, i] += 1\n", " Yb = i == Y\n", " XS.append(numpy.hstack([X, X2]))\n", " Yb = Yb.reshape((len(Yb), 1))\n", " YS.append(Yb)\n", "\n", " Xext = numpy.vstack(XS)\n", " Yext = numpy.vstack(YS)\n", " return Xext, Yext\n", "\n", "\n", "dfb = comparaison_train_test(\n", " models, X, Y, mbruit=multiplie_bruit_centree_duplique, acp=None\n", ")\n", "dfb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cela fonctionne un peu mieux le fait d'ajouter du hasard ne permet pas d'obtenir des gains significatifs à part pour le modèle [SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 11/11 [00:02<00:00, 4.74it/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelTTerr_trainerr2_trainerr2b_train_cleanerr_testerr2_testerr2b_test_clean
10AdaBoostClassifier0.3333330.3333330.3333330.3333330.3333330.333333
3DecisionTreeClassifier0.0266670.0000000.1800000.2200000.3533330.173333
4ExtraTreeClassifier0.0266670.0000000.1633330.2066670.3133330.220000
6ExtraTreesClassifier0.0266670.0000000.1200000.2266670.2066670.206667
8GaussianNB0.3333330.3333330.3333330.3333330.3333330.333333
1GradientBoostingClassifier0.0633330.0266670.1633330.2133330.2466670.200000
9KNeighborsClassifier0.0933330.1033330.1033330.1733330.1933330.160000
7MLPClassifier0.3333330.3333330.3333330.3333330.3333330.333333
0OneVsRestClassifier0.3333330.3333330.3333330.3333330.3333330.333333
2RandomForestClassifier0.0333330.0033330.1433330.2000000.2333330.246667
5XGBClassifier0.0533330.0000000.1600000.2000000.2466670.193333
\n", "
" ], "text/plain": [ " modelTT err_train err2_train err2b_train_clean \\\n", "10 AdaBoostClassifier 0.333333 0.333333 0.333333 \n", "3 DecisionTreeClassifier 0.026667 0.000000 0.180000 \n", "4 ExtraTreeClassifier 0.026667 0.000000 0.163333 \n", "6 ExtraTreesClassifier 0.026667 0.000000 0.120000 \n", "8 GaussianNB 0.333333 0.333333 0.333333 \n", "1 GradientBoostingClassifier 0.063333 0.026667 0.163333 \n", "9 KNeighborsClassifier 0.093333 0.103333 0.103333 \n", "7 MLPClassifier 0.333333 0.333333 0.333333 \n", "0 OneVsRestClassifier 0.333333 0.333333 0.333333 \n", "2 RandomForestClassifier 0.033333 0.003333 0.143333 \n", "5 XGBClassifier 0.053333 0.000000 0.160000 \n", "\n", " err_test err2_test err2b_test_clean \n", "10 0.333333 0.333333 0.333333 \n", "3 0.220000 0.353333 0.173333 \n", "4 0.206667 0.313333 0.220000 \n", "6 0.226667 0.206667 0.206667 \n", "8 0.333333 0.333333 0.333333 \n", "1 0.213333 0.246667 0.200000 \n", "9 0.173333 0.193333 0.160000 \n", "7 0.333333 0.333333 0.333333 \n", "0 0.333333 0.333333 0.333333 \n", "2 0.200000 0.233333 0.246667 \n", "5 0.200000 0.246667 0.193333 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def multiplie_bruit_centree_duplique_rebalance(X, Y, classes=None):\n", " if classes is None:\n", " classes = numpy.unique(Y)\n", " XS = []\n", " YS = []\n", " for i in classes:\n", " X2 = numpy.random.random((X.shape[0], 3)) * 0.8 - 0.4\n", " X2[:, i] += 1 # * ((i % 2) * 2 - 1)\n", " Yb = i == Y\n", " XS.append(numpy.hstack([X, X2]))\n", " Yb = Yb.reshape((len(Yb), 1))\n", " YS.append(Yb)\n", "\n", " Xext = numpy.vstack(XS)\n", " Yext = numpy.vstack(YS)\n", " return Xext, Yext\n", "\n", "\n", "dfb = comparaison_train_test(\n", " models, X, Y, mbruit=multiplie_bruit_centree_duplique_rebalance\n", ")\n", "dfb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Petite explication\n", "\n", "Dans tout le notebook, le score de la régression logistique est nul. Elle ne parvient pas à apprendre tout simplement parce que le problème choisi n'est pas linéaire séparable. S'il l'était, cela voudrait dire que le problème suivant l'est aussi." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([[1., 0., 0., 1., 0., 0.],\n", " [1., 0., 0., 0., 1., 0.],\n", " [1., 0., 0., 0., 0., 1.],\n", " [0., 1., 0., 1., 0., 0.],\n", " [0., 1., 0., 0., 1., 0.],\n", " [0., 1., 0., 0., 0., 1.],\n", " [0., 0., 1., 1., 0., 0.],\n", " [0., 0., 1., 0., 1., 0.],\n", " [0., 0., 1., 0., 0., 1.]]),\n", " array([[1.],\n", " [0.],\n", " [0.],\n", " [0.],\n", " [1.],\n", " [0.],\n", " [0.],\n", " [0.],\n", " [1.]]))" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "M = numpy.zeros((9, 6))\n", "Y = numpy.zeros((9, 1))\n", "for i in range(9):\n", " M[i, i // 3] = 1\n", " M[i, i % 3 + 3] = 1\n", " Y[i] = 1 if i // 3 == i % 3 else 0\n", "M, Y" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/xadupre/vv/this/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1256: FutureWarning: 'multi_class' was deprecated in version 1.5 and will be removed in 1.7. Use OneVsRestClassifier(LogisticRegression(..)) instead. Leave it to its default value to avoid this warning.\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
LogisticRegression(multi_class='ovr', solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LogisticRegression(multi_class='ovr', solver='liblinear')" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf = LogisticRegression(multi_class=\"ovr\", solver=\"liblinear\")\n", "clf.fit(M, Y.ravel())" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 0., 0., 0., 0., 0., 0., 0.])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.predict(M)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A revisiter." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 2 }