{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Prétraitement des catégories ou des dates\n", "\n", "Comment convertir des catégories ou des dates en features ? That is the question.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TableReport\n", "\n", "Le module [skrub](https://skrub-data.org/sable/) propose des outils assez pratiques pour prendre vite la mesure d'un jeu de données." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing column 13 / 13\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "

Please enable javascript

\n", "

\n", " The skrub table reports need javascript to display correctly. If you are\n", " displaying a report in a Jupyter notebook and you see this message, you may need to\n", " re-execute the cell or to trust the notebook (button on the top right or\n", " \"File > Trust notebook\").\n", "

\n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from skrub import TableReport\n", "from teachpyx.datasets import load_wines_dataset\n", "\n", "df = load_wines_dataset()\n", "TableReport(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Catégories\n", "\n", "Les catégories sont assez simples à transformer en variable numériques. La plus populaire des transformations est celle ou une catégorie est diluée sur plusieurs colonnes, une par catégorie : [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas\n", "from sklearn.preprocessing import OneHotEncoder\n", "\n", "data = pandas.DataFrame([{\"A\": \"cat1\"}, {\"A\": \"cat2\"}, {\"A\": \"cat3\"}, {\"A\": \"cat1\"}])\n", "ohe = OneHotEncoder()\n", "ohe.fit(data)\n", "ohe.transform(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Sparse](https://fr.wikipedia.org/wiki/Matrice_creuse) avez-vous dit ?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "matrix([[1., 0., 0.],\n", " [0., 1., 0.],\n", " [0., 0., 1.],\n", " [1., 0., 0.]])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ohe.transform(data).todense()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cette approche fonctionne bien excepté que si il y a beaucoup de catégories ou beaucoup de colonnes catégorielles, le nombre de colonnes explose. Pour y remédier, on peut soit compresser le nombre de colonnes en prenant un hash, la représentation binaire, en éliminant les catégories sous représentées, en les regroupant, en la remplaçant par une valeur numérique [TargetEncoder](https://contrib.scikit-learn.org/category_encoders/targetencoder.html). On peut faire aussi une ACP... Il n'y a pas de bonne ou mauvaise solution dans le cas général. Il faut essayer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Catégories mal orthographiées\n", "\n", "Quand elles sont mal orthographiées, les catégories se multiplient. Dans ce cas, on peut soit chercher à corriger manuellement les erreurs soit faire avec, comme par exemple à [SimilarityEncoder](https://skrub-data.org/stable/reference/generated/skrub.SimilarityEncoder.html). Cet estimateur s'appuie sur la proximité des mots ou des caractères." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.04545455, 1. , 0.8125 , 0.6875 ],\n", " [0.14285714, 0.8125 , 1. , 0.55555556],\n", " [0.04761905, 0.6875 , 0.55555556, 1. ],\n", " [1. , 0.04545455, 0.14285714, 0.04761905]])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas\n", "from skrub import SimilarityEncoder\n", "\n", "data = pandas.DataFrame(\n", " [\n", " {\"A\": \"data scientist\"},\n", " {\"A\": \"data scientiste\"},\n", " {\"A\": \"datascientist\"},\n", " {\"A\": \"alpiniste\"},\n", " ]\n", ")\n", "sim = SimilarityEncoder()\n", "sim.fit(data)\n", "sim.transform(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "D'autres options." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A_0A_1
00.943822-0.123520
10.9078260.113797
20.824866-0.170536
30.1570730.980073
\n", "
" ], "text/plain": [ " A_0 A_1\n", "0 0.943822 -0.123520\n", "1 0.907826 0.113797\n", "2 0.824866 -0.170536\n", "3 0.157073 0.980073" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas\n", "from skrub import StringEncoder\n", "\n", "data = pandas.DataFrame(\n", " [\n", " {\"A\": \"data scientist\"},\n", " {\"A\": \"data scientiste\"},\n", " {\"A\": \"datascientist\"},\n", " {\"A\": \"alpiniste\"},\n", " ]\n", ")\n", "sim = StringEncoder(2)\n", "sim.fit(data.A)\n", "sim.transform(data.A)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dates\n", "\n", "Les dates sont toujours à prendre avec des pincettes. Si les données sont corrélées avec le temps, cela montre qu'il y a une tendance mais il y a toujours un risque que le modèle apprennent un comportement attaché à une époque précise, altérant ses performances dans le futur. Il faut donc distinguer ce qui est une tendance et ce qui est lié à la saisonnalité, le jour de la semaine, le mois de l'année. La saisonnalité est une information qui se répète. Aucune année passé ne revient donc l'année est une information qui ne devrait pas faire partie des bases d'apprentissage. L'objet [DatetimeEncoder](https://skrub-data.org/stable/reference/generated/skrub.DatetimeEncoder.html) automatise cela mais le plus simple est sans doute d'utiliser le module [datetime](https://docs.python.org/3/library/datetime.html)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
login_yearlogin_monthlogin_daylogin_hourlogin_total_secondslogin_weekday
02024.05.013.012.01.715602e+091.0
1NaNNaNNaNNaNNaNNaN
22024.05.015.013.01.715781e+093.0
\n", "
" ], "text/plain": [ " login_year login_month login_day login_hour login_total_seconds \\\n", "0 2024.0 5.0 13.0 12.0 1.715602e+09 \n", "1 NaN NaN NaN NaN NaN \n", "2 2024.0 5.0 15.0 13.0 1.715781e+09 \n", "\n", " login_weekday \n", "0 1.0 \n", "1 NaN \n", "2 3.0 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas\n", "from skrub import DatetimeEncoder\n", "\n", "login = pandas.to_datetime(\n", " pandas.Series([\"2024-05-13T12:05:36\", None, \"2024-05-15T13:46:02\"], name=\"login\")\n", ")\n", "dt = DatetimeEncoder(add_weekday=True)\n", "dt.fit(login)\n", "dt.transform(login)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 2 }