{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Plus proches voisins en grande dimension\n", "\n", "La méthodes des [plus proches voisins](https://fr.wikipedia.org/wiki/Recherche_des_plus_proches_voisins) est un algorithme assez simple. Que se passe-t-il quand la dimension de l'espace des features augmente ? Comment y remédier ? Le profiling [memory_profiler](https://pypi.python.org/pypi/memory_profiler) ou [cprofile](https://docs.python.org/3.7/library/profile.html#module-cProfile) sont des outils utiles pour savoir où le temps est perdu. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q1 : k-nn : mesurer la performance" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 3.62557523, -3.92972784, -2.19327029, -3.01669145, -3.66440003,\n", " 0.05373302, -0.09569564, -1.62733 , -3.05437465, 3.43404744],\n", " [-1.88137987, 1.1603541 , 1.97569429, 2.28962313, 1.06727548,\n", " 0.81364917, -2.15972723, -1.99923386, 0.25393473, 2.67807834],\n", " [ 1.74986482, -2.68848993, 0.83230911, -0.15836161, 0.71428315,\n", " -2.53155132, -0.49799497, -1.53866452, -2.55477724, 2.79401366]])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.datasets import make_classification\n", "\n", "datax, datay = make_classification(\n", " 10000, n_features=10, n_classes=3, n_clusters_per_class=2, n_informative=8\n", ")\n", "datax[:3]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
KNeighborsClassifier(algorithm='brute')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "KNeighborsClassifier(algorithm='brute')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "model = KNeighborsClassifier(5, algorithm=\"brute\")\n", "model.fit(datax, datay)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 0, 0, ..., 0, 2, 0])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(datax)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "139 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%timeit model.predict(datax)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/xadupre/.local/lib/python3.10/site-packages\n" ] } ], "source": [ "import numpy\n", "import os\n", "\n", "path = os.path.normpath(os.path.join(numpy.__file__, \"..\", \"..\"))\n", "print(path)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 448 function calls in 0.219 seconds\n", "\n", " Ordered by: cumulative time\n", "\n", " ncalls tottime percall cumtime percall filename:lineno(function)\n", " 2 0.000 0.000 0.219 0.110 usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py:3472(run_code)\n", " 2 0.000 0.000 0.219 0.110 {built-in method builtins.exec}\n", " 1 0.000 0.000 0.219 0.219 tmp/ipykernel_4917/1382514021.py:1()\n", " 1 0.000 0.000 0.219 0.219 neighbors/_classification.py:239(predict)\n", " 1 0.001 0.001 0.219 0.219 neighbors/_classification.py:300(predict_proba)\n", " 1 0.000 0.000 0.217 0.217 neighbors/_base.py:738(kneighbors)\n", " 1 0.212 0.212 0.214 0.214 metrics/_pairwise_distances_reduction/_dispatcher.py:174(compute)\n", " 1 0.000 0.000 0.003 0.003 base.py:509(_validate_data)\n", " 1 0.000 0.000 0.003 0.003 utils/validation.py:660(check_array)\n", " 1 0.000 0.000 0.001 0.001 utils/validation.py:96(_assert_all_finite)\n", " 1 0.000 0.000 0.001 0.001 utils/fixes.py:91(threadpool_limits)\n", " 1 0.000 0.000 0.001 0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:455(limit)\n", " 2 0.001 0.001 0.001 0.001 {method 'reduce' of 'numpy.ufunc' objects}\n", " 1 0.000 0.000 0.001 0.001 numpy/core/fromnumeric.py:2177(sum)\n", " 1 0.000 0.000 0.001 0.001 numpy/core/fromnumeric.py:71(_wrapreduction)\n", " 1 0.000 0.000 0.001 0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:160(__init__)\n", " 1 0.001 0.001 0.001 0.001 usr/lib/python3.10/warnings.py:458(__enter__)\n", " 1 0.000 0.000 0.001 0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:412(info)\n", " 1 0.000 0.000 0.001 0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:414()\n", " 2 0.000 0.000 0.000 0.000 numpy/core/numeric.py:274(full)\n", " 3 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:813(info)\n", " 37 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}\n", " 3 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:818(num_threads)\n", " 3 0.000 0.000 0.000 0.000 metrics/_pairwise_distances_reduction/_dispatcher.py:83(is_usable_for)\n", " 2 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:846(get_num_threads)\n", " 4 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:856(set_num_threads)\n", " 6 0.000 0.000 0.000 0.000 usr/lib/python3.10/ctypes/__init__.py:384(__getattr__)\n", " 1 0.000 0.000 0.000 0.000 {method 'sum' of 'numpy.ndarray' objects}\n", " 1 0.000 0.000 0.000 0.000 numpy/core/_methods.py:47(_sum)\n", " 3 0.000 0.000 0.000 0.000 utils/validation.py:1417(check_is_fitted)\n", " 1 0.000 0.000 0.000 0.000 numpy/core/fromnumeric.py:1140(argmax)\n", " 6 0.000 0.000 0.000 0.000 usr/lib/python3.10/ctypes/__init__.py:391(__getitem__)\n", " 1 0.000 0.000 0.000 0.000 numpy/core/fromnumeric.py:53(_wrapfunc)\n", " 1 0.000 0.000 0.000 0.000 {method 'argmax' of 'numpy.ndarray' objects}\n", " 1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:278(_set_threadpool_limits)\n", " 1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/joblib/parallel.py:411(effective_n_jobs)\n", " 44 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}\n", " 3 0.000 0.000 0.000 0.000 utils/validation.py:1379(_is_fitted)\n", " 1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/joblib/parallel.py:89(get_active_backend)\n", " 30 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}\n", " 1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:171(__exit__)\n", " 2 0.000 0.000 0.000 0.000 utils/validation.py:334(_num_samples)\n", " 1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:181(restore_original_limits)\n", " 2 0.000 0.000 0.000 0.000 usr/lib/python3.10/codeop.py:117(__call__)\n", " 1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:1017(get_num_threads)\n" ] } ], "source": [ "import os\n", "import cProfile\n", "import pstats\n", "from io import StringIO\n", "from sklearn import __file__ as sklpathf\n", "\n", "sklpath = os.path.dirname(sklpathf)\n", "pr = cProfile.Profile()\n", "pr.enable()\n", "model.predict(datax)\n", "pr.disable()\n", "s = StringIO()\n", "ps = pstats.Stats(pr, stream=s).sort_stats(\"cumulative\")\n", "ps.print_stats()\n", "res = (\n", " s.getvalue()\n", " .replace(sklpath, \"\")\n", " .replace(path, \"\")\n", " .replace(\"\\\\\", \"/\")\n", " .replace(\" /\", \" \")\n", ")\n", "print(\"\\n\".join(res.split(\"\\n\")[:50]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Etudier l'évolution du temps de prédiction en fonction du nombre d'observations, de la dimension, du nombre de classes ? Qu'en déduisez-vous ? Le code sur GitHub :\n", "\n", "* [predict](https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/neighbors/classification.py#L129)\n", "* [kneighbors](https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a805efbe4bb06516670a9b8c690992bd7/sklearn/neighbors/base.py#L273)\n", "* [pairwise_distance](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L1141)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q2 : k-nn avec sparse features\n", "\n", "On recommence cette mesure de temps mais en créant des jeux de données [sparses](https://fr.wikipedia.org/wiki/Matrice_creuse)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q3 : Imaginez une façon d'aller plus vite ?\n", "\n", "Aller plus vite veut parfois dire perdre un peu en performance et dans notre cas, on veut accélérer la prédiction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 2 }