Plus proches voisins en grande dimension#

La méthodes des plus proches voisins est un algorithme assez simple. Que se passe-t-il quand la dimension de l’espace des features augmente ? Comment y remédier ? Le profiling memory_profiler ou cprofile sont des outils utiles pour savoir où le temps est perdu.

Q1 : k-nn : mesurer la performance#

[2]:
from sklearn.datasets import make_classification

datax, datay = make_classification(
    10000, n_features=10, n_classes=3, n_clusters_per_class=2, n_informative=8
)
datax[:3]
[2]:
array([[ 3.62557523, -3.92972784, -2.19327029, -3.01669145, -3.66440003,
         0.05373302, -0.09569564, -1.62733   , -3.05437465,  3.43404744],
       [-1.88137987,  1.1603541 ,  1.97569429,  2.28962313,  1.06727548,
         0.81364917, -2.15972723, -1.99923386,  0.25393473,  2.67807834],
       [ 1.74986482, -2.68848993,  0.83230911, -0.15836161,  0.71428315,
        -2.53155132, -0.49799497, -1.53866452, -2.55477724,  2.79401366]])
[3]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(5, algorithm="brute")
model.fit(datax, datay)
[3]:
KNeighborsClassifier(algorithm='brute')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[4]:
model.predict(datax)
[4]:
array([1, 0, 0, ..., 0, 2, 0])
[5]:
%timeit model.predict(datax)
139 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[6]:
import numpy
import os

path = os.path.normpath(os.path.join(numpy.__file__, "..", ".."))
print(path)
/home/xadupre/.local/lib/python3.10/site-packages
[8]:
import os
import cProfile
import pstats
from io import StringIO
from sklearn import __file__ as sklpathf

sklpath = os.path.dirname(sklpathf)
pr = cProfile.Profile()
pr.enable()
model.predict(datax)
pr.disable()
s = StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumulative")
ps.print_stats()
res = (
    s.getvalue()
    .replace(sklpath, "")
    .replace(path, "")
    .replace("\\", "/")
    .replace(" /", " ")
)
print("\n".join(res.split("\n")[:50]))
         448 function calls in 0.219 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.219    0.110 usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py:3472(run_code)
        2    0.000    0.000    0.219    0.110 {built-in method builtins.exec}
        1    0.000    0.000    0.219    0.219 tmp/ipykernel_4917/1382514021.py:1(<module>)
        1    0.000    0.000    0.219    0.219 neighbors/_classification.py:239(predict)
        1    0.001    0.001    0.219    0.219 neighbors/_classification.py:300(predict_proba)
        1    0.000    0.000    0.217    0.217 neighbors/_base.py:738(kneighbors)
        1    0.212    0.212    0.214    0.214 metrics/_pairwise_distances_reduction/_dispatcher.py:174(compute)
        1    0.000    0.000    0.003    0.003 base.py:509(_validate_data)
        1    0.000    0.000    0.003    0.003 utils/validation.py:660(check_array)
        1    0.000    0.000    0.001    0.001 utils/validation.py:96(_assert_all_finite)
        1    0.000    0.000    0.001    0.001 utils/fixes.py:91(threadpool_limits)
        1    0.000    0.000    0.001    0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:455(limit)
        2    0.001    0.001    0.001    0.001 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.001    0.001 numpy/core/fromnumeric.py:2177(sum)
        1    0.000    0.000    0.001    0.001 numpy/core/fromnumeric.py:71(_wrapreduction)
        1    0.000    0.000    0.001    0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:160(__init__)
        1    0.001    0.001    0.001    0.001 usr/lib/python3.10/warnings.py:458(__enter__)
        1    0.000    0.000    0.001    0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:412(info)
        1    0.000    0.000    0.001    0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:414(<listcomp>)
        2    0.000    0.000    0.000    0.000 numpy/core/numeric.py:274(full)
        3    0.000    0.000    0.000    0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:813(info)
       37    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        3    0.000    0.000    0.000    0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:818(num_threads)
        3    0.000    0.000    0.000    0.000 metrics/_pairwise_distances_reduction/_dispatcher.py:83(is_usable_for)
        2    0.000    0.000    0.000    0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:846(get_num_threads)
        4    0.000    0.000    0.000    0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:856(set_num_threads)
        6    0.000    0.000    0.000    0.000 usr/lib/python3.10/ctypes/__init__.py:384(__getattr__)
        1    0.000    0.000    0.000    0.000 {method 'sum' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 numpy/core/_methods.py:47(_sum)
        3    0.000    0.000    0.000    0.000 utils/validation.py:1417(check_is_fitted)
        1    0.000    0.000    0.000    0.000 numpy/core/fromnumeric.py:1140(argmax)
        6    0.000    0.000    0.000    0.000 usr/lib/python3.10/ctypes/__init__.py:391(__getitem__)
        1    0.000    0.000    0.000    0.000 numpy/core/fromnumeric.py:53(_wrapfunc)
        1    0.000    0.000    0.000    0.000 {method 'argmax' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:278(_set_threadpool_limits)
        1    0.000    0.000    0.000    0.000 usr/local/lib/python3.10/dist-packages/joblib/parallel.py:411(effective_n_jobs)
       44    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        3    0.000    0.000    0.000    0.000 utils/validation.py:1379(_is_fitted)
        1    0.000    0.000    0.000    0.000 usr/local/lib/python3.10/dist-packages/joblib/parallel.py:89(get_active_backend)
       30    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.000    0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:171(__exit__)
        2    0.000    0.000    0.000    0.000 utils/validation.py:334(_num_samples)
        1    0.000    0.000    0.000    0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:181(restore_original_limits)
        2    0.000    0.000    0.000    0.000 usr/lib/python3.10/codeop.py:117(__call__)
        1    0.000    0.000    0.000    0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:1017(get_num_threads)

Etudier l’évolution du temps de prédiction en fonction du nombre d’observations, de la dimension, du nombre de classes ? Qu’en déduisez-vous ? Le code sur GitHub :

Q2 : k-nn avec sparse features#

On recommence cette mesure de temps mais en créant des jeux de données sparses.

Q3 : Imaginez une façon d’aller plus vite ?#

Aller plus vite veut parfois dire perdre un peu en performance et dans notre cas, on veut accélérer la prédiction.

[ ]:


Notebook on github