Plus proches voisins en grande dimension¶
La méthodes des plus proches voisins est un algorithme assez simple. Que se passe-t-il quand la dimension de l’espace des features augmente ? Comment y remédier ? Le profiling memory_profiler ou cprofile sont des outils utiles pour savoir où le temps est perdu.
Q1 : k-nn : mesurer la performance¶
[2]:
from sklearn.datasets import make_classification
datax, datay = make_classification(
10000, n_features=10, n_classes=3, n_clusters_per_class=2, n_informative=8
)
datax[:3]
[2]:
array([[ 3.62557523, -3.92972784, -2.19327029, -3.01669145, -3.66440003,
0.05373302, -0.09569564, -1.62733 , -3.05437465, 3.43404744],
[-1.88137987, 1.1603541 , 1.97569429, 2.28962313, 1.06727548,
0.81364917, -2.15972723, -1.99923386, 0.25393473, 2.67807834],
[ 1.74986482, -2.68848993, 0.83230911, -0.15836161, 0.71428315,
-2.53155132, -0.49799497, -1.53866452, -2.55477724, 2.79401366]])
[3]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(5, algorithm="brute")
model.fit(datax, datay)
[3]:
KNeighborsClassifier(algorithm='brute')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(algorithm='brute')
[4]:
model.predict(datax)
[4]:
array([1, 0, 0, ..., 0, 2, 0])
[5]:
%timeit model.predict(datax)
139 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[6]:
import numpy
import os
path = os.path.normpath(os.path.join(numpy.__file__, "..", ".."))
print(path)
/home/xadupre/.local/lib/python3.10/site-packages
[8]:
import os
import cProfile
import pstats
from io import StringIO
from sklearn import __file__ as sklpathf
sklpath = os.path.dirname(sklpathf)
pr = cProfile.Profile()
pr.enable()
model.predict(datax)
pr.disable()
s = StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumulative")
ps.print_stats()
res = (
s.getvalue()
.replace(sklpath, "")
.replace(path, "")
.replace("\\", "/")
.replace(" /", " ")
)
print("\n".join(res.split("\n")[:50]))
448 function calls in 0.219 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
2 0.000 0.000 0.219 0.110 usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py:3472(run_code)
2 0.000 0.000 0.219 0.110 {built-in method builtins.exec}
1 0.000 0.000 0.219 0.219 tmp/ipykernel_4917/1382514021.py:1(<module>)
1 0.000 0.000 0.219 0.219 neighbors/_classification.py:239(predict)
1 0.001 0.001 0.219 0.219 neighbors/_classification.py:300(predict_proba)
1 0.000 0.000 0.217 0.217 neighbors/_base.py:738(kneighbors)
1 0.212 0.212 0.214 0.214 metrics/_pairwise_distances_reduction/_dispatcher.py:174(compute)
1 0.000 0.000 0.003 0.003 base.py:509(_validate_data)
1 0.000 0.000 0.003 0.003 utils/validation.py:660(check_array)
1 0.000 0.000 0.001 0.001 utils/validation.py:96(_assert_all_finite)
1 0.000 0.000 0.001 0.001 utils/fixes.py:91(threadpool_limits)
1 0.000 0.000 0.001 0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:455(limit)
2 0.001 0.001 0.001 0.001 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.001 0.001 numpy/core/fromnumeric.py:2177(sum)
1 0.000 0.000 0.001 0.001 numpy/core/fromnumeric.py:71(_wrapreduction)
1 0.000 0.000 0.001 0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:160(__init__)
1 0.001 0.001 0.001 0.001 usr/lib/python3.10/warnings.py:458(__enter__)
1 0.000 0.000 0.001 0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:412(info)
1 0.000 0.000 0.001 0.001 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:414(<listcomp>)
2 0.000 0.000 0.000 0.000 numpy/core/numeric.py:274(full)
3 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:813(info)
37 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
3 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:818(num_threads)
3 0.000 0.000 0.000 0.000 metrics/_pairwise_distances_reduction/_dispatcher.py:83(is_usable_for)
2 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:846(get_num_threads)
4 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:856(set_num_threads)
6 0.000 0.000 0.000 0.000 usr/lib/python3.10/ctypes/__init__.py:384(__getattr__)
1 0.000 0.000 0.000 0.000 {method 'sum' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 numpy/core/_methods.py:47(_sum)
3 0.000 0.000 0.000 0.000 utils/validation.py:1417(check_is_fitted)
1 0.000 0.000 0.000 0.000 numpy/core/fromnumeric.py:1140(argmax)
6 0.000 0.000 0.000 0.000 usr/lib/python3.10/ctypes/__init__.py:391(__getitem__)
1 0.000 0.000 0.000 0.000 numpy/core/fromnumeric.py:53(_wrapfunc)
1 0.000 0.000 0.000 0.000 {method 'argmax' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:278(_set_threadpool_limits)
1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/joblib/parallel.py:411(effective_n_jobs)
44 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
3 0.000 0.000 0.000 0.000 utils/validation.py:1379(_is_fitted)
1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/joblib/parallel.py:89(get_active_backend)
30 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:171(__exit__)
2 0.000 0.000 0.000 0.000 utils/validation.py:334(_num_samples)
1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:181(restore_original_limits)
2 0.000 0.000 0.000 0.000 usr/lib/python3.10/codeop.py:117(__call__)
1 0.000 0.000 0.000 0.000 usr/local/lib/python3.10/dist-packages/threadpoolctl.py:1017(get_num_threads)
Etudier l’évolution du temps de prédiction en fonction du nombre d’observations, de la dimension, du nombre de classes ? Qu’en déduisez-vous ? Le code sur GitHub :
Q2 : k-nn avec sparse features¶
On recommence cette mesure de temps mais en créant des jeux de données sparses.
Q3 : Imaginez une façon d’aller plus vite ?¶
Aller plus vite veut parfois dire perdre un peu en performance et dans notre cas, on veut accélérer la prédiction.
[ ]: