Plus proches voisins en grande dimension#
La méthodes des plus proches voisins est un algorithme assez simple. Que se passe-t-il quand la dimension de l’espace des features augmente ? Comment y remédier ? Le profiling memory_profiler ou cprofile sont des outils utiles pour savoir où le temps est perdu.
Q1 : k-nn : mesurer la performance#
[1]:
from sklearn.datasets import make_classification
datax, datay = make_classification(
10000, n_features=10, n_classes=3, n_clusters_per_class=2, n_informative=8
)
datax[:3]
[1]:
array([[-0.85382076, 0.22046675, 1.24910001, 2.94596312, 0.66829759,
-1.20552856, -1.72023578, -1.84674932, -0.26846378, 0.20075415],
[-1.59306412, 1.88866079, -0.76923866, -2.32519462, -2.94535057,
-1.47877141, -2.2276281 , 0.02957725, 1.85438519, 0.55194846],
[ 5.58758523, 2.80964683, 0.32608346, 4.12806316, -1.3248342 ,
1.06996005, 2.31117628, 3.99525892, -1.47020431, -4.13399841]])
[2]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(5, algorithm="brute")
model.fit(datax, datay)
[2]:
KNeighborsClassifier(algorithm='brute')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(algorithm='brute')
[3]:
model.predict(datax)
[3]:
array([0, 1, 0, ..., 2, 0, 1])
[4]:
%timeit model.predict(datax)
271 ms ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[5]:
import numpy
import os
path = os.path.normpath(os.path.join(numpy.__file__, "..", ".."))
print(path)
C:\Python3105_x64\lib\site-packages
[6]:
import cProfile
import pstats
from io import StringIO
pr = cProfile.Profile()
pr.enable()
model.predict(datax)
pr.disable()
s = StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumulative")
ps.print_stats()
res = s.getvalue().replace(path, "").replace("\\", "/").replace(" /", " ")
print("\n".join(res.split("\n")[:50]))
445 function calls in 0.287 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
2 0.000 0.000 0.287 0.143 IPython/core/interactiveshell.py:3397(run_code)
2 0.000 0.000 0.287 0.143 {built-in method builtins.exec}
1 0.000 0.000 0.287 0.287 C:/Users/xavie/AppData/Local/Temp/ipykernel_22912/2929265561.py:1(<module>)
1 0.000 0.000 0.287 0.287 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/neighbors/_classification.py:230(predict)
1 0.001 0.001 0.286 0.286 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/neighbors/_classification.py:291(predict_proba)
1 0.000 0.000 0.285 0.285 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/neighbors/_base.py:730(kneighbors)
1 0.000 0.000 0.283 0.283 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:165(compute)
1 0.283 0.283 0.283 0.283 {built-in method compute}
1 0.000 0.000 0.001 0.001 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/base.py:508(_validate_data)
1 0.000 0.000 0.001 0.001 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/utils/validation.py:647(check_array)
1 0.000 0.000 0.001 0.001 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/utils/validation.py:95(_assert_all_finite)
1 0.000 0.000 0.001 0.001 numpy/core/fromnumeric.py:2177(sum)
1 0.000 0.000 0.001 0.001 numpy/core/fromnumeric.py:71(_wrapreduction)
3 0.000 0.000 0.001 0.000 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:75(is_usable_for)
2 0.000 0.000 0.000 0.000 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:448(is_usable_for)
2 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}
7 0.000 0.000 0.000 0.000 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/_config.py:32(get_config)
1 0.000 0.000 0.000 0.000 numpy/core/fromnumeric.py:1140(argmax)
1 0.000 0.000 0.000 0.000 numpy/core/fromnumeric.py:53(_wrapfunc)
1 0.000 0.000 0.000 0.000 {method 'argmax' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/utils/fixes.py:67(threadpool_limits)
1 0.000 0.000 0.000 0.000 threadpoolctl.py:455(limit)
1 0.000 0.000 0.000 0.000 threadpoolctl.py:160(__init__)
1 0.000 0.000 0.000 0.000 {method 'sum' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 numpy/core/_methods.py:47(_sum)
3 0.000 0.000 0.000 0.000 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:58(valid_metrics)
1 0.000 0.000 0.000 0.000 threadpoolctl.py:412(info)
1 0.000 0.000 0.000 0.000 threadpoolctl.py:414(<listcomp>)
29 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
3 0.000 0.000 0.000 0.000 threadpoolctl.py:813(info)
1 0.000 0.000 0.000 0.000 joblib/parallel.py:411(effective_n_jobs)
3 0.000 0.000 0.000 0.000 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/utils/validation.py:1406(check_is_fitted)
2 0.000 0.000 0.000 0.000 numpy/core/numeric.py:274(full)
6 0.000 0.000 0.000 0.000 C:/Python3105_x64/lib/ctypes/__init__.py:384(__getattr__)
2 0.000 0.000 0.000 0.000 C:/Python3105_x64/lib/codeop.py:117(__call__)
1 0.000 0.000 0.000 0.000 joblib/parallel.py:89(get_active_backend)
3 0.000 0.000 0.000 0.000 threadpoolctl.py:818(num_threads)
4 0.000 0.000 0.000 0.000 threadpoolctl.py:856(set_num_threads)
1 0.000 0.000 0.000 0.000 threadpoolctl.py:171(__exit__)
2 0.000 0.000 0.000 0.000 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/utils/validation.py:333(_num_samples)
1 0.000 0.000 0.000 0.000 threadpoolctl.py:181(restore_original_limits)
2 0.000 0.000 0.000 0.000 threadpoolctl.py:846(get_num_threads)
2 0.000 0.000 0.000 0.000 {built-in method builtins.compile}
33 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
3 0.000 0.000 0.000 0.000 C:/xavierdupre/__home_/github_fork/scikit-learn/sklearn/utils/validation.py:1368(_is_fitted)
Etudier l’évolution du temps de prédiction en fonction du nombre d’observations, de la dimension, du nombre de classes ? Qu’en déduisez-vous ? Le code sur GitHub :
Q2 : k-nn avec sparse features#
On recommence cette mesure de temps mais en créant des jeux de données sparses.
Q3 : Imaginez une façon d’aller plus vite ?#
Aller plus vite veut parfois dire perdre un peu en performance et dans notre cas, on veut accélérer la prédiction.
[ ]: