{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.algo - Plus proches voisins en grande dimension\n", "\n", "La m\u00e9thodes des [plus proches voisins](https://fr.wikipedia.org/wiki/Recherche_des_plus_proches_voisins) est un algorithme assez simple. Que se passe-t-il quand la dimension de l'espace des features augmente ? Comment y rem\u00e9dier ? Le profiling [memory_profiler](https://pypi.python.org/pypi/memory_profiler) ou [cprofile](https://docs.python.org/3.7/library/profile.html#module-cProfile) sont des outils utiles pour savoir o\u00f9 le temps est perdu. "]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Q1 : k-nn : mesurer la performance"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[-0.8475772 , -2.32538375, 2.85493495, 0.80844826, -0.22859889,\n", " -1.04841583, 0.02968567, 0.64623341, 0.80613674, -2.23389406],\n", " [-0.98432181, -0.06661461, 7.75513731, -0.68528612, 2.91266715,\n", " -2.42866215, -1.30340144, -2.10535336, 2.30057811, -0.16914582],\n", " [ 2.5080994 , 0.78644825, 2.64918709, 1.47316878, -6.35328966,\n", " -0.82007342, -0.08550633, -5.23436533, -0.56694263, -2.1252314 ]])"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.datasets import make_classification\n", "datax, datay = make_classification(10000, n_features=10, n_classes=3, \n", " n_clusters_per_class=2, n_informative=8)\n", "datax[:3]"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski',\n", " metric_params=None, n_jobs=None, n_neighbors=5, p=2,\n", " weights='uniform')"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.neighbors import KNeighborsClassifier\n", "model = KNeighborsClassifier(5, algorithm=\"brute\")\n", "model.fit(datax, datay)"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([2, 1, 0, ..., 1, 0, 0])"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["model.predict(datax)"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["2.78 s \u00b1 27.3 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n"]}], "source": ["%timeit model.predict(datax)"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["c:\\Python372_x64\\lib\\site-packages\n"]}], "source": ["import numpy\n", "import os\n", "path = os.path.normpath(os.path.join(numpy.__file__, '..', '..'))\n", "print(path)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {"scrolled": false}, "outputs": [{"name": "stdout", "output_type": "stream", "text": [" 290457 function calls in 2.919 seconds\n", "\n", " Ordered by: cumulative time\n", "\n", " ncalls tottime percall cumtime percall filename:lineno(function)\n", " 2 0.000 0.000 2.918 1.459 IPython/core/interactiveshell.py:3259(run_code)\n", " 2 0.000 0.000 2.918 1.459 {built-in method builtins.exec}\n", " 1 0.000 0.000 2.918 2.918 :6()\n", " 1 0.000 0.000 2.918 2.918 sklearn/neighbors/classification.py:133(predict)\n", " 1 0.000 0.000 2.600 2.600 sklearn/neighbors/base.py:331(kneighbors)\n", " 2 0.074 0.037 2.599 1.299 sklearn/metrics/pairwise.py:1154(pairwise_distances_chunked)\n", " 1 0.080 0.080 1.277 1.277 sklearn/neighbors/base.py:296(_kneighbors_reduce_func)\n", " 1 0.000 0.000 1.248 1.248 sklearn/metrics/pairwise.py:1315(pairwise_distances)\n", " 1 0.000 0.000 1.248 1.248 sklearn/metrics/pairwise.py:1057(_parallel_pairwise)\n", " 1 0.328 0.328 1.248 1.248 sklearn/metrics/pairwise.py:165(euclidean_distances)\n", " 10003 0.006 0.000 1.210 0.000 numpy/core/fromnumeric.py:54(_wrapfunc)\n", " 1 0.000 0.000 1.197 1.197 numpy/core/fromnumeric.py:742(argpartition)\n", " 1 1.197 1.197 1.197 1.197 {method 'argpartition' of 'numpy.ndarray' objects}\n", " 1 0.000 0.000 0.919 0.919 sklearn/utils/extmath.py:117(safe_sparse_dot)\n", " 1 0.919 0.919 0.919 0.919 {built-in method numpy.dot}\n", " 1 0.017 0.017 0.318 0.318 scipy/stats/stats.py:389(mode)\n", " 10000 0.016 0.000 0.291 0.000 scipy/stats/stats.py:460(_mode1D)\n", " 10000 0.014 0.000 0.229 0.000 numpy/lib/arraysetops.py:151(unique)\n", " 10000 0.069 0.000 0.204 0.000 numpy/lib/arraysetops.py:299(_unique1d)\n", " 10000 0.052 0.000 0.064 0.000 numpy/lib/function_base.py:1149(diff)\n", " 10000 0.005 0.000 0.037 0.000 {method 'max' of 'numpy.ndarray' objects}\n", " 10000 0.003 0.000 0.032 0.000 numpy/core/_methods.py:26(_amax)\n", " 10002 0.030 0.000 0.030 0.000 {built-in method numpy.concatenate}\n", " 10004 0.029 0.000 0.029 0.000 {method 'reduce' of 'numpy.ufunc' objects}\n", " 30005 0.008 0.000 0.017 0.000 numpy/core/numeric.py:541(asanyarray)\n", " 10000 0.004 0.000 0.017 0.000 numpy/core/fromnumeric.py:1694(nonzero)\n", " 10001 0.007 0.000 0.010 0.000 numpy/lib/index_tricks.py:653(__next__)\n", " 30012 0.009 0.000 0.009 0.000 {built-in method numpy.array}\n", " 10000 0.008 0.000 0.008 0.000 {method 'argmax' of 'numpy.ndarray' objects}\n", " 10002 0.008 0.000 0.008 0.000 {built-in method numpy.empty}\n", " 10000 0.007 0.000 0.007 0.000 {method 'sort' of 'numpy.ndarray' objects}\n", " 10000 0.007 0.000 0.007 0.000 {method 'flatten' of 'numpy.ndarray' objects}\n", " 10000 0.005 0.000 0.005 0.000 {method 'nonzero' of 'numpy.ndarray' objects}\n", " 10000 0.003 0.000 0.004 0.000 numpy/lib/arraysetops.py:138(_unpack_tuple)\n", " 10000 0.003 0.000 0.003 0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}\n", " 10001 0.003 0.000 0.003 0.000 {built-in method builtins.next}\n", " 20014 0.002 0.000 0.002 0.000 {built-in method builtins.len}\n", " 10012 0.002 0.000 0.002 0.000 {built-in method builtins.getattr}\n", " 10002 0.002 0.000 0.002 0.000 {method 'append' of 'list' objects}\n", " 3 0.000 0.000 0.001 0.000 sklearn/utils/validation.py:332(check_array)\n", " 3 0.000 0.000 0.001 0.000 sklearn/utils/validation.py:36(_assert_all_finite)\n", " 4 0.000 0.000 0.001 0.000 numpy/core/fromnumeric.py:1966(sum)\n", " 4 0.000 0.000 0.001 0.000 numpy/core/fromnumeric.py:69(_wrapreduction)\n", " 3 0.000 0.000 0.001 0.000 sklearn/utils/extmath.py:663(_safe_accumulator_op)\n", " 1 0.000 0.000 0.000 0.000 numpy/core/fromnumeric.py:942(argsort)\n"]}], "source": ["import cProfile\n", "import pstats\n", "from io import StringIO\n", "pr = cProfile.Profile()\n", "pr.enable()\n", "model.predict(datax)\n", "pr.disable()\n", "s = StringIO()\n", "ps = pstats.Stats(pr, stream=s).sort_stats('cumulative')\n", "ps.print_stats()\n", "res = s.getvalue().replace(path, '').replace(\"\\\\\", \"/\").replace(\" /\", \" \")\n", "print('\\n'.join(res.split('\\n')[:50]))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Etudier l'\u00e9volution du temps de pr\u00e9diction en fonction du nombre d'observations, de la dimension, du nombre de classes ? Qu'en d\u00e9duisez-vous ? Le code sur GitHub :\n", "\n", "* [predict](https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/neighbors/classification.py#L129)\n", "* [kneighbors](https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a805efbe4bb06516670a9b8c690992bd7/sklearn/neighbors/base.py#L273)\n", "* [pairwise_distance](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L1141)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Q2 : k-nn avec sparse features\n", "\n", "On recommence cette mesure de temps mais en cr\u00e9ant des jeux de donn\u00e9es [sparses](https://fr.wikipedia.org/wiki/Matrice_creuse)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Q3 : Imaginez une fa\u00e7on d'aller plus vite ?\n", "\n", "Aller plus vite veut parfois dire perdre un peu en performance et dans notre cas, on veut acc\u00e9l\u00e9rer la pr\u00e9diction."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2"}}, "nbformat": 4, "nbformat_minor": 2}