{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Nearest Neighbours and Sparse Features\n", "\n", "While trying to apply k-nearest neighbours classifier database, we might face a tricky issue. Let's try to find out what it is, why it is happening and how to solve it."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["c:\\python370_x64\\lib\\site-packages\\ipykernel\\parentpoller.py:116: UserWarning: Parent poll failed. If the frontend dies,\n", " the kernel may be left running. Please let us know\n", " about your system (bitness, Python, etc.) at\n", " ipython-dev@scipy.org\n", " ipython-dev@scipy.org\"\"\")\n"]}, {"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Get the data\n", "\n", "We use the package [mnist](https://github.com/datapythonista/mnist)."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["import mnist\n", "train_images = mnist.train_images()\n", "train_labels = mnist.train_labels()\n", "test_images = mnist.test_images()\n", "test_labels = mnist.test_labels()"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["((60000, 28, 28), (60000,))"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["train_images.shape, train_labels.shape"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["train_X = train_images.reshape((train_images.shape[0], train_images.shape[1] * train_images.shape[2]))\n", "test_X = test_images.reshape((test_images.shape[0], test_images.shape[1] * test_images.shape[2]))"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["((60000, 784), (60000,))"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["train_X.shape, train_labels.shape"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["train_X[:2]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Train a classifier"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["KNeighborsClassifier(algorithm='kd_tree', leaf_size=30, metric='minkowski',\n", " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n", " weights='uniform')"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.neighbors import KNeighborsClassifier\n", "knn = KNeighborsClassifier(algorithm=\"kd_tree\")\n", "knn"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["KNeighborsClassifier(algorithm='kd_tree', leaf_size=30, metric='minkowski',\n", " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n", " weights='uniform')"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["knn.fit(train_X, train_labels)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The memory consumption is quite huge. The first hill is training, the second one is the beginning of testing."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": [""]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["from pyquickhelper.helpgen.utils_sphinx_config import NbImage\n", "NbImage(\"images/train.png\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Predict"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["# do not do it, it takes for ever.\n", "# yest = knn.predict(test_X)"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": [""]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["NbImage(\"images/test.png\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Enigma"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The process almost does not end. We chose a [k-d tree](https://en.wikipedia.org/wiki/K-d_tree) to optimize the neighbours search. Why does it take so much memory and so much time? What would you do to optimize it?"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0"}}, "nbformat": 4, "nbformat_minor": 2}