{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Factorisation de matrice et recommandations\n", "\n", "Le notebook utilise la factorisation de matrice pour calculer des recommandations sur la base [movielens](https://grouplens.org/datasets/movielens/). On utilise le jeu de donn\u00e9es [ml-latest-small.zip](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["['links', 'movies', 'ratings', 'tags']"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from papierstat.datasets import load_movielens_dataset\n", "data = load_movielens_dataset(cache='movielens.zip')\n", "list(sorted(data))"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
\n", " \n", " \n", " | \n", " movieId | \n", " title | \n", " genres | \n", "
\n", " \n", " \n", " \n", " 9123 | \n", " 164977 | \n", " The Gay Desperado (1936) | \n", " Comedy | \n", "
\n", " \n", " 9124 | \n", " 164979 | \n", " Women of '69, Unboxed | \n", " Documentary | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" movieId title genres\n", "9123 164977 The Gay Desperado (1936) Comedy\n", "9124 164979 Women of '69, Unboxed Documentary"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["data['movies'].tail(n=2)"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " userId | \n", " movieId | \n", " rating | \n", " timestamp | \n", " dt | \n", "
\n", " \n", " \n", " \n", " 100002 | \n", " 671 | \n", " 6385 | \n", " 2.5 | \n", " 1070979663 | \n", " 2003-12-09 14:21:03 | \n", "
\n", " \n", " 100003 | \n", " 671 | \n", " 6565 | \n", " 3.5 | \n", " 1074784724 | \n", " 2004-01-22 15:18:44 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" userId movieId rating timestamp dt\n", "100002 671 6385 2.5 1070979663 2003-12-09 14:21:03\n", "100003 671 6565 3.5 1074784724 2004-01-22 15:18:44"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "rate = data[\"ratings\"]\n", "rate[\"dt\"] = pandas.to_datetime(rate[\"timestamp\"], unit='s')\n", "rate.tail(n=2)"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["(671, 9066)"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["len(set(rate['userId'])), len(set(rate['movieId']))"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " userId | \n", " movieId | \n", " rating | \n", " timestamp | \n", "
\n", " \n", " \n", " \n", " count | \n", " 100004.000000 | \n", " 100004.000000 | \n", " 100004.000000 | \n", " 1.000040e+05 | \n", "
\n", " \n", " mean | \n", " 347.011310 | \n", " 12548.664363 | \n", " 3.543608 | \n", " 1.129639e+09 | \n", "
\n", " \n", " std | \n", " 195.163838 | \n", " 26369.198969 | \n", " 1.058064 | \n", " 1.916858e+08 | \n", "
\n", " \n", " min | \n", " 1.000000 | \n", " 1.000000 | \n", " 0.500000 | \n", " 7.896520e+08 | \n", "
\n", " \n", " 25% | \n", " 182.000000 | \n", " 1028.000000 | \n", " 3.000000 | \n", " 9.658478e+08 | \n", "
\n", " \n", " 50% | \n", " 367.000000 | \n", " 2406.500000 | \n", " 4.000000 | \n", " 1.110422e+09 | \n", "
\n", " \n", " 75% | \n", " 520.000000 | \n", " 5418.000000 | \n", " 4.000000 | \n", " 1.296192e+09 | \n", "
\n", " \n", " max | \n", " 671.000000 | \n", " 163949.000000 | \n", " 5.000000 | \n", " 1.476641e+09 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" userId movieId rating timestamp\n", "count 100004.000000 100004.000000 100004.000000 1.000040e+05\n", "mean 347.011310 12548.664363 3.543608 1.129639e+09\n", "std 195.163838 26369.198969 1.058064 1.916858e+08\n", "min 1.000000 1.000000 0.500000 7.896520e+08\n", "25% 182.000000 1028.000000 3.000000 9.658478e+08\n", "50% 367.000000 2406.500000 4.000000 1.110422e+09\n", "75% 520.000000 5418.000000 4.000000 1.296192e+09\n", "max 671.000000 163949.000000 5.000000 1.476641e+09"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["rate.describe()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["671 utilisateurs et 9066 films. C'est petit mais assez pour voir la factorisation et le temps que cela prend. Quelques id\u00e9es sur les donn\u00e9es."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"image/png": "iVBORw0KGgoAAAANSUhEUgAAAN8AAADSCAYAAADKZxXyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFEFJREFUeJzt3Xu0XGV9xvHvY7iayM1AGgEJlmhBqBERaKktt0Lk0ugSXGHJVVgoBasVq6HewECJqwiWFnF5iRiwRsqljRCFFDkiLdcot4BICgECgcglkAMIBH/9433H7AwzOTPn9k5mns9aszLz7j17v2fn/Gbv/c4+z1ZEYGaj7w2lO2DWq1x8ZoW4+MwKcfGZFeLiMyvExWdWiIuvAUnflPTFYVrWWyX1SxqTX/dJOmE4lp2X9xNJxwzX8taynmMl3TjS62lX/fZdl/Rc8UlaIuklSSslrZD0v5I+LukP2yIiPh4RM1tc1v5rmyciHomIcRHx2jD0/XRJl9Qt//0R8f2hLntdUb/Nh3P7jraeK77s0Ih4E7AdMAv4HPDd4V6JpPWGe5ndrOe2V0T01ANYAuxf17Y78Htg5/z6IuDM/Hw8cBWwAngG+AXpQ+vi/J6XgH7gs8AkIIDjgUeAGypt6+Xl9QFnA7cCzwH/BWyRp+0NLG3UX2Aq8Arwal7fnZXlnZCfvwH4AvAwsByYA2yap9X6cUzu21PA59eynd4MzAOez32dCdxYmf4nwIK8Te4HPlyZdhBwL7ASeAz4TJN1HAv8D3BeXs6ZwB8DPwOezn38AbBZnn9t27y6fWfm5a4ErgXGV9Z5dN4+TwNfrP4+5N+D2/PP/CRw7oj+LpYuhk4ovtz+CHBSg+I7G/gmsH5+vA9Qo2VVfhHmAGOBjZv8cjwG7JznuRy4ZKDiy89Pr81bmd7H6uL7KLAYeBswDrgCuLiub9/O/XoX8DKwY5PtNBe4NPdx59znG/O0scCjwHHAesCuuVDemacvA96Xn28O7LqW4lsFfCIvZ2NgB+CvgQ2BLUkfYF9v9v/XZPv+H/D2vLw+YFaethOpaP8C2AA4h/RhVtu+NwFH5efjgD1H8nexVw87G3kc2KJB+6vARGC7iHg1In4R+X9nLU6PiBci4qUm0y+OiHsi4gXSp++Hh2nA4COkT+sHI6IfOA2YXnc4d0ZEvBQRdwJ3kopwDbkvHwK+lH+Oe4DqeeUhwJKI+F5ErIqIX5I+RA7L018FdpK0SUQ8m6c383hE/GtezksRsTgiFkTEyxHxW+Bc4K/a3A7fi4jf5O1/KTAltx8G/DgiboyIV4AvkQq35lVgB0njI6I/Im5uc71tcfGttjXp0KfeP5P2JtdKelDSjBaW9Wgb0x8m7VHHt9TLtXtLXl512esBEyptT1Sev0j6hK+3ZX5ffT9rtgP2yANWKyStIBX+H+XpHyIdej4s6eeS/mwtfV5jW0naStJcSY9Jeh64hPa3TbOf8S3V9UXEi6TDz5rjSXvMX0u6TdIhba63LS4+QNJ7ScX3uqH0iFgZEadGxNuAQ4FPS9qvNrnJIgfaM25bef5W0ifuU8ALwBsr/RpDKoRWl/s4qTCqy15FOn9px2/z++r7WfMo8POI2KzyGBcRJwFExG0RMQ3YCvhP0t6nmfqf6ezc9qcRsQlwJKC1zN+OZcA2tReSNiad25L7/UBEHJH7/VXgMkljh7C+terp4pO0Sf50m0s6l7q7wTyHSNpBkkgn4q/lB6Rf6rcNYtVHStpJ0huBrwCXRRoq/w2wkaSDJa1PGjzZsPK+J4FJ1a9F6vwQ+HtJ20saB/wT8KOIWNVO53JfrgBOl/RGSTuRBmpqrgLeLukoSevnx3sl7ShpA0kfkbRpRLzK6m3WqjeRzstWSNoa+Ie66YPd5gCXAYdK+nNJGwBnUClsSUdK2jIifk8aYKPNvrelV4vvx5JWkj7BP086rziuybyTgf8m/ULcBHwjIvrytLOBL+RDr8+0sf6LSYM6TwAbAX8HEBHPAX8LfIc0wPECsLTyvv/I/z4tqdF51Oy87BuAh4DfkQYzBuMU0uHaE7mv36tNiIiVwAHAdNLe9gnSnqL2QXEUsCQfNn6ctPdq1RmkAZzngKtJHwJVg93mRMQi0vaYS9oLriSNCr+cZ5kKLJLUD/wLMD0iftfOOtpRG7Uz6zn56GAFMDkiHhrt9ffqns96lKRD86H0WNJXDXeTvr4YdS4+6zXTSIfKj5NOKaa38NXRiBiw+CRtJOlWSXdKWiTpjNy+vaRbJD0g6Uf5BBZJG+bXi/P0SZVlnZbb75d0YKV9am5b3OJQvtmgRMQJeXR204jYLyLuL9WXVvZ8LwP7RsS7SF9WTpW0J+kE+7yImAw8S/qOhPzvsxGxA+myoa8C5BGz6cA7SSe235A0Jg+nXwC8n3QFwhF5XrOuNmDxRdKfX9YusQpgX9LQLaSrHz6Qn09j9dUQlwH75WH6acDcfOXCQ6QvrnfPj8X5qoxXSCNR04b8k5l1uJauIs97p4Wk6+4uIF07t6Ly/dFS0pfU5H8fBYiIVZKeI32RuTVQvVyn+p5H69r3GKhP48ePj0mTJrXS/UF54YUXGDt2xL5fdT/W4X4M1IeFCxc+FRFbNp0ha6n48peuUyRtBlwJ7Nhotvyvmkxr1t5o79vwBFjSicCJABMmTOCcc84ZoOeD19/fz7hxja68Gl3uR+f1Y6A+7LPPPg83nVjR1t9PRcQKSX3AnsBmktbLe79tSKNHkPZc2wJL8wW9m5Kumay111Tf06y9fv3fAr4FsNtuu8Xee+/dTvfb0tfXx0gu3/1Yd/sxXH1oZbRzy7zHq10Ltz9wH3A9q69iP4b0d2mQ/gasdinSYcDP8lDuPNIV9htK2p40zHsrcBswOY+ebkAalJk35J/MrMO1suebCHw/n/e9Abg0Iq6SdC8wV9KZwK9Y/Zfg3wUulrSYtMebDunSHkmXkv7IchVwcj6cRdIpwDXAGGB2vgzIrKsNWHwRcRfw7gbtD5JGKuvbfwcc3mRZZwFnNWifD8xvob9mXcNXuJgV0luBNdZxJs24uq35T91lFcfm9yyZdfBIdGnUeM9nVoiLz6wQF59ZIS4+s0JcfGaFuPjMCnHxmRXi4jMrxMVnVoiLz6wQF59ZIS4+s0JcfGaFuPjMCnHxmRXi4jMrpJUApW0lXS/pvhwX/8ncfnq+e+gd+XFQ5T1txcI3i54362at7PlWAadGxI6kyMCTK3Hu50XElPyYD4OOhW8WPW/WtVqJi19Wu6F9vinifaxOmm6krVj4HCXfLHrerGu1dc6X7zj0buCW3HSKpLskzZa0eW77Q1x8VouFb9b+ZppHz5t1rZYDlPJdPC8HPhURz0u6EJhJinafCXwN+Cjtx8I3m79RH9aIi+/r62u1+23r7+8f0eW7H8mpu7R1u3gmbLz6PaW2y3Bti1ZvlLI+qfB+EBFXAETEk5Xp3wauyi/bjYV/iubR82twXHz39ePYQaSXfe3u9Gu75CPD359WjGZcvEgp1PdFxLmV9omV2T4I3JOftxULn6Pkm0XPm3WtVvZ8ewFHAXdLuiO3/SNptHIK6RBxCfAxGHQs/OdoHD1v1rVaiYu/kcbnZU3j3duNhW8WPW/WzXyFi1khLj6zQlx8ZoW4+MwKcfGZFeLiMyvExWdWiIvPrBAXn1khLj6zQlx8ZoW4+MwKcfGZFeLiMyvExWdWiIvPrBAXn1khLj6zQoYSF7+FpAU54n1BLbdTyfk5Ev4uSbtWlnVMnv8BScdU2t8j6e78nvNzaJNZVxtKXPwM4Loc8X5dfg0pDn5yfpwIXAipWIEvA3uQ8lq+XAnavTDPW3vf1KH/aGadbShx8dNI0e6wZsT7NGBOJDeTMjknAgcCCyLimYh4FlgATM3TNomIm3KM4BwcF289oOXEanhdXPyEiFgGqUAlbZVnazcufuv8vL7dRsmkFoJrT91lVdOA2yWzDh7uLvWEocTFN521QdvaYuEdF1+4H61Etldj2usNpX+Oix9Ao7h44ElJE/NebyKwPLc3i4tfCuxd196X27dpMP/rOC5+ZLQS2V6Naa83lNh2x8WvRbO4eFIsfG3EshrxPg84Oo967gk8lw9PrwEOkLR5Hmg5ALgmT1spac+8rqNxXLz1gKHExc8CLpV0PPAIcHieNh84iHRfvheB4wAi4hlJM0n3bAD4SkQ8k5+fBFwEbAz8JD/MutpQ4uIB9mswfwAnN1nWbGB2g/bbgZ0H6otZN/EVLmaFuPjMCnHxmRXi4jMrxMVnVoiLz6wQF59ZIS4+s0JcfGaFuPjMCnHxmRXi4jMrxMVnVoiLz6wQF59ZIS4+s0JcfGaFuPjMCmklQGm2pOWS7qm0nS7pMUl35MdBlWmn5dj3+yUdWGmfmtsWS5pRad9e0i05Qv5HkjYYzh/QrFO1sue7iMbx7edFxJT8mA+QY+SnA+/M7/mGpDGSxgAXkKLkdwKOyPMCfDUvazLwLHD8UH4gs3VFK3HxNwDPDDRfNg2YGxEvR8RDpASz3fNjcUQ8GBGvAHOBaTkqcF/gsvz+auy8WVdrKy6+zimSjgZuJ91I5VlSzPvNlXmq0e/1UfF7AG8GVkTEqgbzm42YViLym7lo6thh6cNgi+9CYCYp1n0m8DXgozSPfm+0h20rKh4cFz9SejEuvt31Vo1qXHy9iHiy9lzSt4Gr8stmUfE0aX+KdBej9fLer2lUfF6v4+JHQC/Gxbe73qqLpo4dnbj4RvK9GWo+CNRGQucB0yVtKGl70r32biWlVE/OI5sbkAZl5uWA3euBw/L7q7HzZl1twD2fpB+SbnAyXtJS0g0u95Y0hXSIuAT4GEBELJJ0KXAv6aaaJ0fEa3k5p5Du1zAGmB0Ri/IqPgfMlXQm8CvSfSHMul4rcfFHNGhuWiARcRZwVoP2+aT7ONS3P0gaDTXrKb7CxawQF59ZIS4+s0JcfGaFuPjMCnHxmRXi4jMrxMVnVoiLz6wQF59ZIS4+s0JcfGaFuPjMCnHxmRXi4jMrxMVnVoiLz6wQF59ZIYONi99C0oIc8b5A0ua5XZLOz5Hwd0natfKeY/L8D0g6ptL+Hkl35/ecn4N0zbpeK9GBFwH/BsyptM0ArouIWfm+CzNIQUjvJyWWTSaF4l4I7CFpC1Lw0m6k0KWFkubloN0LSVmcN5MyXqYCPxn6j2ajZSgBtL1ssHHx00jR7rBmxPs0YE4kN5MyOScCBwILIuKZXHALgKl52iYRcVOOEZyD4+KtRwz2nG9CRCwDyP9uldu35vWx8FsP0L60QbtZ1xvKvRoaaRb/3m5744U7Ln5EDDUufjT1fFw88KSkiRGxLB86Ls/tzeLil5KCd6vtfbl9mwbzN+S4+JEx1Lj40dTzcfGkWPjaiGU14n0ecHQe9dwTeC4fll4DHCBp8zwyegBwTZ62UtKeeZTzaBwXbz1isHHxs4BLJR0PPAIcnmefDxxEui/fi8BxABHxjKSZpHs2AHwlImqDOCeRRlQ3Jo1yeqTTesJg4+IB9mswbwAnN1nObGB2g/bbgZ0H6odZt/EVLmaFuPjMCnHxmRXi4jMrxMVnVoiLz6wQF59ZIeWvFzIbpHX9T5m85zMrxMVnVogPOztIo8OoU3dZ1dIV+EtmHTwSXbIR5D2fWSEuPrNCXHxmhbj4zApx8ZkV4uIzK8TFZ1bIkIpP0pIc9X6HpNtz27BFyZt1s+HY8+0TEVMiYrf8uhYlPxm4Lr+GNaPkTyTFxFOJkt8D2B34cq1gzbrZSBx2DkuU/Aj0y6yjDLX4ArhW0sKcJg3DFyVv1tWGem3nXhHxuKStgAWSfr2WeYccGd/tcfGNIsxbjWkf6ej0ToyLL6V0XDwAEfF4/ne5pCtJ52zDFSXfaH1dHRff6ALqVmPaRzo6vRPj4kspHRePpLGS3lR7ToqAv4dhipIfbL/M1hVD+QiZAFyZbyS7HvDvEfFTSbcxfFHy1qJ1/a+6e9Ggiy8iHgTe1aD9aYYpSt6sm/kKF7NCXHxmhbj4zApx8ZkVUv6Lmw40acbVLQcX1XOQkbXKez6zQlx8ZoW4+MwKcfGZFeLiMyvExWdWiIvPrBAXn1kh/pJ9mPlPe6xV3vOZFeLiMyukKw87fehn6wLv+cwK6ZjikzRV0v05Tn7GwO8wW7d1RPFJGgNcQIqU3wk4QtJOZXtlNrI6ovhIeZ+LI+LBiHgFmEuKlzfrWp1SfI6Mt56jlOhXuBPS4cCBEXFCfn0UsHtEfKJuvj/ExQPvAO4fwW6NB54aweW3yv1YUyf0Y6A+bBcRWw60kE75qqFZlPwaqnHxI03S7ZXbnhXjfnReP4arD51y2HkbMFnS9pI2AKaT4uXNulZH7PkiYpWkU0j3aBgDzI6IRYW7ZTaiOqL4ACJiPul+Dp1iVA5vW+B+rKkT+jEsfeiIARezXtQp53xmPcfFV0fSbEnLJd1TuB/bSrpe0n2SFkn6ZIE+bCTpVkl35j6cMdp9qOvPGEm/knRVwT4skXS3pDsk3T6kZfmwc02S/hLoB+ZExM4F+zERmBgRv8w3IV0IfCAi7h3FPggYGxH9ktYHbgQ+GRE3j1Yf6vrzaWA3YJOIOKRQH5YAu0XEkL9r9J6vTkTcABS/OWdELIuIX+bnK4H7GOWrfiLpzy/Xz48in9aStgEOBr5TYv0jwcW3DpA0CXg3cEuBdY+RdAewHFgQEaPeh+zrwGeB3xdaf00A10pamK+4GjQXX4eTNA64HPhURDw/2uuPiNciYgrpqqPdJY36obikQ4DlEbFwtNfdwF4RsSvpL3BOzqcpg+Li62D5POty4AcRcUXJvkTECqAPmFpg9XsBf5PPt+YC+0q6pEA/iIjH87/LgStJf5EzKC6+DpUHO74L3BcR5xbqw5aSNsvPNwb2B3492v2IiNMiYpuImES69PBnEXHkaPdD0tg8+IWkscABwKBHxV18dST9ELgJeIekpZKOL9SVvYCjSJ/yd+THQaPch4nA9ZLuIl1/uyAiig3zd4AJwI2S7gRuBa6OiJ8OdmH+qsGsEO/5zApx8ZkV4uIzK8TFZ1aIi8+sEBefWSEuPrNCXHxmhfw/Bf4qkV+MISwAAAAASUVORK5CYII=\n", "text/plain": [""]}, "metadata": {}, "output_type": "display_data"}], "source": ["ax = rate['rating'].hist(bins=10, figsize=(3,3))\n", "ax.set_title('Distribution des ratings');"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Les gens pr\u00e9f\u00e8rent les ratings arrondis."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": [""]}, "metadata": {}, "output_type": "display_data"}], "source": ["ax = rate['dt'].hist(bins=50, figsize=(10,3))\n", "ax.set_title('Distribution des dates');"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAADhCAYAAAAkqmXdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAHVJJREFUeJzt3Xu4HFWd7vHva7hf5KoRQiRAkMcoHoU9omfUieNlgoD6eLyQwREUQc8cnPEc5mgQxsEZL+gZPCOiB6NiRJSIOGoiYdBR9iAjAwheuEQwYjQxSrhIYGe8EPidP9ZqqHS69+7e3V3Vu+v9PE8/u7suq1ZVr/7tVauq1lJEYGZm9fC4qjNgZmblcdA3M6sRB30zsxpx0DczqxEHfTOzGnHQNzOrkaEL+pIukPS3fUrryZImJM3Kn8clvbkfaef0rpB0Yr/Sm2Q7J0m6ZtDb6ZWktZJeXOL2+vp9VsnlvuV2XO5bb6+n73O7fmZmKpLWArOBLcDDwG3ARcDSiHgEICLe2kVab46If223TET8Atitt1w/ur2zgfkR8fpC+kf3I20bbS73NkyqqOkfFxG7AwcC5wDvBD7d741IKvUfmnWnht+Py70NxfdTWfNORGyKiBXA64ATJT0dQNIySe/N7/eV9HVJ90u6T9J3JD1O0ueAJwMr82nsOyTNkxSSTpb0C+DbhWnFA32IpOslbZL0NUl7520tlLS+mMfGaZukRcC7gNfl7f0wz3/0NCvn6yxJP5e0UdJFkvbI8xr5OFHSLyTdI+nMdsdG0j6SVkh6QNL1wCGFedvs02Sne5LOlnRpzs+Dkm6VNFaY/9S8/v153ssL85ZJ+ng+nZ+Q9O+SniTpnyT9RtKPJT2raZN/JOm2PP8zknYqHl9J75T0a+Azefqxkn6Qt/9dSc+Y5Li8JG9zk6TzATXNf5Ok1XnbV0o6sF1aVXG5d7nP0ysr95W36UfE9cB64PktZp+e5z2BdHr8rrRK/AXwC1LtabeI+FBhnT8Bngr8WZtNvgF4E7A/6XT7vA7y+C/A+4Ev5u39lxaLnZRfLwQOJp1en9+0zPOAw4AXAe+W9NQ2m/wY8Dtgv5zXN02Vxym8HFgO7AmsaORL0vbASuAbwBOBtwGfl3RYYd3XAmcB+wK/B64FbsqfLwM+3LStE0jH/hDgKXndhicBe5Nqu6dKOgK4EHgLsA/wCWCFpB2bd0DSvsCXC3n5KfDHhfmvJJWPV5HKy3eASzo7POVzuW/J5b7JIMp95UE/20A6KM0eIhWAAyPioYj4TkzdWdDZEbE5In7bZv7nIuKWiNgM/C3wWuULXj06AfhwRNwZERPAGcDxTbWt90TEbyPih8APgW1+RDkv/w14d96PW4DP9pi3ayJiVUQ8DHyusN3nkH6k50TEHyLi28DXgcWFdb8SETdGxO+ArwC/i4iLclpfBJprPOdHxLqIuA94X1NajwB/FxG/z9/PKcAnIuK6iHg4Ij5L+oE9p8U+vAy4LSIui4iHgH8Cfl2Y/xbgAxGxOiK2kILVM4extl/gcp+53JdX7ocl6M8B7msx/f8Aa4BvSLpT0pIO0lrXxfyfA9uT/oP2av+cXjHt7Ug1tYbil/WftL7Y9oS8XnM+e9G83Z3yj3J/YF3jYmJhW3MKn+8qvP9ti8/N+9Cc7/0Ln+/OP6KGA4HT8ynu/ZLuB+Y2rdOwfzHtHASL2zoQ+EghnftIp8FzGF4u949xuS+p3Fce9CX9ESmD29yaFREPRsTpEXEwcBzwvyS9qDG7TZJT1YjmFt4/mVSrugfYDOxSyNcsUkHsNN0NpC+gmPYWti4snbg7r9ecz4bN+e8uhWlP6nIbDRuAuZKK5eDJwC+nmR5sm+8Nhc/Nx3Ad8L6I2LPw2iUiWp2e/qqYtiQ1bWsd8JamtHaOiO/2sC8D43K/DZf7ksp9ZUFf0uMlHUtqc7s4Im5uscyxkubnHX2AdLvbw3n2XaQ2xG69XtICSbsAfw9clk/Z7iDVBI7JbX5nAcU2truAeU0FpegS4H9KOkjSbjzWFrqlm8zlvPwzcLakXSQtAE4szL+bVDhfL2mWpDdRuODVpetIP6Z3SNpe0kJSkFk+zfQA/oekA5QuFL6LdCrczieBt0o6Ssmu+fjv3mLZy4GnSXpVrq39FVv/6C8AzpD0NABJe0h6TQ/7MRAu96253JdX7qsI+islPUj6D3Um6YLIG9sseyjwr8AE6ULKxyNiPM/7AHBWPq35my62/zlgGenUbyfSQSQiNgF/CXyKVLg2ky6mNXwp/71X0k0t0r0wp3018DPSBam3dZGvotNIp4+/znn9TNP8U4D/DdwLPA2YVm02Iv5Auth1NKnW93HgDRHx42nlOvkC6QLZnfn13km2/z3SvpwP/IbUpHFSm2XvAV5Dut3xXlLZ+PfC/K8AHwSWS3oAuCXv17BwuZ+ay/22y/a93Gvq60NmZjYqKm/TNzOz8jjom5nViIO+mVmNOOibmdVIJUFf0mGSvq/UJ8ZfqdCtrFr0BTKTSPpjSTfkW7dmNEnPl3R74fOjXchKepekT1WXO7PuSNpZ0kqlPmy+JOkESd8ozA9J86vMYxmq6vHtHcB4RDQ/yjwjSArg0IhY0zR9Luk+5WPy49gzSvN+RcR3SH2mbCMi3t+H7S0k3at+QK9pWXnalf8Z4NWkJ4X3KTxH8PkK81OJqpp3DgRurWjbA5P73viTiNhYdV40BF24Vs3HoL7afPcHAnd0++DYyImIUl/At0lPF/6O9PDJU0gPYrw3z18IrC8sv5b0QMaPSA+OfJr03/oK4EHSQyx75WV3Ai4mPcRwP3ADMLtNPtYCf5PT3UR6gm6nwvxTSA9N3EfqoW//PP1q0mPVm3P+X5enHwv8IG/3u8AzCmm9k/Tgy4PA7cCL2uRpGekJu2/mZf+N1OlWY/5HSA/3PADcCDy/MO9sUu9/F+f5b26R/nhxOumBkGva7Veb7+LFhe1dPNVxJz2AtDrvz52kR8YBdiX1YfJI3t4EqZ+RxwFLSL0J3gtcCuzdqmy0ydOkx8CvepZ/4D3AH0jdT0wAJxfLf14mSAPGNPLycVKcmSA9EPUkUodnvwF+DDyr6u9xWt99RQVunK2DzzImD/r/QQr0c4CNpC5On0V6XPzbpB7sIPU4t5LUP8cs4Ejg8ZMU+utJgWZvUmB6a573p6Qn9Y7I2/gocHWrwpE/H5HzdVTe7ok5/R1JzSPrCj+aecAhkxT6B4EX5HU/0lQoX0/qinU7Uve7vyb/UHOhfwh4JSlw7tzBcW9b6Cf5LloF/bbHHTiG9Li8SN3//idwRKv087S35+/7gHwMPgFcMsnyzXma9Bj4Vevy/2iZnar857zck8vyTqQ48zNSF9WzSE/cXlX19zid10y5e+ejEXFXRPyS1F/0dRHx/Yj4Panb08a1gYdIhWJ+pC5Lb4yIByZJ97yI2BCp/X0l8Mw8/QTgwoi4KW/jDOC5kua1SWeyrlIfJhXgBZK2j4i1EfHTSfJ0eURcnbd7Zt7uXICIuDgi7o2ILRFxLo/9qBqujYivRsQj0b6L3UFoe9wj4vKI+Gkk/0Z6VL1VH/INbwHOjIj1+RicDby6i6aaqo7BTOTyP7luu1eeEWZK0O+0m9PPAVeS+qHYIOlDuROpdtp1+bpVd7GR+gm/l/bdlbbtKjXSxa63k4LXRknLJbXqQrWh2I3qBOn0en8ASacrjZCzKW9jD7buHneq7nUHpe1xl3S0pP9QGgHqflL/4JN16Xsg8JXCcVxNChyzJ1mnqKpjMBO5/E+u2+6VZ4SZEvQ7EmnAifdExALgv5LaGd8wjaS26i5W0q6kmmy7rlcn7So1Ir4QEc/LaQapg6R2it2o7kY69d4g6fmkttHXkq5h7Elqiy0OnTZVR0pbdaPL9Lum3Uq74640EtCXgX8ktfHvCawq5LlVftcBRzcdy53yWd5U3QC3S9O6M6rl3xixoC/phZIOz8HgAVKzw8NTrNbKF4A3SnpmDlzvJzUprc3zm7u3bdtVan4m4U9zOr8j1RAmy9PLJD1P0g7AP+TtrgN2J/U3fjewnaR3A4/vcr9+ALxKqeva+aSLWUXT6rZ3kuO+A+kU/G5gi6SjgZc2bW8f5TFVswuA9ymP/CPpCZJekedN1Q2w9ceoln9jxII+qeZ6GSnwrCZd/b+420Qi4lukIeW+TBrE4BDg+MIiZwOfzaeyr43Ju0rdkdQt6j2k0+knkvrbbucLwN+RTmuPJLWvQmo+uYIU+H5O+gF1ezr7f0l3MNxFGoqu+R7lrfari3RbHveIeJDUhe+lpOPy56Q7QQCI1JXtJcCdeZv7ky7erSCNGvUg6aLuUXn5qboBtj4Y4fJvuGvloSJpGenulLOmWtZs1Lj8l2PUavpmZjYJB32zPsjt2DcqDYVoNrTcvGPWgqQLSXchbYyIpxemLyJdd5gFfCoizsnT/550jeHWiPh6BVk264iDvlkLkl5Aevz+okbQz3cn3QG8hHQB+QZgMek+8n1JT27e46Bvw8wdUpm1EBFXt3gC9dnAmoi4E0DScuAVpId0dgUWAL+VtCoiHikxu2YdG4qgv++++8a8efO2mb5582Z23XXX8jNUglHeN6hm/2688cZ7IqL5Ya1+msPWtwmuB46KiNMAJJ1Equm3DPiSTgVOBdh5552PnDt37jbLPPLIIzzuccN5qc15m55B5u2OO+7ovsw3d8ZTxevII4+MVq666qqW00fBKO9bRDX7B3wv+tsp2TzglsLn15Da8Ruf/4LUL1QtyrzzNj2DzNt0ynyl/xolHSdp6aZNm6rMhlmn1lPoJoDUE+iGbhJwmbeqVRr0I2JlRJy6xx57TL2wWfVuAA6VdFDuJuB4Ck8Yd8Jl3qo2nI1gZhWTdAlwLXCYpPWSTo404tJppC4BVgOXRkRXI8C5pm9VG4oLuWbDJiIWt5m+itRT6HTTXQmsHBsbO2W6aZj1wjV9sxK5pm9VG/qa/rwllwOw9pxjKs6JWe86qek3yjy43Fv/uaZvViLX9K1qvmXTrES+e8eq5ls2zcxqxM07ZmY14qBvViI3aVrVHPTNSuQmTauag76ZWY046JuVyM07VjUHfbMSuXnHquagb2ZWIw76ZmY14qBvZlYj7obBrEQu81Y1d8NgViKXeauam3fMzGrEQd/MrEYc9M3MasRB38ysRhz0zcxqxEHfrES+ZdOq5qBvViLfsmlVc9A3M6sRB30zsxpx0DczqxEHfTOzGnHQNzOrkYEEfUmvlPRJSV+T9NJBbMPMzLq3XacLSroQOBbYGBFPL0xfBHwEmAV8KiLOiYivAl+VtBfwj8A3es3ovCWXP/p+7TnH9JqcWV9Jeirw18C+wLci4v9VnCWzlrqp6S8DFhUnSJoFfAw4GlgALJa0oLDIWXm+2Ywj6UJJGyXd0jR9kaTbJa2RtAQgIlZHxFuB1wJjVeTXrBMd1/Qj4mpJ85omPxtYExF3AkhaDrxC0mrgHOCKiLipVXqSTgVOBZg9ezbj4+PbLDMxMcHphz+8zfRWy840ExMTI7Ef7YzI/i0DzgcuakwoVHReAqwHbpC0IiJuk/RyYElex2wodRz025gDrCt8Xg8cBbwNeDGwh6T5EXFB84oRsRRYCjA2NhYLFy7cJvHx8XHOvWbzNtPXnrDtsjPN+Pg4rfZ5VIzC/nVT0QFui4gVwApJlwNfKDOvZp3qNeirxbSIiPOA83pM22wYtazoSFoIvArYEVjVasXpnN0O09nSMJ+9OW+d6zXorwfmFj4fAGzodGVJxwHHzZ8/v8dsmJWmXUVnHBifbMWIWCrpV8Bxu++++5Ednd3enN4Pw80Lw3z25rx1rtdbNm8ADpV0kKQdgOOBFZ2u7M6nbAbqqaLjMm9V6zjoS7oEuBY4TNJ6SSdHxBbgNOBKYDVwaUTcOpismg2Fnio67lrZqtZx0I+IxRGxX0RsHxEHRMSn8/RVEfGUiDgkIt7Xzcb9A7BhNoiKjmv6VrVe2/R7EhErgZVjY2OnVJkPs1YiYnGb6atoc7HWbNi57x2zEvns1qpWaU1/unfvuEsGm6l8dmtVq7Sm7/ZNqxvX9K1qbt4xK5ErOlY1B30zsxqpNOj7VNfqxmXequY2fbMSucxb1Sq9e8fMOuM71qxf3KZvZlYjDvpmJXKbvlXNF3LNSuQ2fauaL+SamdWIm3fMzGrEQd/MrEYc9M1K1I/rWPOWXL7VLZxm3XDQNyuRr2NZ1Wb83Tuu9ZiZdc5375iZ1Yibd8zMasRB38ysRhz0zcxqxEHfrETuesSq5qBvViLfvGBVm/G3bJqZWecqHUQlIlYCK8fGxk7pNS0PMmFmNjWPnGU2Q7miY9PhNn0zsxpx0DczqxE375iNADf1WKdc0zfrA0mvlPRJSV+T9NKq82PWjoO+WRuSLpS0UdItTdMXSbpd0hpJSwAi4qsRcQpwEvC6CrL7KPc8a5Nx0DdrbxmwqDhB0izgY8DRwAJgsaQFhUXOyvPNhpKDvlkbEXE1cF/T5GcDayLizoj4A7AceIWSDwJXRMRNZefVrFO+kGvWnTnAusLn9cBRwNuAFwN7SJofERc0ryjpVOBUgNmzZzM+Pr5N4hMTE5x++MN9yWir9HsxMTHR9zT7xXnrXKVBX9JxwHHz58/va7q+k8EGSC2mRUScB5w32YoRsVTSr4Djdt999yMXLly4zTLj4+Oce83mvmSUmx9Lpx+/g/HxcVrleRg4b53zyFlm3VkPzC18PgDY0OnKLvNWNbfpm3XnBuBQSQdJ2gE4HljR6cruZNCq5qBv1oakS4BrgcMkrZd0ckRsAU4DrgRWA5dGxK2dpumavlVt5C/kNtr33bZv3YqIxW2mrwJWlZwds75wTd+sRG7esaqNfE3fbJj0cwyJbviONmtwTd+sRK7pW9Uc9M1K5Au5VjUHfTOzGnHQNyuRm3esag76ZiVy845VzUHfzKxGfMumWU35Ns566nvQl3QwcCawR0S8ut/pm81kg+pZthseVaveOmre6XLYuDsj4uRBZNZspnObvlWt05r+MuB84KLGhMKwcS8hdTd7g6QVEXFbvzNpZoPlpp766CjoR8TVkuY1TX502DgAScuBVwAdBf0qRxG6+ZfpdrnD51RX2xq20XT6bdT3z2ym6qVNv+WwcZL2Ad4HPEvSGRHxgVYrR8RSYCnA2NhYDHoUobUnPJb+SY2eNwvTyjZso+n026jvn9lM1UvQbzds3L3AW3tI12xkDcOFXKu3Xu7T72nYOPDTiVY/vpBrVesl6Pc0bBz4B2BmVrZOb9ns+7BxZmZWvk7v3hnIsHFltm/6gRSz/vAQpDNbpX3vuHnHzKxc7nvHzLbS6qzYtfrR4V42zUrkO9asapUGff8ArG7cpGlVc5u+mVmNuHnHzKxGKr2QO0yPpE/Vy6B7ITRrzxd/Z45Kg35ErARWjo2NnVJlPsyse372ZWZy846ZTWneksu5+ZebHOhHgIO+mVmNOOib9YGkgyV9WtJlVefFbDK1vpDrU1WbjKQLgWOBjRHx9ML0RcBHgFnApyLinDyC3MkO+jbsfJ++WXvLgEXFCYWxoY8GFgCLJS0oP2tm0+O+d8za6PfY0GWPC91vs3eG0w/f0vHyrfavMT419HeM6mEek3nY8uagb9adaY8NXfa40P12+uFbOPfmzkNGqzGoTyo+79LHMaqHeUzmYcubg75Zd3oaG7rq61hmvnvHrDs9jQ3t61hWNfeyOQ3zllzuO3/qq6exoWdqmZ+Oxu+kjN+KHxzrnO/eMWtjEGNDu8xb1dymb9bGoMaGNquS2/TNSlSn5h0bTg76ZiVy845VzUHfrESu6bfmmyPK46BvViLX9K1qte5wbSr9rHl45C0zGwa+ZdOsRG7esaq5ecesRK7oWNUc9M3MasRB38ysRvxErlmJhv3mhUFpdVPEVDdKTHXzQ3H+6Yd3v35duaZvViK36VvVHPTNzGrEQd/MrEYc9M3MasQXcs1KVNcLuf0y3afkG+v5gq67YRioXu9OsNETESuBlWNjY6dUnRerJ3fDYGZWI27TNzOrEQd9M7MacdA3M6sRB30zsxrxLZtmJRr1O9Z6NeghE9ulX6e751zTNyuR71izqjnom5nViIO+mVmNOOibmdWIg76ZWY046JuZ1YiDvplZjfT9Pn1JuwIfB/4AjEfE5/u9DbNh4jJvM0lHNX1JF0raKOmWpumLJN0uaY2kJXnyq4DLIuIU4OV9zq9ZKVzmbVR12ryzDFhUnCBpFvAx4GhgAbBY0gLgAGBdXuzh/mTTrHTLcJm3EaSI6GxBaR7w9Yh4ev78XODsiPiz/PmMvOh64DcR8XVJyyPi+DbpnQqcCjB79uwjly9fvs0yExMT/GzT8P6GDp/z2FOVN/9yU1frzt4Z7vpt67Qm0247na7fLq3G+lOl32qdViYmJthtt926zlMrnW7zhS984Y0RMdaXjeIy36y5zA6TRt56+U22067MNdJvt83G9I33bWp53Kb6TbVKv9l0ynwvbfpzeKx2A6ngHwWcB5wv6RhgZbuVI2IpsBRgbGwsFi5cuM0y4+PjnHvN5h6yOFhrT1j46PuTuuwz5PTDt3DuzY8d/mJak2m3nU7Xb5dWY/2p0m+1Tivj4+O0+k6no9NtlqDWZb65zA6TRt56+U22067MNdJvt83G9I9+/mstj9tUv6lW6fdDL9+gWkyLiNgMvLGHdM2GVc9l3h2uWdV6uWVzPTC38PkAYEM3CUg6TtLSTZv6cxpmNmA9l3l3uGZV6yXo3wAcKukgSTsAxwMruknAPwCbYXou867oWNU6vWXzEuBa4DBJ6yWdHBFbgNOAK4HVwKURcevgsmpWnkGVeVd0rGodtelHxOI201cBq6a7cbdv2rBymbdRVWk3DK71WN24zFvVOr5Pf6CZkO4Gft5i1r7APSVnpyyjvG9Qzf4dGBFPKHmb0zJDy7zzNj2DzFvXZX4ogn47kr7Xz4dthsko7xuM/v4NyjAfN+dteoYtb+5l08ysRhz0zcxqZNiD/tKqMzBAo7xvMPr7NyjDfNyct+kZqrwNdZu+mZn117DX9M3MrI+GMui3Gahi6LUaeEPS3pK+Kekn+e9eeboknZf38UeSjiisc2Je/ieSTqxiX5pJmivpKkmrJd0q6a/z9JHYv2FQdbnvV/kdUN76Vv4GkLedJF0v6Yc5b+/J0w+SdF3O2xdz1x1I2jF/XpPnzxtU3lqKiKF6AbOAnwIHAzsAPwQWVJ2vDvP+AuAI4JbCtA8BS/L7JcAH8/uXAVeQem58DnBdnr43cGf+u1d+v9cQ7Nt+wBH5/e7AHaSBREZi/6p+DUO570f5HfbyN6C8Cdgtv98euC5v81Lg+Dz9AuC/5/d/CVyQ3x8PfLHU77nMjXV4AJ8LXFn4fAZwRtX56iL/85p+NLcD++X3+wG35/efABY3LwcsBj5RmL7VcsPyAr4GvGRU96+C4zkU5b7X8ltiPqdV/krI1y7ATaRxFu4Btmv+fkl9Nz03v98uL6eyjt0wNu+0GqhiTkV56YfZEfErgPz3iXl6u/0c+v3Pp6PPItVoRm7/KjKsx6Xb73fgeix/g8rTLEk/ADYC3ySdtd0fqZO+5u0/mrc8fxOwz6Dy1mwYg37LgSpKz8XgtdvPod5/SbsBXwbeHhEPTLZoi2lDv38VmmnHpZL89qH8DUREPBwRzySNsfBs4KmTbL/S73oYg37PA1UMmbsk7QeQ/27M09vt59Duv6TtST+4z0fEP+fJI7N/FRvW49Lt9zswfSp/AxUR9wPjpDb9PSU1ejIubv/RvOX5ewD3DTpvDcMY9HseqGLIrAAad6icSGqLbEx/Q77L4DnApnx6eiXwUkl75TsRXpqnVUqSgE8DqyPiw4VZI7F/Q2BYy3233+9A9LH8DSJvT5C0Z36/M/Bi0ngLVwGvbpO3Rp5fDXw7cgN/Kcq6eNDlxZCXka7O/xQ4s+r8dJHvS4BfAQ+R/pufTGqr+xbwk/x377ysgI/lfbwZGCuk8yZgTX69ser9ynl6HukU9EfAD/LrZaOyf8Pwqrrc96v8Dnv5G0DengF8P+ftFuDdefrBwPW5nH8J2DFP3yl/XpPnH1zm9+wncs3MamQYm3fMzGxAHPTNzGrEQd/MrEYc9M3MasRB38ysRhz0zcxqxEHfzKxGHPTNzGrk/wM2EthLUJG1TAAAAABJRU5ErkJggg==\n", "text/plain": [""]}, "metadata": {}, "output_type": "display_data"}], "source": ["import matplotlib.pyplot as plt\n", "fig, ax = plt.subplots(1,2, figsize=(6,3))\n", "gr = rate[[\"userId\", \"movieId\"]].groupby('userId').count()\n", "gr.hist('movieId', bins=50, figsize=(3,3), ax=ax[0])\n", "ax[0].set_yscale('log')\n", "ax[0].set_title('Distribution du nombre de\\nfilms not\u00e9s par utilisateur')\n", "gr = rate[[\"userId\", \"movieId\"]].groupby('movieId').count()\n", "gr.hist('userId', bins=50, figsize=(3,3), ax=ax[1])\n", "ax[1].set_yscale('log')\n", "ax[1].set_title('Distribution du nombre de\\nnotes par film');"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il y a quelques utilisateurs z\u00e9l\u00e9s et quelques films suscitant beaucoup d'int\u00e9r\u00eat. Ce ne sont pas des valeurs aberrantes mais il faudra songer \u00e0 regarder \u00e7a de plus pr\u00e8s un jour. Noter plus de 2000 films para\u00eet suspect. M\u00eame si les votes s'\u00e9talent sur les 20 ans de collecte, cela fait un film tous les 3-4 jours. Il faut transformer les donn\u00e9es sous la forme d'une matrice [sparse](https://docs.scipy.org/doc/scipy/reference/sparse.html)."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["matrix([[1., 0., 0., 0., 0.],\n", " [0., 2., 3., 0., 0.],\n", " [0., 0., 0., 4., 5.]])"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["from scipy.sparse import csr_matrix\n", "import numpy\n", "\n", "\n", "def conversion(rating, shape=None, movieId_col=None, userId_row=None):\n", " rating = rating[['userId', 'movieId', 'rating']].dropna()\n", " coefs = {}\n", " posix = {}\n", " movieId_col = movieId_col.copy() if movieId_col is not None else {}\n", " userId_row = userId_row.copy() if userId_row is not None else {}\n", " for ind, uid, mid, note in rating.itertuples():\n", " if uid not in userId_row:\n", " userId_row[uid] = len(userId_row)\n", " if mid not in movieId_col:\n", " movieId_col[mid] = len(movieId_col)\n", " row = userId_row[uid]\n", " col = movieId_col[mid]\n", " if row not in coefs:\n", " coefs[row] = []\n", " posix[row] = []\n", " coefs[row].append(note)\n", " posix[row].append(col)\n", " \n", " nbcoefs = sum(map(len, coefs.values()))\n", " indptr = numpy.zeros(len(coefs)+1)\n", " indices = numpy.zeros(nbcoefs)\n", " data = numpy.zeros(nbcoefs) \n", " nb = 0\n", " for row in range(len(userId_row)):\n", " cs = coefs[row]\n", " ps = posix[row]\n", " indptr[row] = nb\n", " for i, (p, c) in enumerate(sorted(zip(ps, cs))):\n", " indices[nb] = p\n", " data[nb] = c\n", " nb += 1 \n", " \n", " indptr[-1] = nb\n", " if shape is None:\n", " shape = (len(userId_row), len(movieId_col))\n", " mat = csr_matrix((data, indices, indptr), shape=shape)\n", " if mat.max() != data.max():\n", " end = min(10, len(indptr))\n", " raise RuntimeError(\"La conversion s'est mal pass\u00e9e.\\ndata={0}\\nindices={1}\\nindptr={2}\".format(\n", " data[:end], indices[:end], indptr[:end]))\n", " return mat, userId_row, movieId_col\n", "\n", "\n", "petit = pandas.DataFrame(dict(userId=[0, 1, 1, 5, 5], movieId=[0, 1, 2, 4, 10], \n", " rating=[1, 2, 3, 4, 5]))\n", "\n", "mat, userId_row, movieId_col = conversion(petit)\n", "numpy.nan_to_num(mat.todense())"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["({0: 0, 1: 1, 5: 2}, '*', {0: 0, 1: 1, 2: 2, 4: 3, 10: 4})"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["userId_row, '*', movieId_col"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["matrix([[2.5, 3. , 3. , 2. , 4. ],\n", " [0. , 0. , 0. , 0. , 0. ],\n", " [0. , 0. , 0. , 0. , 0. ],\n", " [0. , 0. , 0. , 0. , 0. ],\n", " [0. , 0. , 0. , 0. , 0. ]])"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["mat, userId_row, movieId_col = conversion(rate)\n", "numpy.nan_to_num(mat[:5,:5].todense())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On cale une factorisation de matrice."]}, {"cell_type": "code", "execution_count": 13, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/plain": ["NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=400,\n", " n_components=20, random_state=None, shuffle=True, solver='cd',\n", " tol=0.0001, verbose=0)"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.decomposition import NMF\n", "mf = NMF(n_components=20, shuffle=True, max_iter=400)\n", "mf.fit(mat)"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["897.9791684368183"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["mf.reconstruction_err_ "]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0. , 0.05039834, 0. , 0.00358679, 0. ],\n", " [0. , 0.01736155, 1.91278876, 0. , 0. ],\n", " [0.20592908, 0. , 0.29535495, 0. , 0. ],\n", " [0. , 0.11052953, 0. , 0.39806458, 0. ],\n", " [0.71489676, 0. , 0.20115088, 0.02163221, 0. ]])"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["wh = mf.transform(mat)\n", "wh[:5,:5]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["L'erreur ne dit pas grand chose sur la pertinence de la recommandation. Le plus simple est d'enlever des notes pour voir si on les retrouve."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "rate_train, rate_test = train_test_split(rate)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il faut quand m\u00eame s'assurer que la matrice \u00e0 d\u00e9composer a les m\u00eames dimensions que la pr\u00e9c\u00e9dente avec toutes les donn\u00e9es."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": ["shape0 = mat.shape\n", "mat_train, userId_row_train, movieId_col_train = conversion(rate, \n", " shape=shape0, userId_row=userId_row, movieId_col=movieId_col)"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=400,\n", " n_components=20, random_state=None, shuffle=True, solver='cd',\n", " tol=0.0001, verbose=0)"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["mf.fit(mat_train)"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["898.1781492558509"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["mf.reconstruction_err_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On calcule l'erreur sur les bases d'apprentissage et de test."]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " userId | \n", " movieId | \n", " rating | \n", " prediction | \n", "
\n", " \n", " index | \n", " | \n", " | \n", " | \n", " | \n", "
\n", " \n", " \n", " \n", " 13240 | \n", " 85 | \n", " 356 | \n", " 4.0 | \n", " 4.934174 | \n", "
\n", " \n", " 75789 | \n", " 527 | \n", " 1276 | \n", " 4.5 | \n", " 0.345162 | \n", "
\n", " \n", " 42713 | \n", " 306 | \n", " 1513 | \n", " 4.0 | \n", " 2.439180 | \n", "
\n", " \n", " 56562 | \n", " 407 | \n", " 3635 | \n", " 3.0 | \n", " 0.406101 | \n", "
\n", " \n", " 63681 | \n", " 457 | \n", " 122882 | \n", " 0.5 | \n", " 0.903360 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" userId movieId rating prediction\n", "index \n", "13240 85 356 4.0 4.934174\n", "75789 527 1276 4.5 0.345162\n", "42713 306 1513 4.0 2.439180\n", "56562 407 3635 3.0 0.406101\n", "63681 457 122882 0.5 0.903360"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["def predict(mf, mat_train, test, userId_row, movieId_col): \n", " W = mf.transform(mat_train)\n", " H = mf.components_\n", " wh = W @ H\n", " test = test[['userId', 'movieId', 'rating']]\n", " predictions = []\n", " for ind, uid, mid, note in test.itertuples(): \n", " row = userId_row[uid]\n", " col = movieId_col[mid]\n", " try:\n", " pred = wh[row, col] \n", " except Exception as e:\n", " raise Exception(\"Issue with uid={} mid={} row={} col={} shape={}\".format(uid, mid, row, col, wh.shape))\n", " predictions.append((ind, pred))\n", " dfpred = pandas.DataFrame(data=predictions, columns=['index', 'prediction']).set_index('index')\n", " dfall = pandas.concat([test, dfpred], axis=1)\n", " return dfall\n", "\n", "pred = predict(mf, mat_train, rate_test, userId_row_train, movieId_col_train)\n", "pred.head()"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["-4.659895568960519"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import r2_score\n", "r2_score(pred['rating'], pred['prediction'])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pas extraordinaire. Faisons varier *k*."]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["5 {'train_time': 0.7479112074008754, 'k': 5, 'r2': -6.014138361502092, 'err_test': 7.929004708705666, 'err_train': 981.0205349504541}\n", "10 {'train_time': 1.910340634477734, 'k': 10, 'r2': -5.388353487045178, 'err_test': 7.221597617417111, 'err_train': 943.8568240851894}\n", "15 {'train_time': 2.2207374467998306, 'k': 15, 'r2': -4.96900640010828, 'err_test': 6.747554355716667, 'err_train': 918.0198374341521}\n", "20 {'train_time': 5.637187125555101, 'k': 20, 'r2': -4.694288211755458, 'err_test': 6.437004193066304, 'err_train': 897.9357561628665}\n", "25 {'train_time': 7.713239363839193, 'k': 25, 'r2': -4.420273628728207, 'err_test': 6.127249408216037, 'err_train': 878.84542031377}\n", "30 {'train_time': 12.43995074364534, 'k': 30, 'r2': -4.195368644607753, 'err_test': 5.873009673241581, 'err_train': 862.2473126443812}\n", "35 {'train_time': 15.610665020047463, 'k': 35, 'r2': -3.9997376232229183, 'err_test': 5.6518621552167065, 'err_train': 846.2449517351943}\n"]}], "source": ["from time import perf_counter as clock\n", "from sklearn.metrics import mean_squared_error\n", "values = []\n", "\n", "for k in [5, 10, 15, 20, 25, 30, 35]:\n", " mem = {}\n", " mf = NMF(n_components=k, shuffle=True, max_iter=400)\n", " cl = clock()\n", " mf.fit(mat_train)\n", " mem['train_time'] = clock() - cl\n", " pred = predict(mf, mat_train, rate_test, userId_row_train, movieId_col_train)\n", " mem['k'] = k\n", " mem['r2'] = r2_score(pred['rating'], pred['prediction'])\n", " mem['err_test'] = mean_squared_error(pred['rating'], pred['prediction'])\n", " mem['err_train'] = mf.reconstruction_err_\n", " values.append(mem)\n", " print(k, mem) "]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " err_test | \n", " err_train | \n", " k | \n", " r2 | \n", " train_time | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 7.929005 | \n", " 981.020535 | \n", " 5 | \n", " -6.014138 | \n", " 0.747911 | \n", "
\n", " \n", " 1 | \n", " 7.221598 | \n", " 943.856824 | \n", " 10 | \n", " -5.388353 | \n", " 1.910341 | \n", "
\n", " \n", " 2 | \n", " 6.747554 | \n", " 918.019837 | \n", " 15 | \n", " -4.969006 | \n", " 2.220737 | \n", "
\n", " \n", " 3 | \n", " 6.437004 | \n", " 897.935756 | \n", " 20 | \n", " -4.694288 | \n", " 5.637187 | \n", "
\n", " \n", " 4 | \n", " 6.127249 | \n", " 878.845420 | \n", " 25 | \n", " -4.420274 | \n", " 7.713239 | \n", "
\n", " \n", " 5 | \n", " 5.873010 | \n", " 862.247313 | \n", " 30 | \n", " -4.195369 | \n", " 12.439951 | \n", "
\n", " \n", " 6 | \n", " 5.651862 | \n", " 846.244952 | \n", " 35 | \n", " -3.999738 | \n", " 15.610665 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" err_test err_train k r2 train_time\n", "0 7.929005 981.020535 5 -6.014138 0.747911\n", "1 7.221598 943.856824 10 -5.388353 1.910341\n", "2 6.747554 918.019837 15 -4.969006 2.220737\n", "3 6.437004 897.935756 20 -4.694288 5.637187\n", "4 6.127249 878.845420 25 -4.420274 7.713239\n", "5 5.873010 862.247313 30 -4.195369 12.439951\n", "6 5.651862 846.244952 35 -3.999738 15.610665"]}, "execution_count": 24, "metadata": {}, "output_type": "execute_result"}], "source": ["df = pandas.DataFrame(values)\n", "df"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": [""]}, "metadata": {}, "output_type": "display_data"}], "source": ["fig, ax = plt.subplots(1, 4, figsize=(12, 3))\n", "df.plot(x='k', y=\"r2\", style='o-', ax=ax[0])\n", "ax[0].set_title(\"NMF\\nr2 base de test\\net k\")\n", "df.plot(x='k', y=\"err_test\", style='o-', ax=ax[1])\n", "ax[1].set_title(\"NMF\\nerreur de test\\net k\");\n", "df.plot(x='k', y=\"err_train\", style='o-', ax=ax[2])\n", "ax[2].set_title(\"NMF\\nerreur d'apprentissage\\net k\")\n", "df.plot(y='train_time', x=\"k\", style='o-', ax=ax[3])\n", "ax[3].set_title(\"NMF\\nk\\net temps d'apprentissage\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il faudrait explorer de plus grandes valeurs de *k*, il faudrait aussi faire de la cross-validation puis regarder si on peut corr\u00e9l\u00e9rer les plus autres erreurs \u00e0 certains type d'utilisateurs ou de films, si on arrive \u00e0 d\u00e9terminer s'ils se distingue des autres par un faible ou fort taux de *ratings*, moyenne, plus ou moins proches des utilisateurs typiques (~*H*) ou des films typiques (~*W*). Bref, ce n'est pas fini."]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4"}}, "nbformat": 4, "nbformat_minor": 2}