{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Courte introduction au machine learning\n", "\n", "Le jeu de donn\u00e9es [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) recense les composants chimiques de vins ainsi que la note d'experts. Peut-on pr\u00e9dire cette note \u00e0 partir des composants chimiques ? Peut-\u00eatre que si on arrive \u00e0 construire une fonction qui permet de pr\u00e9dire cette note, on pourra comprendre comment l'expert note les vins."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Donn\u00e9es et premi\u00e8re r\u00e9gression lin\u00e9aire\n", "\n", "On peut utiliser la fonction impl\u00e9ment\u00e9e dans ce module."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcoholqualitycolor
07.40.700.001.90.07611.034.00.99783.510.569.450
17.80.880.002.60.09825.067.00.99683.200.689.850
27.80.760.042.30.09215.054.00.99703.260.659.850
311.20.280.561.90.07517.060.00.99803.160.589.860
47.40.700.001.90.07611.034.00.99783.510.569.450
\n", "
"], "text/plain": [" fixed_acidity volatile_acidity citric_acid residual_sugar chlorides \\\n", "0 7.4 0.70 0.00 1.9 0.076 \n", "1 7.8 0.88 0.00 2.6 0.098 \n", "2 7.8 0.76 0.04 2.3 0.092 \n", "3 11.2 0.28 0.56 1.9 0.075 \n", "4 7.4 0.70 0.00 1.9 0.076 \n", "\n", " free_sulfur_dioxide total_sulfur_dioxide density pH sulphates \\\n", "0 11.0 34.0 0.9978 3.51 0.56 \n", "1 25.0 67.0 0.9968 3.20 0.68 \n", "2 15.0 54.0 0.9970 3.26 0.65 \n", "3 17.0 60.0 0.9980 3.16 0.58 \n", "4 11.0 34.0 0.9978 3.51 0.56 \n", "\n", " alcohol quality color \n", "0 9.4 5 0 \n", "1 9.8 5 0 \n", "2 9.8 5 0 \n", "3 9.8 6 0 \n", "4 9.4 5 0 "]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from papierstat.datasets import load_wines_dataset\n", "df = load_wines_dataset()\n", "df[\"color2\"] = 0\n", "df.loc[df[\"color\"] == \"white\", \"color2\"] = 1\n", "df[\"color\"] = df[\"color2\"]\n", "df = df.drop('color2', axis=1)\n", "df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ou on peut aussi r\u00e9cup\u00e9rer les donn\u00e9es depuis le site et former les m\u00eames donn\u00e9es."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["# import pandas\n", "# df_red = pandas.read_csv('winequality-red.csv', sep=';')\n", "# df_red['color'] = 0\n", "# df_white = pandas.read_csv('winequality-white.csv', sep=';')\n", "# df_white['color'] = 1\n", "# df = pandas.concat([df_red, df_white])\n", "# df.shape, df_red.shape, df_white.shape"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
fixed_acidity6497.07.2153071.2964343.800006.400007.000007.7000015.90000
volatile_acidity6497.00.3396660.1646360.080000.230000.290000.400001.58000
citric_acid6497.00.3186330.1453180.000000.250000.310000.390001.66000
residual_sugar6497.05.4432354.7578040.600001.800003.000008.1000065.80000
chlorides6497.00.0560340.0350340.009000.038000.047000.065000.61100
free_sulfur_dioxide6497.030.52531917.7494001.0000017.0000029.0000041.00000289.00000
total_sulfur_dioxide6497.0115.74457456.5218556.0000077.00000118.00000156.00000440.00000
density6497.00.9946970.0029990.987110.992340.994890.996991.03898
pH6497.03.2185010.1607872.720003.110003.210003.320004.01000
sulphates6497.00.5312680.1488060.220000.430000.510000.600002.00000
alcohol6497.010.4918011.1927128.000009.5000010.3000011.3000014.90000
quality6497.05.8183780.8732553.000005.000006.000006.000009.00000
color6497.00.7538860.4307790.000001.000001.000001.000001.00000
\n", "
"], "text/plain": [" count mean std min 25% \\\n", "fixed_acidity 6497.0 7.215307 1.296434 3.80000 6.40000 \n", "volatile_acidity 6497.0 0.339666 0.164636 0.08000 0.23000 \n", "citric_acid 6497.0 0.318633 0.145318 0.00000 0.25000 \n", "residual_sugar 6497.0 5.443235 4.757804 0.60000 1.80000 \n", "chlorides 6497.0 0.056034 0.035034 0.00900 0.03800 \n", "free_sulfur_dioxide 6497.0 30.525319 17.749400 1.00000 17.00000 \n", "total_sulfur_dioxide 6497.0 115.744574 56.521855 6.00000 77.00000 \n", "density 6497.0 0.994697 0.002999 0.98711 0.99234 \n", "pH 6497.0 3.218501 0.160787 2.72000 3.11000 \n", "sulphates 6497.0 0.531268 0.148806 0.22000 0.43000 \n", "alcohol 6497.0 10.491801 1.192712 8.00000 9.50000 \n", "quality 6497.0 5.818378 0.873255 3.00000 5.00000 \n", "color 6497.0 0.753886 0.430779 0.00000 1.00000 \n", "\n", " 50% 75% max \n", "fixed_acidity 7.00000 7.70000 15.90000 \n", "volatile_acidity 0.29000 0.40000 1.58000 \n", "citric_acid 0.31000 0.39000 1.66000 \n", "residual_sugar 3.00000 8.10000 65.80000 \n", "chlorides 0.04700 0.06500 0.61100 \n", "free_sulfur_dioxide 29.00000 41.00000 289.00000 \n", "total_sulfur_dioxide 118.00000 156.00000 440.00000 \n", "density 0.99489 0.99699 1.03898 \n", "pH 3.21000 3.32000 4.01000 \n", "sulphates 0.51000 0.60000 2.00000 \n", "alcohol 10.30000 11.30000 14.90000 \n", "quality 6.00000 6.00000 9.00000 \n", "color 1.00000 1.00000 1.00000 "]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["df.describe().T"]}, {"cell_type": "markdown", "metadata": {}, "source": ["J'ai tendance \u00e0 utiliser ``df`` partout quitte \u00e0 ce que le premier soit \u00e9cras\u00e9. Conservons-le dans une variable \u00e0 part."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["df_data = df"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Quelle est la distribution des notes ?"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAATGElEQVR4nO3df6zd9X3f8eerOGkITsEp6RUFNjPJi0qxxuAK2JCi67EQA1VJpkUCMQJpIkcTTMlqaXEqVWRNI3lSaKfQDM0NXohCc8XyQ1jghnpe7zL+oMFOaYzjRHjEZbaZ3c6OqRPU1tl7f9yvtxu4ts8999x77j2f50M6Oud8vp/v9/t533PO63zP93zP96aqkCS14WeGPQBJ0uIx9CWpIYa+JDXE0Jekhhj6ktSQFcMewNlcfPHFtXr16r7n/9GPfsQFF1wwuAENyajUAdayVI1KLaNSB8yvlt27d/9lVb1jtmlLOvRXr17Nrl27+p5/amqKiYmJwQ1oSEalDrCWpWpUahmVOmB+tST58zNNc/eOJDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1ZEn/IldayvYcOsG9m55a9PUe2Hzboq9To8MtfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNeScoZ/k8iR/nGRfkr1JPtq1fzLJoSTPd5dbZ8zziST7k3w/yXtmtK/v2vYn2bQwJUmSzmRFD31OARur6ttJ3gbsTrKjm/a7VfWZmZ2TXAncAfwy8IvAf0ny97vJnwPeDRwEnkuyraq+O4hCJEnnds7Qr6pXgFe623+VZB9w6VlmuR2YrKq/Bn6QZD9wXTdtf1W9BJBksutr6EvSIklV9d45WQ18E7gK+HXgXuBVYBfTnwaOJ/k94Nmq+lI3zyPAH3aLWF9VH+7a7waur6r7X7eODcAGgLGxsWsnJyf7rY2TJ0+ycuXKvudfKkalDhitWo4eO8GR1xZ/vWsvvXDgyxyVx2VU6oD51bJu3brdVTU+27Redu8AkGQl8FXgY1X1apKHgU8B1V0/CPwakFlmL2b//uAN7zhVtQXYAjA+Pl4TExO9DvENpqammM/8S8Wo1AGjVctDjz3Bg3t6fgkNzIG7Jga+zFF5XEalDli4Wnp6xiZ5E9OB/1hVfQ2gqo7MmP77wJPd3YPA5TNmvww43N0+U7skaRH0cvROgEeAfVX1OzPaL5nR7X3AC93tbcAdSX42yRXAGuBbwHPAmiRXJHkz01/2bhtMGZKkXvSypX8jcDewJ8nzXdtvAHcmuZrpXTQHgI8AVNXeJI8z/QXtKeC+qvoJQJL7gaeB84CtVbV3gLVIks6hl6N3nmH2/fTbzzLPp4FPz9K+/WzzSZIWlr/IlaSGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWrIimEPQKNh9aaneuq3ce0p7u2xby8ObL5tYMuSWuCWviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWrIOUM/yeVJ/jjJviR7k3y0a397kh1JXuyuV3XtSfLZJPuTfCfJNTOWdU/X/8Uk9yxcWZKk2fSypX8K2FhVvwTcANyX5EpgE7CzqtYAO7v7ALcAa7rLBuBhmH6TAB4ArgeuAx44/UYhSVoc5wz9qnqlqr7d3f4rYB9wKXA78GjX7VHgvd3t24Ev1rRngYuSXAK8B9hRVceq6jiwA1g/0GokSWeVquq9c7Ia+CZwFfByVV00Y9rxqlqV5Elgc1U907XvBD4OTABvqarf7tp/E3itqj7zunVsYPoTAmNjY9dOTk72XdzJkydZuXJl3/MvFcuhjj2HTvTUb+x8OPLa4Na79tILB7ewOTp67MRAa+nVQtS8HJ5jvRiVOmB+taxbt253VY3PNq3n0zAkWQl8FfhYVb2a5IxdZ2mrs7T/dEPVFmALwPj4eE1MTPQ6xDeYmppiPvMvFcuhjl5PrbBx7Ske3DO4s38cuGtiYMuaq4cee2KgtfRqIWpeDs+xXoxKHbBwtfR09E6SNzEd+I9V1de65iPdbhu666Nd+0Hg8hmzXwYcPku7JGmR9HL0ToBHgH1V9TszJm0DTh+Bcw/wxIz2D3RH8dwAnKiqV4CngZuTrOq+wL25a5MkLZJePpveCNwN7EnyfNf2G8Bm4PEkHwJeBt7fTdsO3ArsB34MfBCgqo4l+RTwXNfvt6rq2ECqkCT15Jyh330he6Yd+DfN0r+A+86wrK3A1rkMUJI0OP4iV5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqyDlDP8nWJEeTvDCj7ZNJDiV5vrvcOmPaJ5LsT/L9JO+Z0b6+a9ufZNPgS5EknUsvW/pfANbP0v67VXV1d9kOkORK4A7gl7t5/kOS85KcB3wOuAW4Eriz6ytJWkQrztWhqr6ZZHWPy7sdmKyqvwZ+kGQ/cF03bX9VvQSQZLLr+905j1iS1LdU1bk7TYf+k1V1VXf/k8C9wKvALmBjVR1P8nvAs1X1pa7fI8AfdotZX1Uf7trvBq6vqvtnWdcGYAPA2NjYtZOTk30Xd/LkSVauXNn3/EvFcqhjz6ETPfUbOx+OvDa49a699MLBLWyOjh47MdBaerUQNS+H51gvRqUOmF8t69at211V47NNO+eW/hk8DHwKqO76QeDXgMzSt5h9N9Ks7zZVtQXYAjA+Pl4TExN9DhGmpqaYz/xLxXKo495NT/XUb+PaUzy4p9+n3RsduGtiYMuaq4cee2KgtfRqIWpeDs+xXoxKHbBwtfT1jK2qI6dvJ/l94Mnu7kHg8hldLwMOd7fP1C5JWiR9HbKZ5JIZd98HnD6yZxtwR5KfTXIFsAb4FvAcsCbJFUnezPSXvdv6H7YkqR/n3NJP8mVgArg4yUHgAWAiydVM76I5AHwEoKr2Jnmc6S9oTwH3VdVPuuXcDzwNnAdsraq9A69GknRWvRy9c+cszY+cpf+ngU/P0r4d2D6n0UmSBspf5EpSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQ1YMewCS5mb1pqcGvsyNa09xbw/LPbD5toGvW4vLLX1JaoihL0kNOWfoJ9ma5GiSF2a0vT3JjiQvdteruvYk+WyS/Um+k+SaGfPc0/V/Mck9C1OOJOlsetnS/wKw/nVtm4CdVbUG2NndB7gFWNNdNgAPw/SbBPAAcD1wHfDA6TcKSdLiOWfoV9U3gWOva74deLS7/Sjw3hntX6xpzwIXJbkEeA+wo6qOVdVxYAdvfCORJC2wVNW5OyWrgSer6qru/g+r6qIZ049X1aokTwKbq+qZrn0n8HFgAnhLVf121/6bwGtV9ZlZ1rWB6U8JjI2NXTs5Odl3cSdPnmTlypV9z79ULIc69hw60VO/sfPhyGuDW+/aSy8c3MLm6OixEwOtZZh6fVyG+ffuxXJ4rfRqPrWsW7dud1WNzzZt0IdsZpa2Okv7GxurtgBbAMbHx2tiYqLvwUxNTTGf+ZeK5VBHL4f7wfShgQ/uGdzT7sBdEwNb1lw99NgTA61lmHp9XIb59+7Fcnit9Gqhaun36J0j3W4buuujXftB4PIZ/S4DDp+lXZK0iPoN/W3A6SNw7gGemNH+ge4onhuAE1X1CvA0cHOSVd0XuDd3bZKkRXTOz3NJvsz0PvmLkxxk+iiczcDjST4EvAy8v+u+HbgV2A/8GPggQFUdS/Ip4Lmu329V1eu/HJYkLbBzhn5V3XmGSTfN0reA+86wnK3A1jmNTpI0UP4iV5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqyLxCP8mBJHuSPJ9kV9f29iQ7krzYXa/q2pPks0n2J/lOkmsGUYAkqXeD2NJfV1VXV9V4d38TsLOq1gA7u/sAtwBrussG4OEBrFuSNAcLsXvnduDR7vajwHtntH+xpj0LXJTkkgVYvyTpDFJV/c+c/AA4DhTwH6tqS5IfVtVFM/ocr6pVSZ4ENlfVM137TuDjVbXrdcvcwPQnAcbGxq6dnJzse3wnT55k5cqVfc+/VCyHOvYcOtFTv7Hz4chrg1vv2ksvHNzC5ujosRMDrWWYen1chvn37sVyeK30aj61rFu3bveMvS8/ZcW8RgU3VtXhJL8A7EjyvbP0zSxtb3jHqaotwBaA8fHxmpiY6HtwU1NTzGf+pWI51HHvpqd66rdx7Ske3DPfp93/d+CuiYEta64eeuyJgdYyTL0+LsP8e/diObxWerVQtcxr905VHe6ujwJfB64DjpzebdNdH+26HwQunzH7ZcDh+axfkjQ3fYd+kguSvO30beBm4AVgG3BP1+0e4Inu9jbgA91RPDcAJ6rqlb5HLkmas/l8Nh0Dvp7k9HL+oKq+keQ54PEkHwJeBt7f9d8O3ArsB34MfHAe65Yk9aHv0K+ql4B/MEv7/wZumqW9gPv6XZ8kaf78Ra4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNGY1/8LnErO7x/8X2auPaUz3/D9oDm28b6LoljRa39CWpIW7pS1ryev30PJdPxb0atU/PbulLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1Jasiih36S9Um+n2R/kk2LvX5JatminnsnyXnA54B3AweB55Jsq6rvLsT69hw6MfDzcEjScrbYJ1y7DthfVS8BJJkEbgcWJPQlab4Gfar0Xn1h/QULstxU1YIseNaVJf8cWF9VH+7u3w1cX1X3z+izAdjQ3X0n8P15rPJi4C/nMf9SMSp1gLUsVaNSy6jUAfOr5e9W1Ttmm7DYW/qZpe2n3nWqaguwZSArS3ZV1fggljVMo1IHWMtSNSq1jEodsHC1LPYXuQeBy2fcvww4vMhjkKRmLXboPwesSXJFkjcDdwDbFnkMktSsRd29U1WnktwPPA2cB2ytqr0LuMqB7CZaAkalDrCWpWpUahmVOmCBalnUL3IlScPlL3IlqSGGviQ1ZORCP8lbknwryZ8l2Zvk3w57TPOV5Lwkf5rkyWGPZT6SHEiyJ8nzSXYNezz9SnJRkq8k+V6SfUn+0bDH1I8k7+wei9OXV5N8bNjj6leSf9295l9I8uUkbxn2mPqR5KNdDXsX4vEYuX36SQJcUFUnk7wJeAb4aFU9O+Sh9S3JrwPjwM9V1a8Mezz9SnIAGK+qZf3jmSSPAv+9qj7fHYX21qr64bDHNR/dKVIOMf1jyT8f9njmKsmlTL/Wr6yq15I8Dmyvqi8Md2Rzk+QqYJLpsxf8DfAN4F9W1YuDWsfIbenXtJPd3Td1l2X7zpbkMuA24PPDHosgyc8B7wIeAaiqv1nugd+5CfgfyzHwZ1gBnJ9kBfBWludvgH4JeLaqflxVp4D/BrxvkCsYudCH/7c75HngKLCjqv5k2GOah38P/Bvg/wx7IANQwB8l2d2dbmM5+nvAXwD/qdvl9vkkC3OSlMV1B/DlYQ+iX1V1CPgM8DLwCnCiqv5ouKPqywvAu5L8fJK3Arfy0z9onbeRDP2q+klVXc30L36v6z4yLTtJfgU4WlW7hz2WAbmxqq4BbgHuS/KuYQ+oDyuAa4CHq+ofAj8ClvUpwrtdVL8K/Odhj6VfSVYxffLGK4BfBC5I8i+GO6q5q6p9wL8DdjC9a+fPgFODXMdIhv5p3cfuKWD9kIfSrxuBX+32hU8C/yTJl4Y7pP5V1eHu+ijwdab3Wy43B4GDMz49foXpN4Hl7Bbg21V1ZNgDmYd/Cvygqv6iqv4W+Brwj4c8pr5U1SNVdU1VvQs4Bgxsfz6MYOgneUeSi7rb5zP9ZPjecEfVn6r6RFVdVlWrmf74/V+ratltvQAkuSDJ207fBm5m+qPsslJV/wv4n0ne2TXdxPI/NfidLONdO52XgRuSvLU7mOMmYN+Qx9SXJL/QXf8d4J8x4Mdmsc+yuRguAR7tjkb4GeDxqlrWhzqOiDHg69OvR1YAf1BV3xjukPr2r4DHut0iLwEfHPJ4+tbtN3438JFhj2U+qupPknwF+DbTu0P+lOV7SoavJvl54G+B+6rq+CAXPnKHbEqSzmzkdu9Iks7M0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kN+b8I/aKnJnwohQAAAABJRU5ErkJggg==\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["df['quality'].hist();"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Les notes pour les blancs et les rouges."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["import matplotlib.pyplot as plt\n", "fig, ax = plt.subplots(1, 2, figsize=(12, 3))\n", "df[df['color'] == 0]['quality'].hist(ax=ax[0])\n", "df[df['color'] == 1]['quality'].hist(ax=ax[1])\n", "ax[0].set_title('rouge')\n", "ax[1].set_title('blanc');"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On construit le jeu de donn\u00e9es. D'un c\u00f4t\u00e9, ce qu'on sait - les features X -, d'un autre ce qu'on cherche \u00e0 pr\u00e9dire."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',\n", " 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',\n", " 'pH', 'sulphates', 'alcohol', 'quality', 'color'],\n", " dtype='object')"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["df.columns"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["X = df.drop(\"quality\", axis=1)\n", "y = df['quality']"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On divise en apprentissage / test puisqu'il est de coutume d'apprendre sur des donn\u00e9es et de v\u00e9rifier les pr\u00e9dictions sur un autre."]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, random_state=42)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On cale un premier mod\u00e8le, une r\u00e9gression lin\u00e9aire."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.linear_model import LinearRegression\n", "clr = LinearRegression()\n", "clr.fit(X_train, y_train)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On r\u00e9cup\u00e8re les coefficients."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([ 9.55561389e-02, -1.53182004e+00, -9.60658321e-02, 6.51351208e-02,\n", " -3.21323223e-01, 6.06114885e-03, -1.60663994e-03, -1.05342354e+02,\n", " 5.14593092e-01, 7.84057766e-01, 2.32175504e-01, -3.29941606e-01])"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["clr.coef_"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["105.86698928549437"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["clr.intercept_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Puis on calcule le coefficient $R^2$."]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.26585260463659766"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import r2_score\n", "pred = clr.predict(X_test)\n", "r2_score(y_test, pred)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ou l'erreur moyenne en valeur absolue."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.5682450595415709"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import mean_absolute_error\n", "mean_absolute_error(y_test, clr.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le mod\u00e8le se trompe en moyenne d'un demi-point pour la note."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Arbre de r\u00e9gression\n", "\n", "Voyons ce qu'un arbre de r\u00e9gression peut faire."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=10, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, presort='deprecated',\n", " random_state=None, splitter='best')"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.tree import DecisionTreeRegressor\n", "dt = DecisionTreeRegressor(min_samples_leaf=10)\n", "dt.fit(X_train, y_train)"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.2423038841196432"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["r2_score(y_test, dt.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["L'arbre de r\u00e9gression r\u00e9v\u00e8le l'int\u00e9r\u00eat d'avoir une base d'apprentissage et de test puisque ce mod\u00e8le peut r\u00e9pliquer \u00e0 l'identique les donn\u00e9es sur lequel le mod\u00e8le a \u00e9t\u00e9 estim\u00e9. A contrario, sur la base de test, les performances en pr\u00e9diction sont plut\u00f4t mauvaise."]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.6288975399079262"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["r2_score(y_train, dt.predict(X_train))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour \u00e9viter cela, on joue avec le param\u00e8tre *min_smaple_leaf*. Il signifie qu'une pr\u00e9diction de l'arbre de r\u00e9gression est une moyenne d'au moins *min_sample_leaf* notes tir\u00e9es de le base d'apprentissage. Il y a beaucoup moins de chance que cela aboutisse \u00e0 du sur apprentissage."]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 49/49 [00:03<00:00, 14.25it/s]\n"]}, {"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
minlr2_train_dtr2_test_dtr2_train_regr2_test_reg
011.0000000.0500870.305220.265853
120.9309930.1307650.305220.265853
\n", "
"], "text/plain": [" minl r2_train_dt r2_test_dt r2_train_reg r2_test_reg\n", "0 1 1.000000 0.050087 0.30522 0.265853\n", "1 2 0.930993 0.130765 0.30522 0.265853"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "from sklearn.ensemble import RandomForestRegressor\n", "from tqdm import tqdm\n", "res = []\n", "for i in tqdm(range(1, 50)):\n", " dt = DecisionTreeRegressor(min_samples_leaf=i)\n", " reg = LinearRegression()\n", " dt.fit(X_train, y_train)\n", " reg.fit(X_train, y_train)\n", " r = {\n", " 'minl': i,\n", " 'r2_train_dt': r2_score(y_train, dt.predict(X_train)),\n", " 'r2_test_dt': r2_score(y_test, dt.predict(X_test)),\n", " 'r2_train_reg': r2_score(y_train, reg.predict(X_train)),\n", " 'r2_test_reg': r2_score(y_test, reg.predict(X_test)),\n", " }\n", " res.append(r)\n", "df = pandas.DataFrame(res)\n", "df.head(2)"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["df.plot(x=\"minl\", y=[\"r2_train_dt\", \"r2_test_dt\",\n", " \"r2_train_reg\", \"r2_test_reg\"]);"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On voit que la performance sur la base de test augmente rapidement puis stagne sans jamais rattraper celle de la base d'apprentissage. Elle ne d\u00e9passe pas celle d'un mod\u00e8le lin\u00e9aire ce qui est d\u00e9cevant. Essayons avec une for\u00eat al\u00e9atoire."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## For\u00eat al\u00e9atoire"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 25/25 [00:20<00:00, 1.54it/s]\n"]}, {"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
minlr2_train_dtr2_test_dtr2_train_regr2_test_regr2_train_rfr2_test_rf
011.0000000.0302110.305220.2658530.9203180.472317
130.8640860.1301330.305220.2658530.8362990.455444
\n", "
"], "text/plain": [" minl r2_train_dt r2_test_dt r2_train_reg r2_test_reg r2_train_rf \\\n", "0 1 1.000000 0.030211 0.30522 0.265853 0.920318 \n", "1 3 0.864086 0.130133 0.30522 0.265853 0.836299 \n", "\n", " r2_test_rf \n", "0 0.472317 \n", "1 0.455444 "]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "from sklearn.ensemble import RandomForestRegressor\n", "from tqdm import tqdm\n", "res = []\n", "for i in tqdm(range(1, 50, 2)):\n", " dt = DecisionTreeRegressor(min_samples_leaf=i)\n", " reg = LinearRegression()\n", " rf = RandomForestRegressor(n_estimators=25, min_samples_leaf=i)\n", " dt.fit(X_train, y_train)\n", " reg.fit(X_train, y_train)\n", " rf.fit(X_train, y_train)\n", " r = {\n", " 'minl': i,\n", " 'r2_train_dt': r2_score(y_train, dt.predict(X_train)),\n", " 'r2_test_dt': r2_score(y_test, dt.predict(X_test)),\n", " 'r2_train_reg': r2_score(y_train, reg.predict(X_train)),\n", " 'r2_test_reg': r2_score(y_test, reg.predict(X_test)),\n", " 'r2_train_rf': r2_score(y_train, rf.predict(X_train)),\n", " 'r2_test_rf': r2_score(y_test, rf.predict(X_test)),\n", " }\n", " res.append(r)\n", "df = pandas.DataFrame(res)\n", "df.head(2)"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["df.plot(x=\"minl\", y=[\"r2_train_dt\", \"r2_test_dt\",\n", " \"r2_train_reg\", \"r2_test_reg\",\n", " \"r2_train_rf\", \"r2_test_rf\"]);"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A l'inverse de l'arbre de r\u00e9gression, la for\u00eat al\u00e9atoire est meilleure lorsque ce param\u00e8tre est petit. Une for\u00eat est une moyenne de mod\u00e8le, chacun appris sur un sous-\u00e9chantillon du jeu de donn\u00e9es initiale. M\u00eame si un arbre apprend par coeur, il est peu probable que son voisin ait appris le m\u00eame sous-\u00e9chantillon. En faisant la moyenne, on fait un compromis."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Validation crois\u00e9e\n", "\n", "Il reste \u00e0 v\u00e9rifier que le mod\u00e8le est robuste. C'est l'objet de la validation crois\u00e9e qui d\u00e9coupe le jeu de donn\u00e9es en 5 parties, apprend sur 4, teste une 1 puis recommence 5 fois en faisant varier la partie qui sert \u00e0 tester."]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.4s finished\n"]}, {"data": {"text/plain": ["array([0.05037733, 0.24594631, 0.25811598, 0.348578 , 0.2462281 ])"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.model_selection import cross_val_score\n", "cross_val_score(\n", " RandomForestRegressor(n_estimators=25), X, y, cv=5,\n", " verbose=1)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ce r\u00e9sultat doit vous interrompre car les performances sont loin d'\u00eatre stables. Deux options : soit le mod\u00e8le n'est pas robuste, soit la m\u00e9thodologie est fausse quelque part. Comme le probl\u00e8me est assez simple, il est probable que ce soit la seconde option : la jeu de donn\u00e9es est tri\u00e9e. Les vins rouges d'abord, les blancs ensuite. Il est possible que la validation crois\u00e9e estime un mod\u00e8le sur des vins rouges et l'appliquent \u00e0 des vins blancs. Cela ne marche pas visiblement. Cela veut dire aussi que les vins blancs et rouges sont tr\u00e8s diff\u00e9rents et que la couleur est probablement une information redondante avec les autres. M\u00e9langeons les donn\u00e9es au hasard."]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": ["from sklearn.utils import shuffle\n", "X2, y2 = shuffle(X, y)"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.6s finished\n"]}, {"data": {"text/plain": ["array([0.47975777, 0.50951094, 0.49514404, 0.51110336, 0.51584857])"]}, "execution_count": 27, "metadata": {}, "output_type": "execute_result"}], "source": ["cross_val_score(\n", " RandomForestRegressor(n_estimators=25), X2, y2, cv=5,\n", " verbose=1)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Beaucoup mieux. On peut faire comme \u00e7a aussi."]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 6.6s finished\n"]}, {"data": {"text/plain": ["array([0.53754932, 0.54227221, 0.5442236 , 0.57726314, 0.53393994])"]}, "execution_count": 28, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.model_selection import ShuffleSplit\n", "cross_val_score(\n", " RandomForestRegressor(n_estimators=25), X, y, cv=ShuffleSplit(5),\n", " verbose=1)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Pipeline\n", "\n", "On peut caler un mod\u00e8le apr\u00e8s une ACP mais il faut bien se souvenir de toutes les \u00e9tapes interm\u00e9diaires avant de pr\u00e9dire avec le mod\u00e8le final."]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [{"data": {"text/plain": ["PCA(copy=True, iterated_power='auto', n_components=6, random_state=None,\n", " svd_solver='auto', tol=0.0, whiten=False)"]}, "execution_count": 29, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.decomposition import PCA\n", "pca = PCA(6)\n", "pca.fit(X_train, y_train)"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [{"data": {"text/plain": ["RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " max_samples=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=100, n_jobs=None, oob_score=False,\n", " random_state=None, verbose=0, warm_start=False)"]}, "execution_count": 30, "metadata": {}, "output_type": "execute_result"}], "source": ["rf = RandomForestRegressor(n_estimators=100)\n", "X_train_pca = pca.transform(X_train)\n", "rf.fit(X_train_pca, y_train)"]}, {"cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.421429956568139"]}, "execution_count": 31, "metadata": {}, "output_type": "execute_result"}], "source": ["X_test_pca = pca.transform(X_test)\n", "pred = rf.predict(X_test_pca)\n", "r2_score(y_test, pred)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ou alors on utilise le concept de *pipeline* qui permet d'assembler les pr\u00e9traitements et le mod\u00e8le pr\u00e9dictif sous la forme d'une s\u00e9quence de traitement qui devient le mod\u00e8le unique."]}, {"cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": ["from sklearn.pipeline import Pipeline\n", "pipe = Pipeline([\n", " ('acp', PCA(n_components=6)),\n", " ('rf', RandomForestRegressor(n_estimators=100))\n", "])\n", "pipe.fit(X_train, y_train);"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Grille de recherche\n", "\n", "De cette fa\u00e7on, on peut chercher simplement les meilleurs hyperparam\u00e8tres du mod\u00e8le."]}, {"cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Fitting 3 folds for each of 12 candidates, totalling 36 fits\n"]}, {"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 44.4s finished\n"]}, {"data": {"text/plain": ["GridSearchCV(cv=ShuffleSplit(n_splits=3, random_state=None, test_size=None, train_size=None),\n", " error_score=nan,\n", " estimator=Pipeline(memory=None,\n", " steps=[('acp',\n", " PCA(copy=True, iterated_power='auto',\n", " n_components=6, random_state=None,\n", " svd_solver='auto', tol=0.0,\n", " whiten=False)),\n", " ('rf',\n", " RandomForestRegressor(bootstrap=True,\n", " ccp_alpha=0.0,\n", " criterion='mse',\n", " max_depth=None,\n", " m...\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " n_estimators=100,\n", " n_jobs=None,\n", " oob_score=False,\n", " random_state=None,\n", " verbose=0,\n", " warm_start=False))],\n", " verbose=False),\n", " iid='deprecated', n_jobs=None,\n", " param_grid={'acp__n_components': [1, 4, 7, 10],\n", " 'rf__n_estimators': [10, 20, 50]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n", " scoring=None, verbose=1)"]}, "execution_count": 33, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.model_selection import GridSearchCV\n", "param_grid = {'acp__n_components': list(range(1, 11, 3)),\n", " 'rf__n_estimators': [10, 20, 50]}\n", "grid = GridSearchCV(pipe, param_grid=param_grid, verbose=1,\n", " cv=ShuffleSplit(3))\n", "grid.fit(X, y)"]}, {"cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'acp__n_components': 10, 'rf__n_estimators': 50}"]}, "execution_count": 34, "metadata": {}, "output_type": "execute_result"}], "source": ["grid.best_params_"]}, {"cell_type": "code", "execution_count": 34, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/plain": ["array([7.1 , 5.06, 7. , ..., 6.74, 4.12, 5.98])"]}, "execution_count": 35, "metadata": {}, "output_type": "execute_result"}], "source": ["grid.predict(X_test)"]}, {"cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.9275290318700775"]}, "execution_count": 36, "metadata": {}, "output_type": "execute_result"}], "source": ["r2_score(y_test, grid.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ce nombre para\u00eet beaucoup trop beau pour \u00eatre vrai. Cela signifie sans doute que les donn\u00e9es de test ont \u00e9t\u00e9 utilis\u00e9s pour effectuer la recherche."]}, {"cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.49487646056265816"]}, "execution_count": 37, "metadata": {}, "output_type": "execute_result"}], "source": ["grid.best_score_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Nettement plus plausible."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Enregistrer, restaurer\n", "\n", "Le moyen le plus simple de conserver les mod\u00e8les en python est de les s\u00e9rialiser : on copie la m\u00e9moire sur disque puis on la restaure plus tard."]}, {"cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": ["import pickle\n", "\n", "with open('piperf.pickle', 'wb') as f:\n", " pickle.dump(grid, f)"]}, {"cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [{"data": {"text/plain": ["['piperf.pickle']"]}, "execution_count": 39, "metadata": {}, "output_type": "execute_result"}], "source": ["import glob\n", "glob.glob('*.pickle')"]}, {"cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": ["with open(\"piperf.pickle\", 'rb') as f:\n", " grid2 = pickle.load(f)"]}, {"cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([7.1 , 5.06, 7. , ..., 6.74, 4.12, 5.98])"]}, "execution_count": 41, "metadata": {}, "output_type": "execute_result"}], "source": ["grid2.predict(X_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Pr\u00e9diction de la couleur\n", "\n", "Le fait que la premi\u00e8re validation crois\u00e9e \u00e9choue \u00e9tait un signe que la couleur \u00e9tait facilement pr\u00e9visible. V\u00e9rifions."]}, {"cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": ["Xc = df_data.drop(['quality', 'color'], axis=1)\n", "yc = df_data[\"color\"]"]}, {"cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": ["Xc_train, Xc_test, yc_train, yc_test = train_test_split(Xc, yc)"]}, {"cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": ["from sklearn.linear_model import LogisticRegression\n", "log = LogisticRegression(solver='lbfgs', max_iter=1500)\n", "log.fit(Xc_train, yc_train);"]}, {"cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.04459922717947637"]}, "execution_count": 45, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import log_loss\n", "log_loss(yc_test, log.predict_proba(Xc_test))"]}, {"cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 391, 14],\n", " [ 9, 1211]], dtype=int64)"]}, "execution_count": 46, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import confusion_matrix\n", "confusion_matrix(yc_test, log.predict(Xc_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La matrice de confusion est plut\u00f4t explicite."]}, {"cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2"}}, "nbformat": 4, "nbformat_minor": 2}