{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.ml - Machine Learning et Marketting\n", "\n", "Pr\u00e9dire la souscription d'un contrat sur le jeu de donn\u00e9es [Bank Marketing Data Set ](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Populating the interactive namespace from numpy and matplotlib\n"]}], "source": ["%matplotlib inline"]}, {"cell_type": "code", "execution_count": 2, "metadata": {"collapsed": true}, "outputs": [], "source": ["import matplotlib.pyplot as plt"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["<b>Plan</b>\n", "<div id=\"my_menu_id\">run previous cell, wait for 2 seconds</div>\n", "<script>\n", "function repeat_indent_string(n){\n", "    var a = \"\" ;\n", "    for ( ; n > 0 ; --n) {\n", "        a += \"    \";\n", "    }\n", "    return a;\n", "}\n", "var update_menu_string = function(begin, lfirst, llast, sformat, send) {\n", "    var anchors = document.getElementsByClassName(\"section\");\n", "    if (anchors.length == 0) {\n", "        anchors = document.getElementsByClassName(\"text_cell_render rendered_html\");\n", "    }\n", "    var i,t;\n", "    var text_menu = begin;\n", "    var text_memo = \"<pre>\\nlength:\" + anchors.length + \"\\n\";\n", "    var ind = \"\";\n", "    var memo_level = 1;\n", "    var href;\n", "    var tags = [];\n", "    for (i = 0; i <= llast; i++) {\n", "        tags.push(\"h\" + i);\n", "    }\n", "\n", "    for (i = 0; i < anchors.length; i++) {\n", "        text_memo += \"**\" + anchors[i].id + \"--\\n\";\n", "\n", "        var child = null;\n", "        for(t = 0; t < tags.length; t++) {\n", "            var r = anchors[i].getElementsByTagName(tags[t]);\n", "            if (r.length > 0) {\n", "child = r[0];\n", "break;\n", "            }\n", "        }\n", "        if (child == null){\n", "            text_memo += \"null\\n\";\n", "            continue;\n", "        }\n", "\n", "        if (anchors[i].hasAttribute(\"id\")) {\n", "            // when converted in RST\n", "            href = anchors[i].id;\n", "            text_memo += \"#1-\" + href;\n", "            // passer \u00e0 child suivant (le chercher)\n", "        }\n", "        else if (child.hasAttribute(\"id\")) {\n", "            // in a notebook\n", "            href = child.id;\n", "            text_memo += \"#2-\" + href;\n", "        }\n", "        else {\n", "            text_memo += \"#3-\" + \"*\" + \"\\n\";\n", "            continue;\n", "        }\n", "        var title = child.textContent;\n", "        var level = parseInt(child.tagName.substring(1,2));\n", "\n", "        text_memo += \"--\" + level + \"?\" + lfirst + \"--\" + title + \"\\n\";\n", "\n", "        if ((level < lfirst) || (level > llast)) {\n", "            continue ;\n", "        }\n", "        if (title.endsWith('\u00b6')) {\n", "            title = title.substring(0,title.length-1).replace(\"<\", \"&lt;\").replace(\">\", \"&gt;\").replace(\"&\", \"&amp;\")\n", "        }\n", "\n", "        if (title.length == 0) {\n", "            continue;\n", "        }\n", "        while (level > memo_level) {\n", "            text_menu += \"<ul>\\n\";\n", "            memo_level += 1;\n", "        }\n", "        while (level < memo_level) {\n", "            text_menu += \"</ul>\\n\";\n", "            memo_level -= 1;\n", "        }\n", "        text_menu += repeat_indent_string(level-2) + sformat.replace(\"__HREF__\", href).replace(\"__TITLE__\", title);\n", "    }\n", "    while (1 < memo_level) {\n", "        text_menu += \"</ul>\\n\";\n", "        memo_level -= 1;\n", "    }\n", "    text_menu += send;\n", "    //text_menu += \"\\n\" + text_memo;\n", "    return text_menu;\n", "};\n", "var update_menu = function() {\n", "    var sbegin = \"\";\n", "    var sformat = '<li><a href=\"#__HREF__\">__TITLE__</a></li>';\n", "    var send = \"\";\n", "    var text_menu = update_menu_string(sbegin, 2, 4, sformat, send);\n", "    var menu = document.getElementById(\"my_menu_id\");\n", "    menu.innerHTML=text_menu;\n", "};\n", "window.setTimeout(update_menu,2000);\n", "            </script>"], "text/plain": ["<IPython.core.display.HTML object>"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Donn\u00e9es"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le jeu de donn\u00e9es [Bank Marketing Data Set ](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) contient des donn\u00e9es destin\u00e9es \u00e0 \u00e9valuer le fait qu'une personne souscrive un contrat. La base de donn\u00e9es contient 45.000 observations avec 17 attributs et une variable binaire qui repr\u00e9sente le r\u00e9sultat \u00e0 pr\u00e9dire. Tout d'abord, on r\u00e9cup\u00e8re la base de donn\u00e9es."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["    downloading of  https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip  to  bank.zip\n", "    unzipped  bank-full.csv  to  .\\bank-full.csv\n"]}], "source": ["url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00222/\"\n", "file = \"bank.zip\"\n", "import pyensae.datasource\n", "data = pyensae.datasource.download_data(file, website=url)"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["\"age\";\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"\n", "30;\"unemployed\";\"married\";\"primary\";\"no\";1787;\"no\";\"no\";\"cellular\";19;\"oct\";79;1;-1;0;\"unknown\";\"no\"\n", "33;\"services\";\"married\";\"secondary\";\"no\";4789;\"yes\";\"yes\";\"cellular\";11;\"may\";220;1;339;4;\"failure\";\"no\"\n", "35;\"management\";\"single\";\"tertiary\";\"no\";1350;\"yes\";\"no\";\"cellular\";16;\"apr\";185;1;330;1;\"failure\";\"no\"\n", "30;\"management\";\"married\";\"tertiary\";\"no\";1476;\"yes\";\"yes\";\"unknown\";3;\"jun\";199;4;-1;0;\"unknown\";\"no\"\n", "59;\"blue-collar\";\"married\";\"secondary\";\"no\";0;\"yes\";\"no\";\"unknown\";5;\"may\";226;1;-1;0;\"unknown\";\"no\"\n"]}], "source": ["with open(\"bank.csv\",\"r\") as fo :\n", "    n = 0\n", "    for row in fo :\n", "        print(row.strip(\"\\r\\n \"))\n", "        n += 1\n", "        if n > 5 : break"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>age</th>\n", "      <th>job</th>\n", "      <th>marital</th>\n", "      <th>education</th>\n", "      <th>default</th>\n", "      <th>balance</th>\n", "      <th>housing</th>\n", "      <th>loan</th>\n", "      <th>contact</th>\n", "      <th>day</th>\n", "      <th>month</th>\n", "      <th>duration</th>\n", "      <th>campaign</th>\n", "      <th>pdays</th>\n", "      <th>previous</th>\n", "      <th>poutcome</th>\n", "      <th>y</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>4516</th>\n", "      <td>33</td>\n", "      <td>services</td>\n", "      <td>married</td>\n", "      <td>secondary</td>\n", "      <td>no</td>\n", "      <td>-333</td>\n", "      <td>yes</td>\n", "      <td>no</td>\n", "      <td>cellular</td>\n", "      <td>30</td>\n", "      <td>jul</td>\n", "      <td>329</td>\n", "      <td>5</td>\n", "      <td>-1</td>\n", "      <td>0</td>\n", "      <td>unknown</td>\n", "      <td>no</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4517</th>\n", "      <td>57</td>\n", "      <td>self-employed</td>\n", "      <td>married</td>\n", "      <td>tertiary</td>\n", "      <td>yes</td>\n", "      <td>-3313</td>\n", "      <td>yes</td>\n", "      <td>yes</td>\n", "      <td>unknown</td>\n", "      <td>9</td>\n", "      <td>may</td>\n", "      <td>153</td>\n", "      <td>1</td>\n", "      <td>-1</td>\n", "      <td>0</td>\n", "      <td>unknown</td>\n", "      <td>no</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4518</th>\n", "      <td>57</td>\n", "      <td>technician</td>\n", "      <td>married</td>\n", "      <td>secondary</td>\n", "      <td>no</td>\n", "      <td>295</td>\n", "      <td>no</td>\n", "      <td>no</td>\n", "      <td>cellular</td>\n", "      <td>19</td>\n", "      <td>aug</td>\n", "      <td>151</td>\n", "      <td>11</td>\n", "      <td>-1</td>\n", "      <td>0</td>\n", "      <td>unknown</td>\n", "      <td>no</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4519</th>\n", "      <td>28</td>\n", "      <td>blue-collar</td>\n", "      <td>married</td>\n", "      <td>secondary</td>\n", "      <td>no</td>\n", "      <td>1137</td>\n", "      <td>no</td>\n", "      <td>no</td>\n", "      <td>cellular</td>\n", "      <td>6</td>\n", "      <td>feb</td>\n", "      <td>129</td>\n", "      <td>4</td>\n", "      <td>211</td>\n", "      <td>3</td>\n", "      <td>other</td>\n", "      <td>no</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4520</th>\n", "      <td>44</td>\n", "      <td>entrepreneur</td>\n", "      <td>single</td>\n", "      <td>tertiary</td>\n", "      <td>no</td>\n", "      <td>1136</td>\n", "      <td>yes</td>\n", "      <td>yes</td>\n", "      <td>cellular</td>\n", "      <td>3</td>\n", "      <td>apr</td>\n", "      <td>345</td>\n", "      <td>2</td>\n", "      <td>249</td>\n", "      <td>7</td>\n", "      <td>other</td>\n", "      <td>no</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["      age            job  marital  education default  balance housing loan  \\\n", "4516   33       services  married  secondary      no     -333     yes   no   \n", "4517   57  self-employed  married   tertiary     yes    -3313     yes  yes   \n", "4518   57     technician  married  secondary      no      295      no   no   \n", "4519   28    blue-collar  married  secondary      no     1137      no   no   \n", "4520   44   entrepreneur   single   tertiary      no     1136     yes  yes   \n", "\n", "       contact  day month  duration  campaign  pdays  previous poutcome   y  \n", "4516  cellular   30   jul       329         5     -1         0  unknown  no  \n", "4517   unknown    9   may       153         1     -1         0  unknown  no  \n", "4518  cellular   19   aug       151        11     -1         0  unknown  no  \n", "4519  cellular    6   feb       129         4    211         3    other  no  \n", "4520  cellular    3   apr       345         2    249         7    other  no  "]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "df = pandas.read_csv(\"bank.csv\",sep=\";\")\n", "df.tail()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 1 : pr\u00e9dire y en fonction des attributs\n", "\n", "On utilisera pour cela un [GradientBoostingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) apr\u00e8s avoir sciender la base en base d'apprentissage et tests. Quelques liens :\n", "\n", "* [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)\n", "* [DictVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html)\n", "* [Using DictVectorizer with sklearn DecisionTreeClassifier](http://stackoverflow.com/questions/15181311/using-dictvectorizer-with-sklearn-decisiontreeclassifier)\n", "* [iterrows](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 2 : tracer la courbe ROC\n", "\n", "La plupart des classifieur produisent deux chiffres en sorties : la classe et un score. Le score est souvent une probabilit\u00e9 et il traduit la qualit\u00e9 du classifieur. On s'en sert alors pour accepter ou rejeter la pr\u00e9diction [Sensitivity and specificity](http://en.wikipedia.org/wiki/Sensitivity_and_specificity). La d\u00e9nomination ``vrai positif``, ``faux positif``, ... est assez trompeuse. Il vaut mieux retenir les d\u00e9finitions de [pr\u00e9cision, rappel](http://fr.wikipedia.org/wiki/Pr%C3%A9cision_et_rappel).\n", "\n", "Comme le classifieur retourne un score de confiance, on d\u00e9cide de valider ou de rejeter sa r\u00e9ponse si le score est sup\u00e9rieur ou inf\u00e9rieur \u00e0 ce seuil :\n", "\n", "* si score >= seuil, on valide la r\u00e9ponse, qui est soit bonne (TP : True Positive), soit fausse (FP : False Positive)\n", "* si score < seuil, on rejete la r\u00e9ponse qui est soit bonne (FN : False Negative) soit fausse (True Negative)\n", "\n", "La pr\u00e9sicion est d\u00e9finie comme \u00e9tant le nombre de r\u00e9ponses justes sur le nombre de r\u00e9ponses valid\u00e9es :\n", "\n", "$$precision = \\frac{TP}{TP + FP}$$\n", "\n", "Le rappel est d\u00e9fini comme \u00e9tant le nombre de r\u00e9ponses justes sur le nombre total de r\u00e9ponses justes :\n", "\n", "\n", "$$rappel = \\frac{TP}{TP + FN}$$\n", "\n", "Dans notre cas, on d\u00e9finit une r\u00e9ponse juste comme \u00e9tant le fait qu'on pr\u00e9dit la bonne classe. Quelques liens :\n", "\n", "* [Receiver Operating Characteristic (ROC)](http://scikit-learn.org/stable/auto_examples/plot_roc.html)\n", "\n", "Et il faudra tracer sur le m\u00eame dessin la courbe ROC de :\n", "\n", "* l'ensemble de la base de test\n", "* deux \u00e9chantillons al\u00e9atoires de la base de test\n", "\n", "Les courbes des deux \u00e9chantillons al\u00e9atoires devraient illustrer la stabilit\u00e9 de la courbe ROC. Il est m\u00eame possible de calculer un intervalle de confiance en utilisant un [bootstrap](http://fr.wikipedia.org/wiki/Bootstrap_%28statistiques%29).\n", "\n", "Quelques liens sur matplotlib :\n", "\n", "* [Our Favorite Recipes](http://matplotlib.org/users/recipes.html)\n", "* [How to make several plots on a single page using matplotlib?](http://stackoverflow.com/questions/1358977/how-to-make-several-plots-on-a-single-page-using-matplotlib)"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}