{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.ml - Machine Learning et Marketting\n", "\n", "Pr\u00e9dire la souscription d'un contrat sur le jeu de donn\u00e9es [Bank Marketing Data Set ](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Populating the interactive namespace from numpy and matplotlib\n"]}], "source": ["%matplotlib inline"]}, {"cell_type": "code", "execution_count": 2, "metadata": {"collapsed": true}, "outputs": [], "source": ["import matplotlib.pyplot as plt"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["Plan\n", "
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Donn\u00e9es"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le jeu de donn\u00e9es [Bank Marketing Data Set ](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) contient des donn\u00e9es destin\u00e9es \u00e0 \u00e9valuer le fait qu'une personne souscrive un contrat. La base de donn\u00e9es contient 45.000 observations avec 17 attributs et une variable binaire qui repr\u00e9sente le r\u00e9sultat \u00e0 pr\u00e9dire. Tout d'abord, on r\u00e9cup\u00e8re la base de donn\u00e9es."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": [" downloading of https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip to bank.zip\n", " unzipped bank-full.csv to .\\bank-full.csv\n"]}], "source": ["url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00222/\"\n", "file = \"bank.zip\"\n", "import pyensae.datasource\n", "data = pyensae.datasource.download_data(file, website=url)"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["\"age\";\"job\";\"marital\";\"education\";\"default\";\"balance\";\"housing\";\"loan\";\"contact\";\"day\";\"month\";\"duration\";\"campaign\";\"pdays\";\"previous\";\"poutcome\";\"y\"\n", "30;\"unemployed\";\"married\";\"primary\";\"no\";1787;\"no\";\"no\";\"cellular\";19;\"oct\";79;1;-1;0;\"unknown\";\"no\"\n", "33;\"services\";\"married\";\"secondary\";\"no\";4789;\"yes\";\"yes\";\"cellular\";11;\"may\";220;1;339;4;\"failure\";\"no\"\n", "35;\"management\";\"single\";\"tertiary\";\"no\";1350;\"yes\";\"no\";\"cellular\";16;\"apr\";185;1;330;1;\"failure\";\"no\"\n", "30;\"management\";\"married\";\"tertiary\";\"no\";1476;\"yes\";\"yes\";\"unknown\";3;\"jun\";199;4;-1;0;\"unknown\";\"no\"\n", "59;\"blue-collar\";\"married\";\"secondary\";\"no\";0;\"yes\";\"no\";\"unknown\";5;\"may\";226;1;-1;0;\"unknown\";\"no\"\n"]}], "source": ["with open(\"bank.csv\",\"r\") as fo :\n", " n = 0\n", " for row in fo :\n", " print(row.strip(\"\\r\\n \"))\n", " n += 1\n", " if n > 5 : break"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomey
451633servicesmarriedsecondaryno-333yesnocellular30jul3295-10unknownno
451757self-employedmarriedtertiaryyes-3313yesyesunknown9may1531-10unknownno
451857technicianmarriedsecondaryno295nonocellular19aug15111-10unknownno
451928blue-collarmarriedsecondaryno1137nonocellular6feb12942113otherno
452044entrepreneursingletertiaryno1136yesyescellular3apr34522497otherno
\n", "
"], "text/plain": [" age job marital education default balance housing loan \\\n", "4516 33 services married secondary no -333 yes no \n", "4517 57 self-employed married tertiary yes -3313 yes yes \n", "4518 57 technician married secondary no 295 no no \n", "4519 28 blue-collar married secondary no 1137 no no \n", "4520 44 entrepreneur single tertiary no 1136 yes yes \n", "\n", " contact day month duration campaign pdays previous poutcome y \n", "4516 cellular 30 jul 329 5 -1 0 unknown no \n", "4517 unknown 9 may 153 1 -1 0 unknown no \n", "4518 cellular 19 aug 151 11 -1 0 unknown no \n", "4519 cellular 6 feb 129 4 211 3 other no \n", "4520 cellular 3 apr 345 2 249 7 other no "]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "df = pandas.read_csv(\"bank.csv\",sep=\";\")\n", "df.tail()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 1 : pr\u00e9dire y en fonction des attributs\n", "\n", "On utilisera pour cela un [GradientBoostingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) apr\u00e8s avoir sciender la base en base d'apprentissage et tests. Quelques liens :\n", "\n", "* [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)\n", "* [DictVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html)\n", "* [Using DictVectorizer with sklearn DecisionTreeClassifier](http://stackoverflow.com/questions/15181311/using-dictvectorizer-with-sklearn-decisiontreeclassifier)\n", "* [iterrows](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 2 : tracer la courbe ROC\n", "\n", "La plupart des classifieur produisent deux chiffres en sorties : la classe et un score. Le score est souvent une probabilit\u00e9 et il traduit la qualit\u00e9 du classifieur. On s'en sert alors pour accepter ou rejeter la pr\u00e9diction [Sensitivity and specificity](http://en.wikipedia.org/wiki/Sensitivity_and_specificity). La d\u00e9nomination ``vrai positif``, ``faux positif``, ... est assez trompeuse. Il vaut mieux retenir les d\u00e9finitions de [pr\u00e9cision, rappel](http://fr.wikipedia.org/wiki/Pr%C3%A9cision_et_rappel).\n", "\n", "Comme le classifieur retourne un score de confiance, on d\u00e9cide de valider ou de rejeter sa r\u00e9ponse si le score est sup\u00e9rieur ou inf\u00e9rieur \u00e0 ce seuil :\n", "\n", "* si score >= seuil, on valide la r\u00e9ponse, qui est soit bonne (TP : True Positive), soit fausse (FP : False Positive)\n", "* si score < seuil, on rejete la r\u00e9ponse qui est soit bonne (FN : False Negative) soit fausse (True Negative)\n", "\n", "La pr\u00e9sicion est d\u00e9finie comme \u00e9tant le nombre de r\u00e9ponses justes sur le nombre de r\u00e9ponses valid\u00e9es :\n", "\n", "$$precision = \\frac{TP}{TP + FP}$$\n", "\n", "Le rappel est d\u00e9fini comme \u00e9tant le nombre de r\u00e9ponses justes sur le nombre total de r\u00e9ponses justes :\n", "\n", "\n", "$$rappel = \\frac{TP}{TP + FN}$$\n", "\n", "Dans notre cas, on d\u00e9finit une r\u00e9ponse juste comme \u00e9tant le fait qu'on pr\u00e9dit la bonne classe. Quelques liens :\n", "\n", "* [Receiver Operating Characteristic (ROC)](http://scikit-learn.org/stable/auto_examples/plot_roc.html)\n", "\n", "Et il faudra tracer sur le m\u00eame dessin la courbe ROC de :\n", "\n", "* l'ensemble de la base de test\n", "* deux \u00e9chantillons al\u00e9atoires de la base de test\n", "\n", "Les courbes des deux \u00e9chantillons al\u00e9atoires devraient illustrer la stabilit\u00e9 de la courbe ROC. Il est m\u00eame possible de calculer un intervalle de confiance en utilisant un [bootstrap](http://fr.wikipedia.org/wiki/Bootstrap_%28statistiques%29).\n", "\n", "Quelques liens sur matplotlib :\n", "\n", "* [Our Favorite Recipes](http://matplotlib.org/users/recipes.html)\n", "* [How to make several plots on a single page using matplotlib?](http://stackoverflow.com/questions/1358977/how-to-make-several-plots-on-a-single-page-using-matplotlib)"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}