{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.ml - 2016 - Comp\u00e9tition ENSAE - Premiers mod\u00e8les\n", "\n", "Une comp\u00e9tition \u00e9tait propos\u00e9e dans le cadre du cours *Python pour un Data Scientist* \u00e0 l'[ENSAE](http://www.ensae.fr/ensae/fr/). Ce notebook facilite la prise en main des donn\u00e9es et propose de mettre en oeuvre un mod\u00e8le logit."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["<div id=\"my_id_menu_nb\">run previous cell, wait for 2 seconds</div>\n", "<script>\n", "function repeat_indent_string(n){\n", "    var a = \"\" ;\n", "    for ( ; n > 0 ; --n)\n", "        a += \"    \";\n", "    return a;\n", "}\n", "var update_menu_string = function(begin, lfirst, llast, sformat, send, keep_item, begin_format, end_format) {\n", "    var anchors = document.getElementsByClassName(\"section\");\n", "    if (anchors.length == 0) {\n", "        anchors = document.getElementsByClassName(\"text_cell_render rendered_html\");\n", "    }\n", "    var i,t;\n", "    var text_menu = begin;\n", "    var text_memo = \"<pre>\\nlength:\" + anchors.length + \"\\n\";\n", "    var ind = \"\";\n", "    var memo_level = 1;\n", "    var href;\n", "    var tags = [];\n", "    var main_item = 0;\n", "    var format_open = 0;\n", "    for (i = 0; i <= llast; i++)\n", "        tags.push(\"h\" + i);\n", "\n", "    for (i = 0; i < anchors.length; i++) {\n", "        text_memo += \"**\" + anchors[i].id + \"--\\n\";\n", "\n", "        var child = null;\n", "        for(t = 0; t < tags.length; t++) {\n", "            var r = anchors[i].getElementsByTagName(tags[t]);\n", "            if (r.length > 0) {\n", "child = r[0];\n", "break;\n", "            }\n", "        }\n", "        if (child == null) {\n", "            text_memo += \"null\\n\";\n", "            continue;\n", "        }\n", "        if (anchors[i].hasAttribute(\"id\")) {\n", "            // when converted in RST\n", "            href = anchors[i].id;\n", "            text_memo += \"#1-\" + href;\n", "            // passer \u00e0 child suivant (le chercher)\n", "        }\n", "        else if (child.hasAttribute(\"id\")) {\n", "            // in a notebook\n", "            href = child.id;\n", "            text_memo += \"#2-\" + href;\n", "        }\n", "        else {\n", "            text_memo += \"#3-\" + \"*\" + \"\\n\";\n", "            continue;\n", "        }\n", "        var title = child.textContent;\n", "        var level = parseInt(child.tagName.substring(1,2));\n", "\n", "        text_memo += \"--\" + level + \"?\" + lfirst + \"--\" + title + \"\\n\";\n", "\n", "        if ((level < lfirst) || (level > llast)) {\n", "            continue ;\n", "        }\n", "        if (title.endsWith('\u00b6')) {\n", "            title = title.substring(0,title.length-1).replace(\"<\", \"&lt;\")\n", "         .replace(\">\", \"&gt;\").replace(\"&\", \"&amp;\");\n", "        }\n", "        if (title.length == 0) {\n", "            continue;\n", "        }\n", "\n", "        while (level < memo_level) {\n", "            text_menu += end_format + \"</ul>\\n\";\n", "            format_open -= 1;\n", "            memo_level -= 1;\n", "        }\n", "        if (level == lfirst) {\n", "            main_item += 1;\n", "        }\n", "        if (keep_item != -1 && main_item != keep_item + 1) {\n", "            // alert(main_item + \" - \" + level + \" - \" + keep_item);\n", "            continue;\n", "        }\n", "        while (level > memo_level) {\n", "            text_menu += \"<ul>\\n\";\n", "            memo_level += 1;\n", "        }\n", "        text_menu += repeat_indent_string(level-2);\n", "        text_menu += begin_format + sformat.replace(\"__HREF__\", href).replace(\"__TITLE__\", title);\n", "        format_open += 1;\n", "    }\n", "    while (1 < memo_level) {\n", "        text_menu += end_format + \"</ul>\\n\";\n", "        memo_level -= 1;\n", "        format_open -= 1;\n", "    }\n", "    text_menu += send;\n", "    //text_menu += \"\\n\" + text_memo;\n", "\n", "    while (format_open > 0) {\n", "        text_menu += end_format;\n", "        format_open -= 1;\n", "    }\n", "    return text_menu;\n", "};\n", "var update_menu = function() {\n", "    var sbegin = \"\";\n", "    var sformat = '<a href=\"#__HREF__\">__TITLE__</a>';\n", "    var send = \"\";\n", "    var begin_format = '<li>';\n", "    var end_format = '</li>';\n", "    var keep_item = -1;\n", "    var text_menu = update_menu_string(sbegin, 2, 4, sformat, send, keep_item,\n", "       begin_format, end_format);\n", "    var menu = document.getElementById(\"my_id_menu_nb\");\n", "    menu.innerHTML=text_menu;\n", "};\n", "window.setTimeout(update_menu,2000);\n", "            </script>"], "text/plain": ["<IPython.core.display.HTML object>"]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La premi\u00e8re \u00e9tape constiste \u00e0 mettre en forme les donn\u00e9es pour que les fonctions des modules statsmodels et skitlearn fonctionnent comme on le souhaite"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["['ensae_competition_test_X.txt', 'ensae_competition_train.txt']"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from pyensae.datasource import download_data\n", "download_data(\"ensae_competition_2016.zip\",\n", "              url=\"https://github.com/sdpython/ensae_teaching_cs/raw/master/_doc/competitions/2016_ENSAE_2A/\")"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["c:\\python36_x64\\lib\\site-packages\\statsmodels\\compat\\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n", "  from pandas.core import datetools\n"]}], "source": ["import pandas as pd\n", "import statsmodels.api as sm\n", "import pylab as pl\n", "import numpy as np\n", "\n", "fichier_train = \"./ensae_competition_train.txt\"\n", "fichier_test = \"./ensae_competition_test_X.txt\"\n", "\n", "df = pd.read_csv(fichier_train, header=[0,1], sep=\"\\t\", index_col=0)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Mise en forme des donn\u00e9es pour la r\u00e9gression \n", "\n", "Pour mieux analyser les donn\u00e9es pr\u00e9sentes dans la base de donn\u00e9es, il faut passer par quelques manipulations.\n", "\n", "Par exemple, il faut transformer les variables SEX, EDUCATION, MARRIAGE en indictatrices afin qu'elles ne soient pas prise en compte comme des variables continues dans le mod\u00e8le."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["#### Gender dummies\n", "df['X2'] = df['X2'].applymap(str)\n", "gender_dummies = pd.get_dummies(df['X2'] )"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": ["### education dummies\n", "df['X3'] = df['X3'].applymap(str)\n", "educ_dummies = pd.get_dummies(df['X3'] )"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["#### marriage dummies\n", "df['X4'] = df['X4'].applymap(str)\n", "mariage_dummies = pd.get_dummies(df['X4'] )"]}, {"cell_type": "code", "execution_count": 7, "metadata": {"scrolled": true}, "outputs": [], "source": ["### On va aussi supprimer les multi index de la table\n", "df.columns = df.columns.droplevel(0)"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>LIMIT_BAL</th>\n", "      <th>SEX</th>\n", "      <th>EDUCATION</th>\n", "      <th>MARRIAGE</th>\n", "      <th>AGE</th>\n", "      <th>PAY_0</th>\n", "      <th>PAY_2</th>\n", "      <th>PAY_3</th>\n", "      <th>PAY_4</th>\n", "      <th>PAY_5</th>\n", "      <th>...</th>\n", "      <th>EDUCATION_1</th>\n", "      <th>EDUCATION_2</th>\n", "      <th>EDUCATION_3</th>\n", "      <th>EDUCATION_4</th>\n", "      <th>EDUCATION_5</th>\n", "      <th>EDUCATION_6</th>\n", "      <th>MARRIAGE_0</th>\n", "      <th>MARRIAGE_1</th>\n", "      <th>MARRIAGE_2</th>\n", "      <th>MARRIAGE_3</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>180000</td>\n", "      <td>1</td>\n", "      <td>2</td>\n", "      <td>1</td>\n", "      <td>47</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>...</td>\n", "      <td>0</td>\n", "      <td>1</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>1</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>110000</td>\n", "      <td>2</td>\n", "      <td>2</td>\n", "      <td>1</td>\n", "      <td>35</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>...</td>\n", "      <td>0</td>\n", "      <td>1</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>1</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "<p>2 rows \u00d7 37 columns</p>\n", "</div>"], "text/plain": ["   LIMIT_BAL SEX EDUCATION MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  PAY_5  \\\n", "0     180000   1         2        1   47      0      0      0      0      0   \n", "1     110000   2         2        1   35      0      0      0      0      0   \n", "\n", "      ...      EDUCATION_1  EDUCATION_2  EDUCATION_3  EDUCATION_4  \\\n", "0     ...                0            1            0            0   \n", "1     ...                0            1            0            0   \n", "\n", "   EDUCATION_5  EDUCATION_6  MARRIAGE_0  MARRIAGE_1  MARRIAGE_2  MARRIAGE_3  \n", "0            0            0           0           1           0           0  \n", "1            0            0           0           1           0           0  \n", "\n", "[2 rows x 37 columns]"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["#### on aggr\u00e8ge ensuite les 3 tables ensemble\n", "data = df.join(gender_dummies).join(educ_dummies).join(mariage_dummies)\n", "data.head(n=2)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## R\u00e9gression logistique en utilisant statsmodels \n", "\n", "Pour notre premier mod\u00e8le, nous allons voir comment fonctionne le module statsmodels\n", "\n", "Par d\u00e9faut, la regression logit n'a pas de beta z\u00e9ro pour le module statsmodels : il faut donc l'ajouter"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',\n", "       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',\n", "       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',\n", "       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Y',\n", "       'SEX_1', 'SEX_2', 'EDUCATION_0', 'EDUCATION_1', 'EDUCATION_2',\n", "       'EDUCATION_3', 'EDUCATION_4', 'EDUCATION_5', 'EDUCATION_6',\n", "       'MARRIAGE_0', 'MARRIAGE_1', 'MARRIAGE_2', 'MARRIAGE_3', 'intercept'],\n", "      dtype='object')"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["# premi\u00e8re \u00e9tape pour ce module, il faut ajouter \u00e0 la main le beta z\u00e9ro - l'intercept\n", "data['intercept'] = 1.0\n", "data.rename(columns = {'default payment next month' : \"Y\"}, inplace = True)\n", "data.columns"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["# variable = ['AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4',\n", "#       'BILL_AMT5', 'BILL_AMT6', 'LIMIT_BAL', 'PAY_0',\n", "#       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'PAY_AMT1', 'PAY_AMT2',\n", "#       'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'SEX_1',\n", "#       'EDUCATION_0', 'EDUCATION_1', 'EDUCATION_2', 'EDUCATION_3',\n", "#       'EDUCATION_4', 'EDUCATION_5', 'MARRIAGE_0', 'MARRIAGE_1',\n", "#       'MARRIAGE_2', 'intercept']\n", "\n", "train_cols = [\"SEX_1\", \"AGE\", \"MARRIAGE_0\", 'PAY_0','intercept']"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": ["# Cette cellule n'est n\u00e9cessaire que si vous utilisez scipy 1.0 avec statsmodels 0.8.\n", "from pymyinstall.fix import fix_scipy10_for_statsmodels08\n", "fix_scipy10_for_statsmodels08()"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Optimization terminated successfully.\n", "         Current function value: 0.475306\n", "         Iterations 6\n", "                           Logit Regression Results                           \n", "==============================================================================\n", "Dep. Variable:                      Y   No. Observations:                22500\n", "Model:                          Logit   Df Residuals:                    22495\n", "Method:                           MLE   Df Model:                            4\n", "Date:                Wed, 01 Nov 2017   Pseudo R-squ.:                  0.1008\n", "Time:                        20:48:54   Log-Likelihood:                -10694.\n", "converged:                       True   LL-Null:                       -11893.\n", "                                        LLR p-value:                     0.000\n", "==============================================================================\n", "                 coef    std err          z      P>|z|      [0.025      0.975]\n", "------------------------------------------------------------------------------\n", "SEX_1          0.1093      0.035      3.144      0.002       0.041       0.177\n", "AGE            0.0076      0.002      4.202      0.000       0.004       0.011\n", "MARRIAGE_0    -0.8095      0.527     -1.536      0.124      -1.842       0.223\n", "PAY_0          0.7304      0.016     44.554      0.000       0.698       0.763\n", "intercept     -1.7150      0.067    -25.435      0.000      -1.847      -1.583\n", "==============================================================================\n"]}], "source": ["logit = sm.Logit(data['Y'], data[train_cols].astype(float))\n", "\n", "# fit the model\n", "result = logit.fit()\n", "print(result.summary())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Pr\u00e9diction sur la base de test\n", "\n", "On va pr\u00e9dire les probabilit\u00e9s de d\u00e9faut de paiement sur notre base de test.\n", "\n", "Il faut commencer par transformer la base de test comme on l'a fait pour la base de train."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": ["data_test = pd.read_csv(fichier_test, header=[0,1], sep=\"\\t\", index_col=0)"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": ["#### Gender dummies\n", "data_test['X2'] = data_test['X2'].applymap(str)\n", "gender_dummies_test = pd.get_dummies(data_test['X2'] )"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": ["### education dummies\n", "data_test['X3'] = data_test['X3'].applymap(str)\n", "educ_dummies_test = pd.get_dummies(data_test['X3'] )"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": ["#### marriage dummies\n", "data_test['X4'] = data_test['X4'].applymap(str)\n", "mariage_dummies_test = pd.get_dummies(data_test['X4'] )"]}, {"cell_type": "code", "execution_count": 17, "metadata": {"scrolled": true}, "outputs": [], "source": ["### On va aussi supprimer les multi index de la table\n", "data_test.columns = data_test.columns.droplevel(0)"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": ["#### on aggr\u00e8ge ensuite les 3 tables ensemble\n", "data_test = data_test.join(gender_dummies_test).join(educ_dummies_test).join(mariage_dummies_test)"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": ["data_test['intercept'] = 1.0"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>SEX_1</th>\n", "      <th>AGE</th>\n", "      <th>MARRIAGE_0</th>\n", "      <th>PAY_0</th>\n", "      <th>intercept</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>0</td>\n", "      <td>35</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>1.0</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>0</td>\n", "      <td>55</td>\n", "      <td>0</td>\n", "      <td>-1</td>\n", "      <td>1.0</td>\n", "    </tr>\n", "    <tr>\n", "      <th>2</th>\n", "      <td>1</td>\n", "      <td>33</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>1.0</td>\n", "    </tr>\n", "    <tr>\n", "      <th>3</th>\n", "      <td>1</td>\n", "      <td>23</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>1.0</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4</th>\n", "      <td>1</td>\n", "      <td>42</td>\n", "      <td>0</td>\n", "      <td>-2</td>\n", "      <td>1.0</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["   SEX_1  AGE  MARRIAGE_0  PAY_0  intercept\n", "0      0   35           0      0        1.0\n", "1      0   55           0     -1        1.0\n", "2      1   33           0      0        1.0\n", "3      1   23           0      0        1.0\n", "4      1   42           0     -2        1.0"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["data_test[train_cols].head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Maintenant que la base de test est \u00e9galement transform\u00e9e, nous allons appliqu\u00e9 les r\u00e9sultats de notre mod\u00e8le \u00e0 cette table en utilisant la fonction predict"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["count    7500.000000\n", "mean        0.220305\n", "std         0.136834\n", "min         0.025511\n", "25%         0.114870\n", "50%         0.194184\n", "75%         0.225708\n", "max         0.988777\n", "Name: prediction_statsmodel, dtype: float64"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["data_test['prediction_statsmodel'] = result.predict(data_test[train_cols])\n", "data_test['prediction_statsmodel'].describe()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On trouve un taux moyen de d\u00e9faut de 22%, ce qui est tr\u00e8s proche du taux observ\u00e9 dans la base de train"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": ["# puis on l'exporte\n", "data_test['prediction_statsmodel'].to_csv(\"./answer.csv\", index=False)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Regression logistique en utilisant Scikit-learn\n", "\n", "A pr\u00e9sent, nous allons utiliser le module scikit learn pour estimer le m\u00eame mod\u00e8le que pr\u00e9c\u00e9demment et comparer les r\u00e9sultats.\n", "\n", "Ici pas besoin d'ajouter une variable \u00e9gale \u00e0 1 (l'intercept) car Scikit learn consid\u00e8re qu'il y a un intercept par d\u00e9faut."]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": ["from sklearn import linear_model\n", "logistic = linear_model.LogisticRegression()"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["l'estimation des coefficients [[ 0.10899192  0.00751036 -0.56310513  0.72994068 -0.85555264]] \n", "\n", "l'intercept :  [-0.85555264]\n"]}], "source": ["print(\"l'estimation des coefficients\", logistic.fit(data[train_cols], data['Y']).coef_, \"\\n\")\n", "\n", "print(\"l'intercept : \", logistic.fit(data[train_cols], data['Y']).intercept_)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A la diff\u00e9rence de statsmodels, Scikit-learn ne propose pas une belle table de r\u00e9sultat : seulement les arrays qui contiennent les coefficients\n", "\n", "Pour le d\u00e9tail des p-value et intervalles de confiance, il faudra les recalculer \u00e0 la main. \n", "\n", "\n", "Par contre, la fonction de pr\u00e9diction existe et elle renvoit  la probabilit\u00e9 de Y = 0 et Y = 1 "]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 0.80972784,  0.19027216],\n", "       [ 0.88370325,  0.11629675],\n", "       [ 0.79482709,  0.20517291],\n", "       ..., \n", "       [ 0.80972784,  0.19027216],\n", "       [ 0.88899819,  0.11100181],\n", "       [ 0.79667299,  0.20332701]])"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["logistic.predict_proba(data_test[train_cols])"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([ 0.77965411,  0.22034589])"]}, "execution_count": 27, "metadata": {}, "output_type": "execute_result"}], "source": ["# pour calculer la moyenne du taux de d\u00e9faut pr\u00e9dit \n", "logistic.predict_proba(data_test[train_cols]).mean(axis=0)\n", "\n", "# on trouve \u00e0 nouveau 22%"]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}