{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.ml - 2016 - Comp\u00e9tition ENSAE - Premiers mod\u00e8les\n", "\n", "Une comp\u00e9tition \u00e9tait propos\u00e9e dans le cadre du cours *Python pour un Data Scientist* \u00e0 l'[ENSAE](http://www.ensae.fr/ensae/fr/). Ce notebook facilite la prise en main des donn\u00e9es et propose de mettre en oeuvre un mod\u00e8le logit."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La premi\u00e8re \u00e9tape constiste \u00e0 mettre en forme les donn\u00e9es pour que les fonctions des modules statsmodels et skitlearn fonctionnent comme on le souhaite"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["['ensae_competition_test_X.txt', 'ensae_competition_train.txt']"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from pyensae.datasource import download_data\n", "download_data(\"ensae_competition_2016.zip\",\n", " url=\"https://github.com/sdpython/ensae_teaching_cs/raw/master/_doc/competitions/2016_ENSAE_2A/\")"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["c:\\python36_x64\\lib\\site-packages\\statsmodels\\compat\\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n", " from pandas.core import datetools\n"]}], "source": ["import pandas as pd\n", "import statsmodels.api as sm\n", "import pylab as pl\n", "import numpy as np\n", "\n", "fichier_train = \"./ensae_competition_train.txt\"\n", "fichier_test = \"./ensae_competition_test_X.txt\"\n", "\n", "df = pd.read_csv(fichier_train, header=[0,1], sep=\"\\t\", index_col=0)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Mise en forme des donn\u00e9es pour la r\u00e9gression \n", "\n", "Pour mieux analyser les donn\u00e9es pr\u00e9sentes dans la base de donn\u00e9es, il faut passer par quelques manipulations.\n", "\n", "Par exemple, il faut transformer les variables SEX, EDUCATION, MARRIAGE en indictatrices afin qu'elles ne soient pas prise en compte comme des variables continues dans le mod\u00e8le."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["#### Gender dummies\n", "df['X2'] = df['X2'].applymap(str)\n", "gender_dummies = pd.get_dummies(df['X2'] )"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": ["### education dummies\n", "df['X3'] = df['X3'].applymap(str)\n", "educ_dummies = pd.get_dummies(df['X3'] )"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["#### marriage dummies\n", "df['X4'] = df['X4'].applymap(str)\n", "mariage_dummies = pd.get_dummies(df['X4'] )"]}, {"cell_type": "code", "execution_count": 7, "metadata": {"scrolled": true}, "outputs": [], "source": ["### On va aussi supprimer les multi index de la table\n", "df.columns = df.columns.droplevel(0)"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_0PAY_2PAY_3PAY_4PAY_5...EDUCATION_1EDUCATION_2EDUCATION_3EDUCATION_4EDUCATION_5EDUCATION_6MARRIAGE_0MARRIAGE_1MARRIAGE_2MARRIAGE_3
01800001214700000...0100000100
11100002213500000...0100000100
\n", "

2 rows \u00d7 37 columns

\n", "
"], "text/plain": [" LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 \\\n", "0 180000 1 2 1 47 0 0 0 0 0 \n", "1 110000 2 2 1 35 0 0 0 0 0 \n", "\n", " ... EDUCATION_1 EDUCATION_2 EDUCATION_3 EDUCATION_4 \\\n", "0 ... 0 1 0 0 \n", "1 ... 0 1 0 0 \n", "\n", " EDUCATION_5 EDUCATION_6 MARRIAGE_0 MARRIAGE_1 MARRIAGE_2 MARRIAGE_3 \n", "0 0 0 0 1 0 0 \n", "1 0 0 0 1 0 0 \n", "\n", "[2 rows x 37 columns]"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["#### on aggr\u00e8ge ensuite les 3 tables ensemble\n", "data = df.join(gender_dummies).join(educ_dummies).join(mariage_dummies)\n", "data.head(n=2)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## R\u00e9gression logistique en utilisant statsmodels \n", "\n", "Pour notre premier mod\u00e8le, nous allons voir comment fonctionne le module statsmodels\n", "\n", "Par d\u00e9faut, la regression logit n'a pas de beta z\u00e9ro pour le module statsmodels : il faut donc l'ajouter"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',\n", " 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',\n", " 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',\n", " 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Y',\n", " 'SEX_1', 'SEX_2', 'EDUCATION_0', 'EDUCATION_1', 'EDUCATION_2',\n", " 'EDUCATION_3', 'EDUCATION_4', 'EDUCATION_5', 'EDUCATION_6',\n", " 'MARRIAGE_0', 'MARRIAGE_1', 'MARRIAGE_2', 'MARRIAGE_3', 'intercept'],\n", " dtype='object')"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["# premi\u00e8re \u00e9tape pour ce module, il faut ajouter \u00e0 la main le beta z\u00e9ro - l'intercept\n", "data['intercept'] = 1.0\n", "data.rename(columns = {'default payment next month' : \"Y\"}, inplace = True)\n", "data.columns"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["# variable = ['AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4',\n", "# 'BILL_AMT5', 'BILL_AMT6', 'LIMIT_BAL', 'PAY_0',\n", "# 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'PAY_AMT1', 'PAY_AMT2',\n", "# 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'SEX_1',\n", "# 'EDUCATION_0', 'EDUCATION_1', 'EDUCATION_2', 'EDUCATION_3',\n", "# 'EDUCATION_4', 'EDUCATION_5', 'MARRIAGE_0', 'MARRIAGE_1',\n", "# 'MARRIAGE_2', 'intercept']\n", "\n", "train_cols = [\"SEX_1\", \"AGE\", \"MARRIAGE_0\", 'PAY_0','intercept']"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": ["# Cette cellule n'est n\u00e9cessaire que si vous utilisez scipy 1.0 avec statsmodels 0.8.\n", "from pymyinstall.fix import fix_scipy10_for_statsmodels08\n", "fix_scipy10_for_statsmodels08()"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Optimization terminated successfully.\n", " Current function value: 0.475306\n", " Iterations 6\n", " Logit Regression Results \n", "==============================================================================\n", "Dep. Variable: Y No. Observations: 22500\n", "Model: Logit Df Residuals: 22495\n", "Method: MLE Df Model: 4\n", "Date: Wed, 01 Nov 2017 Pseudo R-squ.: 0.1008\n", "Time: 20:48:54 Log-Likelihood: -10694.\n", "converged: True LL-Null: -11893.\n", " LLR p-value: 0.000\n", "==============================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "SEX_1 0.1093 0.035 3.144 0.002 0.041 0.177\n", "AGE 0.0076 0.002 4.202 0.000 0.004 0.011\n", "MARRIAGE_0 -0.8095 0.527 -1.536 0.124 -1.842 0.223\n", "PAY_0 0.7304 0.016 44.554 0.000 0.698 0.763\n", "intercept -1.7150 0.067 -25.435 0.000 -1.847 -1.583\n", "==============================================================================\n"]}], "source": ["logit = sm.Logit(data['Y'], data[train_cols].astype(float))\n", "\n", "# fit the model\n", "result = logit.fit()\n", "print(result.summary())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Pr\u00e9diction sur la base de test\n", "\n", "On va pr\u00e9dire les probabilit\u00e9s de d\u00e9faut de paiement sur notre base de test.\n", "\n", "Il faut commencer par transformer la base de test comme on l'a fait pour la base de train."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": ["data_test = pd.read_csv(fichier_test, header=[0,1], sep=\"\\t\", index_col=0)"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": ["#### Gender dummies\n", "data_test['X2'] = data_test['X2'].applymap(str)\n", "gender_dummies_test = pd.get_dummies(data_test['X2'] )"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": ["### education dummies\n", "data_test['X3'] = data_test['X3'].applymap(str)\n", "educ_dummies_test = pd.get_dummies(data_test['X3'] )"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": ["#### marriage dummies\n", "data_test['X4'] = data_test['X4'].applymap(str)\n", "mariage_dummies_test = pd.get_dummies(data_test['X4'] )"]}, {"cell_type": "code", "execution_count": 17, "metadata": {"scrolled": true}, "outputs": [], "source": ["### On va aussi supprimer les multi index de la table\n", "data_test.columns = data_test.columns.droplevel(0)"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": ["#### on aggr\u00e8ge ensuite les 3 tables ensemble\n", "data_test = data_test.join(gender_dummies_test).join(educ_dummies_test).join(mariage_dummies_test)"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": ["data_test['intercept'] = 1.0"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SEX_1AGEMARRIAGE_0PAY_0intercept
0035001.0
10550-11.0
2133001.0
3123001.0
41420-21.0
\n", "
"], "text/plain": [" SEX_1 AGE MARRIAGE_0 PAY_0 intercept\n", "0 0 35 0 0 1.0\n", "1 0 55 0 -1 1.0\n", "2 1 33 0 0 1.0\n", "3 1 23 0 0 1.0\n", "4 1 42 0 -2 1.0"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["data_test[train_cols].head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Maintenant que la base de test est \u00e9galement transform\u00e9e, nous allons appliqu\u00e9 les r\u00e9sultats de notre mod\u00e8le \u00e0 cette table en utilisant la fonction predict"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["count 7500.000000\n", "mean 0.220305\n", "std 0.136834\n", "min 0.025511\n", "25% 0.114870\n", "50% 0.194184\n", "75% 0.225708\n", "max 0.988777\n", "Name: prediction_statsmodel, dtype: float64"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["data_test['prediction_statsmodel'] = result.predict(data_test[train_cols])\n", "data_test['prediction_statsmodel'].describe()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On trouve un taux moyen de d\u00e9faut de 22%, ce qui est tr\u00e8s proche du taux observ\u00e9 dans la base de train"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": ["# puis on l'exporte\n", "data_test['prediction_statsmodel'].to_csv(\"./answer.csv\", index=False)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Regression logistique en utilisant Scikit-learn\n", "\n", "A pr\u00e9sent, nous allons utiliser le module scikit learn pour estimer le m\u00eame mod\u00e8le que pr\u00e9c\u00e9demment et comparer les r\u00e9sultats.\n", "\n", "Ici pas besoin d'ajouter une variable \u00e9gale \u00e0 1 (l'intercept) car Scikit learn consid\u00e8re qu'il y a un intercept par d\u00e9faut."]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": ["from sklearn import linear_model\n", "logistic = linear_model.LogisticRegression()"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["l'estimation des coefficients [[ 0.10899192 0.00751036 -0.56310513 0.72994068 -0.85555264]] \n", "\n", "l'intercept : [-0.85555264]\n"]}], "source": ["print(\"l'estimation des coefficients\", logistic.fit(data[train_cols], data['Y']).coef_, \"\\n\")\n", "\n", "print(\"l'intercept : \", logistic.fit(data[train_cols], data['Y']).intercept_)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A la diff\u00e9rence de statsmodels, Scikit-learn ne propose pas une belle table de r\u00e9sultat : seulement les arrays qui contiennent les coefficients\n", "\n", "Pour le d\u00e9tail des p-value et intervalles de confiance, il faudra les recalculer \u00e0 la main. \n", "\n", "\n", "Par contre, la fonction de pr\u00e9diction existe et elle renvoit la probabilit\u00e9 de Y = 0 et Y = 1 "]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 0.80972784, 0.19027216],\n", " [ 0.88370325, 0.11629675],\n", " [ 0.79482709, 0.20517291],\n", " ..., \n", " [ 0.80972784, 0.19027216],\n", " [ 0.88899819, 0.11100181],\n", " [ 0.79667299, 0.20332701]])"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["logistic.predict_proba(data_test[train_cols])"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([ 0.77965411, 0.22034589])"]}, "execution_count": 27, "metadata": {}, "output_type": "execute_result"}], "source": ["# pour calculer la moyenne du taux de d\u00e9faut pr\u00e9dit \n", "logistic.predict_proba(data_test[train_cols]).mean(axis=0)\n", "\n", "# on trouve \u00e0 nouveau 22%"]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}