{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Rappels sur scikit-learn et le machine learning (correction)\n", "\n", "Quelques exercices simples sur *scikit-learn*. Le notebook est long pour ceux qui d\u00e9butent en machine learning et sans doute sans suspens pour ceux qui en ont d\u00e9j\u00e0 fait."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Des donn\u00e9es synth\u00e9tiques\n", "\n", "On simule un jeu de donn\u00e9es al\u00e9atoires."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0.33151303, 0.70686719],\n", " [0.13039027, 0.58941167],\n", " [0.612744 , 0.37799233],\n", " [0.20215973, 0.11095186],\n", " [0.56857961, 0.10783821]])"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from numpy import random\n", "n = 1000\n", "X = random.rand(n, 2)\n", "X[:5]"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([0.10121972, 0.49414321, 2.19975264, 0.74372472, 2.27103021])"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["y = X[:, 0] * 3 - 2 * X[:, 1] ** 2 + random.rand(n)\n", "y[:5]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : diviser en base d'apprentissage et de test\n", "\n", "Simple [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 2 : caler une r\u00e9gression lin\u00e9aire\n", "\n", "Et calculer le coefficient $R^2$. Pour ceux qui ne savent pas se servir d'un moteur de recherche : [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), [r2_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.linear_model import LinearRegression\n", "reg = LinearRegression()\n", "reg.fit(X_train, y_train)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.908988490753245"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import r2_score\n", "score = r2_score(y_test, reg.predict(X_test))\n", "score"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 3 : am\u00e9liorer le mod\u00e8le en appliquant une transformation bien choisie\n", "\n", "Le mod\u00e8le de d\u00e9part est : $Y = 3 X_1 - 2 X_2^2 + \\epsilon$. Il suffit de rajouter des featues polyn\u00f4miales avec [PolynomialFeatures](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.preprocessing import PolynomialFeatures\n", "poly = PolynomialFeatures()\n", "poly.fit(X_train)\n", "X_train2 = poly.transform(X_train)\n", "reg2 = LinearRegression()\n", "reg2.fit(X_train2, y_train)"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.9362394073926681"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["score2 = r2_score(y_test, reg2.predict(poly.transform(X_test)))\n", "score2"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le coefficient $R^2$ est plus \u00e9lev\u00e9 car on utilise les m\u00eames variables que le mod\u00e8le. Il n'est th\u00e9oriquement pas possible d'aller au del\u00e0."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 4 : caler une for\u00eat al\u00e9atoire"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n", " max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n", " oob_score=False, random_state=None, verbose=0, warm_start=False)"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.ensemble import RandomForestRegressor\n", "rf = RandomForestRegressor()\n", "rf.fit(X_train, y_train)"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.9153506166386053"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["r2_score(y_test, rf.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le mod\u00e8le lin\u00e9aire est le meilleur mod\u00e8le dans notre cas puisque les donn\u00e9es ont \u00e9t\u00e9 construites de la sorte. Il est attendu que le $R^2$ ne soit pas plus \u00e9lev\u00e9 tout du moins pas significativement plus \u00e9lev\u00e9. On regarde avec les features polyn\u00f4miales..."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.9119002367619022"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["rf2 = RandomForestRegressor()\n", "rf2.fit(X_train2, y_train)\n", "r2_score(y_test, rf2.predict(poly.transform(X_test)))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Avant de tirer des conclusions h\u00e2tives, il faudrait recommencer plusieurs fois l'exp\u00e9rience avant de dire que la performance est plus ou moins \u00e9lev\u00e9e avec ces features ce que ce notebook ne fera pas puisque la r\u00e9ponse th\u00e9orique est connue dans ce cas."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 5 : un peu de math\n", "\n", "Comparer les deux mod\u00e8les sur les donn\u00e9es suivantes ? Que remarquez-vous ? Expliquez pourquoi ?"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": ["X_test2 = random.rand(n, 2) + 0.5\n", "y_test2 = X_test2[:, 0] * 3 - 2 * X_test2[:, 1] ** 2 + random.rand(n)"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namer2r2_jeu2
0LinearRegression0.9089880.682467
1LinearRegression + X^20.9362390.948110
2RandomForestRegressor0.9153510.493273
3RandomForestRegressor + X^20.9119000.517105
\n", "
"], "text/plain": [" name r2 r2_jeu2\n", "0 LinearRegression 0.908988 0.682467\n", "1 LinearRegression + X^2 0.936239 0.948110\n", "2 RandomForestRegressor 0.915351 0.493273\n", "3 RandomForestRegressor + X^2 0.911900 0.517105"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["res = []\n", "for model in [reg, reg2, rf, rf2]:\n", " name = model.__class__.__name__\n", " try:\n", " pred = model.predict(X_test)\n", " pred2 = model.predict(X_test2)\n", " except Exception:\n", " pred = model.predict(poly.transform(X_test))\n", " pred2 = model.predict(poly.transform(X_test2))\n", " name += \" + X^2\"\n", " res.append(dict(name=name, r2=r2_score(y_test, pred),\n", " r2_jeu2=r2_score(y_test2, pred2)))\n", "\n", "import pandas\n", "df = pandas.DataFrame(res)\n", "df"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le seul mod\u00e8le qui s'en tire vraiment est la r\u00e9gression lin\u00e9aire avec les features polyn\u00f4miales. Comme il \u00e9quivaut au mod\u00e8le th\u00e9orique, il est normal qu'il ne se plante pas trop m\u00eame si ses coefficients ne sont pas identique au mod\u00e8le th\u00e9orique (il faudrait plus de donn\u00e9es pour que cela converge)."]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["(array([ 0. , 2.81692538, -0.29768531, 0.08662761, 0.13367719,\n", " -1.7515442 ]), 0.5889925538787228)"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["reg2.coef_, reg2.intercept_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour les autes mod\u00e8les, voyons d\u00e9j\u00e0 visuellement ce qu'il se passe."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 6 : faire un graphe avec...\n", "\n", "Je laisse le code d\u00e9crire l'approche choisie pour illustrer les carences des mod\u00e8les pr\u00e9c\u00e9dents. Le commentaire suit le graphique pour les paresseux."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {}, "output_type": "display_data"}], "source": ["import matplotlib.pyplot as plt\n", "fig, ax = plt.subplots(1, 2, figsize=(14, 4))\n", "\n", "a, b = 0.9, 1.1\n", "index1 = (X_test2[:, 0] >= a) & (X_test2[:, 0] <= b)\n", "index2 = (X_test2[:, 1] >= a) & (X_test2[:, 1] <= b)\n", "yth = X_test2[:, 0] * 3 - 2 * X_test2[:, 1]\n", "\n", "\n", "ax[0].set_xlabel(\"X1\")\n", "ax[0].set_ylabel(\"Y\")\n", "ax[0].plot(X_test2[index2, 0], yth[index2], '.', label='Y th\u00e9orique')\n", "ax[1].set_xlabel(\"X2\")\n", "ax[1].set_ylabel(\"Y\")\n", "ax[1].plot(X_test2[index1, 1], yth[index1], '.', label='Y th\u00e9orique')\n", "\n", "for model in [reg, reg2, rf, rf2]:\n", " name = model.__class__.__name__\n", " try:\n", " pred2 = model.predict(X_test2)\n", " except Exception:\n", " pred2 = model.predict(poly.transform(X_test2))\n", " name += \" + X^2\"\n", " ax[0].plot(X_test2[index2, 0], pred2[index2], '.', label=name)\n", " ax[1].plot(X_test2[index1, 1], pred2[index1], '.', label=name)\n", "ax[0].legend()\n", "ax[1].legend();"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le graphe \u00e9tudie les variables des mod\u00e8les selon une coordonn\u00e9es tout en restreignant l'autre dans un intervalle donn\u00e9. On voit tout de suite que la for\u00eat al\u00e9atoire devient constante au del\u00e0 d'un certain seuil. C'est encore une fois tout \u00e0 fait normal puisque la base d'apprentissage ne contient des $X_1, X_2$ que dans l'intervalle $[0, 1]$. En dehors, chaque arbre de d\u00e9cision produit une valeur constante tout simplement parce que ce sont des fonctions en escalier : une for\u00eat al\u00e9atoire est une moyenne de fonctions en escalier, elle est born\u00e9e. Quant \u00e0 la premi\u00e8re r\u00e9gression lin\u00e9aire, elle ne peut saisir les effets du second degr\u00e9, elle est lin\u00e9aire par rapport aux variables de d\u00e9part. Elle s'\u00e9carte moins mais elle s'\u00e9carte quand m\u00eame de la variable \u00e0 pr\u00e9dire.\n", "\n", "Cet exercice a pour but d'illustrer qu'un mod\u00e8le de machine learning est estim\u00e9 sur un jeu de donn\u00e9es qui suit une certaine distribution. Lorsque les donn\u00e9es sur lesquelles le mod\u00e8le est utilis\u00e9 pour pr\u00e9dire ne suivent plus cette loi, les mod\u00e8les retournent des r\u00e9ponses qui ont toutes les chances d'\u00eatre fausses et ce, de mani\u00e8re diff\u00e9rente selon les mod\u00e8les.\n", "\n", "C'est pour cela qu'on dit qu'il faut r\u00e9apprendre r\u00e9guli\u00e8rement les mod\u00e8les de machine learning, surtout s'ils sont appliqu\u00e9s sur des donn\u00e9es g\u00e9n\u00e9r\u00e9es par l'activit\u00e9 humaine et non des donn\u00e9es issues de probl\u00e8mes physiques."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 7 : illuster l'overfitting avec un arbre de d\u00e9cision\n", "\n", "Sur le premier jeu de donn\u00e9es."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
profondeurr2_testr2_train
010.3801030.446387
120.6363460.672284
230.7867780.826594
340.8727990.892911
450.8962760.931297
\n", "
"], "text/plain": [" profondeur r2_test r2_train\n", "0 1 0.380103 0.446387\n", "1 2 0.636346 0.672284\n", "2 3 0.786778 0.826594\n", "3 4 0.872799 0.892911\n", "4 5 0.896276 0.931297"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.tree import DecisionTreeRegressor\n", "\n", "res = []\n", "for md in range(1, 20):\n", " tree = DecisionTreeRegressor(max_depth=md)\n", " tree.fit(X_train, y_train)\n", " r2_train = r2_score(y_train, tree.predict(X_train))\n", " r2_test = r2_score(y_test, tree.predict(X_test))\n", " res.append(dict(profondeur=md, r2_train=r2_train, r2_test=r2_test))\n", "\n", "df = pandas.DataFrame(res)\n", "df.head()"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {}, "output_type": "display_data"}], "source": ["ax = df.plot(x='profondeur', y=['r2_train', 'r2_test'])\n", "ax.set_title(\"Evolution du R2 selon la profondeur\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 8 : augmenter le nombre de features et r\u00e9gulariser une r\u00e9gression logistique\n", "\n", "L'objectif est de regarder l'impact de la r\u00e9gularisation des coefficients d'une r\u00e9gression logistique lorsque le nombre de features augmentent. On utilise les features polyn\u00f4miales et une r\u00e9gression [Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) ou [Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)."]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
degrenb_featuresnnul_lasnnul_regnnul_ridnorm_lasnorm_regnorm_ridr2_lasr2_regr2_rid
0132223.4039243.568168e+003.0798960.9068540.9089880.892232
1264553.2807383.334211e+002.3509890.9342470.9362390.919758
23104993.2808885.090698e+002.0723400.9342470.9356570.921656
3415414143.2809631.180973e+011.9924480.9342480.9349340.921532
4521420203.2810226.657031e+011.9727620.9342480.9351770.921656
5628427263.2810373.242570e+021.9651890.9342480.9343880.921823
6736435353.2810401.428490e+031.9589650.9342480.9332460.921927
7845444433.2810411.441240e+041.9529810.9342480.9313070.921976
8955455533.2810414.893616e+131.9477370.9342480.6165620.921997
91066465653.2810414.659569e+051.9436000.9342480.9297510.922013
101178477753.2810412.360073e+061.9406230.9342480.9253440.922035
111291491883.2810411.708086e+071.9386660.9342480.9211230.922067
12131054105993.2810419.955733e+071.9375180.9342480.8741360.922108
131412041201113.2810416.854497e+081.9369610.9342480.9277160.922158
141513641361273.2810413.786997e+091.9368080.9342480.8424880.922212
151615341531453.2810414.467998e+101.9369130.9342480.6645880.922269
161717141711613.2810412.361809e+111.9371650.934248-0.7264420.922325
171819041901793.2810411.599035e+121.9374890.9342480.5823850.922379
181921042102033.2810414.455355e+131.9378340.934248-25.4065360.922429
192023142312233.2810412.262080e+131.9381680.934248-21.6844470.922475
\n", "
"], "text/plain": [" degre nb_features nnul_las nnul_reg nnul_rid norm_las norm_reg \\\n", "0 1 3 2 2 2 3.403924 3.568168e+00 \n", "1 2 6 4 5 5 3.280738 3.334211e+00 \n", "2 3 10 4 9 9 3.280888 5.090698e+00 \n", "3 4 15 4 14 14 3.280963 1.180973e+01 \n", "4 5 21 4 20 20 3.281022 6.657031e+01 \n", "5 6 28 4 27 26 3.281037 3.242570e+02 \n", "6 7 36 4 35 35 3.281040 1.428490e+03 \n", "7 8 45 4 44 43 3.281041 1.441240e+04 \n", "8 9 55 4 55 53 3.281041 4.893616e+13 \n", "9 10 66 4 65 65 3.281041 4.659569e+05 \n", "10 11 78 4 77 75 3.281041 2.360073e+06 \n", "11 12 91 4 91 88 3.281041 1.708086e+07 \n", "12 13 105 4 105 99 3.281041 9.955733e+07 \n", "13 14 120 4 120 111 3.281041 6.854497e+08 \n", "14 15 136 4 136 127 3.281041 3.786997e+09 \n", "15 16 153 4 153 145 3.281041 4.467998e+10 \n", "16 17 171 4 171 161 3.281041 2.361809e+11 \n", "17 18 190 4 190 179 3.281041 1.599035e+12 \n", "18 19 210 4 210 203 3.281041 4.455355e+13 \n", "19 20 231 4 231 223 3.281041 2.262080e+13 \n", "\n", " norm_rid r2_las r2_reg r2_rid \n", "0 3.079896 0.906854 0.908988 0.892232 \n", "1 2.350989 0.934247 0.936239 0.919758 \n", "2 2.072340 0.934247 0.935657 0.921656 \n", "3 1.992448 0.934248 0.934934 0.921532 \n", "4 1.972762 0.934248 0.935177 0.921656 \n", "5 1.965189 0.934248 0.934388 0.921823 \n", "6 1.958965 0.934248 0.933246 0.921927 \n", "7 1.952981 0.934248 0.931307 0.921976 \n", "8 1.947737 0.934248 0.616562 0.921997 \n", "9 1.943600 0.934248 0.929751 0.922013 \n", "10 1.940623 0.934248 0.925344 0.922035 \n", "11 1.938666 0.934248 0.921123 0.922067 \n", "12 1.937518 0.934248 0.874136 0.922108 \n", "13 1.936961 0.934248 0.927716 0.922158 \n", "14 1.936808 0.934248 0.842488 0.922212 \n", "15 1.936913 0.934248 0.664588 0.922269 \n", "16 1.937165 0.934248 -0.726442 0.922325 \n", "17 1.937489 0.934248 0.582385 0.922379 \n", "18 1.937834 0.934248 -25.406536 0.922429 \n", "19 1.938168 0.934248 -21.684447 0.922475 "]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.linear_model import Ridge, Lasso\n", "import numpy.linalg as nplin\n", "import numpy\n", "\n", "def coef_non_nuls(coef):\n", " return sum(numpy.abs(coef) > 0.001)\n", "\n", "res = []\n", "for d in range(1, 21):\n", " poly = PolynomialFeatures(degree=d) \n", " poly.fit(X_train)\n", " X_test2 = poly.transform(X_test)\n", " \n", " reg = LinearRegression()\n", " reg.fit(poly.transform(X_train), y_train)\n", " r2_reg = r2_score(y_test, reg.predict(X_test2))\n", " \n", " rid = Ridge(alpha=10)\n", " rid.fit(poly.transform(X_train), y_train)\n", " r2_rid = r2_score(y_test, rid.predict(X_test2))\n", " \n", " las = Lasso(alpha=0.01)\n", " las.fit(poly.transform(X_train), y_train)\n", " r2_las = r2_score(y_test, las.predict(X_test2))\n", " \n", " res.append(dict(degre=d, nb_features=X_test2.shape[1],\n", " r2_reg=r2_reg, r2_las=r2_las, r2_rid=r2_rid,\n", " norm_reg=nplin.norm(reg.coef_),\n", " norm_rid=nplin.norm(rid.coef_),\n", " norm_las=nplin.norm(las.coef_),\n", " nnul_reg=coef_non_nuls(reg.coef_),\n", " nnul_rid=coef_non_nuls(rid.coef_),\n", " nnul_las=coef_non_nuls(las.coef_),\n", " ))\n", "\n", "df = pandas.DataFrame(res)\n", "df"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {}, "output_type": "display_data"}], "source": ["fig, ax = plt.subplots(1, 2, figsize=(12, 4))\n", "df.plot(x=\"nb_features\", y=[\"r2_reg\", \"r2_las\", \"r2_rid\"], ax=ax[0])\n", "ax[0].set_xlabel(\"Nombre de features\")\n", "ax[0].set_ylim([0, 1])\n", "ax[0].set_title(\"r2\")\n", "df.plot(x=\"nb_features\", y=[\"nnul_reg\", \"nnul_las\", \"nnul_rid\"], ax=ax[1])\n", "ax[1].set_xlabel(\"Nombre de features\")\n", "ax[1].set_title(\"Nombre de coefficients non nuls\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Num\u00e9riquement, la r\u00e9gression lin\u00e9aire devient difficile \u00e0 estimer lorsque le nombre de features augmente. Th\u00e9oriquement, il ne devrait pas y avoir de baisse de performances mais le graphe montre des erreurs \u00e9videntes. Cela se traduit par une norme des coefficients qui explose. La r\u00e9gularisation parvient \u00e0 contraindre les mod\u00e8les. La r\u00e9gression *Ridge* produira beaucoup de petits coefficients non nuls, la r\u00e9gression *Lasso* pr\u00e9f\u00e8rera concentrer la norme sur quelques coefficients seulement. Cette observation n'est vraie que dans le cas d'une r\u00e9gression lin\u00e9aire avec une erreur quadratique."]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0"}}, "nbformat": 4, "nbformat_minor": 2}