{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.ml - Pipeline pour un r\u00e9duction d'une for\u00eat al\u00e9atoire - correction\n", "\n", "Le mod\u00e8le Lasso permet de s\u00e9lectionner des variables, une for\u00eat al\u00e9atoire produit une pr\u00e9diction comme \u00e9tant la moyenne d'arbres de r\u00e9gression. Cet aspect a \u00e9t\u00e9 abord\u00e9 dans le notebook [Reduction d'une for\u00eat al\u00e9atoire](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/notebooks/td2a_tree_selection_correction.html). On cherche \u00e0 automatiser le processus."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Datasets\n", "\n", "Comme il faut toujours des donn\u00e9es, on prend ce jeu [Diabetes](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html)."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["from sklearn.datasets import load_diabetes\n", "data = load_diabetes()\n", "X, y = data.data, data.target"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## For\u00eat al\u00e9atoire suivi de Lasso\n", "\n", "La m\u00e9thode consiste \u00e0 apprendre une for\u00eat al\u00e9atoire puis \u00e0 effectuer d'une r\u00e9gression sur chacun des estimateurs."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([0.00516931, 0. , 0. , 0. , 0. ,\n", " 0. , 0. , 0. , 0.05150952, 0. ,\n", " 0.0114454 , 0.00778913, 0. , 0.04239907, 0.01882099,\n", " 0.02956967, 0. , 0.04699227, 0. , 0.04588009,\n", " 0.00476672, 0.05276899, 0. , 0. , 0.00719994,\n", " 0. , 0.02817731, 0. , 0. , 0.03606261,\n", " 0.00228349, 0.01204062, 0.02018557, 0. , 0. ,\n", " 0.03759611, 0.04608785, 0. , 0.00316996, 0. ,\n", " 0. , 0. , 0.01678394, 0. , 0. ,\n", " 0. , 0.00801926, 0.07006079, 0.03263025, 0. ,\n", " 0.00770145, 0. , 0.00351302, 0. , 0. ,\n", " 0. , 0. , 0. , 0.00183299, 0. ,\n", " 0. , 0. , 0. , 0. , 0. ,\n", " 0. , 0.02545205, 0.05789703, 0. , 0. ,\n", " 0. , 0.0065516 , 0. , 0. , 0. ,\n", " 0.07234827, 0. , 0.03547108, 0. , 0. ,\n", " 0.03080198, 0.00930293, 0.04231454, 0. , 0.01124574,\n", " 0. , 0. , 0. , 0. , 0. ,\n", " 0.00108674, 0.02485889, 0.01839299, 0. , 0. ,\n", " 0.03118312, 0. , 0. , 0. , 0. ])"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.linear_model import Lasso\n", "\n", "# Apprentissage d'une for\u00eat al\u00e9atoire\n", "clr = RandomForestRegressor()\n", "clr.fit(X_train, y_train)\n", "\n", "# R\u00e9cup\u00e9ration de la pr\u00e9diction de chaque arbre\n", "X_train_2 = numpy.zeros((X_train.shape[0], len(clr.estimators_)))\n", "estimators = numpy.array(clr.estimators_).ravel()\n", "for i, est in enumerate(estimators):\n", " pred = est.predict(X_train)\n", " X_train_2[:, i] = pred\n", "\n", "# Apprentissage d'une r\u00e9gression Lasso\n", "lrs = Lasso(max_iter=10000)\n", "lrs.fit(X_train_2, y_train)\n", "lrs.coef_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Nous avons r\u00e9ussi \u00e0 reproduire le processus dans son ensemble. Pas toujours simple de se souvenir de toutes les \u00e9tapes, c'est pourquoi il est plus simple de compiler l'ensemble dans un [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Premier pipeline\n", "\n", "L'id\u00e9e est d'avoir quelque chose qui ressemble \u00e0 ce qui suit."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["name 'fct' is not defined\n"]}], "source": ["from sklearn.pipeline import Pipeline\n", "\n", "try:\n", " pipe = Pipeline(steps=[\n", " ('rf', RandomForestRegressor()),\n", " (\"une fonction qui n'existe pas encore\", fct),\n", " (\"lasso\", Lasso()),\n", " ])\n", "except Exception as e:\n", " print(e)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Dans un pipeline, on ne peut y mettre que des mod\u00e8les pr\u00e9dictifs, classifieur, r\u00e9gresseur ou des transformeur (normalisseur). La fonction qui extrait les pr\u00e9dictions des arbres doit \u00eatre emball\u00e9s dans un *transformer*. C'est le r\u00f4le d'un [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[23.8, 31.2, 32. , ..., 32. , 29.9, 28. ],\n", " [21.6, 22.9, 21.6, ..., 21.6, 22. , 24.4],\n", " [33.8, 37.2, 34.7, ..., 34.9, 34.7, 34.7],\n", " ...,\n", " [23.9, 23.9, 31.5, ..., 29.9, 23.9, 23.9],\n", " [23.9, 22. , 22. , ..., 29.9, 23.9, 22. ],\n", " [11.9, 11.9, 11.9, ..., 11.9, 20.6, 11.9]])"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.preprocessing import FunctionTransformer\n", "\n", "def random_forest_tree_prediction(rf, X):\n", " preds = numpy.zeros((X.shape[0], len(rf.estimators_)))\n", " estimators = numpy.array(rf.estimators_).ravel()\n", " for i, est in enumerate(estimators):\n", " pred = est.predict(X)\n", " preds[:, i] = pred\n", " return preds\n", " \n", "\n", "random_forest_tree_prediction(clr, X)"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 8.5, 8.5, 5.6, ..., 5. , 5. , 8.5],\n", " [14.8, 15. , 13.4, ..., 19.5, 21.7, 16.6],\n", " [13.1, 13.1, 15.1, ..., 13.1, 11.3, 12.6],\n", " ...,\n", " [21.4, 22. , 21.4, ..., 20. , 22.6, 21.4],\n", " [25.1, 29.9, 25.1, ..., 25.1, 25.1, 25.1],\n", " [28.4, 28.4, 28.4, ..., 28.4, 28.4, 28.4]])"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["fct = FunctionTransformer(lambda X, rf=clr: random_forest_tree_prediction(rf, X) )\n", "\n", "fct.transform(X_train)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Tout se passe bien. Il suffit de l'ins\u00e9rer dans le pipeline."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=100,\n", " n_jobs=None, oob_score=False, random_state=None,\n", " verbose=0, warm_start=False)' (type ) doesn't\n"]}], "source": ["try:\n", " pipe = Pipeline(steps=[\n", " ('rf', RandomForestRegressor()),\n", " (\"tree_pred\", fct),\n", " (\"lasso\", Lasso()),\n", " ])\n", "except Exception as e:\n", " print(e)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ca ne marche toujours pas parce qu'un pipeline \u00e0 ce que toutes les \u00e9tapes except\u00e9 la derni\u00e8re doivent \u00eatre un *transformeur* et impl\u00e9menter la m\u00e9thode ``transform`` et ce n'est pas le cas. Et cela pose \u00e9galement un autre probl\u00e8me, la fonction ne fonctionne que si elle re\u00e7oit la for\u00eat al\u00e9atoire en argument et nous avons pass\u00e9 celle d\u00e9j\u00e0 apprise mais ce n'aurait pas \u00e9t\u00e9 celle apprise dans le pipeline."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["False"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["hasattr(clr, 'transform')"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": ["from jyquickhelper import RenderJsDot"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", ""], "text/plain": [""]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["RenderJsDot(\"\"\"digraph {\n", " A [label=\"RandomForestRegressor pipline\"];\n", " A2 [label=\"RandomForestRegressor - pretrained\"];\n", " B [label=\"FunctionTransformer\"]; C [label=\"Lasso\"];\n", " A -> B [label=\"X\"]; B -> C [label=\"X2\"]; A2 -> B [label=\"rf\"]; }\"\"\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Comme \u00e7a ne marche pas, on passe \u00e0 une seconde id\u00e9e."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Second pipeline\n", "\n", "On d\u00e9guise la for\u00eat al\u00e9atoire en un transformeur."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 8.5, 8.5, 10.4, ..., 8.5, 8.5, 5.6],\n", " [13.4, 15. , 21.7, ..., 21.7, 21.7, 20. ],\n", " [ 8.3, 13.1, 15.1, ..., 13.6, 13.1, 17.1],\n", " ...,\n", " [21.4, 21.2, 21.4, ..., 21.4, 21.4, 21.4],\n", " [23.8, 25.1, 25.1, ..., 24.3, 25.1, 22. ],\n", " [28.4, 28.4, 28.4, ..., 28.1, 28.4, 28.4]])"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["class RandomForestRegressorAsTransformer:\n", " \n", " def __init__(self, **kwargs):\n", " self.rf = RandomForestRegressor(**kwargs)\n", " \n", " def fit(self, X, y):\n", " self.rf.fit(X, y)\n", " return self\n", " \n", " def transform(self, X):\n", " preds = numpy.zeros((X.shape[0], len(self.rf.estimators_)))\n", " estimators = numpy.array(self.rf.estimators_).ravel()\n", " for i, est in enumerate(estimators):\n", " pred = est.predict(X)\n", " preds[:, i] = pred\n", " return preds\n", "\n", "\n", "trrf = RandomForestRegressorAsTransformer()\n", "trrf.fit(X_train, y_train)\n", "trrf.transform(X_train)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Tout va bien. On refait le pipeline."]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["C:\\xavierdupre\\__home_\\github_fork\\scikit-learn\\sklearn\\linear_model\\coordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 15.605865570498736, tolerance: 3.566623625329815\n", " positive)\n"]}, {"data": {"text/plain": ["Pipeline(memory=None,\n", " steps=[('trrf',\n", " <__main__.RandomForestRegressorAsTransformer object at 0x000002D7FBD6B0F0>),\n", " ('lasso',\n", " Lasso(alpha=1.0, copy_X=True, fit_intercept=True,\n", " max_iter=1000, normalize=False, positive=False,\n", " precompute=False, random_state=None, selection='cyclic',\n", " tol=0.0001, warm_start=False))],\n", " verbose=False)"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["pipe = Pipeline(steps=[('trrf', RandomForestRegressorAsTransformer()),\n", " (\"lasso\", Lasso())])\n", "\n", "pipe.fit(X_train, y_train)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On r\u00e9cup\u00e8re les coefficients."]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([0.00000000e+00, 8.18725785e-03, 2.57107281e-02, 2.64260468e-02,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 2.91496034e-02, 0.00000000e+00, 2.59224355e-03,\n", " 0.00000000e+00, 0.00000000e+00, 1.11199737e-02, 2.25351658e-02,\n", " 0.00000000e+00, 1.89481812e-02, 1.02779896e-01, 0.00000000e+00,\n", " 6.25993012e-03, 2.88645052e-02, 2.26525053e-02, 0.00000000e+00,\n", " 1.58723695e-02, 2.17116677e-02, 5.73111769e-02, 4.07723945e-02,\n", " 3.07676159e-02, 0.00000000e+00, 0.00000000e+00, 2.96368833e-02,\n", " 6.31627239e-03, 3.05513736e-04, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 2.61832331e-02, 0.00000000e+00, 0.00000000e+00,\n", " 1.95009449e-02, 3.88476951e-02, 1.12862592e-02, 1.97136005e-02,\n", " 0.00000000e+00, 5.67052346e-02, 9.39029327e-03, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00, 6.86248078e-03, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00, 5.22709050e-02, 1.56786096e-02,\n", " 0.00000000e+00, 1.06189159e-02, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 7.99152616e-02, 0.00000000e+00, 1.05299329e-02,\n", " 0.00000000e+00, 0.00000000e+00, 4.70392340e-02, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n", " 1.04728058e-03, 3.60665273e-02, 0.00000000e+00, 0.00000000e+00,\n", " 1.21597852e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 4.73818504e-02, 1.70113005e-02, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00, 2.29949689e-06, 0.00000000e+00])"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["pipe.steps[1][1].coef_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## A quoi \u00e7a sert : GridSearchCV\n", "\n", "Comme l'ensemble des traitements sont maintenant dans un seul pipeline que *scikit-learn* consid\u00e8re comme un mod\u00e8le comme les autres, on peut rechercher les meilleurs hyper-param\u00e8tres du mod\u00e8le, comme le nombre d'arbres initial, le param\u00e8tre *alpha*, la profondeur des arbres..."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Fitting 5 folds for each of 12 candidates, totalling 60 fits\n", "'RandomForestRegressorAsTransformer' object has no attribute 'set_params'\n"]}, {"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"]}], "source": ["from sklearn.model_selection import GridSearchCV\n", "\n", "param_grid = {'trrf__n_estimators': [30, 50, 80, 100],\n", " 'lasso__alpha': [0.5, 1.0, 1.5]}\n", "\n", "try:\n", " grid = GridSearchCV(pipe, cv=5, verbose=1, param_grid=param_grid)\n", " grid.fit(X_train, y_train)\n", "except Exception as e:\n", " print(e)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La classe ``RandomForestRegressorAsTransformer`` a besoin de la m\u00e9thode *set_params*... Aucun probl\u00e8me."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": ["class RandomForestRegressorAsTransformer:\n", " \n", " def __init__(self, **kwargs):\n", " self.rf = RandomForestRegressor(**kwargs)\n", " \n", " def fit(self, X, y):\n", " self.rf.fit(X, y)\n", " return self\n", " \n", " def transform(self, X):\n", " preds = numpy.zeros((X.shape[0], len(self.rf.estimators_)))\n", " estimators = numpy.array(self.rf.estimators_).ravel()\n", " for i, est in enumerate(estimators):\n", " pred = est.predict(X)\n", " preds[:, i] = pred\n", " return preds\n", " \n", " def set_params(self, **params):\n", " self.rf.set_params(**params)"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"]}, {"name": "stdout", "output_type": "stream", "text": ["Fitting 5 folds for each of 6 candidates, totalling 30 fits\n", "[CV] lasso__alpha=0.5, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=0.5, trrf__n_estimators=50, total= 0.3s\n", "[CV] lasso__alpha=0.5, trrf__n_estimators=50 .........................\n"]}, {"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s\n"]}, {"name": "stdout", "output_type": "stream", "text": ["[CV] .......... lasso__alpha=0.5, trrf__n_estimators=50, total= 0.3s\n", "[CV] lasso__alpha=0.5, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=0.5, trrf__n_estimators=50, total= 0.3s\n", "[CV] lasso__alpha=0.5, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=0.5, trrf__n_estimators=50, total= 0.2s\n", "[CV] lasso__alpha=0.5, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=0.5, trrf__n_estimators=50, total= 0.2s\n", "[CV] lasso__alpha=0.5, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=0.5, trrf__n_estimators=100, total= 0.4s\n", "[CV] lasso__alpha=0.5, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=0.5, trrf__n_estimators=100, total= 0.4s\n", "[CV] lasso__alpha=0.5, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=0.5, trrf__n_estimators=100, total= 0.4s\n", "[CV] lasso__alpha=0.5, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=0.5, trrf__n_estimators=100, total= 0.5s\n", "[CV] lasso__alpha=0.5, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=0.5, trrf__n_estimators=100, total= 0.4s\n", "[CV] lasso__alpha=1.0, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=1.0, trrf__n_estimators=50, total= 0.2s\n", "[CV] lasso__alpha=1.0, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=1.0, trrf__n_estimators=50, total= 0.2s\n", "[CV] lasso__alpha=1.0, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=1.0, trrf__n_estimators=50, total= 0.3s\n", "[CV] lasso__alpha=1.0, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=1.0, trrf__n_estimators=50, total= 0.2s\n", "[CV] lasso__alpha=1.0, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=1.0, trrf__n_estimators=50, total= 0.2s\n", "[CV] lasso__alpha=1.0, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=1.0, trrf__n_estimators=100, total= 0.5s\n", "[CV] lasso__alpha=1.0, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=1.0, trrf__n_estimators=100, total= 0.4s\n", "[CV] lasso__alpha=1.0, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=1.0, trrf__n_estimators=100, total= 0.5s\n", "[CV] lasso__alpha=1.0, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=1.0, trrf__n_estimators=100, total= 0.5s\n", "[CV] lasso__alpha=1.0, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=1.0, trrf__n_estimators=100, total= 0.4s\n", "[CV] lasso__alpha=1.5, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=1.5, trrf__n_estimators=50, total= 0.2s\n", "[CV] lasso__alpha=1.5, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=1.5, trrf__n_estimators=50, total= 0.2s\n", "[CV] lasso__alpha=1.5, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=1.5, trrf__n_estimators=50, total= 0.2s\n", "[CV] lasso__alpha=1.5, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=1.5, trrf__n_estimators=50, total= 0.2s\n", "[CV] lasso__alpha=1.5, trrf__n_estimators=50 .........................\n", "[CV] .......... lasso__alpha=1.5, trrf__n_estimators=50, total= 0.2s\n", "[CV] lasso__alpha=1.5, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=1.5, trrf__n_estimators=100, total= 0.4s\n", "[CV] lasso__alpha=1.5, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=1.5, trrf__n_estimators=100, total= 0.4s\n", "[CV] lasso__alpha=1.5, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=1.5, trrf__n_estimators=100, total= 0.5s\n", "[CV] lasso__alpha=1.5, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=1.5, trrf__n_estimators=100, total= 0.7s\n", "[CV] lasso__alpha=1.5, trrf__n_estimators=100 ........................\n", "[CV] ......... lasso__alpha=1.5, trrf__n_estimators=100, total= 0.5s\n"]}, {"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 10.6s finished\n"]}], "source": ["import warnings\n", "from sklearn.exceptions import ConvergenceWarning\n", "\n", "pipe = Pipeline(steps=[('trrf', RandomForestRegressorAsTransformer()),\n", " (\"lasso\", Lasso())])\n", " \n", "param_grid = {'trrf__n_estimators': [50, 100],\n", " 'lasso__alpha': [0.5, 1.0, 1.5]}\n", "\n", "grid = GridSearchCV(pipe, cv=5, verbose=2, param_grid=param_grid)\n", "\n", "with warnings.catch_warnings(record=False) as w:\n", " # On ignore les convergence warning car il y en beaucoup.\n", " warnings.simplefilter(\"ignore\", ConvergenceWarning)\n", " grid.fit(X_train, y_train)"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'lasso__alpha': 0.5, 'trrf__n_estimators': 50}"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["grid.best_params_"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([ 0.01661755, 0.09608553, -0. , 0.04337892, 0.00256722,\n", " 0.04875441, 0.0022436 , 0.00757652, 0. , 0. ,\n", " 0. , 0.04868714, 0.00878259, 0. , 0.01989812,\n", " 0.0123234 , 0. , 0.06432313, 0.00565488, 0. ,\n", " 0.00119269, 0. , 0.00611262, 0. , 0. ,\n", " 0. , 0.01786513, 0. , 0.026654 , 0. ,\n", " 0. , 0. , 0.09583967, 0.00722895, 0.05395944,\n", " 0.063898 , 0.0586511 , 0. , 0. , 0.1290402 ,\n", " 0. , 0. , 0. , 0. , 0. ,\n", " 0.00750308, 0. , 0.08633491, 0.03593556, 0.05771344])"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["grid.best_estimator_.steps[1][1].coef_"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.8321710116268228"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["grid.best_score_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On essaye sur la base de test."]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.8772908724536076"]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["grid.score(X_test, y_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Et il y a combien de coefficients non nuls extactement..."]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"data": {"text/plain": ["((50,), 27)"]}, "execution_count": 24, "metadata": {}, "output_type": "execute_result"}], "source": ["coef = grid.best_estimator_.steps[1][1].coef_\n", "coef.shape, sum(coef != 0)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Et on ne peut pas v\u00e9rifier pour les autres car la recherche ne conserve pas les autres mod\u00e8les. Et il reste un probl\u00e8me qui est plus ou moins important selon l'usage qu'on fait du mod\u00e8le : il y a 23 coefficients nuls ce qui fait le mod\u00e8le calcule la pr\u00e9diction de 23 arbres de r\u00e9gressions qui ne sont pas utilis\u00e9es puisque la r\u00e9gression leur affecte un coefficient nul. On pourrait donc facilement diviser le temps de pr\u00e9diction par deux. Ce n'est pas si difficile \u00e0 faire mais ce sera pour une autre histoire."]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5"}}, "nbformat": 4, "nbformat_minor": 2}