{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Courte introduction au machine learning\n", "\n", "Le jeu de donn\u00e9es [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) recense les composants chimiques de vins ainsi que la note d'experts. Peut-on pr\u00e9dire cette note \u00e0 partir des composants chimiques ? Peut-\u00eatre que si on arrive \u00e0 construire une fonction qui permet de pr\u00e9dire cette note, on pourra comprendre comment l'expert note les vins."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Donn\u00e9es et premi\u00e8re r\u00e9gression lin\u00e9aire\n", "\n", "On peut utiliser la fonction impl\u00e9ment\u00e9e dans ce module."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["df.plot(x=\"minl\", y=[\"r2_train_dt\", \"r2_test_dt\",\n", " \"r2_train_reg\", \"r2_test_reg\"]);"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On voit que la performance sur la base de test augmente rapidement puis stagne sans jamais rattraper celle de la base d'apprentissage. Elle ne d\u00e9passe pas celle d'un mod\u00e8le lin\u00e9aire ce qui est d\u00e9cevant. Essayons avec une for\u00eat al\u00e9atoire."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## For\u00eat al\u00e9atoire"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 25/25 [00:20<00:00, 1.54it/s]\n"]}, {"data": {"text/html": ["
\n", "\n", "
\n", " \n", "
\n", "
\n", "
minl
\n", "
r2_train_dt
\n", "
r2_test_dt
\n", "
r2_train_reg
\n", "
r2_test_reg
\n", "
r2_train_rf
\n", "
r2_test_rf
\n", "
\n", " \n", " \n", "
\n", "
0
\n", "
1
\n", "
1.000000
\n", "
0.030211
\n", "
0.30522
\n", "
0.265853
\n", "
0.920318
\n", "
0.472317
\n", "
\n", "
\n", "
1
\n", "
3
\n", "
0.864086
\n", "
0.130133
\n", "
0.30522
\n", "
0.265853
\n", "
0.836299
\n", "
0.455444
\n", "
\n", " \n", "
\n", "
"], "text/plain": [" minl r2_train_dt r2_test_dt r2_train_reg r2_test_reg r2_train_rf \\\n", "0 1 1.000000 0.030211 0.30522 0.265853 0.920318 \n", "1 3 0.864086 0.130133 0.30522 0.265853 0.836299 \n", "\n", " r2_test_rf \n", "0 0.472317 \n", "1 0.455444 "]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "from sklearn.ensemble import RandomForestRegressor\n", "from tqdm import tqdm\n", "res = []\n", "for i in tqdm(range(1, 50, 2)):\n", " dt = DecisionTreeRegressor(min_samples_leaf=i)\n", " reg = LinearRegression()\n", " rf = RandomForestRegressor(n_estimators=25, min_samples_leaf=i)\n", " dt.fit(X_train, y_train)\n", " reg.fit(X_train, y_train)\n", " rf.fit(X_train, y_train)\n", " r = {\n", " 'minl': i,\n", " 'r2_train_dt': r2_score(y_train, dt.predict(X_train)),\n", " 'r2_test_dt': r2_score(y_test, dt.predict(X_test)),\n", " 'r2_train_reg': r2_score(y_train, reg.predict(X_train)),\n", " 'r2_test_reg': r2_score(y_test, reg.predict(X_test)),\n", " 'r2_train_rf': r2_score(y_train, rf.predict(X_train)),\n", " 'r2_test_rf': r2_score(y_test, rf.predict(X_test)),\n", " }\n", " res.append(r)\n", "df = pandas.DataFrame(res)\n", "df.head(2)"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": [""]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["df.plot(x=\"minl\", y=[\"r2_train_dt\", \"r2_test_dt\",\n", " \"r2_train_reg\", \"r2_test_reg\",\n", " \"r2_train_rf\", \"r2_test_rf\"]);"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A l'inverse de l'arbre de r\u00e9gression, la for\u00eat al\u00e9atoire est meilleure lorsque ce param\u00e8tre est petit. Une for\u00eat est une moyenne de mod\u00e8le, chacun appris sur un sous-\u00e9chantillon du jeu de donn\u00e9es initiale. M\u00eame si un arbre apprend par coeur, il est peu probable que son voisin ait appris le m\u00eame sous-\u00e9chantillon. En faisant la moyenne, on fait un compromis."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Validation crois\u00e9e\n", "\n", "Il reste \u00e0 v\u00e9rifier que le mod\u00e8le est robuste. C'est l'objet de la validation crois\u00e9e qui d\u00e9coupe le jeu de donn\u00e9es en 5 parties, apprend sur 4, teste une 1 puis recommence 5 fois en faisant varier la partie qui sert \u00e0 tester."]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.4s finished\n"]}, {"data": {"text/plain": ["array([0.05037733, 0.24594631, 0.25811598, 0.348578 , 0.2462281 ])"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.model_selection import cross_val_score\n", "cross_val_score(\n", " RandomForestRegressor(n_estimators=25), X, y, cv=5,\n", " verbose=1)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ce r\u00e9sultat doit vous interrompre car les performances sont loin d'\u00eatre stables. Deux options : soit le mod\u00e8le n'est pas robuste, soit la m\u00e9thodologie est fausse quelque part. Comme le probl\u00e8me est assez simple, il est probable que ce soit la seconde option : la jeu de donn\u00e9es est tri\u00e9e. Les vins rouges d'abord, les blancs ensuite. Il est possible que la validation crois\u00e9e estime un mod\u00e8le sur des vins rouges et l'appliquent \u00e0 des vins blancs. Cela ne marche pas visiblement. Cela veut dire aussi que les vins blancs et rouges sont tr\u00e8s diff\u00e9rents et que la couleur est probablement une information redondante avec les autres. M\u00e9langeons les donn\u00e9es au hasard."]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": ["from sklearn.utils import shuffle\n", "X2, y2 = shuffle(X, y)"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.6s finished\n"]}, {"data": {"text/plain": ["array([0.47975777, 0.50951094, 0.49514404, 0.51110336, 0.51584857])"]}, "execution_count": 27, "metadata": {}, "output_type": "execute_result"}], "source": ["cross_val_score(\n", " RandomForestRegressor(n_estimators=25), X2, y2, cv=5,\n", " verbose=1)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Beaucoup mieux. On peut faire comme \u00e7a aussi."]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 6.6s finished\n"]}, {"data": {"text/plain": ["array([0.53754932, 0.54227221, 0.5442236 , 0.57726314, 0.53393994])"]}, "execution_count": 28, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.model_selection import ShuffleSplit\n", "cross_val_score(\n", " RandomForestRegressor(n_estimators=25), X, y, cv=ShuffleSplit(5),\n", " verbose=1)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Pipeline\n", "\n", "On peut caler un mod\u00e8le apr\u00e8s une ACP mais il faut bien se souvenir de toutes les \u00e9tapes interm\u00e9diaires avant de pr\u00e9dire avec le mod\u00e8le final."]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [{"data": {"text/plain": ["PCA(copy=True, iterated_power='auto', n_components=6, random_state=None,\n", " svd_solver='auto', tol=0.0, whiten=False)"]}, "execution_count": 29, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.decomposition import PCA\n", "pca = PCA(6)\n", "pca.fit(X_train, y_train)"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [{"data": {"text/plain": ["RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " max_samples=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=100, n_jobs=None, oob_score=False,\n", " random_state=None, verbose=0, warm_start=False)"]}, "execution_count": 30, "metadata": {}, "output_type": "execute_result"}], "source": ["rf = RandomForestRegressor(n_estimators=100)\n", "X_train_pca = pca.transform(X_train)\n", "rf.fit(X_train_pca, y_train)"]}, {"cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.421429956568139"]}, "execution_count": 31, "metadata": {}, "output_type": "execute_result"}], "source": ["X_test_pca = pca.transform(X_test)\n", "pred = rf.predict(X_test_pca)\n", "r2_score(y_test, pred)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ou alors on utilise le concept de *pipeline* qui permet d'assembler les pr\u00e9traitements et le mod\u00e8le pr\u00e9dictif sous la forme d'une s\u00e9quence de traitement qui devient le mod\u00e8le unique."]}, {"cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": ["from sklearn.pipeline import Pipeline\n", "pipe = Pipeline([\n", " ('acp', PCA(n_components=6)),\n", " ('rf', RandomForestRegressor(n_estimators=100))\n", "])\n", "pipe.fit(X_train, y_train);"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Grille de recherche\n", "\n", "De cette fa\u00e7on, on peut chercher simplement les meilleurs hyperparam\u00e8tres du mod\u00e8le."]}, {"cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Fitting 3 folds for each of 12 candidates, totalling 36 fits\n"]}, {"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 44.4s finished\n"]}, {"data": {"text/plain": ["GridSearchCV(cv=ShuffleSplit(n_splits=3, random_state=None, test_size=None, train_size=None),\n", " error_score=nan,\n", " estimator=Pipeline(memory=None,\n", " steps=[('acp',\n", " PCA(copy=True, iterated_power='auto',\n", " n_components=6, random_state=None,\n", " svd_solver='auto', tol=0.0,\n", " whiten=False)),\n", " ('rf',\n", " RandomForestRegressor(bootstrap=True,\n", " ccp_alpha=0.0,\n", " criterion='mse',\n", " max_depth=None,\n", " m...\n", " min_samples_leaf=1,\n", " min_samples_split=2,\n", " min_weight_fraction_leaf=0.0,\n", " n_estimators=100,\n", " n_jobs=None,\n", " oob_score=False,\n", " random_state=None,\n", " verbose=0,\n", " warm_start=False))],\n", " verbose=False),\n", " iid='deprecated', n_jobs=None,\n", " param_grid={'acp__n_components': [1, 4, 7, 10],\n", " 'rf__n_estimators': [10, 20, 50]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n", " scoring=None, verbose=1)"]}, "execution_count": 33, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.model_selection import GridSearchCV\n", "param_grid = {'acp__n_components': list(range(1, 11, 3)),\n", " 'rf__n_estimators': [10, 20, 50]}\n", "grid = GridSearchCV(pipe, param_grid=param_grid, verbose=1,\n", " cv=ShuffleSplit(3))\n", "grid.fit(X, y)"]}, {"cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'acp__n_components': 10, 'rf__n_estimators': 50}"]}, "execution_count": 34, "metadata": {}, "output_type": "execute_result"}], "source": ["grid.best_params_"]}, {"cell_type": "code", "execution_count": 34, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/plain": ["array([7.1 , 5.06, 7. , ..., 6.74, 4.12, 5.98])"]}, "execution_count": 35, "metadata": {}, "output_type": "execute_result"}], "source": ["grid.predict(X_test)"]}, {"cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.9275290318700775"]}, "execution_count": 36, "metadata": {}, "output_type": "execute_result"}], "source": ["r2_score(y_test, grid.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ce nombre para\u00eet beaucoup trop beau pour \u00eatre vrai. Cela signifie sans doute que les donn\u00e9es de test ont \u00e9t\u00e9 utilis\u00e9s pour effectuer la recherche."]}, {"cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.49487646056265816"]}, "execution_count": 37, "metadata": {}, "output_type": "execute_result"}], "source": ["grid.best_score_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Nettement plus plausible."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Enregistrer, restaurer\n", "\n", "Le moyen le plus simple de conserver les mod\u00e8les en python est de les s\u00e9rialiser : on copie la m\u00e9moire sur disque puis on la restaure plus tard."]}, {"cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": ["import pickle\n", "\n", "with open('piperf.pickle', 'wb') as f:\n", " pickle.dump(grid, f)"]}, {"cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [{"data": {"text/plain": ["['piperf.pickle']"]}, "execution_count": 39, "metadata": {}, "output_type": "execute_result"}], "source": ["import glob\n", "glob.glob('*.pickle')"]}, {"cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": ["with open(\"piperf.pickle\", 'rb') as f:\n", " grid2 = pickle.load(f)"]}, {"cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([7.1 , 5.06, 7. , ..., 6.74, 4.12, 5.98])"]}, "execution_count": 41, "metadata": {}, "output_type": "execute_result"}], "source": ["grid2.predict(X_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Pr\u00e9diction de la couleur\n", "\n", "Le fait que la premi\u00e8re validation crois\u00e9e \u00e9choue \u00e9tait un signe que la couleur \u00e9tait facilement pr\u00e9visible. V\u00e9rifions."]}, {"cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": ["Xc = df_data.drop(['quality', 'color'], axis=1)\n", "yc = df_data[\"color\"]"]}, {"cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": ["Xc_train, Xc_test, yc_train, yc_test = train_test_split(Xc, yc)"]}, {"cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": ["from sklearn.linear_model import LogisticRegression\n", "log = LogisticRegression(solver='lbfgs', max_iter=1500)\n", "log.fit(Xc_train, yc_train);"]}, {"cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.04459922717947637"]}, "execution_count": 45, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import log_loss\n", "log_loss(yc_test, log.predict_proba(Xc_test))"]}, {"cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 391, 14],\n", " [ 9, 1211]], dtype=int64)"]}, "execution_count": 46, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import confusion_matrix\n", "confusion_matrix(yc_test, log.predict(Xc_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La matrice de confusion est plut\u00f4t explicite."]}, {"cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2"}}, "nbformat": 4, "nbformat_minor": 2}