.. _winesmultistackingrst: ======================================= Classification multi-classe et stacking ======================================= .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/lectures/wines_multi_stacking.ipynb|*` On cherche à prédire la note d’un vin avec un classifieur multi-classe puis à améliorer le score obtenu avec une méthode dite de `stacking `__. .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: Le problème ----------- Il n’est pas évident que les scores des différents modèles qu’on apprend sur chacun des classes soient comparables. Si le modèle n’est pas assez performant, on peut songer à ajouter un dernier modèle qui prend la décision finale en fonction du résultat de chaque modèle. .. code:: ipython3 from pyquickhelper.helpgen import NbImage NbImage('images/stackmulti.png', width=400) .. image:: wines_multi_stacking_3_0.png :width: 400px .. code:: ipython3 %matplotlib inline .. code:: ipython3 from papierstat.datasets import load_wines_dataset df = load_wines_dataset() X = df.drop(['quality', 'color'], axis=1) y = df['quality'] .. code:: ipython3 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) .. code:: ipython3 from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OneVsRestClassifier clr = OneVsRestClassifier(LogisticRegression(max_iter=1500)) clr.fit(X_train, y_train) .. parsed-literal:: OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=1500, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False), n_jobs=None) .. code:: ipython3 import numpy numpy.mean(clr.predict(X_test).ravel() == y_test.ravel()) * 100 .. parsed-literal:: 53.907692307692315 On regarde la matrice de confusion. .. code:: ipython3 from sklearn.metrics import confusion_matrix import pandas df = pandas.DataFrame(confusion_matrix(y_test, clr.predict(X_test))) try: df.columns = [str(_) for _ in clr.classes_][:df.shape[1]] df.index = [str(_) for _ in clr.classes_][:df.shape[0]] except ValueError: # Il peut arriver qu'une classe ne soit pas représenter # lors de l'apprentissage print("erreur", df.shape, clr.classes_) df .. raw:: html
3 4 5 6 7 8 9
3 0 0 2 1 0 1 0
4 0 0 39 15 1 0 0
5 0 0 314 201 2 0 0
6 0 0 170 537 9 0 0
7 0 0 17 233 25 0 0
8 0 0 2 47 6 0 0
9 0 0 0 2 1 0 0
On cale d’abord une random forest sur les données brutes. .. code:: ipython3 from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier() rfc.fit(X_train, y_train) numpy.mean(rfc.predict(X_test).ravel() == y_test.ravel()) * 100 .. parsed-literal:: 67.44615384615385 On cale une random forest avec les sorties de la régression logistique. .. code:: ipython3 rf_train = clr.decision_function(X_train) rfc_y = RandomForestClassifier() rfc_y.fit(rf_train, y_train) .. parsed-literal:: RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False) On calcule le taux d’erreur. .. code:: ipython3 rf_test = clr.decision_function(X_test) numpy.mean(rfc_y.predict(rf_test).ravel() == y_test.ravel()) * 100 .. parsed-literal:: 64.8 C’est presque équivalent à une random forest calée sur les données brutes. On trace les courbes ROC pour la classe 4. .. code:: ipython3 from sklearn.metrics import roc_curve, roc_auc_score fpr_lr, tpr_lr, th_lr = roc_curve(y_test == 4, clr.decision_function(X_test)[:, 2]) fpr_rfc, tpr_rfc, th_rfc = roc_curve(y_test == 4, rfc.predict_proba(X_test)[:, 2]) fpr_rfc_y, tpr_rfc_y, th_rfc_y = roc_curve(y_test == 4, rfc_y.predict_proba(rf_test)[:, 2]) auc_lr = roc_auc_score(y_test == 4, clr.decision_function(X_test)[:, 2]) auc_rfc = roc_auc_score(y_test == 4, rfc.predict_proba(X_test)[:, 2]) auc_rfc_y = roc_auc_score(y_test == 4, rfc_y.predict_proba(rf_test)[:, 2]) auc_lr, auc_rfc, auc_rfc_y .. parsed-literal:: (0.7485466126230457, 0.6752634626519977, 0.6984597568037059) .. code:: ipython3 import matplotlib.pyplot as plt fig, ax = plt.subplots(1, 1, figsize=(4,4)) ax.plot([0, 1], [0, 1], 'k--') ax.plot(fpr_lr, tpr_lr, label="OneVsRest + LR") ax.plot(fpr_rfc, tpr_rfc, label="RF") ax.plot(fpr_rfc_y, tpr_rfc_y, label="OneVsRest + LR + RF") ax.set_title('Courbe ROC - comparaison de deux\nmodèles pour la classe 4') ax.legend(); .. image:: wines_multi_stacking_19_0.png La courbe ROC ne montre rien de probant. Il faudrait vérifier avec une cross-validation qu’il serait pratique de faire avec un `pipeline `__ mais ceux-ci n’acceptent qu’un seul prédicteur final. .. code:: ipython3 from sklearn.pipeline import make_pipeline try: pipe = make_pipeline(OneVsRestClassifier(LogisticRegression(max_iter=1500)), RandomForestClassifier()) except Exception as e: print('ERREUR :') print(e) .. parsed-literal:: ERREUR : All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=1500, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False), n_jobs=None)' (type ) doesn't On construit une ROC sur toutes les classes. .. code:: ipython3 fpr_lr, tpr_lr, th_lr = roc_curve(y_test == clr.predict(X_test), clr.predict_proba(X_test).max(axis=1), drop_intermediate=False) fpr_rfc, tpr_rfc, th_rfc = roc_curve(y_test == rfc.predict(X_test), rfc.predict_proba(X_test).max(axis=1), drop_intermediate=False) fpr_rfc_y, tpr_rfc_y, th_rfc_y = roc_curve(y_test == rfc_y.predict(rf_test), rfc_y.predict_proba(rf_test).max(axis=1), drop_intermediate=False) auc_lr = roc_auc_score(y_test == clr.predict(X_test), clr.decision_function(X_test).max(axis=1)) auc_rfc = roc_auc_score(y_test == rfc.predict(X_test), rfc.predict_proba(X_test).max(axis=1)) auc_rfc_y = roc_auc_score(y_test == rfc_y.predict(rf_test), rfc_y.predict_proba(rf_test).max(axis=1)) auc_lr, auc_rfc, auc_rfc_y .. parsed-literal:: (0.5510406569489914, 0.753562533633215, 0.7354245943989535) .. code:: ipython3 import matplotlib.pyplot as plt fig, ax = plt.subplots(1, 1, figsize=(4,4)) ax.plot([0, 1], [0, 1], 'k--') ax.plot(fpr_lr, tpr_lr, label="OneVsRest + LR") ax.plot(fpr_rfc, tpr_rfc, label="RF") ax.plot(fpr_rfc_y, tpr_rfc_y, label="OneVsRest + LR + RF") ax.set_title('Courbe ROC - comparaison de deux\nmodèles pour toutes les classes') ax.legend(); .. image:: wines_multi_stacking_24_0.png Sur ce modèle, le score produit par le classifieur final paraît plus partinent que le score obtenu en prenant le score maximum sur toutes les classes. On tente une dernière approche où le modèle final doit valider ou non la réponse : c’est un classifieur binaire. Avec celui-ci, tous les classifieurs estimés sont binaires. .. code:: ipython3 rf_train_bin = clr.decision_function(X_train) y_train_bin = clr.predict(X_train) == y_train rfc = RandomForestClassifier() rfc.fit(rf_train_bin, y_train_bin) .. parsed-literal:: RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False) On regarde les premières réponses. .. code:: ipython3 rf_test_bin = clr.decision_function(X_test) rfc.predict_proba(rf_test_bin)[:3] .. parsed-literal:: array([[0.32, 0.68], [0.5 , 0.5 ], [0.89, 0.11]]) .. code:: ipython3 y_test_bin = clr.predict(X_test) == y_test .. code:: ipython3 fpr_rfc_bin, tpr_rfc_bin, th_rfc_bin = roc_curve(y_test_bin, rfc.predict_proba(rf_test_bin)[:, 1]) auc_rfc_bin = roc_auc_score(y_test_bin, rfc.predict_proba(rf_test_bin)[:, 1]) auc_lr, auc_rfc_bin .. parsed-literal:: (0.5510406569489914, 0.7668718108162481) .. code:: ipython3 import matplotlib.pyplot as plt fig, ax = plt.subplots(1, 1, figsize=(4,4)) ax.plot([0, 1], [0, 1], 'k--') ax.plot(fpr_lr, tpr_lr, label="OneVsRest + LR") ax.plot(fpr_rfc, tpr_rfc, label="OneVsRest + LR + RF") ax.plot(fpr_rfc_bin, tpr_rfc_bin, label="OneVsRest + LR + RF binaire") ax.set_title('Courbe ROC - comparaison de deux\nmodèles pour toutes les classes') ax.legend(); .. image:: wines_multi_stacking_31_0.png Un peu mieux mais il faudrait encore valider avec une validation croisée et plusieurs jeux de données, y compris artificiels. Il reste néanmoins l’idée. Automatisation avec une implémentation -------------------------------------- Comme c’est fastidieux de faire tout cela, on implémente une classe qui convertit un modèle de machine learning en un *transform* qu’on peut insérer dans un pipeline. .. code:: ipython3 from mlinsights.sklapi import SkBaseTransformStacking model = make_pipeline( SkBaseTransformStacking( [OneVsRestClassifier(LogisticRegression(max_iter=1500))], 'decision_function'), RandomForestClassifier()) model.fit(X_train, y_train) .. parsed-literal:: Pipeline(memory=None, steps=[('skbasetransformstacking', SkBaseTransformStacking([OneVsRestClassifier(estimator=LogisticRegress ion(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=1500, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False), n_j... RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False))], verbose=False) .. code:: ipython3 fpr_pipe, tpr_pipe, th_pipe = roc_curve(y_test == model.predict(X_test), model.predict_proba(X_test).max(axis=1), drop_intermediate=False) auc_pipe = roc_auc_score(y_test == model.predict(X_test), model.predict_proba(X_test).max(axis=1)) auc_pipe .. parsed-literal:: 0.7330188804547779 On n’oublie pas de mélanger les données avant de faire tourner la validation croisée. .. code:: ipython3 df = load_wines_dataset(shuffle=True) X = df.drop(['quality', 'color'], axis=1) y = df['quality'] On retrouve les mêmes résultats mais on peut maintenant faire une validation croisée. .. code:: ipython3 from sklearn.model_selection import cross_val_score from sklearn.model_selection import cross_validate cross_val_score(model, X, y, cv=5) .. parsed-literal:: array([0.66923077, 0.66153846, 0.67513472, 0.66897614, 0.65588915]) scikit-learn 0.22 - stacking ---------------------------- A partir de la version 0.22, *scikit-learn* a introduit le modèle `StackingClassifier `__ avec un design un peu différent .. code:: ipython3 try: from sklearn.ensemble import StackingClassifier skl = True except ImportError: # scikit-learn pas assez récent skl = False if skl: model = StackingClassifier([ ('ovrlr', OneVsRestClassifier(LogisticRegression(max_iter=1500))), ('rf', RandomForestClassifier()) ]) model.fit(X_train, y_train) else: model = None model .. parsed-literal:: C:\xavierdupre\__home_\github_fork\scikit-learn\sklearn\model_selection\_split.py:667: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) C:\xavierdupre\__home_\github_fork\scikit-learn\sklearn\model_selection\_split.py:667: UserWarning: The least populated class in y has only 2 members, which is less than n_splits=5. % (min_groups, self.n_splits)), UserWarning) C:\xavierdupre\__home_\github_fork\scikit-learn\sklearn\linear_model\_logistic.py:939: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html. Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG) .. parsed-literal:: StackingClassifier(cv=None, estimators=[('ovrlr', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=1500, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False), n_jobs=None)), ('rf', RandomForestCla... max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False))], final_estimator=None, n_jobs=None, passthrough=False, stack_method='auto', verbose=0) .. code:: ipython3 if model is not None: fpr_pipe, tpr_pipe, th_pipe = roc_curve(y_test == model.predict(X_test), model.predict_proba(X_test).max(axis=1), drop_intermediate=False) auc_pipe = roc_auc_score(y_test == model.predict(X_test), model.predict_proba(X_test).max(axis=1)) else: auc_pipe = None auc_pipe .. parsed-literal:: 0.7301920159931777