.. _td2aecocompetitioncomparerclassifieursrst: ==================================================== 2A.ml - 2016 - Compétition - Préparation des données ==================================================== .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/competitions/2016/td2a_eco_competition_comparer_classifieurs.ipynb|*` Une compétition était proposée dans le cadre du cours *Python pour un Data Scientist* à l’\ `ENSAE `__. Ce notebook facilite la prise en main des données et montre comment comparer plusieurs classifieurs avec la courbe ROC. .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: .. parsed-literal:: The history saving thread hit an unexpected error (OperationalError('database is locked',)).History will not be written to the database. .. code:: ipython3 from pyensae.datasource import download_data download_data("ensae_competition_2016.zip", url="https://github.com/sdpython/ensae_teaching_cs/raw/master/_doc/competitions/2016_ENSAE_2A/") .. parsed-literal:: ['ensae_competition_test_X.txt', 'ensae_competition_train.txt'] .. code:: ipython3 %matplotlib inline Données ------- .. code:: ipython3 import pandas as p import numpy as np df = p.read_csv('./ensae_competition_train.txt', header=[0,1], sep="\t", index_col=0) #### Gender dummies df['X2'] = df['X2'].applymap(str) gender_dummies = p.get_dummies(df['X2'] ) ### education dummies df['X3'] = df['X3'].applymap(str) educ_dummies = p.get_dummies(df['X3'] ) #### marriage dummies df['X4'] = df['X4'].applymap(str) mariage_dummies = p.get_dummies(df['X4'] ) ### On va aussi supprimer les multi index de la table df.columns = df.columns.droplevel(0) #### on aggrège ensuite les 3 tables ensemble data = df.join(gender_dummies).join(educ_dummies).join(mariage_dummies) data.rename(columns = {'default payment next month' : "Y"}, inplace = True) data = data.drop(['SEX','EDUCATION','MARRIAGE'],1) data_resample = p.concat([data[data['Y']==1], data[data['Y']==0].sample(len(data[data['Y']==1]))]) data.head(n=2) .. raw:: html
LIMIT_BAL AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 ... EDUCATION_1 EDUCATION_2 EDUCATION_3 EDUCATION_4 EDUCATION_5 EDUCATION_6 MARRIAGE_0 MARRIAGE_1 MARRIAGE_2 MARRIAGE_3
0 180000 47 0 0 0 0 0 0 179253 95170 ... 0 1 0 0 0 0 0 1 0 0
1 110000 35 0 0 0 0 0 0 6137 7040 ... 0 1 0 0 0 0 0 1 0 0

2 rows × 34 columns

.. code:: ipython3 Y = data['Y'] Y = data_resample['Y'] X = data.drop('Y', 1) #X = data[["SEX_1", "AGE", "MARRIAGE_0", 'PAY_0']] X = data_resample.drop('Y',1) X.columns .. parsed-literal:: Index(['LIMIT_BAL', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'SEX_1', 'SEX_2', 'EDUCATION_0', 'EDUCATION_1', 'EDUCATION_2', 'EDUCATION_3', 'EDUCATION_4', 'EDUCATION_5', 'EDUCATION_6', 'MARRIAGE_0', 'MARRIAGE_1', 'MARRIAGE_2', 'MARRIAGE_3'], dtype='object') .. code:: ipython3 from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33) .. parsed-literal:: Choix du Classifieur -------------------- .. code:: ipython3 from sklearn.linear_model import LogisticRegression from sklearn import svm from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier from sklearn.linear_model import SGDClassifier, Perceptron #type_classifier = LogisticRegression #type_classifier = svm.SVC type_classifier = GradientBoostingClassifier #type_classifier = RandomForestClassifier #type_classifier = Perceptron clf = type_classifier() #clf = SGDClassifier(loss="hinge", penalty="l2") clf = clf.fit(X_train, Y_train.ravel()) .. code:: ipython3 # Matrice de confusion .. code:: ipython3 %matplotlib inline .. code:: ipython3 from sklearn.metrics import confusion_matrix for x,y in [ (X_train, Y_train), (X_test, Y_test) ]: yp = clf.predict(x) cm = confusion_matrix(y.ravel(), yp.ravel()) print(cm.transpose()) import matplotlib.pyplot as plt import seaborn as sns sns.set_style('ticks') plt.matshow(cm.transpose()) plt.title('Confusion matrix sur Test') plt.colorbar() plt.ylabel('Predicted label') plt.xlabel('True label') .. parsed-literal:: [[2713 1112] [ 641 2207]] [[1272 590] [ 354 1071]] .. parsed-literal:: .. image:: td2a_eco_competition_comparer_classifieurs_12_2.png .. code:: ipython3 (cm.transpose()[0,0]+cm.transpose()[1,1])/ (cm[0].sum()+cm[1].sum()) .. parsed-literal:: 0.71280803163979312 Calcul du critère AUC --------------------- .. code:: ipython3 from sklearn.metrics import roc_curve, auc probas = clf.predict_proba(X_test) probas .. parsed-literal:: array([[ 0.14376934, 0.85623066], [ 0.78068073, 0.21931927], [ 0.76203158, 0.23796842], ..., [ 0.36898867, 0.63101133], [ 0.45767261, 0.54232739], [ 0.21353388, 0.78646612]]) .. code:: ipython3 rep = [ ] yt = Y_test.ravel() for i in range(probas.shape[0]): p0,p1 = probas[i,:] exp = yt[i] if p0 > p1 : if exp == 0 : # bonne réponse, true positive (tp) rep.append ( (1, p0) ) else : # mauvaise réponse, false positive (fp) rep.append( (0, p0) ) else : if exp == 0 : # mauvaise réponse, false negative (fn) rep.append ( (0, p1) ) else : # bonne réponse, true negative (tn) rep.append( (1, p1) ) mat_rep = np.array(rep) print("AUC : Taux de bonnes réponses" , sum(mat_rep[:,0]) / len(mat_rep[:,0])) .. parsed-literal:: AUC : Taux de bonnes réponses 0.71280803164 Tous les critères sont détaillés `là `__. Attention au sens de la matrice de confusion, selon les articles, cela change. Courbe ROC ---------- .. code:: ipython3 fpr, tpr, thresholds = roc_curve(mat_rep[:,0], mat_rep[:, 1]) roc_auc = auc(fpr, tpr) plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate (or precision)') plt.title('ROC') plt.legend(loc="lower right") .. parsed-literal:: .. image:: td2a_eco_competition_comparer_classifieurs_19_1.png En haut à droite, TPR et FPR valent 1 (il suffit de prédire toujours positif = pas de défaut = Y_hat=0), en bas à gauche, TPR et FPR valent 0 parce qu’il suffit de toujours prédire la situation négative (ou le défaut, Y_hat = 1). Une autre métrique souvent suivie consiste à comparer Precision (= TPR) et Recall. C’est un peu le même arbitrage. Cela devrait vous rappeler celui entre risque de première espèce et puissance d’un test. Précision-Recall, Score F1 -------------------------- .. code:: ipython3 tp=0 fp=0 fn=0 tn=0 for i in range(len(probas[:,0])): if (probas[i,0] >= 0.5 and yt[i] == 0): tp+=1 elif (probas[i,0] >= 0.5 and yt[i] == 1): fp+=1 elif (probas[i,0] <= 0.5 and yt[i] == 0): fn+=1 else: tn+=1 print("On retrouve la matrice de confusion :\n", "TP : ", tp, "FP : ", fp, "\n", " FN : ", fn, "TN : ", tn) print("Precision : TP / (TP + FP) = ", tp/(tp+fp)) print("Recall : TP / (TP + FN) = ", tp/(tp+fn)) precision = tp/(tp+fp) recall = tp/(tp+fn) print("F1 Score : T2 * P * R / (P + R) = ", 2 * precision * recall / (precision + recall) ) print("False Positive rate : FP / (FP + FN) = ", fp/(fp+tn)) .. parsed-literal:: On retrouve la matrice de confusion : TP : 1272 FP : 590 FN : 354 TN : 1071 Precision : TP / (TP + FP) = 0.6831364124597207 Recall : TP / (TP + FN) = 0.7822878228782287 F1 Score : T2 * P * R / (P + R) = 0.7293577981651376 False Positive rate : FP / (FP + FN) = 0.35520770620108366 .. code:: ipython3 from sklearn.metrics import precision_recall_curve precision, recall, _ = precision_recall_curve(Y_test.ravel(), yp.ravel()) lw = 2 plt.plot(recall, precision, lw=lw, color='navy', label='Precision-Recall curve') plt.xlabel('Recall') plt.ylabel('Precision') plt.ylim([0.0, 1.05]) plt.xlim([0.0, 1.0]) plt.title('Precision-Recall') plt.legend(loc="lower left") .. parsed-literal:: .. image:: td2a_eco_competition_comparer_classifieurs_23_1.png