.. _rocexamplerst: === ROC === .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/metric/roc_example.ipynb|*` A few graphs about ROC on the iris datasets. .. code:: ipython3 %matplotlib inline import matplotlib.pyplot as plt plt.style.use('ggplot') .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: Iris datasets ------------- .. code:: ipython3 from sklearn import datasets iris = datasets.load_iris() X = iris.data[:, :2] y = iris.target .. code:: ipython3 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) .. code:: ipython3 from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(X_train, y_train) .. parsed-literal:: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) .. code:: ipython3 import numpy ypred = clf.predict(X_test) yprob = clf.predict_proba(X_test) score = numpy.array(list(yprob[i,ypred[i]] for i in range(len(ypred)))) .. code:: ipython3 data = numpy.zeros((len(ypred), 2)) data[:,0] = score.ravel() data[ypred==y_test,1] = 1 data[:5] .. parsed-literal:: array([[ 0.70495209, 1. ], [ 0.56148737, 0. ], [ 0.56148737, 1. ], [ 0.77416227, 1. ], [ 0.58631799, 0. ]]) ROC with scikit-learn --------------------- We use the following example `Receiver Operating Characteristic (ROC) `__. .. code:: ipython3 from sklearn.metrics import roc_curve fpr, tpr, th = roc_curve(y_test == ypred, score) .. code:: ipython3 import matplotlib.pyplot as plt plt.plot(fpr, tpr, label='ROC curve') plt.plot([0, 1], [0, 1], linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.legend(loc="lower right") .. parsed-literal:: .. image:: roc_example_11_1.png .. code:: ipython3 import pandas df = pandas.DataFrame(dict(fpr=fpr, tpr=tpr, threshold=th)) df .. raw:: html
fpr threshold tpr
0 0.000000 0.910712 0.032258
1 0.000000 0.869794 0.096774
2 0.000000 0.863174 0.161290
3 0.000000 0.805864 0.258065
4 0.000000 0.790909 0.387097
5 0.000000 0.650510 0.612903
6 0.052632 0.634499 0.612903
7 0.052632 0.620319 0.709677
8 0.105263 0.615015 0.709677
9 0.210526 0.607975 0.741935
10 0.210526 0.604496 0.774194
11 0.263158 0.586318 0.774194
12 0.263158 0.584172 0.806452
13 0.315789 0.561487 0.838710
14 0.421053 0.556499 0.838710
15 0.578947 0.525449 0.838710
16 0.578947 0.522579 0.870968
17 0.631579 0.522551 0.870968
18 0.684211 0.520835 0.903226
19 0.736842 0.516687 0.903226
20 0.736842 0.453826 0.967742
21 0.842105 0.417941 0.967742
22 0.842105 0.412703 1.000000
23 1.000000 0.375573 1.000000
ROC - TPR / FPR --------------- We do the same with the class this module provides `ROC `__. - TPR = True Positive Rate - FPR = False Positive Rate You can see as TPR the distribution function of a score for a positive example and the FPR the same for a negative example. .. code:: ipython3 from mlstatpy.ml.roc import ROC .. code:: ipython3 roc = ROC(df=data) .. code:: ipython3 roc .. parsed-literal:: Overall precision: 0.63 - AUC=0.850594 -------------- score label weight 0 0.375573 0.0 1.0 1 0.385480 0.0 1.0 2 0.412314 0.0 1.0 3 0.412703 1.0 1.0 4 0.417941 0.0 1.0 -------------- score label weight 45 0.863174 1.0 1.0 46 0.863174 1.0 1.0 47 0.869794 1.0 1.0 48 0.903335 1.0 1.0 49 0.910712 1.0 1.0 -------------- False Positive Rate True Positive Rate threshold 0 0.000000 0.032258 0.910712 1 0.000000 0.193548 0.828617 2 0.000000 0.354839 0.790909 3 0.000000 0.516129 0.737000 4 0.052632 0.645161 0.627589 5 0.157895 0.741935 0.607975 6 0.263158 0.838710 0.561487 7 0.526316 0.838710 0.542211 8 0.684211 0.903226 0.520835 9 0.842105 0.967742 0.417941 10 1.000000 1.000000 0.375573 -------------- error recall threshold 0 0.000000 0.02 0.910712 1 0.000000 0.12 0.828617 2 0.000000 0.22 0.790909 3 0.000000 0.32 0.737000 4 0.047619 0.42 0.627589 5 0.115385 0.52 0.607975 6 0.161290 0.62 0.561487 7 0.277778 0.72 0.542211 8 0.317073 0.82 0.520835 9 0.347826 0.92 0.417941 10 0.380000 1.00 0.375573 .. code:: ipython3 roc.auc() .. parsed-literal:: 0.85059422750424452 .. code:: ipython3 roc.plot(nb=10) .. parsed-literal:: .. image:: roc_example_18_1.png This function draws the curve with only 10 points but we can ask for more. .. code:: ipython3 roc.plot(nb=100) .. parsed-literal:: .. image:: roc_example_20_1.png We can also ask to draw bootstropped curves to get a sense of the confidence. .. code:: ipython3 roc.plot(nb=10, bootstrap=10) .. parsed-literal:: .. image:: roc_example_22_1.png ROC - score distribution ------------------------ This another representation for the metrics FPR and TPR. :math:`P(xs)` is the probability that a score for a negative example to be higher than :math:`s`. We assume in this case that the higher the better for the score. .. code:: ipython3 roc.plot(curve=ROC.CurveType.PROBSCORE, thresholds=True) .. parsed-literal:: .. image:: roc_example_24_1.png When curves intersects at score :math:`s^*`, error rates for positive and negative examples are equal. If we show the confusion matrix for this particular score :math:`s^*`, it gives: .. code:: ipython3 conf = roc.confusion() conf["P(+s)"] = 1 - conf["True Negative"] / conf.loc[0,"True Negative"] conf .. raw:: html
True Positive False Positive False Negative True Negative threshold P(+<s) P(->s)
0 0.0 0.0 31.0 19.0 0.910712 1.000000 0.000000
1 1.0 0.0 30.0 19.0 0.910712 0.967742 0.000000
2 6.0 0.0 25.0 19.0 0.828617 0.806452 0.000000
3 11.0 0.0 20.0 19.0 0.790909 0.645161 0.000000
4 16.0 0.0 15.0 19.0 0.737000 0.483871 0.000000
5 20.0 1.0 11.0 18.0 0.627589 0.354839 0.052632
6 23.0 3.0 8.0 16.0 0.607975 0.258065 0.157895
7 26.0 5.0 5.0 14.0 0.561487 0.161290 0.263158
8 26.0 10.0 5.0 9.0 0.542211 0.161290 0.526316
9 28.0 13.0 3.0 6.0 0.520835 0.096774 0.684211
10 30.0 16.0 1.0 3.0 0.417941 0.032258 0.842105
11 31.0 19.0 0.0 0.0 0.375573 0.000000 1.000000
ROC - recall / precision ------------------------ In this representation, we show the score. .. code:: ipython3 import matplotlib.pyplot as plt fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(14,4)) roc.plot(curve=ROC.CurveType.RECPREC, thresholds=True, ax=axes[0]) roc.plot(curve=ROC.CurveType.RECPREC, ax=axes[1]) .. parsed-literal:: .. image:: roc_example_28_1.png