{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.data - Classification, r\u00e9gression, anomalies - \u00e9nonc\u00e9\n", "\n", "Le jeu de donn\u00e9es [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) contient 5000 vins d\u00e9crits par leurs caract\u00e9ristiques chimiques et \u00e9valu\u00e9s par un expert. Peut-on s'approcher de l'expert \u00e0 l'aide d'un mod\u00e8le de machine learning."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["%matplotlib inline\n", "import matplotlib.pyplot as plt"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Les donn\u00e9es\n", "\n", "On peut les r\u00e9cup\u00e9rer sur [github...data_2a](https://github.com/sdpython/ensae_teaching_cs/tree/master/src/ensae_teaching_cs/data/data_1a)."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcoholqualitycolor
07.40.700.001.90.07611.034.00.99783.510.569.45red
17.80.880.002.60.09825.067.00.99683.200.689.85red
27.80.760.042.30.09215.054.00.99703.260.659.85red
311.20.280.561.90.07517.060.00.99803.160.589.86red
47.40.700.001.90.07611.034.00.99783.510.569.45red
\n", "
"], "text/plain": [" fixed_acidity volatile_acidity citric_acid residual_sugar chlorides \\\n", "0 7.4 0.70 0.00 1.9 0.076 \n", "1 7.8 0.88 0.00 2.6 0.098 \n", "2 7.8 0.76 0.04 2.3 0.092 \n", "3 11.2 0.28 0.56 1.9 0.075 \n", "4 7.4 0.70 0.00 1.9 0.076 \n", "\n", " free_sulfur_dioxide total_sulfur_dioxide density pH sulphates \\\n", "0 11.0 34.0 0.9978 3.51 0.56 \n", "1 25.0 67.0 0.9968 3.20 0.68 \n", "2 15.0 54.0 0.9970 3.26 0.65 \n", "3 17.0 60.0 0.9980 3.16 0.58 \n", "4 11.0 34.0 0.9978 3.51 0.56 \n", "\n", " alcohol quality color \n", "0 9.4 5 red \n", "1 9.8 5 red \n", "2 9.8 5 red \n", "3 9.8 6 red \n", "4 9.4 5 red "]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from ensae_teaching_cs.data import wines_quality\n", "from pandas import read_csv\n", "df = read_csv(wines_quality(local=True, filename=True))\n", "df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : afficher la distribution des notes\n", "\n", "La fonction [hist](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html) est simple, efficice."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 2 : s\u00e9paration train / test\n", "\n", "La fonction est tellement utilis\u00e9e que vous la trouverez rapidement."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 3 : la variable couleur n'est pas num\u00e9rique\n", "\n", "M... [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 3 : premier classifieur\n", "\n", "Vous trouverez aussi tout seul. Quelques fonctions pourront vous aider \u00e0 \u00e9valuer le mod\u00e8le [confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html), [classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["Beaucoup mieux."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 4 : courbe ROC\n", "\n", "Quelques aides..."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": ["from sklearn.metrics import roc_curve, auc\n", "\n", "# labels = pipe.steps[1][1].classes_\n", "# y_score = pipe.predict_proba(X_test)\n", "\n", "fpr = dict()\n", "tpr = dict()\n", "roc_auc = dict()\n", "# for i, cl in enumerate(labels):\n", "# fpr[cl], tpr[cl], _ = roc_curve(y_test == cl, y_score[:, i])\n", "# roc_auc[cl] = auc(fpr[cl], tpr[cl])"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["# fig, ax = plt.subplots(1, 1, figsize=(8,4))\n", "# for k in roc_auc:\n", "# ax.plot(fpr[k], tpr[k], label=\"c%d = %1.2f\" % (k, roc_auc[k]))\n", "# ax.legend();"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 5 : anomalies\n", "\n", "Une anomalie est un point aberrant. Cela revient \u00e0 dire que sa probabilit\u00e9 qu'un tel \u00e9v\u00e9nement se reproduise est faible. Un mod\u00e8le assez connu est [EllipticEnvelope](https://scikit-learn.org/stable/auto_examples/plot_anomaly_comparison.html). On suppose que si le mod\u00e8le d\u00e9tecte une anomalie, un mod\u00e8le de pr\u00e9diction aura plus de mal \u00e0 pr\u00e9dire. On r\u00e9utilise le pipeline pr\u00e9c\u00e9dent en changeant juste la derni\u00e8re \u00e9tape."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": ["from sklearn.covariance import EllipticEnvelope\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 6 : r\u00e9gression\n", "\n", "La note est num\u00e9rique, pourquoi ne pas essayer une r\u00e9gression."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": ["from sklearn.ensemble import RandomForestRegressor"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 7 : intervalle de confiance\n", " \n", "Comment constuire un intervalle de confiance avec un classifieur et un r\u00e9gresseur. Rien de th\u00e9orique, juste des id\u00e9es et un peu de bidouille."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2"}}, "nbformat": 4, "nbformat_minor": 2}