{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# D\u00e9coupage stratifi\u00e9 apprentissage / test\n", "\n", "Lorsqu'une classe est sous-repr\u00e9sent\u00e9e, il y a peu de chances que la r\u00e9partition apprentissage test conserve la distribution des classes."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["from papierstat.datasets import load_wines_dataset\n", "df = load_wines_dataset()\n", "X = df.drop(['quality', 'color'], axis=1)\n", "y = df['quality']"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On divise en base d'apprentissage et de test avec la fonction [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
\n", " \n", " \n", " base | \n", " test | \n", " train | \n", " ratio | \n", "
\n", " \n", " y | \n", " | \n", " | \n", " | \n", "
\n", " \n", " \n", " \n", " 3 | \n", " 13.0 | \n", " 17.0 | \n", " 0.764706 | \n", "
\n", " \n", " 4 | \n", " 54.0 | \n", " 162.0 | \n", " 0.333333 | \n", "
\n", " \n", " 5 | \n", " 539.0 | \n", " 1599.0 | \n", " 0.337086 | \n", "
\n", " \n", " 6 | \n", " 713.0 | \n", " 2123.0 | \n", " 0.335846 | \n", "
\n", " \n", " 7 | \n", " 267.0 | \n", " 812.0 | \n", " 0.328818 | \n", "
\n", " \n", " 8 | \n", " 39.0 | \n", " 154.0 | \n", " 0.253247 | \n", "
\n", " \n", " 9 | \n", " NaN | \n", " 5.0 | \n", " NaN | \n", "
\n", " \n", "
\n", "
"], "text/plain": ["base test train ratio\n", "y \n", "3 13.0 17.0 0.764706\n", "4 54.0 162.0 0.333333\n", "5 539.0 1599.0 0.337086\n", "6 713.0 2123.0 0.335846\n", "7 267.0 812.0 0.328818\n", "8 39.0 154.0 0.253247\n", "9 NaN 5.0 NaN"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "ys = pandas.DataFrame(dict(y=y_train))\n", "ys['base'] = 'train'\n", "ys2 = pandas.DataFrame(dict(y=y_test))\n", "ys2['base'] = 'test'\n", "ys = pandas.concat([ys, ys2])\n", "ys['compte'] = 1\n", "piv = ys.groupby(['base', 'y'], as_index=False).count().pivot('y', 'base', 'compte')\n", "piv['ratio'] = piv['test'] / piv['train']\n", "piv"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On voit le ratio entre les deux classes est \u00e0 peu pr\u00e8s \u00e9gal \u00e0 1/3 sauf pour les notes sous-repr\u00e9sent\u00e9es. On utilise une r\u00e9partition stratifi\u00e9e : la distribution d'une variable, les labels, sera la m\u00eame dans les bases d'apprentissages et de de tests. On s'inspire de l'exemple [StratifiedShuffleSplit](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["(4352, 2145)"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.model_selection import StratifiedShuffleSplit\n", "split = StratifiedShuffleSplit(n_splits=1, test_size=0.33)\n", "train_index, test_index = list(split.split(X, y))[0]\n", "len(train_index), len(test_index)"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["((4352,), (2145,))"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["X_train, y_train = X.iloc[train_index, :], y[train_index]\n", "X_test, y_test = X.iloc[test_index, :], y[test_index]\n", "y_train.shape, y_test.shape"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " base | \n", " test | \n", " train | \n", " ratio | \n", "
\n", " \n", " y | \n", " | \n", " | \n", " | \n", "
\n", " \n", " \n", " \n", " 3 | \n", " 10 | \n", " 20 | \n", " 0.500000 | \n", "
\n", " \n", " 4 | \n", " 71 | \n", " 145 | \n", " 0.489655 | \n", "
\n", " \n", " 5 | \n", " 706 | \n", " 1432 | \n", " 0.493017 | \n", "
\n", " \n", " 6 | \n", " 936 | \n", " 1900 | \n", " 0.492632 | \n", "
\n", " \n", " 7 | \n", " 356 | \n", " 723 | \n", " 0.492393 | \n", "
\n", " \n", " 8 | \n", " 64 | \n", " 129 | \n", " 0.496124 | \n", "
\n", " \n", " 9 | \n", " 2 | \n", " 3 | \n", " 0.666667 | \n", "
\n", " \n", "
\n", "
"], "text/plain": ["base test train ratio\n", "y \n", "3 10 20 0.500000\n", "4 71 145 0.489655\n", "5 706 1432 0.493017\n", "6 936 1900 0.492632\n", "7 356 723 0.492393\n", "8 64 129 0.496124\n", "9 2 3 0.666667"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["ys = pandas.DataFrame(dict(y=y_train))\n", "ys['base'] = 'train'\n", "ys2 = pandas.DataFrame(dict(y=y_test))\n", "ys2['base'] = 'test'\n", "ys = pandas.concat([ys, ys2])\n", "ys['compte'] = 1\n", "piv = ys.groupby(['base', 'y'], as_index=False).count().pivot('y', 'base', 'compte')\n", "piv['ratio'] = piv['test'] / piv['train']\n", "piv"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le ratio entre les classes est identique, la classe test contient deux fois moins d'invidivu et c'est vrai pour toutes les classes except\u00e9 pour la classe 9 qui contient si peu d'\u00e9l\u00e9ments que c'est impossible."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',\n", " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n", " weights='uniform')"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.neighbors import KNeighborsRegressor\n", "knn = KNeighborsRegressor(n_neighbors=1)\n", "knn.fit(X_train, y_train)"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": ["prediction = knn.predict(X_test)"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["-0.1007330402006481"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import r2_score\n", "r2_score(y_test, prediction)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Cela n'am\u00e9liore pas la qualit\u00e9 du mod\u00e8le mais on est s\u00fbr que les classes sous-repr\u00e9sent\u00e9es sont mieux g\u00e9r\u00e9es par cette r\u00e9partition al\u00e9atoire stratifi\u00e9e."]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4"}}, "nbformat": 4, "nbformat_minor": 2}