{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Le gradient et le discret\n", "\n", "Les m\u00e9thodes d'optimisation \u00e0 base de gradient s'appuie sur une fonction d'erreur d\u00e9rivable qu'on devrait appliquer de pr\u00e9f\u00e9rence sur des variables al\u00e9atoires r\u00e9elles. Ce notebook explore quelques id\u00e9es."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Un petit probl\u00e8me simple\n", "\n", "On utilise le jeu de donn\u00e9es *iris* disponible dans [scikit-learn](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["from sklearn import datasets\n", "\n", "iris = datasets.load_iris()\n", "X = iris.data[:, :2] # we only take the first two features.\n", "Y = iris.target"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On cale une r\u00e9gression logistique. On ne distingue pas apprentissage et test car ce n'est pas le propos de ce notebook."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='ovr',\n", " n_jobs=None, penalty='l2', random_state=None, solver='liblinear',\n", " tol=0.0001, verbose=0, warm_start=False)"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.linear_model import LogisticRegression\n", "clf = LogisticRegression(multi_class=\"ovr\", solver=\"liblinear\")\n", "clf.fit(X, Y)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Puis on calcule la matrice de confusion."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[49, 1, 0],\n", " [ 2, 21, 27],\n", " [ 1, 4, 45]], dtype=int64)"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import confusion_matrix\n", "pred = clf.predict(X)\n", "confusion_matrix(Y, pred)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Multiplication des observations"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le param\u00e8tre ``multi_class='ovr'`` stipule que le mod\u00e8le cache en fait l'estimation de 3 r\u00e9gressions logistiques binaire. Essayons de n'en faire qu'une seule en ajouter le label ``Y`` aux variables. Soit un couple $(X_i \\in \\mathbb{R^d}, Y_i \\in \\mathbb{N})$ qui correspond \u00e0 une observation pour un probl\u00e8me multi-classe. Comme il y a $C$ classes, on multiplie cette ligne par le nombre de classes $C$ pour obtenir :\n", "\n", "$$\\forall c \\in \\mathbb{[}1, ..., C\\mathbb{]}, \\; \\left\\{ \\begin{array}{ll} X_i' = (X_{i,1}, ..., X_{i,d}, Y_{i,1}, ..., Y_{i,C}) \\\\ Y_i' = \\mathbb{1}_{Y_i = c} \\\\ Y_{i,k} = \\mathbb{1}_{c = k}\\end{array} \\right.$$\n", "\n", "Voyons ce que cela donne sur un exemple :"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " X1 | \n", " X2 | \n", " Y0 | \n", " Y1 | \n", " Y2 | \n", " Y' | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 5.1 | \n", " 3.5 | \n", " 1.0 | \n", " 0.0 | \n", " 0.0 | \n", " 1.0 | \n", "
\n", " \n", " 1 | \n", " 5.1 | \n", " 3.5 | \n", " 0.0 | \n", " 1.0 | \n", " 0.0 | \n", " 0.0 | \n", "
\n", " \n", " 2 | \n", " 5.1 | \n", " 3.5 | \n", " 0.0 | \n", " 0.0 | \n", " 1.0 | \n", " 0.0 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" X1 X2 Y0 Y1 Y2 Y'\n", "0 5.1 3.5 1.0 0.0 0.0 1.0\n", "1 5.1 3.5 0.0 1.0 0.0 0.0\n", "2 5.1 3.5 0.0 0.0 1.0 0.0"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "import pandas\n", "\n", "def multiplie(X, Y, classes=None):\n", " if classes is None:\n", " classes = numpy.unique(Y)\n", " XS = []\n", " YS = []\n", " for i in classes:\n", " X2 = numpy.zeros((X.shape[0], 3))\n", " X2[:,i] = 1\n", " Yb = Y == i\n", " XS.append(numpy.hstack([X, X2]))\n", " Yb = Yb.reshape((len(Yb), 1))\n", " YS.append(Yb)\n", "\n", " Xext = numpy.vstack(XS)\n", " Yext = numpy.vstack(YS)\n", " return Xext, Yext\n", "\n", "x, y = multiplie(X[:1,:], Y[:1], [0, 1, 2])\n", "df = pandas.DataFrame(numpy.hstack([x, y]))\n", "df.columns = [\"X1\", \"X2\", \"Y0\", \"Y1\", \"Y2\", \"Y'\"]\n", "df"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Trois colonnes ont \u00e9t\u00e9 ajout\u00e9es c\u00f4t\u00e9 $X$, la ligne a \u00e9t\u00e9 multipli\u00e9e 3 fois, la derni\u00e8re colonne est $Y$ qui ne vaut 1 que lorsque le 1 est au bon endroit dans une des colonnes ajout\u00e9es. Le probl\u00e8me de classification qui \u00e9t\u00e9 de pr\u00e9dire la bonne classe devient : est-ce la classe \u00e0 pr\u00e9dire est $k$ ? On applique cela sur toutes les lignes de la base et cela donne :"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " X1 | \n", " X2 | \n", " Y0 | \n", " Y1 | \n", " Y2 | \n", " Y' | \n", "
\n", " \n", " \n", " \n", " 414 | \n", " 6.7 | \n", " 3.3 | \n", " 0.0 | \n", " 0.0 | \n", " 1.0 | \n", " 1.0 | \n", "
\n", " \n", " 125 | \n", " 5.5 | \n", " 2.5 | \n", " 0.0 | \n", " 0.0 | \n", " 1.0 | \n", " 0.0 | \n", "
\n", " \n", " 394 | \n", " 6.7 | \n", " 3.1 | \n", " 0.0 | \n", " 0.0 | \n", " 1.0 | \n", " 0.0 | \n", "
\n", " \n", " 411 | \n", " 6.0 | \n", " 2.2 | \n", " 1.0 | \n", " 0.0 | \n", " 0.0 | \n", " 0.0 | \n", "
\n", " \n", " 95 | \n", " 7.6 | \n", " 3.0 | \n", " 0.0 | \n", " 0.0 | \n", " 1.0 | \n", " 1.0 | \n", "
\n", " \n", " 64 | \n", " 5.8 | \n", " 2.6 | \n", " 0.0 | \n", " 0.0 | \n", " 1.0 | \n", " 0.0 | \n", "
\n", " \n", " 309 | \n", " 5.0 | \n", " 3.4 | \n", " 0.0 | \n", " 0.0 | \n", " 1.0 | \n", " 0.0 | \n", "
\n", " \n", " 7 | \n", " 6.7 | \n", " 3.0 | \n", " 0.0 | \n", " 1.0 | \n", " 0.0 | \n", " 0.0 | \n", "
\n", " \n", " 182 | \n", " 6.1 | \n", " 2.8 | \n", " 1.0 | \n", " 0.0 | \n", " 0.0 | \n", " 0.0 | \n", "
\n", " \n", " 49 | \n", " 4.7 | \n", " 3.2 | \n", " 0.0 | \n", " 1.0 | \n", " 0.0 | \n", " 0.0 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" X1 X2 Y0 Y1 Y2 Y'\n", "414 6.7 3.3 0.0 0.0 1.0 1.0\n", "125 5.5 2.5 0.0 0.0 1.0 0.0\n", "394 6.7 3.1 0.0 0.0 1.0 0.0\n", "411 6.0 2.2 1.0 0.0 0.0 0.0\n", "95 7.6 3.0 0.0 0.0 1.0 1.0\n", "64 5.8 2.6 0.0 0.0 1.0 0.0\n", "309 5.0 3.4 0.0 0.0 1.0 0.0\n", "7 6.7 3.0 0.0 1.0 0.0 0.0\n", "182 6.1 2.8 1.0 0.0 0.0 0.0\n", "49 4.7 3.2 0.0 1.0 0.0 0.0"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["Xext, Yext = multiplie(X, Y)\n", "numpy.hstack([Xext, Yext])\n", "df = pandas.DataFrame(numpy.hstack([Xext, Yext]))\n", "df.columns = [\"X1\", \"X2\", \"Y0\", \"Y1\", \"Y2\", \"Y'\"]\n", "df.iloc[numpy.random.permutation(df.index), :].head(n=10)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["GradientBoostingClassifier(criterion='friedman_mse', init=None,\n", " learning_rate=0.1, loss='deviance', max_depth=3,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=100,\n", " n_iter_no_change=None, presort='auto', random_state=None,\n", " subsample=1.0, tol=0.0001, validation_fraction=0.1,\n", " verbose=0, warm_start=False)"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.ensemble import GradientBoostingClassifier\n", "clf = GradientBoostingClassifier()\n", "clf.fit(Xext, Yext.ravel())"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[278, 22],\n", " [ 25, 125]], dtype=int64)"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["pred = clf.predict(Xext)\n", "confusion_matrix(Yext, pred)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Introduire du bruit\n", "\n", "Un des probl\u00e8mes de cette m\u00e9thode est qu'on ajoute une variable binaire pour un probl\u00e8me r\u00e9solu \u00e0 l'aide d'une optimisation \u00e0 base de gradient. C'est moyen. Pas de probl\u00e8me, changeons un peu la donne."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " X1 | \n", " X2 | \n", " Y0 | \n", " Y1 | \n", " Y2 | \n", " Y' | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 5.1 | \n", " 3.5 | \n", " 1.107461 | \n", " 0.166893 | \n", " 0.018765 | \n", " 1.0 | \n", "
\n", " \n", " 1 | \n", " 5.1 | \n", " 3.5 | \n", " 0.162464 | \n", " 1.187359 | \n", " 0.187721 | \n", " 0.0 | \n", "
\n", " \n", " 2 | \n", " 5.1 | \n", " 3.5 | \n", " 0.086876 | \n", " 0.178472 | \n", " 1.179201 | \n", " 0.0 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" X1 X2 Y0 Y1 Y2 Y'\n", "0 5.1 3.5 1.107461 0.166893 0.018765 1.0\n", "1 5.1 3.5 0.162464 1.187359 0.187721 0.0\n", "2 5.1 3.5 0.086876 0.178472 1.179201 0.0"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["def multiplie_bruit(X, Y, classes=None):\n", " if classes is None:\n", " classes = numpy.unique(Y)\n", " XS = []\n", " YS = []\n", " for i in classes:\n", " # X2 = numpy.random.randn((X.shape[0]* 3)).reshape(X.shape[0], 3) * 0.1\n", " X2 = numpy.random.random((X.shape[0], 3)) * 0.2\n", " X2[:,i] += 1\n", " Yb = Y == i\n", " XS.append(numpy.hstack([X, X2]))\n", " Yb = Yb.reshape((len(Yb), 1))\n", " YS.append(Yb)\n", "\n", " Xext = numpy.vstack(XS)\n", " Yext = numpy.vstack(YS)\n", " return Xext, Yext\n", "\n", "x, y = multiplie_bruit(X[:1,:], Y[:1], [0, 1, 2])\n", "df = pandas.DataFrame(numpy.hstack([x, y]))\n", "df.columns = [\"X1\", \"X2\", \"Y0\", \"Y1\", \"Y2\", \"Y'\"]\n", "df"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le probl\u00e8me est le m\u00eame qu'avant except\u00e9 les variables $Y_i$ qui sont maintenant r\u00e9el. Au lieu d'\u00eatre nul, on prend une valeur $Y_i < 0.4$."]}, {"cell_type": "code", "execution_count": 10, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " X1 | \n", " X2 | \n", " Y0 | \n", " Y1 | \n", " Y2 | \n", " Y' | \n", "
\n", " \n", " \n", " \n", " 295 | \n", " 5.5 | \n", " 2.6 | \n", " 0.197643 | \n", " 1.199976 | \n", " 0.180766 | \n", " 1.0 | \n", "
\n", " \n", " 46 | \n", " 5.2 | \n", " 3.4 | \n", " 0.178395 | \n", " 0.190600 | \n", " 1.159765 | \n", " 0.0 | \n", "
\n", " \n", " 187 | \n", " 6.7 | \n", " 3.1 | \n", " 0.188947 | \n", " 1.093288 | \n", " 0.139723 | \n", " 1.0 | \n", "
\n", " \n", " 210 | \n", " 6.9 | \n", " 3.1 | \n", " 0.095428 | \n", " 0.182643 | \n", " 1.037533 | \n", " 1.0 | \n", "
\n", " \n", " 29 | \n", " 5.5 | \n", " 3.5 | \n", " 1.131419 | \n", " 0.077241 | \n", " 0.177483 | \n", " 1.0 | \n", "
\n", " \n", " 315 | \n", " 6.4 | \n", " 3.2 | \n", " 0.099738 | \n", " 0.197291 | \n", " 1.035431 | \n", " 1.0 | \n", "
\n", " \n", " 152 | \n", " 5.8 | \n", " 2.7 | \n", " 0.069061 | \n", " 0.045325 | \n", " 1.061221 | \n", " 0.0 | \n", "
\n", " \n", " 168 | \n", " 6.5 | \n", " 2.8 | \n", " 0.093164 | \n", " 1.177413 | \n", " 0.095890 | \n", " 1.0 | \n", "
\n", " \n", " 348 | \n", " 6.9 | \n", " 3.1 | \n", " 1.094184 | \n", " 0.196944 | \n", " 0.083975 | \n", " 0.0 | \n", "
\n", " \n", " 261 | \n", " 6.3 | \n", " 2.8 | \n", " 0.197558 | \n", " 0.080273 | \n", " 1.009379 | \n", " 1.0 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" X1 X2 Y0 Y1 Y2 Y'\n", "295 5.5 2.6 0.197643 1.199976 0.180766 1.0\n", "46 5.2 3.4 0.178395 0.190600 1.159765 0.0\n", "187 6.7 3.1 0.188947 1.093288 0.139723 1.0\n", "210 6.9 3.1 0.095428 0.182643 1.037533 1.0\n", "29 5.5 3.5 1.131419 0.077241 0.177483 1.0\n", "315 6.4 3.2 0.099738 0.197291 1.035431 1.0\n", "152 5.8 2.7 0.069061 0.045325 1.061221 0.0\n", "168 6.5 2.8 0.093164 1.177413 0.095890 1.0\n", "348 6.9 3.1 1.094184 0.196944 0.083975 0.0\n", "261 6.3 2.8 0.197558 0.080273 1.009379 1.0"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["Xextb, Yextb = multiplie_bruit(X, Y)\n", "df = pandas.DataFrame(numpy.hstack([Xextb, Yextb]))\n", "df.columns = [\"X1\", \"X2\", \"Y0\", \"Y1\", \"Y2\", \"Y'\"]\n", "df.iloc[numpy.random.permutation(df.index), :].head(n=10)"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["GradientBoostingClassifier(criterion='friedman_mse', init=None,\n", " learning_rate=0.1, loss='deviance', max_depth=3,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=100,\n", " n_iter_no_change=None, presort='auto', random_state=None,\n", " subsample=1.0, tol=0.0001, validation_fraction=0.1,\n", " verbose=0, warm_start=False)"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.ensemble import GradientBoostingClassifier\n", "clfb = GradientBoostingClassifier()\n", "clfb.fit(Xextb, Yextb.ravel())"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[299, 1],\n", " [ 10, 140]], dtype=int64)"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["predb = clfb.predict(Xextb)\n", "confusion_matrix(Yextb, predb)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["C'est un petit peu mieux."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Comparaisons de plusieurs mod\u00e8les\n", "\n", "On cherche maintenant \u00e0 comparer le gain en introduisant du bruit pour diff\u00e9rents mod\u00e8les."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " err1 | \n", " err2 | \n", " model | \n", "
\n", " \n", " \n", " \n", " 10 | \n", " 0.333333 | \n", " 0.333333 | \n", " AdaBoostClassifier | \n", "
\n", " \n", " 3 | \n", " 0.048889 | \n", " 0.000000 | \n", " DecisionTreeClassifier | \n", "
\n", " \n", " 4 | \n", " 0.048889 | \n", " 0.000000 | \n", " ExtraTreeClassifier | \n", "
\n", " \n", " 6 | \n", " 0.048889 | \n", " 0.000000 | \n", " ExtraTreesClassifier | \n", "
\n", " \n", " 8 | \n", " 0.333333 | \n", " 0.333333 | \n", " GaussianNB | \n", "
\n", " \n", " 1 | \n", " 0.104444 | \n", " 0.044444 | \n", " GradientBoostingClassifier | \n", "
\n", " \n", " 9 | \n", " 0.104444 | \n", " 0.091111 | \n", " KNeighborsClassifier | \n", "
\n", " \n", " 0 | \n", " 0.333333 | \n", " 0.333333 | \n", " LogisticRegression | \n", "
\n", " \n", " 7 | \n", " 0.333333 | \n", " 0.333333 | \n", " MLPClassifier | \n", "
\n", " \n", " 2 | \n", " 0.053333 | \n", " 0.002222 | \n", " RandomForestClassifier | \n", "
\n", " \n", " 5 | \n", " 0.333333 | \n", " 0.053333 | \n", " XGBClassifier | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" err1 err2 model\n", "10 0.333333 0.333333 AdaBoostClassifier\n", "3 0.048889 0.000000 DecisionTreeClassifier\n", "4 0.048889 0.000000 ExtraTreeClassifier\n", "6 0.048889 0.000000 ExtraTreesClassifier\n", "8 0.333333 0.333333 GaussianNB\n", "1 0.104444 0.044444 GradientBoostingClassifier\n", "9 0.104444 0.091111 KNeighborsClassifier\n", "0 0.333333 0.333333 LogisticRegression\n", "7 0.333333 0.333333 MLPClassifier\n", "2 0.053333 0.002222 RandomForestClassifier\n", "5 0.333333 0.053333 XGBClassifier"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["def error(model, x, y):\n", " p = model.predict(x)\n", " cm = confusion_matrix(y, p)\n", " return (cm[1,0] + cm[0,1]) / cm.sum()\n", "\n", "def comparaison(model, X, Y):\n", "\n", " if isinstance(model, tuple):\n", " clf = model[0](**model[1])\n", " clfb = model[0](**model[1])\n", " model = model[0]\n", " else: \n", " clf = model()\n", " clfb = model()\n", " \n", " Xext, Yext = multiplie(X, Y)\n", " clf.fit(Xext, Yext.ravel())\n", " err = error(clf, Xext, Yext)\n", " \n", " Xextb, Yextb = multiplie_bruit(X, Y)\n", " clfb.fit(Xextb, Yextb.ravel())\n", " errb = error(clfb, Xextb, Yextb)\n", " return dict(model=model.__name__, err1=err, err2=errb)\n", "\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.neighbors import KNeighborsClassifier, RadiusNeighborsClassifier\n", "from xgboost import XGBClassifier\n", "\n", "models = [(LogisticRegression, dict(multi_class=\"ovr\", solver=\"liblinear\")),\n", " GradientBoostingClassifier,\n", " (RandomForestClassifier, dict(n_estimators=20)),\n", " DecisionTreeClassifier,\n", " ExtraTreeClassifier,\n", " XGBClassifier,\n", " (ExtraTreesClassifier, dict(n_estimators=20)),\n", " (MLPClassifier, dict(activation=\"logistic\")),\n", " GaussianNB, KNeighborsClassifier, \n", " (AdaBoostClassifier, dict(base_estimator=LogisticRegression(multi_class=\"ovr\", solver=\"liblinear\"), \n", " algorithm=\"SAMME\"))]\n", "\n", "res = [comparaison(model, X, Y) for model in models]\n", "df = pandas.DataFrame(res)\n", "df.sort_values(\"model\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["*err1* correspond \u00e0 $Y_0, Y_1, Y_2$ binaire, *err2* aux m\u00eames variables mais avec un peu de bruit. L'ajout ne semble pas faire d\u00e9cro\u00eetre la performance et l'am\u00e9liore dans certains cas. C'est une piste \u00e0 suivre. Reste \u00e0 savoir si les mod\u00e8les n'apprennent pas le bruit."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Avec une ACP\n", "\n", "On peut faire varier le nombre de composantes, j'en ai gard\u00e9 qu'une. L'ACP est appliqu\u00e9e apr\u00e8s avoir ajout\u00e9 les variables binaires ou binaires bruit\u00e9es. Le r\u00e9sultat est sans \u00e9quivoque. Aucun mod\u00e8le ne parvient \u00e0 apprendre sans l'ajout de bruit."]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " err1 | \n", " err2 | \n", " model | \n", " errACP1 | \n", " errACP2 | \n", " modelACP | \n", "
\n", " \n", " \n", " \n", " 10 | \n", " 0.333333 | \n", " 0.333333 | \n", " AdaBoostClassifier | \n", " 0.333333 | \n", " 0.333333 | \n", " AdaBoostClassifier | \n", "
\n", " \n", " 3 | \n", " 0.048889 | \n", " 0.000000 | \n", " DecisionTreeClassifier | \n", " 0.333333 | \n", " 0.000000 | \n", " DecisionTreeClassifier | \n", "
\n", " \n", " 4 | \n", " 0.048889 | \n", " 0.000000 | \n", " ExtraTreeClassifier | \n", " 0.333333 | \n", " 0.000000 | \n", " ExtraTreeClassifier | \n", "
\n", " \n", " 6 | \n", " 0.048889 | \n", " 0.000000 | \n", " ExtraTreesClassifier | \n", " 0.333333 | \n", " 0.000000 | \n", " ExtraTreesClassifier | \n", "
\n", " \n", " 8 | \n", " 0.333333 | \n", " 0.333333 | \n", " GaussianNB | \n", " 0.333333 | \n", " 0.333333 | \n", " GaussianNB | \n", "
\n", " \n", " 1 | \n", " 0.104444 | \n", " 0.044444 | \n", " GradientBoostingClassifier | \n", " 0.333333 | \n", " 0.224444 | \n", " GradientBoostingClassifier | \n", "
\n", " \n", " 9 | \n", " 0.104444 | \n", " 0.091111 | \n", " KNeighborsClassifier | \n", " 0.335556 | \n", " 0.340000 | \n", " KNeighborsClassifier | \n", "
\n", " \n", " 0 | \n", " 0.333333 | \n", " 0.333333 | \n", " LogisticRegression | \n", " 0.333333 | \n", " 0.333333 | \n", " LogisticRegression | \n", "
\n", " \n", " 7 | \n", " 0.333333 | \n", " 0.333333 | \n", " MLPClassifier | \n", " 0.333333 | \n", " 0.333333 | \n", " MLPClassifier | \n", "
\n", " \n", " 2 | \n", " 0.053333 | \n", " 0.002222 | \n", " RandomForestClassifier | \n", " 0.333333 | \n", " 0.024444 | \n", " RandomForestClassifier | \n", "
\n", " \n", " 5 | \n", " 0.333333 | \n", " 0.053333 | \n", " XGBClassifier | \n", " 0.333333 | \n", " 0.315556 | \n", " XGBClassifier | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" err1 err2 model errACP1 errACP2 \\\n", "10 0.333333 0.333333 AdaBoostClassifier 0.333333 0.333333 \n", "3 0.048889 0.000000 DecisionTreeClassifier 0.333333 0.000000 \n", "4 0.048889 0.000000 ExtraTreeClassifier 0.333333 0.000000 \n", "6 0.048889 0.000000 ExtraTreesClassifier 0.333333 0.000000 \n", "8 0.333333 0.333333 GaussianNB 0.333333 0.333333 \n", "1 0.104444 0.044444 GradientBoostingClassifier 0.333333 0.224444 \n", "9 0.104444 0.091111 KNeighborsClassifier 0.335556 0.340000 \n", "0 0.333333 0.333333 LogisticRegression 0.333333 0.333333 \n", "7 0.333333 0.333333 MLPClassifier 0.333333 0.333333 \n", "2 0.053333 0.002222 RandomForestClassifier 0.333333 0.024444 \n", "5 0.333333 0.053333 XGBClassifier 0.333333 0.315556 \n", "\n", " modelACP \n", "10 AdaBoostClassifier \n", "3 DecisionTreeClassifier \n", "4 ExtraTreeClassifier \n", "6 ExtraTreesClassifier \n", "8 GaussianNB \n", "1 GradientBoostingClassifier \n", "9 KNeighborsClassifier \n", "0 LogisticRegression \n", "7 MLPClassifier \n", "2 RandomForestClassifier \n", "5 XGBClassifier "]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.decomposition import PCA\n", "\n", "def comparaison_ACP(model, X, Y):\n", "\n", " if isinstance(model, tuple):\n", " clf = model[0](**model[1])\n", " clfb = model[0](**model[1])\n", " model = model[0]\n", " else: \n", " clf = model()\n", " clfb = model()\n", " \n", " axes = 1\n", " solver = \"full\"\n", " Xext, Yext = multiplie(X, Y)\n", " Xext = PCA(n_components=axes, svd_solver=solver).fit_transform(Xext)\n", " clf.fit(Xext, Yext.ravel())\n", " err = error(clf, Xext, Yext)\n", " \n", " Xextb, Yextb = multiplie_bruit(X, Y)\n", " Xextb = PCA(n_components=axes, svd_solver=solver).fit_transform(Xextb)\n", " clfb.fit(Xextb, Yextb.ravel())\n", " errb = error(clfb, Xextb, Yextb)\n", " return dict(modelACP=model.__name__, errACP1=err, errACP2=errb)\n", "\n", "res = [comparaison_ACP(model, X, Y) for model in models]\n", "dfb = pandas.DataFrame(res)\n", "pandas.concat([ df.sort_values(\"model\"), dfb.sort_values(\"modelACP\")], axis=1)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Base d'apprentissage et de test\n", "\n", "Cette fois-ci, on s'int\u00e9resse \u00e0 la qualit\u00e9 des fronti\u00e8res que les mod\u00e8les trouvent en v\u00e9rifiant sur une base de test que l'apprentissage s'est bien pass\u00e9."]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " modelTT | \n", " err_train | \n", " err2_train | \n", " err2b_train_clean | \n", " err_test | \n", " err2_test | \n", " err2b_test_clean | \n", "
\n", " \n", " \n", " \n", " 10 | \n", " AdaBoostClassifier | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 3 | \n", " DecisionTreeClassifier | \n", " 0.026667 | \n", " 0.000000 | \n", " 0.226667 | \n", " 0.206667 | \n", " 0.273333 | \n", " 0.313333 | \n", "
\n", " \n", " 4 | \n", " ExtraTreeClassifier | \n", " 0.026667 | \n", " 0.000000 | \n", " 0.253333 | \n", " 0.213333 | \n", " 0.253333 | \n", " 0.273333 | \n", "
\n", " \n", " 6 | \n", " ExtraTreesClassifier | \n", " 0.026667 | \n", " 0.000000 | \n", " 0.140000 | \n", " 0.200000 | \n", " 0.213333 | \n", " 0.220000 | \n", "
\n", " \n", " 8 | \n", " GaussianNB | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 1 | \n", " GradientBoostingClassifier | \n", " 0.080000 | \n", " 0.013333 | \n", " 0.176667 | \n", " 0.186667 | \n", " 0.246667 | \n", " 0.240000 | \n", "
\n", " \n", " 9 | \n", " KNeighborsClassifier | \n", " 0.070000 | \n", " 0.076667 | \n", " 0.073333 | \n", " 0.160000 | \n", " 0.160000 | \n", " 0.166667 | \n", "
\n", " \n", " 0 | \n", " LogisticRegression | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 7 | \n", " MLPClassifier | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 2 | \n", " RandomForestClassifier | \n", " 0.026667 | \n", " 0.000000 | \n", " 0.156667 | \n", " 0.206667 | \n", " 0.266667 | \n", " 0.213333 | \n", "
\n", " \n", " 5 | \n", " XGBClassifier | \n", " 0.106667 | \n", " 0.036667 | \n", " 0.333333 | \n", " 0.193333 | \n", " 0.280000 | \n", " 0.346667 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" modelTT err_train err2_train err2b_train_clean \\\n", "10 AdaBoostClassifier 0.333333 0.333333 0.333333 \n", "3 DecisionTreeClassifier 0.026667 0.000000 0.226667 \n", "4 ExtraTreeClassifier 0.026667 0.000000 0.253333 \n", "6 ExtraTreesClassifier 0.026667 0.000000 0.140000 \n", "8 GaussianNB 0.333333 0.333333 0.333333 \n", "1 GradientBoostingClassifier 0.080000 0.013333 0.176667 \n", "9 KNeighborsClassifier 0.070000 0.076667 0.073333 \n", "0 LogisticRegression 0.333333 0.333333 0.333333 \n", "7 MLPClassifier 0.333333 0.333333 0.333333 \n", "2 RandomForestClassifier 0.026667 0.000000 0.156667 \n", "5 XGBClassifier 0.106667 0.036667 0.333333 \n", "\n", " err_test err2_test err2b_test_clean \n", "10 0.333333 0.333333 0.333333 \n", "3 0.206667 0.273333 0.313333 \n", "4 0.213333 0.253333 0.273333 \n", "6 0.200000 0.213333 0.220000 \n", "8 0.333333 0.333333 0.333333 \n", "1 0.186667 0.246667 0.240000 \n", "9 0.160000 0.160000 0.166667 \n", "0 0.333333 0.333333 0.333333 \n", "7 0.333333 0.333333 0.333333 \n", "2 0.206667 0.266667 0.213333 \n", "5 0.193333 0.280000 0.346667 "]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.model_selection import train_test_split\n", "\n", "def comparaison_train_test(models, X, Y, mbruit=multiplie_bruit, acp=None):\n", "\n", " axes = acp\n", " solver = \"full\" \n", " \n", " ind = numpy.random.permutation(numpy.arange(X.shape[0]))\n", " X = X[ind,:]\n", " Y = Y[ind]\n", " X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1./3)\n", " \n", " res = []\n", " for model in models:\n", " \n", " if isinstance(model, tuple):\n", " clf = model[0](**model[1])\n", " clfb = model[0](**model[1])\n", " model = model[0]\n", " else: \n", " clf = model()\n", " clfb = model()\n", "\n", " Xext_train, Yext_train = multiplie(X_train, Y_train)\n", " Xext_test, Yext_test = multiplie(X_test, Y_test)\n", " if acp:\n", " Xext_train_ = Xext_train\n", " Xext_test_ = Xext_test\n", " acp_model = PCA(n_components=axes, svd_solver=solver).fit(Xext_train)\n", " Xext_train = acp_model.transform(Xext_train)\n", " Xext_test = acp_model.transform(Xext_test) \n", " clf.fit(Xext_train, Yext_train.ravel())\n", "\n", " err_train = error(clf, Xext_train, Yext_train)\n", " err_test = error(clf, Xext_test, Yext_test)\n", "\n", " Xextb_train, Yextb_train = mbruit(X_train, Y_train)\n", " Xextb_test, Yextb_test = mbruit(X_test, Y_test)\n", " if acp:\n", " acp_model = PCA(n_components=axes, svd_solver=solver).fit(Xextb_train)\n", " Xextb_train = acp_model.transform(Xextb_train)\n", " Xextb_test = acp_model.transform(Xextb_test) \n", " Xext_train = acp_model.transform(Xext_train_)\n", " Xext_test = acp_model.transform(Xext_test_) \n", " clfb.fit(Xextb_train, Yextb_train.ravel())\n", "\n", " errb_train = error(clfb, Xextb_train, Yextb_train)\n", " errb_train_clean = error(clfb, Xext_train, Yext_train)\n", " errb_test = error(clfb, Xextb_test, Yextb_test)\n", " errb_test_clean = error(clfb, Xext_test, Yext_test)\n", " \n", " res.append(dict(modelTT=model.__name__, err_train=err_train, err2_train=errb_train,\n", " err_test=err_test, err2_test=errb_test, err2b_test_clean=errb_test_clean,\n", " err2b_train_clean=errb_train_clean))\n", " \n", " dfb = pandas.DataFrame(res)\n", " dfb = dfb[[\"modelTT\", \"err_train\", \"err2_train\", \"err2b_train_clean\", \"err_test\", \"err2_test\", \"err2b_test_clean\"]]\n", " dfb = dfb.sort_values(\"modelTT\") \n", " return dfb\n", "\n", "dfb = comparaison_train_test(models, X, Y)\n", "dfb"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Les colonnes *err2b_train_clean* et *err2b_test_clean* sont les erreurs obtenues par des mod\u00e8les appris sur des colonnes bruit\u00e9es et test\u00e9es sur des colonnes non bruit\u00e9es ce qui est le v\u00e9ritable test. On s'aper\u00e7oit que les performances sont tr\u00e8s d\u00e9grad\u00e9es sur la base d'test. Une raison est que le bruit choisi ajout\u00e9 n'est pas centr\u00e9. Corrigeons cela."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " modelTT | \n", " err_train | \n", " err2_train | \n", " err2b_train_clean | \n", " err_test | \n", " err2_test | \n", " err2b_test_clean | \n", "
\n", " \n", " \n", " \n", " 10 | \n", " AdaBoostClassifier | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 3 | \n", " DecisionTreeClassifier | \n", " 0.033333 | \n", " 0.000000 | \n", " 0.143333 | \n", " 0.193333 | \n", " 0.273333 | \n", " 0.206667 | \n", "
\n", " \n", " 4 | \n", " ExtraTreeClassifier | \n", " 0.033333 | \n", " 0.000000 | \n", " 0.143333 | \n", " 0.226667 | \n", " 0.233333 | \n", " 0.180000 | \n", "
\n", " \n", " 6 | \n", " ExtraTreesClassifier | \n", " 0.033333 | \n", " 0.000000 | \n", " 0.123333 | \n", " 0.200000 | \n", " 0.213333 | \n", " 0.193333 | \n", "
\n", " \n", " 8 | \n", " GaussianNB | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 1 | \n", " GradientBoostingClassifier | \n", " 0.083333 | \n", " 0.013333 | \n", " 0.203333 | \n", " 0.193333 | \n", " 0.226667 | \n", " 0.280000 | \n", "
\n", " \n", " 9 | \n", " KNeighborsClassifier | \n", " 0.106667 | \n", " 0.106667 | \n", " 0.100000 | \n", " 0.180000 | \n", " 0.180000 | \n", " 0.193333 | \n", "
\n", " \n", " 0 | \n", " LogisticRegression | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 7 | \n", " MLPClassifier | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 2 | \n", " RandomForestClassifier | \n", " 0.040000 | \n", " 0.000000 | \n", " 0.176667 | \n", " 0.206667 | \n", " 0.240000 | \n", " 0.253333 | \n", "
\n", " \n", " 5 | \n", " XGBClassifier | \n", " 0.080000 | \n", " 0.063333 | \n", " 0.170000 | \n", " 0.186667 | \n", " 0.220000 | \n", " 0.240000 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" modelTT err_train err2_train err2b_train_clean \\\n", "10 AdaBoostClassifier 0.333333 0.333333 0.333333 \n", "3 DecisionTreeClassifier 0.033333 0.000000 0.143333 \n", "4 ExtraTreeClassifier 0.033333 0.000000 0.143333 \n", "6 ExtraTreesClassifier 0.033333 0.000000 0.123333 \n", "8 GaussianNB 0.333333 0.333333 0.333333 \n", "1 GradientBoostingClassifier 0.083333 0.013333 0.203333 \n", "9 KNeighborsClassifier 0.106667 0.106667 0.100000 \n", "0 LogisticRegression 0.333333 0.333333 0.333333 \n", "7 MLPClassifier 0.333333 0.333333 0.333333 \n", "2 RandomForestClassifier 0.040000 0.000000 0.176667 \n", "5 XGBClassifier 0.080000 0.063333 0.170000 \n", "\n", " err_test err2_test err2b_test_clean \n", "10 0.333333 0.333333 0.333333 \n", "3 0.193333 0.273333 0.206667 \n", "4 0.226667 0.233333 0.180000 \n", "6 0.200000 0.213333 0.193333 \n", "8 0.333333 0.333333 0.333333 \n", "1 0.193333 0.226667 0.280000 \n", "9 0.180000 0.180000 0.193333 \n", "0 0.333333 0.333333 0.333333 \n", "7 0.333333 0.333333 0.333333 \n", "2 0.206667 0.240000 0.253333 \n", "5 0.186667 0.220000 0.240000 "]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["def multiplie_bruit_centree(X, Y, classes=None):\n", " if classes is None:\n", " classes = numpy.unique(Y)\n", " XS = []\n", " YS = []\n", " for i in classes:\n", " # X2 = numpy.random.randn((X.shape[0]* 3)).reshape(X.shape[0], 3) * 0.1\n", " X2 = numpy.random.random((X.shape[0], 3)) * 0.2 - 0.1\n", " X2[:,i] += 1\n", " Yb = Y == i\n", " XS.append(numpy.hstack([X, X2]))\n", " Yb = Yb.reshape((len(Yb), 1))\n", " YS.append(Yb)\n", "\n", " Xext = numpy.vstack(XS)\n", " Yext = numpy.vstack(YS)\n", " return Xext, Yext\n", "\n", "dfb = comparaison_train_test(models, X, Y, mbruit=multiplie_bruit_centree, acp=None)\n", "dfb"]}, {"cell_type": "markdown", "metadata": {}, "source": ["C'est mieux mais on en conclut que dans la plupart des cas, la meilleure performance sur la base d'apprentissage avec le bruit ajout\u00e9 est due au fait que les mod\u00e8les apprennent par coeur. Sur la base de test, les performances ne sont pas meilleures. Une erreur de 33% signifie que la r\u00e9ponse du classifieur est constante. On multiplie les exemples."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " modelTT | \n", " err_train | \n", " err2_train | \n", " err2b_train_clean | \n", " err_test | \n", " err2_test | \n", " err2b_test_clean | \n", "
\n", " \n", " \n", " \n", " 10 | \n", " AdaBoostClassifier | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 3 | \n", " DecisionTreeClassifier | \n", " 0.040000 | \n", " 0.000000 | \n", " 0.120000 | \n", " 0.180000 | \n", " 0.209333 | \n", " 0.240000 | \n", "
\n", " \n", " 4 | \n", " ExtraTreeClassifier | \n", " 0.040000 | \n", " 0.000000 | \n", " 0.073333 | \n", " 0.213333 | \n", " 0.232000 | \n", " 0.220000 | \n", "
\n", " \n", " 6 | \n", " ExtraTreesClassifier | \n", " 0.040000 | \n", " 0.000000 | \n", " 0.066667 | \n", " 0.213333 | \n", " 0.168000 | \n", " 0.160000 | \n", "
\n", " \n", " 8 | \n", " GaussianNB | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 1 | \n", " GradientBoostingClassifier | \n", " 0.086667 | \n", " 0.087333 | \n", " 0.166667 | \n", " 0.173333 | \n", " 0.192000 | \n", " 0.186667 | \n", "
\n", " \n", " 9 | \n", " KNeighborsClassifier | \n", " 0.110000 | \n", " 0.094667 | \n", " 0.106667 | \n", " 0.113333 | \n", " 0.158667 | \n", " 0.153333 | \n", "
\n", " \n", " 0 | \n", " LogisticRegression | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 7 | \n", " MLPClassifier | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 2 | \n", " RandomForestClassifier | \n", " 0.046667 | \n", " 0.000667 | \n", " 0.090000 | \n", " 0.160000 | \n", " 0.188000 | \n", " 0.226667 | \n", "
\n", " \n", " 5 | \n", " XGBClassifier | \n", " 0.123333 | \n", " 0.108667 | \n", " 0.173333 | \n", " 0.153333 | \n", " 0.204000 | \n", " 0.193333 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" modelTT err_train err2_train err2b_train_clean \\\n", "10 AdaBoostClassifier 0.333333 0.333333 0.333333 \n", "3 DecisionTreeClassifier 0.040000 0.000000 0.120000 \n", "4 ExtraTreeClassifier 0.040000 0.000000 0.073333 \n", "6 ExtraTreesClassifier 0.040000 0.000000 0.066667 \n", "8 GaussianNB 0.333333 0.333333 0.333333 \n", "1 GradientBoostingClassifier 0.086667 0.087333 0.166667 \n", "9 KNeighborsClassifier 0.110000 0.094667 0.106667 \n", "0 LogisticRegression 0.333333 0.333333 0.333333 \n", "7 MLPClassifier 0.333333 0.333333 0.333333 \n", "2 RandomForestClassifier 0.046667 0.000667 0.090000 \n", "5 XGBClassifier 0.123333 0.108667 0.173333 \n", "\n", " err_test err2_test err2b_test_clean \n", "10 0.333333 0.333333 0.333333 \n", "3 0.180000 0.209333 0.240000 \n", "4 0.213333 0.232000 0.220000 \n", "6 0.213333 0.168000 0.160000 \n", "8 0.333333 0.333333 0.333333 \n", "1 0.173333 0.192000 0.186667 \n", "9 0.113333 0.158667 0.153333 \n", "0 0.333333 0.333333 0.333333 \n", "7 0.333333 0.333333 0.333333 \n", "2 0.160000 0.188000 0.226667 \n", "5 0.153333 0.204000 0.193333 "]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["def multiplie_bruit_centree_duplique(X, Y, classes=None):\n", " if classes is None:\n", " classes = numpy.unique(Y)\n", " XS = []\n", " YS = []\n", " for i in classes:\n", " \n", " for k in range(0,5):\n", " #X2 = numpy.random.randn((X.shape[0]* 3)).reshape(X.shape[0], 3) * 0.3\n", " X2 = numpy.random.random((X.shape[0], 3)) * 0.8 - 0.4\n", " X2[:,i] += 1\n", " Yb = Y == i\n", " XS.append(numpy.hstack([X, X2]))\n", " Yb = Yb.reshape((len(Yb), 1))\n", " YS.append(Yb)\n", " \n", " Xext = numpy.vstack(XS)\n", " Yext = numpy.vstack(YS)\n", " return Xext, Yext\n", "\n", "dfb = comparaison_train_test(models, X, Y, mbruit=multiplie_bruit_centree_duplique, acp=None)\n", "dfb"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Cela fonctionne un peu mieux le fait d'ajouter du hasard ne permet pas d'obtenir des gains significatifs \u00e0 part pour le mod\u00e8le [SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)."]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " modelTT | \n", " err_train | \n", " err2_train | \n", " err2b_train_clean | \n", " err_test | \n", " err2_test | \n", " err2b_test_clean | \n", "
\n", " \n", " \n", " \n", " 10 | \n", " AdaBoostClassifier | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 3 | \n", " DecisionTreeClassifier | \n", " 0.033333 | \n", " 0.000000 | \n", " 0.143333 | \n", " 0.200000 | \n", " 0.233333 | \n", " 0.193333 | \n", "
\n", " \n", " 4 | \n", " ExtraTreeClassifier | \n", " 0.033333 | \n", " 0.000000 | \n", " 0.246667 | \n", " 0.233333 | \n", " 0.320000 | \n", " 0.300000 | \n", "
\n", " \n", " 6 | \n", " ExtraTreesClassifier | \n", " 0.033333 | \n", " 0.000000 | \n", " 0.143333 | \n", " 0.206667 | \n", " 0.220000 | \n", " 0.180000 | \n", "
\n", " \n", " 8 | \n", " GaussianNB | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 1 | \n", " GradientBoostingClassifier | \n", " 0.090000 | \n", " 0.013333 | \n", " 0.133333 | \n", " 0.220000 | \n", " 0.206667 | \n", " 0.186667 | \n", "
\n", " \n", " 9 | \n", " KNeighborsClassifier | \n", " 0.103333 | \n", " 0.110000 | \n", " 0.123333 | \n", " 0.206667 | \n", " 0.180000 | \n", " 0.186667 | \n", "
\n", " \n", " 0 | \n", " LogisticRegression | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 7 | \n", " MLPClassifier | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", " 0.333333 | \n", "
\n", " \n", " 2 | \n", " RandomForestClassifier | \n", " 0.040000 | \n", " 0.000000 | \n", " 0.146667 | \n", " 0.180000 | \n", " 0.266667 | \n", " 0.173333 | \n", "
\n", " \n", " 5 | \n", " XGBClassifier | \n", " 0.100000 | \n", " 0.033333 | \n", " 0.210000 | \n", " 0.206667 | \n", " 0.240000 | \n", " 0.246667 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" modelTT err_train err2_train err2b_train_clean \\\n", "10 AdaBoostClassifier 0.333333 0.333333 0.333333 \n", "3 DecisionTreeClassifier 0.033333 0.000000 0.143333 \n", "4 ExtraTreeClassifier 0.033333 0.000000 0.246667 \n", "6 ExtraTreesClassifier 0.033333 0.000000 0.143333 \n", "8 GaussianNB 0.333333 0.333333 0.333333 \n", "1 GradientBoostingClassifier 0.090000 0.013333 0.133333 \n", "9 KNeighborsClassifier 0.103333 0.110000 0.123333 \n", "0 LogisticRegression 0.333333 0.333333 0.333333 \n", "7 MLPClassifier 0.333333 0.333333 0.333333 \n", "2 RandomForestClassifier 0.040000 0.000000 0.146667 \n", "5 XGBClassifier 0.100000 0.033333 0.210000 \n", "\n", " err_test err2_test err2b_test_clean \n", "10 0.333333 0.333333 0.333333 \n", "3 0.200000 0.233333 0.193333 \n", "4 0.233333 0.320000 0.300000 \n", "6 0.206667 0.220000 0.180000 \n", "8 0.333333 0.333333 0.333333 \n", "1 0.220000 0.206667 0.186667 \n", "9 0.206667 0.180000 0.186667 \n", "0 0.333333 0.333333 0.333333 \n", "7 0.333333 0.333333 0.333333 \n", "2 0.180000 0.266667 0.173333 \n", "5 0.206667 0.240000 0.246667 "]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["def multiplie_bruit_centree_duplique_rebalance(X, Y, classes=None):\n", " if classes is None:\n", " classes = numpy.unique(Y)\n", " XS = []\n", " YS = []\n", " for i in classes:\n", " \n", " X2 = numpy.random.random((X.shape[0], 3)) * 0.8 - 0.4\n", " X2[:,i] += 1 # * ((i % 2) * 2 - 1)\n", " Yb = Y == i\n", " XS.append(numpy.hstack([X, X2]))\n", " Yb = Yb.reshape((len(Yb), 1))\n", " YS.append(Yb)\n", " \n", " \n", " Xext = numpy.vstack(XS)\n", " Yext = numpy.vstack(YS)\n", " return Xext, Yext\n", "\n", "dfb = comparaison_train_test(models, X, Y, mbruit=multiplie_bruit_centree_duplique_rebalance)\n", "dfb"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Petite explication\n", "\n", "Dans tout le notebook, le score de la r\u00e9gression logistique est nul. Elle ne parvient pas \u00e0 apprendre tout simplement parce que le probl\u00e8me choisi n'est pas lin\u00e9aire s\u00e9parable. S'il l'\u00e9tait, cela voudrait dire que le probl\u00e8me suivant l'est aussi."]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["(array([[1., 0., 0., 1., 0., 0.],\n", " [1., 0., 0., 0., 1., 0.],\n", " [1., 0., 0., 0., 0., 1.],\n", " [0., 1., 0., 1., 0., 0.],\n", " [0., 1., 0., 0., 1., 0.],\n", " [0., 1., 0., 0., 0., 1.],\n", " [0., 0., 1., 1., 0., 0.],\n", " [0., 0., 1., 0., 1., 0.],\n", " [0., 0., 1., 0., 0., 1.]]), array([[1.],\n", " [0.],\n", " [0.],\n", " [0.],\n", " [1.],\n", " [0.],\n", " [0.],\n", " [0.],\n", " [1.]]))"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["M = numpy.zeros((9, 6))\n", "Y = numpy.zeros((9, 1))\n", "for i in range(0, 9):\n", " M[i, i//3] = 1\n", " M[i, i%3+3] = 1\n", " Y[i] = 1 if i//3 == i%3 else 0\n", "M,Y"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='ovr',\n", " n_jobs=None, penalty='l2', random_state=None, solver='liblinear',\n", " tol=0.0001, verbose=0, warm_start=False)"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["clf = LogisticRegression(multi_class=\"ovr\", solver=\"liblinear\")\n", "clf.fit(M, Y.ravel())"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([0., 0., 0., 0., 0., 0., 0., 0., 0.])"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["clf.predict(M)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A revisiter."]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0"}}, "nbformat": 4, "nbformat_minor": 2}