{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 1A.e - TD not\u00e9, 16 d\u00e9cembre 2016\n", "\n", "R\u00e9gression lin\u00e9aire avec des variables cat\u00e9gorielles."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["<div id=\"my_id_menu_nb\">run previous cell, wait for 2 seconds</div>\n", "<script>\n", "function repeat_indent_string(n){\n", "    var a = \"\" ;\n", "    for ( ; n > 0 ; --n)\n", "        a += \"    \";\n", "    return a;\n", "}\n", "var update_menu_string = function(begin, lfirst, llast, sformat, send, keep_item, begin_format, end_format) {\n", "    var anchors = document.getElementsByClassName(\"section\");\n", "    if (anchors.length == 0) {\n", "        anchors = document.getElementsByClassName(\"text_cell_render rendered_html\");\n", "    }\n", "    var i,t;\n", "    var text_menu = begin;\n", "    var text_memo = \"<pre>\\nlength:\" + anchors.length + \"\\n\";\n", "    var ind = \"\";\n", "    var memo_level = 1;\n", "    var href;\n", "    var tags = [];\n", "    var main_item = 0;\n", "    var format_open = 0;\n", "    for (i = 0; i <= llast; i++)\n", "        tags.push(\"h\" + i);\n", "\n", "    for (i = 0; i < anchors.length; i++) {\n", "        text_memo += \"**\" + anchors[i].id + \"--\\n\";\n", "\n", "        var child = null;\n", "        for(t = 0; t < tags.length; t++) {\n", "            var r = anchors[i].getElementsByTagName(tags[t]);\n", "            if (r.length > 0) {\n", "child = r[0];\n", "break;\n", "            }\n", "        }\n", "        if (child == null) {\n", "            text_memo += \"null\\n\";\n", "            continue;\n", "        }\n", "        if (anchors[i].hasAttribute(\"id\")) {\n", "            // when converted in RST\n", "            href = anchors[i].id;\n", "            text_memo += \"#1-\" + href;\n", "            // passer \u00e0 child suivant (le chercher)\n", "        }\n", "        else if (child.hasAttribute(\"id\")) {\n", "            // in a notebook\n", "            href = child.id;\n", "            text_memo += \"#2-\" + href;\n", "        }\n", "        else {\n", "            text_memo += \"#3-\" + \"*\" + \"\\n\";\n", "            continue;\n", "        }\n", "        var title = child.textContent;\n", "        var level = parseInt(child.tagName.substring(1,2));\n", "\n", "        text_memo += \"--\" + level + \"?\" + lfirst + \"--\" + title + \"\\n\";\n", "\n", "        if ((level < lfirst) || (level > llast)) {\n", "            continue ;\n", "        }\n", "        if (title.endsWith('\u00b6')) {\n", "            title = title.substring(0,title.length-1).replace(\"<\", \"&lt;\")\n", "         .replace(\">\", \"&gt;\").replace(\"&\", \"&amp;\");\n", "        }\n", "        if (title.length == 0) {\n", "            continue;\n", "        }\n", "\n", "        while (level < memo_level) {\n", "            text_menu += end_format + \"</ul>\\n\";\n", "            format_open -= 1;\n", "            memo_level -= 1;\n", "        }\n", "        if (level == lfirst) {\n", "            main_item += 1;\n", "        }\n", "        if (keep_item != -1 && main_item != keep_item + 1) {\n", "            // alert(main_item + \" - \" + level + \" - \" + keep_item);\n", "            continue;\n", "        }\n", "        while (level > memo_level) {\n", "            text_menu += \"<ul>\\n\";\n", "            memo_level += 1;\n", "        }\n", "        text_menu += repeat_indent_string(level-2);\n", "        text_menu += begin_format + sformat.replace(\"__HREF__\", href).replace(\"__TITLE__\", title);\n", "        format_open += 1;\n", "    }\n", "    while (1 < memo_level) {\n", "        text_menu += end_format + \"</ul>\\n\";\n", "        memo_level -= 1;\n", "        format_open -= 1;\n", "    }\n", "    text_menu += send;\n", "    //text_menu += \"\\n\" + text_memo;\n", "\n", "    while (format_open > 0) {\n", "        text_menu += end_format;\n", "        format_open -= 1;\n", "    }\n", "    return text_menu;\n", "};\n", "var update_menu = function() {\n", "    var sbegin = \"\";\n", "    var sformat = '<a href=\"#__HREF__\">__TITLE__</a>';\n", "    var send = \"\";\n", "    var begin_format = '<li>';\n", "    var end_format = '</li>';\n", "    var keep_item = -1;\n", "    var text_menu = update_menu_string(sbegin, 2, 4, sformat, send, keep_item,\n", "       begin_format, end_format);\n", "    var menu = document.getElementById(\"my_id_menu_nb\");\n", "    menu.innerHTML=text_menu;\n", "};\n", "window.setTimeout(update_menu,2000);\n", "            </script>"], "text/plain": ["<IPython.core.display.HTML object>"]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On suppose qu'on dispose d'un ensemble d'observations $(X_i, Y_i)$ avec $X_i, Y_i \\in \\mathbb{R}$. La r\u00e9gression lin\u00e9aire consiste une relation lin\u00e9aire $Y_i = a X_i + b + \\epsilon_i$ qui minimise la variance du bruit. On pose :\n", "\n", "$$E(a, b) = \\sum_i (Y_i - (a X_i + b))^2$$\n", "\n", "On cherche $a, b$ tels que :\n", "\n", "$$a^*, b^* = \\arg \\min E(a, b) = \\arg \\min \\sum_i (Y_i - (a X_i + b))^2$$\n", "\n", "La fonction est d\u00e9rivable et on trouve :\n", "\n", "$$\\frac{\\partial E(a,b)}{\\partial a} = - 2 \\sum_i X_i ( Y_i - (a X_i + b)) \\text{ et } \\frac{\\partial E(a,b)}{\\partial b} = - 2 \\sum_i ( Y_i - (a X_i + b))$$\n", "\n", "Il suffit alors d'annuler les d\u00e9riv\u00e9es. On r\u00e9soud un syst\u00e8me d'\u00e9quations lin\u00e9aires. On note :\n", "\n", "$$\\begin{array}{l} \\mathbb{E} X = \\frac{1}{n}\\sum_{i=1}^n X_i \\text{ et } \\mathbb{E} Y = \\frac{1}{n}\\sum_{i=1}^n Y_i \\\\ \\mathbb{E}(X^2) = \\frac{1}{n}\\sum_{i=1}^n X_i^2 \\text{ et } \\mathbb{E}(XY) = \\frac{1}{n}\\sum_{i=1}^n X_i Y_i \\end{array}$$\n", "\n", "Finalement :\n", "\n", "$$\\begin{array}{l} a^* = \\frac{ \\mathbb{E}(XY) - \\mathbb{E} X \\mathbb{E} Y}{\\mathbb{E}(X^2) - (\\mathbb{E} X)^2} \\text{ et } b^* = \\mathbb{E} Y - a^* \\mathbb{E} X \\end{array}$$"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q1\n", "\n", "On g\u00e9n\u00e8re un nuage de points avec le code suivant :"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["[(2.996890181837922, 2.8750295096923186),\n", " (4.264526460045277, 2.324063943726332),\n", " (4.718648422500299, 3.052469543647318),\n", " (2.442689562115705, 3.861870829036456),\n", " (0.13558433730903707, 0.5754835901808546),\n", " (5.59230695209076, 1.6209924216651825),\n", " (7.610357428256408, 3.3202733390571373),\n", " (8.678403330137792, 4.96766236219644),\n", " (9.427259745518597, 6.385862058140737),\n", " (9.273956381823456, 4.938275166261537)]"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["import random\n", "def generate_xy(n=100, a=0.5, b=1):\n", "    res = []\n", "    for i in range(0, n):\n", "        x = random.uniform(0, 10)\n", "        res.append((x, x*a + b + random.gauss(0,1)))\n", "    return res\n", "\n", "generate_xy(10)"]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["### Q2\n", "\n", "Ecrire une fonction qui calcule $\\mathbb{E} X, \\mathbb{E} Y, \\mathbb{E}(XY), \\mathbb{E}(X^2)$. Plusieurs \u00e9tudiants m'ont demand\u00e9 ce qu'\u00e9tait ``obs``. C'est simplement le r\u00e9sultat de la fonction pr\u00e9c\u00e9dente."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["(5.523441805914873, 3.850511796328412, 25.88928454527569, 38.98854258182378)"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["def calcule_exyxyx2(obs):\n", "    sx = 0\n", "    sy = 0\n", "    sxy = 0\n", "    sx2 = 0\n", "    for x, y in obs:\n", "        sx += x\n", "        sy += y\n", "        sxy += x * y\n", "        sx2 += x * x\n", "    n = len(obs)\n", "    return sx/n, sy/n, sxy/n, sx2/n\n", "\n", "obs = generate_xy(10)\n", "calcule_exyxyx2(obs)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q3\n", "\n", " Calculer les grandeurs $a^*$, $b^*$. A priori, on doit retrouver quelque chose d'assez proche des valeurs choisies pour la premi\u00e8re question : $a=0.5$, $b=1$."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["(-0.5446995618974346, 6.859128128176218)"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["def calcule_ab(obs):\n", "    sx, sy, sxy, sx2 = calcule_exyxyx2(obs)\n", "    a = (sxy - sx * sx)  / (sx2 - sx**2)\n", "    b = sy - a * sx\n", "    return a, b\n", "\n", "calcule_ab(obs)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q4\n", "\n", "Compl\u00e9ter le programme."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('vert', 5.132157444058703),\n", " ('vert', 6.088324149707968),\n", " ('rouge', 0.16315983779393228),\n", " ('rouge', 0.9717657424738734),\n", " ('rouge', 2.843197432779423),\n", " ('rouge', 0.7204386278807904),\n", " ('bleu', 21.89226869979884),\n", " ('rouge', -0.16605748011543708),\n", " ('rouge', -0.02903894820027486),\n", " ('rouge', 0.5787816483863786)]"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["import random\n", "def generate_caty(n=100, a=0.5, b=1, cats=[\"rouge\", \"vert\", \"bleu\"]):\n", "    res = []\n", "    for i in range(0, n):\n", "        x = random.randint(0,2)\n", "        cat = cats[x]\n", "        res.append((cat, 10*x**2*a + b + random.gauss(0,1)))\n", "    return res\n", "\n", "generate_caty(10)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q5\n", "\n", "On voudrait quand m\u00eame faire une r\u00e9gression de la variable $Y$ sur la variable cat\u00e9gorielle. On construit une fonction qui assigne un num\u00e9ro quelconque mais distinct \u00e0 chaque cat\u00e9gorie. La fonction retourne un dictionnaire avec les cat\u00e9gories comme cl\u00e9 et les num\u00e9ros comme valeurs."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'bleu': 0, 'rouge': 2, 'vert': 1}"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["def numero_cat(obs):\n", "    mapping = {}\n", "    for color, y in obs:\n", "        if color not in mapping:\n", "            mapping[color] = len(mapping)\n", "    return mapping\n", "\n", "obs = generate_caty(100)\n", "numero_cat(obs)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q6 \n", "\n", "On construit la matrice $M_{ic}$ tel que : $M_{ic}$ vaut 1 si $c$ est le num\u00e9ro de la cat\u00e9gorie $X_i$, 0 sinon."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 1.,  0.,  0.],\n", "       [ 0.,  1.,  0.],\n", "       [ 1.,  0.,  0.],\n", "       [ 1.,  0.,  0.],\n", "       [ 0.,  0.,  1.]])"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "def construit_M(obs):\n", "    mapping = numero_cat(obs)\n", "    M = numpy.zeros((len(obs), 3))\n", "    for i, (color, y) in enumerate(obs):\n", "        cat = mapping[color]\n", "        M[i, cat] = 1.0\n", "    return M\n", "    \n", "M = construit_M(obs)\n", "M[:5]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q7 \n", "\n", "Il est conseill\u00e9 de convertir la matrice $M$ et les $Y$ au format *numpy*. On ajoute un vecteur constant \u00e0 la matrice $M$. La requ\u00eate *numpy add column* sur un moteur de recherche vous directement \u00e0 ce r\u00e9sultat : [How to add an extra column to an numpy array](http://stackoverflow.com/questions/8486294/how-to-add-an-extra-column-to-an-numpy-array)."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["(array([[ 1.,  0.,  0.,  1.],\n", "        [ 0.,  1.,  0.,  1.],\n", "        [ 1.,  0.,  0.,  1.],\n", "        [ 1.,  0.,  0.,  1.],\n", "        [ 0.,  0.,  1.,  1.]]), array([[ 21.15485572],\n", "        [  6.37882494],\n", "        [ 21.37124634],\n", "        [ 21.77476221],\n", "        [  2.03305199]]))"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["def convert_numpy(obs):\n", "    M = construit_M(obs)\n", "    Mc = numpy.hstack([M, numpy.ones((M.shape[0], 1))])\n", "    Y = numpy.array([y for c, y in obs])\n", "    return M, Mc, Y.reshape((M.shape[0], 1))\n", "\n", "M, Mc, Y = convert_numpy(obs)\n", "Mc[:5], Y[:5]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q8 \n", "\n", "On r\u00e9soud la r\u00e9gression multidimensionnelle en appliquant la formule $C^* = (M'M)^{-1}M'Y$. La question 7 ne servait pas \u00e0 grand chose except\u00e9 faire d\u00e9couvrir la fonction [hstack](https://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html) car le rang de la matrice ``Mc`` est 3 < 4."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 20.92499253],\n", "       [  6.14818418],\n", "       [  1.09988478]])"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["alpha = numpy.linalg.inv(M.T @ M) @ M.T @ Y\n", "alpha"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q9 \n", "\n", "La r\u00e9gression d\u00e9termine les coefficients $\\alpha$ dans la r\u00e9gression $Y_i = \\alpha_{rouge} \\mathbb{1}_{X_i = rouge} + \\alpha_{vert}  \\mathbb{1}_{X_i = vert} + \\alpha_{bleu} \\mathbb{1}_{X_i = bleu} + \\epsilon_i$.\n", "\n", "Construire le vecteur $\\hat{Y_i} = \\alpha_{rouge} \\mathbb{1}_{X_i = rouge} + \\alpha_{vert} \\mathbb{1}_{X_i = vert} + \\alpha_{bleu} \\mathbb{1}_{X_i = bleu}$."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 20.92499253],\n", "       [  6.14818418],\n", "       [ 20.92499253],\n", "       [ 20.92499253],\n", "       [  1.09988478]])"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["Yp = numpy.zeros((M.shape[0], 1))\n", "for i in range(3):\n", "    Yp[ M[:,i] == 1, 0] = alpha[i, 0]\n", "Yp[:5]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q10\n", "\n", "Utiliser le r\u00e9sultat de la question 3 pour calculer les coefficients de la r\u00e9gression $Y_i = a^* \\hat{Y_i} + b^*$."]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["(array([ 1.]), array([  1.77635684e-15]))"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["obs = [(x, y) for x, y in zip(Yp, Y)]\n", "calcule_ab( obs )"]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["On aboutit au r\u00e9sultat $Y = \\hat{Y} + \\epsilon$. On a associ\u00e9 une valeur \u00e0 chaque cat\u00e9gorie de telle sorte que la r\u00e9gression de $Y$ sur cette valeur soit cette valeur. Autrement dit, c'est la meilleur approximation de $Y$ sur chaque cat\u00e9gorie. A quoi cela correspond-il ? C'est le second \u00e9nonc\u00e9 qui r\u00e9pond \u00e0 cette question."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 2"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q1 - Q2 - Q3\n", "\n", "Ce sont les m\u00eames r\u00e9ponses."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q4\n", "\n", "Un moyen tr\u00e8s simple de simuler une [loi multinomiale](https://fr.wikipedia.org/wiki/Loi_multinomiale) est de partir d'une loi uniforme et discr\u00e8te \u00e0 valeur dans entre 1 et 10. On tire un nombre, s'il est inf\u00e9rieur ou \u00e9gal \u00e0 5, ce sera la cat\u00e9gorie 0, 1 si c'est inf\u00e9rieur \u00e0 8, 2 sinon."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('vert', 6.0084452843428675),\n", " ('rouge', 2.155449750270483),\n", " ('rouge', 2.1132607428792447),\n", " ('vert', 6.897729973062269),\n", " ('rouge', 0.7637316114791164),\n", " ('vert', 5.566787193134299),\n", " ('vert', 5.848567708215508),\n", " ('bleu', 19.722503065860707),\n", " ('rouge', 0.8043492141543047),\n", " ('bleu', 21.675781652825997)]"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["def generate_caty(n=100, a=0.5, b=1, cats=[\"rouge\", \"vert\", \"bleu\"]):\n", "    res = []\n", "    for i in range(0, n):\n", "        # on veut 50% de rouge, 30% de vert, 20% de bleu\n", "        x = random.randint(1, 10)\n", "        if x <= 5: x = 0\n", "        elif x <= 8: x = 1\n", "        else : x = 2\n", "        cat = cats[x]\n", "        res.append((cat, 10*x**2*a + b + random.gauss(0,1)))\n", "    return res\n", "\n", "obs = generate_caty(10)\n", "obs"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q5 \n", "\n", "On voudrait quand m\u00eame faire une r\u00e9gression de la variable $Y$ sur la variable cat\u00e9gorielle. On commence par les compter. Construire une fonction qui compte le nombre de fois qu'une cat\u00e9gorie est pr\u00e9sente dans les donn\u00e9es (un histogramme)."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'bleu': 2, 'rouge': 4, 'vert': 4}"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["def histogram_cat(obs):\n", "    h = dict()\n", "    for color, y in obs:\n", "        h[color] = h.get(color, 0) + 1\n", "    return h\n", "\n", "histogram_cat(obs)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q6\n", "\n", "Construire une fonction qui calcule la moyenne des $Y_i$ pour chaque cat\u00e9gorie : $\\mathbb{E}(Y | rouge)$, $\\mathbb{E}(Y | vert)$, $\\mathbb{E}(Y | bleu)$. La fonction retourne un dictionnaire ``{couleur:moyenne}``. "]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'bleu': 20.69914235934335,\n", " 'rouge': 1.4591978296957873,\n", " 'vert': 6.080382539688736}"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["def moyenne_cat(obs):\n", "    h = dict()\n", "    sy = dict()\n", "    for color, y in obs:\n", "        h[color] = h.get(color, 0) + 1\n", "        sy[color] = sy.get(color, 0) + y\n", "    for k, v in h.items():\n", "        sy[k] /= v\n", "    return sy\n", "\n", "moyenne_cat(obs)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["L'\u00e9nonc\u00e9 induisait quelque peu en erreur car la fonction sugg\u00e9r\u00e9e ne permet de calculer ces moyennes. Il suffit de changer."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q7\n", "\n", "Construire le vecteur $Z_i = \\mathbb{E}(Y | rouge)\\mathbb{1}_{X_i = rouge} + \\mathbb{E}(Y | vert) \\mathbb{1}_{X_i = vert} + \\mathbb{E}(Y | bleu) \\mathbb{1}_{X_i = bleu}$."]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["[6.080382539688736,\n", " 1.4591978296957873,\n", " 1.4591978296957873,\n", " 6.080382539688736,\n", " 1.4591978296957873]"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["moys = moyenne_cat(obs)\n", "Z = [moys[c] for c, y in obs]\n", "Z[:5]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q8\n", "\n", "Utiliser le r\u00e9sultat de la question 3 pour calculer les coefficients de la r\u00e9gression $Y_i = a^* Z_i + b^*$."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["(1.0, 1.7763568394002505e-15)"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["obs2 = [(z, y) for (c, y), z in zip(obs, Z)]\n", "calcule_ab( obs2 )"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On aboutit au r\u00e9sultat $Y = \\hat{Y} + \\epsilon$. On a associ\u00e9 une valeur \u00e0 chaque cat\u00e9gorie de telle sorte que la r\u00e9gression de $Y$ sur cette valeur soit cette valeur."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q9 \n", "\n", "Calculer la matrice de variance / covariance pour les variables $(Y_i)$, $(Z_i)$, $(Y_i - Z_i)$, $\\mathbb{1}_{X_i = rouge}$, $\\mathbb{1}_{X_i = vert}$, $\\mathbb{1}_{X_i = bleu}$. "]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 6.00844528,  6.08038254, -0.07193726,  0.        ,  1.        ,\n", "         0.        ],\n", "       [ 2.15544975,  1.45919783,  0.69625192,  1.        ,  0.        ,\n", "         0.        ],\n", "       [ 2.11326074,  1.45919783,  0.65406291,  1.        ,  0.        ,\n", "         0.        ],\n", "       [ 6.89772997,  6.08038254,  0.81734743,  0.        ,  1.        ,\n", "         0.        ],\n", "       [ 0.76373161,  1.45919783, -0.69546622,  1.        ,  0.        ,\n", "         0.        ]])"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["bigM = numpy.empty((len(obs), 6))\n", "bigM[:, 0] = [o[1] for o in obs]\n", "bigM[:, 1] = Z\n", "bigM[:, 2] = bigM[:, 0] - bigM[:, 1]\n", "bigM[:, 3] = [ 1 if o[0] == \"rouge\" else 0 for o in obs]\n", "bigM[:, 4] = [ 1 if o[0] == \"vert\" else 0 for o in obs]\n", "bigM[:, 5] = [ 1 if o[0] == \"bleu\" else 0 for o in obs]\n", "bigM[:5]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On utilise la fonction [cov](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html)."]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[  5.62221004e+01,   5.56972711e+01,   5.24829301e-01,\n", "         -2.53176124e+00,  -4.77901369e-01,   3.00966261e+00],\n", "       [  5.56972711e+01,   5.56972711e+01,  -1.92890933e-16,\n", "         -2.53176124e+00,  -4.77901369e-01,   3.00966261e+00],\n", "       [  5.24829301e-01,  -1.92890933e-16,   5.24829301e-01,\n", "         -5.54535166e-17,   7.40725030e-17,  -1.24510807e-17],\n", "       [ -2.53176124e+00,  -2.53176124e+00,  -5.54535166e-17,\n", "          2.66666667e-01,  -1.77777778e-01,  -8.88888889e-02],\n", "       [ -4.77901369e-01,  -4.77901369e-01,   7.40725030e-17,\n", "         -1.77777778e-01,   2.66666667e-01,  -8.88888889e-02],\n", "       [  3.00966261e+00,   3.00966261e+00,  -1.24510807e-17,\n", "         -8.88888889e-02,  -8.88888889e-02,   1.77777778e-01]])"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["c = numpy.cov(bigM.T)\n", "c"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On affiche un peu mieux les r\u00e9sultats :"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style>\n", "    .dataframe thead tr:only-child th {\n", "        text-align: right;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: left;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>0</th>\n", "      <th>1</th>\n", "      <th>2</th>\n", "      <th>3</th>\n", "      <th>4</th>\n", "      <th>5</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>56.222</td>\n", "      <td>55.697</td>\n", "      <td>0.525</td>\n", "      <td>-2.532</td>\n", "      <td>-0.478</td>\n", "      <td>3.010</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>55.697</td>\n", "      <td>55.697</td>\n", "      <td>-0.000</td>\n", "      <td>-2.532</td>\n", "      <td>-0.478</td>\n", "      <td>3.010</td>\n", "    </tr>\n", "    <tr>\n", "      <th>2</th>\n", "      <td>0.525</td>\n", "      <td>-0.000</td>\n", "      <td>0.525</td>\n", "      <td>-0.000</td>\n", "      <td>0.000</td>\n", "      <td>-0.000</td>\n", "    </tr>\n", "    <tr>\n", "      <th>3</th>\n", "      <td>-2.532</td>\n", "      <td>-2.532</td>\n", "      <td>-0.000</td>\n", "      <td>0.267</td>\n", "      <td>-0.178</td>\n", "      <td>-0.089</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4</th>\n", "      <td>-0.478</td>\n", "      <td>-0.478</td>\n", "      <td>0.000</td>\n", "      <td>-0.178</td>\n", "      <td>0.267</td>\n", "      <td>-0.089</td>\n", "    </tr>\n", "    <tr>\n", "      <th>5</th>\n", "      <td>3.010</td>\n", "      <td>3.010</td>\n", "      <td>-0.000</td>\n", "      <td>-0.089</td>\n", "      <td>-0.089</td>\n", "      <td>0.178</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["        0       1       2       3       4       5\n", "0  56.222  55.697   0.525  -2.532  -0.478   3.010\n", "1  55.697  55.697  -0.000  -2.532  -0.478   3.010\n", "2   0.525  -0.000   0.525  -0.000   0.000  -0.000\n", "3  -2.532  -2.532  -0.000   0.267  -0.178  -0.089\n", "4  -0.478  -0.478   0.000  -0.178   0.267  -0.089\n", "5   3.010   3.010  -0.000  -0.089  -0.089   0.178"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "pandas.DataFrame(c).applymap(lambda x: '%1.3f' % x)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q10 \n", "\n", "On permute rouge et vert. Construire le vecteur $W_i = \\mathbb{E}(Y | rouge)\\mathbb{1}_{X_i = vert} + \\mathbb{E}(Y | vert)\\mathbb{1}_{X_i = rouge} + \\mathbb{E}(Y | bleu)\\mathbb{1}_{X_i = bleu}$. Utiliser le r\u00e9sultat de la question 3 pour calculer les coefficients de la r\u00e9gression $Y_i = a^* W_i + b^*$. V\u00e9rifiez que l'erreur est sup\u00e9rieure."]}, {"cell_type": "code", "execution_count": 20, "metadata": {"collapsed": true}, "outputs": [], "source": ["moys = moyenne_cat(obs)\n", "moys[\"rouge\"], moys[\"vert\"] = moys.get(\"vert\", 0), moys.get(\"rouge\", 0)"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["(0.829591905722086, 1.2193824894893863)"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["W = [moys[c] for c, y in obs]\n", "obs3 = [(w, y) for (c, y), w in zip(obs, W)]\n", "calcule_ab( obs3 )"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/plain": ["(0.4723463712054069, 16.100975199731273)"]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["def calcule_erreur(obs):\n", "    a, b = calcule_ab(obs)\n", "    e = [(a*x + b - y)**2 for x, y in obs]\n", "    return sum(e) / len(obs)\n", "\n", "calcule_erreur(obs2), calcule_erreur(obs3)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["C'est carr\u00e9ment sup\u00e9rieur."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Conclusion\n", "\n", "L'[analyse des correspondances multiples](https://fr.wikipedia.org/wiki/Analyse_des_correspondances_multiples) est une fa\u00e7on d'\u00e9tudier les modalit\u00e9s de variables cat\u00e9gorielles mais cela ne fait pas de la pr\u00e9diction. Le mod\u00e8le [logit - probit](https://fr.wikipedia.org/wiki/Mod%C3%A8le_Probit) pr\u00e9dit une variable binaire \u00e0 partir de variables continue mais dans notre cas, c'est la variable \u00e0 pr\u00e9dire qui est continue. Pour effectuer une pr\u00e9diction, il convertit les cat\u00e9gories en variables num\u00e9riques (voir [Categorical Variables](https://en.wikipedia.org/wiki/Categorical_variable)). Le langage R est plus outill\u00e9 pour cela : [Regression on categorical variables](https://www.r-bloggers.com/regression-on-categorical-variables/). Le module [categorical-encoding](https://github.com/scikit-learn-contrib/categorical-encoding) est disponible en python. Cet examen d\u00e9crit une m\u00e9thode parmi d'autres pour transformer les cat\u00e9gories en variables continues."]}, {"cell_type": "code", "execution_count": 23, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}