{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 1A.e - TD not\u00e9, 21 f\u00e9vrier 2017\n", "\n", "Solution du TD not\u00e9, celui-ci pr\u00e9sente un algorithme pour calculer les coefficients d'une r\u00e9gression quantile et par extension d'une m\u00e9diane dans un espace \u00e0 plusieurs dimensions."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pr\u00e9cision : dans tout l'\u00e9nonc\u00e9, la transpos\u00e9e d'une matrice est not\u00e9e $X' = X^{T}$. La plupart du temps $X$ et $Y$ d\u00e9signent des vecteurs colonnes. $\\beta$ d\u00e9signe un vecteur ligne, $W$ une matrice diagonale."]}, {"cell_type": "markdown", "metadata": {}, "source": ["# Exercice 1"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q1 \n", "\n", "A l'aide du module [random](https://docs.python.org/3/library/random.html), g\u00e9n\u00e9rer un ensemble de points al\u00e9atoires."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["[1000, 51, 83, 29, 15, 62, 90, 28, 61, 40]"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["import random\n", "\n", "def ensemble_aleatoire(n):\n", " res = [random.randint(0, 100) for i in range(n)]\n", " res[0] = 1000\n", " return res\n", "\n", "ens = ensemble_aleatoire(10)\n", "ens"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q2\n", "\n", "La m\u00e9diane d'un ensemble de points $\\left\\{X_1, ..., X_n\\right\\}$ est une valeur $X_M$ telle que : \n", "\n", "$$\\sum_i \\mathbb{1}_{X_i < X_m} = \\sum_i \\mathbb{1}_{X_i > X_m}$$\n", "\n", "Autrement dit, il y a autant de valeurs inf\u00e9rieures que sup\u00e9rieures \u00e0 $X_M$. On obtient cette valeur en triant les \u00e9l\u00e9ments par ordre croissant et en prenant celui du milieu.\n", "\n"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["61"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["def mediane(ensemble):\n", " tri = list(sorted(ensemble))\n", " return tri[len(tri)//2]\n", "\n", "mediane(ens)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q3\n", "\n", "Lorsque le nombre de points est pair, la m\u00e9diane peut \u00eatre n'importe quelle valeur dans un intervalle. Modifier votre fonction de fa\u00e7on \u00e0 ce que la fonction pr\u00e9c\u00e9dente retourne le milieu de la fonction."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["56.0"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["def mediane(ensemble):\n", " tri = list(sorted(ensemble))\n", " if len(tri) % 2 == 0:\n", " m = len(tri)//2\n", " return (tri[m] + tri[m-1]) / 2\n", " else:\n", " return tri[len(tri)//2]\n", "\n", "mediane(ens)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q4\n", "\n", "Pour un ensemble de points $E=\\left\\{X_1, ..., X_n\\right\\}$, on consid\u00e8re la fonction suivante : \n", "\n", "$$f(x) = \\sum_{i=1}^n \\left | x - X_i\\right |$$.\n", "\n", "On suppose que la m\u00e9diane $X_M$ de l'ensemble $E$ n'appartient pas \u00e0 $E$ : $X_M \\notin E$. Que vaut $f'(X_M)$ ?\n", "On acceptera le fait que la m\u00e9diane est le seul point dans ce cas.\n", "\n", "$$f'(X_m) = - \\sum_{i=1}^n \\mathbb{1}_{X_i < X_m} + \\sum_{i=1}^n \\mathbb{1}_{X_i > X_m}$$\n", "\n", "Par d\u00e9finition de la m\u00e9diane, $f'(X_M)=0$. En triant les \u00e9l\u00e9ments, on montre que la $f'(x) = 0 \\Longleftrightarrow x=X_m$."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q5\n", "\n", "On suppose qu'on dispose d'un ensemble d'observations $\\left(X_i, Y_i\\right)$ avec $X_i, Y_i \\in \\mathbb{R}$.\n", "La r\u00e9gression lin\u00e9aire consiste une relation lin\u00e9aire $Y_i = a X_i + b + \\epsilon_i$\n", "qui minimise la variance du bruit. On pose :\n", "\n", "$$E(a, b) = \\sum_i \\left(Y_i - (a X_i + b)\\right)^2$$\n", "\n", "On cherche $a, b$ tels que :\n", "\n", "$$a^*, b^* = \\arg \\min E(a, b) = \\arg \\min \\sum_i \\left(Y_i - (a X_i + b)\\right)^2$$\n", "\n", "La fonction est d\u00e9rivable et on trouve :\n", "\n", "$$\\frac{\\partial E(a,b)}{\\partial a} = - 2 \\sum_i X_i ( Y_i - (a X_i + b)) \\text{ et } \\frac{\\partial E(a,b)}{\\partial b} = - 2 \\sum_i ( Y_i - (a X_i + b))$$\n", "\n", "Il suffit alors d'annuler les d\u00e9riv\u00e9es. On r\u00e9soud un syst\u00e8me d'\u00e9quations lin\u00e9aires. On note :\n", "\n", "$$\\begin{array}{l} \\mathbb{E} X = \\frac{1}{n}\\sum_{i=1}^n X_i \\text{ et } \\mathbb{E} Y = \\frac{1}{n}\\sum_{i=1}^n Y_i \\\\ \\mathbb{E}{X^2} = \\frac{1}{n}\\sum_{i=1}^n X_i^2 \\text{ et } \\mathbb{E} {XY} = \\frac{1}{n}\\sum_{i=1}^n X_i Y_i \\end{array}$$\n", "\n", "Finalement :\n", "\n", "$$\\begin{array}{l} a^* = \\frac{ \\mathbb{E} {XY} - \\mathbb{E} X \\mathbb{E} Y}{\\mathbb{E}{X^2} - (\\mathbb{E} X)^2} \\text{ et } b^* = \\mathbb{E} Y - a^* \\mathbb{E} X \\end{array}$$\n", "\n", "Lorsqu'on a plusieurs dimensions pour $X$, on \u00e9crit le probl\u00e8me d'optimisation, on cherche les coefficients $\\beta^*$ qui minimisent :\n", "\n", "$$E(\\beta)=\\sum_{i=1}^n \\left(y_i - X_i \\beta\\right)^2 = \\left \\Vert Y - X\\beta \\right \\Vert ^2$$\n", "\n", "La solution est : $\\beta^* = (X'X)^{-1}X'Y$.\n", "\n", "Ecrire une fonction qui calcule ce vecteur optimal."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 1.00141843]])"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["from numpy.linalg import inv\n", "\n", "def regression_lineaire(X, Y):\n", " t = X.T\n", " return inv(t @ X) @ t @ Y\n", "\n", "import numpy\n", "X = numpy.array(ens).reshape((len(ens), 1))\n", "regression_lineaire(X, X+1) # un essai pour v\u00e9rifier que la valeur n'est pas aberrante"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q6\n", "\n", "Ecrire une fonction qui transforme un vecteur en une matrice diagonale."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[1, 0, 0],\n", " [0, 2, 0],\n", " [0, 0, 3]])"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["def matrice_diagonale(W):\n", " return numpy.diag(W)\n", "\n", "matrice_diagonale([1, 2, 3])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q7\n", "\n", "On consid\u00e8re maintenant que chaque observation est pond\u00e9r\u00e9e par un poids $w_i$. On veut maintenant trouver le vecteur $\\beta$ qui minimise :\n", "\n", "$$E(\\beta)=\\sum_{i=1}^n w_i \\left( y_i - X_i \\beta \\right)^2 = \\left \\Vert W^{\\frac{1}{2}}(Y - X\\beta)\\right \\Vert^2$$\n", "\n", "O\u00f9 $W=diag(w_1, ..., w_n)$ est la matrice diagonale. La solution est :\n", "\n", "$$\\beta_* = (X'WX)^{-1}X'WY$$.\n", "\n", "Ecrire une fonction qui calcule la solution de la r\u00e9gression pond\u00e9r\u00e9e. La fonction [ravel](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html) est utile."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["(array([[ 1.]]), array([[ 1.01240451]]))"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["def regression_lineaire_ponderee(X, Y, W):\n", " if len(W.shape) == 1 or W.shape[0] != W.shape[1]:\n", " # c'est un vecteur\n", " W = matrice_diagonale(W.ravel())\n", " wx = W @ X\n", " xt = X.T\n", " return inv(xt @ wx) @ xt @ W @ Y\n", "\n", "\n", "X = numpy.array(sorted(ens)).reshape((len(ens), 1))\n", "Y = X.copy()\n", "Y[0] = max(X)\n", "W = numpy.ones(len(ens))\n", "W[0] = 0\n", "regression_lineaire_ponderee(X, Y, W), regression_lineaire(X, Y)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q8\n", "\n", "Ecrire une fonction qui calcule les quantit\u00e9s suivantes (fonctions [maximum](https://docs.scipy.org/doc/numpy/reference/generated/numpy.maximum.html), [reciprocal](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reciprocal.html#numpy.reciprocal)).\n", "\n", "$$z_i = \\frac{1}{\\max\\left( \\delta, \\left|y_i - X_i \\beta\\right|\\right)}$$"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 1.01330469e-03],\n", " [ 5.26315789e+00],\n", " [ 4.54545455e+00],\n", " [ 3.22580645e+00],\n", " [ 1.85185185e+00],\n", " [ 1.47058824e+00],\n", " [ 1.14942529e+00],\n", " [ 1.07526882e+00],\n", " [ 1.07526882e+00],\n", " [ 1.00000000e-01]])"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["def calcule_z(X, beta, Y, W, delta=0.0001):\n", " epsilon = numpy.abs(Y - X @ beta) \n", " return numpy.reciprocal(numpy.maximum(epsilon, numpy.ones(epsilon.shape) * delta))\n", "\n", "calcule_z(X * 1.0, numpy.array([[1.01]]), Y, W)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q9\n", "\n", "On souhaite coder l'algorithme suivant :\n", "\n", "1. $w_i^{(1)} = 1$\n", "2. $\\beta_{(t)} = (X'W^{(t)}X)^{-1}X'W^{(t)}Y$\n", "3. $w_i^{(t+1)} = \\frac{1}{\\max\\left( \\delta, \\left|y_i - X_i \\beta^{(t)}\\right|\\right)}$\n", "4. $t = t+1$\n", "5. Retour \u00e0 l'\u00e9tape 2.\n"]}, {"cell_type": "code", "execution_count": 9, "metadata": {"scrolled": false}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["0 150.13052808 [[ 13.82243581]]\n", "1 104.79608014 [[ 3.21524459]]\n", "2 100.851019446 [[ 2.25815451]]\n", "3 100.36420567 [[ 2.12644545]]\n", "4 100.255554539 [[ 2.09141327]]\n", "5 100.220626093 [[ 2.0777948]]\n", "6 100.219023635 [[ 2.07639404]]\n", "7 100.21901041 [[ 2.07631459]]\n", "8 100.218994922 [[ 2.07622156]]\n", "9 100.218976948 [[ 2.07611358]]\n"]}, {"data": {"text/plain": ["array([[ 2.07611358]])"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["def algorithm(X, Y, delta=0.0001):\n", " W = numpy.ones(X.shape[0])\n", " for i in range(0, 10):\n", " beta = regression_lineaire_ponderee(X, Y, W)\n", " W = calcule_z(X, beta, Y, W, delta=delta)\n", " E = numpy.abs(Y - X @ beta).sum()\n", " print(i, E, beta)\n", " return beta\n", " \n", "X = numpy.random.rand(10, 1) \n", "Y = X*2 + numpy.random.rand()\n", "Y[0] = Y[0] + 100\n", "algorithm(X, Y)"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 13.82243581]])"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["regression_lineaire(X, Y)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q10"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["34.5"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["ens = ensemble_aleatoire(10)\n", "Y = numpy.empty((len(ens), 1))\n", "Y[:,0] = ens\n", "X = numpy.ones((len(ens), 1))\n", "mediane(ens)"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([ 131.1])"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["Y.mean(axis=0)"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 131.1]])"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["regression_lineaire(X, Y)"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["0 1737.8 [[ 131.1]]\n", "1 1215.2110733 [[ 55.05276833]]\n", "2 1196.55478823 [[ 48.77739411]]\n", "3 1190.4919578 [[ 45.7459789]]\n", "4 1183.56462833 [[ 42.28231416]]\n", "5 1179.0 [[ 39.7931558]]\n", "6 1179.0 [[ 39.7931558]]\n", "7 1179.0 [[ 39.7931558]]\n", "8 1179.0 [[ 39.7931558]]\n", "9 1179.0 [[ 39.7931558]]\n"]}, {"data": {"text/plain": ["array([[ 39.7931558]])"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["algorithm(X,Y)"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["34.5"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["mediane(ens)"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["[5, 6, 12, 14, 29, 40, 52, 67, 86, 1000]"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["list(sorted(ens))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La r\u00e9gression lin\u00e9aire \u00e9gale la moyenne, l'algorithme s'approche de la m\u00e9diane."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Quelques explications et d\u00e9monstrations\n", "\n", "Cet \u00e9nonc\u00e9 est inspir\u00e9 de [Iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares). Cet algorithme permet notamment d'\u00e9tendre la notion de m\u00e9diane \u00e0 des espaces vectoriels de plusieurs dimensions. On peut d\u00e9termine un point $X_M$ qui minimise la quantit\u00e9 :\n", "\n", "$$\\sum_{i=1}^n \\left| X_i - X_M \\right |$$\n", "\n", "Nous reprenons l'algorithme d\u00e9crit ci-dessus :\n", "\n", "1. $w_i^{(1)} = 1$\n", "2. $\\beta_{(t)} = (X'W^{(t)}X)^{-1}X'W^{(t)}Y$\n", "3. $w_i^{(t+1)} = \\frac{1}{\\max\\left( \\delta, \\left|y_i - X_i \\beta^{(t)}\\right|\\right)}$\n", "4. $t = t+1$\n", "5. Retour \u00e0 l'\u00e9tape 2.\n", "\n", "L'erreur quadratique pond\u00e9r\u00e9 est :\n", "\n", "$$E_2(\\beta, W) = \\sum_{i=1}^n w_i \\left\\Vert Y_i - X_i \\beta \\right\\Vert^2$$\n", "\n", "Si $w_i = \\frac{1}{\\left|y_i - X_i \\beta\\right|}$, on remarque que :\n", "\n", "$$E_2(\\beta, W) = \\sum_{i=1}^n \\frac{\\left\\Vert Y_i - X_i \\beta \\right\\Vert^2}{\\left|y_i - X_i \\beta\\right|} = \\sum_{i=1}^n \\left|y_i - X_i \\beta\\right| = E_1(\\beta)$$\n", "\n", "On retombe sur l'erreur en valeur absolue optimis\u00e9e par la r\u00e9gression quantile. Comme l'\u00e9tape 2 consiste \u00e0 trouver les coefficients $\\beta$ qui minimise $E_2(\\beta, W^{(t)})$, par construction, il ressort que :\n", "\n", "$$E_1(\\beta^{(t+1)}) = E_2(\\beta^{(t+1)}, W^{(t)}) \\leqslant E_2(\\beta^{(t)}, W^{(t)}) = E_1(\\beta^{(t)})$$\n", "\n", "La suite $t \\rightarrow E_1(\\beta^{(t)})$ est suite d\u00e9croissant est minor\u00e9e par 0. Elle converge donc vers un minimum. Or la fonction $\\beta \\rightarrow E_1(\\beta)$ est une fonction convexe. Elle n'admet qu'un seul minimum (mais pas n\u00e9cessaire un seul point atteignant ce minimum). L'algorithme converge donc vers la m\u00e9diane. Le param\u00e8tre $\\delta$ est l\u00e0 pour \u00e9viter les erreurs de divisions par z\u00e9ros et les approximations de calcul faites par l'ordinateur."]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["## Quelques commentaires sur le code\n", "\n", "Le symbol [@](https://www.python.org/dev/peps/pep-0465/) a \u00e9t\u00e9 introduit par Python 3.5 et est \u00e9quivalent \u00e0 la fonction [numpy.dot](https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html). Les dimensions des matrices posent souvent quelques probl\u00e8mes."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["((3, 2), (3,))"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "y = numpy.array([1, 2, 3])\n", "M = numpy.array([[3, 4], [6, 7], [3, 3]])\n", "M.shape, y.shape"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["shapes (3,2) and (3,) not aligned: 2 (dim 1) != 3 (dim 0)\n"]}], "source": ["try:\n", " M @ y\n", "except Exception as e:\n", " print(e)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Par d\u00e9faut, numpy consid\u00e8re un vecteur de taille ``(3,)`` comme un vecteur ligne ``(3,1)``. Donc l'expression suivante va marcher :"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([24, 27])"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["y @ M"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ou :"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([24, 27])"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["M.T @ y"]}, {"cell_type": "code", "execution_count": 21, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}