{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 1A.1 - D'une structure de donn\u00e9es \u00e0 l'autre\n", "\n", "Ce notebook s'amuse \u00e0 passer d'une structure de donn\u00e9es \u00e0 une autre, d'une liste \u00e0 un dictionnaire, d'une liste de liste \u00e0 un dictionnaire, avec toujours les m\u00eames donn\u00e9es : [list](http://www.xavierdupre.fr/app/teachpyx/helpsphinx/c_lang/types.html#liste), [dict](http://www.xavierdupre.fr/app/teachpyx/helpsphinx/c_lang/types.html#dictionnaire), [tuple](http://www.xavierdupre.fr/app/teachpyx/helpsphinx/c_lang/types.html#tuple)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## histogramme et dictionnaire"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### liste \u00e0 dictionnaire\n", "\n", "Un histogramme est le moyen le plus simple de calculer la distribution d'une variable, de compter la fr\u00e9quence des \u00e9l\u00e9ments d'une liste."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'a': 1, 'b': 2, 'gh': 2, 'er': 1}"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["ens = [\"a\", \"b\", \"gh\", \"er\", \"b\", \"gh\"]\n", "hist = {}\n", "for e in ens:\n", " hist[e] = hist.get(e, 0) + 1\n", "hist"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La m\u00e9thode [get](https://docs.python.org/3/library/stdtypes.html?highlight=get#dict.get) comme beaucoup de fonctions impl\u00e9mente un besoin fr\u00e9quent. Elle regarde si une cl\u00e9 appartient au dictionnaire, retourne la valeur associ\u00e9e ou une valeur par d\u00e9fault dans le cas contraire. Sans utiliser cette m\u00e9thode, le code pr\u00e9c\u00e9dent devient :"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'a': 1, 'b': 2, 'gh': 2, 'er': 1}"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["ens = [\"a\", \"b\", \"gh\", \"er\", \"b\", \"gh\"]\n", "hist = {}\n", "for e in ens:\n", " if e in hist:\n", " hist[e] += 1\n", " else:\n", " hist[e] = 1\n", "hist"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il existe \u00e9galement la fonction [Counter](https://docs.python.org/fr/3/library/collections.html?highlight=counter#collections.Counter) qui fait cela."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["Counter({'a': 1, 'b': 2, 'gh': 2, 'er': 1})"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["from collections import Counter\n", "ens = [\"a\", \"b\", \"gh\", \"er\", \"b\", \"gh\"]\n", "hist = Counter(ens)\n", "hist"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### dictionnaire \u00e0 liste\n", "\n", "A priori l'histogramme repr\u00e9sente la m\u00eame information que la liste initiale `ens`. Il doit exister un moyen de recontruire la liste initiale."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["['a', 'b', 'b', 'er', 'gh', 'gh']"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["hist = {'a': 1, 'b': 2, 'er': 1, 'gh': 2}\n", "ens = []\n", "for k, v in hist.items():\n", " for i in range(v):\n", " ens.append(k)\n", "ens"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La liste initiale est retrouv\u00e9e except\u00e9 l'ordre qui est diff\u00e9rent. Les \u00e9l\u00e9ments identiques sont c\u00f4te \u00e0 c\u00f4te. La m\u00e9thode [items](https://docs.python.org/3/library/stdtypes.html?highlight=get#dict.items) retourne des couples `(cl\u00e9, valeur)` ou plut\u00f4t une vue, c'est-\u00e0-dire une fa\u00e7on de parcourir un ensemble."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["dict_items([('a', 1), ('b', 2), ('er', 1), ('gh', 2)])"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["hist.items()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour v\u00e9rifier que la m\u00e9thode [items](https://docs.python.org/3/library/stdtypes.html?highlight=get#dict.items) ne retourne pas un ensemble mais une fa\u00e7on de parcourir un ensemble, on regarde sa taille avec la fonction [getsizeof](https://docs.python.org/3/library/sys.html?highlight=getsizeof#sys.getsizeof) :"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["(128, 240, 48)"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["import sys\n", "vue = hist.items()\n", "sys.getsizeof(ens), sys.getsizeof(hist), sys.getsizeof(vue)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Et pour un dictionnaire plus grand, la taille du dictionnaire."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["(36968, 48)"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["d = {i:i for i in range(1000)}\n", "sys.getsizeof(d), sys.getsizeof(d.items())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut ne pas utiliser la m\u00e9thode [items](https://docs.python.org/3/library/stdtypes.html?highlight=get#dict.items) :"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["['a', 'b', 'b', 'er', 'gh', 'gh']"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["hist = {'a': 1, 'b': 2, 'er': 1, 'gh': 2}\n", "ens = []\n", "for k in hist:\n", " v = hist[k]\n", " for i in range(v):\n", " ens.append(k)\n", "ens"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### dictionnaire et deux listes\n", "\n", "Cette fois-ci, on met les cl\u00e9s d'un c\u00f4t\u00e9 et les valeurs de l'autre."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["(['a', 'b', 'er', 'gh'], [1, 2, 1, 2])"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["hist = {'a': 1, 'b': 2, 'er': 1, 'gh': 2}\n", "cles = [k for k in hist]\n", "vals = [hist[k] for k in hist]\n", "cles, vals"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut \u00e9crire aussi ce programme "]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["(['a', 'b', 'er', 'gh'], [1, 2, 1, 2])"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["hist = {'a': 1, 'b': 2, 'er': 1, 'gh': 2}\n", "cles = list(hist.keys())\n", "vals = list(hist.values())\n", "cles, vals"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Toutefois, cette \u00e9criture n'est pas recommand\u00e9e car il est possible que l'expression ``for k in hist`` ou ``list(hist.keys())`` parcourent les cl\u00e9s d'un dictionnaire de deux fa\u00e7ons diff\u00e9rentes si le dictionnaire est modifi\u00e9 entre temps. Mais on ne s'en pas toujours compte car cela d\u00e9pend de l'impl\u00e9mentation des m\u00e9thodes associ\u00e9es \u00e0 la classe [dict](https://docs.python.org/3.5/library/stdtypes.html?highlight=dict#dict) (voir [cpython](https://github.com/python/cpython/tree/master/Python)). C'est pourquoi on pr\u00e9f\u00e8re ne parcourir qu'une seule fois le dictionnaire tout en cr\u00e9ant les deux listes."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["(['a', 'b', 'er', 'gh'], [1, 2, 1, 2])"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["hist = {'a': 1, 'b': 2, 'er': 1, 'gh': 2}\n", "cles = []\n", "vals = []\n", "for k, v in hist.items():\n", " cles.append(k)\n", " vals.append(v)\n", "cles, vals"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### deux listes et dictionnaires\n", "\n", "On effectue l'op\u00e9ration inverse."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'a': 1, 'gh': 2, 'er': 1, 'b': 2}"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["cles, vals = ['a', 'gh', 'er', 'b'], [1, 2, 1, 2]\n", "hist = {a:b for a, b in zip(cles, vals)}\n", "hist"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Et si on ne veut pas utiliser la fonction [zip](https://docs.python.org/3/library/functions.html#zip) :"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'a': 1, 'gh': 2, 'er': 1, 'b': 2}"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["cles, vals = ['a', 'gh', 'er', 'b'], [1, 2, 1, 2]\n", "hist = {}\n", "for i in range(len(cles)):\n", " hist[cles[i]] = vals[i]\n", "hist"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### zip reverse\n", "\n", "La fonction [zip](https://docs.python.org/3/library/functions.html#zip) permet de parcourir deux listes en parall\u00e8les. Cela permet de raccourcir le code pour cr\u00e9er un dictionnaire \u00e0 partir de cl\u00e9s et de valeurs s\u00e9par\u00e9s. Ca para\u00eet bien plus long que de cr\u00e9er les listes des cl\u00e9s et des valeurs. Et pourtant le code suivant peut \u00eatre consid\u00e9rablement raccourci :"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["(['a', 'b', 'er', 'gh'], [1, 2, 1, 2])"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["hist = {'a': 1, 'b': 2, 'er': 1, 'gh': 2}\n", "cles = []\n", "vals = []\n", "for k, v in hist.items():\n", " cles.append(k)\n", " vals.append(v)\n", "cles, vals"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Cela devient :"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["(('a', 'b', 'er', 'gh'), (1, 2, 1, 2))"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["hist = {'a': 1, 'b': 2, 'er': 1, 'gh': 2}\n", "cles, vals = zip(*hist.items())\n", "cles, vals"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Petite diff\u00e9rence, `cles`, `vals` sont sous forme de [tuple](https://docs.python.org/3.5/library/stdtypes.html?highlight=tuple#tuple) mais cela reste tr\u00e8s \u00e9l\u00e9gant."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## matrices et dictionnaires"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### liste de listes et dictionnaires\n", "\n", "Une liste de listes est la repr\u00e9sentation la plus naturelle. Essayons de la transformer sous forme de dictionnaire. On utilise la fonction [enumerate](https://docs.python.org/3/library/functions.html#enumerate)."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["{(0, 0): 1, (0, 1): 2, (1, 0): 3, (1, 1): 4}"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["mat = [[1, 2], \n", " [3, 4]]\n", "dv = {}\n", "for i, row in enumerate(mat):\n", " for j, x in enumerate(row):\n", " dv[i,j] = x\n", "dv"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### dictionnaires et liste de listes\n", "\n", "On effectue l'op\u00e9ration inverse. Nous n'avons pas perdu d'information, nous devrions retrouver la liste de listes originale."]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["[[1, 2], [3, 4]]"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["dx = {(0, 0): 1, (0, 1): 2, (1, 0): 3, (1, 1): 4}\n", "max_i = max(k[0] for k in dx) + 1\n", "max_j = max(k[1] for k in dx) + 1\n", "mat = [[0] * max_j for i in range(max_i)]\n", "for k, v in dv.items():\n", " mat[k[0]][k[1]] = v\n", "mat"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La diff\u00e9rence principale entre un dictionnaire ``d`` et une liste ``l`` est que l'instruction ``d[k]`` ajoute un \u00e9l\u00e9ment d'indice ``k`` (quel que soit ``k``) alors que l'instruction ``l[k]``) suppose que l'\u00e9l\u00e9ment d'indice ``k`` existe dans la liste. C'est pour cela qu'on commence \u00e0 calculer les indices maximaux largeur, longueur."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### matrice sparse\n", "\n", "On utilise cette r\u00e9presentation surtout lorsque pour des matrices sparses : la majorit\u00e9 des coefficients sont nuls. Dans ce cas, le dictionnaire final ne contient que les coefficients non nuls."]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["{(0, 0): 1, (1, 1): 4}"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["mat = [[1, 0, 0], \n", " [0, 4, 0]]\n", "dv = {}\n", "for i, row in enumerate(mat):\n", " for j, x in enumerate(row):\n", " if x != 0:\n", " dv[i,j] = x\n", "dv"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Si on ne conserve pas les dimensions de la matrice originale, on perd un peu d'information dans un cas pr\u00e9cis : si la matrice se termine par une colonne ou une ligne de z\u00e9ros."]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["[[1, 0], [0, 4]]"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["dx = {(0, 0): 1, (1, 1): 4}\n", "max_i = max(k[0] for k in dx) + 1\n", "max_j = max(k[1] for k in dx) + 1\n", "mat = [[0] * max_j for i in range(max_i)]\n", "for k, v in dv.items():\n", " mat[k[0]][k[1]] = v\n", "mat"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## matrices et tableaux"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 2 dimensions logiques, 1 dimension en m\u00e9moire\n", "\n", "On pr\u00e9f\u00e8re repr\u00e9senter une matrice par un seul vecteur m\u00eame si logiquement elle en contient car cela prend moins de place en m\u00e9moire. Dans ce cas, on met les lignes bout \u00e0 bout."]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["[1, 0, 0, 0, 4, 0, 1, 2, 3]"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["mat = [[1, 0, 0], \n", " [0, 4, 0],\n", " [1, 2, 3]]\n", "arr = []\n", "for i, row in enumerate(mat):\n", " for j, x in enumerate(row):\n", " arr.append(x)\n", "arr"]}, {"cell_type": "markdown", "metadata": {}, "source": ["D'un c\u00f4t\u00e9, nous avons 4 listes avec `mat` et une seule avec `arr`. V\u00e9rifions les tailles :"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/plain": ["(88, 192)"]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["import sys\n", "sys.getsizeof(mat), sys.getsizeof(arr)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Etrange ! Mais pour comprendre, il faut lire la documentation de la fonction [getsizeof](https://docs.python.org/3/library/sys.html?highlight=getsizeof#sys.getsizeof) qui ne compte pas la somme des objets r\u00e9f\u00e9renc\u00e9s par celui dont on mesure la taille. Autrement dit, dans le cas d'une liste de listes, la fonction ne mesure que la taille de la premi\u00e8re liste. Pour corriger le tir, on utilise la fonction sugg\u00e9r\u00e9e par la documentation de Python."]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"data": {"text/plain": ["(488, 328)"]}, "execution_count": 24, "metadata": {}, "output_type": "execute_result"}], "source": ["from ensae_teaching_cs.helpers.size_helper import total_size\n", "total_size(mat), total_size(arr)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut aussi utiliser le module [pympler](https://pythonhosted.org/Pympler/) et la fonction [asizeof](https://pythonhosted.org/Pympler/asizeof.html#asizeof)."]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"data": {"text/plain": ["(504, 344)"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["from pympler.asizeof import asizeof\n", "asizeof(mat), asizeof(arr)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Cela prend \u00e9norm\u00e9ment de place pour 9 *float* (soit 9x8 octets) mais Python stocke beaucoup plus d'informations qu'un langage compil\u00e9 type C++. Cela explique pourquoi le module [numpy](http://www.numpy.org/) fait la m\u00eame chose avec moins d'espace m\u00e9moire car il est cod\u00e9 en C++."]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"data": {"text/plain": ["(152, 136)"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["from numpy import array\n", "amat = array(mat)\n", "aarr = array(arr)\n", "asizeof(amat), asizeof(aarr)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Et si on augmente le nombre de r\u00e9els pour faire dispara\u00eetre les co\u00fbts fixes :"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"data": {"text/plain": ["(32.7984, 8.00096)"]}, "execution_count": 27, "metadata": {}, "output_type": "execute_result"}], "source": ["n = 100000\n", "li = list(float(x) for x in range(n))\n", "ar = array(li)\n", "asizeof(li) / n, asizeof(ar) / n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Python prend 4 fois plus de place que numpy."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### du tableau \u00e0 la liste de listes\n", "\n", "A moins que la matrice soit carr\u00e9e, il faut conserver une des dimensions du tableau original, le nombre de lignes par exemple."]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{"data": {"text/plain": ["[[1, 0, 0], [0, 4, 0], [1, 2, 3]]"]}, "execution_count": 28, "metadata": {}, "output_type": "execute_result"}], "source": ["arr = [1, 0, 0, 0, 4, 0, 1, 2, 3]\n", "nb_lin = 3\n", "nb_col = len(arr) // nb_lin\n", "mat = []\n", "pos = 0\n", "for i in range(nb_lin):\n", " row = []\n", " for j in range(nb_col):\n", " row.append(arr[pos])\n", " pos += 1\n", " mat.append(row)\n", "mat"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut aussi faire comme ceci :"]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [{"data": {"text/plain": ["[[1, 0, 0], [0, 4, 0], [1, 2, 3]]"]}, "execution_count": 29, "metadata": {}, "output_type": "execute_result"}], "source": ["arr = [1, 0, 0, 0, 4, 0, 1, 2, 3]\n", "nb_lin = 3\n", "nb_col = len(arr) // nb_lin\n", "mat = [[0] * nb_col for i in range(nb_lin)]\n", "for pos, x in enumerate(arr):\n", " i = pos // nb_lin\n", " j = pos % nb_lin\n", " mat[i][j] = x\n", "mat"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0"}}, "nbformat": 4, "nbformat_minor": 2}