{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.i - S\u00e9rialisation\n", "\n", "Charger un dataframe depuis un fichier texte prend du temps car il faut convertir le texte en nombre. La s\u00e9rialisation permet de copier le contenu depuis la m\u00e9moire vers le disque. A la prochaine utilisation, Python a juste besoin de recopier le bloc depuis le disque et de le copier sans trop le modifier en m\u00e9moire. S\u00e9rialiser un dataframe permet de le r\u00e9cup\u00e9rer beaucoup plus vite."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["<div id=\"my_id_menu_nb\">run previous cell, wait for 2 seconds</div>\n", "<script>\n", "function repeat_indent_string(n){\n", "    var a = \"\" ;\n", "    for ( ; n > 0 ; --n)\n", "        a += \"    \";\n", "    return a;\n", "}\n", "var update_menu_string = function(begin, lfirst, llast, sformat, send, keep_item, begin_format, end_format) {\n", "    var anchors = document.getElementsByClassName(\"section\");\n", "    if (anchors.length == 0) {\n", "        anchors = document.getElementsByClassName(\"text_cell_render rendered_html\");\n", "    }\n", "    var i,t;\n", "    var text_menu = begin;\n", "    var text_memo = \"<pre>\\nlength:\" + anchors.length + \"\\n\";\n", "    var ind = \"\";\n", "    var memo_level = 1;\n", "    var href;\n", "    var tags = [];\n", "    var main_item = 0;\n", "    var format_open = 0;\n", "    for (i = 0; i <= llast; i++)\n", "        tags.push(\"h\" + i);\n", "\n", "    for (i = 0; i < anchors.length; i++) {\n", "        text_memo += \"**\" + anchors[i].id + \"--\\n\";\n", "\n", "        var child = null;\n", "        for(t = 0; t < tags.length; t++) {\n", "            var r = anchors[i].getElementsByTagName(tags[t]);\n", "            if (r.length > 0) {\n", "child = r[0];\n", "break;\n", "            }\n", "        }\n", "        if (child == null) {\n", "            text_memo += \"null\\n\";\n", "            continue;\n", "        }\n", "        if (anchors[i].hasAttribute(\"id\")) {\n", "            // when converted in RST\n", "            href = anchors[i].id;\n", "            text_memo += \"#1-\" + href;\n", "            // passer \u00e0 child suivant (le chercher)\n", "        }\n", "        else if (child.hasAttribute(\"id\")) {\n", "            // in a notebook\n", "            href = child.id;\n", "            text_memo += \"#2-\" + href;\n", "        }\n", "        else {\n", "            text_memo += \"#3-\" + \"*\" + \"\\n\";\n", "            continue;\n", "        }\n", "        var title = child.textContent;\n", "        var level = parseInt(child.tagName.substring(1,2));\n", "\n", "        text_memo += \"--\" + level + \"?\" + lfirst + \"--\" + title + \"\\n\";\n", "\n", "        if ((level < lfirst) || (level > llast)) {\n", "            continue ;\n", "        }\n", "        if (title.endsWith('\u00b6')) {\n", "            title = title.substring(0,title.length-1).replace(\"<\", \"&lt;\")\n", "         .replace(\">\", \"&gt;\").replace(\"&\", \"&amp;\");\n", "        }\n", "        if (title.length == 0) {\n", "            continue;\n", "        }\n", "\n", "        while (level < memo_level) {\n", "            text_menu += end_format + \"</ul>\\n\";\n", "            format_open -= 1;\n", "            memo_level -= 1;\n", "        }\n", "        if (level == lfirst) {\n", "            main_item += 1;\n", "        }\n", "        if (keep_item != -1 && main_item != keep_item + 1) {\n", "            // alert(main_item + \" - \" + level + \" - \" + keep_item);\n", "            continue;\n", "        }\n", "        while (level > memo_level) {\n", "            text_menu += \"<ul>\\n\";\n", "            memo_level += 1;\n", "        }\n", "        text_menu += repeat_indent_string(level-2);\n", "        text_menu += begin_format + sformat.replace(\"__HREF__\", href).replace(\"__TITLE__\", title);\n", "        format_open += 1;\n", "    }\n", "    while (1 < memo_level) {\n", "        text_menu += end_format + \"</ul>\\n\";\n", "        memo_level -= 1;\n", "        format_open -= 1;\n", "    }\n", "    text_menu += send;\n", "    //text_menu += \"\\n\" + text_memo;\n", "\n", "    while (format_open > 0) {\n", "        text_menu += end_format;\n", "        format_open -= 1;\n", "    }\n", "    return text_menu;\n", "};\n", "var update_menu = function() {\n", "    var sbegin = \"\";\n", "    var sformat = '<a href=\"#__HREF__\">__TITLE__</a>';\n", "    var send = \"\";\n", "    var begin_format = '<li>';\n", "    var end_format = '</li>';\n", "    var keep_item = -1;\n", "    var text_menu = update_menu_string(sbegin, 2, 4, sformat, send, keep_item,\n", "       begin_format, end_format);\n", "    var menu = document.getElementById(\"my_id_menu_nb\");\n", "    menu.innerHTML=text_menu;\n", "};\n", "window.setTimeout(update_menu,2000);\n", "            </script>"], "text/plain": ["<IPython.core.display.HTML object>"]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## S\u00e9rialisation\n", "\n", "La [s\u00e9rialisation](http://fr.wikipedia.org/wiki/S%C3%A9rialisation) d\u00e9signe l'action de sauvegarder un objet dans un fichier telle qu'il est repr\u00e9sent\u00e9e dans la m\u00e9moire de l'ordinateur. De cette fa\u00e7on, la relecture de l'objet en question est plus rapide. La difficult\u00e9 r\u00e9side dans la s\u00e9rialisation d'objets composites comme une liste qui contient un dictionnaire qui contient une liste d'autres listes. Sans rentrer dans le d\u00e9tail de l'impl\u00e9mentation, la plupart des objets en Python sont s\u00e9rialisables ainsi qu'un objet compos\u00e9 de ces objets. Cela s'effectue avec le module [pickle](https://docs.python.org/3.4/library/pickle.html)."]}, {"cell_type": "code", "execution_count": 2, "metadata": {"collapsed": true}, "outputs": [], "source": ["import pickle\n", "l = [ {3:\"4\"}, \"4\", -5.5, [6, None]]\n", "with open(\"objet_serialise.bin\", \"wb\") as f :\n", "    pickle.dump(l, f)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Puis on r\u00e9cup\u00e8re les donn\u00e9es :"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["[{3: '4'}, '4', -5.5, [6, None]]"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["with open(\"objet_serialise.bin\", \"rb\") as f :\n", "    obj = pickle.load(f)\n", "obj"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## DataFrame\n", "\n", "Il existe une m\u00e9thode sp\u00e9cifique pour les DataFrame : [to_pickle](http://pandas-docs.github.io/pandas-docs-travis/io.html?highlight=to_pickle#pickling) qu'on relie avec la m\u00e9thode [read_pickle](http://pandas-docs.github.io/pandas-docs-travis/io.html?highlight=read_pickle#pickling)."]}, {"cell_type": "code", "execution_count": 4, "metadata": {"collapsed": true}, "outputs": [], "source": ["import pandas\n", "df = pandas.DataFrame( [ {\"name\":\"xavier\", \"school\":\"ENSAE\"},\n", "                         {\"name\":\"antoine\", \"school\":\"ENSAE\"} ] )\n", "df.to_pickle(\"df_serialize.bin\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Puis on relit le fichier :"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>name</th>\n", "      <th>school</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>xavier</td>\n", "      <td>ENSAE</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>antoine</td>\n", "      <td>ENSAE</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["      name school\n", "0   xavier  ENSAE\n", "1  antoine  ENSAE"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["df2 = pandas.read_pickle(\"df_serialize.bin\")\n", "df2"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : s\u00e9rialisation d'un gros dataframe\n", "\n", "On veut comparer le temps de chargement du m\u00eame dataframe depuis un fichier texte et depuis un contenant le dataframe s\u00e9rialis\u00e9. Dans un premier temps, on g\u00e9n\u00e8re un gros dataframe qu'on sauve sous fichier texte puis on le s\u00e9rialise. On compare ensuite les temps de chargement."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["## Exercice 2 : s\u00e9rialisation json\n", "\n", "Le module [pickle](https://docs.python.org/3/library/pickle.html) produit des fichiers binaires qui ne sont pas lisible autrement que par Python. Et le format peut changer d'une version de Python \u00e0 l'autre. On lui pr\u00e9f\u00e8re souvent un format texte comme json. Reprendre l'exercice 1 avec le module [jsonpickle](https://jsonpickle.github.io/)."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["**Question**\n", "\n", "Que veut dire un message comme celui-ci pr\u00e9sent dans la documentation de [jsonpickle](https://jsonpickle.github.io/) ?\n", "\n", "```\n", "Warning jsonpickle can execute arbitrary Python code. Do not load jsonpickles from untrusted / unauthenticated sources.\n", "```"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Autre option : dill\n", "\n", "Le module [dill](https://pypi.python.org/pypi/dill) \u00e9tend un peu les fonctionnalit\u00e9 de [pickle](https://docs.python.org/3.4/library/pickle.html). Ce dernier a quelques soucis pour r\u00e9cup\u00e9rer des objets s\u00e9rialis\u00e9s avec d'autres versions de Python. [dill](https://pypi.python.org/pypi/dill) est encore \u00e0 l'\u00e9tat de d\u00e9veloppement mais il devrait \u00eatre plus robuste dans ce cas particulier."]}, {"cell_type": "code", "execution_count": 8, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}