{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.i - Mesures de vitesse sur les dataframes\n", "\n", "Lire un [dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) avec un it\u00e9rateur quand on ne conna\u00eet pas sa taille, lire un [array](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) avec un it\u00e9rateur."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Cr\u00e9ation d'un dataframe \u00e0 partir d'un it\u00e9rateur\n", "\n", "On cherche \u00e0 cr\u00e9er un dataframe \u00e0 partir d'un ensemble de lignes dont on ne conna\u00eet pas le nombre au moment o\u00f9 on cr\u00e9\u00e9 le dataframe car on les re\u00e7oit sous la forme d'un it\u00e9rateur ou un g\u00e9n\u00e9rateur."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["[(0.48407052966425546,\n", " 0.804134458177794,\n", " 0.6429087806017728,\n", " 0.15971606067266708,\n", " 0.24228260345690167,\n", " 0.23378976056237966,\n", " 0.689702986896856,\n", " 0.6968197701594067,\n", " 0.644936143902689,\n", " 0.17127561953185932),\n", " (0.37058334468191145,\n", " 0.689477712320184,\n", " 0.02131118805602783,\n", " 0.772192897946077,\n", " 0.7048866020231711,\n", " 0.012595878354861978,\n", " 0.23269748071952856,\n", " 0.5331829925096959,\n", " 0.29721238636977587,\n", " 0.884680186273301)]"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["import random\n", "def enumerate_row(nb=10000, n=10):\n", " for i in range(nb):\n", " # on retourne un tuple, les donn\u00e9es sont\n", " # plus souvent recopi\u00e9es car le type est immuable\n", " yield tuple(random.random() for k in range(n))\n", " # on retourne une liste, ces listes ne sont pas\n", " # recopi\u00e9es en g\u00e9n\u00e9ral, seule la liste qui les tient\n", " # l'est\n", " # yield list(random.random() for k in range(n))\n", " \n", "list(enumerate_row(2))"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
c0c1c2c3c4c5c6c7c8c9
00.4412340.4241490.6729140.8902960.3685750.4775710.7984170.6949600.4393950.910571
10.5331900.6549240.3167320.2186780.5093690.3297700.3846390.6747800.7446740.776442
20.8638790.7109170.3877450.5208680.4249820.8649280.6398180.9537590.8557130.783866
30.7940470.5117360.8571610.6258030.3527310.9938980.1956720.4765820.2939880.796978
40.4838210.1196230.6478370.2153540.1448400.1197180.1592950.1842820.9194500.612818
\n", "
"], "text/plain": [" c0 c1 c2 c3 c4 c5 c6 \\\n", "0 0.441234 0.424149 0.672914 0.890296 0.368575 0.477571 0.798417 \n", "1 0.533190 0.654924 0.316732 0.218678 0.509369 0.329770 0.384639 \n", "2 0.863879 0.710917 0.387745 0.520868 0.424982 0.864928 0.639818 \n", "3 0.794047 0.511736 0.857161 0.625803 0.352731 0.993898 0.195672 \n", "4 0.483821 0.119623 0.647837 0.215354 0.144840 0.119718 0.159295 \n", "\n", " c7 c8 c9 \n", "0 0.694960 0.439395 0.910571 \n", "1 0.674780 0.744674 0.776442 \n", "2 0.953759 0.855713 0.783866 \n", "3 0.476582 0.293988 0.796978 \n", "4 0.184282 0.919450 0.612818 "]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "nb, n = 10, 10\n", "df = pandas.DataFrame(enumerate_row(nb=nb, n=n), columns=[\"c%d\" % i for i in range(n)])\n", "df.head()"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["nb, n =100000, 10"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On compare plusieurs constructions :"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["100000 10\n", "1 loops, best of 3: 539 ms per loop\n"]}], "source": ["print(nb,n)\n", "%timeit pandas.DataFrame(enumerate_row(nb=nb,n=n), columns=[\"c%d\" % i for i in range(n)])"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["100000 10\n", "1 loops, best of 3: 531 ms per loop\n"]}], "source": ["print(nb,n)\n", "%timeit pandas.DataFrame(list(enumerate_row(nb=nb,n=n)), columns=[\"c%d\" % i for i in range(n)])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On d\u00e9compose :"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["100000 10\n", "1 loops, best of 3: 437 ms per loop\n"]}], "source": ["def cache():\n", " return list(enumerate_row(nb=nb,n=n))\n", "print(nb,n)\n", "%timeit cache()"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["100000 10\n", "10 loops, best of 3: 118 ms per loop\n"]}], "source": ["print(nb,n)\n", "l = list(enumerate_row(nb=nb,n=n))\n", "%timeit pandas.DataFrame(l, columns=[\"c%d\" % i for i in range(n)])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["D'apr\u00e8s ces temps, pandas convertit probablement l'it\u00e9rateur en liste. On essaye de cr\u00e9er le dataframe vide, puis avec la m\u00e9thode [from_records](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#alternate-constructors)."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The slowest run took 4.15 times longer than the fastest. This could mean that an intermediate result is being cached \n", "1000 loops, best of 3: 1.58 ms per loop\n"]}], "source": ["%timeit pandas.DataFrame(columns=[\"c%d\" % i for i in range(n)], index=range(n))"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["100000 10\n", "1 loops, best of 3: 508 ms per loop\n"]}], "source": ["def create_df3():\n", " return pandas.DataFrame.from_records(enumerate_row(nb=nb,n=n), \n", " columns=[\"c%d\" % i for i in range(n)])\n", "\n", "print(nb,n)\n", "%timeit create_df3()"]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["## Cr\u00e9ation d'un array \u00e0 partir d'un it\u00e9rateur\n", "\n", "On cherche \u00e0 cr\u00e9er un dataframe \u00e0 partir d'un ensemble de lignes dont on ne conna\u00eet pas le nombre au moment o\u00f9 on cr\u00e9\u00e9 le dataframe car on les re\u00e7oit sous la forme d'un it\u00e9rateur ou un g\u00e9n\u00e9rateur. La documentation de la fonction [numpy.fromiter](http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html#numpy.fromiter) est int\u00e9ressante \u00e0 ce sujet."]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["100000 10\n"]}, {"data": {"text/plain": ["array([[ 0.56015574, 0.87562798, 0.98440727, 0.09243148, 0.71725052,\n", " 0.46137005, 0.34867008, 0.16251999, 0.81366002, 0.04497673],\n", " [ 0.61909677, 0.71484276, 0.81124602, 0.35659089, 0.39326758,\n", " 0.65078866, 0.92445948, 0.32325873, 0.3898043 , 0.92096777],\n", " [ 0.88257268, 0.92949871, 0.85010483, 0.13049573, 0.30860387,\n", " 0.60728701, 0.95734593, 0.01977126, 0.06968685, 0.98705452],\n", " [ 0.84260967, 0.96877873, 0.8886382 , 0.63529283, 0.99503151,\n", " 0.95205265, 0.80948905, 0.08377304, 0.68015408, 0.74213396],\n", " [ 0.77371233, 0.39757708, 0.45769021, 0.01445757, 0.4160269 ,\n", " 0.2832024 , 0.68086284, 0.4086078 , 0.99526844, 0.15037316]])"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["def enumerate_row2(nb=10000, n=10):\n", " for i in range(nb):\n", " for k in range(n):\n", " yield random.random()\n", " \n", "import numpy\n", "nb,n = 100000, 10\n", "# on pr\u00e9cise la taille du tableau car cela \u00e9vite \u00e0 numpy d'agrandir le tableau\n", "# au fur et \u00e0 mesure, ceci ne fonctionne pas\n", "print(nb,n)\n", "m = numpy.fromiter(enumerate_row2(nb=nb,n=n), float, nb*n)\n", "m.resize((nb,n))\n", "m [:5,:]"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["100000 10\n", "1 loops, best of 3: 255 ms per loop\n"]}], "source": ["def create_array():\n", " m = numpy.fromiter(enumerate_row2(nb=nb,n=n), float, nb*n)\n", " m.resize((nb,n)) \n", " return m\n", "print(nb,n)\n", "%timeit create_array()"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["100000 10\n", "1 loops, best of 3: 430 ms per loop\n"]}], "source": ["def create_array2():\n", " m = list(enumerate_row(nb=nb,n=n))\n", " ml = numpy.array(m, float)\n", " return ml\n", "print(nb,n)\n", "%timeit create_array2()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Et si on ne pr\u00e9cise pas la taille du tableau cr\u00e9\u00e9 avec la fonction ``fromiter`` :"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["100000 10\n", "1 loops, best of 3: 271 ms per loop\n"]}], "source": ["def create_array3():\n", " m = numpy.fromiter(enumerate_row2(nb=nb,n=n), float)\n", " m.resize((nb,n)) \n", " return m\n", "print(nb,n)\n", "%timeit create_array3()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On retrouve des temps similaires que ceux obtenus avec une liste. En conclusion, pour cr\u00e9er un *array*, il vaut mieux :\n", "\n", "* conna\u00eetre la taille finale\n", "* \u00e9viter de cr\u00e9er une liste"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour finir, je recommande la lecture de [Enhancing Performance](http://pandas-docs.github.io/pandas-docs-travis/enhancingperf.html?highlight=dataframe) qui \u00e9tudie diff\u00e9rent sc\u00e9nari avec [cython](http://cython.org/), [eval](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.eval.html), [numba](http://numba.pydata.org/)."]}, {"cell_type": "code", "execution_count": 15, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}