{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mesures de vitesse sur les dataframes\n", "\n", "Le notebook montre comment lire un [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) avec un itérateur quand on ne connaît pas sa taille, ou lire un [array](https://numpy.org/doc/stable/reference/generated/numpy.array.html) avec un itérateur." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Création d'un dataframe à partir d'un itérateur\n", "\n", "On cherche à créer un dataframe à partir d'un ensemble de lignes dont on ne connaît pas le nombre au moment où on créé le dataframe car on les reçoit sous la forme d'un itérateur ou un générateur." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.4584214264768637,\n", " 0.0957370472492135,\n", " 0.825720254865909,\n", " 0.056222146826998554,\n", " 0.012568801665460705,\n", " 0.20797581971445256,\n", " 0.6508447830614892,\n", " 0.817974554103244,\n", " 0.04182207570159391,\n", " 0.591375261282058),\n", " (0.5818213564160107,\n", " 0.3384435930913253,\n", " 0.5900215149482624,\n", " 0.9556893663618211,\n", " 0.9156247392985197,\n", " 0.20153581804870713,\n", " 0.893987513368823,\n", " 0.11112779556835362,\n", " 0.043959856261986174,\n", " 0.233344273733338)]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import random\n", "\n", "\n", "def enumerate_row(nb=10000, n=10):\n", " for i in range(nb):\n", " # on retourne un tuple, les données sont\n", " # plus souvent recopiées car le type est immuable\n", " yield tuple(random.random() for k in range(n))\n", " # on retourne une liste, ces listes ne sont pas\n", " # recopiées en général, seule la liste qui les tient\n", " # l'est\n", " # yield list(random.random() for k in range(n))\n", "\n", "\n", "list(enumerate_row(2))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
c0c1c2c3c4c5c6c7c8c9
00.1559690.4311930.9954510.0814670.2578340.4576170.7738570.8434360.8422550.570137
10.8763860.7024470.1305920.0841600.7827950.0654420.6824760.0775650.4449160.025166
20.8548080.8732400.0553190.5187090.4861420.0342370.9791280.9978980.4722200.512437
30.4769520.2500160.9648430.5799300.6932380.1031600.2490000.8509350.6320830.738248
40.7735020.2374460.9747550.5645040.6847630.3611640.1522430.3202420.2185290.411604
\n", "
" ], "text/plain": [ " c0 c1 c2 c3 c4 c5 c6 \\\n", "0 0.155969 0.431193 0.995451 0.081467 0.257834 0.457617 0.773857 \n", "1 0.876386 0.702447 0.130592 0.084160 0.782795 0.065442 0.682476 \n", "2 0.854808 0.873240 0.055319 0.518709 0.486142 0.034237 0.979128 \n", "3 0.476952 0.250016 0.964843 0.579930 0.693238 0.103160 0.249000 \n", "4 0.773502 0.237446 0.974755 0.564504 0.684763 0.361164 0.152243 \n", "\n", " c7 c8 c9 \n", "0 0.843436 0.842255 0.570137 \n", "1 0.077565 0.444916 0.025166 \n", "2 0.997898 0.472220 0.512437 \n", "3 0.850935 0.632083 0.738248 \n", "4 0.320242 0.218529 0.411604 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas\n", "\n", "nb, n = 10, 10\n", "df = pandas.DataFrame(enumerate_row(nb=nb, n=n), columns=[\"c%d\" % i for i in range(n)])\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "nb, n = 100000, 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On compare plusieurs constructions :" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100000 10\n", "230 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "print(nb, n)\n", "%timeit pandas.DataFrame(enumerate_row(nb=nb,n=n), columns=[\"c%d\" % i for i in range(n)])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100000 10\n", "225 ms ± 8.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "print(nb, n)\n", "%timeit pandas.DataFrame(list(enumerate_row(nb=nb,n=n)), columns=[\"c%d\" % i for i in range(n)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On décompose :" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100000 10\n", "145 ms ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)\n" ] } ], "source": [ "def cache():\n", " return list(enumerate_row(nb=nb, n=n))\n", "\n", "\n", "print(nb, n)\n", "%timeit -n 3 cache()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100000 10\n", "87.7 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)\n" ] } ], "source": [ "print(nb, n)\n", "l = list(enumerate_row(nb=nb, n=n))\n", "%timeit -n 3 pandas.DataFrame(l, columns=[\"c%d\" % i for i in range(n)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "D'après ces temps, pandas convertit probablement l'itérateur en liste. On essaye de créer le dataframe vide, puis avec la méthode [from_records](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_records.html)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.94 ms ± 540 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)\n" ] } ], "source": [ "%timeit -n 3 pandas.DataFrame(columns=[\"c%d\" % i for i in range(n)], index=range(n))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100000 10\n", "224 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "def create_df3():\n", " return pandas.DataFrame.from_records(\n", " enumerate_row(nb=nb, n=n), columns=[\"c%d\" % i for i in range(n)]\n", " )\n", "\n", "\n", "print(nb, n)\n", "%timeit create_df3()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Création d'un array à partir d'un itérateur\n", "\n", "On cherche à créer un dataframe à partir d'un ensemble de lignes dont on ne connaît pas le nombre au moment où on créé le dataframe car on les reçoit sous la forme d'un itérateur ou un générateur. La documentation de la fonction [numpy.fromiter](https://numpy.org/doc/stable/reference/generated/numpy.fromiter.html) est intéressante à ce sujet." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100000 10\n" ] }, { "data": { "text/plain": [ "array([[0.18094164, 0.98726051, 0.15154422, 0.02532254, 0.13567288,\n", " 0.52949799, 0.9955031 , 0.56441516, 0.95278832, 0.37068437],\n", " [0.97776124, 0.1088838 , 0.72051064, 0.79808152, 0.25334263,\n", " 0.04203916, 0.8290536 , 0.32045666, 0.48908504, 0.70058525],\n", " [0.03562189, 0.45141838, 0.98266729, 0.36282507, 0.74903618,\n", " 0.36675298, 0.30681627, 0.86053065, 0.36733881, 0.03716365],\n", " [0.8255547 , 0.31025914, 0.61405287, 0.2289358 , 0.87746991,\n", " 0.98780181, 0.99195587, 0.6592586 , 0.90237022, 0.73119145],\n", " [0.79096242, 0.72046597, 0.87479709, 0.75549334, 0.2525281 ,\n", " 0.91680528, 0.97679278, 0.92947194, 0.2344261 , 0.67808894]])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def enumerate_row2(nb=10000, n=10):\n", " for i in range(nb):\n", " for k in range(n):\n", " yield random.random()\n", "\n", "\n", "import numpy\n", "\n", "nb, n = 100000, 10\n", "# on précise la taille du tableau car cela évite à numpy d'agrandir le tableau\n", "# au fur et à mesure, ceci ne fonctionne pas\n", "print(nb, n)\n", "m = numpy.fromiter(enumerate_row2(nb=nb, n=n), float, nb * n)\n", "m.resize((nb, n))\n", "m[:5, :]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100000 10\n", "106 ms ± 7.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "def create_array():\n", " m = numpy.fromiter(enumerate_row2(nb=nb, n=n), float, nb * n)\n", " m.resize((nb, n))\n", " return m\n", "\n", "\n", "print(nb, n)\n", "%timeit create_array()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100000 10\n", "175 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "def create_array2():\n", " m = list(enumerate_row(nb=nb, n=n))\n", " ml = numpy.array(m, float)\n", " return ml\n", "\n", "\n", "print(nb, n)\n", "%timeit create_array2()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Et si on ne précise pas la taille du tableau créé avec la fonction ``fromiter`` :" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100000 10\n", "110 ms ± 7.48 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)\n" ] } ], "source": [ "def create_array3():\n", " m = numpy.fromiter(enumerate_row2(nb=nb, n=n), float)\n", " m.resize((nb, n))\n", " return m\n", "\n", "\n", "print(nb, n)\n", "%timeit -n 3 create_array3()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On retrouve des temps similaires que ceux obtenus avec une liste. En conclusion, pour créer un *array*, il vaut mieux :\n", "\n", "* connaître la taille finale\n", "* éviter de créer une liste" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pour finir, je recommande la lecture de [Enhancing Performance](https://pandas.pydata.org/docs/user_guide/enhancingperf.html) qui étudie différent scénari avec [cython](https://cython.org/), [eval](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.eval.html), [numba](https://numba.pydata.org/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 1 }