.. _chshpandasrst: ================================== Uncommon operation with dataframes ================================== .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/cheat_sheets/chsh_pandas.ipynb|*` Cheat sheet on uncommand operation with pandas such as reading a big file. .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: Pointer on notebooks -------------------- - `Rappel de ce que vous savez déjà mais avez peut-être oublié `__ - `Python pour un Data Scientist / Economiste `__ - `Exercices Pratiques `__ List of strings into binaries features -------------------------------------- .. code:: ipython3 import pandas df = pandas.DataFrame([{"target":0, "features":["a", "b", "c"]}, {"target":1, "features":["a", "b"]}, {"target":2, "features":["c", "b"]}]) df .. raw:: html
features target
0 [a, b, c] 0
1 [a, b] 1
2 [c, b] 2
.. code:: ipython3 df.features.str.join("*").str.get_dummies("*") .. raw:: html
a b c
0 1 1 1
1 1 1 0
2 0 1 1
Big files --------- Let’s save some data first. .. code:: ipython3 from sklearn.datasets import load_breast_cancer data = load_breast_cancer() import pandas df = pandas.DataFrame(data.data, columns=data.feature_names) df.to_csv("cancer.txt", sep="\t", encoding="utf-8", index=False) first lines : nrows ~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 df = pandas.read_csv("cancer.txt", nrows=3) df .. raw:: html
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99\t10.38\t122.8\t1001.0\t0.1184\t0.2776\t0...
1 20.57\t17.77\t132.9\t1326.0\t0.08474\t0.07864\...
2 19.69\t21.25\t130.0\t1203.0\t0.1096\t0.1599\t0...
.. code:: ipython3 df = pandas.read_csv("cancer.txt", nrows=3, sep="\t") df .. raw:: html
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758

3 rows × 30 columns

middle lines : nrows + skiprows ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 df = pandas.read_csv("cancer.txt", nrows=3, skiprows=100, sep="\t", header=None) df .. raw:: html
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 14.420 19.77 94.48 642.5 0.09752 0.11410 0.09388 0.05839 0.1879 0.06390 ... 16.33 30.86 109.50 826.4 0.1431 0.3026 0.3194 0.1565 0.2718 0.09353
1 13.610 24.98 88.05 582.7 0.09488 0.08511 0.08625 0.04489 0.1609 0.05871 ... 16.99 35.27 108.60 906.5 0.1265 0.1943 0.3169 0.1184 0.2651 0.07397
2 6.981 13.43 43.79 143.5 0.11700 0.07568 0.00000 0.00000 0.1930 0.07818 ... 7.93 19.54 50.41 185.2 0.1584 0.1202 0.0000 0.0000 0.2932 0.09382

3 rows × 30 columns

big files : iterator ~~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 for piece, df in enumerate(pandas.read_csv("cancer.txt", iterator=True, sep="\t", chunksize=3)): print(piece, df.shape) if piece > 2: break .. parsed-literal:: 0 (3, 30) 1 (3, 30) 2 (3, 30) 3 (3, 30) sample on big files : iterator + concat ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 samples = [] for df in pandas.read_csv("cancer.txt", iterator=True, sep="\t", chunksize=30): sample = df.sample(3) samples.append(sample) dfsample = pandas.concat(samples) dfsample.shape .. parsed-literal:: (57, 30)