{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.data - Pandas et it\u00e9rateurs\n", "\n", "[pandas](http://pandas.pydata.org/) a tendance a prendre beaucoup d'espace m\u00e9moire pour charger les donn\u00e9es, environ trois fois plus que sa taille sur disque. Quand la m\u00e9moire n'est pas assez grande, que peut-on faire ?"]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {"collapsed": true}, "outputs": [], "source": ["from sklearn.datasets import load_iris\n", "data = load_iris()"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123target
05.13.51.40.20
14.93.01.40.20
24.73.21.30.20
34.63.11.50.20
45.03.61.40.20
\n", "
"], "text/plain": [" 0 1 2 3 target\n", "0 5.1 3.5 1.4 0.2 0\n", "1 4.9 3.0 1.4 0.2 0\n", "2 4.7 3.2 1.3 0.2 0\n", "3 4.6 3.1 1.5 0.2 0\n", "4 5.0 3.6 1.4 0.2 0"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "df = pandas.DataFrame(data.data)\n", "df.column = \"X1 X2 X3 X4\".split()\n", "df[\"target\"] = data.target\n", "df.head()"]}, {"cell_type": "code", "execution_count": 4, "metadata": {"collapsed": true}, "outputs": [], "source": ["df.to_csv(\"iris.txt\", sep=\"\\t\", index=False)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : it\u00e9rer sur un grand fichier\n", "\n", "A quoi sert le param\u00e8tre *iterator* de la fonction [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) ? Comment s'en servir pour lire un grand fichier ?"]}, {"cell_type": "code", "execution_count": 5, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 2 : split train test\n", "\n", "Utiliser les fonctions [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) et [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) pour r\u00e9partir un gros jeu de donn\u00e9es en deux jeux train, test."]}, {"cell_type": "code", "execution_count": 6, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 3 : stratify ?\n", "\n", "Que dire \u00e0 propos de la fonction pr\u00e9c\u00e9dente du param\u00e8tre *stratify* de la fonction [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) ?"]}, {"cell_type": "code", "execution_count": 7, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 4 : quelques id\u00e9es pour un group by ?\n", "\n", "Toujours sur un gros fichier..."]}, {"cell_type": "code", "execution_count": 8, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}