{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.eco - Mise en pratique des s\u00e9ances 1 et 2 - Utilisation de pandas et visualisation - correction\n", "\n", "Correction d'un exercice sur la manipulation des donn\u00e9es."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["['Players_WC2014.xlsx', 'velib_t1.txt', 'velib_t2.txt', 'villes.txt']"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from pyensae.datasource import download_data\n", "files = download_data(\"td2a_eco_exercices_de_manipulation_de_donnees.zip\",\n", " url=\"https://github.com/sdpython/ensae_teaching_cs/raw/master/_doc/notebooks/td2a_eco/data/\")\n", "files"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 - manipulation des bases\n", "\n", "Dur\u00e9e : 10 minutes\n", "\n", "1. Importer la base de donn\u00e9es relatives aux joueurs de la Coupe du Monde 2014\n", "2. D\u00e9terminer le nombre de joueurs dans chaque \u00e9quipe et cr\u00e9er un dictionnaire { \u00e9quipe : Nombre de joueurs}\n", "3. D\u00e9terminer quels sont les 3 joueurs qui ont couvert le plus de distance. Y a t il un biais de s\u00e9lection ?\n", "4. Parmis les joueurs qui sont dans le premier d\u00e9cile des joueurs plus rapides, qui a pass\u00e9 le plus clair de son temps \u00e0 courrir sans la balle ?"]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["# Import du fichier"]}, {"cell_type": "code", "execution_count": 3, "metadata": {"collapsed": true}, "outputs": [], "source": ["import pandas as pd\n", "data_players = pd.read_excel(\"Players_WC2014.xlsx\", engine='openpyxl')"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Clearance RateColorsDistance Covered In PossessionDistance Covered Not In PossessionDistance CoveredLocationNameTeamTop Speed% Passes CompletedAssistsAttempted ClearancesAttempts In The AreaAttempts Off-TargetAttempts On TargetAttempts On-Target From Inside The AreaAttempts On-Target From Outside The AreaAttempts Outside The Area
0NaNNaNNaNNaNNaNAlgeriaCedric SI MOHAMMEDAlgeriaNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1NaNNaNNaNNaNNaNAlgeriaLiassine CADAMUROAlgeriaNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2NaNNaNNaNNaNNaNAlgeriaMohamed ZEMMAMOUCHEAlgeriaNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3NaNNaN2.1km4.2km8.7kmAlgeriaRiyad MAHREZAlgeria23.9km/h0.55NaNNaN1.01.00.00.00.00.0
4-NaN5.0km10.0km20.6kmAlgeriaSaphir TAIDERAlgeria25.7km/h0.74NaN0.0NaNNaNNaNNaNNaNNaN
\n", "
"], "text/plain": [" Clearance Rate Colors Distance Covered In Possession \\\n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN NaN 2.1km \n", "4 - NaN 5.0km \n", "\n", " Distance Covered Not In Possession Distance Covered Location \\\n", "0 NaN NaN Algeria \n", "1 NaN NaN Algeria \n", "2 NaN NaN Algeria \n", "3 4.2km 8.7km Algeria \n", "4 10.0km 20.6km Algeria \n", "\n", " Name Team Top Speed % Passes Completed Assists \\\n", "0 Cedric SI MOHAMMED Algeria NaN NaN NaN \n", "1 Liassine CADAMURO Algeria NaN NaN NaN \n", "2 Mohamed ZEMMAMOUCHE Algeria NaN NaN NaN \n", "3 Riyad MAHREZ Algeria 23.9km/h 0.55 NaN \n", "4 Saphir TAIDER Algeria 25.7km/h 0.74 NaN \n", "\n", " Attempted Clearances Attempts In The Area Attempts Off-Target \\\n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN 1.0 1.0 \n", "4 0.0 NaN NaN \n", "\n", " Attempts On Target Attempts On-Target From Inside The Area \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 0.0 0.0 \n", "4 NaN NaN \n", "\n", " Attempts On-Target From Outside The Area Attempts Outside The Area \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 0.0 0.0 \n", "4 NaN NaN "]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["data_players.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Nombre de joueurs par \u00e9quipe"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'Algeria': 23,\n", " 'Argentina': 23,\n", " 'Australia': 23,\n", " 'Belgium': 23,\n", " 'Bosnia': 23,\n", " 'Brazil': 23,\n", " 'Cameroon': 23,\n", " 'Chile': 23,\n", " 'Colombia': 23,\n", " 'Costa Rica': 23,\n", " 'Croatia': 23,\n", " \"C\u00f4te d'Ivoire\": 23,\n", " 'Ecuador': 22,\n", " 'England': 22,\n", " 'France': 22,\n", " 'Germany': 22,\n", " 'Ghana': 23,\n", " 'Greece': 23,\n", " 'Honduras': 23,\n", " 'Iran': 23,\n", " 'Italy': 23,\n", " 'Japan': 23,\n", " 'Korea Republic': 23,\n", " 'Mexico': 23,\n", " 'Netherlands': 23,\n", " 'Nigeria': 23,\n", " 'Portugal': 23,\n", " 'Russia': 23,\n", " 'Spain': 23,\n", " 'Switzerland': 23,\n", " 'USA': 23,\n", " 'Uruguay': 23}"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["data_players.groupby(['Team']).size().to_dict()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Les joueurs ayant couvert le plus de distance"]}, {"cell_type": "code", "execution_count": 6, "metadata": {"collapsed": true}, "outputs": [], "source": ["## quels joueurs ont couvert le plus de distance ? \n", "data_players['Distance Covered'] = data_players['Distance Covered'].str.replace('km','')"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Clearance RateColorsDistance Covered In PossessionDistance Covered Not In PossessionDistance CoveredLocationNameTeamTop Speed% Passes CompletedAssistsAttempted ClearancesAttempts In The AreaAttempts Off-TargetAttempts On TargetAttempts On-Target From Inside The AreaAttempts On-Target From Outside The AreaAttempts Outside The Area
3621Top 434.5km26.4km84.0GermanyThomas MUELLERGermany30.5km/h0.693.03.012.06.010.07.03.04.0
3530.5Top 434.3km28.0km82.6GermanyToni KROOSGermany29.8km/h0.854.02.02.05.011.02.09.014.0
421Top 429.5km30.7km81.2ArgentinaJavier MASCHERANOArgentina30.3km/h0.86NaN8.00.00.04.00.04.04.0
\n", "
"], "text/plain": [" Clearance Rate Colors Distance Covered In Possession \\\n", "362 1 Top 4 34.5km \n", "353 0.5 Top 4 34.3km \n", "42 1 Top 4 29.5km \n", "\n", " Distance Covered Not In Possession Distance Covered Location \\\n", "362 26.4km 84.0 Germany \n", "353 28.0km 82.6 Germany \n", "42 30.7km 81.2 Argentina \n", "\n", " Name Team Top Speed % Passes Completed Assists \\\n", "362 Thomas MUELLER Germany 30.5km/h 0.69 3.0 \n", "353 Toni KROOS Germany 29.8km/h 0.85 4.0 \n", "42 Javier MASCHERANO Argentina 30.3km/h 0.86 NaN \n", "\n", " Attempted Clearances Attempts In The Area Attempts Off-Target \\\n", "362 3.0 12.0 6.0 \n", "353 2.0 2.0 5.0 \n", "42 8.0 0.0 0.0 \n", "\n", " Attempts On Target Attempts On-Target From Inside The Area \\\n", "362 10.0 7.0 \n", "353 11.0 2.0 \n", "42 4.0 0.0 \n", "\n", " Attempts On-Target From Outside The Area Attempts Outside The Area \n", "362 3.0 4.0 \n", "353 9.0 14.0 \n", "42 4.0 4.0 "]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["data_players['Distance Covered'] = pd.to_numeric(data_players['Distance Covered'])\n", "data_players.sort_values(['Distance Covered'], ascending = 0).head(n=3)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On voit un clair effet de s\u00e9lection sur cette variable : ce sont les joueurs dont les \u00e9quipes ont \u00e9t\u00e9 le plus loin dans la comp\u00e9tition qui couvert le plus de distance."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Qui a \u00e9t\u00e9 le plus efficace ?\n", "\n", "On a besoin de rendre la variable Top Speed num\u00e9rique, et de cr\u00e9er une nouvelle variable avec le poucentage de possession de balle"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Clearance RateColorsDistance Covered In PossessionDistance Covered Not In PossessionDistance CoveredLocationNameTeamTop Speed% Passes CompletedAssistsAttempted ClearancesAttempts In The AreaAttempts Off-TargetAttempts On TargetAttempts On-Target From Inside The AreaAttempts On-Target From Outside The AreaAttempts Outside The Area
2220.89Quarter Finals15.4km22.7km56.2Costa RicaJunior DIAZCosta Rica33.80.65NaN9.0NaNNaNNaNNaNNaNNaN
2470.75Group Stage10.5km9.3km28.8C\u00f4te d'IvoireSerge AURIERC\u00f4te d'Ivoire33.50.73NaN4.02.00.03.02.01.01.0
320.5Top 425.1km19.2km63.6ArgentinaGonzalo HIGUAINArgentina33.10.751.04.011.07.06.06.00.02.0
\n", "
"], "text/plain": [" Clearance Rate Colors Distance Covered In Possession \\\n", "222 0.89 Quarter Finals 15.4km \n", "247 0.75 Group Stage 10.5km \n", "32 0.5 Top 4 25.1km \n", "\n", " Distance Covered Not In Possession Distance Covered Location \\\n", "222 22.7km 56.2 Costa Rica \n", "247 9.3km 28.8 C\u00f4te d'Ivoire \n", "32 19.2km 63.6 Argentina \n", "\n", " Name Team Top Speed % Passes Completed Assists \\\n", "222 Junior DIAZ Costa Rica 33.8 0.65 NaN \n", "247 Serge AURIER C\u00f4te d'Ivoire 33.5 0.73 NaN \n", "32 Gonzalo HIGUAIN Argentina 33.1 0.75 1.0 \n", "\n", " Attempted Clearances Attempts In The Area Attempts Off-Target \\\n", "222 9.0 NaN NaN \n", "247 4.0 2.0 0.0 \n", "32 4.0 11.0 7.0 \n", "\n", " Attempts On Target Attempts On-Target From Inside The Area \\\n", "222 NaN NaN \n", "247 3.0 2.0 \n", "32 6.0 6.0 \n", "\n", " Attempts On-Target From Outside The Area Attempts Outside The Area \n", "222 NaN NaN \n", "247 1.0 1.0 \n", "32 0.0 2.0 "]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["## Qui a \u00e9t\u00e9 le plus rapide ?\n", "\n", "data_players['Top Speed'] = data_players['Top Speed'].str.replace('km/h','')\n", "data_players['Top Speed'] = pd.to_numeric(data_players['Top Speed'])\n", "data_players.sort_values(['Top Speed'], ascending = 0).head(n=3)"]}, {"cell_type": "code", "execution_count": 9, "metadata": {"collapsed": true}, "outputs": [], "source": ["## Parmis ceux qui sont dans le d\u00e9cile des plus rapides, qui a pass\u00e9 le plus clair de son temps \u00e0 courrir sans la balle ?\n", "data_players['Distance Covered In Possession'] = data_players['Distance Covered In Possession'].str.replace('km','')\n", "data_players['Distance Covered In Possession'] = pd.to_numeric(data_players['Distance Covered In Possession'])\n", "data_players['Share of Possession'] = data_players['Distance Covered In Possession']/data_players['Distance Covered']"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Clearance RateColorsDistance Covered In PossessionDistance Covered Not In PossessionDistance CoveredLocationNameTeamTop Speed% Passes CompletedAssistsAttempted ClearancesAttempts In The AreaAttempts Off-TargetAttempts On TargetAttempts On-Target From Inside The AreaAttempts On-Target From Outside The AreaAttempts Outside The AreaShare of Possession
3631Group Stage6.44.2km14.6GermanyShkodran MUSTAFIGermany32.90.80NaN1.01.00.01.01.00.00.00.438356
102-NaN4.73.0km10.8BosniaAvdija VRSAJEVICBosnia32.10.750.00.02.01.01.01.00.00.00.435185
472-Group Stage9.06.0km20.8ItalyMario BALOTELLIItaly33.00.650.00.03.03.04.02.02.04.00.432692
351-Top 431.623.6km73.2GermanyMesut OEZILGermany32.00.791.00.08.03.07.06.01.02.00.431694
646NaNGroup Stage13.710.4km31.8SpainJordi ALBASpain31.60.82NaNNaN2.02.01.01.00.01.00.430818
\n", "
"], "text/plain": [" Clearance Rate Colors Distance Covered In Possession \\\n", "363 1 Group Stage 6.4 \n", "102 - NaN 4.7 \n", "472 - Group Stage 9.0 \n", "351 - Top 4 31.6 \n", "646 NaN Group Stage 13.7 \n", "\n", " Distance Covered Not In Possession Distance Covered Location \\\n", "363 4.2km 14.6 Germany \n", "102 3.0km 10.8 Bosnia \n", "472 6.0km 20.8 Italy \n", "351 23.6km 73.2 Germany \n", "646 10.4km 31.8 Spain \n", "\n", " Name Team Top Speed % Passes Completed Assists \\\n", "363 Shkodran MUSTAFI Germany 32.9 0.80 NaN \n", "102 Avdija VRSAJEVIC Bosnia 32.1 0.75 0.0 \n", "472 Mario BALOTELLI Italy 33.0 0.65 0.0 \n", "351 Mesut OEZIL Germany 32.0 0.79 1.0 \n", "646 Jordi ALBA Spain 31.6 0.82 NaN \n", "\n", " Attempted Clearances Attempts In The Area Attempts Off-Target \\\n", "363 1.0 1.0 0.0 \n", "102 0.0 2.0 1.0 \n", "472 0.0 3.0 3.0 \n", "351 0.0 8.0 3.0 \n", "646 NaN 2.0 2.0 \n", "\n", " Attempts On Target Attempts On-Target From Inside The Area \\\n", "363 1.0 1.0 \n", "102 1.0 1.0 \n", "472 4.0 2.0 \n", "351 7.0 6.0 \n", "646 1.0 1.0 \n", "\n", " Attempts On-Target From Outside The Area Attempts Outside The Area \\\n", "363 0.0 0.0 \n", "102 0.0 0.0 \n", "472 2.0 4.0 \n", "351 1.0 2.0 \n", "646 0.0 1.0 \n", "\n", " Share of Possession \n", "363 0.438356 \n", "102 0.435185 \n", "472 0.432692 \n", "351 0.431694 \n", "646 0.430818 "]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["data_players[data_players['Top Speed'] > data_players['Top Speed'].\n", " quantile(.90)].sort_values(['Share of Possession'], ascending = 0).head()"]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}