{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.ml - Clustering\n", "\n", "Ce notebook utilise les donn\u00e9es des v\u00e9los de Chicago [Divvy Data](https://www.divvybikes.com/system-data). Il s'inspire du challenge cr\u00e9\u00e9e pour d\u00e9couvrir les habitudes des habitantes de la ville [City Bike](http://www.xavierdupre.fr/app/ensae_projects/helpsphinx/challenges/city_bike.html). L'id\u00e9e est d'explorer plusieurs algorithmes de clustering et de voire comment trafiquer les donn\u00e9es pour les faire marcher et en tirer quelques apprentissages."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Les donn\u00e9es\n", "\n", "Elles ont \u00e9t\u00e9 pr\u00e9trait\u00e9es selon le notebook [Bike Pattern 2](http://www.xavierdupre.fr/app/ensae_projects/helpsphinx/notebooks/city_bike_solution_cluster_start.html). Elles repr\u00e9sentent la distribution du nombre de v\u00e9los partant (*startdist*) et arrivant (*stopdist*). On utilise le clustering pour d\u00e9couvrir les diff\u00e9rents usages des habitants de Chicago avec pour intuition le fait que les habitants de Chicago utilise majoritairement les v\u00e9los pour aller et venir entre leur appartement et leur lieu de travail. Cette m\u00eame id\u00e9e mais \u00e0 Paris est illustr\u00e9e par ce billet de blog : [Busy areas in Paris](http://www.xavierdupre.fr/blog/2013-09-26_nojs.html)."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["['.\\\\features_bike_chicago.txt']"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from pyensae.datasource import download_data\n", "file = download_data(\"features_bike_chicago.zip\")\n", "file"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_idstation_nameweekday(startdist, 00:00:00)(startdist, 00:10:00)(startdist, 00:20:00)(startdist, 00:30:00)(startdist, 00:40:00)(startdist, 00:50:00)(startdist, 01:00:00)...(stopdist, 22:20:00)(stopdist, 22:30:00)(stopdist, 22:40:00)(stopdist, 22:50:00)(stopdist, 23:00:00)(stopdist, 23:10:00)(stopdist, 23:20:00)(stopdist, 23:30:00)(stopdist, 23:40:00)(stopdist, 23:50:00)
02.0Michigan Ave & Balbo Ave0.00.0037560.0009390.0028170.0000000.0000000.0037560.000000...0.0043810.0021910.0043810.0021910.0043810.0043810.0054760.0021910.0000000.005476
12.0Michigan Ave & Balbo Ave1.00.0000000.0000000.0011060.0011060.0011060.0022120.000000...0.0093710.0120480.0066930.0040160.0053550.0066930.0026770.0000000.0000000.000000
22.0Michigan Ave & Balbo Ave2.00.0013570.0027140.0000000.0013570.0000000.0054270.000000...0.0029070.0029070.0159880.0058140.0014530.0014530.0116280.0000000.0000000.007267
32.0Michigan Ave & Balbo Ave3.00.0000000.0041440.0000000.0000000.0027620.0041440.000000...0.0092740.0030910.0030910.0077280.0015460.0030910.0092740.0015460.0077280.001546
42.0Michigan Ave & Balbo Ave4.00.0000000.0000000.0000000.0028460.0000000.0000000.000949...0.0082140.0010270.0061600.0041070.0154000.0061600.0020530.0061600.0071870.000000
\n", "

5 rows \u00d7 291 columns

\n", "
"], "text/plain": [" station_id station_name weekday (startdist, 00:00:00) \\\n", "0 2.0 Michigan Ave & Balbo Ave 0.0 0.003756 \n", "1 2.0 Michigan Ave & Balbo Ave 1.0 0.000000 \n", "2 2.0 Michigan Ave & Balbo Ave 2.0 0.001357 \n", "3 2.0 Michigan Ave & Balbo Ave 3.0 0.000000 \n", "4 2.0 Michigan Ave & Balbo Ave 4.0 0.000000 \n", "\n", " (startdist, 00:10:00) (startdist, 00:20:00) (startdist, 00:30:00) \\\n", "0 0.000939 0.002817 0.000000 \n", "1 0.000000 0.001106 0.001106 \n", "2 0.002714 0.000000 0.001357 \n", "3 0.004144 0.000000 0.000000 \n", "4 0.000000 0.000000 0.002846 \n", "\n", " (startdist, 00:40:00) (startdist, 00:50:00) (startdist, 01:00:00) \\\n", "0 0.000000 0.003756 0.000000 \n", "1 0.001106 0.002212 0.000000 \n", "2 0.000000 0.005427 0.000000 \n", "3 0.002762 0.004144 0.000000 \n", "4 0.000000 0.000000 0.000949 \n", "\n", " ... (stopdist, 22:20:00) (stopdist, 22:30:00) \\\n", "0 ... 0.004381 0.002191 \n", "1 ... 0.009371 0.012048 \n", "2 ... 0.002907 0.002907 \n", "3 ... 0.009274 0.003091 \n", "4 ... 0.008214 0.001027 \n", "\n", " (stopdist, 22:40:00) (stopdist, 22:50:00) (stopdist, 23:00:00) \\\n", "0 0.004381 0.002191 0.004381 \n", "1 0.006693 0.004016 0.005355 \n", "2 0.015988 0.005814 0.001453 \n", "3 0.003091 0.007728 0.001546 \n", "4 0.006160 0.004107 0.015400 \n", "\n", " (stopdist, 23:10:00) (stopdist, 23:20:00) (stopdist, 23:30:00) \\\n", "0 0.004381 0.005476 0.002191 \n", "1 0.006693 0.002677 0.000000 \n", "2 0.001453 0.011628 0.000000 \n", "3 0.003091 0.009274 0.001546 \n", "4 0.006160 0.002053 0.006160 \n", "\n", " (stopdist, 23:40:00) (stopdist, 23:50:00) \n", "0 0.000000 0.005476 \n", "1 0.000000 0.000000 \n", "2 0.000000 0.007267 \n", "3 0.007728 0.001546 \n", "4 0.007187 0.000000 \n", "\n", "[5 rows x 291 columns]"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "features = pandas.read_csv(\"features_bike_chicago.txt\", sep=\"\\t\", encoding=\"utf-8\", low_memory=False, header=[0,1])\n", "features.columns = [\"station_id\", \"station_name\", \"weekday\"] + list(features.columns[3:])\n", "features.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : petits clusters\n", "\n", "Que faire des petits clusters ?"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 2 : autres types de clustering\n", "\n", "On essaye des algorithmes de clustering qui n'imposent pas de choisir un nombre de clusters initial.\n", "\n", "1. On essaye [DBScan](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). Est-ce que cela fonctionne ? Si non pourquoi ?\n", "2. Et si vous savez pourquoi, vous trouverez une solution d'y rem\u00e9dier."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0"}}, "nbformat": 4, "nbformat_minor": 2}