{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Bike Pattern 2\n", "\n", "We used a little bit of machine learning on [Divvy Data](https://www.divvybikes.com/system-data) to dig into a better division of Chicago. We try to identify patterns among bike stations."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## The data\n", "\n", "[Divvy Data](https://www.divvybikes.com/system-data) publishes a sample of the data. "]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["from pyensae.datasource import download_data\n", "file = download_data(\"Divvy_Trips_2016_Q3Q4.zip\", url=\"https://s3.amazonaws.com/divvy-data/tripdata/\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We know the stations."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["import pandas\n", "stations = pandas.read_csv(\"Divvy_Stations_2016_Q3.csv\")\n", "bikes = pandas.concat([pandas.read_csv(\"Divvy_Trips_2016_Q3.csv\"),\n", " pandas.read_csv(\"Divvy_Trips_2016_Q4.csv\")])"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trip_idstarttimestoptimebikeidtripdurationfrom_station_idfrom_station_nameto_station_idto_station_nameusertypegenderbirthyear
0121501609/30/2016 23:59:5810/1/2016 00:04:03495924569Damen Ave & Pierce Ave17Wood St & Division StSubscriberMale1988.0
1121501599/30/2016 23:59:5810/1/2016 00:04:092589251383Ashland Ave & Harrison St320Loomis St & Lexington StSubscriberFemale1990.0
2121501589/30/2016 23:59:5110/1/2016 00:24:5136561500302Sheffield Ave & Wrightwood Ave334Lake Shore Dr & Belmont AveCustomerNaNNaN
3121501579/30/2016 23:59:5110/1/2016 00:03:563570245475Washtenaw Ave & Lawrence Ave471Francisco Ave & Foster AveSubscriberFemale1988.0
4121501569/30/2016 23:59:3210/1/2016 00:26:5031581638302Sheffield Ave & Wrightwood Ave492Leavitt St & Addison StCustomerNaNNaN
\n", "
"], "text/plain": [" trip_id starttime stoptime bikeid tripduration \\\n", "0 12150160 9/30/2016 23:59:58 10/1/2016 00:04:03 4959 245 \n", "1 12150159 9/30/2016 23:59:58 10/1/2016 00:04:09 2589 251 \n", "2 12150158 9/30/2016 23:59:51 10/1/2016 00:24:51 3656 1500 \n", "3 12150157 9/30/2016 23:59:51 10/1/2016 00:03:56 3570 245 \n", "4 12150156 9/30/2016 23:59:32 10/1/2016 00:26:50 3158 1638 \n", "\n", " from_station_id from_station_name to_station_id \\\n", "0 69 Damen Ave & Pierce Ave 17 \n", "1 383 Ashland Ave & Harrison St 320 \n", "2 302 Sheffield Ave & Wrightwood Ave 334 \n", "3 475 Washtenaw Ave & Lawrence Ave 471 \n", "4 302 Sheffield Ave & Wrightwood Ave 492 \n", "\n", " to_station_name usertype gender birthyear \n", "0 Wood St & Division St Subscriber Male 1988.0 \n", "1 Loomis St & Lexington St Subscriber Female 1990.0 \n", "2 Lake Shore Dr & Belmont Ave Customer NaN NaN \n", "3 Francisco Ave & Foster Ave Subscriber Female 1988.0 \n", "4 Leavitt St & Addison St Customer NaN NaN "]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["bikes.head()"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["from datetime import datetime, time\n", "df = bikes\n", "df[\"dtstart\"] = pandas.to_datetime(df.starttime, infer_datetime_format=True)\n", "df[\"dtstop\"] = pandas.to_datetime(df.stoptime, infer_datetime_format=True)\n", "\n", "df[\"stopday\"] = df.dtstop.apply(lambda r: datetime(r.year, r.month, r.day))\n", "df[\"stoptime\"] = df.dtstop.apply(lambda r: time(r.hour, r.minute, 0))\n", "df[\"stoptime10\"] = df.dtstop.apply(lambda r: time(r.hour, (r.minute // 10)*10, 0)) # every 10 minutes\n", "\n", "df[\"startday\"] = df.dtstart.apply(lambda r: datetime(r.year, r.month, r.day))\n", "df[\"starttime\"] = df.dtstart.apply(lambda r: time(r.hour, r.minute, 0))\n", "df[\"starttime10\"] = df.dtstart.apply(lambda r: time(r.hour, (r.minute // 10)*10, 0)) # every 10 minutes"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": ["df['stopweekday'] = df['dtstop'].dt.dayofweek\n", "df['startweekday'] = df['dtstart'].dt.dayofweek"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Normalize, aggregating and merging per start and stop time"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
to_station_idto_station_namestopweekdaystoptime10nb_tripsnb_tripsdaystopdist
02Michigan Ave & Balbo Ave000:10:0029130.002191
12Michigan Ave & Balbo Ave000:20:0029130.002191
22Michigan Ave & Balbo Ave000:30:0029130.002191
32Michigan Ave & Balbo Ave001:00:0039130.003286
42Michigan Ave & Balbo Ave001:10:0029130.002191
\n", "
"], "text/plain": [" to_station_id to_station_name stopweekday stoptime10 nb_trips \\\n", "0 2 Michigan Ave & Balbo Ave 0 00:10:00 2 \n", "1 2 Michigan Ave & Balbo Ave 0 00:20:00 2 \n", "2 2 Michigan Ave & Balbo Ave 0 00:30:00 2 \n", "3 2 Michigan Ave & Balbo Ave 0 01:00:00 3 \n", "4 2 Michigan Ave & Balbo Ave 0 01:10:00 2 \n", "\n", " nb_tripsday stopdist \n", "0 913 0.002191 \n", "1 913 0.002191 \n", "2 913 0.002191 \n", "3 913 0.003286 \n", "4 913 0.002191 "]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["key = [\"to_station_id\", \"to_station_name\", \"stopweekday\", \"stoptime10\"]\n", "keep = key + [\"trip_id\"]\n", "\n", "stopaggtime = df[keep].groupby(key, as_index=False).count()\n", "stopaggtime.columns = key + [\"nb_trips\"]\n", "\n", "stopaggday = df[keep[:-2] + [\"trip_id\"]].groupby(key[:-1], as_index=False).count()\n", "stopaggday.columns = key[:-1] + [\"nb_trips\"]\n", "\n", "stopaggday = df[keep[:-2] + [\"trip_id\"]].groupby(key[:-1], as_index=False).count()\n", "stopaggday.columns = key[:-1] + [\"nb_trips\"]\n", "\n", "stopmerge = stopaggtime.merge(stopaggday, on=key[:-1], suffixes=(\"\", \"day\"))\n", "stopmerge[\"stopdist\"] = stopmerge[\"nb_trips\"] / stopmerge[\"nb_tripsday\"]\n", "stopmerge.head()"]}, {"cell_type": "code", "execution_count": 9, "metadata": {"scrolled": false}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {}, "output_type": "display_data"}], "source": ["stopmerge[stopmerge[\"to_station_id\"] == 2] \\\n", " .plot(x=\"stoptime10\", y=\"stopdist\", figsize=(14,4), kind=\"area\");"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
from_station_idfrom_station_namestartweekdaystarttime10nb_tripsnb_tripsdaystartdist
02Michigan Ave & Balbo Ave000:00:00410650.003756
12Michigan Ave & Balbo Ave000:10:00110650.000939
22Michigan Ave & Balbo Ave000:20:00310650.002817
32Michigan Ave & Balbo Ave000:50:00410650.003756
42Michigan Ave & Balbo Ave001:10:00310650.002817
\n", "
"], "text/plain": [" from_station_id from_station_name startweekday starttime10 \\\n", "0 2 Michigan Ave & Balbo Ave 0 00:00:00 \n", "1 2 Michigan Ave & Balbo Ave 0 00:10:00 \n", "2 2 Michigan Ave & Balbo Ave 0 00:20:00 \n", "3 2 Michigan Ave & Balbo Ave 0 00:50:00 \n", "4 2 Michigan Ave & Balbo Ave 0 01:10:00 \n", "\n", " nb_trips nb_tripsday startdist \n", "0 4 1065 0.003756 \n", "1 1 1065 0.000939 \n", "2 3 1065 0.002817 \n", "3 4 1065 0.003756 \n", "4 3 1065 0.002817 "]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["key = [\"from_station_id\", \"from_station_name\", \"startweekday\", \"starttime10\"]\n", "keep = key + [\"trip_id\"]\n", "\n", "startaggtime = df[keep].groupby(key, as_index=False).count()\n", "startaggtime.columns = key + [\"nb_trips\"]\n", "\n", "startaggday = df[keep[:-2] + [\"trip_id\"]].groupby(key[:-1], as_index=False).count()\n", "startaggday.columns = key[:-1] + [\"nb_trips\"]\n", "\n", "startaggday = df[keep[:-2] + [\"trip_id\"]].groupby(key[:-1], as_index=False).count()\n", "startaggday.columns = key[:-1] + [\"nb_trips\"]\n", "\n", "startmerge = startaggtime.merge(startaggday, on=key[:-1], suffixes=(\"\", \"day\"))\n", "startmerge[\"startdist\"] = startmerge[\"nb_trips\"] / startmerge[\"nb_tripsday\"]\n", "startmerge.head()"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {}, "output_type": "display_data"}], "source": ["startmerge[startmerge[\"from_station_id\"] == 2] \\\n", " .plot(x=\"starttime10\", y=\"startdist\", figsize=(14,4), kind=\"area\");"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
to_station_idto_station_namestopweekdaystoptime10nb_tripsstopnb_tripsdaystopstopdistfrom_station_idfrom_station_namestartweekdaystarttime10nb_tripsstartnb_tripsdaystartstartdist
02.0Michigan Ave & Balbo Ave0.000:10:002.0913.00.0021912.0Michigan Ave & Balbo Ave0.000:10:001.01065.00.000939
12.0Michigan Ave & Balbo Ave0.000:20:002.0913.00.0021912.0Michigan Ave & Balbo Ave0.000:20:003.01065.00.002817
22.0Michigan Ave & Balbo Ave0.000:30:002.0913.00.002191NaNNaNNaNNaNNaNNaNNaN
32.0Michigan Ave & Balbo Ave0.001:00:003.0913.00.003286NaNNaNNaNNaNNaNNaNNaN
42.0Michigan Ave & Balbo Ave0.001:10:002.0913.00.0021912.0Michigan Ave & Balbo Ave0.001:10:003.01065.00.002817
\n", "
"], "text/plain": [" to_station_id to_station_name stopweekday stoptime10 \\\n", "0 2.0 Michigan Ave & Balbo Ave 0.0 00:10:00 \n", "1 2.0 Michigan Ave & Balbo Ave 0.0 00:20:00 \n", "2 2.0 Michigan Ave & Balbo Ave 0.0 00:30:00 \n", "3 2.0 Michigan Ave & Balbo Ave 0.0 01:00:00 \n", "4 2.0 Michigan Ave & Balbo Ave 0.0 01:10:00 \n", "\n", " nb_tripsstop nb_tripsdaystop stopdist from_station_id \\\n", "0 2.0 913.0 0.002191 2.0 \n", "1 2.0 913.0 0.002191 2.0 \n", "2 2.0 913.0 0.002191 NaN \n", "3 3.0 913.0 0.003286 NaN \n", "4 2.0 913.0 0.002191 2.0 \n", "\n", " from_station_name startweekday starttime10 nb_tripsstart \\\n", "0 Michigan Ave & Balbo Ave 0.0 00:10:00 1.0 \n", "1 Michigan Ave & Balbo Ave 0.0 00:20:00 3.0 \n", "2 NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN \n", "4 Michigan Ave & Balbo Ave 0.0 01:10:00 3.0 \n", "\n", " nb_tripsdaystart startdist \n", "0 1065.0 0.000939 \n", "1 1065.0 0.002817 \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 1065.0 0.002817 "]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["everything = stopmerge.merge(startmerge,\n", " left_on=[\"to_station_id\", \"to_station_name\", \"stopweekday\", \"stoptime10\"],\n", " right_on=[\"from_station_id\", \"from_station_name\",\"startweekday\", \"starttime10\"],\n", " suffixes=(\"stop\", \"start\"),\n", " how=\"outer\")\n", "everything.head()"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["(datetime.datetime(2017, 2, 2, 0, 0), datetime.datetime(2017, 2, 2, 0, 0))"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "from datetime import datetime\n", "\n", "def bestof(x, y):\n", " if isinstance(x, (datetime, time, str)):\n", " return x\n", " try:\n", " if x is None or isinstance(y, (datetime, time, str)) or numpy.isnan(x):\n", " return y\n", " else:\n", " return x\n", " except:\n", " print(type(x), type(y))\n", " print(x, y)\n", " raise\n", " \n", "bestof(datetime(2017,2,2), numpy.nan), bestof(numpy.nan, datetime(2017,2,2))"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nb_tripsstopnb_tripsdaystopstopdistnb_tripsstartnb_tripsdaystartstartdiststation_namestation_idtime10weekday
02.0913.00.0021911.01065.00.000939Michigan Ave & Balbo Ave2.000:10:000.0
12.0913.00.0021913.01065.00.002817Michigan Ave & Balbo Ave2.000:20:000.0
22.0913.00.002191NaNNaNNaNMichigan Ave & Balbo Ave2.000:30:000.0
33.0913.00.003286NaNNaNNaNMichigan Ave & Balbo Ave2.001:00:000.0
42.0913.00.0021913.01065.00.002817Michigan Ave & Balbo Ave2.001:10:000.0
\n", "
"], "text/plain": [" nb_tripsstop nb_tripsdaystop stopdist nb_tripsstart nb_tripsdaystart \\\n", "0 2.0 913.0 0.002191 1.0 1065.0 \n", "1 2.0 913.0 0.002191 3.0 1065.0 \n", "2 2.0 913.0 0.002191 NaN NaN \n", "3 3.0 913.0 0.003286 NaN NaN \n", "4 2.0 913.0 0.002191 3.0 1065.0 \n", "\n", " startdist station_name station_id time10 weekday \n", "0 0.000939 Michigan Ave & Balbo Ave 2.0 00:10:00 0.0 \n", "1 0.002817 Michigan Ave & Balbo Ave 2.0 00:20:00 0.0 \n", "2 NaN Michigan Ave & Balbo Ave 2.0 00:30:00 0.0 \n", "3 NaN Michigan Ave & Balbo Ave 2.0 01:00:00 0.0 \n", "4 0.002817 Michigan Ave & Balbo Ave 2.0 01:10:00 0.0 "]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["every = everything.copy()\n", "every[\"station_name\"] = every.apply(lambda row: bestof(row[\"to_station_name\"], row[\"from_station_name\"]), axis=1)\n", "every[\"station_id\"] = every.apply(lambda row: bestof(row[\"to_station_id\"], row[\"from_station_id\"]), axis=1)\n", "every[\"time10\"] = every.apply(lambda row: bestof(row[\"stoptime10\"], row[\"starttime10\"]), axis=1)\n", "every[\"weekday\"] = every.apply(lambda row: bestof(row[\"stopweekday\"], row[\"startweekday\"]), axis=1)\n", "every = every.drop([\"stoptime10\", \"starttime10\", \"stopweekday\", \"startweekday\",\n", " \"to_station_id\", \"from_station_id\",\n", " \"to_station_name\", \"from_station_name\"], axis=1)\n", "every.head()"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["((357809, 10), (298013, 7), (299700, 7))"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["every.shape, stopmerge.shape, startmerge.shape"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We need vectors of equal size which means filling NaN values with 0 and adding times when not present."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["Index(['nb_tripsstop', 'nb_tripsdaystop', 'stopdist', 'nb_tripsstart',\n", " 'nb_tripsdaystart', 'startdist', 'station_name', 'station_id', 'time10',\n", " 'weekday'],\n", " dtype='object')"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["every.columns"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": ["keys = ['station_name', 'station_id', 'weekday', 'time10']\n", "for c in every.columns:\n", " if c not in keys:\n", " every[c].fillna(0)"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_namestation_idtime10weekdaystopdiststartdistnb_tripsstopnb_tripsdaystopnb_tripsstartnb_tripsdaystart
3578092112 W Peterson Ave456.000:00:000.00.00.0000000.00.00.00.0
3578102112 W Peterson Ave456.000:10:000.00.00.0000000.00.00.00.0
3578112112 W Peterson Ave456.000:20:000.00.00.0000000.00.00.00.0
3414432112 W Peterson Ave456.000:30:000.00.00.0212770.00.01.047.0
3578122112 W Peterson Ave456.000:40:000.00.00.0000000.00.00.00.0
\n", "
"], "text/plain": [" station_name station_id time10 weekday stopdist \\\n", "357809 2112 W Peterson Ave 456.0 00:00:00 0.0 0.0 \n", "357810 2112 W Peterson Ave 456.0 00:10:00 0.0 0.0 \n", "357811 2112 W Peterson Ave 456.0 00:20:00 0.0 0.0 \n", "341443 2112 W Peterson Ave 456.0 00:30:00 0.0 0.0 \n", "357812 2112 W Peterson Ave 456.0 00:40:00 0.0 0.0 \n", "\n", " startdist nb_tripsstop nb_tripsdaystop nb_tripsstart \\\n", "357809 0.000000 0.0 0.0 0.0 \n", "357810 0.000000 0.0 0.0 0.0 \n", "357811 0.000000 0.0 0.0 0.0 \n", "341443 0.021277 0.0 0.0 1.0 \n", "357812 0.000000 0.0 0.0 0.0 \n", "\n", " nb_tripsdaystart \n", "357809 0.0 \n", "357810 0.0 \n", "357811 0.0 \n", "341443 47.0 \n", "357812 0.0 "]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["from ensae_projects.datainc.data_bikes import add_missing_time\n", "full = df = add_missing_time(every, delay=10, column=\"time10\", \n", " values=[c for c in every.columns if c not in keys])\n", "full = full[['station_name', 'station_id', 'time10', 'weekday', \n", " 'stopdist', 'startdist',\n", " 'nb_tripsstop', 'nb_tripsdaystop', 'nb_tripsstart',\n", " 'nb_tripsdaystart']].sort_values(keys)\n", "full.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Clustering (stop and start)\n", "\n", "We cluster these distribution to find some patterns. But we need vectors of equal size which should be equal to 24*6."]}, {"cell_type": "markdown", "metadata": {}, "source": ["This is much better."]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {}, "output_type": "display_data"}], "source": ["df = full\n", "import matplotlib.pyplot as plt\n", "fig, ax = plt.subplots(2, 1, figsize=(14,6))\n", "df[df[\"station_id\"] == 2].plot(x=\"time10\", y=\"startdist\", figsize=(14,4), kind=\"area\", ax=ax[0])\n", "df[df[\"station_id\"] == 2].plot(x=\"time10\", y=\"stopdist\", figsize=(14,4), \n", " kind=\"area\", ax=ax[1], color=\"r\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's build the features."]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_idstation_nameweekdaystartdist...stopdist
time1000:00:0000:10:0000:20:0000:30:0000:40:0000:50:0001:00:00...22:20:0022:30:0022:40:0022:50:0023:00:0023:10:0023:20:0023:30:0023:40:0023:50:00
02.0Michigan Ave & Balbo Ave0.00.0037560.0009390.0028170.0000000.0000000.0037560.000000...0.0043810.0021910.0043810.0021910.0043810.0043810.0054760.0021910.0000000.005476
12.0Michigan Ave & Balbo Ave1.00.0000000.0000000.0011060.0011060.0011060.0022120.000000...0.0093710.0120480.0066930.0040160.0053550.0066930.0026770.0000000.0000000.000000
22.0Michigan Ave & Balbo Ave2.00.0013570.0027140.0000000.0013570.0000000.0054270.000000...0.0029070.0029070.0159880.0058140.0014530.0014530.0116280.0000000.0000000.007267
32.0Michigan Ave & Balbo Ave3.00.0000000.0041440.0000000.0000000.0027620.0041440.000000...0.0092740.0030910.0030910.0077280.0015460.0030910.0092740.0015460.0077280.001546
42.0Michigan Ave & Balbo Ave4.00.0000000.0000000.0000000.0028460.0000000.0000000.000949...0.0082140.0010270.0061600.0041070.0154000.0061600.0020530.0061600.0071870.000000
\n", "

5 rows \u00d7 291 columns

\n", "
"], "text/plain": [" station_id station_name weekday startdist \\\n", "time10 00:00:00 00:10:00 \n", "0 2.0 Michigan Ave & Balbo Ave 0.0 0.003756 0.000939 \n", "1 2.0 Michigan Ave & Balbo Ave 1.0 0.000000 0.000000 \n", "2 2.0 Michigan Ave & Balbo Ave 2.0 0.001357 0.002714 \n", "3 2.0 Michigan Ave & Balbo Ave 3.0 0.000000 0.004144 \n", "4 2.0 Michigan Ave & Balbo Ave 4.0 0.000000 0.000000 \n", "\n", " ... stopdist \\\n", "time10 00:20:00 00:30:00 00:40:00 00:50:00 01:00:00 ... 22:20:00 \n", "0 0.002817 0.000000 0.000000 0.003756 0.000000 ... 0.004381 \n", "1 0.001106 0.001106 0.001106 0.002212 0.000000 ... 0.009371 \n", "2 0.000000 0.001357 0.000000 0.005427 0.000000 ... 0.002907 \n", "3 0.000000 0.000000 0.002762 0.004144 0.000000 ... 0.009274 \n", "4 0.000000 0.002846 0.000000 0.000000 0.000949 ... 0.008214 \n", "\n", " \\\n", "time10 22:30:00 22:40:00 22:50:00 23:00:00 23:10:00 23:20:00 23:30:00 \n", "0 0.002191 0.004381 0.002191 0.004381 0.004381 0.005476 0.002191 \n", "1 0.012048 0.006693 0.004016 0.005355 0.006693 0.002677 0.000000 \n", "2 0.002907 0.015988 0.005814 0.001453 0.001453 0.011628 0.000000 \n", "3 0.003091 0.003091 0.007728 0.001546 0.003091 0.009274 0.001546 \n", "4 0.001027 0.006160 0.004107 0.015400 0.006160 0.002053 0.006160 \n", "\n", " \n", "time10 23:40:00 23:50:00 \n", "0 0.000000 0.005476 \n", "1 0.000000 0.000000 \n", "2 0.000000 0.007267 \n", "3 0.007728 0.001546 \n", "4 0.007187 0.000000 \n", "\n", "[5 rows x 291 columns]"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["features = df.pivot_table(index=[\"station_id\", \"station_name\", \"weekday\"],\n", " columns=\"time10\", values=[\"startdist\", \"stopdist\"]).reset_index()\n", "\n", "features.head()"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["288"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["names = features.columns[3:]\n", "len(names)"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/plain": ["KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n", " n_clusters=8, n_init=10, n_jobs=1, precompute_distances='auto',\n", " random_state=None, tol=0.0001, verbose=0)"]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.cluster import KMeans\n", "clus = KMeans(8)\n", "clus.fit(features[names])"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"data": {"text/plain": ["{0, 1, 2, 3, 4, 5, 6, 7}"]}, "execution_count": 24, "metadata": {}, "output_type": "execute_result"}], "source": ["pred = clus.predict(features[names])\n", "set(pred)"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": ["features[\"cluster\"] = pred"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's see what it means accross day. We need to look whether or not a cluster is related to day of the working week or the week end."]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["c:\\python370_x64\\lib\\site-packages\\pandas\\core\\generic.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.\n", " obj = obj._drop_axis(labels, axis, level=level, errors=errors)\n"]}, {"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_id
time10
clusterweekday
03.01
4.01
6.01
10.0146
1.0110
2.0106
3.0119
4.0143
5.0553
6.0547
20.0137
1.0141
2.0150
3.0147
4.0149
5.08
6.012
30.01
3.01
42.01
3.01
6.01
56.01
60.01
1.01
2.01
6.01
70.0291
1.0326
2.0322
3.0308
4.0287
5.019
6.017
\n", "
"], "text/plain": [" station_id\n", "time10 \n", "cluster weekday \n", "0 3.0 1\n", " 4.0 1\n", " 6.0 1\n", "1 0.0 146\n", " 1.0 110\n", " 2.0 106\n", " 3.0 119\n", " 4.0 143\n", " 5.0 553\n", " 6.0 547\n", "2 0.0 137\n", " 1.0 141\n", " 2.0 150\n", " 3.0 147\n", " 4.0 149\n", " 5.0 8\n", " 6.0 12\n", "3 0.0 1\n", " 3.0 1\n", "4 2.0 1\n", " 3.0 1\n", " 6.0 1\n", "5 6.0 1\n", "6 0.0 1\n", " 1.0 1\n", " 2.0 1\n", " 6.0 1\n", "7 0.0 291\n", " 1.0 326\n", " 2.0 322\n", " 3.0 308\n", " 4.0 287\n", " 5.0 19\n", " 6.0 17"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["features[[\"cluster\", \"weekday\", \"station_id\"]].groupby([\"cluster\", \"weekday\"]).count()"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["c:\\python370_x64\\lib\\site-packages\\pandas\\core\\generic.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.\n", " obj = obj._drop_axis(labels, axis, level=level, errors=errors)\n"]}, {"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {}, "output_type": "display_data"}], "source": ["nb = features[[\"cluster\", \"weekday\", \"station_id\"]].groupby([\"cluster\", \"weekday\"]).count()\n", "nb = nb.reset_index()\n", "nb[nb.cluster.isin([0, 3, 5, 6])].pivot(\"weekday\",\"cluster\", \"station_id\").plot(kind=\"bar\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's draw the clusters."]}, {"cell_type": "code", "execution_count": 27, "metadata": {"scrolled": false}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {}, "output_type": "display_data"}], "source": ["centers = clus.cluster_centers_.T\n", "import matplotlib.pyplot as plt\n", "fig, ax = plt.subplots(centers.shape[1], 2, figsize=(10,10))\n", "nbf = centers.shape[0] // 2\n", "x = list(range(0,nbf))\n", "col = 0\n", "dec = 0\n", "colors = [\"red\", \"yellow\", \"gray\", \"green\", \"brown\", \"orange\", \"blue\"]\n", "for i in range(centers.shape[1]):\n", " if 2*i == centers.shape[1]:\n", " col += 1\n", " dec += centers.shape[1] \n", " color = colors[i%len(colors)]\n", " ax[2*i-dec, col].bar (x, centers[:nbf,i], width=1.0, color=color)\n", " ax[2*i-dec, col].set_ylabel(\"cluster %d - start\" % i, color=color)\n", " ax[2*i+1-dec, col].bar (x, centers[nbf:,i], width=1.0, color=color)\n", " ax[2*i+1-dec, col].set_ylabel(\"cluster %d - stop\" % i, color=color)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Four patterns emerge. Small clusters are annoying but let's show them on a map. The widest one is the one for the week-end."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Graph\n", "\n", "We first need to get 7 clusters for each stations, one per day."]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["c:\\python370_x64\\lib\\site-packages\\pandas\\core\\generic.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.\n", " obj = obj._drop_axis(labels, axis, level=level, errors=errors)\n"]}, {"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
time10
weekday0.01.02.03.04.05.06.0
station_idstation_name
2.0Michigan Ave & Balbo Ave1.07.01.01.01.01.01.0
3.0Shedd Aquarium1.01.01.01.01.01.01.0
4.0Burnham Harbor1.01.01.01.01.01.01.0
5.0State St & Harrison St7.07.07.07.07.01.01.0
6.0Dusable Harbor1.01.01.01.01.01.01.0
\n", "
"], "text/plain": ["time10 \n", "weekday 0.0 1.0 2.0 3.0 4.0 5.0 6.0\n", "station_id station_name \n", "2.0 Michigan Ave & Balbo Ave 1.0 7.0 1.0 1.0 1.0 1.0 1.0\n", "3.0 Shedd Aquarium 1.0 1.0 1.0 1.0 1.0 1.0 1.0\n", "4.0 Burnham Harbor 1.0 1.0 1.0 1.0 1.0 1.0 1.0\n", "5.0 State St & Harrison St 7.0 7.0 7.0 7.0 7.0 1.0 1.0\n", "6.0 Dusable Harbor 1.0 1.0 1.0 1.0 1.0 1.0 1.0"]}, "execution_count": 29, "metadata": {}, "output_type": "execute_result"}], "source": ["piv = features.pivot_table(index=[\"station_id\",\"station_name\"], \n", " columns=\"weekday\", values=\"cluster\")\n", "piv.head()"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": ["piv[\"distincts\"] = piv.apply(lambda row: len(set(row[i] for i in range(0,7))), axis=1)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's see which station is classified in more than 4 clusters. NaN means no bikes stopped at this stations. They are mostly unused stations."]}, {"cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
time10distincts
weekday0.01.02.03.04.05.06.0
station_idstation_name
391.0Halsted St & 69th St3.07.01.07.01.01.02.04
440.0Lawndale Ave & 23rd St7.07.01.07.02.01.00.04
557.0Damen Ave & Garfield BlvdNaN2.01.0NaN2.0NaNNaN6
558.0Ashland Ave & Garfield BlvdNaN1.01.01.01.02.0NaN4
561.0Damen Ave & 61st St2.07.02.07.0NaN1.01.04
562.0Racine Ave & 61st StNaNNaNNaNNaN7.0NaNNaN7
564.0Racine Ave & 65th St1.01.0NaNNaN7.01.07.04
565.0Ashland Ave & 66th St1.01.0NaN1.00.0NaN5.05
567.0May St & 69th St6.06.02.02.01.01.0NaN4
568.0Normal Ave & 72nd St1.01.07.0NaN7.01.04.04
569.0Woodlawn Ave & 75th St1.0NaN7.01.01.0NaN1.04
576.0Greenwood Ave & 79th St7.01.01.0NaN2.01.01.04
581.0Commercial Ave & 83rd St1.0NaN7.0NaNNaN1.01.05
582.0Phillips Ave & 82nd StNaNNaN1.01.0NaN1.07.05
586.0MLK Jr Dr & 83rd St1.02.06.07.07.07.02.04
587.0Wabash Ave & 83rd StNaNNaN1.0NaN1.02.07.06
588.0South Chicago Ave & 83rd StNaN2.07.03.01.02.07.05
593.0Halsted St & 59th StNaN7.04.04.0NaN1.01.05
\n", "
"], "text/plain": ["time10 \\\n", "weekday 0.0 1.0 2.0 3.0 4.0 5.0 6.0 \n", "station_id station_name \n", "391.0 Halsted St & 69th St 3.0 7.0 1.0 7.0 1.0 1.0 2.0 \n", "440.0 Lawndale Ave & 23rd St 7.0 7.0 1.0 7.0 2.0 1.0 0.0 \n", "557.0 Damen Ave & Garfield Blvd NaN 2.0 1.0 NaN 2.0 NaN NaN \n", "558.0 Ashland Ave & Garfield Blvd NaN 1.0 1.0 1.0 1.0 2.0 NaN \n", "561.0 Damen Ave & 61st St 2.0 7.0 2.0 7.0 NaN 1.0 1.0 \n", "562.0 Racine Ave & 61st St NaN NaN NaN NaN 7.0 NaN NaN \n", "564.0 Racine Ave & 65th St 1.0 1.0 NaN NaN 7.0 1.0 7.0 \n", "565.0 Ashland Ave & 66th St 1.0 1.0 NaN 1.0 0.0 NaN 5.0 \n", "567.0 May St & 69th St 6.0 6.0 2.0 2.0 1.0 1.0 NaN \n", "568.0 Normal Ave & 72nd St 1.0 1.0 7.0 NaN 7.0 1.0 4.0 \n", "569.0 Woodlawn Ave & 75th St 1.0 NaN 7.0 1.0 1.0 NaN 1.0 \n", "576.0 Greenwood Ave & 79th St 7.0 1.0 1.0 NaN 2.0 1.0 1.0 \n", "581.0 Commercial Ave & 83rd St 1.0 NaN 7.0 NaN NaN 1.0 1.0 \n", "582.0 Phillips Ave & 82nd St NaN NaN 1.0 1.0 NaN 1.0 7.0 \n", "586.0 MLK Jr Dr & 83rd St 1.0 2.0 6.0 7.0 7.0 7.0 2.0 \n", "587.0 Wabash Ave & 83rd St NaN NaN 1.0 NaN 1.0 2.0 7.0 \n", "588.0 South Chicago Ave & 83rd St NaN 2.0 7.0 3.0 1.0 2.0 7.0 \n", "593.0 Halsted St & 59th St NaN 7.0 4.0 4.0 NaN 1.0 1.0 \n", "\n", "time10 distincts \n", "weekday \n", "station_id station_name \n", "391.0 Halsted St & 69th St 4 \n", "440.0 Lawndale Ave & 23rd St 4 \n", "557.0 Damen Ave & Garfield Blvd 6 \n", "558.0 Ashland Ave & Garfield Blvd 4 \n", "561.0 Damen Ave & 61st St 4 \n", "562.0 Racine Ave & 61st St 7 \n", "564.0 Racine Ave & 65th St 4 \n", "565.0 Ashland Ave & 66th St 5 \n", "567.0 May St & 69th St 4 \n", "568.0 Normal Ave & 72nd St 4 \n", "569.0 Woodlawn Ave & 75th St 4 \n", "576.0 Greenwood Ave & 79th St 4 \n", "581.0 Commercial Ave & 83rd St 5 \n", "582.0 Phillips Ave & 82nd St 5 \n", "586.0 MLK Jr Dr & 83rd St 4 \n", "587.0 Wabash Ave & 83rd St 6 \n", "588.0 South Chicago Ave & 83rd St 5 \n", "593.0 Halsted St & 59th St 5 "]}, "execution_count": 31, "metadata": {}, "output_type": "execute_result"}], "source": ["piv[piv.distincts >= 4]"]}, {"cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_idstation_name0123456distincts
02.0Michigan Ave & Balbo Ave1.07.01.01.01.01.01.02
13.0Shedd Aquarium1.01.01.01.01.01.01.01
24.0Burnham Harbor1.01.01.01.01.01.01.01
35.0State St & Harrison St7.07.07.07.07.01.01.02
46.0Dusable Harbor1.01.01.01.01.01.01.01
\n", "
"], "text/plain": [" station_id station_name 0 1 2 3 4 5 6 \\\n", "0 2.0 Michigan Ave & Balbo Ave 1.0 7.0 1.0 1.0 1.0 1.0 1.0 \n", "1 3.0 Shedd Aquarium 1.0 1.0 1.0 1.0 1.0 1.0 1.0 \n", "2 4.0 Burnham Harbor 1.0 1.0 1.0 1.0 1.0 1.0 1.0 \n", "3 5.0 State St & Harrison St 7.0 7.0 7.0 7.0 7.0 1.0 1.0 \n", "4 6.0 Dusable Harbor 1.0 1.0 1.0 1.0 1.0 1.0 1.0 \n", "\n", " distincts \n", "0 2 \n", "1 1 \n", "2 1 \n", "3 2 \n", "4 1 "]}, "execution_count": 32, "metadata": {}, "output_type": "execute_result"}], "source": ["pivn = piv.reset_index()\n", "pivn.columns = [' '.join(str(_).replace(\".0\", \"\") for _ in col).strip() for col in pivn.columns.values]\n", "pivn.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's draw a map on a week day."]}, {"cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamelatitudelongitudedpcapacityonline_datestation_idstation_name0123456distincts
3572Michigan Ave & Balbo Ave41.872638-87.623979355/8/20152.0Michigan Ave & Balbo Ave1.07.01.01.01.01.01.02
4563Shedd Aquarium41.867226-87.615355314/24/20153.0Shedd Aquarium1.01.01.01.01.01.01.01
534Burnham Harbor41.856268-87.613348235/16/20154.0Burnham Harbor1.01.01.01.01.01.01.01
4975State St & Harrison St41.874053-87.627716236/18/20135.0State St & Harrison St7.07.07.07.07.01.01.02
1886Dusable Harbor41.885042-87.612795314/24/20156.0Dusable Harbor1.01.01.01.01.01.01.01
\n", "
"], "text/plain": [" id name latitude longitude dpcapacity \\\n", "357 2 Michigan Ave & Balbo Ave 41.872638 -87.623979 35 \n", "456 3 Shedd Aquarium 41.867226 -87.615355 31 \n", "53 4 Burnham Harbor 41.856268 -87.613348 23 \n", "497 5 State St & Harrison St 41.874053 -87.627716 23 \n", "188 6 Dusable Harbor 41.885042 -87.612795 31 \n", "\n", " online_date station_id station_name 0 1 2 3 \\\n", "357 5/8/2015 2.0 Michigan Ave & Balbo Ave 1.0 7.0 1.0 1.0 \n", "456 4/24/2015 3.0 Shedd Aquarium 1.0 1.0 1.0 1.0 \n", "53 5/16/2015 4.0 Burnham Harbor 1.0 1.0 1.0 1.0 \n", "497 6/18/2013 5.0 State St & Harrison St 7.0 7.0 7.0 7.0 \n", "188 4/24/2015 6.0 Dusable Harbor 1.0 1.0 1.0 1.0 \n", "\n", " 4 5 6 distincts \n", "357 1.0 1.0 1.0 2 \n", "456 1.0 1.0 1.0 1 \n", "53 1.0 1.0 1.0 1 \n", "497 7.0 1.0 1.0 2 \n", "188 1.0 1.0 1.0 1 "]}, "execution_count": 33, "metadata": {}, "output_type": "execute_result"}], "source": ["data = stations.merge(pivn, left_on=[\"id\", \"name\"],\n", " right_on=[\"station_id\", \"station_name\"], suffixes=('_s', '_c'))\n", "data.sort_values(\"id\").head()"]}, {"cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Cluster 0 is red\n", "Cluster 1 is yellow\n", "Cluster 2 is gray\n", "Cluster 3 is green\n", "Cluster 4 is brown\n", "Cluster 5 is orange\n", "Cluster 6 is blue\n", "Cluster 7 is black\n"]}, {"data": {"text/html": ["
"], "text/plain": [".CustomFoliumMap at 0x2030f4d0c50>"]}, "execution_count": 34, "metadata": {}, "output_type": "execute_result"}], "source": ["from ensae_projects.datainc.data_bikes import folium_html_stations_map\n", "\n", "colors = [\"red\", \"yellow\", \"gray\", \"green\", \"brown\", \"orange\", \"blue\", \"black\"]\n", "for i, c in enumerate(colors):\n", " print(\"Cluster {0} is {1}\".format(i, c))\n", "xy = []\n", "for els in data.apply(lambda row: (row[\"latitude\"], row[\"longitude\"], row[\"1\"], row[\"name\"]), axis=1):\n", " try:\n", " cl = int(els[2])\n", " except:\n", " # NaN\n", " continue\n", " name = \"%s c%d\" % (els[3], cl)\n", " color = colors[cl]\n", " xy.append( ( (els[0], els[1]), (name, color)))\n", "folium_html_stations_map(xy, width=\"80%\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Look at the colors close the parks. We notice than people got to the park after work. Let's see during the week-end."]}, {"cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [{"data": {"text/html": ["
"], "text/plain": [".CustomFoliumMap at 0x2030fca2c50>"]}, "execution_count": 35, "metadata": {}, "output_type": "execute_result"}], "source": ["xy = []\n", "for els in data.apply(lambda row: (row[\"latitude\"], row[\"longitude\"], row[\"5\"], row[\"name\"]), axis=1):\n", " try:\n", " cl = int(els[2])\n", " except:\n", " # NaN\n", " continue\n", " name = \"%s c%d\" % (els[3], cl)\n", " color = colors[cl]\n", " xy.append( ( (els[0], els[1]), (name, color)))\n", "folium_html_stations_map(xy, width=\"80%\")"]}, {"cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0"}}, "nbformat": 4, "nbformat_minor": 2}