.. _citybikesolutionclusterstartrst: ============== Bike Pattern 2 ============== .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/challenges/city_bike/city_bike_solution_cluster_start.ipynb|*` We used a little bit of machine learning on `Divvy Data `__ to dig into a better division of Chicago. We try to identify patterns among bike stations. .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: .. code:: ipython3 %matplotlib inline The data -------- `Divvy Data `__ publishes a sample of the data. .. code:: ipython3 from pyensae.datasource import download_data file = download_data("Divvy_Trips_2016_Q3Q4.zip", url="https://s3.amazonaws.com/divvy-data/tripdata/") We know the stations. .. code:: ipython3 import pandas stations = pandas.read_csv("Divvy_Stations_2016_Q3.csv") bikes = pandas.concat([pandas.read_csv("Divvy_Trips_2016_Q3.csv"), pandas.read_csv("Divvy_Trips_2016_Q4.csv")]) .. code:: ipython3 bikes.head() .. raw:: html

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_id	from_station_name	to_station_id	to_station_name	usertype	gender	birthyear
0	12150160	9/30/2016 23:59:58	10/1/2016 00:04:03	4959	245	69	Damen Ave & Pierce Ave	17	Wood St & Division St	Subscriber	Male	1988.0
1	12150159	9/30/2016 23:59:58	10/1/2016 00:04:09	2589	251	383	Ashland Ave & Harrison St	320	Loomis St & Lexington St	Subscriber	Female	1990.0
2	12150158	9/30/2016 23:59:51	10/1/2016 00:24:51	3656	1500	302	Sheffield Ave & Wrightwood Ave	334	Lake Shore Dr & Belmont Ave	Customer	NaN	NaN
3	12150157	9/30/2016 23:59:51	10/1/2016 00:03:56	3570	245	475	Washtenaw Ave & Lawrence Ave	471	Francisco Ave & Foster Ave	Subscriber	Female	1988.0
4	12150156	9/30/2016 23:59:32	10/1/2016 00:26:50	3158	1638	302	Sheffield Ave & Wrightwood Ave	492	Leavitt St & Addison St	Customer	NaN	NaN

.. code:: ipython3 from datetime import datetime, time df = bikes df["dtstart"] = pandas.to_datetime(df.starttime, infer_datetime_format=True) df["dtstop"] = pandas.to_datetime(df.stoptime, infer_datetime_format=True) df["stopday"] = df.dtstop.apply(lambda r: datetime(r.year, r.month, r.day)) df["stoptime"] = df.dtstop.apply(lambda r: time(r.hour, r.minute, 0)) df["stoptime10"] = df.dtstop.apply(lambda r: time(r.hour, (r.minute // 10)*10, 0)) # every 10 minutes df["startday"] = df.dtstart.apply(lambda r: datetime(r.year, r.month, r.day)) df["starttime"] = df.dtstart.apply(lambda r: time(r.hour, r.minute, 0)) df["starttime10"] = df.dtstart.apply(lambda r: time(r.hour, (r.minute // 10)*10, 0)) # every 10 minutes .. code:: ipython3 df['stopweekday'] = df['dtstop'].dt.dayofweek df['startweekday'] = df['dtstart'].dt.dayofweek Normalize, aggregating and merging per start and stop time ---------------------------------------------------------- .. code:: ipython3 key = ["to_station_id", "to_station_name", "stopweekday", "stoptime10"] keep = key + ["trip_id"] stopaggtime = df[keep].groupby(key, as_index=False).count() stopaggtime.columns = key + ["nb_trips"] stopaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count() stopaggday.columns = key[:-1] + ["nb_trips"] stopaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count() stopaggday.columns = key[:-1] + ["nb_trips"] stopmerge = stopaggtime.merge(stopaggday, on=key[:-1], suffixes=("", "day")) stopmerge["stopdist"] = stopmerge["nb_trips"] / stopmerge["nb_tripsday"] stopmerge.head() .. raw:: html

	to_station_id	to_station_name	stoptime10	nb_trips	nb_tripsday	stopdist
0	2	Michigan Ave & Balbo Ave	00:10:00	2	913	0.002191
1	2	Michigan Ave & Balbo Ave	00:20:00	2	913	0.002191
2	2	Michigan Ave & Balbo Ave	00:30:00	2	913	0.002191
3	2	Michigan Ave & Balbo Ave	01:00:00	3	913	0.003286
4	2	Michigan Ave & Balbo Ave	01:10:00	2	913	0.002191

.. code:: ipython3 stopmerge[stopmerge["to_station_id"] == 2] \ .plot(x="stoptime10", y="stopdist", figsize=(14,4), kind="area"); .. image:: city_bike_solution_cluster_start_12_0.png .. code:: ipython3 key = ["from_station_id", "from_station_name", "startweekday", "starttime10"] keep = key + ["trip_id"] startaggtime = df[keep].groupby(key, as_index=False).count() startaggtime.columns = key + ["nb_trips"] startaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count() startaggday.columns = key[:-1] + ["nb_trips"] startaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count() startaggday.columns = key[:-1] + ["nb_trips"] startmerge = startaggtime.merge(startaggday, on=key[:-1], suffixes=("", "day")) startmerge["startdist"] = startmerge["nb_trips"] / startmerge["nb_tripsday"] startmerge.head() .. raw:: html

	from_station_id	from_station_name	starttime10	nb_trips	nb_tripsday	startdist
0	2	Michigan Ave & Balbo Ave	00:00:00	4	1065	0.003756
1	2	Michigan Ave & Balbo Ave	00:10:00	1	1065	0.000939
2	2	Michigan Ave & Balbo Ave	00:20:00	3	1065	0.002817
3	2	Michigan Ave & Balbo Ave	00:50:00	4	1065	0.003756
4	2	Michigan Ave & Balbo Ave	01:10:00	3	1065	0.002817

.. code:: ipython3 startmerge[startmerge["from_station_id"] == 2] \ .plot(x="starttime10", y="startdist", figsize=(14,4), kind="area"); .. image:: city_bike_solution_cluster_start_14_0.png .. code:: ipython3 everything = stopmerge.merge(startmerge, left_on=["to_station_id", "to_station_name", "stopweekday", "stoptime10"], right_on=["from_station_id", "from_station_name","startweekday", "starttime10"], suffixes=("stop", "start"), how="outer") everything.head() .. raw:: html

	to_station_id	to_station_name	stoptime10	nb_tripsstop	nb_tripsdaystop	stopdist	from_station_id	from_station_name	startweekday	starttime10	nb_tripsstart	nb_tripsdaystart	startdist
0	2.0	Michigan Ave & Balbo Ave	00:10:00	2.0	913.0	0.002191	2.0	Michigan Ave & Balbo Ave	0.0	00:10:00	1.0	1065.0	0.000939
1	2.0	Michigan Ave & Balbo Ave	00:20:00	2.0	913.0	0.002191	2.0	Michigan Ave & Balbo Ave	0.0	00:20:00	3.0	1065.0	0.002817
2	2.0	Michigan Ave & Balbo Ave	00:30:00	2.0	913.0	0.002191	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2.0	Michigan Ave & Balbo Ave	01:00:00	3.0	913.0	0.003286	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2.0	Michigan Ave & Balbo Ave	01:10:00	2.0	913.0	0.002191	2.0	Michigan Ave & Balbo Ave	0.0	01:10:00	3.0	1065.0	0.002817

.. code:: ipython3 import numpy from datetime import datetime def bestof(x, y): if isinstance(x, (datetime, time, str)): return x try: if x is None or isinstance(y, (datetime, time, str)) or numpy.isnan(x): return y else: return x except: print(type(x), type(y)) print(x, y) raise bestof(datetime(2017,2,2), numpy.nan), bestof(numpy.nan, datetime(2017,2,2)) .. parsed-literal:: (datetime.datetime(2017, 2, 2, 0, 0), datetime.datetime(2017, 2, 2, 0, 0)) .. code:: ipython3 every = everything.copy() every["station_name"] = every.apply(lambda row: bestof(row["to_station_name"], row["from_station_name"]), axis=1) every["station_id"] = every.apply(lambda row: bestof(row["to_station_id"], row["from_station_id"]), axis=1) every["time10"] = every.apply(lambda row: bestof(row["stoptime10"], row["starttime10"]), axis=1) every["weekday"] = every.apply(lambda row: bestof(row["stopweekday"], row["startweekday"]), axis=1) every = every.drop(["stoptime10", "starttime10", "stopweekday", "startweekday", "to_station_id", "from_station_id", "to_station_name", "from_station_name"], axis=1) every.head() .. raw:: html

	nb_tripsstop	nb_tripsdaystop	stopdist	nb_tripsstart	nb_tripsdaystart	startdist	station_name	station_id	time10
0	2.0	913.0	0.002191	1.0	1065.0	0.000939	Michigan Ave & Balbo Ave	2.0	00:10:00
1	2.0	913.0	0.002191	3.0	1065.0	0.002817	Michigan Ave & Balbo Ave	2.0	00:20:00
2	2.0	913.0	0.002191	NaN	NaN	NaN	Michigan Ave & Balbo Ave	2.0	00:30:00
3	3.0	913.0	0.003286	NaN	NaN	NaN	Michigan Ave & Balbo Ave	2.0	01:00:00
4	2.0	913.0	0.002191	3.0	1065.0	0.002817	Michigan Ave & Balbo Ave	2.0	01:10:00

.. code:: ipython3 every.shape, stopmerge.shape, startmerge.shape .. parsed-literal:: ((357809, 10), (298013, 7), (299700, 7)) We need vectors of equal size which means filling NaN values with 0 and adding times when not present. .. code:: ipython3 every.columns .. parsed-literal:: Index(['nb_tripsstop', 'nb_tripsdaystop', 'stopdist', 'nb_tripsstart', 'nb_tripsdaystart', 'startdist', 'station_name', 'station_id', 'time10', 'weekday'], dtype='object') .. code:: ipython3 keys = ['station_name', 'station_id', 'weekday', 'time10'] for c in every.columns: if c not in keys: every[c].fillna(0) .. code:: ipython3 from ensae_projects.datainc.data_bikes import add_missing_time full = df = add_missing_time(every, delay=10, column="time10", values=[c for c in every.columns if c not in keys]) full = full[['station_name', 'station_id', 'time10', 'weekday', 'stopdist', 'startdist', 'nb_tripsstop', 'nb_tripsdaystop', 'nb_tripsstart', 'nb_tripsdaystart']].sort_values(keys) full.head() .. raw:: html

	station_name	station_id	time10	startdist	nb_tripsstart	nb_tripsdaystart
357809	2112 W Peterson Ave	456.0	00:00:00	0.000000	0.0	0.0
357810	2112 W Peterson Ave	456.0	00:10:00	0.000000	0.0	0.0
357811	2112 W Peterson Ave	456.0	00:20:00	0.000000	0.0	0.0
341443	2112 W Peterson Ave	456.0	00:30:00	0.021277	1.0	47.0
357812	2112 W Peterson Ave	456.0	00:40:00	0.000000	0.0	0.0

Clustering (stop and start) --------------------------- We cluster these distribution to find some patterns. But we need vectors of equal size which should be equal to 24*6. This is much better. .. code:: ipython3 df = full import matplotlib.pyplot as plt fig, ax = plt.subplots(2, 1, figsize=(14,6)) df[df["station_id"] == 2].plot(x="time10", y="startdist", figsize=(14,4), kind="area", ax=ax[0]) df[df["station_id"] == 2].plot(x="time10", y="stopdist", figsize=(14,4), kind="area", ax=ax[1], color="r"); .. image:: city_bike_solution_cluster_start_25_0.png Let’s build the features. .. code:: ipython3 features = df.pivot_table(index=["station_id", "station_name", "weekday"], columns="time10", values=["startdist", "stopdist"]).reset_index() features.head() .. raw:: html

	station_id	station_name	weekday	startdist							...	stopdist
time10				00:00:00	00:10:00	00:20:00	00:30:00	00:40:00	00:50:00	01:00:00	...	22:20:00	22:30:00	22:40:00	22:50:00	23:00:00	23:10:00	23:20:00	23:30:00	23:40:00	23:50:00
0	2.0	Michigan Ave & Balbo Ave	0.0	0.003756	0.000939	0.002817	0.000000	0.000000	0.003756	0.000000	...	0.004381	0.002191	0.004381	0.002191	0.004381	0.004381	0.005476	0.002191	0.000000	0.005476
1	2.0	Michigan Ave & Balbo Ave	1.0	0.000000	0.000000	0.001106	0.001106	0.001106	0.002212	0.000000	...	0.009371	0.012048	0.006693	0.004016	0.005355	0.006693	0.002677	0.000000	0.000000	0.000000
2	2.0	Michigan Ave & Balbo Ave	2.0	0.001357	0.002714	0.000000	0.001357	0.000000	0.005427	0.000000	...	0.002907	0.002907	0.015988	0.005814	0.001453	0.001453	0.011628	0.000000	0.000000	0.007267
3	2.0	Michigan Ave & Balbo Ave	3.0	0.000000	0.004144	0.000000	0.000000	0.002762	0.004144	0.000000	...	0.009274	0.003091	0.003091	0.007728	0.001546	0.003091	0.009274	0.001546	0.007728	0.001546
4	2.0	Michigan Ave & Balbo Ave	4.0	0.000000	0.000000	0.000000	0.002846	0.000000	0.000000	0.000949	...	0.008214	0.001027	0.006160	0.004107	0.015400	0.006160	0.002053	0.006160	0.007187	0.000000

5 rows × 291 columns

.. code:: ipython3 names = features.columns[3:] len(names) .. parsed-literal:: 288 .. code:: ipython3 from sklearn.cluster import KMeans clus = KMeans(8) clus.fit(features[names]) .. parsed-literal:: KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=8, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0) .. code:: ipython3 pred = clus.predict(features[names]) set(pred) .. parsed-literal:: {0, 1, 2, 3, 4, 5, 6, 7} .. code:: ipython3 features["cluster"] = pred Let’s see what it means accross day. We need to look whether or not a cluster is related to day of the working week or the week end. .. code:: ipython3 features[["cluster", "weekday", "station_id"]].groupby(["cluster", "weekday"]).count() .. parsed-literal:: c:\python370_x64\lib\site-packages\pandas\core\generic.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance. obj = obj._drop_axis(labels, axis, level=level, errors=errors) .. raw:: html

		station_id
	time10
cluster	weekday
0	3.0	1
	4.0	1
	6.0	1
1	0.0	146
	1.0	110
	2.0	106
	3.0	119
	4.0	143
	5.0	553
	6.0	547
2	0.0	137
	1.0	141
	2.0	150
	3.0	147
	4.0	149
	5.0	8
	6.0	12
3	0.0	1
3	3.0	1
4	2.0	1
	3.0	1
	6.0	1
5	6.0	1
6	0.0	1
	1.0	1
	2.0	1
	6.0	1
7	0.0	291
	1.0	326
	2.0	322
	3.0	308
	4.0	287
	5.0	19
	6.0	17

.. code:: ipython3 nb = features[["cluster", "weekday", "station_id"]].groupby(["cluster", "weekday"]).count() nb = nb.reset_index() nb[nb.cluster.isin([0, 3, 5, 6])].pivot("weekday","cluster", "station_id").plot(kind="bar"); .. parsed-literal:: c:\python370_x64\lib\site-packages\pandas\core\generic.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance. obj = obj._drop_axis(labels, axis, level=level, errors=errors) .. image:: city_bike_solution_cluster_start_34_1.png Let’s draw the clusters. .. code:: ipython3 centers = clus.cluster_centers_.T import matplotlib.pyplot as plt fig, ax = plt.subplots(centers.shape[1], 2, figsize=(10,10)) nbf = centers.shape[0] // 2 x = list(range(0,nbf)) col = 0 dec = 0 colors = ["red", "yellow", "gray", "green", "brown", "orange", "blue"] for i in range(centers.shape[1]): if 2*i == centers.shape[1]: col += 1 dec += centers.shape[1] color = colors[i%len(colors)] ax[2*i-dec, col].bar (x, centers[:nbf,i], width=1.0, color=color) ax[2*i-dec, col].set_ylabel("cluster %d - start" % i, color=color) ax[2*i+1-dec, col].bar (x, centers[nbf:,i], width=1.0, color=color) ax[2*i+1-dec, col].set_ylabel("cluster %d - stop" % i, color=color) .. image:: city_bike_solution_cluster_start_36_0.png Four patterns emerge. Small clusters are annoying but let’s show them on a map. The widest one is the one for the week-end. Graph ----- We first need to get 7 clusters for each stations, one per day. .. code:: ipython3 piv = features.pivot_table(index=["station_id","station_name"], columns="weekday", values="cluster") piv.head() .. parsed-literal:: c:\python370_x64\lib\site-packages\pandas\core\generic.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance. obj = obj._drop_axis(labels, axis, level=level, errors=errors) .. raw:: html

	time10
	weekday	0.0	1.0	2.0	3.0	4.0	5.0	6.0
station_id	station_name
2.0	Michigan Ave & Balbo Ave	1.0	7.0	1.0	1.0	1.0	1.0	1.0
3.0	Shedd Aquarium	1.0	1.0	1.0	1.0	1.0	1.0	1.0
4.0	Burnham Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0
5.0	State St & Harrison St	7.0	7.0	7.0	7.0	7.0	1.0	1.0
6.0	Dusable Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0

.. code:: ipython3 piv["distincts"] = piv.apply(lambda row: len(set(row[i] for i in range(0,7))), axis=1) Let’s see which station is classified in more than 4 clusters. NaN means no bikes stopped at this stations. They are mostly unused stations. .. code:: ipython3 piv[piv.distincts >= 4] .. raw:: html

	time10								distincts
	weekday	0.0	1.0	2.0	3.0	4.0	5.0	6.0
station_id	station_name
391.0	Halsted St & 69th St	3.0	7.0	1.0	7.0	1.0	1.0	2.0	4
440.0	Lawndale Ave & 23rd St	7.0	7.0	1.0	7.0	2.0	1.0	0.0	4
557.0	Damen Ave & Garfield Blvd	NaN	2.0	1.0	NaN	2.0	NaN	NaN	6
558.0	Ashland Ave & Garfield Blvd	NaN	1.0	1.0	1.0	1.0	2.0	NaN	4
561.0	Damen Ave & 61st St	2.0	7.0	2.0	7.0	NaN	1.0	1.0	4
562.0	Racine Ave & 61st St	NaN	NaN	NaN	NaN	7.0	NaN	NaN	7
564.0	Racine Ave & 65th St	1.0	1.0	NaN	NaN	7.0	1.0	7.0	4
565.0	Ashland Ave & 66th St	1.0	1.0	NaN	1.0	0.0	NaN	5.0	5
567.0	May St & 69th St	6.0	6.0	2.0	2.0	1.0	1.0	NaN	4
568.0	Normal Ave & 72nd St	1.0	1.0	7.0	NaN	7.0	1.0	4.0	4
569.0	Woodlawn Ave & 75th St	1.0	NaN	7.0	1.0	1.0	NaN	1.0	4
576.0	Greenwood Ave & 79th St	7.0	1.0	1.0	NaN	2.0	1.0	1.0	4
581.0	Commercial Ave & 83rd St	1.0	NaN	7.0	NaN	NaN	1.0	1.0	5
582.0	Phillips Ave & 82nd St	NaN	NaN	1.0	1.0	NaN	1.0	7.0	5
586.0	MLK Jr Dr & 83rd St	1.0	2.0	6.0	7.0	7.0	7.0	2.0	4
587.0	Wabash Ave & 83rd St	NaN	NaN	1.0	NaN	1.0	2.0	7.0	6
588.0	South Chicago Ave & 83rd St	NaN	2.0	7.0	3.0	1.0	2.0	7.0	5
593.0	Halsted St & 59th St	NaN	7.0	4.0	4.0	NaN	1.0	1.0	5

.. code:: ipython3 pivn = piv.reset_index() pivn.columns = [' '.join(str(_).replace(".0", "") for _ in col).strip() for col in pivn.columns.values] pivn.head() .. raw:: html

	station_id	station_name	0	1	2	3	4	5	6	distincts
0	2.0	Michigan Ave & Balbo Ave	1.0	7.0	1.0	1.0	1.0	1.0	1.0	2
1	3.0	Shedd Aquarium	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1
2	4.0	Burnham Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1
3	5.0	State St & Harrison St	7.0	7.0	7.0	7.0	7.0	1.0	1.0	2
4	6.0	Dusable Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1

Let’s draw a map on a week day. .. code:: ipython3 data = stations.merge(pivn, left_on=["id", "name"], right_on=["station_id", "station_name"], suffixes=('_s', '_c')) data.sort_values("id").head() .. raw:: html

	id	name	latitude	longitude	dpcapacity	online_date	station_id	station_name	0	1	2	3	4	5	6	distincts
357	2	Michigan Ave & Balbo Ave	41.872638	-87.623979	35	5/8/2015	2.0	Michigan Ave & Balbo Ave	1.0	7.0	1.0	1.0	1.0	1.0	1.0	2
456	3	Shedd Aquarium	41.867226	-87.615355	31	4/24/2015	3.0	Shedd Aquarium	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1
53	4	Burnham Harbor	41.856268	-87.613348	23	5/16/2015	4.0	Burnham Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1
497	5	State St & Harrison St	41.874053	-87.627716	23	6/18/2013	5.0	State St & Harrison St	7.0	7.0	7.0	7.0	7.0	1.0	1.0	2
188	6	Dusable Harbor	41.885042	-87.612795	31	4/24/2015	6.0	Dusable Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1

.. code:: ipython3 from ensae_projects.datainc.data_bikes import folium_html_stations_map colors = ["red", "yellow", "gray", "green", "brown", "orange", "blue", "black"] for i, c in enumerate(colors): print("Cluster {0} is {1}".format(i, c)) xy = [] for els in data.apply(lambda row: (row["latitude"], row["longitude"], row["1"], row["name"]), axis=1): try: cl = int(els[2]) except: # NaN continue name = "%s c%d" % (els[3], cl) color = colors[cl] xy.append( ( (els[0], els[1]), (name, color))) folium_html_stations_map(xy, width="80%") .. parsed-literal:: Cluster 0 is red Cluster 1 is yellow Cluster 2 is gray Cluster 3 is green Cluster 4 is brown Cluster 5 is orange Cluster 6 is blue Cluster 7 is black .. raw:: html

Look at the colors close the parks. We notice than people got to the park after work. Let’s see during the week-end. .. code:: ipython3 xy = [] for els in data.apply(lambda row: (row["latitude"], row["longitude"], row["5"], row["name"]), axis=1): try: cl = int(els[2]) except: # NaN continue name = "%s c%d" % (els[3], cl) color = colors[cl] xy.append( ( (els[0], els[1]), (name, color))) folium_html_stations_map(xy, width="80%") .. raw:: html