Bike Pattern#

Links: notebook, html, PDF, python, slides, GitHub

We used a little bit of machine learning on Divvy Data to dig into a better division of Chicago. We try to identify patterns among bike stations.

from jyquickhelper import add_notebook_menu
add_notebook_menu()

%matplotlib inline

The data #

Divvy Data publishes a sample of the data.

from pyensae.datasource import download_data
file = download_data("Divvy_Trips_2016_Q3Q4.zip", url="https://s3.amazonaws.com/divvy-data/tripdata/")

We know the stations.

import pandas
stations = pandas.read_csv("Divvy_Stations_2016_Q3.csv")
bikes = pandas.concat([pandas.read_csv("Divvy_Trips_2016_Q3.csv"),
                       pandas.read_csv("Divvy_Trips_2016_Q4.csv")])

from datetime import datetime, time
df = bikes
df["dtstart"] = pandas.to_datetime(df.starttime, infer_datetime_format=True)
df["dtstop"] = pandas.to_datetime(df.stoptime, infer_datetime_format=True)
df["stopday"] = df.dtstop.apply(lambda r: datetime(r.year, r.month, r.day))
df["stoptime"] = df.dtstop.apply(lambda r: time(r.hour, r.minute, 0))
df["stoptime10"] = df.dtstop.apply(lambda r: time(r.hour, (r.minute // 10)*10, 0))  # every 10 minutes

df['stopweekday'] = df['dtstop'].dt.dayofweek

Normalize per week day (stop)#

df.columns

Index(['trip_id', 'starttime', 'stoptime', 'bikeid', 'tripduration',
       'from_station_id', 'from_station_name', 'to_station_id',
       'to_station_name', 'usertype', 'gender', 'birthyear', 'dtstart',
       'dtstop', 'stopday', 'stoptime10', 'stopweekday'],
      dtype='object')

key = ["to_station_id", "to_station_name", "stopweekday", "stoptime10"]
keep = key + ["trip_id"]
aggtime = df[keep].groupby(key, as_index=False).count()
aggtime.columns = key + ["nb_trips"]
aggtime.head()

	to_station_id	to_station_name	stoptime10	nb_trips
0	2	Michigan Ave & Balbo Ave	00:10:00	2
1	2	Michigan Ave & Balbo Ave	00:20:00	2
2	2	Michigan Ave & Balbo Ave	00:30:00	2
3	2	Michigan Ave & Balbo Ave	01:00:00	3
4	2	Michigan Ave & Balbo Ave	01:10:00	2

aggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count()
aggday.columns = key[:-1] + ["nb_trips"]
aggday.sort_values("nb_trips", ascending=False).head()

	to_station_id	to_station_name	stopweekday	nb_trips
222	35	Streeter Dr & Grand Ave	5	15380
223	35	Streeter Dr & Grand Ave	6	14680
217	35	Streeter Dr & Grand Ave	0	9228
221	35	Streeter Dr & Grand Ave	4	7945
1741	268	Lake Shore Dr & North Blvd	5	7508

merge = aggtime.merge(aggday, on=key[:-1], suffixes=("", "day"))
merge.head()

	to_station_id	to_station_name	stoptime10	nb_trips	nb_tripsday
0	2	Michigan Ave & Balbo Ave	00:10:00	2	913
1	2	Michigan Ave & Balbo Ave	00:20:00	2	913
2	2	Michigan Ave & Balbo Ave	00:30:00	2	913
3	2	Michigan Ave & Balbo Ave	01:00:00	3	913
4	2	Michigan Ave & Balbo Ave	01:10:00	2	913

merge["dist"] = merge["nb_trips"] / merge["nb_tripsday"]
merge.head()

	to_station_id	to_station_name	stoptime10	nb_trips	nb_tripsday	dist
0	2	Michigan Ave & Balbo Ave	00:10:00	2	913	0.002191
1	2	Michigan Ave & Balbo Ave	00:20:00	2	913	0.002191
2	2	Michigan Ave & Balbo Ave	00:30:00	2	913	0.002191
3	2	Michigan Ave & Balbo Ave	01:00:00	3	913	0.003286
4	2	Michigan Ave & Balbo Ave	01:10:00	2	913	0.002191

merge[merge["to_station_id"] == 2].plot(x="stoptime10", y="dist", figsize=(14,4), kind="area");

../_images/city_bike_solution_cluster_15_0.png

Clustering (stop)#

We cluster these distribution to find some patterns. But we need vectors of equal size which should be equal to 24*6.

print(key)
merge.groupby(key[:-1], as_index=False).count().head()

['to_station_id', 'to_station_name', 'stopweekday', 'stoptime10']

	to_station_id	to_station_name	stopweekday	stoptime10	nb_trips	nb_tripsday	dist
0	2	Michigan Ave & Balbo Ave	0	114	114	114	114
1	2	Michigan Ave & Balbo Ave	1	109	109	109	109
2	2	Michigan Ave & Balbo Ave	2	116	116	116	116
3	2	Michigan Ave & Balbo Ave	3	112	112	112	112
4	2	Michigan Ave & Balbo Ave	4	117	117	117	117

from ensae_projects.datainc.data_bikes import add_missing_time
full = df = add_missing_time(merge, delay=10, column="stoptime10", values=["nb_trips", "nb_tripsday", "dist"])
df.groupby(key[:-1], as_index=False).count().head()

	to_station_id	to_station_name	stopweekday	stoptime10	nb_trips	nb_tripsday	dist
0	2	Michigan Ave & Balbo Ave	0	144	144	144	144
1	2	Michigan Ave & Balbo Ave	1	144	144	144	144
2	2	Michigan Ave & Balbo Ave	2	144	144	144	144
3	2	Michigan Ave & Balbo Ave	3	144	144	144	144
4	2	Michigan Ave & Balbo Ave	4	144	144	144	144

This is much better.

df[df["to_station_id"] == 2].plot(x="stoptime10", y="dist", figsize=(14,4), kind="area");

../_images/city_bike_solution_cluster_20_0.png

Let’s build the features.

features = df.pivot_table(index=["to_station_id", "to_station_name", "stopweekday"],
                          columns="stoptime10", values="dist").reset_index()
features.head()

stoptime10	to_station_id	to_station_name	stopweekday	00:00:00	00:10:00	00:20:00	00:30:00	00:40:00	00:50:00	01:00:00	...	22:20:00	22:30:00	22:40:00	22:50:00	23:00:00	23:10:00	23:20:00	23:30:00	23:40:00	23:50:00
0	2	Michigan Ave & Balbo Ave	0	0.000000	0.002191	0.002191	0.002191	0.000000	0.000000	0.003286	...	0.004381	0.002191	0.004381	0.002191	0.004381	0.004381	0.005476	0.002191	0.000000	0.005476
1	2	Michigan Ave & Balbo Ave	1	0.000000	0.002677	0.000000	0.000000	0.001339	0.002677	0.000000	...	0.009371	0.012048	0.006693	0.004016	0.005355	0.006693	0.002677	0.000000	0.000000	0.000000
2	2	Michigan Ave & Balbo Ave	2	0.002907	0.002907	0.002907	0.004360	0.001453	0.007267	0.002907	...	0.002907	0.002907	0.015988	0.005814	0.001453	0.001453	0.011628	0.000000	0.000000	0.007267
3	2	Michigan Ave & Balbo Ave	3	0.000000	0.000000	0.000000	0.000000	0.007728	0.000000	0.001546	...	0.009274	0.003091	0.003091	0.007728	0.001546	0.003091	0.009274	0.001546	0.007728	0.001546
4	2	Michigan Ave & Balbo Ave	4	0.002053	0.000000	0.000000	0.002053	0.000000	0.002053	0.003080	...	0.008214	0.001027	0.006160	0.004107	0.015400	0.006160	0.002053	0.006160	0.007187	0.000000

5 rows × 147 columns

names = features.columns[3:]
names

Index([00:00:00, 00:10:00, 00:20:00, 00:30:00, 00:40:00, 00:50:00, 01:00:00,
       01:10:00, 01:20:00, 01:30:00,
       ...
       22:20:00, 22:30:00, 22:40:00, 22:50:00, 23:00:00, 23:10:00, 23:20:00,
       23:30:00, 23:40:00, 23:50:00],
      dtype='object', name='stoptime10', length=144)

from sklearn.cluster import KMeans
clus = KMeans(8)
clus.fit(features[names])

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=8, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

pred = clus.predict(features[names])
set(pred)

{0, 1, 2, 3, 4, 5, 6, 7}

features["cluster"] = pred

Let’s see what it means accross day.

features[["cluster", "stopweekday", "to_station_id"]].groupby(["cluster", "stopweekday"]).count()

	stoptime10	to_station_id
cluster	stopweekday
0	0	304
	1	348
	2	337
	3	315
	4	275
	5	36
	6	46
1	0	2
	1	1
	2	1
	6	1
2	5	2
3	0	1
	2	1
	3	1
	4	2
4	0	1
	1	1
	2	1
	3	1
5	0	1
5	6	2
6	0	123
	1	133
	2	143
	3	137
	4	147
	5	13
	6	11
7	0	141
	1	93
	2	94
	3	121
	4	154
	5	526
	6	514

Let’s draw the clusters.

centers = clus.cluster_centers_.T
import matplotlib.pyplot as plt
fig, ax = plt.subplots(centers.shape[1], 1, figsize=(10,10))
x = list(range(0,centers.shape[0]))
for i in range(centers.shape[1]):
    ax[i].bar (x, centers[:,i], width=1.0)
    ax[i].set_ylabel("cluster %d" % i)

../_images/city_bike_solution_cluster_30_0.png

Three patterns emerge. However, small clusters are still annoying but let’s show them on a map.

Graph #

We first need to get 7 clusters for each stations, one per day.

piv = features.pivot_table(index=["to_station_id","to_station_name"], columns="stopweekday", values="cluster")
piv.head()

	stopweekday	0	1	2	3	4	5	6
to_station_id	to_station_name
2	Michigan Ave & Balbo Ave	7.0	0.0	7.0	7.0	7.0	7.0	7.0
3	Shedd Aquarium	7.0	7.0	7.0	7.0	7.0	7.0	7.0
4	Burnham Harbor	7.0	7.0	0.0	7.0	7.0	7.0	7.0
5	State St & Harrison St	0.0	0.0	0.0	0.0	0.0	7.0	7.0
6	Dusable Harbor	7.0	7.0	0.0	7.0	7.0	7.0	7.0

piv["distincts"] = piv.apply(lambda row: len(set(row[i] for i in range(0,7))), axis=1)

Let’s see which station is classified in more than 4 clusters. NaN means no bikes stopped at this stations. They are mostly unused stations.

piv[piv.distincts >= 4]

	stopweekday	0	1	2	3	4	5	6	distincts
to_station_id	to_station_name
384	Halsted St & 51st St	NaN	7.0	7.0	7.0	0.0	0.0	6.0	4
386	Halsted St & 56th St	5.0	7.0	0.0	7.0	7.0	7.0	6.0	4
409	Shields Ave & 43rd St	7.0	NaN	7.0	NaN	6.0	0.0	7.0	5
530	Laramie Ave & Kinzie St	3.0	7.0	7.0	7.0	0.0	6.0	7.0	4
538	Cicero Ave & Flournoy St	7.0	6.0	NaN	6.0	7.0	7.0	5.0	4
543	Laramie Ave & Gladys Ave	0.0	7.0	6.0	6.0	6.0	6.0	NaN	4
548	Morgan St & Pershing Rd	NaN	7.0	7.0	6.0	0.0	7.0	0.0	4
556	Throop St & 52nd St	7.0	0.0	7.0	3.0	0.0	2.0	7.0	4
557	Damen Ave & Garfield Blvd	NaN	0.0	3.0	NaN	NaN	NaN	NaN	7
558	Ashland Ave & Garfield Blvd	NaN	7.0	7.0	0.0	3.0	7.0	NaN	5
561	Damen Ave & 61st St	1.0	6.0	6.0	0.0	NaN	7.0	7.0	5
562	Racine Ave & 61st St	NaN	NaN	NaN	NaN	6.0	NaN	NaN	7
564	Racine Ave & 65th St	7.0	NaN	NaN	NaN	0.0	2.0	6.0	7
565	Ashland Ave & 66th St	7.0	0.0	NaN	7.0	7.0	NaN	0.0	4
567	May St & 69th St	1.0	1.0	6.0	6.0	6.0	NaN	NaN	4
569	Woodlawn Ave & 75th St	NaN	NaN	0.0	7.0	0.0	NaN	NaN	6
576	Greenwood Ave & 79th St	0.0	0.0	0.0	NaN	6.0	NaN	0.0	4
580	Exchange Ave & 79th St	4.0	NaN	0.0	0.0	0.0	7.0	0.0	4
581	Commercial Ave & 83rd St	7.0	NaN	0.0	NaN	NaN	7.0	0.0	5
582	Phillips Ave & 82nd St	NaN	NaN	7.0	7.0	NaN	7.0	0.0	5
584	Ellis Ave & 83rd St	NaN	7.0	NaN	6.0	6.0	7.0	NaN	5
586	MLK Jr Dr & 83rd St	7.0	6.0	1.0	6.0	6.0	0.0	0.0	4
587	Wabash Ave & 83rd St	NaN	NaN	7.0	NaN	7.0	6.0	0.0	6
588	South Chicago Ave & 83rd St	NaN	0.0	0.0	NaN	7.0	7.0	0.0	4
591	Kilbourn Ave & Milwaukee Ave	0.0	6.0	6.0	6.0	6.0	7.0	NaN	4
593	Halsted St & 59th St	NaN	4.0	7.0	7.0	NaN	7.0	5.0	5

Let’s draw a map on a week day.

data = stations.merge(piv.reset_index(), left_on=["id", "name"],
                      right_on=["to_station_id", "to_station_name"], suffixes=('', '_c'))
data.sort_values("id").head()

	id	name	latitude	longitude	dpcapacity	online_date	to_station_id	to_station_name	0	1	2	3	4	5	6	distincts
357	2	Michigan Ave & Balbo Ave	41.872638	-87.623979	35	5/8/2015	2	Michigan Ave & Balbo Ave	7.0	0.0	7.0	7.0	7.0	7.0	7.0	2
456	3	Shedd Aquarium	41.867226	-87.615355	31	4/24/2015	3	Shedd Aquarium	7.0	7.0	7.0	7.0	7.0	7.0	7.0	1
53	4	Burnham Harbor	41.856268	-87.613348	23	5/16/2015	4	Burnham Harbor	7.0	7.0	0.0	7.0	7.0	7.0	7.0	2
497	5	State St & Harrison St	41.874053	-87.627716	23	6/18/2013	5	State St & Harrison St	0.0	0.0	0.0	0.0	0.0	7.0	7.0	2
188	6	Dusable Harbor	41.885042	-87.612795	31	4/24/2015	6	Dusable Harbor	7.0	7.0	0.0	7.0	7.0	7.0	7.0	2

from ensae_projects.datainc.data_bikes import folium_html_stations_map

colors = ["blue", "red", "yellow", "gray", "green", "black", "brown"]
xy = []
for els in data.apply(lambda row: (row["latitude"], row["longitude"], row[1], row["name"]), axis=1):
    try:
        cl = int(els[2])
    except:
        # NaN
        continue
    name = "%s c%d" % (els[3], cl)
    color = colors[cl % len(colors)]
    xy.append( ( (els[0], els[1]), (name, color)))
folium_html_stations_map(xy, width="80%")

We notice than people got to the park after work. Let’s see during the week-end.

from ensae_projects.datainc.data_bikes import folium_html_stations_map

colors = ["blue", "red", "yellow", "gray", "green", "black", "brown"]
xy = []
for els in data.apply(lambda row: (row["latitude"], row["longitude"], row[5], row["name"]), axis=1):
    try:
        cl = int(els[2])
    except:
        # NaN
        continue
    name = "%s c%d" % (els[3], cl)
    color = colors[cl % len(colors)]
    xy.append( ( (els[0], els[1]), (name, color)))
folium_html_stations_map(xy, width="80%")

Links

Contents

Information

Previous topic

Next topic

Bike Pattern#

The data #

Normalize per week day (stop)#

Clustering (stop)#

Graph #

Links

Contents

Information

Previous topic

Next topic

Bike Pattern#

The data#

Normalize per week day (stop)#

Clustering (stop)#

Graph#

The data #

Graph #