How Does a Bike-Share Navigate Speedy Success

Some quick analysis on Cycle bike-share data from Chicago

Author

Thomas H. Simm

Introduction

Some quick analysis on Cycle bike-share data from Chicago

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path. By the end of this lesson, you will have a portfolio-ready case study. Download the packet and reference the details of this case study anytime. Then, when you begin your job hunt, your case study will be a tangible way to demonstrate your knowledge and skills to potential employers.

Some imports

import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import requests
# from datetime import datetime, timezone
import os
import geopandas as gpd

#get the current working directory
owd=os.getcwd()

A function to extract web zip files and save to /dat folder

We’ll just look at Q1 for 2020 data

The data for 2020 is split into Q1 and the rest is for each month. So we need to download and convert each to a data frame and then combine them together

def extractStuff(url):
    import requests, zipfile, io
    import os
    owd=os.getcwd()
    r = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(owd+"/dat/")
URL="https://divvy-tripdata.s3.amazonaws.com/"
noma =[['202004-divvy-tripdata'],
['202005-divvy-tripdata'],
['202006-divvy-tripdata'],
['202007-divvy-tripdata'],
['202008-divvy-tripdata'],
['202009-divvy-tripdata'],
['202010-divvy-tripdata'],
['202011-divvy-tripdata'],
['202012-divvy-tripdata']]
    
for nom in noma:
    extractStuff(URL+ nom[0]+".zip")

Load files, put in pandas data frame and have a look

Lets load each csv and combine them

for i,nom in enumerate(noma):
    if i>0:
        df=pd.read_csv(owd+"/dat/"+nom[0]+'.csv')
        dfAll=pd.concat([df,dfAll])    
        print('1 ',nom[0])
    else:
        dfAll=pd.read_csv(owd+"/dat/"+nom[0]+'.csv')
        print('2',nom[0])
2 202004-divvy-tripdata
1  202005-divvy-tripdata
1  202006-divvy-tripdata
1  202007-divvy-tripdata
1  202008-divvy-tripdata
1  202009-divvy-tripdata
1  202010-divvy-tripdata
1  202011-divvy-tripdata
1  202012-divvy-tripdata

How many NaN stations?

print('The percentage start stations NaN = {}'.format(100*np.shape(dfAll[dfAll['start_station_id'].isna()])[0] / np.shape(dfAll)[0]) )#95 282 3 114 796

print('The percentage end stations NaN = {}'.format(100*np.shape(dfAll[dfAll['end_station_id'].isna()])[0] / np.shape(dfAll)[0]) )#95 282 3 114 796

bothNa=dfAll[dfAll['start_station_id'].isna() | dfAll['end_station_id'].isna()] 
print('The percentage start stations NaN = {}'.format(100*np.shape(bothNa)[0] / np.shape(dfAll)[0]) )#95 282 3 114 796
The percentage start stations NaN = 3.059012532441932
The percentage end stations NaN = 3.5745840177013197
The percentage start stations NaN = 4.889597906251324
import copy
dfUse=copy.copy(dfAll[dfAll['start_station_id'].notnull() & dfAll['end_station_id'].notnull()])
dfUse.describe(include='all')
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
count 2962495 2962495 2962495 2962495 2962495 2962495.0 2962495 2962495.0 2.962495e+06 2.962495e+06 2.962495e+06 2.962495e+06 2962495
unique 2962287 3 2568216 2556286 689 1301.0 690 1304.0 NaN NaN NaN NaN 2
top C4EC57564F928A97 docked_bike 2020-09-07 15:19:26 2020-07-11 15:35:03 Streeter Dr & Grand Ave 35.0 Streeter Dr & Grand Ave 35.0 NaN NaN NaN NaN member
freq 2 2535257 12 13 32629 32192.0 34905 34467.0 NaN NaN NaN NaN 1710201
mean NaN NaN NaN NaN NaN NaN NaN NaN 4.190534e+01 -8.764452e+01 4.190559e+01 -8.764481e+01 NaN
std NaN NaN NaN NaN NaN NaN NaN NaN 4.157203e-02 2.446731e-02 4.169299e-02 2.460637e-02 NaN
min NaN NaN NaN NaN NaN NaN NaN NaN 4.164850e+01 -8.777470e+01 4.164850e+01 -8.777470e+01 NaN
25% NaN NaN NaN NaN NaN NaN NaN NaN 4.188316e+01 -8.765840e+01 4.188338e+01 -8.765862e+01 NaN
50% NaN NaN NaN NaN NaN NaN NaN NaN 4.190096e+01 -8.764117e+01 4.190096e+01 -8.764182e+01 NaN
75% NaN NaN NaN NaN NaN NaN NaN NaN 4.193120e+01 -8.762773e+01 4.193125e+01 -8.762775e+01 NaN
max NaN NaN NaN NaN NaN NaN NaN NaN 4.206490e+01 -8.752823e+01 4.206501e+01 -8.752823e+01 NaN

Now we need to convert the dates from object (i.e. string) to date format

next add a new column as time for hire in hours

dfUse.loc[:,'started_at']=pd.to_datetime(dfUse['started_at'],infer_datetime_format=True)
dfUse.loc[:,'ended_at']=pd.to_datetime(dfUse['ended_at'],infer_datetime_format=True)

delta=dfUse.iloc[:,3]-dfUse.iloc[:,2]
dd=delta.dt.total_seconds()/(60*60)
dfUse.insert(2,"hire_time_h",dd)
dfUse.head()
ride_id rideable_type hire_time_h started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
0 70B6A9A437D4C30D classic_bike 0.176944 2020-12-27 12:44:29 2020-12-27 12:55:06 Aberdeen St & Jackson Blvd 13157 Desplaines St & Kinzie St TA1306000003 41.877726 -87.654787 41.888716 -87.644448 member
39 15F369FDAED4E8E3 electric_bike 0.130556 2020-12-18 13:53:56 2020-12-18 14:01:46 Larrabee St & Armitage Ave TA1309000006 Wells St & Walton St TA1306000011 41.918112 -87.643799 41.900129 -87.634448 member
50 0CFD61DFE00E6043 electric_bike 0.030000 2020-12-28 17:10:25 2020-12-28 17:12:13 Kingsbury St & Kinzie St KA1503000043 Desplaines St & Kinzie St TA1306000003 41.889193 -87.638576 41.889099 -87.642479 member
87 244CB936487039B7 docked_bike 1.013056 2020-12-10 13:36:16 2020-12-10 14:37:03 Clark St & Leland Ave TA1309000014 Clark St & Leland Ave TA1309000014 41.967096 -87.667429 41.967096 -87.667429 casual
88 B7AD5038F79637F9 classic_bike 0.101111 2020-12-20 13:09:04 2020-12-20 13:15:08 Dearborn St & Monroe St TA1305000006 Kingsbury St & Kinzie St KA1503000043 41.881320 -87.629521 41.889177 -87.638506 member

Maybe we want the day of the week?

The day of the week with Monday=0, Sunday=6.

dfUse.insert(3,'day_week',dfUse.loc[:,'started_at'].dt.dayofweek)
dfUse.head()
ride_id rideable_type hire_time_h day_week started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
0 70B6A9A437D4C30D classic_bike 0.176944 6 2020-12-27 12:44:29 2020-12-27 12:55:06 Aberdeen St & Jackson Blvd 13157 Desplaines St & Kinzie St TA1306000003 41.877726 -87.654787 41.888716 -87.644448 member
39 15F369FDAED4E8E3 electric_bike 0.130556 4 2020-12-18 13:53:56 2020-12-18 14:01:46 Larrabee St & Armitage Ave TA1309000006 Wells St & Walton St TA1306000011 41.918112 -87.643799 41.900129 -87.634448 member
50 0CFD61DFE00E6043 electric_bike 0.030000 0 2020-12-28 17:10:25 2020-12-28 17:12:13 Kingsbury St & Kinzie St KA1503000043 Desplaines St & Kinzie St TA1306000003 41.889193 -87.638576 41.889099 -87.642479 member
87 244CB936487039B7 docked_bike 1.013056 3 2020-12-10 13:36:16 2020-12-10 14:37:03 Clark St & Leland Ave TA1309000014 Clark St & Leland Ave TA1309000014 41.967096 -87.667429 41.967096 -87.667429 casual
88 B7AD5038F79637F9 classic_bike 0.101111 6 2020-12-20 13:09:04 2020-12-20 13:15:08 Dearborn St & Monroe St TA1305000006 Kingsbury St & Kinzie St KA1503000043 41.881320 -87.629521 41.889177 -87.638506 member

Lets also get the time on its own

dfUse.insert(4,'time_day',dfUse.loc[:,'started_at'].dt.hour + dfUse.loc[:,'started_at'].dt.minute/60)
dfUse.head()
ride_id rideable_type hire_time_h day_week time_day started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
0 70B6A9A437D4C30D classic_bike 0.176944 6 12.733333 2020-12-27 12:44:29 2020-12-27 12:55:06 Aberdeen St & Jackson Blvd 13157 Desplaines St & Kinzie St TA1306000003 41.877726 -87.654787 41.888716 -87.644448 member
39 15F369FDAED4E8E3 electric_bike 0.130556 4 13.883333 2020-12-18 13:53:56 2020-12-18 14:01:46 Larrabee St & Armitage Ave TA1309000006 Wells St & Walton St TA1306000011 41.918112 -87.643799 41.900129 -87.634448 member
50 0CFD61DFE00E6043 electric_bike 0.030000 0 17.166667 2020-12-28 17:10:25 2020-12-28 17:12:13 Kingsbury St & Kinzie St KA1503000043 Desplaines St & Kinzie St TA1306000003 41.889193 -87.638576 41.889099 -87.642479 member
87 244CB936487039B7 docked_bike 1.013056 3 13.600000 2020-12-10 13:36:16 2020-12-10 14:37:03 Clark St & Leland Ave TA1309000014 Clark St & Leland Ave TA1309000014 41.967096 -87.667429 41.967096 -87.667429 casual
88 B7AD5038F79637F9 classic_bike 0.101111 6 13.150000 2020-12-20 13:09:04 2020-12-20 13:15:08 Dearborn St & Monroe St TA1305000006 Kingsbury St & Kinzie St KA1503000043 41.881320 -87.629521 41.889177 -87.638506 member

And the distance travelled

def distanceLatLong(lat1,lon1,lat2,lon2):
    
    import numpy as np

    def deg2rad(deg):
        return deg * np.pi/180

    R = 6371; # Radius of the earth in km
    dLat = deg2rad(lat2-lat1)  # deg2rad below
    dLon = deg2rad(lon2-lon1)
    a = np.sin(dLat/2) * np.sin(dLat/2) + \
    np.cos(deg2rad(lat1)) * np.cos(deg2rad(lat2)) * \
    np.sin(dLon/2) * np.sin(dLon/2)

    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    d = R * c ## Distance in km
 
    return d
d=distanceLatLong(dfUse["start_lat"].values,dfUse["start_lng"].values,dfUse["end_lat"].values,dfUse["end_lng"].values)
dfUse.insert(11,'distance',d)
dfUse.describe()
hire_time_h day_week time_day distance start_lat start_lng end_lat end_lng
count 2.962495e+06 2.962495e+06 2.962495e+06 2.962495e+06 2.962495e+06 2.962495e+06 2.962495e+06 2.962495e+06
mean 4.291666e-01 3.262215e+00 1.485441e+01 2.240406e+00 4.190534e+01 -8.764452e+01 4.190559e+01 -8.764481e+01
std 6.640800e+00 1.973614e+00 4.595529e+00 2.029178e+00 4.157203e-02 2.446731e-02 4.169299e-02 2.460637e-02
min -4.841661e+02 0.000000e+00 0.000000e+00 0.000000e+00 4.164850e+01 -8.777470e+01 4.164850e+01 -8.777470e+01
25% 1.377778e-01 2.000000e+00 1.188333e+01 8.598642e-01 4.188316e+01 -8.765840e+01 4.188338e+01 -8.765862e+01
50% 2.525000e-01 3.000000e+00 1.550000e+01 1.713410e+00 4.190096e+01 -8.764117e+01 4.190096e+01 -8.764182e+01
75% 4.577778e-01 5.000000e+00 1.820000e+01 3.098683e+00 4.193120e+01 -8.762773e+01 4.193125e+01 -8.762775e+01
max 9.786672e+02 6.000000e+00 2.398333e+01 4.837080e+01 4.206490e+01 -8.752823e+01 4.206501e+01 -8.752823e+01

Some issues arose above

  • hire_time_h max and min values

Looks like the error is there from the start, so lets delete them

do the same for long times

dfUse=dfUse[dfUse.hire_time_h>=0]

dfUse=dfUse[dfUse.hire_time_h<24]

Lets drop some columns for space

dfUse.drop(columns=["ride_id", "started_at","ended_at","start_station_name","end_station_name"],inplace=True)
#,"start_station_id","end_station_id"]

save

dfUse.to_csv('/data/df_2020.csv')
df=copy.copy(dfUse)
df = pd.read_csv('/data/df_2020.csv')
df
Unnamed: 0 rideable_type hire_time_h day_week time_day start_station_id end_station_id distance start_lat start_lng end_lat end_lng member_casual
0 0 classic_bike 0.176944 6 12.733333 13157 TA1306000003 1.491984 41.877726 -87.654787 41.888716 -87.644448 member
1 39 electric_bike 0.130556 4 13.883333 TA1309000006 TA1306000011 2.144117 41.918112 -87.643799 41.900129 -87.634448 member
2 50 electric_bike 0.030000 0 17.166667 KA1503000043 TA1306000003 0.323238 41.889193 -87.638576 41.889099 -87.642479 member
3 87 docked_bike 1.013056 3 13.600000 TA1309000014 TA1309000014 0.000000 41.967096 -87.667429 41.967096 -87.667429 casual
4 88 classic_bike 0.101111 6 13.150000 TA1305000006 KA1503000043 1.147392 41.881320 -87.629521 41.889177 -87.638506 member
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2949984 84771 docked_bike 0.215278 3 16.166667 140.0 140.0 0.000000 41.899000 -87.629900 41.899000 -87.629900 member
2949985 84772 docked_bike 0.319167 3 17.933333 322.0 351.0 1.056377 41.799600 -87.594700 41.803000 -87.606600 casual
2949986 84773 docked_bike 1.886111 4 19.950000 236.0 182.0 0.604983 41.907600 -87.638600 41.903200 -87.634300 casual
2949987 84774 docked_bike 0.708611 3 17.983333 310.0 310.0 0.000000 41.920100 -87.677900 41.920100 -87.677900 casual
2949988 84775 docked_bike 0.100000 5 1.516667 138.0 138.0 0.000000 41.904600 -87.640600 41.904600 -87.640600 casual

2949989 rows × 13 columns

Put frequency location onto a map

import folium
from folium import plugins
from folium.plugins import HeatMap

lat=df['start_lat'].values
lon=df['start_lng'].values
latlon = [lat, lon]

maps = folium.Map(location=[lat[0],lon[0]],
                    zoom_start = 11)

latlon=np.transpose(latlon)

# Plot it on the map
HeatMap(latlon).add_to(maps)

# Display the map
maps
Make this Notebook Trusted to load map: File -> Trust Notebook