Case Study 1: How Does a Bike-Share Navigate Speed Success?

This case study is the capstone project for the Google Data Analytics Professional Certificate.

Introduction

This case study will be used to showcase real-world tasks and each step of the data analysis process that was taught in the Google Data Analytics Professional Certificate: ask, prepare, process, analyze, share, and act.

The company that I am working for, Cyclistic, is a fictional company, however, the data used is from a real bike-share company from Chicago. For more information on the data used, please see the Prepare section

Scenario

I am a data analyst working at Cyclistic in the marketing analyst team for six months. Lily Moreno, the director of marketing, believes that maximizing the number of annual members known as Cyclistic members will be the key to the company’s future success. Therefore, the marketing analyst team is tasked to understand how casual riders and annual riders use the company’s services differently. We will develop a new marketing strategy to convert casual riders into annual members. However, our recommendations must be backed up by compelling data insights and professional data visualizations before the Cyclistic executives approve of them.

Characters and Teams

Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

About the company

Cyclistic is a bike-share company that was started in 2016. Currently, there are 5,824 bicycles and 692 docking stations across Chicago. Each bicycle is geotracked and can be unlocked and returned to any of the 692 stations around Chicago.

In the past, Cyclistic’s marketing strategy was to bring awareness to the program and to appeal to and target broader consumers. This was done by offering three types of flexible pricing options: single-ride pass, full-day pass or annual memberships. Single-ride pass and full-day pass users are known as Casual Riders. Customers who purchased the annual memberships are known as Cyclistic Members. However, the finance analysts in the company have determined that annual memberships are more profitable than casual rider passes. Moreno, the director of marketing, believes that the key to the company’s future success is to maximize the number of annual memberships. Since casual riders are already aware of Cyclistic and are current users of Cyclistic, she believes there is a high chance of converting casual riders into annual members.

Moreno’s goal is: Design a marketing strategy aimed to convert casual riders into annual members. To achieve this, the marketing analyst team will analyze historical bike trip data to identify trends to:

Better understand the difference between Cyclistic members and casual riders.
Why casual riders would buy a membership.
How digital media could affect marketing tactics.

Ask

Deliverable: A clear statement of the business task.

Problem

How can we convert casual riders to annual members to maximize the number of annual members and make the company more profitable?

Business Task

The director of marketing believes that the number of Cyclistic members will be the key to the company’s future success. Instead of creating new marketing campaigns to increase new customers, the goal is to convert current casual riders into annual members.

The three questions that are used to guide the future marketing programs are:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

In this case study, we are tasked with answering the first question:

How do annual members and casual riders use Cyclistic bikes differently?

Key Stakeholders

Cyclistic executive team: The executive team will be the ones that approve of the new marketing program
Lily Moreno: The director of marketing who is responsible for campaign development and initiatives to promote the bike-sharing program.

Prepare

Deliverable: Description of all data sources used.

Data Source

The data can be downloaded from Divvy Data.

As the data was collected by Divvy, we assume that our data source is first-party with no credibility issues. The data is released to the public every month in a CSV file.

For this case study, we will be analyzing on the trips from the last 12 months, April 2020 to March 2021. As a result, our dataset will be comprised of 12 CSV files with 13 columns for:

Rider ID
The type of bicycle the rider rented
The start and end times of the trip
The start and end station names
The start and end station IDs
The start and end longitude and latitude coordinates
The rider’s member type (casual or member)

Licensing, Privacy, Security and Accesibility

The data is made available by Motivate International Inc. under the Divvy License Agreement.

The data will have no Personal Identifiable Information. As a result, we will not be able to determine if casual riders live in a Cyclistic service area or if they purchased multiple single passes.

Process

Deliverable: Documentation of any cleaning or manipulation of data.

Due to the large dataset, R will be used for cleaning and analysis. All visualizations are also created in RStudio.

Step 1: Set up of environment

Install and load the required packages that are needed for the cleaning and analysis.

install.packages("tidyverse", repos = "http://cran.us.r-project.org")

## 
## The downloaded binary packages are in
##  /var/folders/pq/bynm2rq51nb3l0bbjj4_ns6c0000gn/T//RtmpdAnZDI/downloaded_packages

install.packages("skimr", repos = "http://cran.us.r-project.org")

## 
## The downloaded binary packages are in
##  /var/folders/pq/bynm2rq51nb3l0bbjj4_ns6c0000gn/T//RtmpdAnZDI/downloaded_packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate) 
library(readr) 
library(skimr)

Step 2: Import data

In this section, we will import the data into our RStudio and combine the data into one data frame.

Import the dataset for each month.

X202004_divvy_tripdata <- read_csv("~/Documents/Case_study/202004-divvy-tripdata.csv")

## Rows: 84776 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

X202005_divvy_tripdata <- read_csv("~/Documents/Case_study/202005-divvy-tripdata.csv")

## Rows: 200274 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

X202006_divvy_tripdata <- read_csv("~/Documents/Case_study/202006-divvy-tripdata.csv")

## Rows: 343005 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

X202007_divvy_tripdata <- read_csv("~/Documents/Case_study/202007-divvy-tripdata.csv")

## Rows: 551480 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

X202008_divvy_tripdata <- read_csv("~/Documents/Case_study/202008-divvy-tripdata.csv")

## Rows: 622361 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

X202009_divvy_tripdata <- read_csv("~/Documents/Case_study/202009-divvy-tripdata.csv")

## Rows: 532958 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

X202010_divvy_tripdata <- read_csv("~/Documents/Case_study/202010-divvy-tripdata.csv")

## Rows: 388653 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

X202011_divvy_tripdata <- read_csv("~/Documents/Case_study/202011-divvy-tripdata.csv")

## Rows: 259716 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

X202012_divvy_tripdata <- read_csv("~/Documents/Case_study/202012-divvy-tripdata.csv")

## Rows: 131573 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

X202101_divvy_tripdata <- read_csv("~/Documents/Case_study/202101-divvy-tripdata.csv")

## Rows: 96834 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

X202102_divvy_tripdata <- read_csv("~/Documents/Case_study/202102-divvy-tripdata.csv")

## Rows: 49622 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

X202103_divvy_tripdata <- read_csv("~/Documents/Case_study/202103-divvy-tripdata.csv")

## Rows: 228496 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Check the column names in each dataset. By using colnames we can check that each dataset has the same colnames.

colnames(X202004_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(X202005_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(X202006_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(X202007_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(X202008_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(X202009_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(X202010_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(X202011_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(X202012_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(X202101_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(X202102_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(X202103_divvy_tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Combine all 12 months of the dataset into a single data frame.

annual_data <- bind_rows(X202004_divvy_tripdata, X202005_divvy_tripdata, X202006_divvy_tripdata, X202007_divvy_tripdata, X202008_divvy_tripdata, X202009_divvy_tripdata, X202010_divvy_tripdata, X202011_divvy_tripdata, X202012_divvy_tripdata, X202101_divvy_tripdata, X202102_divvy_tripdata, X202103_divvy_tripdata)

Before we do a quick check of the data, let’s also create a new column to calculate trip duration in minutes and save it in a new data frame “clean_data”.

clean_data <- mutate(annual_data, trip_duration=as.numeric(difftime(ended_at, started_at, units="min")))

Step 3: Check combined data

In this step, we will do a quick check and see an overview of our newly combined data.

skim_without_charts(clean_data)

Data summary
Name	clean_data
Number of rows	3489748
Number of columns	14
_______________________
Column type frequency:
character	7
numeric	5
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	3489539
rideable_type	0	1.00	11	13	3
start_station_name	122175	0.96	10	53	708
start_station_id	122801	0.96	1	35	1259
end_station_name	143242	0.96	10	53	706
end_station_id	143703	0.96	1	35	1259
member_casual	0	1.00	6	6	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.04	41.64	41.88	41.90	41.93	42.08
start_lng	0	1	-87.64	0.03	-87.87	-87.66	-87.64	-87.63	-87.52
end_lat	4738	1	41.90	0.04	41.54	41.88	41.90	41.93	42.16
end_lng	4738	1	-87.65	0.03	-88.07	-87.66	-87.64	-87.63	-87.44
trip_duration	0	1	24.77	390.42	-29049.97	7.88	14.52	26.63	58720.03

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2020-04-01 00:00:30	2021-03-31 23:59:08	2020-08-29 14:50:36	3040228
ended_at	0	1	2020-04-01 00:10:45	2021-04-06 11:00:11	2020-08-29 15:21:13	3027775

Observations

Results show that there are a total of 3489748 rows and 14 columns.

The following columns have missing values:

start_station_name: 122175 missing values
start_station_id: 122801 missing values
end_station_name: 143242 missing values
end_station_id: 143703 missing values
end_lat: 4738 missing values
end_lng: 4738 missing values

Additionally, we can also see that there are other errors in the ending latitude and longitude values. The geographical coordinates should all be very similar as the program only runs in Chicago.

The newly added “trip_duration” column also shows that there are negative trip durations and a very high trip duration of 58720.03 minutes.

Step 4: Cleaning Our Data Frame

Continuing with our observations we can begin to clean our data.

Step 4.1: Removing missing values, incorrect geographical coordinates and negative trip_durations.

clean_data <- clean_data %>%
  drop_na() %>% 
  filter(end_lat > 0 & end_lng < 0) %>% 
  filter(trip_duration > 0)

A quick check with skim_without_charts() shows that all the missing values, the (0,0) coordinates, and trip_duration that were negative have all been removed. However, there are still trips that are less than 1 minute and longer than 1 day. Trips less than 1 minute will be removed as they could be false starts or riders attempting to re-dock to ensure that the bike is secured. Trip durations greater than 24 hours for casual riders will also be removed as casual riders only have single-ride passes or full-day passes, which means they cannot rent a bicycle for greater than 24 hours (or 1440 minutes).

Adding the group_by function in our overview allows us to see the difference in trip_duration between casual riders and members.

skim_without_charts(group_by(clean_data,member_casual))

Data summary
Name	group_by(clean_data, memb…
Number of rows	3283948
Number of columns	14
_______________________
Column type frequency:
character	6
numeric	5
POSIXct	2
________________________
Group variables	member_casual

Variable type: character

skim_variable	member_casual	complete_rate	min	max	n_unique
ride_id	casual	1	16	16	1347908
ride_id	member	1	16	16	1936040
rideable_type	casual	1	11	13	3
rideable_type	member	1	11	13	3
start_station_name	casual	1	10	53	690
start_station_name	member	1	10	53	693
start_station_id	casual	1	1	12	1252
start_station_id	member	1	1	35	1219
end_station_name	casual	1	10	53	703
end_station_name	member	1	10	53	693
end_station_id	casual	1	1	35	1255
end_station_id	member	1	1	35	1217

Variable type: numeric

skim_variable	member_casual	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	casual	1	41.90	0.04	41.65	41.88	41.90	41.93	42.06
start_lat	member	1	41.91	0.04	41.65	41.88	41.90	41.93	42.06
start_lng	casual	1	-87.64	0.03	-87.77	-87.65	-87.64	-87.63	-87.53
start_lng	member	1	-87.65	0.02	-87.77	-87.66	-87.64	-87.63	-87.53
end_lat	casual	1	41.90	0.04	41.65	41.88	41.90	41.93	42.06
end_lat	member	1	41.91	0.04	41.65	41.88	41.90	41.93	42.07
end_lng	casual	1	-87.64	0.03	-87.77	-87.65	-87.64	-87.63	-87.53
end_lng	member	1	-87.65	0.02	-87.77	-87.66	-87.64	-87.63	-87.53
trip_duration	casual	1	45.46	387.82	0.02	11.87	21.63	40.98	55683.88
trip_duration	member	1	15.93	93.54	0.02	6.62	11.60	20.18	58720.03

Variable type: POSIXct

skim_variable	member_casual	complete_rate	min	max	median	n_unique
started_at	casual	1	2020-04-01 00:00:30	2021-03-31 23:53:37	2020-08-16 15:41:22	1249582
started_at	member	1	2020-04-01 00:02:35	2021-03-31 23:59:08	2020-09-01 09:23:05	1801123
ended_at	casual	1	2020-04-01 00:23:03	2021-04-06 11:00:11	2020-08-16 16:23:41	1244547
ended_at	member	1	2020-04-01 00:10:45	2021-04-01 00:35:27	2020-09-01 09:39:26	1789886

Step 4.2: Remove trip durations less than one minute and greater than 1440 minutes for casual riders.

In this step, we will remove the rows where trip_duration is less than one minute and remove rows greater than 1440 minutes for casual riders, while leaving the Cyclistic member’s rows alone.

clean_data_2 <- clean_data %>%
  filter(trip_duration > 1) %>%  
  filter((member_casual == "casual" & trip_duration < 1440) | (member_casual == "member"))

Another quick overview with skim_without_charts(), shows that our data frame is much cleaner now with 3238911 rows left. We do not have any missing data or whitespace in our columns, and the range of geographic coordinates is closer to each other.

After our last step, the trip duration for casual riders only ranges from 1.02 minutes to 1440 minutes (24 hours) and for Cyclistic members ranges from 1.02 minutes to 58720 minutes (978.66 hours).

skim_without_charts(group_by(clean_data_2, member_casual))

Data summary
Name	group_by(clean_data_2, me…
Number of rows	3238911
Number of columns	14
_______________________
Column type frequency:
character	6
numeric	5
POSIXct	2
________________________
Group variables	member_casual

Variable type: character

skim_variable	member_casual	complete_rate	min	max	n_unique
ride_id	casual	1	16	16	1332019
ride_id	member	1	16	16	1906892
rideable_type	casual	1	11	13	3
rideable_type	member	1	11	13	3
start_station_name	casual	1	10	53	690
start_station_name	member	1	10	53	693
start_station_id	casual	1	1	12	1252
start_station_id	member	1	1	35	1218
end_station_name	casual	1	10	53	703
end_station_name	member	1	10	53	693
end_station_id	casual	1	1	35	1255
end_station_id	member	1	1	35	1216

Variable type: numeric

skim_variable	member_casual	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	casual	1	41.90	0.04	41.65	41.88	41.90	41.93	42.06
start_lat	member	1	41.91	0.04	41.65	41.88	41.90	41.93	42.06
start_lng	casual	1	-87.64	0.03	-87.77	-87.65	-87.64	-87.63	-87.53
start_lng	member	1	-87.65	0.02	-87.77	-87.66	-87.64	-87.63	-87.53
end_lat	casual	1	41.90	0.04	41.65	41.88	41.90	41.93	42.06
end_lat	member	1	41.91	0.04	41.65	41.88	41.90	41.93	42.07
end_lng	casual	1	-87.64	0.03	-87.77	-87.65	-87.64	-87.63	-87.53
end_lng	member	1	-87.65	0.02	-87.77	-87.66	-87.64	-87.63	-87.53
trip_duration	casual	1	37.33	63.58	1.02	12.10	21.87	41.18	1439.90
trip_duration	member	1	16.17	94.23	1.02	6.80	11.78	20.38	58720.03

Variable type: POSIXct

skim_variable	member_casual	complete_rate	min	max	median	n_unique
started_at	casual	1	2020-04-01 00:00:30	2021-03-31 23:51:09	2020-08-16 15:13:43	1235815
started_at	member	1	2020-04-01 00:02:35	2021-03-31 23:59:08	2020-09-01 10:43:59	1775956
ended_at	casual	1	2020-04-01 00:23:03	2021-04-01 00:13:22	2020-08-16 15:56:59	1230576
ended_at	member	1	2020-04-01 00:10:45	2021-04-01 00:35:27	2020-09-01 11:00:46	1764799

Step 5: Set up clean data for analysis

We will now create a few more columns in preparation for analysis. A new column for each year, month and day that the ride trip started, as well as the day of the week and whether that day is a weekday or not.

clean_data_3 <- clean_data_2 %>%
  arrange(started_at) %>%
  mutate(start_year = year(started_at))%>% 
  mutate(start_month = month(started_at, label = TRUE, abbr = FALSE)) %>% 
  mutate(start_day= day(started_at)) %>% 
  mutate(start_day_of_week = wday(started_at, label = TRUE, abbr = FALSE)) %>% 
  mutate(is_weekday = if_else(start_day_of_week =="Saturday" |start_day_of_week == "Sunday", FALSE, TRUE))

Analyze

Deliverable: Summary of your analysis.

In the analyze section, we will create plots that will allow us to visualize the difference in how casual riders and Cyclistic members use the program differently.

Step 1: Min, Median, Mean, Max and Total Trip Duration

Firstly, we will take a look at the min, median, mean, and max of trip duration for:

Overall trips
Casual riders trips
Cyclistic member’s trips

#Min, Median, Mean, Max of overall trip duration:
summary(clean_data_3$trip_duration)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.02     8.30    14.97    24.87    27.12 58720.03

#Min, Median, Mean, Max of casual riders trip duration:
summary((filter(clean_data_3,member_casual=="casual")$trip_duration))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.017   12.100   21.867   37.327   41.183 1439.900

#Min, Median, Mean, Max of member trip duration:
summary((filter(clean_data_3,member_casual=="member")$trip_duration))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.02     6.80    11.78    16.17    20.38 58720.03

As well as the total trip duration for casual riders and Cyclistic members:

#Total trip duration for casual riders
sum(filter(clean_data_3,member_casual == "casual")$trip_duration)

## [1] 49720640

#Total trip duration for members:
sum(filter(clean_data_3,member_casual == "member")$trip_duration)

## [1] 30839103

Observations:

This further shows that all trips are between 1 minute and 1440 minutes (24 hours) for casual riders and greater than 1 minute for Cyclistic members who are not limited by full-day passes.

The second half of step 1 shows that the overall total trip duration for casual members is greater than the total trip duration for Cyclistic members.

Step 2: Plot 1 - Number of Riders vs Type of Membership

The first graph will show how many riders there are for each member type. This visualization is also split into weekend users and weekday users.

weekday_label <- c("Weekend", "Weekday")
names(weekday_label) <- c("FALSE", "TRUE")
clean_data_3 %>%
  group_by(member_casual, is_weekday) %>%
  summarize(number_of_rides = n()) %>%
  ggplot(aes(x = member_casual, y= number_of_rides, fill = member_casual))+
    geom_col()+
    facet_wrap(~is_weekday, labeller = labeller(is_weekday = weekday_label))+
    labs(title = "Number of Riders for Each Membership", x = "Member Type", y = "Number of Rides")+
    scale_y_continuous(labels = scales::comma) +
    scale_x_discrete(labels = c("Casual Rider", "Cyclistic Member"))+
    theme(axis.text.x =  element_text(angle=45, margin=margin(t=20)), legend.position="none")+
    geom_text(aes(label=number_of_rides, vjust=1.5))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

This first bar graph illustrates that the number of users during the weekend is very similar between casual riders (562,995) and Cyclistic members (545,864). However, during the weekday, it is evident that there are more Cyclistic members (1,361,028) compared to the 769,024 casual riders.

Step 3: Plot 2 - Number of Rides by Day of the Week

Our second plot will show the number of riders for each day of the week for casual riders and Cyclistic members.

rides_day_of_week <- clean_data_3 %>%
  group_by(member_casual, start_day_of_week) 

rides_day_of_week %>%
  summarize(number_of_rides = n()) %>%
  arrange(member_casual, start_day_of_week) %>%
  ggplot(aes(x = start_day_of_week, y = number_of_rides, fill = member_casual)) + 
    geom_col(position="dodge")+
    labs(title="Number of Rides by Day of the Week", x = "Day of the Week", y = "Number of Rides", fill = "Member Type") +
    scale_y_continuous(labels = scales::comma) +  
    scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+ 
    theme(axis.text.x = element_text(angle = 45))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

With this bar graph, we can see that there are more riders during the weekdays for Cyclistic members. This is due to the members using the program to commute to and from work.

During the weekend, the number of rides is roughly the same between casual riders and Cyclistic members, with more casual riders on Saturdays and nearly the same number of rides between Cyclistic members and casual riders on Sundays. This shows that the number of people between casual riders and Cyclistic members that use the program for leisure purposes during the weekend is roughly the same.

Step 4: Plot 3 - Number of Rides by Month

This third visualization will show the difference in the number of rides for each month.

monthly_rides <- clean_data_3 
monthly_rides$start_month <- ordered(monthly_rides$start_month, levels = c("April", "May", "June", "July", "August", "September", "October", "November", "December", "January", "February", "March"))

monthly_rides %>%
  group_by(member_casual, start_month) %>%
  summarize(number_of_rides = n()) %>%
  arrange(member_casual, start_month) %>%
  ggplot(aes(x = start_month, y = number_of_rides, fill = member_casual)) + 
    geom_col(position="dodge") + 
    theme(axis.text.x = element_text(angle = 45), legend.position= c(0.7, 0.85)) +  
    scale_y_continuous(labels = scales::comma, breaks=c(0, 50000, 100000, 150000, 200000, 250000, 300000)) + 
    labs(title="Number of Rides by Month", x = "Month", y = "Number of Rides", fill="Member Type")+
    scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+ 
    scale_x_discrete(labels = c( "April 20", "May 20", "June 20", "July 20", "August 20", "September 20", "October 20", "November 20", "December 20", "January 21", "Febuary 21", "March 21"))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Since the data we are using is from April 2020 to March 2021, our bar graph is organized as such.

From this graph, we can see that the program is popular for both casual riders and Cyclistic members during the peak summer months of July and August, with a slight decline during the spring and fall months for casual riders. However, there is a large decline in users during the winter months (November to February) for both casual riders and Cyclistic members. The drop in usage for Cyclistic members is not as great as for casual riders during the winter months.

Step 5: Plot 4 - Number of Rides Started Every Hour

Plot 4 will show the number of rides that started every hour for each day of the week.

hourly_rides <- clean_data_3 %>% 
  mutate(start_hour = hour(started_at)) %>%
  group_by(member_casual, start_hour, start_day_of_week)

hourly_rides %>%
  summarize(number_of_rides = n()) %>%
  arrange(member_casual, start_hour) %>%
  ggplot(aes(x=start_hour, y = number_of_rides, fill = member_casual))+ 
    geom_col(position="dodge")+
    facet_wrap(~start_day_of_week)+ 
    labs(title = "Number of Rides Started Every Hour", x = "Time Ride Started (24Hr)", y = "Number of Rides", fill = "Member Type") +
    scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+
    scale_y_continuous(labels = scales::comma, breaks=c(0, 10000, 20000, 30000, 40000))+ 
    scale_x_continuous(breaks=c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22)) + 
    theme(axis.text.x = element_text(angle = 90), legend.position= c(0.7, 0.15))

## `summarise()` has grouped output by 'member_casual', 'start_hour'. You can
## override using the `.groups` argument.

Plot 4 illustrates that the number of rides started at each hour on the weekends is similar between casual riders and Cyclistic members. This shows that the trends between casual riders and Cyclistic members for leisure purposes are similar. However, during the weekdays, peaks between 7 AM (7H) to 9 AM (9H) and 4 PM (16H) to 6 PM (18H) clearly show that many Cyclistic members use the program to commute to and from work. Comparatively, casual riders tend to have a gradually increasing trend of riders as the day progresses with small peaks during commute hours on Tuesdays, Wednesdays, Thursdays and Fridays.

Step 6: Plot 5: Average Duration of Rides by Month

This next visualization shows the average duration of rides for each month.

monthly_rides %>% 
  group_by(member_casual, start_month) %>%
  summarize(average_duration=mean(trip_duration)) %>%
  ggplot(aes(x = start_month, y = average_duration, fill = member_casual))+
  geom_col(position="dodge")+
  labs(title="Average Duration for Each Month", x = "Month", y = "Average Duration (Minutes)", fill = "Member Type") +
  theme(axis.text.x = element_text(angle = 45))+
  scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+
  scale_x_discrete(labels = c("April 20", "May 20", "June 20", "July 20", "August 20", "September 20", "October 20", "November 20", "December 20", "January 21", "Febuary 21", "March 21"))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

According to this graph, on average, casual riders will borrow a bicycle for longer periods than Cyclistic members. Casual riders have an average trip duration of roughly 25 minutes to 45 minutes throughout the year. During the winter months, the average rides for casual riders drops to less than 25 minutes, with summer rides ranging from 40 to 45 minutes. While Cyclistic members have an average trip duration ranging from 12 minutes to 22 minutes. This could be due to the costs of single passes. Cyclistic members, who have an unlimited number of rentals throughout the year, could return bikes in between stops and errands, while casual riders will need to pay every time they rent and return. This could lead to casual riders keeping the bicycle and not returning it to a docking station between errands.

Step 7: Plot 6 - Average Duration for Each Day

Similar to the last graph, this bar graph will show the average duration of rides for each day of the week.

clean_data_3 %>%
  group_by(member_casual, start_day_of_week) %>%
  summarize(number_of_rides = n(), average_duration = mean(trip_duration)) %>%
  arrange(number_of_rides, start_day_of_week) %>%
  ggplot(aes(x = start_day_of_week, y = average_duration, fill = member_casual))+
    geom_col(position= "dodge")+
    labs(title="Average Duration for Each Day", x = "Day of the Week", y = "Average Duration (Minutes)", fill = "Member Type")+
    scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+
    scale_y_continuous(breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50))+
    theme(axis.text.x = element_text(angle = 45))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Similar to Plot 5, Plot 6 also shows a greater average duration of rides for casual riders for each day of the week. During the weekend, on average casual riders have leisure rides for greater than 40 minutes and around 33 to 42 minutes during the week. For Cyclistic members, the average duration of rides ranges between 15 to 18 minutes.

Step 8: Plot 7 - Average Duration Every Hour

In plot 7, we can see the average duration of rides that started in each hour for each day of the week.

hourly_rides %>%
  group_by(member_casual, start_hour, start_day_of_week) %>%
  summarize(average_duration = mean(trip_duration)) %>%
  arrange(member_casual, start_hour) %>%
  ggplot(aes(x=start_hour, y = average_duration, fill = member_casual))+
    geom_col(position="dodge") + 
    facet_wrap(~start_day_of_week)+
    labs(title = "Average Duration Every Hour", x = "Time Ride Started (24Hr)", y = "Average Duration (Minutes)", fill = "Member Type") +
    scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+
    scale_y_continuous(breaks = c(0, 5, 10, 15, 20, 25, 30))+
    scale_x_continuous(breaks=c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22)) +
    theme(axis.text.x = element_text(angle = 90), legend.position= c(0.7, 0.15))

## `summarise()` has grouped output by 'member_casual', 'start_hour'. You can
## override using the `.groups` argument.

This series of bar graphs further shows the consistency of uses for Cyclistic members throughout weekdays. The average duration of each ride is between 15 to 20 minutes.

For casual riders, there are fluctuations between the average duration for each day. On the weekdays, there seems to be a shorter average duration during dawn and early mornings with longer average duration rides every day around 10 AM to 2 PM.

Summary of Analysis

In summary, Plot 1, Plot 2 and Plot 4 show that there is a similar number of users between casual riders and Cyclistic members during weekends. Not only are the number of riders similar, but the number of rides that start at each hour is also very similar. This shows that the leisure uses for the program between the casual riders and Cyclistic members are probably similar. However, the program is much more popular for Cyclistic members during the weekdays compared to casual riders. In plot 4, it is evident that a majority of the Cyclistic members are using the program to commute to and from work as the number of riders from 7 AM to 9 AM and 4 PM to 6 PM is greater than at any other time. There are still some casual riders who use the company’s services for commuting purposes. This can be seen from the slight increase in the number of rides for casual riders during commute hours on Tuesdays, Wednesdays, Thursdays and Fridays.

Plot 3 clearly illustrates the effects of weather on the number of riders each month. As expected, there are fewer riders during the winter from November to February. With peak usage during the summer months from July and August.

From Plot 5, Plot 6 and Plot 7, it seems like there is a higher average ride duration for casual riders above 30 minutes. While Cyclistic members typically only use the program for 15 to 20 minutes at a time. As mentioned earlier, this could be due to casual riders being more cost-efficient. Instead of returning their bicycles to docking stations between stops or errands, they will keep the bicycle with them causing the average duration to increase. Conversely, Cyclistic members who do not have to worry about the cost per ride, are free to return the bicycle in between errands, keeping the average duration lower.

Case Study 1: How Does a Bike-Share Navigate Speed Success?

Priyanka Belde

2023-08-23

Introduction

Scenario

Characters and Teams

About the company

Ask

Problem

Business Task

Key Stakeholders

Prepare

Data Source

Licensing, Privacy, Security and Accesibility

Process

Step 1: Set up of environment

Step 2: Import data

Step 3: Check combined data

Observations

Step 4: Cleaning Our Data Frame

Step 4.1: Removing missing values, incorrect geographical coordinates and negative trip_durations.

Step 4.2: Remove trip durations less than one minute and greater than 1440 minutes for casual riders.

Step 5: Set up clean data for analysis

Analyze

Step 1: Min, Median, Mean, Max and Total Trip Duration

Observations:

Step 2: Plot 1 - Number of Riders vs Type of Membership

Step 3: Plot 2 - Number of Rides by Day of the Week

Step 4: Plot 3 - Number of Rides by Month

Step 5: Plot 4 - Number of Rides Started Every Hour

Step 6: Plot 5: Average Duration of Rides by Month

Step 7: Plot 6 - Average Duration for Each Day

Step 8: Plot 7 - Average Duration Every Hour

Summary of Analysis

Case Study 1: How Does a Bike-Share Navigate Speed Success?

Priyanka Belde

2023-08-23

Introduction

Scenario

Characters and Teams

About the company

Ask

Problem

Business Task

Key Stakeholders

Prepare

Data Source

Licensing, Privacy, Security and Accesibility

Process

Step 1: Set up of environment

Step 2: Import data

Step 3: Check combined data

Observations

Step 4: Cleaning Our Data Frame

Step 4.1: Removing missing values, incorrect geographical coordinates and negative trip_durations.

Step 4.2: Remove trip durations less than one minute and greater than 1440 minutes for casual riders.

Step 5: Set up clean data for analysis

Analyze

Step 1: Min, Median, Mean, Max and Total Trip Duration

Observations:

Step 2: Plot 1 - Number of Riders vs Type of Membership

Step 3: Plot 2 - Number of Rides by Day of the Week

Step 4: Plot 3 - Number of Rides by Month

Step 5: Plot 4 - Number of Rides Started Every Hour

Step 6: Plot 5: Average Duration of Rides by Month

Step 7: Plot 6 - Average Duration for Each Day

Step 8: Plot 7 - Average Duration Every Hour

Summary of Analysis

Share and Act

Recommendations: