This case study is the capstone project for the Google Data Analytics Professional Certificate.
This case study will be used to showcase real-world tasks and each step of the data analysis process that was taught in the Google Data Analytics Professional Certificate: ask, prepare, process, analyze, share, and act.
The company that I am working for, Cyclistic, is a fictional company, however, the data used is from a real bike-share company from Chicago. For more information on the data used, please see the Prepare section
I am a data analyst working at Cyclistic in the marketing analyst team for six months. Lily Moreno, the director of marketing, believes that maximizing the number of annual members known as Cyclistic members will be the key to the company’s future success. Therefore, the marketing analyst team is tasked to understand how casual riders and annual riders use the company’s services differently. We will develop a new marketing strategy to convert casual riders into annual members. However, our recommendations must be backed up by compelling data insights and professional data visualizations before the Cyclistic executives approve of them.
Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
Cyclistic is a bike-share company that was started in 2016. Currently, there are 5,824 bicycles and 692 docking stations across Chicago. Each bicycle is geotracked and can be unlocked and returned to any of the 692 stations around Chicago.
In the past, Cyclistic’s marketing strategy was to bring awareness to the program and to appeal to and target broader consumers. This was done by offering three types of flexible pricing options: single-ride pass, full-day pass or annual memberships. Single-ride pass and full-day pass users are known as Casual Riders. Customers who purchased the annual memberships are known as Cyclistic Members. However, the finance analysts in the company have determined that annual memberships are more profitable than casual rider passes. Moreno, the director of marketing, believes that the key to the company’s future success is to maximize the number of annual memberships. Since casual riders are already aware of Cyclistic and are current users of Cyclistic, she believes there is a high chance of converting casual riders into annual members.
Moreno’s goal is: Design a marketing strategy aimed to convert casual riders into annual members. To achieve this, the marketing analyst team will analyze historical bike trip data to identify trends to:
Deliverable: A clear statement of the business task.
How can we convert casual riders to annual members to maximize the number of annual members and make the company more profitable?
The director of marketing believes that the number of Cyclistic members will be the key to the company’s future success. Instead of creating new marketing campaigns to increase new customers, the goal is to convert current casual riders into annual members.
The three questions that are used to guide the future marketing programs are:
In this case study, we are tasked with answering the first question:
How do annual members and casual riders use Cyclistic bikes differently?
Deliverable: Description of all data sources used.
The data can be downloaded from Divvy Data.
As the data was collected by Divvy, we assume that our data source is first-party with no credibility issues. The data is released to the public every month in a CSV file.
For this case study, we will be analyzing on the trips from the last 12 months, April 2020 to March 2021. As a result, our dataset will be comprised of 12 CSV files with 13 columns for:
The data is made available by Motivate International Inc. under the Divvy License Agreement.
The data will have no Personal Identifiable Information. As a result, we will not be able to determine if casual riders live in a Cyclistic service area or if they purchased multiple single passes.
Deliverable: Documentation of any cleaning or manipulation of data.
Due to the large dataset, R will be used for cleaning and analysis. All visualizations are also created in RStudio.
Install and load the required packages that are needed for the cleaning and analysis.
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/pq/bynm2rq51nb3l0bbjj4_ns6c0000gn/T//RtmpdAnZDI/downloaded_packages
install.packages("skimr", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/pq/bynm2rq51nb3l0bbjj4_ns6c0000gn/T//RtmpdAnZDI/downloaded_packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(readr)
library(skimr)
In this section, we will import the data into our RStudio and combine the data into one data frame.
Import the dataset for each month.
X202004_divvy_tripdata <- read_csv("~/Documents/Case_study/202004-divvy-tripdata.csv")
## Rows: 84776 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202005_divvy_tripdata <- read_csv("~/Documents/Case_study/202005-divvy-tripdata.csv")
## Rows: 200274 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202006_divvy_tripdata <- read_csv("~/Documents/Case_study/202006-divvy-tripdata.csv")
## Rows: 343005 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202007_divvy_tripdata <- read_csv("~/Documents/Case_study/202007-divvy-tripdata.csv")
## Rows: 551480 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202008_divvy_tripdata <- read_csv("~/Documents/Case_study/202008-divvy-tripdata.csv")
## Rows: 622361 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202009_divvy_tripdata <- read_csv("~/Documents/Case_study/202009-divvy-tripdata.csv")
## Rows: 532958 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202010_divvy_tripdata <- read_csv("~/Documents/Case_study/202010-divvy-tripdata.csv")
## Rows: 388653 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202011_divvy_tripdata <- read_csv("~/Documents/Case_study/202011-divvy-tripdata.csv")
## Rows: 259716 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202012_divvy_tripdata <- read_csv("~/Documents/Case_study/202012-divvy-tripdata.csv")
## Rows: 131573 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202101_divvy_tripdata <- read_csv("~/Documents/Case_study/202101-divvy-tripdata.csv")
## Rows: 96834 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202102_divvy_tripdata <- read_csv("~/Documents/Case_study/202102-divvy-tripdata.csv")
## Rows: 49622 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202103_divvy_tripdata <- read_csv("~/Documents/Case_study/202103-divvy-tripdata.csv")
## Rows: 228496 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Check the column names in each dataset. By using colnames we can check that each dataset has the same colnames.
colnames(X202004_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(X202005_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(X202006_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(X202007_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(X202008_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(X202009_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(X202010_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(X202011_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(X202012_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(X202101_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(X202102_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(X202103_divvy_tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Combine all 12 months of the dataset into a single data frame.
annual_data <- bind_rows(X202004_divvy_tripdata, X202005_divvy_tripdata, X202006_divvy_tripdata, X202007_divvy_tripdata, X202008_divvy_tripdata, X202009_divvy_tripdata, X202010_divvy_tripdata, X202011_divvy_tripdata, X202012_divvy_tripdata, X202101_divvy_tripdata, X202102_divvy_tripdata, X202103_divvy_tripdata)
Before we do a quick check of the data, let’s also create a new column to calculate trip duration in minutes and save it in a new data frame “clean_data”.
clean_data <- mutate(annual_data, trip_duration=as.numeric(difftime(ended_at, started_at, units="min")))
In this step, we will do a quick check and see an overview of our newly combined data.
skim_without_charts(clean_data)
Name | clean_data |
Number of rows | 3489748 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 5 |
POSIXct | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
ride_id | 0 | 1.00 | 16 | 16 | 0 | 3489539 | 0 |
rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
start_station_name | 122175 | 0.96 | 10 | 53 | 0 | 708 | 0 |
start_station_id | 122801 | 0.96 | 1 | 35 | 0 | 1259 | 0 |
end_station_name | 143242 | 0.96 | 10 | 53 | 0 | 706 | 0 |
end_station_id | 143703 | 0.96 | 1 | 35 | 0 | 1259 | 0 |
member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
start_lat | 0 | 1 | 41.90 | 0.04 | 41.64 | 41.88 | 41.90 | 41.93 | 42.08 |
start_lng | 0 | 1 | -87.64 | 0.03 | -87.87 | -87.66 | -87.64 | -87.63 | -87.52 |
end_lat | 4738 | 1 | 41.90 | 0.04 | 41.54 | 41.88 | 41.90 | 41.93 | 42.16 |
end_lng | 4738 | 1 | -87.65 | 0.03 | -88.07 | -87.66 | -87.64 | -87.63 | -87.44 |
trip_duration | 0 | 1 | 24.77 | 390.42 | -29049.97 | 7.88 | 14.52 | 26.63 | 58720.03 |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
started_at | 0 | 1 | 2020-04-01 00:00:30 | 2021-03-31 23:59:08 | 2020-08-29 14:50:36 | 3040228 |
ended_at | 0 | 1 | 2020-04-01 00:10:45 | 2021-04-06 11:00:11 | 2020-08-29 15:21:13 | 3027775 |
Results show that there are a total of 3489748 rows and 14 columns.
The following columns have missing values:
Additionally, we can also see that there are other errors in the ending latitude and longitude values. The geographical coordinates should all be very similar as the program only runs in Chicago.
The newly added “trip_duration” column also shows that there are negative trip durations and a very high trip duration of 58720.03 minutes.
Continuing with our observations we can begin to clean our data.
clean_data <- clean_data %>%
drop_na() %>%
filter(end_lat > 0 & end_lng < 0) %>%
filter(trip_duration > 0)
A quick check with skim_without_charts() shows that all the missing values, the (0,0) coordinates, and trip_duration that were negative have all been removed. However, there are still trips that are less than 1 minute and longer than 1 day. Trips less than 1 minute will be removed as they could be false starts or riders attempting to re-dock to ensure that the bike is secured. Trip durations greater than 24 hours for casual riders will also be removed as casual riders only have single-ride passes or full-day passes, which means they cannot rent a bicycle for greater than 24 hours (or 1440 minutes).
Adding the group_by function in our overview allows us to see the difference in trip_duration between casual riders and members.
skim_without_charts(group_by(clean_data,member_casual))
Name | group_by(clean_data, memb… |
Number of rows | 3283948 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 6 |
numeric | 5 |
POSIXct | 2 |
________________________ | |
Group variables | member_casual |
Variable type: character
skim_variable | member_casual | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|---|
ride_id | casual | 0 | 1 | 16 | 16 | 0 | 1347908 | 0 |
ride_id | member | 0 | 1 | 16 | 16 | 0 | 1936040 | 0 |
rideable_type | casual | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
rideable_type | member | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
start_station_name | casual | 0 | 1 | 10 | 53 | 0 | 690 | 0 |
start_station_name | member | 0 | 1 | 10 | 53 | 0 | 693 | 0 |
start_station_id | casual | 0 | 1 | 1 | 12 | 0 | 1252 | 0 |
start_station_id | member | 0 | 1 | 1 | 35 | 0 | 1219 | 0 |
end_station_name | casual | 0 | 1 | 10 | 53 | 0 | 703 | 0 |
end_station_name | member | 0 | 1 | 10 | 53 | 0 | 693 | 0 |
end_station_id | casual | 0 | 1 | 1 | 35 | 0 | 1255 | 0 |
end_station_id | member | 0 | 1 | 1 | 35 | 0 | 1217 | 0 |
Variable type: numeric
skim_variable | member_casual | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|---|
start_lat | casual | 0 | 1 | 41.90 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.06 |
start_lat | member | 0 | 1 | 41.91 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.06 |
start_lng | casual | 0 | 1 | -87.64 | 0.03 | -87.77 | -87.65 | -87.64 | -87.63 | -87.53 |
start_lng | member | 0 | 1 | -87.65 | 0.02 | -87.77 | -87.66 | -87.64 | -87.63 | -87.53 |
end_lat | casual | 0 | 1 | 41.90 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.06 |
end_lat | member | 0 | 1 | 41.91 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.07 |
end_lng | casual | 0 | 1 | -87.64 | 0.03 | -87.77 | -87.65 | -87.64 | -87.63 | -87.53 |
end_lng | member | 0 | 1 | -87.65 | 0.02 | -87.77 | -87.66 | -87.64 | -87.63 | -87.53 |
trip_duration | casual | 0 | 1 | 45.46 | 387.82 | 0.02 | 11.87 | 21.63 | 40.98 | 55683.88 |
trip_duration | member | 0 | 1 | 15.93 | 93.54 | 0.02 | 6.62 | 11.60 | 20.18 | 58720.03 |
Variable type: POSIXct
skim_variable | member_casual | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|---|
started_at | casual | 0 | 1 | 2020-04-01 00:00:30 | 2021-03-31 23:53:37 | 2020-08-16 15:41:22 | 1249582 |
started_at | member | 0 | 1 | 2020-04-01 00:02:35 | 2021-03-31 23:59:08 | 2020-09-01 09:23:05 | 1801123 |
ended_at | casual | 0 | 1 | 2020-04-01 00:23:03 | 2021-04-06 11:00:11 | 2020-08-16 16:23:41 | 1244547 |
ended_at | member | 0 | 1 | 2020-04-01 00:10:45 | 2021-04-01 00:35:27 | 2020-09-01 09:39:26 | 1789886 |
In this step, we will remove the rows where trip_duration is less than one minute and remove rows greater than 1440 minutes for casual riders, while leaving the Cyclistic member’s rows alone.
clean_data_2 <- clean_data %>%
filter(trip_duration > 1) %>%
filter((member_casual == "casual" & trip_duration < 1440) | (member_casual == "member"))
Another quick overview with skim_without_charts(), shows that our data frame is much cleaner now with 3238911 rows left. We do not have any missing data or whitespace in our columns, and the range of geographic coordinates is closer to each other.
After our last step, the trip duration for casual riders only ranges from 1.02 minutes to 1440 minutes (24 hours) and for Cyclistic members ranges from 1.02 minutes to 58720 minutes (978.66 hours).
skim_without_charts(group_by(clean_data_2, member_casual))
Name | group_by(clean_data_2, me… |
Number of rows | 3238911 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 6 |
numeric | 5 |
POSIXct | 2 |
________________________ | |
Group variables | member_casual |
Variable type: character
skim_variable | member_casual | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|---|
ride_id | casual | 0 | 1 | 16 | 16 | 0 | 1332019 | 0 |
ride_id | member | 0 | 1 | 16 | 16 | 0 | 1906892 | 0 |
rideable_type | casual | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
rideable_type | member | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
start_station_name | casual | 0 | 1 | 10 | 53 | 0 | 690 | 0 |
start_station_name | member | 0 | 1 | 10 | 53 | 0 | 693 | 0 |
start_station_id | casual | 0 | 1 | 1 | 12 | 0 | 1252 | 0 |
start_station_id | member | 0 | 1 | 1 | 35 | 0 | 1218 | 0 |
end_station_name | casual | 0 | 1 | 10 | 53 | 0 | 703 | 0 |
end_station_name | member | 0 | 1 | 10 | 53 | 0 | 693 | 0 |
end_station_id | casual | 0 | 1 | 1 | 35 | 0 | 1255 | 0 |
end_station_id | member | 0 | 1 | 1 | 35 | 0 | 1216 | 0 |
Variable type: numeric
skim_variable | member_casual | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|---|
start_lat | casual | 0 | 1 | 41.90 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.06 |
start_lat | member | 0 | 1 | 41.91 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.06 |
start_lng | casual | 0 | 1 | -87.64 | 0.03 | -87.77 | -87.65 | -87.64 | -87.63 | -87.53 |
start_lng | member | 0 | 1 | -87.65 | 0.02 | -87.77 | -87.66 | -87.64 | -87.63 | -87.53 |
end_lat | casual | 0 | 1 | 41.90 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.06 |
end_lat | member | 0 | 1 | 41.91 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.07 |
end_lng | casual | 0 | 1 | -87.64 | 0.03 | -87.77 | -87.65 | -87.64 | -87.63 | -87.53 |
end_lng | member | 0 | 1 | -87.65 | 0.02 | -87.77 | -87.66 | -87.64 | -87.63 | -87.53 |
trip_duration | casual | 0 | 1 | 37.33 | 63.58 | 1.02 | 12.10 | 21.87 | 41.18 | 1439.90 |
trip_duration | member | 0 | 1 | 16.17 | 94.23 | 1.02 | 6.80 | 11.78 | 20.38 | 58720.03 |
Variable type: POSIXct
skim_variable | member_casual | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|---|
started_at | casual | 0 | 1 | 2020-04-01 00:00:30 | 2021-03-31 23:51:09 | 2020-08-16 15:13:43 | 1235815 |
started_at | member | 0 | 1 | 2020-04-01 00:02:35 | 2021-03-31 23:59:08 | 2020-09-01 10:43:59 | 1775956 |
ended_at | casual | 0 | 1 | 2020-04-01 00:23:03 | 2021-04-01 00:13:22 | 2020-08-16 15:56:59 | 1230576 |
ended_at | member | 0 | 1 | 2020-04-01 00:10:45 | 2021-04-01 00:35:27 | 2020-09-01 11:00:46 | 1764799 |
We will now create a few more columns in preparation for analysis. A new column for each year, month and day that the ride trip started, as well as the day of the week and whether that day is a weekday or not.
clean_data_3 <- clean_data_2 %>%
arrange(started_at) %>%
mutate(start_year = year(started_at))%>%
mutate(start_month = month(started_at, label = TRUE, abbr = FALSE)) %>%
mutate(start_day= day(started_at)) %>%
mutate(start_day_of_week = wday(started_at, label = TRUE, abbr = FALSE)) %>%
mutate(is_weekday = if_else(start_day_of_week =="Saturday" |start_day_of_week == "Sunday", FALSE, TRUE))
Deliverable: Summary of your analysis.
In the analyze section, we will create plots that will allow us to visualize the difference in how casual riders and Cyclistic members use the program differently.
Firstly, we will take a look at the min, median, mean, and max of trip duration for:
#Min, Median, Mean, Max of overall trip duration:
summary(clean_data_3$trip_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.02 8.30 14.97 24.87 27.12 58720.03
#Min, Median, Mean, Max of casual riders trip duration:
summary((filter(clean_data_3,member_casual=="casual")$trip_duration))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.017 12.100 21.867 37.327 41.183 1439.900
#Min, Median, Mean, Max of member trip duration:
summary((filter(clean_data_3,member_casual=="member")$trip_duration))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.02 6.80 11.78 16.17 20.38 58720.03
As well as the total trip duration for casual riders and Cyclistic members:
#Total trip duration for casual riders
sum(filter(clean_data_3,member_casual == "casual")$trip_duration)
## [1] 49720640
#Total trip duration for members:
sum(filter(clean_data_3,member_casual == "member")$trip_duration)
## [1] 30839103
This further shows that all trips are between 1 minute and 1440 minutes (24 hours) for casual riders and greater than 1 minute for Cyclistic members who are not limited by full-day passes.
The second half of step 1 shows that the overall total trip duration for casual members is greater than the total trip duration for Cyclistic members.
The first graph will show how many riders there are for each member type. This visualization is also split into weekend users and weekday users.
weekday_label <- c("Weekend", "Weekday")
names(weekday_label) <- c("FALSE", "TRUE")
clean_data_3 %>%
group_by(member_casual, is_weekday) %>%
summarize(number_of_rides = n()) %>%
ggplot(aes(x = member_casual, y= number_of_rides, fill = member_casual))+
geom_col()+
facet_wrap(~is_weekday, labeller = labeller(is_weekday = weekday_label))+
labs(title = "Number of Riders for Each Membership", x = "Member Type", y = "Number of Rides")+
scale_y_continuous(labels = scales::comma) +
scale_x_discrete(labels = c("Casual Rider", "Cyclistic Member"))+
theme(axis.text.x = element_text(angle=45, margin=margin(t=20)), legend.position="none")+
geom_text(aes(label=number_of_rides, vjust=1.5))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
This first bar graph illustrates that the number of users during the weekend is very similar between casual riders (562,995) and Cyclistic members (545,864). However, during the weekday, it is evident that there are more Cyclistic members (1,361,028) compared to the 769,024 casual riders.
Our second plot will show the number of riders for each day of the week for casual riders and Cyclistic members.
rides_day_of_week <- clean_data_3 %>%
group_by(member_casual, start_day_of_week)
rides_day_of_week %>%
summarize(number_of_rides = n()) %>%
arrange(member_casual, start_day_of_week) %>%
ggplot(aes(x = start_day_of_week, y = number_of_rides, fill = member_casual)) +
geom_col(position="dodge")+
labs(title="Number of Rides by Day of the Week", x = "Day of the Week", y = "Number of Rides", fill = "Member Type") +
scale_y_continuous(labels = scales::comma) +
scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+
theme(axis.text.x = element_text(angle = 45))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
With this bar graph, we can see that there are more riders during the weekdays for Cyclistic members. This is due to the members using the program to commute to and from work.
During the weekend, the number of rides is roughly the same between casual riders and Cyclistic members, with more casual riders on Saturdays and nearly the same number of rides between Cyclistic members and casual riders on Sundays. This shows that the number of people between casual riders and Cyclistic members that use the program for leisure purposes during the weekend is roughly the same.
This third visualization will show the difference in the number of rides for each month.
monthly_rides <- clean_data_3
monthly_rides$start_month <- ordered(monthly_rides$start_month, levels = c("April", "May", "June", "July", "August", "September", "October", "November", "December", "January", "February", "March"))
monthly_rides %>%
group_by(member_casual, start_month) %>%
summarize(number_of_rides = n()) %>%
arrange(member_casual, start_month) %>%
ggplot(aes(x = start_month, y = number_of_rides, fill = member_casual)) +
geom_col(position="dodge") +
theme(axis.text.x = element_text(angle = 45), legend.position= c(0.7, 0.85)) +
scale_y_continuous(labels = scales::comma, breaks=c(0, 50000, 100000, 150000, 200000, 250000, 300000)) +
labs(title="Number of Rides by Month", x = "Month", y = "Number of Rides", fill="Member Type")+
scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+
scale_x_discrete(labels = c( "April 20", "May 20", "June 20", "July 20", "August 20", "September 20", "October 20", "November 20", "December 20", "January 21", "Febuary 21", "March 21"))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Since the data we are using is from April 2020 to March 2021, our bar graph is organized as such.
From this graph, we can see that the program is popular for both casual riders and Cyclistic members during the peak summer months of July and August, with a slight decline during the spring and fall months for casual riders. However, there is a large decline in users during the winter months (November to February) for both casual riders and Cyclistic members. The drop in usage for Cyclistic members is not as great as for casual riders during the winter months.
Plot 4 will show the number of rides that started every hour for each day of the week.
hourly_rides <- clean_data_3 %>%
mutate(start_hour = hour(started_at)) %>%
group_by(member_casual, start_hour, start_day_of_week)
hourly_rides %>%
summarize(number_of_rides = n()) %>%
arrange(member_casual, start_hour) %>%
ggplot(aes(x=start_hour, y = number_of_rides, fill = member_casual))+
geom_col(position="dodge")+
facet_wrap(~start_day_of_week)+
labs(title = "Number of Rides Started Every Hour", x = "Time Ride Started (24Hr)", y = "Number of Rides", fill = "Member Type") +
scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+
scale_y_continuous(labels = scales::comma, breaks=c(0, 10000, 20000, 30000, 40000))+
scale_x_continuous(breaks=c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22)) +
theme(axis.text.x = element_text(angle = 90), legend.position= c(0.7, 0.15))
## `summarise()` has grouped output by 'member_casual', 'start_hour'. You can
## override using the `.groups` argument.
Plot 4 illustrates that the number of rides started at each hour on the weekends is similar between casual riders and Cyclistic members. This shows that the trends between casual riders and Cyclistic members for leisure purposes are similar. However, during the weekdays, peaks between 7 AM (7H) to 9 AM (9H) and 4 PM (16H) to 6 PM (18H) clearly show that many Cyclistic members use the program to commute to and from work. Comparatively, casual riders tend to have a gradually increasing trend of riders as the day progresses with small peaks during commute hours on Tuesdays, Wednesdays, Thursdays and Fridays.
This next visualization shows the average duration of rides for each month.
monthly_rides %>%
group_by(member_casual, start_month) %>%
summarize(average_duration=mean(trip_duration)) %>%
ggplot(aes(x = start_month, y = average_duration, fill = member_casual))+
geom_col(position="dodge")+
labs(title="Average Duration for Each Month", x = "Month", y = "Average Duration (Minutes)", fill = "Member Type") +
theme(axis.text.x = element_text(angle = 45))+
scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+
scale_x_discrete(labels = c("April 20", "May 20", "June 20", "July 20", "August 20", "September 20", "October 20", "November 20", "December 20", "January 21", "Febuary 21", "March 21"))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
According to this graph, on average, casual riders will borrow a bicycle for longer periods than Cyclistic members. Casual riders have an average trip duration of roughly 25 minutes to 45 minutes throughout the year. During the winter months, the average rides for casual riders drops to less than 25 minutes, with summer rides ranging from 40 to 45 minutes. While Cyclistic members have an average trip duration ranging from 12 minutes to 22 minutes. This could be due to the costs of single passes. Cyclistic members, who have an unlimited number of rentals throughout the year, could return bikes in between stops and errands, while casual riders will need to pay every time they rent and return. This could lead to casual riders keeping the bicycle and not returning it to a docking station between errands.
Similar to the last graph, this bar graph will show the average duration of rides for each day of the week.
clean_data_3 %>%
group_by(member_casual, start_day_of_week) %>%
summarize(number_of_rides = n(), average_duration = mean(trip_duration)) %>%
arrange(number_of_rides, start_day_of_week) %>%
ggplot(aes(x = start_day_of_week, y = average_duration, fill = member_casual))+
geom_col(position= "dodge")+
labs(title="Average Duration for Each Day", x = "Day of the Week", y = "Average Duration (Minutes)", fill = "Member Type")+
scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+
scale_y_continuous(breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50))+
theme(axis.text.x = element_text(angle = 45))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Similar to Plot 5, Plot 6 also shows a greater average duration of rides for casual riders for each day of the week. During the weekend, on average casual riders have leisure rides for greater than 40 minutes and around 33 to 42 minutes during the week. For Cyclistic members, the average duration of rides ranges between 15 to 18 minutes.
In plot 7, we can see the average duration of rides that started in each hour for each day of the week.
hourly_rides %>%
group_by(member_casual, start_hour, start_day_of_week) %>%
summarize(average_duration = mean(trip_duration)) %>%
arrange(member_casual, start_hour) %>%
ggplot(aes(x=start_hour, y = average_duration, fill = member_casual))+
geom_col(position="dodge") +
facet_wrap(~start_day_of_week)+
labs(title = "Average Duration Every Hour", x = "Time Ride Started (24Hr)", y = "Average Duration (Minutes)", fill = "Member Type") +
scale_fill_discrete(labels = c("Casual Riders", "Cyclistic Members"))+
scale_y_continuous(breaks = c(0, 5, 10, 15, 20, 25, 30))+
scale_x_continuous(breaks=c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22)) +
theme(axis.text.x = element_text(angle = 90), legend.position= c(0.7, 0.15))
## `summarise()` has grouped output by 'member_casual', 'start_hour'. You can
## override using the `.groups` argument.
This series of bar graphs further shows the consistency of uses for Cyclistic members throughout weekdays. The average duration of each ride is between 15 to 20 minutes.
For casual riders, there are fluctuations between the average duration for each day. On the weekdays, there seems to be a shorter average duration during dawn and early mornings with longer average duration rides every day around 10 AM to 2 PM.
In summary, Plot 1, Plot 2 and Plot 4 show that there is a similar number of users between casual riders and Cyclistic members during weekends. Not only are the number of riders similar, but the number of rides that start at each hour is also very similar. This shows that the leisure uses for the program between the casual riders and Cyclistic members are probably similar. However, the program is much more popular for Cyclistic members during the weekdays compared to casual riders. In plot 4, it is evident that a majority of the Cyclistic members are using the program to commute to and from work as the number of riders from 7 AM to 9 AM and 4 PM to 6 PM is greater than at any other time. There are still some casual riders who use the company’s services for commuting purposes. This can be seen from the slight increase in the number of rides for casual riders during commute hours on Tuesdays, Wednesdays, Thursdays and Fridays.
Plot 3 clearly illustrates the effects of weather on the number of riders each month. As expected, there are fewer riders during the winter from November to February. With peak usage during the summer months from July and August.
From Plot 5, Plot 6 and Plot 7, it seems like there is a higher average ride duration for casual riders above 30 minutes. While Cyclistic members typically only use the program for 15 to 20 minutes at a time. As mentioned earlier, this could be due to casual riders being more cost-efficient. Instead of returning their bicycles to docking stations between stops or errands, they will keep the bicycle with them causing the average duration to increase. Conversely, Cyclistic members who do not have to worry about the cost per ride, are free to return the bicycle in between errands, keeping the average duration lower.