This case study will use all the skills I developed in the Google Data Analytics Professional Certificate Course to complete the business tasks as a data analyst working for the fictional bike-sharing company Cyclistic. I will use the data analysis process of Ask, Prepare, Process, Analyze, Share and Act to answer key business questions so that the company can make data-driven decisions.
The Director of Marketing, Lily Moreno, believes the Cyclistic’s future success depends on maximizing the number of annual memberships. Therefore, the team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Moreno believes that maximizing the number of annual members will be key to future growth and she believes there is a very good chance to convert casual riders into members.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand the Cyclistic historical bike trip data to identify trends.
Analyze Cyclistic’s August 2022 through July 2023 trip data to better understand how annual members and casual riders differ, then use insights to assist with marketing strategies aimed at converting casual riders into annual members.
Lily Moreno: Director of Marketing. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
Trip data stored by Cyclistic goes back many years, but we will use data available for the last 12 months (August 2022 to July 2023). The data is broken up monthly and stored as 12 .zip files:
The data has been made available by Motivate International Inc. under this license.
The 12 .zip files were downloaded, stored, and extracted into the /Cyclistic/raw_trip_data/ directory.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.1
#library(hms)
setwd("D:/R/Cyclistic")
data_202208 <- read_csv("raw_trip_data/202208-divvy-tripdata.csv", show_col_types = FALSE)
data_202209 <- read_csv("raw_trip_data/202209-divvy-publictripdata.csv", show_col_types = FALSE)
data_202210 <- read_csv("raw_trip_data/202210-divvy-tripdata.csv", show_col_types = FALSE)
data_202211 <- read_csv("raw_trip_data/202211-divvy-tripdata.csv", show_col_types = FALSE)
data_202212 <- read_csv("raw_trip_data/202212-divvy-tripdata.csv", show_col_types = FALSE)
data_202301 <- read_csv("raw_trip_data/202301-divvy-tripdata.csv", show_col_types = FALSE)
data_202302 <- read_csv("raw_trip_data/202302-divvy-tripdata.csv", show_col_types = FALSE)
data_202303 <- read_csv("raw_trip_data/202303-divvy-tripdata.csv", show_col_types = FALSE)
data_202304 <- read_csv("raw_trip_data/202304-divvy-tripdata.csv", show_col_types = FALSE)
data_202305 <- read_csv("raw_trip_data/202305-divvy-tripdata.csv", show_col_types = FALSE)
data_202306 <- read_csv("raw_trip_data/202306-divvy-tripdata.csv", show_col_types = FALSE)
data_202307 <- read_csv("raw_trip_data/202307-divvy-tripdata.csv", show_col_types = FALSE)
compare_df_cols(data_202208, data_202209, data_202210, data_202211, data_202212, data_202301, data_202302, data_202303, data_202304, data_202305, data_202306, data_202307)
## column_name data_202208 data_202209 data_202210
## 1 end_lat numeric numeric numeric
## 2 end_lng numeric numeric numeric
## 3 end_station_id character character character
## 4 end_station_name character character character
## 5 ended_at POSIXct, POSIXt POSIXct, POSIXt POSIXct, POSIXt
## 6 member_casual character character character
## 7 ride_id character character character
## 8 rideable_type character character character
## 9 start_lat numeric numeric numeric
## 10 start_lng numeric numeric numeric
## 11 start_station_id character character character
## 12 start_station_name character character character
## 13 started_at POSIXct, POSIXt POSIXct, POSIXt POSIXct, POSIXt
## data_202211 data_202212 data_202301 data_202302
## 1 numeric numeric numeric numeric
## 2 numeric numeric numeric numeric
## 3 character character character character
## 4 character character character character
## 5 POSIXct, POSIXt POSIXct, POSIXt POSIXct, POSIXt POSIXct, POSIXt
## 6 character character character character
## 7 character character character character
## 8 character character character character
## 9 numeric numeric numeric numeric
## 10 numeric numeric numeric numeric
## 11 character character character character
## 12 character character character character
## 13 POSIXct, POSIXt POSIXct, POSIXt POSIXct, POSIXt POSIXct, POSIXt
## data_202303 data_202304 data_202305 data_202306
## 1 numeric numeric numeric numeric
## 2 numeric numeric numeric numeric
## 3 character character character character
## 4 character character character character
## 5 POSIXct, POSIXt POSIXct, POSIXt POSIXct, POSIXt POSIXct, POSIXt
## 6 character character character character
## 7 character character character character
## 8 character character character character
## 9 numeric numeric numeric numeric
## 10 numeric numeric numeric numeric
## 11 character character character character
## 12 character character character character
## 13 POSIXct, POSIXt POSIXct, POSIXt POSIXct, POSIXt POSIXct, POSIXt
## data_202307
## 1 numeric
## 2 numeric
## 3 character
## 4 character
## 5 POSIXct, POSIXt
## 6 character
## 7 character
## 8 character
## 9 numeric
## 10 numeric
## 11 character
## 12 character
## 13 POSIXct, POSIXt
data_trip <- rbind(data_202208, data_202209, data_202210, data_202211, data_202212, data_202301, data_202302, data_202303, data_202304, data_202305, data_202306, data_202307)
#data_trip <- data_202208
rm(data_202208, data_202209, data_202210, data_202211, data_202212, data_202301, data_202302, data_202303, data_202304, data_202305, data_202306, data_202307)
head(data_trip)
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 550CF7EFEAE0C618 electric_bike 2022-08-07 21:34:15 2022-08-07 21:41:46
## 2 DAD198F405F9C5F5 electric_bike 2022-08-08 14:39:21 2022-08-08 14:53:23
## 3 E6F2BC47B65CB7FD electric_bike 2022-08-08 15:29:50 2022-08-08 15:40:34
## 4 F597830181C2E13C electric_bike 2022-08-08 02:43:50 2022-08-08 02:58:53
## 5 0CE689BB4E313E8D electric_bike 2022-08-07 20:24:06 2022-08-07 20:29:58
## 6 BFA7E7CC69860C20 electric_bike 2022-08-08 13:06:08 2022-08-08 13:19:09
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
str(data_trip)
## spc_tbl_ [5,723,606 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5723606] "550CF7EFEAE0C618" "DAD198F405F9C5F5" "E6F2BC47B65CB7FD" "F597830181C2E13C" ...
## $ rideable_type : chr [1:5723606] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5723606], format: "2022-08-07 21:34:15" "2022-08-08 14:39:21" ...
## $ ended_at : POSIXct[1:5723606], format: "2022-08-07 21:41:46" "2022-08-08 14:53:23" ...
## $ start_station_name: chr [1:5723606] NA NA NA NA ...
## $ start_station_id : chr [1:5723606] NA NA NA NA ...
## $ end_station_name : chr [1:5723606] NA NA NA NA ...
## $ end_station_id : chr [1:5723606] NA NA NA NA ...
## $ start_lat : num [1:5723606] 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num [1:5723606] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5723606] 41.9 41.9 42 42 41.8 ...
## $ end_lng : num [1:5723606] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:5723606] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
data_trip <- data_trip %>%
select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
sum(is.na(data_trip))
## [1] 12204
data_trip <- na.omit(data_trip)
#remove comment tag later
#duplicate_rows <- data_trip[duplicated(data_trip), ]
#str(duplicate_rows)
#rm(duplicate_rows)
data_trip$ride_length <- as.numeric(as.character(round(difftime(data_trip$ended_at, data_trip$started_at, units="mins"), 2)))
#data_trip$ride_length <- as_hms(difftime(data_trip$ended_at, data_trip$started_at))
str(data_trip)
## tibble [5,717,504 × 10] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5717504] "550CF7EFEAE0C618" "DAD198F405F9C5F5" "E6F2BC47B65CB7FD" "F597830181C2E13C" ...
## $ rideable_type: chr [1:5717504] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5717504], format: "2022-08-07 21:34:15" "2022-08-08 14:39:21" ...
## $ ended_at : POSIXct[1:5717504], format: "2022-08-07 21:41:46" "2022-08-08 14:53:23" ...
## $ start_lat : num [1:5717504] 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num [1:5717504] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5717504] 41.9 41.9 42 42 41.8 ...
## $ end_lng : num [1:5717504] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual: chr [1:5717504] "casual" "casual" "casual" "casual" ...
## $ ride_length : num [1:5717504] 7.52 14.03 10.73 15.05 5.87 ...
## - attr(*, "na.action")= 'omit' Named int [1:6102] 8293 17483 84582 84608 84788 84820 85004 85043 85351 85564 ...
## ..- attr(*, "names")= chr [1:6102] "8293" "17483" "84582" "84608" ...
negative_rows <- data_trip[data_trip$ride_length<=0, ]
str(negative_rows)
## tibble [715 × 10] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:715] "162024EA2A5F23AC" "8B5EDA7571220F96" "09B71177D92C0DC5" "968C67D0CE946110" ...
## $ rideable_type: chr [1:715] "electric_bike" "electric_bike" "classic_bike" "electric_bike" ...
## $ started_at : POSIXct[1:715], format: "2022-08-18 18:16:28" "2022-08-05 18:28:39" ...
## $ ended_at : POSIXct[1:715], format: "2022-08-18 18:16:28" "2022-08-05 18:28:39" ...
## $ start_lat : num [1:715] 41.8 42 41.9 42 41.9 ...
## $ start_lng : num [1:715] -87.6 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:715] 41.8 42 41.9 42 41.9 ...
## $ end_lng : num [1:715] -87.6 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual: chr [1:715] "member" "member" "casual" "casual" ...
## $ ride_length : num [1:715] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "na.action")= 'omit' Named int [1:6102] 8293 17483 84582 84608 84788 84820 85004 85043 85351 85564 ...
## ..- attr(*, "names")= chr [1:6102] "8293" "17483" "84582" "84608" ...
rm(negative_rows)
data_trip <- data_trip[!(data_trip$ride_length<=0), ]
str(data_trip)
## tibble [5,716,789 × 10] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5716789] "550CF7EFEAE0C618" "DAD198F405F9C5F5" "E6F2BC47B65CB7FD" "F597830181C2E13C" ...
## $ rideable_type: chr [1:5716789] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5716789], format: "2022-08-07 21:34:15" "2022-08-08 14:39:21" ...
## $ ended_at : POSIXct[1:5716789], format: "2022-08-07 21:41:46" "2022-08-08 14:53:23" ...
## $ start_lat : num [1:5716789] 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num [1:5716789] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5716789] 41.9 41.9 42 42 41.8 ...
## $ end_lng : num [1:5716789] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual: chr [1:5716789] "casual" "casual" "casual" "casual" ...
## $ ride_length : num [1:5716789] 7.52 14.03 10.73 15.05 5.87 ...
## - attr(*, "na.action")= 'omit' Named int [1:6102] 8293 17483 84582 84608 84788 84820 85004 85043 85351 85564 ...
## ..- attr(*, "names")= chr [1:6102] "8293" "17483" "84582" "84608" ...
summary(time_length(data_trip$ride_length), unit = "mins" )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02 5.45 9.58 15.17 17.05 12136.30
newdata <- data_trip[order(-data_trip$ride_length), ]
head(newdata$ride_length, n = 20)
## [1] 12136.30 11152.27 8243.80 4903.12 4848.35 3818.40 3245.78 2719.93
## [9] 2457.85 2349.35 1767.38 1499.93 1499.93 1499.93 1499.92 1499.92
## [17] 1499.92 1499.92 1499.92 1499.90
It is possible the rider did not return the bike because they needed it for that length of time so we will leave those values in place.
remove(newdata)
data_trip$date <- as.Date(data_trip$started_at)
data_trip$year <- format(as.Date(data_trip$date), "%Y")
data_trip$month <- format(as.Date(data_trip$date), "%m")
data_trip$day <- format(as.Date(data_trip$date), "%d")
data_trip$day_of_week <- format(as.Date(data_trip$date), "%A")
data_trip$hour_started <- hour(data_trip$started_at)
data_trip$coords_start <- paste(data_trip$start_lat, data_trip$start_lng, sep=", ")
data_trip$coords_end <- paste(data_trip$end_lat, data_trip$end_lng, sep=", ")
str(data_trip)
## tibble [5,716,789 × 18] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5716789] "550CF7EFEAE0C618" "DAD198F405F9C5F5" "E6F2BC47B65CB7FD" "F597830181C2E13C" ...
## $ rideable_type: chr [1:5716789] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5716789], format: "2022-08-07 21:34:15" "2022-08-08 14:39:21" ...
## $ ended_at : POSIXct[1:5716789], format: "2022-08-07 21:41:46" "2022-08-08 14:53:23" ...
## $ start_lat : num [1:5716789] 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num [1:5716789] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5716789] 41.9 41.9 42 42 41.8 ...
## $ end_lng : num [1:5716789] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual: chr [1:5716789] "casual" "casual" "casual" "casual" ...
## $ ride_length : num [1:5716789] 7.52 14.03 10.73 15.05 5.87 ...
## $ date : Date[1:5716789], format: "2022-08-07" "2022-08-08" ...
## $ year : chr [1:5716789] "2022" "2022" "2022" "2022" ...
## $ month : chr [1:5716789] "08" "08" "08" "08" ...
## $ day : chr [1:5716789] "07" "08" "08" "08" ...
## $ day_of_week : chr [1:5716789] "Sunday" "Monday" "Monday" "Monday" ...
## $ hour_started : int [1:5716789] 21 14 15 2 20 13 14 20 21 23 ...
## $ coords_start : chr [1:5716789] "41.93, -87.69" "41.89, -87.64" "41.97, -87.69" "41.94, -87.65" ...
## $ coords_end : chr [1:5716789] "41.94, -87.72" "41.92, -87.64" "41.97, -87.66" "41.97, -87.69" ...
## - attr(*, "na.action")= 'omit' Named int [1:6102] 8293 17483 84582 84608 84788 84820 85004 85043 85351 85564 ...
## ..- attr(*, "names")= chr [1:6102] "8293" "17483" "84582" "84608" ...
data_trip %>%
group_by(member_casual) %>%
summarize(total_count = n())
## # A tibble: 2 × 2
## member_casual total_count
## <chr> <int>
## 1 casual 2163955
## 2 member 3552834
data_trip %>%
group_by(rideable_type) %>%
summarize(total_count = n())
## # A tibble: 3 × 2
## rideable_type total_count
## <chr> <int>
## 1 classic_bike 2480891
## 2 docked_bike 126089
## 3 electric_bike 3109809
data_trip %>%
group_by(member_casual, rideable_type) %>%
summarize(total_count = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 3
## # Groups: member_casual [2]
## member_casual rideable_type total_count
## <chr> <chr> <int>
## 1 casual classic_bike 788656
## 2 casual docked_bike 126089
## 3 casual electric_bike 1249210
## 4 member classic_bike 1692235
## 5 member electric_bike 1860599
data_trip %>%
group_by(month, year) %>%
summarize(total_count = n()) %>%
arrange(year)
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 3
## # Groups: month [12]
## month year total_count
## <chr> <chr> <int>
## 1 08 2022 785012
## 2 09 2022 700555
## 3 10 2022 558145
## 4 11 2022 337447
## 5 12 2022 181664
## 6 01 2023 190166
## 7 02 2023 190320
## 8 03 2023 258476
## 9 04 2023 426119
## 10 05 2023 604046
## 11 06 2023 718656
## 12 07 2023 766183
data_trip$day_of_week <- factor(data_trip$day_of_week,
levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
data_trip %>%
group_by(day_of_week) %>%
summarize(total_count = n())
## # A tibble: 7 × 2
## day_of_week total_count
## <fct> <int>
## 1 Sunday 717947
## 2 Monday 758647
## 3 Tuesday 808072
## 4 Wednesday 827579
## 5 Thursday 858326
## 6 Friday 850829
## 7 Saturday 895389
data_trip$day_of_week <- factor(data_trip$day_of_week,
levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
data_trip %>%
group_by(member_casual, day_of_week) %>%
summarize(total_count = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 3
## # Groups: member_casual [2]
## member_casual day_of_week total_count
## <chr> <fct> <int>
## 1 casual Sunday 331577
## 2 casual Monday 257332
## 3 casual Tuesday 256287
## 4 casual Wednesday 261827
## 5 casual Thursday 288609
## 6 casual Friday 334276
## 7 casual Saturday 434047
## 8 member Sunday 386370
## 9 member Monday 501315
## 10 member Tuesday 551785
## 11 member Wednesday 565752
## 12 member Thursday 569717
## 13 member Friday 516553
## 14 member Saturday 461342
data_trip %>%
group_by(member_casual, hour_started) %>%
summarize(total_count = n()) %>%
as_tibble() %>%
print(n=48)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 48 × 3
## member_casual hour_started total_count
## <chr> <int> <int>
## 1 casual 0 40834
## 2 casual 1 26474
## 3 casual 2 15992
## 4 casual 3 9082
## 5 casual 4 6411
## 6 casual 5 12070
## 7 casual 6 31840
## 8 casual 7 53988
## 9 casual 8 72378
## 10 casual 9 71657
## 11 casual 10 88846
## 12 casual 11 114264
## 13 casual 12 135132
## 14 casual 13 140860
## 15 casual 14 149734
## 16 casual 15 167895
## 17 casual 16 189782
## 18 casual 17 209310
## 19 casual 18 182695
## 20 casual 19 136144
## 21 casual 20 97979
## 22 casual 21 83217
## 23 casual 22 73106
## 24 casual 23 54265
## 25 member 0 36122
## 26 member 1 21990
## 27 member 2 12667
## 28 member 3 7989
## 29 member 4 8751
## 30 member 5 32852
## 31 member 6 101195
## 32 member 7 185982
## 33 member 8 231075
## 34 member 9 157495
## 35 member 10 144358
## 36 member 11 170785
## 37 member 12 194904
## 38 member 13 193966
## 39 member 14 195483
## 40 member 15 238754
## 41 member 16 317839
## 42 member 17 375461
## 43 member 18 299963
## 44 member 19 215343
## 45 member 20 149373
## 46 member 21 116502
## 47 member 22 87312
## 48 member 23 56673
data_trip %>%
group_by(member_casual) %>%
summarize(avg_ride_length = mean(ride_length))
## # A tibble: 2 × 2
## member_casual avg_ride_length
## <chr> <dbl>
## 1 casual 20.3
## 2 member 12.0
data_trip %>%
group_by(rideable_type) %>%
summarize(avg_ride_length = mean(ride_length))
## # A tibble: 3 × 2
## rideable_type avg_ride_length
## <chr> <dbl>
## 1 classic_bike 16.7
## 2 docked_bike 49.4
## 3 electric_bike 12.5
data_trip %>%
group_by(member_casual, rideable_type) %>%
summarize(avg_ride_length = mean(ride_length))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 3
## # Groups: member_casual [2]
## member_casual rideable_type avg_ride_length
## <chr> <chr> <dbl>
## 1 casual classic_bike 24.6
## 2 casual docked_bike 49.4
## 3 casual electric_bike 14.7
## 4 member classic_bike 13.0
## 5 member electric_bike 11.1
data_trip %>%
group_by(member_casual, day_of_week) %>%
summarize(avg_ride_length = mean(ride_length))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 3
## # Groups: member_casual [2]
## member_casual day_of_week avg_ride_length
## <chr> <fct> <dbl>
## 1 casual Sunday 23.2
## 2 casual Monday 20.1
## 3 casual Tuesday 18.4
## 4 casual Wednesday 17.4
## 5 casual Thursday 18.0
## 6 casual Friday 19.8
## 7 casual Saturday 23.0
## 8 member Sunday 13.3
## 9 member Monday 11.5
## 10 member Tuesday 11.6
## 11 member Wednesday 11.5
## 12 member Thursday 11.6
## 13 member Friday 12.0
## 14 member Saturday 13.4
data_trip %>%
group_by(member_casual, year, month) %>%
summarize(avg_ride_length = mean(ride_length)) %>%
as_tibble() %>%
print(n=24)
## `summarise()` has grouped output by 'member_casual', 'year'. You can override
## using the `.groups` argument.
## # A tibble: 24 × 4
## member_casual year month avg_ride_length
## <chr> <chr> <chr> <dbl>
## 1 casual 2022 08 21.4
## 2 casual 2022 09 20.0
## 3 casual 2022 10 18.4
## 4 casual 2022 11 15.5
## 5 casual 2022 12 13.4
## 6 casual 2023 01 13.6
## 7 casual 2023 02 15.9
## 8 casual 2023 03 15.2
## 9 casual 2023 04 20.4
## 10 casual 2023 05 22.0
## 11 casual 2023 06 21.7
## 12 casual 2023 07 22.7
## 13 member 2022 08 13.1
## 14 member 2022 09 12.6
## 15 member 2022 10 11.5
## 16 member 2022 11 10.9
## 17 member 2022 12 10.4
## 18 member 2023 01 10.1
## 19 member 2023 02 10.5
## 20 member 2023 03 10.2
## 21 member 2023 04 11.5
## 22 member 2023 05 12.6
## 23 member 2023 06 12.9
## 24 member 2023 07 13.2
#define function to calculate mode
find_mode <- function(x) {
u <- unique(x)
tab <- tabulate(match(x, u))
u[tab == max(tab)]
}
#find mode for ride length for day of the week
data_trip %>%
group_by(member_casual, day_of_week) %>%
summarize(mode_ride_length = find_mode(round(ride_length)))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 3
## # Groups: member_casual [2]
## member_casual day_of_week mode_ride_length
## <chr> <fct> <dbl>
## 1 casual Sunday 8
## 2 casual Monday 6
## 3 casual Tuesday 6
## 4 casual Wednesday 6
## 5 casual Thursday 6
## 6 casual Friday 6
## 7 casual Saturday 8
## 8 member Sunday 4
## 9 member Monday 4
## 10 member Tuesday 4
## 11 member Wednesday 4
## 12 member Thursday 4
## 13 member Friday 4
## 14 member Saturday 4
rm(find_mode)
data_trip_start <- data_trip %>%
group_by(coords_start, start_lat, start_lng) %>%
summarize(total_count = n()) %>%
arrange(desc(total_count)) %>%
as_tibble() %>%
print(n=20)
## `summarise()` has grouped output by 'coords_start', 'start_lat'. You can
## override using the `.groups` argument.
## # A tibble: 2,155,237 × 4
## coords_start start_lat start_lng total_count
## <chr> <dbl> <dbl> <int>
## 1 41.892278, -87.612043 41.9 -87.6 45797
## 2 41.880958, -87.616743 41.9 -87.6 27667
## 3 41.911722, -87.626804 41.9 -87.6 24418
## 4 41.89, -87.65 41.9 -87.6 24242
## 5 41.89, -87.63 41.9 -87.6 22900
## 6 41.926277, -87.630834 41.9 -87.6 21799
## 7 41.90096039, -87.62377664 41.9 -87.6 21286
## 8 41.902973, -87.63128 41.9 -87.6 21176
## 9 41.94, -87.65 41.9 -87.6 20579
## 10 41.91, -87.63 41.9 -87.6 19772
## 11 41.78509714636, -87.6010727606 41.8 -87.6 19380
## 12 41.93, -87.64 41.9 -87.6 19373
## 13 41.9, -87.63 41.9 -87.6 19322
## 14 41.791478, -87.599861 41.8 -87.6 19201
## 15 41.912133, -87.634656 41.9 -87.6 18972
## 16 41.88917683258, -87.6385057718 41.9 -87.6 18132
## 17 41.88, -87.63 41.9 -87.6 17972
## 18 41.93, -87.65 41.9 -87.6 17373
## 19 41.867888, -87.623041 41.9 -87.6 17249
## 20 41.95, -87.66 42.0 -87.7 16970
## # ℹ 2,155,217 more rows
data_trip_start <- head(data_trip_start, 20)
data_trip_end <- data_trip %>%
group_by(coords_end, end_lat, end_lng) %>%
summarize(total_count = n()) %>%
arrange(desc(total_count)) %>%
as_tibble() %>%
print(n=20)
## `summarise()` has grouped output by 'coords_end', 'end_lat'. You can override
## using the `.groups` argument.
## # A tibble: 15,314 × 4
## coords_end end_lat end_lng total_count
## <chr> <dbl> <dbl> <int>
## 1 41.892278, -87.612043 41.9 -87.6 67800
## 2 41.911722, -87.626804 41.9 -87.6 38663
## 3 41.880958, -87.616743 41.9 -87.6 37933
## 4 41.912133, -87.634656 41.9 -87.6 35926
## 5 41.90096039, -87.62377664 41.9 -87.6 35088
## 6 41.902973, -87.63128 41.9 -87.6 34852
## 7 41.88917683258, -87.6385057718 41.9 -87.6 33248
## 8 41.926277, -87.630834 41.9 -87.6 31026
## 9 41.903222, -87.634324 41.9 -87.6 29912
## 10 41.88338, -87.64117 41.9 -87.6 29750
## 11 41.9375823160063, -87.6440978050232 41.9 -87.6 29657
## 12 41.8810317, -87.62408432 41.9 -87.6 28222
## 13 41.918306, -87.636282 41.9 -87.6 27780
## 14 41.791478, -87.599861 41.8 -87.6 27034
## 15 41.915689, -87.6346 41.9 -87.6 26056
## 16 41.891466, -87.626761 41.9 -87.6 25839
## 17 41.9402319181086, -87.6529437303543 41.9 -87.7 25679
## 18 41.929546, -87.643118 41.9 -87.6 25482
## 19 41.78509714636, -87.6010727606 41.8 -87.6 25284
## 20 41.893992, -87.629318 41.9 -87.6 24931
## # ℹ 15,294 more rows
data_trip_end <- head(data_trip_end, 20)
Ridership was 38% and 62% for casual and member riders, respectively.
Total trips:
Average ride length:
Based on analysis and the question of how members and casual riders use Cyclistic bikes differently, the following recommendations will help the marketing team convert casual riders into members:
The coordinates for start and end stations were not used this time per guidelines. The data can be analyzed in the future to develop strategic marketing to target hotspots.