Bellabeat is a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device. Analyzing smart device fitness data could help unlock new growth opportunities for the company market.
The business task is to analyze Fitbit smart device usage data in order to gain insight into how consumers currently use their smart devices and to discover underlying trends that can be applied to Bellabeat products. This insight could help unlock new growth opportunities and the best marketing strategy to meet business growth goals.
We have used a public dataset that explores smart device users’ daily habits.
Name: FitBit Fitness Tracker Data (CC0 Public Domain) Link: https://www.kaggle.com/arashnic/fitbit
"This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. "
Overall, this data may not be appropriate for conducting a full detailed analysis to meet the goal of the business. However, we will move forward with the data to uncover any insights.
Ideally, we would be having a sample size that is more representative of the whole population, at least 6 months of data from a more current year, and gendered data so that we can gain insight into the specific female trends.
This is public data. No data was identified as sensitive and therefore we will not need to have Data anonymization / De-identification of any data. Also, there is no need to encrypt the data.
Packages should only be installed once. Installing multiple times can cause errors. Therefore, the install functions are commented out in this document.
# install.packages("tidyverse")
# install.packages("skimr")
# install.packages("janitor")
# install.packages("dplyr")
# install.packages("readr")
# install.packages("plotly")
# install.packages("date")
# install.packages("ggplot2")
# install.packages("formattable")
# install.packages("gghighlight")
library(tidyverse)
library(readr)
library(skimr)
library(janitor)
library(dplyr)
library(plotly)
library(date)
library(lubridate)
library(ggplot2)
library(formattable)
library(gghighlight)
dailyActivity <- read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
sleepDay <- read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weightLogInfo <- read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
glimpse(dailyActivity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
glimpse(sleepDay)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~
glimpse(weightLogInfo)
## Rows: 67
## Columns: 8
## $ Id <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212~
## $ Date <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2~
## $ WeightKg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, ~
## $ WeightPounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6~
## $ Fat <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ BMI <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,~
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ~
## $ LogId <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,~
Summary of Cleaning:
#rename dataframes for easy reference
#cleaned variable names for consistency
updated_daily_activity <- clean_names(dailyActivity)
updated_daily_sleep <- clean_names(sleepDay)
updated_weight_log <- clean_names(weightLogInfo)
#check for missing data (NA/null). results should be 0.
sum(is.na(updated_daily_activity))
## [1] 0
sum(is.na(updated_weight_log)) #has NA data, but no benefit in removing it
## [1] 65
sum(is.na(updated_daily_sleep))
## [1] 0
sum(is.null(updated_daily_activity))
## [1] 0
sum(is.null(updated_weight_log))
## [1] 0
sum(is.null(updated_daily_sleep))
## [1] 0
#check for duplicated data. results should be 0.
sum(duplicated(updated_daily_activity))
## [1] 0
sum(duplicated(updated_weight_log))
## [1] 0
sum(duplicated(updated_daily_sleep)) #duplicates found
## [1] 3
#store and remove duplicates from updated_daily_sleep
sleep_records_dups <- updated_daily_sleep[duplicated(updated_daily_sleep), ]
updated_daily_sleep <- updated_daily_sleep[!duplicated(updated_daily_sleep), ]
#renamed Date variable for consistency and joining
updated_daily_sleep <- rename(updated_daily_sleep,activity_date=sleep_day)
updated_weight_log <- rename(updated_weight_log,activity_date=date)
#convert all Date variables from Character to Date by adding casted_date variable
updated_daily_sleep$casted_date <- as.Date(mdy_hms(updated_daily_sleep$activity_date))
updated_daily_activity$casted_date <- as.Date(mdy(updated_daily_activity$activity_date))
updated_weight_log$casted_date <- as.Date(mdy_hms(updated_weight_log$activity_date))
#arrange by date
updated_daily_sleep <- updated_daily_sleep %>% arrange(casted_date)
updated_daily_activity <- updated_daily_activity %>% arrange(casted_date)
updated_weight_log <- updated_weight_log %>% arrange(casted_date)
#convert id from num to char
updated_daily_activity$id <- as.character(updated_daily_activity$id)
updated_daily_sleep$id <- as.character(updated_daily_sleep$id)
updated_weight_log$id <- as.character(updated_weight_log$id)
#adding Week Day column for categorical data
updated_daily_sleep$day <- weekdays(as.Date(updated_daily_sleep$casted_date))
updated_daily_activity$day <- weekdays(as.Date(updated_daily_activity$casted_date))
updated_weight_log$day <- weekdays(as.Date(updated_weight_log$casted_date))
#adding report type column for categorical data
updated_weight_log$report_type <- ifelse(updated_weight_log$is_manual_report == TRUE, "Manual", "Automated")
#dailyCalories, dailyIntensities, and dailySteps provide no additional value as all data is already merged in DailyActivty, as observed by the set difference analysis
#Comparison of dataframes
dailyCalories <- clean_names(read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv"))
dailyIntensities <- clean_names(read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv"))
dailySteps <- clean_names(read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv"))
calorie_comparison_test <- updated_daily_activity %>%
select(id,activity_date,calories)
intensity_comparison_test <- updated_daily_activity %>%
select(id,activity_date,sedentary_minutes,lightly_active_minutes, fairly_active_minutes,very_active_minutes,sedentary_active_distance,light_active_distance,moderately_active_distance,very_active_distance)
daily_steps_comparison_test <- updated_daily_activity %>%
select(id,activity_date,total_steps)
a1 <- rename(dailyCalories,activity_date=activity_day) %>%
select(id,activity_date,calories)
a2 <- rename(dailyIntensities,activity_date=activity_day) %>%
select(id,activity_date,sedentary_minutes,lightly_active_minutes, fairly_active_minutes,very_active_minutes,sedentary_active_distance,light_active_distance,moderately_active_distance,very_active_distance)
a3 <- rename(dailySteps,activity_date=activity_day, total_steps=step_total) %>%
select(id,activity_date,total_steps)
a1$id <- as.character(a1$id)
a2$id <- as.character(a2$id)
a3$id <- as.character(a3$id)
setdiff(calorie_comparison_test,a1) #0 rows with difference
## # A tibble: 0 x 3
## # ... with 3 variables: id <chr>, activity_date <chr>, calories <dbl>
setdiff(intensity_comparison_test,a2) #0 rows with difference
## # A tibble: 0 x 10
## # ... with 10 variables: id <chr>, activity_date <chr>,
## # sedentary_minutes <dbl>, lightly_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, very_active_minutes <dbl>,
## # sedentary_active_distance <dbl>, light_active_distance <dbl>,
## # moderately_active_distance <dbl>, very_active_distance <dbl>
setdiff(daily_steps_comparison_test,a3) #0 rows with difference
## # A tibble: 0 x 3
## # ... with 3 variables: id <chr>, activity_date <chr>, total_steps <dbl>
Summary of Analysis Process
#quick view of cleaned data
glimpse(updated_daily_activity)
## Rows: 940
## Columns: 17
## $ id <chr> "1503960366", "1624580081", "1644430081", "~
## $ activity_date <chr> "4/12/2016", "4/12/2016", "4/12/2016", "4/1~
## $ total_steps <dbl> 13162, 8163, 10694, 6697, 678, 11875, 4414,~
## $ total_distance <dbl> 8.50, 5.31, 7.77, 4.43, 0.47, 8.34, 2.74, 7~
## $ tracker_distance <dbl> 8.50, 5.31, 7.77, 4.43, 0.47, 8.34, 2.74, 7~
## $ logged_activities_distance <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.0~
## $ very_active_distance <dbl> 1.88, 0.00, 0.14, 0.00, 0.00, 3.31, 0.19, 1~
## $ moderately_active_distance <dbl> 0.55, 0.00, 2.30, 0.00, 0.00, 0.77, 0.35, 0~
## $ light_active_distance <dbl> 6.06, 5.31, 5.33, 4.43, 0.47, 4.26, 2.20, 6~
## $ sedentary_active_distance <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0~
## $ very_active_minutes <dbl> 25, 0, 2, 0, 0, 42, 3, 13, 28, 2, 0, 44, 4,~
## $ fairly_active_minutes <dbl> 13, 0, 51, 0, 0, 14, 8, 9, 13, 21, 0, 19, 1~
## $ lightly_active_minutes <dbl> 328, 146, 256, 339, 55, 227, 181, 306, 320,~
## $ sedentary_minutes <dbl> 728, 1294, 1131, 1101, 734, 1157, 706, 1112~
## $ calories <dbl> 1985, 1432, 3199, 2030, 2220, 2390, 1459, 2~
## $ casted_date <date> 2016-04-12, 2016-04-12, 2016-04-12, 2016-0~
## $ day <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday",~
glimpse(updated_daily_sleep)
## Rows: 410
## Columns: 7
## $ id <chr> "1503960366", "1927972279", "2026352035", "397733~
## $ activity_date <chr> "4/12/2016 12:00:00 AM", "4/12/2016 12:00:00 AM",~
## $ total_sleep_records <dbl> 1, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1~
## $ total_minutes_asleep <dbl> 327, 750, 503, 274, 501, 429, 425, 441, 419, 366,~
## $ total_time_in_bed <dbl> 346, 775, 546, 469, 541, 457, 439, 464, 438, 387,~
## $ casted_date <date> 2016-04-12, 2016-04-12, 2016-04-12, 2016-04-12, ~
## $ day <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tues~
glimpse(updated_weight_log)
## Rows: 67
## Columns: 11
## $ id <chr> "6962181067", "8877689391", "1927972279", "6962181067~
## $ activity_date <chr> "4/12/2016 11:59:59 PM", "4/12/2016 6:47:11 AM", "4/1~
## $ weight_kg <dbl> 62.5, 85.8, 133.5, 62.1, 84.9, 61.7, 84.5, 61.5, 62.0~
## $ weight_pounds <dbl> 137.7889, 189.1566, 294.3171, 136.9071, 187.1725, 136~
## $ fat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 25, NA, NA, N~
## $ bmi <dbl> 24.39, 25.68, 47.54, 24.24, 25.41, 24.10, 25.31, 24.0~
## $ is_manual_report <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, T~
## $ log_id <dbl> 1.460506e+12, 1.460444e+12, 1.460510e+12, 1.460592e+1~
## $ casted_date <date> 2016-04-12, 2016-04-12, 2016-04-13, 2016-04-13, 2016~
## $ day <chr> "Tuesday", "Tuesday", "Wednesday", "Wednesday", "Wedn~
## $ report_type <chr> "Manual", "Automated", "Automated", "Manual", "Automa~
#unique participants in each dataframe
n_distinct(updated_daily_activity$id)
## [1] 33
n_distinct(updated_daily_sleep$id)
## [1] 24
n_distinct(updated_weight_log$id)
## [1] 8
#Examining Date Ranges. All date ranges are the same
min(updated_daily_activity$casted_date)
## [1] "2016-04-12"
min(updated_daily_sleep$casted_date)
## [1] "2016-04-12"
min(updated_weight_log$casted_date)
## [1] "2016-04-12"
max(updated_daily_activity$casted_date)
## [1] "2016-05-12"
max(updated_daily_sleep$casted_date)
## [1] "2016-05-12"
max(updated_weight_log$casted_date)
## [1] "2016-05-12"
#all have 30 days of data
difftime(max(updated_daily_activity$casted_date),min(updated_daily_activity$casted_date), units="days")
## Time difference of 30 days
difftime(max(updated_daily_sleep$casted_date),min(updated_daily_sleep$casted_date), units="days")
## Time difference of 30 days
difftime(max(updated_weight_log$casted_date),min(updated_weight_log$casted_date), units="days")
## Time difference of 30 days
#merge daily activity + daily sleep outer join
combined_sleep_activity <- merge(updated_daily_sleep, updated_daily_activity, by = c("id", "casted_date" ), all.y = TRUE)
n_distinct(combined_sleep_activity$id)
## [1] 33
#merge all daily data outer join
combined_all_daily <- merge(combined_sleep_activity, updated_weight_log, by = c("id", "casted_date"),all.x = TRUE)
n_distinct(combined_all_daily$id)
## [1] 33
#ordering days for graph visualizations
updated_daily_sleep$day <- factor(updated_daily_sleep$day,levels =
c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
updated_daily_activity$day <- factor(updated_daily_activity$day,levels =
c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
updated_weight_log$day <- factor(updated_weight_log$day,levels =
c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
#Quick summary statistics
updated_daily_activity %>%
select(total_steps,
total_distance,
sedentary_minutes) %>%
summary()
## total_steps total_distance sedentary_minutes
## Min. : 0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :1440.0
updated_daily_sleep %>%
select(total_sleep_records,
total_minutes_asleep,
total_time_in_bed) %>%
summary()
## total_sleep_records total_minutes_asleep total_time_in_bed
## Min. :1.00 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8
## Median :1.00 Median :432.5 Median :463.0
## Mean :1.12 Mean :419.2 Mean :458.5
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.00 Max. :796.0 Max. :961.0
updated_weight_log %>%
select(weight_pounds,
bmi,
fat,
is_manual_report) %>%
summary()
## weight_pounds bmi fat is_manual_report
## Min. :116.0 Min. :21.45 Min. :22.00 Mode :logical
## 1st Qu.:135.4 1st Qu.:23.96 1st Qu.:22.75 FALSE:26
## Median :137.8 Median :24.39 Median :23.50 TRUE :41
## Mean :158.8 Mean :25.19 Mean :23.50
## 3rd Qu.:187.5 3rd Qu.:25.56 3rd Qu.:24.25
## Max. :294.3 Max. :47.54 Max. :25.00
## NA's :65
updated_daily_activity %>%
select(total_distance,
very_active_distance,
moderately_active_distance,
light_active_distance,
sedentary_active_distance) %>%
summary()
## total_distance very_active_distance moderately_active_distance
## Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 2.620 1st Qu.: 0.000 1st Qu.:0.0000
## Median : 5.245 Median : 0.210 Median :0.2400
## Mean : 5.490 Mean : 1.503 Mean :0.5675
## 3rd Qu.: 7.713 3rd Qu.: 2.053 3rd Qu.:0.8000
## Max. :28.030 Max. :21.920 Max. :6.4800
## light_active_distance sedentary_active_distance
## Min. : 0.000 Min. :0.000000
## 1st Qu.: 1.945 1st Qu.:0.000000
## Median : 3.365 Median :0.000000
## Mean : 3.341 Mean :0.001606
## 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :10.710 Max. :0.110000
#percent of total records of 0 logged activities
((updated_daily_activity %>% filter(logged_activities_distance == 0) %>% count()) / nrow(updated_daily_activity)) %>% percent()
## [1] 96.60%
#percent of total records with 0 steps
((updated_daily_activity %>% filter(total_steps == 0) %>% count()) / nrow(updated_daily_activity)) %>% percent()
## [1] 8.19%
#percent of total records with 0 distance
((updated_daily_activity %>% filter(total_distance == 0) %>% count()) / nrow(updated_daily_activity)) %>% percent()
## [1] 8.30%
#percent of total users who participated in weight logging
(n_distinct(updated_weight_log$id) / n_distinct(updated_daily_activity$id)) %>% percent()
## [1] 24.24%
#percent of total users who participated in sleep logging
(n_distinct(updated_daily_sleep$id) / n_distinct(updated_daily_activity$id)) %>% percent()
## [1] 72.73%
To meet the goal to empower women with knowledge about their own health and habits, the following recommendations for marketing are to:
More forward with recommendations with the insights gained from the limited data
or
Expand upon the limited data and timeframe by conducting another survey to gather specific and categorical data that targets the female audience. And to ask questions around specific goals that they have with their health and fitness tracking.