Table of contents

Overview of Project

Bellabeat is a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device. Analyzing smart device fitness data could help unlock new growth opportunities for the company market.


Step 1 - Ask

Business Task

The business task is to analyze Fitbit smart device usage data in order to gain insight into how consumers currently use their smart devices and to discover underlying trends that can be applied to Bellabeat products. This insight could help unlock new growth opportunities and the best marketing strategy to meet business growth goals.


Step 2 - Prepare

Description of the Dataset

We have used a public dataset that explores smart device users’ daily habits.

Name: FitBit Fitness Tracker Data (CC0 Public Domain) Link: https://www.kaggle.com/arashnic/fitbit

"This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. "

Limitations

  • There are limited users. The sample size may not be representative of the population.
  • There are only a few categorical variables to categorize the data for analysis.
  • There is no data on male and female population. We will not be able to make female-specific observations for the female target audience.
  • Each dataset has 30 days of data from 4/12/2016 to 5/12/2016, but the dataset description indicates 03/12/2016 to 05/12/2016. We will have less data to work with which may impact recommendations.
  • The first day of data collected is Tuesday 4/12/2016, and the last day of data collected is on Thursday 5/12/2016. There is an unbalanced spread of days, so any weekly analysis could be misleading.

ROCCC Validation of Dataset

  • Reliable - We will assume this data is accurate and reliable.
  • Original - The data is original as it is directly gathered from users
  • Comprehensive - The data is not comprehensive as it does not have all details needed for a full detailed analysis to meet the business goal
  • Current - The data is not current as the data is from 2016.
  • Cited - The data is cited.

Data Summary

Overall, this data may not be appropriate for conducting a full detailed analysis to meet the goal of the business. However, we will move forward with the data to uncover any insights.

Ideally, we would be having a sample size that is more representative of the whole population, at least 6 months of data from a more current year, and gendered data so that we can gain insight into the specific female trends.

Data Sensitivity

This is public data. No data was identified as sensitive and therefore we will not need to have Data anonymization / De-identification of any data. Also, there is no need to encrypt the data.

Setup Environment with Packages

Packages should only be installed once. Installing multiple times can cause errors. Therefore, the install functions are commented out in this document.

# install.packages("tidyverse")
# install.packages("skimr")
# install.packages("janitor")
# install.packages("dplyr")
# install.packages("readr")
# install.packages("plotly")
# install.packages("date")
# install.packages("ggplot2")
# install.packages("formattable")
# install.packages("gghighlight")
library(tidyverse)
library(readr)
library(skimr)
library(janitor)
library(dplyr)
library(plotly)
library(date)
library(lubridate)
library(ggplot2)
library(formattable)
library(gghighlight)

Import Data

  • The below datasets were determined to be most useful for the purpose of this analysis. All other available datasets will not be in scope.
    • dailyActivity (daily physical activity)
    • sleepDay (daily sleep activity)
    • weightLogInfo (daily weight log activity)
dailyActivity <- read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
sleepDay <- read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weightLogInfo <- read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

First Look

glimpse(dailyActivity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
glimpse(sleepDay)
## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~
glimpse(weightLogInfo)
## Rows: 67
## Columns: 8
## $ Id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212~
## $ Date           <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2~
## $ WeightKg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, ~
## $ WeightPounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6~
## $ Fat            <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ BMI            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,~
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ~
## $ LogId          <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,~

Observations from First Look

  • Column names need to be cleaned
  • Common variables are ID and ActivityDate/ActivityDay.
  • Date variables are in format MM/DD/YYYY but are Character data type.
  • We will need to rename Day –> Date for consistency, and convert to Date field.
  • LoggedActivitiesDistance has a lot of 0 values. Assumption is that it is not required to log most activities, but some activities require manual entry. Users may rely on automated tracking and will not manually log activities.
  • The file sleepDay should be renamed to dailySleep, and SleepDay variable should be renamed to ActivityDate and formatted to MM/DD/YYYY for consistency

Step 3 - Process

Summary of Cleaning:

  1. Created new dataframes with simple names for easy reference
  2. Cleaned variable names (changed from PascalCase to snake_case)
  3. Checked for missing (NA/null) data
  4. Checked for duplicate data
  5. Stored and then removed duplicate data
  6. Renamed variables for consistency and joining
  7. Converted all Date variables from Character to Date
  8. Organized data by sorting by date
  9. Converted ID from Number to Character for display purposes
  10. Added a Day column to categorize data by days
  11. Added a Record Type variable for categorical purposes
  12. Performed a Set Difference between additional data to confirm all data was already present in the merged dataset DailyActivity
#rename dataframes for easy reference 
#cleaned variable names for consistency
updated_daily_activity <- clean_names(dailyActivity)
updated_daily_sleep <- clean_names(sleepDay)
updated_weight_log <- clean_names(weightLogInfo)
#check for missing data (NA/null). results should be 0.
sum(is.na(updated_daily_activity))
## [1] 0
sum(is.na(updated_weight_log)) #has NA data, but no benefit in removing it
## [1] 65
sum(is.na(updated_daily_sleep))
## [1] 0
sum(is.null(updated_daily_activity))
## [1] 0
sum(is.null(updated_weight_log))
## [1] 0
sum(is.null(updated_daily_sleep))
## [1] 0
#check for duplicated data. results should be 0.
sum(duplicated(updated_daily_activity))
## [1] 0
sum(duplicated(updated_weight_log))
## [1] 0
sum(duplicated(updated_daily_sleep)) #duplicates found
## [1] 3
#store and remove duplicates from updated_daily_sleep
sleep_records_dups <- updated_daily_sleep[duplicated(updated_daily_sleep), ]
updated_daily_sleep <- updated_daily_sleep[!duplicated(updated_daily_sleep), ]
#renamed Date variable for consistency and joining
updated_daily_sleep <- rename(updated_daily_sleep,activity_date=sleep_day)
updated_weight_log <- rename(updated_weight_log,activity_date=date)
#convert all Date variables from Character to Date by adding casted_date variable
updated_daily_sleep$casted_date <- as.Date(mdy_hms(updated_daily_sleep$activity_date))
updated_daily_activity$casted_date <- as.Date(mdy(updated_daily_activity$activity_date))
updated_weight_log$casted_date <- as.Date(mdy_hms(updated_weight_log$activity_date))

#arrange by date
updated_daily_sleep <- updated_daily_sleep %>% arrange(casted_date)
updated_daily_activity <- updated_daily_activity %>% arrange(casted_date)
updated_weight_log <- updated_weight_log %>% arrange(casted_date)
#convert id from num to char
updated_daily_activity$id <- as.character(updated_daily_activity$id)
updated_daily_sleep$id <- as.character(updated_daily_sleep$id)
updated_weight_log$id <- as.character(updated_weight_log$id)
#adding Week Day column for categorical data
updated_daily_sleep$day <- weekdays(as.Date(updated_daily_sleep$casted_date)) 
updated_daily_activity$day <- weekdays(as.Date(updated_daily_activity$casted_date))
updated_weight_log$day <- weekdays(as.Date(updated_weight_log$casted_date))
#adding report type column for categorical data
updated_weight_log$report_type <- ifelse(updated_weight_log$is_manual_report == TRUE, "Manual", "Automated")
#dailyCalories, dailyIntensities, and dailySteps provide no additional value as all data is already merged in DailyActivty, as observed by the set difference analysis 

#Comparison of dataframes
dailyCalories <- clean_names(read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv"))
dailyIntensities <- clean_names(read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv"))
dailySteps <- clean_names(read_csv("C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv"))

calorie_comparison_test <- updated_daily_activity %>% 
  select(id,activity_date,calories)

intensity_comparison_test <- updated_daily_activity %>% 
  select(id,activity_date,sedentary_minutes,lightly_active_minutes, fairly_active_minutes,very_active_minutes,sedentary_active_distance,light_active_distance,moderately_active_distance,very_active_distance)

daily_steps_comparison_test <- updated_daily_activity %>% 
  select(id,activity_date,total_steps)

a1 <- rename(dailyCalories,activity_date=activity_day) %>% 
  select(id,activity_date,calories) 

a2 <- rename(dailyIntensities,activity_date=activity_day) %>% 
  select(id,activity_date,sedentary_minutes,lightly_active_minutes, fairly_active_minutes,very_active_minutes,sedentary_active_distance,light_active_distance,moderately_active_distance,very_active_distance)

a3 <- rename(dailySteps,activity_date=activity_day, total_steps=step_total) %>% 
  select(id,activity_date,total_steps)

a1$id <- as.character(a1$id)
a2$id <- as.character(a2$id)
a3$id <- as.character(a3$id)

setdiff(calorie_comparison_test,a1) #0 rows with difference
## # A tibble: 0 x 3
## # ... with 3 variables: id <chr>, activity_date <chr>, calories <dbl>
setdiff(intensity_comparison_test,a2) #0 rows with difference
## # A tibble: 0 x 10
## # ... with 10 variables: id <chr>, activity_date <chr>,
## #   sedentary_minutes <dbl>, lightly_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, very_active_minutes <dbl>,
## #   sedentary_active_distance <dbl>, light_active_distance <dbl>,
## #   moderately_active_distance <dbl>, very_active_distance <dbl>
setdiff(daily_steps_comparison_test,a3) #0 rows with difference
## # A tibble: 0 x 3
## # ... with 3 variables: id <chr>, activity_date <chr>, total_steps <dbl>

Step 4 - Analyze

Summary of Analysis Process

  1. Counted distinct participants in each dataframe
  2. Examined date ranges in each dataframe
  3. Merged dataframes
  4. Created an order for days of the week for visualizations
  5. Viewed summary statistics of each dataframe
  6. Analyzed percentages of manual vs automated activity logs
  7. Analyzed percentages of participants of each category
#quick view of cleaned data
glimpse(updated_daily_activity)
## Rows: 940
## Columns: 17
## $ id                         <chr> "1503960366", "1624580081", "1644430081", "~
## $ activity_date              <chr> "4/12/2016", "4/12/2016", "4/12/2016", "4/1~
## $ total_steps                <dbl> 13162, 8163, 10694, 6697, 678, 11875, 4414,~
## $ total_distance             <dbl> 8.50, 5.31, 7.77, 4.43, 0.47, 8.34, 2.74, 7~
## $ tracker_distance           <dbl> 8.50, 5.31, 7.77, 4.43, 0.47, 8.34, 2.74, 7~
## $ logged_activities_distance <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.0~
## $ very_active_distance       <dbl> 1.88, 0.00, 0.14, 0.00, 0.00, 3.31, 0.19, 1~
## $ moderately_active_distance <dbl> 0.55, 0.00, 2.30, 0.00, 0.00, 0.77, 0.35, 0~
## $ light_active_distance      <dbl> 6.06, 5.31, 5.33, 4.43, 0.47, 4.26, 2.20, 6~
## $ sedentary_active_distance  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0~
## $ very_active_minutes        <dbl> 25, 0, 2, 0, 0, 42, 3, 13, 28, 2, 0, 44, 4,~
## $ fairly_active_minutes      <dbl> 13, 0, 51, 0, 0, 14, 8, 9, 13, 21, 0, 19, 1~
## $ lightly_active_minutes     <dbl> 328, 146, 256, 339, 55, 227, 181, 306, 320,~
## $ sedentary_minutes          <dbl> 728, 1294, 1131, 1101, 734, 1157, 706, 1112~
## $ calories                   <dbl> 1985, 1432, 3199, 2030, 2220, 2390, 1459, 2~
## $ casted_date                <date> 2016-04-12, 2016-04-12, 2016-04-12, 2016-0~
## $ day                        <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday",~
glimpse(updated_daily_sleep)
## Rows: 410
## Columns: 7
## $ id                   <chr> "1503960366", "1927972279", "2026352035", "397733~
## $ activity_date        <chr> "4/12/2016 12:00:00 AM", "4/12/2016 12:00:00 AM",~
## $ total_sleep_records  <dbl> 1, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1~
## $ total_minutes_asleep <dbl> 327, 750, 503, 274, 501, 429, 425, 441, 419, 366,~
## $ total_time_in_bed    <dbl> 346, 775, 546, 469, 541, 457, 439, 464, 438, 387,~
## $ casted_date          <date> 2016-04-12, 2016-04-12, 2016-04-12, 2016-04-12, ~
## $ day                  <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tues~
glimpse(updated_weight_log)
## Rows: 67
## Columns: 11
## $ id               <chr> "6962181067", "8877689391", "1927972279", "6962181067~
## $ activity_date    <chr> "4/12/2016 11:59:59 PM", "4/12/2016 6:47:11 AM", "4/1~
## $ weight_kg        <dbl> 62.5, 85.8, 133.5, 62.1, 84.9, 61.7, 84.5, 61.5, 62.0~
## $ weight_pounds    <dbl> 137.7889, 189.1566, 294.3171, 136.9071, 187.1725, 136~
## $ fat              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 25, NA, NA, N~
## $ bmi              <dbl> 24.39, 25.68, 47.54, 24.24, 25.41, 24.10, 25.31, 24.0~
## $ is_manual_report <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, T~
## $ log_id           <dbl> 1.460506e+12, 1.460444e+12, 1.460510e+12, 1.460592e+1~
## $ casted_date      <date> 2016-04-12, 2016-04-12, 2016-04-13, 2016-04-13, 2016~
## $ day              <chr> "Tuesday", "Tuesday", "Wednesday", "Wednesday", "Wedn~
## $ report_type      <chr> "Manual", "Automated", "Automated", "Manual", "Automa~
#unique participants in each dataframe
n_distinct(updated_daily_activity$id)
## [1] 33
n_distinct(updated_daily_sleep$id)
## [1] 24
n_distinct(updated_weight_log$id)
## [1] 8
#Examining Date Ranges. All date ranges are the same
min(updated_daily_activity$casted_date)
## [1] "2016-04-12"
min(updated_daily_sleep$casted_date)
## [1] "2016-04-12"
min(updated_weight_log$casted_date)
## [1] "2016-04-12"
max(updated_daily_activity$casted_date)
## [1] "2016-05-12"
max(updated_daily_sleep$casted_date)
## [1] "2016-05-12"
max(updated_weight_log$casted_date)
## [1] "2016-05-12"
#all have 30 days of data
difftime(max(updated_daily_activity$casted_date),min(updated_daily_activity$casted_date), units="days")
## Time difference of 30 days
difftime(max(updated_daily_sleep$casted_date),min(updated_daily_sleep$casted_date), units="days")
## Time difference of 30 days
difftime(max(updated_weight_log$casted_date),min(updated_weight_log$casted_date), units="days")
## Time difference of 30 days
#merge daily activity + daily sleep outer join
combined_sleep_activity <- merge(updated_daily_sleep, updated_daily_activity, by = c("id", "casted_date" ), all.y = TRUE)
n_distinct(combined_sleep_activity$id)
## [1] 33
#merge all daily data outer join
combined_all_daily <- merge(combined_sleep_activity, updated_weight_log, by = c("id", "casted_date"),all.x = TRUE)
n_distinct(combined_all_daily$id)
## [1] 33
#ordering days for graph visualizations
updated_daily_sleep$day <- factor(updated_daily_sleep$day,levels =
                                    c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
updated_daily_activity$day <- factor(updated_daily_activity$day,levels = 
                                       c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
updated_weight_log$day <- factor(updated_weight_log$day,levels = 
                                   c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
#Quick summary statistics
updated_daily_activity %>%
  select(total_steps,
         total_distance,
         sedentary_minutes) %>%
  summary()
##   total_steps    total_distance   sedentary_minutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   
##  Median : 7406   Median : 5.245   Median :1057.5   
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   
##  Max.   :36019   Max.   :28.030   Max.   :1440.0
updated_daily_sleep %>%  
  select(total_sleep_records,
         total_minutes_asleep,
         total_time_in_bed) %>%
  summary()
##  total_sleep_records total_minutes_asleep total_time_in_bed
##  Min.   :1.00        Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:1.00        1st Qu.:361.0        1st Qu.:403.8    
##  Median :1.00        Median :432.5        Median :463.0    
##  Mean   :1.12        Mean   :419.2        Mean   :458.5    
##  3rd Qu.:1.00        3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :3.00        Max.   :796.0        Max.   :961.0
updated_weight_log %>%  
  select(weight_pounds,
         bmi,
         fat,
         is_manual_report) %>%
  summary()
##  weight_pounds        bmi             fat        is_manual_report
##  Min.   :116.0   Min.   :21.45   Min.   :22.00   Mode :logical   
##  1st Qu.:135.4   1st Qu.:23.96   1st Qu.:22.75   FALSE:26        
##  Median :137.8   Median :24.39   Median :23.50   TRUE :41        
##  Mean   :158.8   Mean   :25.19   Mean   :23.50                   
##  3rd Qu.:187.5   3rd Qu.:25.56   3rd Qu.:24.25                   
##  Max.   :294.3   Max.   :47.54   Max.   :25.00                   
##                                  NA's   :65
updated_daily_activity %>% 
  select(total_distance,
         very_active_distance,
         moderately_active_distance,
         light_active_distance,
         sedentary_active_distance) %>%
  summary()
##  total_distance   very_active_distance moderately_active_distance
##  Min.   : 0.000   Min.   : 0.000       Min.   :0.0000            
##  1st Qu.: 2.620   1st Qu.: 0.000       1st Qu.:0.0000            
##  Median : 5.245   Median : 0.210       Median :0.2400            
##  Mean   : 5.490   Mean   : 1.503       Mean   :0.5675            
##  3rd Qu.: 7.713   3rd Qu.: 2.053       3rd Qu.:0.8000            
##  Max.   :28.030   Max.   :21.920       Max.   :6.4800            
##  light_active_distance sedentary_active_distance
##  Min.   : 0.000        Min.   :0.000000         
##  1st Qu.: 1.945        1st Qu.:0.000000         
##  Median : 3.365        Median :0.000000         
##  Mean   : 3.341        Mean   :0.001606         
##  3rd Qu.: 4.782        3rd Qu.:0.000000         
##  Max.   :10.710        Max.   :0.110000
#percent of total records of 0 logged activities
  ((updated_daily_activity %>% filter(logged_activities_distance  == 0) %>% count()) / nrow(updated_daily_activity)) %>% percent()
## [1] 96.60%
#percent of total records with 0 steps
  ((updated_daily_activity %>% filter(total_steps == 0) %>% count()) / nrow(updated_daily_activity)) %>% percent()
## [1] 8.19%
#percent of total records with 0 distance
((updated_daily_activity %>% filter(total_distance == 0) %>% count()) / nrow(updated_daily_activity)) %>% percent()
## [1] 8.30%
#percent of total users who participated in weight logging
(n_distinct(updated_weight_log$id) / n_distinct(updated_daily_activity$id)) %>% percent()
## [1] 24.24%
#percent of total users who participated in sleep logging
(n_distinct(updated_daily_sleep$id) / n_distinct(updated_daily_activity$id)) %>% percent()
## [1] 72.73%

Step 5 - Share

Visualizations

Graphs Categorized by Weekday

#aggregating data for visualizations
activity_by_day <- 
  updated_daily_activity %>% 
        group_by(day) %>% 
        summarize(avg_steps = mean(total_steps),
                  sedendatary = mean(sedentary_minutes),
                  lightly_active = mean(lightly_active_minutes),
                  fairly_active = mean(fairly_active_minutes),
                  very_active = mean(very_active_minutes),
                  )

sleep_by_day <-
  updated_daily_sleep %>% 
  group_by(day) %>% 
  summarize (avg_minutes_asleep = mean(total_minutes_asleep))

intensity_min_by_day <- 
  activity_by_day %>% 
  select(day,sedendatary,lightly_active,fairly_active,very_active)

longdata <- pivot_longer(intensity_min_by_day, !day,  names_to = "Intensity", values_to = "Minutes")

longdata$Intensity <- factor(longdata$Intensity,levels = c("sedendatary", "lightly_active", "fairly_active", "very_active"))
avg_steps_graph <- ggplot(activity_by_day, aes(x = day, y = avg_steps)) +
  geom_bar(stat="identity", fill='steelblue') +
  geom_hline(yintercept=10000, color = "green") +
  labs(title="Average Steps per Day", x= "Day", y="Average Steps")

ggplotly(avg_steps_graph)
avg_sleep_graph <- ggplot(sleep_by_day, aes(x=day, y=avg_minutes_asleep)) + 
  geom_bar(stat="identity", fill='steelblue') +
  geom_hline(yintercept=480, color = "green") +
  labs(title="Average Sleep per Day", x= "Day of the Week", y="Average Minutes")

ggplotly(avg_sleep_graph)
#Total Weight Logs per day
weight_logs_graph <- ggplot(updated_weight_log, aes(x=day)) + 
  geom_bar(fill='steelblue') +
  labs(title="Daily Weight Logs", x= "Day of the Week", y="Total Log Count") 

ggplotly(weight_logs_graph)
intenstiy_min_graph <- ggplot(longdata, aes(x = day, y = Minutes, fill = Intensity)) +
  geom_bar(position="dodge", stat="identity") +
  labs(title="Activity Intensity per Day", x= "Day", y="Minutes")

ggplotly(intenstiy_min_graph)

Relationships and Trend Lines

#relationship between total_time_in_bed and total_minutes_asleep
sleep_graph <- ggplot(data=updated_daily_sleep, aes(x=total_time_in_bed, y= total_minutes_asleep)) + 
  geom_point() +
  stat_smooth(method="lm", se=FALSE) +
  labs(title="Time in Bed vs Time Asleep", x="In Bed", y="Asleep")

ggplotly(sleep_graph)
#relationship between total_steps and sedentary_minutes
step_graph <- ggplot(data=updated_daily_activity, aes(x=total_steps, y=sedentary_minutes)) +
  geom_point() +
  stat_smooth(method="lm", se=FALSE) +
  labs(title="Steps vs Sedentary Minutes", x="Total Steps", y="Sedentary Minutes")

ggplotly(step_graph)
#total steps vs total minutes asleep
step_graph2 <- ggplot(combined_sleep_activity, aes(x=total_minutes_asleep,  y=total_steps)) + 
  geom_point() +
  geom_vline(xintercept = 460 , color="green") +
  stat_smooth(method="lm", se=FALSE) +
  labs(title="Steps vs Sleep", x="Minutes Asleep", y="Total Steps" ) +
  annotate("text", x = 600, y = 30000, label = "8 hour recommended", color="green")

ggplotly(step_graph2)
#total minutes asleep vs calories
calorie_graph <- ggplot(combined_sleep_activity, aes(x=total_minutes_asleep,  y=calories)) + 
  geom_point() +
  geom_vline(xintercept = 480 , color="green") +
  stat_smooth(method="lm", se=FALSE) +
  labs(title="Sleep vs Calories", x="Minutes Asleep", y="Calories") +
  annotate("text", x = 675, y = 4000, label = "8 hour recommended", color="green")

ggplotly(calorie_graph)

Participation

#how often participants participate. inconsistent participation
phys_part_graph <- ggplot(updated_daily_activity,aes(x = fct_infreq(id))) + 
    geom_bar(stat = 'count',fill='steelblue') +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(title="Physical Activity Participant Frequency", x="User ID", y="Total Log Count")

ggplotly(phys_part_graph)
sleep_part <- ggplot(updated_daily_sleep,aes(x = fct_infreq(id))) + 
    geom_bar(stat = 'count',fill='steelblue') +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(title="Sleep Participant Frequency", x="User ID", y="Total Log Count")

ggplotly(sleep_part)
weight_part <- ggplot(updated_weight_log,aes(x = fct_infreq(id))) + 
    geom_bar(stat = 'count',fill='steelblue') +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(title="Weight Log Particpant Frequency", x="User ID", y="Total Log Count")

ggplotly(weight_part)
record_type <- c("Physical", "Sleep", "Weight")
record_type <- factor(record_type,levels = c("Physical", "Sleep", "Weight"))
                                   
distinct_user_count <-c(n_distinct(updated_daily_activity$id), n_distinct(updated_daily_sleep$id), n_distinct(updated_weight_log$id))

user_comparison <- data.frame(record_type,distinct_user_count)

#pie
fig <- plot_ly()
fig <- fig %>% add_pie(data = user_comparison, labels = record_type, values = distinct_user_count, title = "Particpation Activity")
fig

Manual vs Automated Preferences

manual_physical <- sum(updated_daily_activity$logged_activities_distance != 0)
automated_physical <- sum(updated_daily_activity$logged_activities_distance == 0)
pie_labels <- c('Manual', 'Automated')
record_type_count <- c(manual_physical,automated_physical)
pie_data <- data.frame(pie_labels,record_type_count)

fig <- plot_ly()
fig <- fig %>% add_pie(data = pie_data, labels = pie_labels, values = record_type_count, title = "Manual vs Automated Physical Logs")
fig

Time Lapse

#Shows participants log days. random days. not consistent
weight_con <- ggplot(updated_weight_log, aes(x=casted_date, y=id, color = report_type)) + 
  geom_point() +
  labs(title="Weight Log Consistency", x="Date of Log", y="User ID")

ggplotly(weight_con)
#weight change per participant. shows no changes in weight
weight_change_graph <- ggplot(updated_weight_log, aes(x=casted_date, y=weight_pounds)) + 
  geom_line(color="blue") +
  facet_wrap(~fct_infreq(id)) +
  labs(title="Weight Change", x="", y="Weight in Pounds") 

ggplotly(weight_change_graph)
activity_graph <- ggplot(updated_daily_activity, aes(x=activity_date, y=total_steps, group=id, color=id)) +
  geom_line() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title="User Activitiy", x="Date", y="Steps Taken")

ggplotly(activity_graph)

Observations

Participation

  • Users were inconsistent in tracking across all activities
  • Of 33 participants total
    • 33 participated in physical activity logging
      • Majority logged everyday, however not all did.
    • 24 participated in sleep activity logging
      • Majority did not log everyday
    • 8 participated in weight logging activity
      • Majority did not log everyday

Weekday Analysis

  • Saturday has the highest average step count. Sunday has the lowest. Not too much deviation as the week progresses
  • Sunday has the highest average sleep count. Thursday has the lowest. Not too much deviation as the week progresses
  • Monday and Wednesday are the most active days for Weight Log recording. Friday and Saturday are the least active days. Slight trend of lesser activity as the week progresses

Daily Sleep Activity

  • The average sleep time is under the recommended 8 hours every day
  • There is a trend that users spend roughly the same amount of time in bed as they spend actually sleeping. Meaning they fall asleep and stay asleep, and do not lie in bed awake for long. There are some outliers here that spend a lot of time awake in bed.

Daily Physical Activity

  • The average steps taken is under the recommended 10,000 every day
  • As total steps increases, the amount of sedentary minutes decreases, as one may assume.
  • Users are are mostly sedentary during each day
  • Users are consistently “lightly active” over moderately and very active.

Coorelation between Physical and Sleep Activity

  • There is a trend showing the more sleep you get, the less steps you take.
    • This could mean that there is less time to be active if you sleep longer
  • There is not a strong correlation between the amount of sleep and the amount of calories burned.

Weight Log Analysis

  • 2 of 8 participants accounted for most of the logs, and the same 2 were most consistent in logging nearly everyday.
  • The remaining 6 participants were not consistent and randomly logged.
  • There were no drastic changes in weight. The study being only 30 days may account for this.

Manual vs Automated Logging

  • 4 out of 33 manually logged physical activity.
  • It is inferred that automated logging is the preference over manual logging with physical activity.

Step 6 - Act

Final Conclusions

  • Overall, users were not consistent with tracking physical activity, sleep activity, and logging weight. Potential reasons:
    • Users were forgetting to wear the device during physical activity and during sleep
    • Users were not wanting to wear the device due to comfortability
    • Users were forgetting to manually log activity
    • Users were not finding the tracking process easy and convenient
  • Users were more consistent with tracking physical activities, which could indicate that the device is not comfortable enough to wear during sleep, and that the device does not allow easy weight logging entries.
  • 96% of logged physical activity records are automated. Meaning users rely on activity being automatically detected and logged, and not manually entered.

Recommendations

To meet the goal to empower women with knowledge about their own health and habits, the following recommendations for marketing are to:

Create and Advertise

  • A more comfortable device that allows all-day wear for consistent tracking which leads to more complete health data
  • Daily challenges to provide incentive to stay consistent with all activities - physical, sleep, and logging each day of the week
    • Challenge to take 10,000 steps
    • Challenge to sleep 8 hours
  • An easy and convenient tracking process focused on automated activity tracking, and less need for manual intervention.
  • Reminders to log activity

Next Steps

  • More forward with recommendations with the insights gained from the limited data

    or

  • Expand upon the limited data and timeframe by conducting another survey to gather specific and categorical data that targets the female audience. And to ask questions around specific goals that they have with their health and fitness tracking.