Bellabeat is a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device. Analyzing smart device fitness data could help unlock new growth opportunities for the company market.
The business task is to analyze Fitbit smart device usage data in order to gain insight into how consumers currently use their smart devices and to discover underlying trends that can be applied to Bellabeat products. This insight could help unlock new growth opportunities and the best marketing strategy to meet business growth goals.
We have used a public dataset that explores smart device users’ daily habits.
Name: FitBit Fitness Tracker Data (CC0 Public Domain) Link: https://www.kaggle.com/arashnic/fitbit
"This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. "
Overall, this data may not be appropriate for conducting a full detailed analysis to meet the goal of the business. However, we will move forward with the data to uncover any insights.
Ideally, we would be having a sample size that is more representative of the whole population, at least 6 months of data from a more current year, and gendered data so that we can gain insight into the specific female trends.
This is public data. No data was identified as sensitive and therefore we will not need to have Data anonymization / De-identification of any data. Also, there is no need to encrypt the data.
#import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import janitor
import datetime
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib
from matplotlib.pyplot import figure
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8)
pd.options.mode.chained_assignment = None
#load data from csv files into data frames
dailyActivity = pd.read_csv(r'C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
sleepDay = pd.read_csv(r'C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
weightLogInfo = pd.read_csv(r'C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')
dailyActivity.head()
sleepDay.head()
weightLogInfo.head()
| Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 |
| 1 | 1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0.0 | 1.57 | 0.69 | 4.71 | 0.0 | 21 | 19 | 217 | 776 | 1797 |
| 2 | 1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0.0 | 2.44 | 0.40 | 3.91 | 0.0 | 30 | 11 | 181 | 1218 | 1776 |
| 3 | 1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0.0 | 2.14 | 1.26 | 2.83 | 0.0 | 29 | 34 | 209 | 726 | 1745 |
| 4 | 1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0.0 | 2.71 | 0.41 | 5.04 | 0.0 | 36 | 10 | 221 | 773 | 1863 |
| Id | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
|---|---|---|---|---|---|
| 0 | 1503960366 | 4/12/2016 12:00:00 AM | 1 | 327 | 346 |
| 1 | 1503960366 | 4/13/2016 12:00:00 AM | 2 | 384 | 407 |
| 2 | 1503960366 | 4/15/2016 12:00:00 AM | 1 | 412 | 442 |
| 3 | 1503960366 | 4/16/2016 12:00:00 AM | 2 | 340 | 367 |
| 4 | 1503960366 | 4/17/2016 12:00:00 AM | 1 | 700 | 712 |
| Id | Date | WeightKg | WeightPounds | Fat | BMI | IsManualReport | LogId | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1503960366 | 5/2/2016 11:59:59 PM | 52.599998 | 115.963147 | 22.0 | 22.650000 | True | 1462233599000 |
| 1 | 1503960366 | 5/3/2016 11:59:59 PM | 52.599998 | 115.963147 | NaN | 22.650000 | True | 1462319999000 |
| 2 | 1927972279 | 4/13/2016 1:08:52 AM | 133.500000 | 294.317120 | NaN | 47.540001 | False | 1460509732000 |
| 3 | 2873212765 | 4/21/2016 11:59:59 PM | 56.700001 | 125.002104 | NaN | 21.450001 | True | 1461283199000 |
| 4 | 2873212765 | 5/12/2016 11:59:59 PM | 57.299999 | 126.324875 | NaN | 21.690001 | True | 1463097599000 |
Summary of Cleaning:
#checking for missing data (NULL)
for col in dailyActivity.columns:
pct_missing = round((np.mean(dailyActivity[col].isnull())) * 100)
print(f'{col} {pct_missing}%')
print("---")
for col in sleepDay.columns:
pct_missing = round((np.mean(sleepDay[col].isnull())) * 100)
print(f'{col} {pct_missing}%')
print("---")
for col in weightLogInfo.columns:
pct_missing = round((np.mean(weightLogInfo[col].isnull())) * 100)
print(f'{col} {pct_missing}%')
#checking for missing data (NA)
for col in dailyActivity.columns:
pct_missing = round((np.mean(dailyActivity[col].isna())) * 100)
print(f'{col} {pct_missing}%')
print("---")
for col in sleepDay.columns:
pct_missing = round((np.mean(sleepDay[col].isna())) * 100)
print(f'{col} {pct_missing}%')
print("---")
for col in weightLogInfo.columns:
pct_missing = round((np.mean(weightLogInfo[col].isna())) * 100)
print(f'{col} {pct_missing}%')
# weightLogInfo["Fat"] has missing data, but we will retain it.
Id 0% ActivityDate 0% TotalSteps 0% TotalDistance 0% TrackerDistance 0% LoggedActivitiesDistance 0% VeryActiveDistance 0% ModeratelyActiveDistance 0% LightActiveDistance 0% SedentaryActiveDistance 0% VeryActiveMinutes 0% FairlyActiveMinutes 0% LightlyActiveMinutes 0% SedentaryMinutes 0% Calories 0% --- Id 0% ActivityDate 0% TotalSteps 0% TotalDistance 0% TrackerDistance 0% LoggedActivitiesDistance 0% VeryActiveDistance 0% ModeratelyActiveDistance 0% LightActiveDistance 0% SedentaryActiveDistance 0% VeryActiveMinutes 0% FairlyActiveMinutes 0% LightlyActiveMinutes 0% SedentaryMinutes 0% Calories 0% --- Id 0% SleepDay 0% TotalSleepRecords 0% TotalMinutesAsleep 0% TotalTimeInBed 0% --- Id 0% Date 0% WeightKg 0% WeightPounds 0% Fat 97% BMI 0% IsManualReport 0% LogId 0% Id 0% ActivityDate 0% TotalSteps 0% TotalDistance 0% TrackerDistance 0% LoggedActivitiesDistance 0% VeryActiveDistance 0% ModeratelyActiveDistance 0% LightActiveDistance 0% SedentaryActiveDistance 0% VeryActiveMinutes 0% FairlyActiveMinutes 0% LightlyActiveMinutes 0% SedentaryMinutes 0% Calories 0% --- Id 0% SleepDay 0% TotalSleepRecords 0% TotalMinutesAsleep 0% TotalTimeInBed 0% --- Id 0% Date 0% WeightKg 0% WeightPounds 0% Fat 97% BMI 0% IsManualReport 0% LogId 0%
# Checking data types
dailyActivity.dtypes
sleepDay.dtypes
weightLogInfo.dtypes
Id int64 ActivityDate object TotalSteps int64 TotalDistance float64 TrackerDistance float64 LoggedActivitiesDistance float64 VeryActiveDistance float64 ModeratelyActiveDistance float64 LightActiveDistance float64 SedentaryActiveDistance float64 VeryActiveMinutes int64 FairlyActiveMinutes int64 LightlyActiveMinutes int64 SedentaryMinutes int64 Calories int64 dtype: object
Id int64 SleepDay object TotalSleepRecords int64 TotalMinutesAsleep int64 TotalTimeInBed int64 dtype: object
Id int64 Date object WeightKg float64 WeightPounds float64 Fat float64 BMI float64 IsManualReport bool LogId int64 dtype: object
#cleaning data frames
#changing dataframe names, standardize col names, remove empty rows, renaming columns
updated_daily_activity = (
dailyActivity
.clean_names(None, 'snake')
.remove_empty()
)
updated_daily_sleep = (
sleepDay
.clean_names(None, 'snake')
.remove_empty()
.rename_column('sleep_day','activity_date')
)
updated_weight_log = (
weightLogInfo
.clean_names(None, 'snake')
.remove_empty()
.rename_column('date',"activity_date")
)
#changing activity date string to datetime
updated_daily_activity['activity_date'] = pd.to_datetime(updated_daily_activity['activity_date'])
updated_daily_sleep['activity_date'] = pd.to_datetime(updated_daily_sleep['activity_date'])
updated_weight_log['activity_date'] = pd.to_datetime(updated_weight_log['activity_date'])
#adding weekday column
updated_daily_activity['weekday'] = updated_daily_activity['activity_date'].dt.day_name()
updated_daily_sleep['weekday'] = updated_daily_sleep['activity_date'].dt.day_name()
updated_weight_log['weekday'] = updated_weight_log['activity_date'].dt.day_name()
#change data types of ID from int to string
updated_daily_activity['id'] = updated_daily_activity['id'].astype(str)
updated_daily_sleep['id'] = updated_daily_sleep['id'].astype(str)
updated_weight_log['id'] = updated_weight_log['id'].astype(str)
#check for duplicate data
updated_daily_activity.duplicated().sum()
updated_daily_sleep.duplicated().sum()
updated_weight_log.duplicated().sum()
0
3
0
updated_daily_sleep.loc[updated_daily_sleep.duplicated(), :]
| id | activity_date | total_sleep_records | total_minutes_asleep | total_time_in_bed | weekday | |
|---|---|---|---|---|---|---|
| 161 | 4388161847 | 2016-05-05 | 1 | 471 | 495 | Thursday |
| 223 | 4702921684 | 2016-05-07 | 1 | 520 | 543 | Saturday |
| 380 | 8378563200 | 2016-04-25 | 1 | 388 | 402 | Monday |
updated_daily_sleep = updated_daily_sleep.drop_duplicates()
updated_daily_sleep.duplicated().sum()
0
# sort by Date
updated_daily_activity = updated_daily_activity.sort_values(by=['activity_date'], inplace=False, ascending=True)
updated_daily_sleep = updated_daily_sleep.sort_values(by=['activity_date'], inplace=False, ascending=True)
updated_weight_log = updated_weight_log.sort_values(by=['activity_date'], inplace=False, ascending=True)
#add report type variable for categorical data
updated_weight_log['report_type'] = np.where(updated_weight_log['is_manual_report'] == True, "Manual", "Automated")
#validating new reoport type column
np.sum(updated_weight_log["report_type"] == "Manual")
np.sum(updated_weight_log["is_manual_report"] == True)
np.sum(updated_weight_log["report_type"] == "Automated")
np.sum(updated_weight_log["is_manual_report"] == False)
41
41
26
26
#comapre data frames
calorie_comparison_test = updated_daily_activity[["id","activity_date","calories"]]
intensity_comparison_test = updated_daily_activity[["id","activity_date","sedentary_minutes","lightly_active_minutes", "fairly_active_minutes","very_active_minutes","sedentary_active_distance","light_active_distance","moderately_active_distance","very_active_distance"]]
daily_steps_comparison_test = updated_daily_activity[["id","activity_date","total_steps"]]
dailyCalories = pd.read_csv(r'C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv')
dailyIntensities = pd.read_csv(r'C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')
dailySteps = pd.read_csv(r'C:/Users/Brandi/Documents/Bellabeat Case Study/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv')
a1 = (
dailyCalories
.clean_names(None, 'snake')
.remove_empty()
.rename_column('activity_day','activity_date')
)
a2 = (
dailyIntensities
.clean_names(None, 'snake')
.remove_empty()
.rename_column('activity_day','activity_date')
)
a3 = (
dailySteps
.clean_names(None, 'snake')
.remove_empty()
.rename_column('activity_day','activity_date')
.rename_column('step_total','total_steps')
)
a1['id'] = a1['id'].astype(str)
a2['id'] = a2['id'].astype(str)
a3['id'] = a3['id'].astype(str)
a1['activity_date'] = pd.to_datetime(a1['activity_date'])
a2['activity_date'] = pd.to_datetime(a2['activity_date'])
a3['activity_date'] = pd.to_datetime(a3['activity_date'])
#looking for differences between data sets, none.
pd.concat([calorie_comparison_test,a1]).drop_duplicates(keep=False)
pd.concat([intensity_comparison_test,a2]).drop_duplicates(keep=False)
pd.concat([daily_steps_comparison_test,a3]).drop_duplicates(keep=False)
| id | activity_date | calories |
|---|
| id | activity_date | sedentary_minutes | lightly_active_minutes | fairly_active_minutes | very_active_minutes | sedentary_active_distance | light_active_distance | moderately_active_distance | very_active_distance |
|---|
| id | activity_date | total_steps |
|---|
Summary of Analysis Process
#quick view of clean data
updated_daily_activity.head()
updated_daily_sleep.head()
updated_weight_log.head()
| id | activity_date | total_steps | total_distance | tracker_distance | logged_activities_distance | very_active_distance | moderately_active_distance | light_active_distance | sedentary_active_distance | very_active_minutes | fairly_active_minutes | lightly_active_minutes | sedentary_minutes | calories | weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 | Tuesday |
| 123 | 1927972279 | 2016-04-12 | 678 | 0.47 | 0.47 | 0.0 | 0.00 | 0.00 | 0.47 | 0.0 | 0 | 0 | 55 | 734 | 2220 | Tuesday |
| 154 | 2022484408 | 2016-04-12 | 11875 | 8.34 | 8.34 | 0.0 | 3.31 | 0.77 | 4.26 | 0.0 | 42 | 14 | 227 | 1157 | 2390 | Tuesday |
| 909 | 8877689391 | 2016-04-12 | 23186 | 20.40 | 20.40 | 0.0 | 12.22 | 0.34 | 7.82 | 0.0 | 85 | 7 | 312 | 1036 | 3921 | Tuesday |
| 185 | 2026352035 | 2016-04-12 | 4414 | 2.74 | 2.74 | 0.0 | 0.19 | 0.35 | 2.20 | 0.0 | 3 | 8 | 181 | 706 | 1459 | Tuesday |
| id | activity_date | total_sleep_records | total_minutes_asleep | total_time_in_bed | weekday | |
|---|---|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 | 1 | 327 | 346 | Tuesday |
| 109 | 4020332650 | 2016-04-12 | 1 | 501 | 541 | Tuesday |
| 167 | 4445114986 | 2016-04-12 | 2 | 429 | 457 | Tuesday |
| 200 | 4702921684 | 2016-04-12 | 1 | 425 | 439 | Tuesday |
| 228 | 5553957443 | 2016-04-12 | 1 | 441 | 464 | Tuesday |
| id | activity_date | weight_kg | weight_pounds | fat | bmi | is_manual_report | log_id | weekday | report_type | |
|---|---|---|---|---|---|---|---|---|---|---|
| 43 | 8877689391 | 2016-04-12 06:47:11 | 85.800003 | 189.156628 | NaN | 25.680000 | False | 1460443631000 | Tuesday | Automated |
| 13 | 6962181067 | 2016-04-12 23:59:59 | 62.500000 | 137.788914 | NaN | 24.389999 | True | 1460505599000 | Tuesday | Manual |
| 2 | 1927972279 | 2016-04-13 01:08:52 | 133.500000 | 294.317120 | NaN | 47.540001 | False | 1460509732000 | Wednesday | Automated |
| 44 | 8877689391 | 2016-04-13 06:55:00 | 84.900002 | 187.172464 | NaN | 25.410000 | False | 1460530500000 | Wednesday | Automated |
| 14 | 6962181067 | 2016-04-13 23:59:59 | 62.099998 | 136.907061 | NaN | 24.240000 | True | 1460591999000 | Wednesday | Manual |
#unique participants in each dataframe
updated_daily_activity["id"].nunique()
updated_daily_sleep["id"].nunique()
updated_weight_log["id"].nunique()
33
24
8
#Examining Date Ranges. All date ranges are the same
updated_daily_activity["activity_date"].min()
updated_daily_sleep["activity_date"].min()
updated_weight_log["activity_date"].min()
print("---")
updated_daily_activity["activity_date"].max()
updated_daily_sleep["activity_date"].max()
updated_weight_log["activity_date"].max()
print("---")
updated_daily_activity["activity_date"].max() - updated_daily_activity["activity_date"].min()
updated_daily_sleep["activity_date"].max() - updated_daily_sleep["activity_date"].min()
updated_weight_log["activity_date"].max() - updated_weight_log["activity_date"].min()
Timestamp('2016-04-12 00:00:00')
Timestamp('2016-04-12 00:00:00')
Timestamp('2016-04-12 06:47:11')
---
Timestamp('2016-05-12 00:00:00')
Timestamp('2016-05-12 00:00:00')
Timestamp('2016-05-12 23:59:59')
---
Timedelta('30 days 00:00:00')
Timedelta('30 days 00:00:00')
Timedelta('30 days 17:12:48')
#quick summary statistics
updated_daily_activity[["total_steps", "total_distance", "sedentary_minutes"]].describe()
updated_daily_sleep [["total_sleep_records", "total_minutes_asleep", "total_time_in_bed"]].describe()
updated_weight_log [["weight_pounds", "bmi", "fat"]].describe()
| total_steps | total_distance | sedentary_minutes | |
|---|---|---|---|
| count | 940.000000 | 940.000000 | 940.000000 |
| mean | 7637.910638 | 5.489702 | 991.210638 |
| std | 5087.150742 | 3.924606 | 301.267437 |
| min | 0.000000 | 0.000000 | 0.000000 |
| 25% | 3789.750000 | 2.620000 | 729.750000 |
| 50% | 7405.500000 | 5.245000 | 1057.500000 |
| 75% | 10727.000000 | 7.712500 | 1229.500000 |
| max | 36019.000000 | 28.030001 | 1440.000000 |
| total_sleep_records | total_minutes_asleep | total_time_in_bed | |
|---|---|---|---|
| count | 410.000000 | 410.000000 | 410.000000 |
| mean | 1.119512 | 419.173171 | 458.482927 |
| std | 0.346636 | 118.635918 | 127.455140 |
| min | 1.000000 | 58.000000 | 61.000000 |
| 25% | 1.000000 | 361.000000 | 403.750000 |
| 50% | 1.000000 | 432.500000 | 463.000000 |
| 75% | 1.000000 | 490.000000 | 526.000000 |
| max | 3.000000 | 796.000000 | 961.000000 |
| weight_pounds | bmi | fat | |
|---|---|---|---|
| count | 67.000000 | 67.000000 | 2.00000 |
| mean | 158.811801 | 25.185224 | 23.50000 |
| std | 30.695415 | 3.066963 | 2.12132 |
| min | 115.963147 | 21.450001 | 22.00000 |
| 25% | 135.363832 | 23.959999 | 22.75000 |
| 50% | 137.788914 | 24.389999 | 23.50000 |
| 75% | 187.503152 | 25.559999 | 24.25000 |
| max | 294.317120 | 47.540001 | 25.00000 |
#percentages
print("percent with 0 logged activities:","{:.2%}".format(np.sum(updated_daily_activity["logged_activities_distance"] == 0) / updated_daily_activity["logged_activities_distance"].count()))
print("percent with 0 distance:","{:.2%}".format(np.sum(updated_daily_activity["total_distance"] == 0) / updated_daily_activity["total_distance"].count()))
print("percent users particpated in weight logging:","{:.2%}".format(updated_weight_log["id"].nunique() / updated_daily_activity["id"].nunique()))
print("percent users particpated in sleep logging:","{:.2%}".format(updated_daily_sleep["id"].nunique() / updated_daily_activity["id"].nunique()))
percent with 0 logged activities: 96.60% percent with 0 distance: 8.30% percent users particpated in weight logging: 24.24% percent users particpated in sleep logging: 72.73%
#merge daily activity + daily sleep outer join
combined_sleep_activity = pd.merge(updated_daily_sleep, updated_daily_activity, how="right", on=["id","activity_date"])
#merge all daily data outer join
combined_all_daily = pd.merge(combined_sleep_activity, updated_weight_log, how="left", on=["id"])
combined_sleep_activity.head()
combined_all_daily.head()
| id | activity_date | total_sleep_records | total_minutes_asleep | total_time_in_bed | weekday_x | total_steps | total_distance | tracker_distance | logged_activities_distance | very_active_distance | moderately_active_distance | light_active_distance | sedentary_active_distance | very_active_minutes | fairly_active_minutes | lightly_active_minutes | sedentary_minutes | calories | weekday_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 | 1.0 | 327.0 | 346.0 | Tuesday | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 | Tuesday |
| 1 | 1927972279 | 2016-04-12 | 3.0 | 750.0 | 775.0 | Tuesday | 678 | 0.47 | 0.47 | 0.0 | 0.00 | 0.00 | 0.47 | 0.0 | 0 | 0 | 55 | 734 | 2220 | Tuesday |
| 2 | 2022484408 | 2016-04-12 | NaN | NaN | NaN | NaN | 11875 | 8.34 | 8.34 | 0.0 | 3.31 | 0.77 | 4.26 | 0.0 | 42 | 14 | 227 | 1157 | 2390 | Tuesday |
| 3 | 8877689391 | 2016-04-12 | NaN | NaN | NaN | NaN | 23186 | 20.40 | 20.40 | 0.0 | 12.22 | 0.34 | 7.82 | 0.0 | 85 | 7 | 312 | 1036 | 3921 | Tuesday |
| 4 | 2026352035 | 2016-04-12 | 1.0 | 503.0 | 546.0 | Tuesday | 4414 | 2.74 | 2.74 | 0.0 | 0.19 | 0.35 | 2.20 | 0.0 | 3 | 8 | 181 | 706 | 1459 | Tuesday |
| id | activity_date_x | total_sleep_records | total_minutes_asleep | total_time_in_bed | weekday_x | total_steps | total_distance | tracker_distance | logged_activities_distance | ... | weekday_y | activity_date_y | weight_kg | weight_pounds | fat | bmi | is_manual_report | log_id | weekday | report_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 | 1.0 | 327.0 | 346.0 | Tuesday | 13162 | 8.50 | 8.50 | 0.0 | ... | Tuesday | 2016-05-02 23:59:59 | 52.599998 | 115.963147 | 22.0 | 22.650000 | True | 1.462234e+12 | Monday | Manual |
| 1 | 1503960366 | 2016-04-12 | 1.0 | 327.0 | 346.0 | Tuesday | 13162 | 8.50 | 8.50 | 0.0 | ... | Tuesday | 2016-05-03 23:59:59 | 52.599998 | 115.963147 | NaN | 22.650000 | True | 1.462320e+12 | Tuesday | Manual |
| 2 | 1927972279 | 2016-04-12 | 3.0 | 750.0 | 775.0 | Tuesday | 678 | 0.47 | 0.47 | 0.0 | ... | Tuesday | 2016-04-13 01:08:52 | 133.500000 | 294.317120 | NaN | 47.540001 | False | 1.460510e+12 | Wednesday | Automated |
| 3 | 2022484408 | 2016-04-12 | NaN | NaN | NaN | NaN | 11875 | 8.34 | 8.34 | 0.0 | ... | Tuesday | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 8877689391 | 2016-04-12 | NaN | NaN | NaN | NaN | 23186 | 20.40 | 20.40 | 0.0 | ... | Tuesday | 2016-04-12 06:47:11 | 85.800003 | 189.156628 | NaN | 25.680000 | False | 1.460444e+12 | Tuesday | Automated |
5 rows × 29 columns
combined_sleep_activity["id"].nunique()
combined_all_daily["id"].nunique()
33
33
average_steps_overall = updated_daily_activity["total_steps"].mean()
avg_steps_day = updated_daily_activity.groupby(["weekday"])["total_steps"].mean().sort_values(ascending = False)
avg_steps_day
weekday Saturday 8152.975806 Tuesday 8125.006579 Monday 7780.866667 Wednesday 7559.373333 Friday 7448.230159 Thursday 7405.836735 Sunday 6933.231405 Name: total_steps, dtype: float64
avg_steps_day.plot(kind="bar")
plt.title("Average Steps per Day")
plt.xlabel("Day")
plt.ylabel("Averge Steps")
plt.axhline(10000, color='green')
<AxesSubplot:xlabel='weekday'>
Text(0.5, 1.0, 'Average Steps per Day')
Text(0.5, 0, 'Day')
Text(0, 0.5, 'Averge Steps')
<matplotlib.lines.Line2D at 0x20c0cff00d0>
#Average Sleep per Day
avg_sleep_day = updated_daily_sleep.groupby(["weekday"])["total_minutes_asleep"].mean().sort_values(ascending = False)
avg_sleep_day
weekday Sunday 452.745455 Wednesday 434.681818 Monday 419.500000 Saturday 419.070175 Friday 405.421053 Tuesday 404.538462 Thursday 401.296875 Name: total_minutes_asleep, dtype: float64
avg_sleep_day.plot(kind="bar")
plt.title("Average Sleep per Day")
plt.xlabel("Day")
plt.ylabel("Average Minutes")
plt.axhline(480, color='green')
<AxesSubplot:xlabel='weekday'>
Text(0.5, 1.0, 'Average Sleep per Day')
Text(0.5, 0, 'Day')
Text(0, 0.5, 'Average Minutes')
<matplotlib.lines.Line2D at 0x20c0fd411f0>
#Daily Weight Log
updated_weight_log['weekday'].value_counts().plot(kind='bar')
plt.title("Daily Weight Logs")
plt.xlabel("Day")
plt.ylabel("Total Log Count")
<AxesSubplot:>
Text(0.5, 1.0, 'Daily Weight Logs')
Text(0.5, 0, 'Day')
Text(0, 0.5, 'Total Log Count')
#average minutes per level
dataplot = updated_daily_activity[["weekday","sedentary_minutes","lightly_active_minutes", "fairly_active_minutes","very_active_minutes"]]
dataplot = dataplot.groupby('weekday').agg('mean')
dataplot.plot(kind="bar")
plt.title("Activity Intensity per Day")
plt.xlabel("Day")
plt.ylabel("Average Minutes")
<AxesSubplot:xlabel='weekday'>
Text(0.5, 1.0, 'Activity Intensity per Day')
Text(0.5, 0, 'Day')
Text(0, 0.5, 'Average Minutes')
#relationship between total_time_in_bed and total_minutes_asleep
sns.regplot(x="total_time_in_bed", y="total_minutes_asleep", data=updated_daily_sleep, scatter_kws={"color":"black"}, line_kws={"color":"blue"})
plt.title("Time in Bed vs Time Asleep")
plt.ylabel("Asleep")
plt.xlabel("In Bed")
plt.axhline(480, color='green')
plt.text(600,800,'8 hour recommended', color='green')
plt.show()
<AxesSubplot:xlabel='total_time_in_bed', ylabel='total_minutes_asleep'>
Text(0.5, 1.0, 'Time in Bed vs Time Asleep')
Text(0, 0.5, 'Asleep')
Text(0.5, 0, 'In Bed')
<matplotlib.lines.Line2D at 0x20c066cb0d0>
Text(600, 800, '8 hour recommended')
#relationship between total_steps and sedentary_minutes
sns.regplot(x="total_steps", y="sedentary_minutes", data=updated_daily_activity, scatter_kws={"color":"black"}, line_kws={"color":"blue"})
plt.title("Steps vs Sedentary Minutes")
plt.xlabel("Total Steps")
plt.ylabel("Sedentary Minutes")
plt.show()
<AxesSubplot:xlabel='total_steps', ylabel='sedentary_minutes'>
Text(0.5, 1.0, 'Steps vs Sedentary Minutes')
Text(0.5, 0, 'Total Steps')
Text(0, 0.5, 'Sedentary Minutes')
#total steps vs total minutes asleep
sns.regplot(x="total_minutes_asleep", y="total_steps", data=combined_sleep_activity, scatter_kws={"color":"black"}, line_kws={"color":"blue"})
plt.title("Steps vs Sleep")
plt.xlabel("Minutes Asleep")
plt.ylabel("Total Steps")
plt.axvline(480, color='green')
plt.text(600,20000,'8 hour recommended', color='green')
plt.show()
<AxesSubplot:xlabel='total_minutes_asleep', ylabel='total_steps'>
Text(0.5, 1.0, 'Steps vs Sleep')
Text(0.5, 0, 'Minutes Asleep')
Text(0, 0.5, 'Total Steps')
<matplotlib.lines.Line2D at 0x20c066ccd90>
Text(600, 20000, '8 hour recommended')
#total minutes asleep vs calories
sns.regplot(x="total_minutes_asleep", y="calories", data=combined_sleep_activity, scatter_kws={"color":"black"}, line_kws={"color":"blue"})
plt.title("Sleep vs Calories")
plt.xlabel("Minutes Asleep")
plt.ylabel("Calories")
plt.axvline(480, color='green')
plt.text(600,4000,'8 hour recommended', color='green')
plt.show()
<AxesSubplot:xlabel='total_minutes_asleep', ylabel='calories'>
Text(0.5, 1.0, 'Sleep vs Calories')
Text(0.5, 0, 'Minutes Asleep')
Text(0, 0.5, 'Calories')
<matplotlib.lines.Line2D at 0x20c066cb220>
Text(600, 4000, '8 hour recommended')
#step count per user
user_total_steps = updated_daily_activity.groupby('id')[["total_steps"]].sum().sort_values('total_steps', ascending = False)
top_10_users = user_total_steps[:10]
user_avg_steps = updated_daily_activity.groupby('id')[["total_steps"]].mean().sort_values('total_steps', ascending = False)
top_10_users_mean = user_avg_steps[:10]
user_avg_steps.plot(kind="bar")
plt.title("Average Steps per User")
plt.xlabel("User")
plt.ylabel("Average Steps")
plt.axhline(10000, color='green')
<AxesSubplot:xlabel='id'>
Text(0.5, 1.0, 'Average Steps per User')
Text(0.5, 0, 'User')
Text(0, 0.5, 'Average Steps')
<matplotlib.lines.Line2D at 0x20c05cbc700>
#Participation
updated_daily_activity['id'].value_counts().plot(kind='bar')
plt.title("Physical Activity Participant Frequency")
plt.xlabel("User ID")
plt.ylabel("Total Log Count")
<AxesSubplot:>
Text(0.5, 1.0, 'Physical Activity Participant Frequency')
Text(0.5, 0, 'User ID')
Text(0, 0.5, 'Total Log Count')
updated_daily_sleep['id'].value_counts().plot(kind='bar')
plt.title("Sleep Participant Frequency")
plt.xlabel("User ID")
plt.ylabel("Total Log Count")
<AxesSubplot:>
Text(0.5, 1.0, 'Sleep Participant Frequency')
Text(0.5, 0, 'User ID')
Text(0, 0.5, 'Total Log Count')
updated_weight_log['id'].value_counts().plot(kind='bar')
plt.title("Weight Log Particpant Frequency")
plt.xlabel("User ID")
plt.ylabel("Total Log Count")
<AxesSubplot:>
Text(0.5, 1.0, 'Weight Log Particpant Frequency')
Text(0.5, 0, 'User ID')
Text(0, 0.5, 'Total Log Count')
#all particpation activity
record_type = ["Physical", "Sleep", "Weight"]
distinct_user_count = [updated_daily_activity["id"].nunique(),updated_daily_sleep["id"].nunique(),updated_weight_log["id"].nunique()]
fig1, ax1 = plt.subplots()
ax1.pie(distinct_user_count, labels=record_type, autopct='%1.1f%%', startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
ax1.set_title("All Participation Activity")
plt.show()
([<matplotlib.patches.Wedge at 0x20c7a98f310>, <matplotlib.patches.Wedge at 0x20c7a98ff10>, <matplotlib.patches.Wedge at 0x20c7a52dee0>], [Text(-1.0996788130378836, -0.026580221135116266, 'Physical'), Text(1.0285179484228668, -0.3900651609308596, 'Sleep'), Text(0.41480408775850713, 1.0187922107961136, 'Weight')], [Text(-0.5998248071115728, -0.014498302437336144, '50.8%'), Text(0.5610097900488363, -0.21276281505319614, '36.9%'), Text(0.22625677514100387, 0.5557048422524256, '12.3%')])
(-1.1039992533980778, 1.11951741980678, -1.105538688392744, 1.10026377876888)
Text(0.5, 1.0, 'All Participation Activity')
#percent of total records of 0 logged activities
manual_physical = np.sum(updated_daily_activity["logged_activities_distance"] != 0)
automated_physical = np.sum(updated_daily_activity["logged_activities_distance"] == 0)
record_type_count = [manual_physical,automated_physical]
labels = 'Manual', 'Automated'
fig1, ax1 = plt.subplots()
ax1.pie(record_type_count, labels=labels, autopct='%1.1f%%')
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
ax1.set_title("Manual vs Automated Physical Activity Logs")
plt.show()
([<matplotlib.patches.Wedge at 0x20c7a0ec760>, <matplotlib.patches.Wedge at 0x20c7a08f520>], [Text(1.0937151825418197, 0.11741848013628155, 'Manual'), Text(-1.0937151770450626, -0.11741853133678419, 'Automated')], [Text(0.5965719177500833, 0.06404644371069902, '3.4%'), Text(-0.5965719147518523, -0.06404647163824591, '96.6%')])
(-1.1082151582541035, 1.1003912104513915, -1.1059873583562982, 1.1080560958448729)
Text(0.5, 1.0, 'Manual vs Automated Physical Activity Logs')
#Shows participants log days
hue_order = ['Manual', 'Automated']
sns.scatterplot(x="activity_date", y="id", data=updated_weight_log, hue = "report_type", hue_order=hue_order )
plt.title("Weight Log Consistency")
plt.xlabel("Date of Log")
plt.ylabel("User ID")
<AxesSubplot:xlabel='activity_date', ylabel='id'>
Text(0.5, 1.0, 'Weight Log Consistency')
Text(0.5, 0, 'Date of Log')
Text(0, 0.5, 'User ID')
g = sns.FacetGrid(updated_weight_log, col ="id", col_wrap=4, col_order = updated_weight_log['id'].value_counts().index)
g.map(sns.lineplot, "activity_date", "weight_pounds")
g.set_xticklabels(rotation=45)
plt.title("Weight Change")
plt.xlabel("Date of Log")
plt.ylabel("Weight in Pounds")
<seaborn.axisgrid.FacetGrid at 0x20c7965ea60>
<seaborn.axisgrid.FacetGrid at 0x20c7965ea60>
Text(0.5, 1.0, 'Weight Change')
Text(0.5, 6.800000000000011, 'Date of Log')
Text(669.575, 0.5, 'Weight in Pounds')
#Steps over Time
sns.lineplot(data = updated_daily_activity, x="activity_date", y="total_steps")
plt.title("User Activitiy")
plt.xlabel("Date")
plt.ylabel("Steps Taken")
plt.xticks(rotation=45)
plt.show()
<AxesSubplot:xlabel='activity_date', ylabel='total_steps'>
Text(0.5, 1.0, 'User Activitiy')
Text(0.5, 0, 'Date')
Text(0, 0.5, 'Steps Taken')
(array([16904., 16908., 16912., 16916., 16920., 16922., 16926., 16930.,
16934.]),
[Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, '')])
#Steps over Time
sns.lineplot(data = updated_daily_activity, x="activity_date", y="total_steps", hue="id")
plt.title("User Activitiy")
plt.xlabel("Date")
plt.ylabel("Steps Taken")
plt.xticks(rotation=45)
plt.show()
<AxesSubplot:xlabel='activity_date', ylabel='total_steps'>
Text(0.5, 1.0, 'User Activitiy')
Text(0.5, 0, 'Date')
Text(0, 0.5, 'Steps Taken')
(array([16904., 16908., 16912., 16916., 16920., 16922., 16926., 16930.,
16934.]),
[Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, ''),
Text(0, 0, '')])
To meet the goal to empower women with knowledge about their own health and habits, the following recommendations for marketing are to:
More forward with recommendations with the insights gained from the limited data
or
Expand upon the limited data and timeframe by conducting another survey to gather specific and categorical data that targets the female audience. And to ask questions around specific goals that they have with their health and fitness tracking.