Bellabeat is a high-tech company founded by Urška Sršen and Sando Mur that manufactures health-focused smart products for women. In this case study, I am presented a scenario where I will be working for Urška Sršen and I have been asked to focus on one of Bellabeat's products and analyze smart device data to gain insight into how consumers are using their smart devices.
In order to help guide Bellabeat in unlocking new growth opportunities, I am requested by Urška Sršen to analyze available consumer data, which for this scenario I am to use a dataset by Fitbit Fitness Tracker Data (Public Domain, dataset made available through Mobius). This dataset contains personal fitness tracker from thirty fitbit users. These thirty Fitbit users had consented to the submission of personal tracker data, which includes minute-level output for physical activity, heart rate, and sleep monitoring. It also includes information about daily activity, steps, and heart rate that can be used to explore users' habits.
As this dataset has more than 10 csv files available, I am encouraged by Sršen to use as much as I can in order aid my research.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import missingno as msno
import seaborn as sns
The datasets are gathered internally by FitBit, therefore the data can be safely assumed to be unbias and credible. I have also taken a look at the datasets in a spreadsheet software and there doesn't seem to be any problems other than the date formats are not properly converted. I will convert them to datetime formats when the need arises.
pd.options.display.width = None
pd.options.display.max_columns = None
dailyActivity = pd.read_csv('Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
dailyCalories = pd.read_csv('Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv')
minuteSleep = pd.read_csv('Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv')
dailySteps = pd.read_csv('Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv')
heartrateSeconds = pd.read_csv('Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv')
hourlyCalories = pd.read_csv('Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv')
hourlyIntensities = pd.read_csv('Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv')
hourlySteps = pd.read_csv('Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv')
sleepDay = pd.read_csv('Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
stepsDf = pd.read_csv('Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv')
weightLog = pd.read_csv('Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')
hourlySteps.head()
Id | ActivityHour | StepTotal | |
---|---|---|---|
0 | 1503960366 | 4/12/2016 12:00:00 AM | 373 |
1 | 1503960366 | 4/12/2016 1:00:00 AM | 160 |
2 | 1503960366 | 4/12/2016 2:00:00 AM | 151 |
3 | 1503960366 | 4/12/2016 3:00:00 AM | 0 |
4 | 1503960366 | 4/12/2016 4:00:00 AM | 0 |
hourlyIntensities.head()
Id | ActivityHour | TotalIntensity | AverageIntensity | |
---|---|---|---|---|
0 | 1503960366 | 4/12/2016 12:00:00 AM | 20 | 0.333333 |
1 | 1503960366 | 4/12/2016 1:00:00 AM | 8 | 0.133333 |
2 | 1503960366 | 4/12/2016 2:00:00 AM | 7 | 0.116667 |
3 | 1503960366 | 4/12/2016 3:00:00 AM | 0 | 0.000000 |
4 | 1503960366 | 4/12/2016 4:00:00 AM | 0 | 0.000000 |
heartrateSeconds.head()
Id | Time | Value | |
---|---|---|---|
0 | 2022484408 | 4/12/2016 7:21:00 AM | 97 |
1 | 2022484408 | 4/12/2016 7:21:05 AM | 102 |
2 | 2022484408 | 4/12/2016 7:21:10 AM | 105 |
3 | 2022484408 | 4/12/2016 7:21:20 AM | 103 |
4 | 2022484408 | 4/12/2016 7:21:25 AM | 101 |
dailySteps.head()
Id | ActivityDay | StepTotal | |
---|---|---|---|
0 | 1503960366 | 4/12/2016 | 13162 |
1 | 1503960366 | 4/13/2016 | 10735 |
2 | 1503960366 | 4/14/2016 | 10460 |
3 | 1503960366 | 4/15/2016 | 9762 |
4 | 1503960366 | 4/16/2016 | 12669 |
dailyCalories.head()
Id | ActivityDay | Calories | |
---|---|---|---|
0 | 1503960366 | 4/12/2016 | 1985 |
1 | 1503960366 | 4/13/2016 | 1797 |
2 | 1503960366 | 4/14/2016 | 1776 |
3 | 1503960366 | 4/15/2016 | 1745 |
4 | 1503960366 | 4/16/2016 | 1863 |
# sleepDay.shape (413, 5)
sleepDay.head()
Id | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
---|---|---|---|---|---|
0 | 1503960366 | 4/12/2016 12:00:00 AM | 1 | 327 | 346 |
1 | 1503960366 | 4/13/2016 12:00:00 AM | 2 | 384 | 407 |
2 | 1503960366 | 4/15/2016 12:00:00 AM | 1 | 412 | 442 |
3 | 1503960366 | 4/16/2016 12:00:00 AM | 2 | 340 | 367 |
4 | 1503960366 | 4/17/2016 12:00:00 AM | 1 | 700 | 712 |
minuteSleep.head()
Id | date | value | logId | |
---|---|---|---|---|
0 | 1503960366 | 4/12/2016 2:47:30 AM | 3 | 11380564589 |
1 | 1503960366 | 4/12/2016 2:48:30 AM | 2 | 11380564589 |
2 | 1503960366 | 4/12/2016 2:49:30 AM | 1 | 11380564589 |
3 | 1503960366 | 4/12/2016 2:50:30 AM | 1 | 11380564589 |
4 | 1503960366 | 4/12/2016 2:51:30 AM | 1 | 11380564589 |
dailyActivity.head()
Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 |
1 | 1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0.0 | 1.57 | 0.69 | 4.71 | 0.0 | 21 | 19 | 217 | 776 | 1797 |
2 | 1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0.0 | 2.44 | 0.40 | 3.91 | 0.0 | 30 | 11 | 181 | 1218 | 1776 |
3 | 1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0.0 | 2.14 | 1.26 | 2.83 | 0.0 | 29 | 34 | 209 | 726 | 1745 |
4 | 1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0.0 | 2.71 | 0.41 | 5.04 | 0.0 | 36 | 10 | 221 | 773 | 1863 |
stepsDf.head()
Id | ActivityHour | StepTotal | |
---|---|---|---|
0 | 1503960366 | 4/12/2016 12:00:00 AM | 373 |
1 | 1503960366 | 4/12/2016 1:00:00 AM | 160 |
2 | 1503960366 | 4/12/2016 2:00:00 AM | 151 |
3 | 1503960366 | 4/12/2016 3:00:00 AM | 0 |
4 | 1503960366 | 4/12/2016 4:00:00 AM | 0 |
hourlyCalories.head()
Id | ActivityHour | Calories | |
---|---|---|---|
0 | 1503960366 | 4/12/2016 12:00:00 AM | 81 |
1 | 1503960366 | 4/12/2016 1:00:00 AM | 61 |
2 | 1503960366 | 4/12/2016 2:00:00 AM | 59 |
3 | 1503960366 | 4/12/2016 3:00:00 AM | 47 |
4 | 1503960366 | 4/12/2016 4:00:00 AM | 48 |
weightLog.head()
Id | Date | WeightKg | WeightPounds | Fat | BMI | IsManualReport | LogId | |
---|---|---|---|---|---|---|---|---|
0 | 1503960366 | 5/2/2016 11:59:59 PM | 52.599998 | 115.963147 | 22.0 | 22.650000 | True | 1462233599000 |
1 | 1503960366 | 5/3/2016 11:59:59 PM | 52.599998 | 115.963147 | NaN | 22.650000 | True | 1462319999000 |
2 | 1927972279 | 4/13/2016 1:08:52 AM | 133.500000 | 294.317120 | NaN | 47.540001 | False | 1460509732000 |
3 | 2873212765 | 4/21/2016 11:59:59 PM | 56.700001 | 125.002104 | NaN | 21.450001 | True | 1461283199000 |
4 | 2873212765 | 5/12/2016 11:59:59 PM | 57.299999 | 126.324875 | NaN | 21.690001 | True | 1463097599000 |
dailyActivity.head()
Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 |
1 | 1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0.0 | 1.57 | 0.69 | 4.71 | 0.0 | 21 | 19 | 217 | 776 | 1797 |
2 | 1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0.0 | 2.44 | 0.40 | 3.91 | 0.0 | 30 | 11 | 181 | 1218 | 1776 |
3 | 1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0.0 | 2.14 | 1.26 | 2.83 | 0.0 | 29 | 34 | 209 | 726 | 1745 |
4 | 1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0.0 | 2.71 | 0.41 | 5.04 | 0.0 | 36 | 10 | 221 | 773 | 1863 |
After having an understanding of these available datasets, I have prepared these questions in order to aid myself in the analysis phase.
dailyActivity.head()
Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 |
1 | 1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0.0 | 1.57 | 0.69 | 4.71 | 0.0 | 21 | 19 | 217 | 776 | 1797 |
2 | 1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0.0 | 2.44 | 0.40 | 3.91 | 0.0 | 30 | 11 | 181 | 1218 | 1776 |
3 | 1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0.0 | 2.14 | 1.26 | 2.83 | 0.0 | 29 | 34 | 209 | 726 | 1745 |
4 | 1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0.0 | 2.71 | 0.41 | 5.04 | 0.0 | 36 | 10 | 221 | 773 | 1863 |
#convert to datetime
dailyActivity['ActivityDate'] = pd.to_datetime(dailyActivity['ActivityDate'])
# add a new column that indicates the day's name
dailyActivity['day_name'] = dailyActivity['ActivityDate'].dt.day_name()
# add a new column that indicates weekday/weekend
dailyActivity['weekend_weekday'] = np.where(dailyActivity['ActivityDate'].dt.dayofweek > 4, 'Weekend', 'Weekday')
# steps taken summary
print("Total Steps Summary")
steps = dailyActivity['TotalSteps'].describe()
print(steps)
print('\n')
# byMinutes summary
minutes = dailyActivity[['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes']].describe()
print(minutes)
print('\n')
# Calories and total distance burned summary
calories = dailyActivity[['Calories', 'TotalDistance']].describe()
print(calories)
print('\n')
#Sleep Records, weight, and bmi
print(sleepDay[['TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed']].describe())
print('\n')
print(weightLog[['BMI', 'WeightKg']].describe())
Total Steps Summary count 940.000000 mean 7637.910638 std 5087.150742 min 0.000000 25% 3789.750000 50% 7405.500000 75% 10727.000000 max 36019.000000 Name: TotalSteps, dtype: float64 VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes \ count 940.000000 940.000000 940.000000 mean 21.164894 13.564894 192.812766 std 32.844803 19.987404 109.174700 min 0.000000 0.000000 0.000000 25% 0.000000 0.000000 127.000000 50% 4.000000 6.000000 199.000000 75% 32.000000 19.000000 264.000000 max 210.000000 143.000000 518.000000 SedentaryMinutes count 940.000000 mean 991.210638 std 301.267437 min 0.000000 25% 729.750000 50% 1057.500000 75% 1229.500000 max 1440.000000 Calories TotalDistance count 940.000000 940.000000 mean 2303.609574 5.489702 std 718.166862 3.924606 min 0.000000 0.000000 25% 1828.500000 2.620000 50% 2134.000000 5.245000 75% 2793.250000 7.712500 max 4900.000000 28.030001 TotalSleepRecords TotalMinutesAsleep TotalTimeInBed count 413.000000 413.000000 413.000000 mean 1.118644 419.467312 458.639225 std 0.345521 118.344679 127.101607 min 1.000000 58.000000 61.000000 25% 1.000000 361.000000 403.000000 50% 1.000000 433.000000 463.000000 75% 1.000000 490.000000 526.000000 max 3.000000 796.000000 961.000000 BMI WeightKg count 67.000000 67.000000 mean 25.185224 72.035821 std 3.066963 13.923206 min 21.450001 52.599998 25% 23.959999 61.400002 50% 24.389999 62.500000 75% 25.559999 85.049999 max 47.540001 133.500000
I recommend Bellabeat to program their smart devices to inform users regarding the ideal total steps to be taken to their users.
byMinute = dailyActivity.pivot_table(
values = ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes'],
index = 'day_name'
)
byMinute
FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | VeryActiveMinutes | |
---|---|---|---|---|
day_name | ||||
Friday | 12.111111 | 204.198413 | 1000.309524 | 20.055556 |
Monday | 14.000000 | 192.058333 | 1027.941667 | 23.108333 |
Saturday | 15.201613 | 207.145161 | 964.282258 | 21.919355 |
Sunday | 14.528926 | 173.975207 | 990.256198 | 19.983471 |
Thursday | 11.959184 | 185.421769 | 961.993197 | 19.408163 |
Tuesday | 14.335526 | 197.342105 | 1007.361842 | 22.953947 |
Wednesday | 13.100000 | 189.853333 | 989.480000 | 20.780000 |
byDistance = dailyActivity.pivot_table(
values = ['VeryActiveDistance', 'ModeratelyActiveDistance', 'LightActiveDistance', 'SedentaryActiveDistance'],
index = 'day_name'
)
byDistance
LightActiveDistance | ModeratelyActiveDistance | SedentaryActiveDistance | VeryActiveDistance | |
---|---|---|---|---|
day_name | ||||
Friday | 3.489127 | 0.483810 | 0.001825 | 1.312937 |
Monday | 3.363083 | 0.585833 | 0.002583 | 1.537333 |
Saturday | 3.617177 | 0.677339 | 0.001048 | 1.514597 |
Sunday | 2.892314 | 0.618017 | 0.000661 | 1.488926 |
Thursday | 3.283129 | 0.505170 | 0.002313 | 1.390476 |
Tuesday | 3.471053 | 0.593026 | 0.001447 | 1.613289 |
Wednesday | 3.256333 | 0.527067 | 0.001333 | 1.633467 |
byMinute.plot(kind = 'bar', figsize = [17,7])
<AxesSubplot:xlabel='day_name'>
byDistance.plot(kind = 'bar', figsize = [17,7])
<AxesSubplot:xlabel='day_name'>
print('The Most Active Based on the Type of Activity (Fairly, Lightly, etc)')
print(byMinute.idxmax())
print('\n')
print('The Least Active Based on the Type of Activity (Fairly, Lightly, etc)')
print(byMinute.idxmin())
The Most Active Based on the Type of Activity (Fairly, Lightly, etc) FairlyActiveMinutes Saturday LightlyActiveMinutes Saturday SedentaryMinutes Monday VeryActiveMinutes Monday dtype: object The Least Active Based on the Type of Activity (Fairly, Lightly, etc) FairlyActiveMinutes Thursday LightlyActiveMinutes Sunday SedentaryMinutes Thursday VeryActiveMinutes Thursday dtype: object
print('The Most Active Based On Distance')
print(byDistance.idxmax())
print('\n')
print('The Least Active Based On Distance')
print(byDistance.idxmin())
The Most Active Based On Distance LightActiveDistance Saturday ModeratelyActiveDistance Saturday SedentaryActiveDistance Monday VeryActiveDistance Wednesday dtype: object The Least Active Based On Distance LightActiveDistance Sunday ModeratelyActiveDistance Friday SedentaryActiveDistance Sunday VeryActiveDistance Friday dtype: object
caloriesGrouped = dailyActivity.pivot_table(
values = 'Calories',
index = 'day_name'
)
caloriesGrouped
Calories | |
---|---|
day_name | |
Friday | 2331.785714 |
Monday | 2324.208333 |
Saturday | 2354.967742 |
Sunday | 2263.000000 |
Thursday | 2199.571429 |
Tuesday | 2356.013158 |
Wednesday | 2302.620000 |
fig_dims = (15, 10)
fig, ax = plt.subplots(figsize=fig_dims)
sns.set_style("darkgrid")
reg = sns.regplot(
data = dailyActivity,
x = 'TotalSteps',
y = 'Calories',
ax = ax,
color = 'blue'
)
reg.set_title('Relationship Between Total Steps Taken and Calories Burned')
Text(0.5, 1.0, 'Relationship Between Total Steps Taken and Calories Burned')
# Ensuring proper formatting and creating a new column in order to assist in the joining process
sleepDay['SleepDay'] = pd.to_datetime(sleepDay['SleepDay'])
sleepDay['date'] = pd.to_datetime(sleepDay['SleepDay'])
dailyActivity['date'] = pd.to_datetime(dailyActivity['ActivityDate'])
# sleepDay[['TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed']]
#join sleep data with daily activity
joinedDf = dailyActivity.merge(sleepDay, on = ['Id', 'date'])
joinedDf.head()
Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | day_name | weekend_weekday | date | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1503960366 | 2016-04-12 | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 | Tuesday | Weekday | 2016-04-12 | 2016-04-12 | 1 | 327 | 346 |
1 | 1503960366 | 2016-04-13 | 10735 | 6.97 | 6.97 | 0.0 | 1.57 | 0.69 | 4.71 | 0.0 | 21 | 19 | 217 | 776 | 1797 | Wednesday | Weekday | 2016-04-13 | 2016-04-13 | 2 | 384 | 407 |
2 | 1503960366 | 2016-04-15 | 9762 | 6.28 | 6.28 | 0.0 | 2.14 | 1.26 | 2.83 | 0.0 | 29 | 34 | 209 | 726 | 1745 | Friday | Weekday | 2016-04-15 | 2016-04-15 | 1 | 412 | 442 |
3 | 1503960366 | 2016-04-16 | 12669 | 8.16 | 8.16 | 0.0 | 2.71 | 0.41 | 5.04 | 0.0 | 36 | 10 | 221 | 773 | 1863 | Saturday | Weekend | 2016-04-16 | 2016-04-16 | 2 | 340 | 367 |
4 | 1503960366 | 2016-04-17 | 9705 | 6.48 | 6.48 | 0.0 | 3.19 | 0.78 | 2.51 | 0.0 | 38 | 20 | 164 | 539 | 1728 | Sunday | Weekend | 2016-04-17 | 2016-04-17 | 1 | 700 | 712 |
dailyActivity.groupby('weekend_weekday')['Calories'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
weekend_weekday | ||||||||
Weekday | 695.0 | 2301.516547 | 704.675507 | 0.0 | 1841.0 | 2159.0 | 2799.5 | 4900.0 |
Weekend | 245.0 | 2309.546939 | 756.589860 | 0.0 | 1792.0 | 2096.0 | 2739.0 | 4552.0 |
joinedDf.groupby('weekend_weekday')[['TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed']].describe()
TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
weekend_weekday | ||||||||||||||||||||||||
Weekday | 300.0 | 1.093333 | 0.313501 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 300.0 | 413.106667 | 103.459145 | 59.0 | 362.5 | 426.0 | 472.5 | 796.0 | 300.0 | 449.903333 | 108.342949 | 65.0 | 406.0 | 458.0 | 506.0 | 961.0 |
Weekend | 113.0 | 1.185841 | 0.412931 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 113.0 | 436.353982 | 150.162288 | 58.0 | 361.0 | 465.0 | 530.0 | 775.0 | 113.0 | 481.831858 | 165.356428 | 61.0 | 402.0 | 501.0 | 575.0 | 961.0 |
sleep = sns.relplot(
data = joinedDf,
x = 'TotalMinutesAsleep',
y = 'TotalTimeInBed',
col = 'weekend_weekday',
hue = 'TotalSleepRecords',
height = 7
)
sleep.fig.suptitle('Total Minutes Asleep Vs Total Time in Bed', y = 1.03)
sleep.set_titles("Week: {col_name}")
plt.show()
# convert to proper minutes
sleepDay['TotalMinutesAsleep'] = pd.to_timedelta(sleepDay['TotalMinutesAsleep'], unit = 'm')
sleepDay['TotalMinutesAsleep'] = sleepDay['TotalMinutesAsleep'].astype('timedelta64[m]')
#convert to hours
sleepDay['TotalHoursAsleep'] = sleepDay['TotalMinutesAsleep'] / 60
# round to nearest hundreths
sleepDay['TotalHoursAsleep'] = sleepDay['TotalHoursAsleep'].round(2)
#add sleep status
sleepDay['SleepStatus'] = np.where(sleepDay['TotalHoursAsleep'] < 7, 'Bad Sleepers',
(np.where( sleepDay['TotalHoursAsleep'] > 9, 'Oversleepers', 'Normal Sleepers'))
)
#used ternary operator. link for future me's reference: https://stackoverflow.com/questions/39109045/numpy-where-with-multiple-conditions
#np.select article by dataquest: https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
sleepDay['day_name'] = sleepDay['SleepDay'].dt.day_name()
sleepDay['weekend_weekday'] = np.where(sleepDay['SleepDay'].dt.dayofweek > 4, 'Weekend', 'Weekday')
sleepDay.head()
#Overview
fig_dims = (20, 15)
fig, ax = plt.subplots(figsize = fig_dims)
overview = sns.countplot(
data = sleepDay,
x = sleepDay['SleepStatus'],
hue = 'weekend_weekday',
ax = ax
)
overview.set_title('Overview Look of Sleep Statuses', fontsize = 20)
overview.set_xlabel('Sleep Status', fontsize = 20)
overview.set_ylabel('Total', fontsize = 20)
plt.xticks(size = 20)
plt.yticks(size = 20)
plt.show()
#by Day of the week
fig_dims = (20, 15)
fig, ax = plt.subplots(figsize = fig_dims)
overview2 = sns.countplot(
data = sleepDay,
x = sleepDay['SleepStatus'],
hue = 'day_name',
ax = ax
)
overview2.set_title('Overview Look of Sleep Statuses by Day of The Week', fontsize = 20)
overview2.set_xlabel('Sleep Status', fontsize = 20)
overview2.set_ylabel('Total', fontsize = 20)
plt.xticks(size = 20)
plt.yticks(size = 20)
plt.show()
byMinute
FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | VeryActiveMinutes | |
---|---|---|---|---|
day_name | ||||
Friday | 12.111111 | 204.198413 | 1000.309524 | 20.055556 |
Monday | 14.000000 | 192.058333 | 1027.941667 | 23.108333 |
Saturday | 15.201613 | 207.145161 | 964.282258 | 21.919355 |
Sunday | 14.528926 | 173.975207 | 990.256198 | 19.983471 |
Thursday | 11.959184 | 185.421769 | 961.993197 | 19.408163 |
Tuesday | 14.335526 | 197.342105 | 1007.361842 | 22.953947 |
Wednesday | 13.100000 | 189.853333 | 989.480000 | 20.780000 |
# There's 2,483,658 rows!
heartrateSeconds.shape
print(heartrateSeconds.head())
summary = round(heartrateSeconds['Value'].describe())
print('\n')
print(summary)
unique = heartrateSeconds['Id'].nunique() #there's only 14 unique uses
print('\nNumber of Users in This Dataset: ' + str(unique))
Id Time Value 0 2022484408 4/12/2016 7:21:00 AM 97 1 2022484408 4/12/2016 7:21:05 AM 102 2 2022484408 4/12/2016 7:21:10 AM 105 3 2022484408 4/12/2016 7:21:20 AM 103 4 2022484408 4/12/2016 7:21:25 AM 101 count 2483658.0 mean 77.0 std 19.0 min 36.0 25% 63.0 50% 73.0 75% 88.0 max 203.0 Name: Value, dtype: float64 Number of Users in This Dataset: 14
#Ensure proper time formatting
heartrateSeconds['Time'] = pd.to_datetime(heartrateSeconds['Time'], format = '%m/%d/%Y %I:%M:%S %p')
mean = heartrateSeconds.groupby(heartrateSeconds['Time'].dt.day_name())['Value'].mean()
mean
Time Friday 77.520836 Monday 77.454335 Saturday 79.973815 Sunday 75.925004 Thursday 77.035902 Tuesday 77.013723 Wednesday 76.451580 Name: Value, dtype: float64
mean = heartrateSeconds.groupby(heartrateSeconds['Time'].dt.hour)['Value'].mean()
fig_dims = (23, 13)
fig, ax = plt.subplots(figsize=fig_dims)
# create temporary dataframe
tempDf = pd.DataFrame(mean)
tempDf
t = sns.barplot(
data = tempDf,
x = tempDf.index,
y = 'Value'
)
t.set_xlabel('Time (Hour)', fontsize = 20)
t.set_ylabel('Average HeartRate', fontsize = 20)
t.set_title('Average Heart-Rate Per Minute Based on Time (Hour)', fontsize = 20)
Text(0.5, 1.0, 'Average Heart-Rate Per Minute Based on Time (Hour)')
#rejoining a new df as sleepDay was altered
rejoined = dailyActivity.merge(sleepDay, on = ['Id', 'date'])
rejoined.head()
Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | day_name_x | weekend_weekday_x | date | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | TotalHoursAsleep | SleepStatus | day_name_y | weekend_weekday_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1503960366 | 2016-04-12 | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 | Tuesday | Weekday | 2016-04-12 | 2016-04-12 | 1 | 327.0 | 346 | 5.45 | Bad Sleepers | Tuesday | Weekday |
1 | 1503960366 | 2016-04-13 | 10735 | 6.97 | 6.97 | 0.0 | 1.57 | 0.69 | 4.71 | 0.0 | 21 | 19 | 217 | 776 | 1797 | Wednesday | Weekday | 2016-04-13 | 2016-04-13 | 2 | 384.0 | 407 | 6.40 | Bad Sleepers | Wednesday | Weekday |
2 | 1503960366 | 2016-04-15 | 9762 | 6.28 | 6.28 | 0.0 | 2.14 | 1.26 | 2.83 | 0.0 | 29 | 34 | 209 | 726 | 1745 | Friday | Weekday | 2016-04-15 | 2016-04-15 | 1 | 412.0 | 442 | 6.87 | Bad Sleepers | Friday | Weekday |
3 | 1503960366 | 2016-04-16 | 12669 | 8.16 | 8.16 | 0.0 | 2.71 | 0.41 | 5.04 | 0.0 | 36 | 10 | 221 | 773 | 1863 | Saturday | Weekend | 2016-04-16 | 2016-04-16 | 2 | 340.0 | 367 | 5.67 | Bad Sleepers | Saturday | Weekend |
4 | 1503960366 | 2016-04-17 | 9705 | 6.48 | 6.48 | 0.0 | 3.19 | 0.78 | 2.51 | 0.0 | 38 | 20 | 164 | 539 | 1728 | Sunday | Weekend | 2016-04-17 | 2016-04-17 | 1 | 700.0 | 712 | 11.67 | Oversleepers | Sunday | Weekend |
# Before I analyze with sleep status, let's examine just the total steps taken by weekdays
meanSteps = dailyActivity.groupby('day_name')['TotalSteps'].mean()
fig_dims = (23, 13)
fig, ax = plt.subplots(figsize=fig_dims)
# create temporary dataframe
tempDf = pd.DataFrame(meanSteps)
tempDf = tempDf.reset_index()
#sort by day
cats = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
tempDf['day_name'] = pd.Categorical(tempDf['day_name'], categories = cats)
tempDf = tempDf.sort_values('day_name')
print(tempDf)
t = sns.barplot(
data = tempDf,
x = 'day_name',
y = 'TotalSteps'
)
t.set_xlabel('Day of the Week', fontsize = 20)
t.set_ylabel('Total Steps', fontsize = 20)
t.set_title('Average Total Steps Taken by the Day of the Week', fontsize = 20)
day_name TotalSteps 1 Monday 7780.866667 5 Tuesday 8125.006579 6 Wednesday 7559.373333 4 Thursday 7405.836735 0 Friday 7448.230159 2 Saturday 8152.975806 3 Sunday 6933.231405
Text(0.5, 1.0, 'Average Total Steps Taken by the Day of the Week')
lightlyActive = joinedDf[['LightlyActiveMinutes', 'TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed']]
lightlyActive.head()
LightlyActiveMinutes | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
---|---|---|---|---|
0 | 328 | 1 | 327 | 346 |
1 | 217 | 2 | 384 | 407 |
2 | 209 | 1 | 412 | 442 |
3 | 221 | 2 | 340 | 367 |
4 | 164 | 1 | 700 | 712 |
fig_dims = (15, 10)
fig, ax = plt.subplots(figsize=fig_dims)
sns.set_style("darkgrid")
reg = sns.regplot(
data = lightlyActive,
x = 'LightlyActiveMinutes',
y = 'TotalMinutesAsleep',
ax = ax,
color = 'blue'
)
reg.set_title('Relationship between Total Minutes of Sleep and Users Who are Light Active')
Text(0.5, 1.0, 'Relationship between Total Minutes of Sleep and Users Who are Light Active')
sedentary = joinedDf[['SedentaryMinutes', 'TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed']]
sedentary.head()
SedentaryMinutes | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
---|---|---|---|---|
0 | 728 | 1 | 327 | 346 |
1 | 776 | 2 | 384 | 407 |
2 | 726 | 1 | 412 | 442 |
3 | 773 | 2 | 340 | 367 |
4 | 539 | 1 | 700 | 712 |
sns.set_style("darkgrid")
fig_dims = (15, 10)
fig, ax = plt.subplots(figsize=fig_dims)
s = sns.regplot(
data = sedentary,
x = 'SedentaryMinutes',
y = 'TotalMinutesAsleep',
ax = ax,
color = 'blue'
)
s.set_title('Relationship between Total Minutes of Sleep and Users Who are Sedentary')
Text(0.5, 1.0, 'Relationship between Total Minutes of Sleep and Users Who are Sedentary')
joinedDf.head()
Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | day_name | weekend_weekday | date | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1503960366 | 2016-04-12 | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 | Tuesday | Weekday | 2016-04-12 | 2016-04-12 | 1 | 327 | 346 |
1 | 1503960366 | 2016-04-13 | 10735 | 6.97 | 6.97 | 0.0 | 1.57 | 0.69 | 4.71 | 0.0 | 21 | 19 | 217 | 776 | 1797 | Wednesday | Weekday | 2016-04-13 | 2016-04-13 | 2 | 384 | 407 |
2 | 1503960366 | 2016-04-15 | 9762 | 6.28 | 6.28 | 0.0 | 2.14 | 1.26 | 2.83 | 0.0 | 29 | 34 | 209 | 726 | 1745 | Friday | Weekday | 2016-04-15 | 2016-04-15 | 1 | 412 | 442 |
3 | 1503960366 | 2016-04-16 | 12669 | 8.16 | 8.16 | 0.0 | 2.71 | 0.41 | 5.04 | 0.0 | 36 | 10 | 221 | 773 | 1863 | Saturday | Weekend | 2016-04-16 | 2016-04-16 | 2 | 340 | 367 |
4 | 1503960366 | 2016-04-17 | 9705 | 6.48 | 6.48 | 0.0 | 3.19 | 0.78 | 2.51 | 0.0 | 38 | 20 | 164 | 539 | 1728 | Sunday | Weekend | 2016-04-17 | 2016-04-17 | 1 | 700 | 712 |
sns.set_style("darkgrid")
fig_dims = (15, 10)
fig, ax = plt.subplots(figsize=fig_dims)
s = sns.regplot(
data = joinedDf,
x = 'FairlyActiveMinutes',
y = 'TotalMinutesAsleep',
ax = ax,
color = 'blue'
)
s.set_title('Relationship between Total Minutes of Sleep and Users Who are Fairly Active', fontsize = 20)
s.set_xlabel('Fairly Active Minutes', fontsize = 20)
s.set_ylabel('Total Minutes Asleep', fontsize = 20)
plt.show()
# Very Active
sns.set_style("darkgrid")
fig_dims = (15, 10)
fig, ax = plt.subplots(figsize=fig_dims)
s = sns.regplot(
data = joinedDf,
x = 'VeryActiveMinutes',
y = 'TotalMinutesAsleep',
ax = ax,
color = 'blue'
)
s.set_title('Relationship between Total Minutes of Sleep and Users Who are Very Active', fontsize = 20)
s.set_xlabel('Very Active Minutes', fontsize = 20)
s.set_ylabel('Total Minutes Asleep', fontsize = 20)
plt.show()
#Ensuring proper formatting
hourlyIntensities['ActivityHour'] = pd.to_datetime(hourlyIntensities['ActivityHour'], format = '%m/%d/%Y %I:%M:%S %p')
#Group by the hours and find the average total intensity, which should therefore give only 24 rows in the end
mean = hourlyIntensities.groupby(hourlyIntensities['ActivityHour'].dt.time)['TotalIntensity'].mean()
sns.set_style("darkgrid")
fig_dims = (23, 13)
fig, ax = plt.subplots(figsize=fig_dims)
#create temporary dataframe
tempDf = pd.DataFrame(mean)
t = sns.barplot(
data = tempDf,
x = tempDf.index,
y = 'TotalIntensity'
)
t.set_xlabel('Time', fontsize = 20)
t.set_ylabel('Average Total Intensity', fontsize = 20)
t.set_title('Average Total Intensity Per Hour', fontsize = 20)
Text(0.5, 1.0, 'Average Total Intensity Per Hour')
meanDays = hourlyIntensities.groupby(hourlyIntensities['ActivityHour'].dt.day_name())['TotalIntensity'].mean()
fig_dims = (23, 13)
fig, ax = plt.subplots(figsize=fig_dims)
# create temporary dataframe
tempDf = pd.DataFrame(meanDays)
tempDf = tempDf.reset_index()
tempDf.rename(columns = {"ActivityHour": "Day"}, inplace = True)
#sort by day
cats = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
tempDf['Day'] = pd.Categorical(tempDf['Day'], categories = cats)
tempDf = tempDf.sort_values('Day')
t= sns.barplot(
x = 'Day',
y = 'TotalIntensity',
data = tempDf
)
t.set_xlabel('Day', fontsize = 20)
t.set_ylabel('Average Total Intensity', fontsize = 20)
t.set_title('Average Total Intensity Vs Day', fontsize = 20)
Text(0.5, 1.0, 'Average Total Intensity Vs Day')
After analyzing the data, I have found some insightful information that may be of interest for Bellabeat. As a reminder, Bellabeat is a high-tech company that manufactures health-focused smart products for women. Here are some business marketing recommendations that I can give to Bellabeat:
I suggest Bellabeat to give notifications to their users who had been idle during the day to consider doing some light exercises. And with my previous finding that shows that being sedentary is associating to lack of sleep, it shows that the more idle you are, the lesser the quality of one's sleep may be. Therefore, this reason may serve as a good incentive to encourage users to be more active as this may allow them to sleep better when it's time for bed.
I recommend Bellabeat to notify/remind users to do some exercises between 5pm and 7pm as this is when most users finish work and start exercising based on the data. This is also a great opportunity to encourage those who might have a goal to lose weight to start exercising on this time frame.
Bellabeat should consider notifying users the importance of at least getting more than 8000 steps per day as this will benefit the users in the long run. This is based on a research by CDC that taking 8,000 steps per day was associated with a 51% lower risk for all-cause mortality (or death from all causes). Therefore, it is highly recommended for Bellabeat to consider this advice into consideration for their next marketing strategy.
As users tend exercise more on Saturdays, Bellabeat can use this information to remind users to do some exercise on Saturdays as they are more likely to be more motivated. Also, there seems to be a noticable drop in activity when it is Sunday. Therefore, I recommend Bellabeat to encourage users to be more proactive during the Sundays as, based on the data, users tend to be a bit more laidback on Sundays. Users also seem to oversleep the most on Sunday, therefore Bellabeat should consider advising their users to be mindful of this as oversleeping is associated with risk of diabetes, heart disease, stroke, and death (source)
Bellabeat should notify users when their heart-rate detected to be outside the range of 60-100 BPM as this may signify a potential underlying health condition that the user may not be aware of.
By combining the fact that Tuesday has the most steps taken and that average total intensity is the highest, this might be the very reason why the users on this dataset are undersleeping the most on Tuesday. Bellabeat should consider advising their users to exercise meditation so as to calm themselves on stressful days.