Fitbit activity and sleep data: a time-series analysis with Generalized Additive Models
This is a time-series analysis of activity and sleep data from a fitbit user throughout a year. I use this data to predict an additional year of the life of the user using Generalized Additive Models.
- Data cleaning (missing data and outliers)
- Predicting the step count for an additional year
- Sleep analysis
The goal of this notebook is to provide an analysis of the time-series data from a user of a fitbit tracker throughout a year. I will use this data to predict an additional year of the life of the user using Generalized Additive Models.
Packages used:
- pandas, numpy, matplotlib, seaborn
- Prophet
import pandas as pd
import numpy as np
from fbprophet import Prophet
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
activity = pd.read_csv('OneYearFitBitData.csv')
# change commas to dots
activity.iloc[:,1:] = activity.iloc[:,1:].applymap(lambda x: float(str(x).replace(',','.')))
# change column names to English
activity.columns = ['Date', 'BurnedCalories', 'Steps', 'Distance', 'Floors', 'SedentaryMinutes', 'LightMinutes', 'ModerateMinutes', 'IntenseMinutes', 'IntenseActivityCalories']
# import the sleep data
sleep = pd.read_csv('OneYearFitBitDataSleep.csv')
# check the size of the dataframes
activity.shape, sleep.shape
# merge dataframes
data = pd.merge(activity, sleep, how='outer', on='Date')
# parse date into correct format
data['Date'] = pd.to_datetime(data['Date'], format='%d-%m-%Y')
# correct units for Calories and Steps
for c in ['BurnedCalories', 'Steps', 'IntenseActivityCalories']:
data[c] = data[c]*1000
Once imported, we should check for any missing data:
data.isnull().sum()
data.iloc[np.where(data['MinutesOfSleep'].isnull())[0],:]
We can see that the sleep information was missing for some dates. The activity information for those days is complete. Therefore, we should not get rid of those rows just now.
data.iloc[np.where(data['Steps']==0)[0],:]
We can also see that the step count for some datapoints is zero. If we look at the complete rows, we can see that on those days nearly no other data was recorded. I assume that the user probably did not wear the fitness tracker on that day and we could get rid of those complete rows.
data = data.drop(np.where(data['Steps']==0)[0], axis=0)
sns.distplot(data['Steps'])
plt.title('Histogram for step count')
Step count is probably the most accurate measure obtained from a pedometer. Looking at the distribution of this variable, however, we can see that there is a chance that we have outliers in the data, as at least one value seems to be much higher than all the rest.
data.sort_values(by='Steps', ascending=False).head()
We found the outlier! It seems that the step count for the first day (our data starts on May 8th, 2015) is too high to be a correct value for the amount of steps taken by the user on that day. Maybe the device saves the vibration since its production as step count which is loaded on the first day that the user wears the tracker. We can anyway get rid of that row since the sleep data is also not available for this day.
data = data.drop(np.where(data['Steps']>=100000)[0], axis=0)
Now we can look at our preprocessed data. Shape, distribution of the variables, and a look at some rows from the dataframe, are all useful things to observe:
data.shape
fig, ax = plt.subplots(5,3, figsize=(8,10))
for c, a in zip(data.columns[1:], ax.flat):
df = pd.DataFrame()
df['ds'] = data['Date']
df['y'] = data[c]
df = df.dropna(axis=0, how='any')
sns.distplot(df['y'], axlabel=False, ax=a)
a.set_title(c)
plt.suptitle('Histograms of variables from fitbit data', y=1.02, fontsize=14);
plt.tight_layout()
data.head()
In order to use the Prophet package to predict the future using a Generalized Additive Model, we need to create a dataframe with columns ds
and y
(we need to do this for each variable):
-
ds
is the date stamp data giving the time component -
y
is the variable that we want to predict
In our case we will use the log transform of the step count in order to decrease the effect of outliers on the model.
df = pd.DataFrame()
df['ds'] = data['Date']
df['y'] = data['Steps']
# log-transform of step count
df['y'] = np.log(df['y'])
Now we need to specify the type of growth model that we want to use:
- Linear: assumes that the variable
y
grows linearly in time (doesn't apply to our step count scenario, if the person sticks to their normal lifestyle) - Logistic: assumes that the variable
y
grows logistically in time and saturates at some point
I will assume that the person, for whom we want to predict the step count in the following year, will not have any dramatic lifestyle changes that makes them start to walk more. Therefore, I am using logistic 'growth' capped to a cap of the mean of the data, which in practice means that the step count's growth trend will be 'zero growth'.
df['cap'] = df['y'].median()
m = Prophet(growth='logistic', yearly_seasonality=True)
m.fit(df)
After fitting the model, we need a new dataframe future
with the additional rows for which we want to predict y
.
future = m.make_future_dataframe(periods=365, freq='D')
future['cap'] = df['y'].median()
Now we can call predict on the fitted model and obtain relevant statistics for the forecast period. We can also plot the results.
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
m.plot(forecast, ylabel='log(Steps)', xlabel='Date');
plt.title('1-year prediction of step count from 1 year of fitbit data');
We can see that the model did a good job in mimicking the behavior of step count during the year for which the data was available. This seems reasonable, as we do not expect the pattern to vary necessarily, if the person continues to have a similar lifestyle.
Additionally, we can plot the components from the Generalized Additive Model and see their effect on the 'y' variable. In this case we have the general trend (remember we capped this at '10'), the yearly seasonality effect, and the weekly effect.
m.plot_components(forecast);
plt.suptitle('GAM components for prediction of step count', y=1.02, fontsize=14);
Here we see some interesting patterns:
- The general 'growth' trend is as expected, as we assumed that there would be no growth beyond the mean of the existing data.
- The yearly effect shows a trend towards higher activity during the summer months, however the variation is considerable, probably due to the fact that our dataset consisted of the data for one year only
- The weekly effect shows that Sunday is a day of lower activity for this person whereas Saturday is the day where the activity is the highest. So, grocery shopping on Saturday, Netflix on Sunday? :)
A very important part of our lives is sleep. It would be very interesting to look at the sleep habits of the user of the fitness tracker and see if we can get some insights from this data.
df = pd.DataFrame()
df['ds'] = data['Date']
df['y'] = data['MinutesOfSleep']
df = df.dropna(axis=0, how='any')
# drop rows where sleep time is zero, as this would mean that the person did not wear the tracker overnight and the data is missing
df = df.iloc[np.where(df['y']!=0)[0],:]
sns.distplot(df['y'])
df['cap'] = df['y'].median()
m = Prophet(growth='logistic', yearly_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=365, freq='D')
future['cap'] = df['y'].median()
forecast = m.predict(future)
m.plot(forecast);
plt.title('1-year prediction of MinutesOfSleep from 1 year of fitbit data');
The model again seems to predict a similar sleep behavior for the predicted year. This seems reasonable, as we do not expect the pattern to vary necessarily, if the person continues to have a similar lifestyle.
m.plot_components(forecast);
plt.suptitle('GAM components for prediction of MinutesOfSleep', y=1.02, fontsize=14);
A look at the amount of sleep reveals:
- A saturation trend at the median (we set this assumption)
- A yearly effect shows a trend towards higher amount of sleep during the summer months, with more variation during winter
- The weekly effect shows lowest sleep amount on Mondays (maybe going to bed late on Sunday and waking up early on Monday is a pattern for this user). Highest amout of sleep occurs on Saturdays (no alarm to wake up to on Saturday morning!). Interestingly, the user seems to get more sleep on Wednesdays than on Mondays or Tuesdays, which could mean that their work schedule is not constant during week-days.
zeros_allowed = ['Floors', 'SedentaryMinutes', 'LightMinutes', 'ModerateMinutes', 'IntenseMinutes', 'IntenseActivityCalories', 'MinutesOfBeingAwake', 'NumberOfAwakings']
fig, ax = plt.subplots(3,3, figsize=(12,6), sharex=True)
predict_cols = ['Steps', 'Floors', 'BurnedCalories', 'LightMinutes', 'ModerateMinutes', 'IntenseMinutes', 'MinutesOfSleep', 'MinutesOfBeingAwake', 'NumberOfAwakings']
for c, a in zip(predict_cols, ax.flat):
df = pd.DataFrame()
df['ds'] = data['Date']
df['y'] = data[c]
df = df.dropna(axis=0, how='any')
if c not in zeros_allowed:
df = df.iloc[np.where(df['y']!=0)[0],:]
df['cap'] = df['y'].median()
m = Prophet(growth='logistic', yearly_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=365, freq='D')
future['cap'] = df['y'].median()
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
m.plot(forecast, xlabel='', ax=a);
a.set_title(c)
#m.plot_components(forecast);
plt.suptitle('1-year prediction per variable from 1 year of fitbit data', y=1.02, fontsize=14);
plt.tight_layout()