Visualizing parallel event series in Python
In this post, I will use Python to visualize two different series of events, plotting them on top of each other to gain insights from time series data."
- Getting the data
- Calculating and plotting a moving average
- Descriptive analysis of the event occurrence
- Final remarks
Do movie releases produce literal earthquakes? We always hear about new movie releases being a "blast", some sure are. But how do two independent events correlate with each other? In this post, I will use Python to visualize two different series of events, plotting them on top of each other to gain insights from time series data.
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('Set2')
sns.set_style("whitegrid")
%matplotlib inline
To make this example more fun, I decided to use two independent series of events for which data is readily available in the internet:
We start by downloading the .csv export from the world earthquake website to the 'data' directory and reading the file into a pandas DataFrame
df = pd.read_csv('data/earthquakes_raw.csv', sep=';')
df.dropna((0,1), how='all', inplace=True)
df.head()
We have to unify the date information from the Date and Year columns. Then we can save the cleaned-up earthquake date data to a file 'data/earthquakes.csv'
df['Date'] = df['Date'] + ' ' + df['Year'].map(str)
del df['Year']
df['Date'] = df['Date'].apply(lambda x:
datetime.strptime(x, '%B %d %Y'))
df['Date'].to_csv('data/earthquakes.csv')
df.head()
The movie release data was retrieved from this website and saved to the 'data' directory. We then read the file into a pandas DataFrame. The resulting table tells us the release date and which movies where released on a that date (up to 5 movies).
df = pd.read_csv('data/filmrelease_raw.csv', sep=';', header=None)
df.dropna((0,1), how='all', inplace=True, thresh=2)
df.columns = ['Date', 'Film1', 'Film2', 'Film3', 'Film4', 'Film5']
df.head(), df.tail()
Talk about raw unclean data! It seems that, at the top of the table, the date information contains the year (2015). However, upon further inspection we can see that the bottom of the table does not show us the year anymore. From the website information we find out that, after the index 716 and onwards, the missing year information is '2016'. So we add this data to the DataFrame and change the date format to a more readable one.
df.loc[lambda x: x.index >= 716, 'Date'] += ', 2016'
df['Date'] = df['Date'].apply(lambda x:
datetime.strptime(x, '%A, %B %d, %Y'))
df.head()
For the purpose of plotting the frequency of an event, we are not interested in what movies were released, but only in how many on a particular date. We can count the movies by replacing the names with ones and calculating the sum.
df.iloc[:,1:] = df.iloc[:,1:].replace(r'\w', 1.0, regex=True)
df.head()
We can finally get rid of the unnecessary columns and save the clean data to a file. Now we are ready to start plotting our event series data.
df['film_sum'] =df[['Film1', 'Film2', 'Film3', 'Film4', 'Film5']].sum(axis=1)
df.drop(['Film1', 'Film2', 'Film3', 'Film4', 'Film5'], axis=1, inplace=True)
df.to_csv('data/filmrelease.csv')
df.head()
To use this script, we have to load the clean data that we saved in the previuos steps.
df1 = pd.read_csv('data/earthquakes.csv', header=None)
del df1[0]
df1.columns = ['Date']
df1['earthquake'] = np.ones(len(df1))
df1.head()
df2 = pd.read_csv('data/filmrelease.csv', header=0)
del df2['Unnamed: 0']
df2.columns = ['Date', 'movie_release']
df2.head()
As end result, we want to have a single DataFrame containing the data for both event series. Moreover, we want to have a continuous time series, including those days in which none of the two events took place (no earthquakes, and no movie releases). We do this by using the concatenate and resample functions of pandas.
df = pd.concat([df1, df2], ignore_index=True)
df = df.set_index(pd.DatetimeIndex(df.Date))
df = df.sort_index()
df = df.resample('1d').sum().fillna(0) # to complete every day
df.head(10)
We could simply plot each event occurrence as a data point in a time series. However, this will likely yield a graph that is not very informative. Much easier to grasp is a moving average that tells us the average frequency of the events for a defined period of time in the past. We can create columns for these moving averages, which we can then easily plot.
for i in [7*4, 7*4*2]:
mvav = i # moving average period, i.e. number of points to average
dfi = np.convolve(df['earthquake'],
np.ones((mvav,))*7/mvav # factor for obtaining average
, mode='full')
df['earthquake moving average %sw' % (int(i/7))] = dfi[:-(i-1)]
dfj = np.convolve(df['movie_release'],
np.ones((mvav,))*7/mvav # factor for obtaining average
, mode='full')
df['movie_release moving average %sw' % (int(i/7))] = dfj[:-(i-1)]
df.head()
The shorter the period that we choose for the moving average, the noisier our graphic will get. Let's settle with a moving average that reflects the frequency during the past 8 weeks. And voilà! Now we can see how the frequency of earthquakes and movie releases changed over time.
df.loc[:,['earthquake moving average 8w', 'movie_release moving average 8w']].\
plot(cmap='Set2', figsize=(12,4))# possible
plt.xlim(df.index[0], df.index.max()+10)
plt.title('Moving average of events per week')
plt.ylabel('Frequency')
plt.show()
What other insights can we get from this data set? Two very dissimilar series of events, one natural, and one man-made, will surely have very different properties. Let's start with a simple question: on which days of the week do both events typically happen?
# create column for day of the week
df['Day'] = df.index.dayofweek
df['Day'] = df.Day.astype('category')
df.Day.cat.categories = ['Mon','Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
# create column for type
df['Type'] = np.where(df['earthquake']>0, 'earthquake', np.where(df['movie_release']>0,\
'movie_release', np.nan))
df['Type'] = df['Type'].astype('category')
df['Type'] = df['Type'].cat.remove_categories(['nan'])
# show count for each day
#df[(df.Event1 == 1)&(df.Day == 'Mon')].Day.count()
# plot count data per day of the week
plt.figure()
plt.title('Event count per day of the week')
sns.countplot(data=df, x='Day', hue='Type')
sns.despine(left=True)
plt.show()
We can see that nature does not respect our weekends, as earthquakes seem to be flatly distributed by day of the week.
The movie releases, on the other hand, are most frequent on Friday, followed by Wednesday and a few on Sunday (it's almost as if movie release days were chosen by someone... /s). It seems that, if you're planning to release a new movie in the US, Friday is the way to go. People are usually happy to start the weekend with a leisurely activity, so that makes sense. As of why Saturdays and Sundays are almost not used as movie release days, even though on this days people are also usually free from work, it would be interesting to find out why. Another intriguing finding is the not high but remarkable number of releases on Wednesdays. Don't people work on Thursdays?
The world is a big place (or is it a small world?) and earthquakes occur all the time, even though we might not always find out. On the other hand, I would expect that movie releases occur much more frequently. So let's take a look at the data by plotting histograms for both events side by side.
plt.figure()
df['earthquake moving average 8w'].hist(alpha=.9)
df['movie_release moving average 8w'].hist(alpha=.9)
plt.title('Histogram of event frequency per week')
plt.show()
# With seaborn
sns.distplot(df['earthquake moving average 8w']), \
sns.distplot(df['movie_release moving average 8w'])
Luckily, movie releases are much more frequent per week as earthquakes. On most weeks, there are between two and three movie releases, compared to 0.5 to 1.5 earthquakes.
In this post, we gathered information on the occurrence of two events: earthquakes around the world, and movie releases in the US. By plotting their moving averages we could better compare when they occurred and gained some interesting insights about how they compare. All thanks to Python!