Do movie releases produce literal earthquakes? We always hear about new movie releases being a “blast”, some sure are. But how do two independent events correlate with each other? In this post, I will use Python to visualize two different series of events, plotting them on top of each other to gain insights from time series data.
# Imports
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('Set2')
sns.set_style("whitegrid")
%matplotlib inline
Getting the data
To make this example more fun, I decided to use two independent series of events for which data is readily available in the internet:
Clean and prepare earthquake data
We start by downloading the .csv export from the world earthquake website to the ‘data’ directory and reading the file into a pandas DataFrame
df = pd.read_csv('data/earthquakes_raw.csv', sep=';')
df.dropna((0,1), how='all', inplace=True)
df.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
We have to unify the date information from the Date and Year columns. Then we can save the cleaned-up earthquake date data to a file ‘data/earthquakes.csv’
df['Date'] = df['Date'] + ' ' + df['Year'].map(str)
del df['Year']
df['Date'] = df['Date'].apply(lambda x:
datetime.strptime(x, '%B %d %Y'))
df['Date'].to_csv('data/earthquakes.csv')
df.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Clean and prepare movie release data
The movie release data was retrieved from this website and saved to the ‘data’ directory. We then read the file into a pandas DataFrame. The resulting table tells us the release date and which movies where released on a that date (up to 5 movies).
df = pd.read_csv('data/filmrelease_raw.csv', sep=';', header=None)
df.dropna((0,1), how='all', inplace=True, thresh=2)
df.columns = ['Date', 'Film1', 'Film2', 'Film3', 'Film4', 'Film5']
df.head(), df.tail()
( Date Film1 Film2 \
0 Friday, January 9, 2015 Taken 3 NaN
4 Friday, January 16, 2015 Blackhat Paddington
14 Friday, January 23, 2015 Mortdecai Strange Magic
24 Friday, January 30, 2015 Black or White Project Almanac
34 Friday, February 6, 2015 Jupiter Ascending Seventh Son
Film3 Film4 Film5
0 NaN NaN NaN
4 The Wedding Ringer NaN NaN
14 The Boy Next Door NaN NaN
24 The Loft NaN NaN
34 SpongeBob Movie: Sponge Out of Water NaN NaN ,
Date Film1 Film2 \
836 Friday, December 9 🎥 Office Christmas Party NaN
840 Friday, December 16 🎥 Collateral Beauty 🎥 La La Land
850 Wednesday, December 21 🎥 Assassin's Creed 🎥 Passengers
863 Friday, December 23 🎥 Why Him? NaN
867 Sunday, December 25 🎥 Fences NaN
Film3 Film4 Film5
836 NaN NaN NaN
840 🎥 Rogue One: A Star Wars Story NaN NaN
850 🎥 Patriots Day 🎥 Sing NaN
863 NaN NaN NaN
867 NaN NaN NaN )
Talk about raw unclean data! It seems that, at the top of the table, the date information contains the year (2015). However, upon further inspection we can see that the bottom of the table does not show us the year anymore. From the website information we find out that, after the index 716 and onwards, the missing year information is ‘2016’. So we add this data to the DataFrame and change the date format to a more readable one.
df.loc[lambda x: x.index >= 716, 'Date'] += ', 2016'
df['Date'] = df['Date'].apply(lambda x:
datetime.strptime(x, '%A, %B %d, %Y'))
df.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
For the purpose of plotting the frequency of an event, we are not interested in what movies were released, but only in how many on a particular date. We can count the movies by replacing the names with ones and calculating the sum.
# replace movie names with ones
df.iloc[:,1:] = df.iloc[:,1:].replace(r'\w', 1.0, regex=True)
df.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
We can finally get rid of the unnecessary columns and save the clean data to a file. Now we are ready to start plotting our event series data.
df['film_sum'] =df[['Film1', 'Film2', 'Film3', 'Film4', 'Film5']].sum(axis=1)
df.drop(['Film1', 'Film2', 'Film3', 'Film4', 'Film5'], axis=1, inplace=True)
df.to_csv('data/filmrelease.csv')
df.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Load the data to the plotting variables
To use this script, we have to load the clean data that we saved in the previuos steps.
# Load the earthquake data and add a column with ones since there was only one earthquake per row
df1 = pd.read_csv('data/earthquakes.csv', header=None)
del df1[0]
df1.columns = ['Date']
df1['earthquake'] = np.ones(len(df1))
df1.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
# Load the movie data, the second column already shows us the sum of movie releases
df2 = pd.read_csv('data/filmrelease.csv', header=0)
del df2['Unnamed: 0']
df2.columns = ['Date', 'movie_release']
df2.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
As end result, we want to have a single DataFrame containing the data for both event series. Moreover, we want to have a continuous time series, including those days in which none of the two events took place (no earthquakes, and no movie releases). We do this by using the concatenate and resample functions of pandas.
# Concatenate both DataFrames into one
df = pd.concat([df1, df2], ignore_index=True)
df = df.set_index(pd.DatetimeIndex(df.Date))
df = df.sort_index()
df = df.resample('1d').sum().fillna(0) # to complete every day
df.head(10)
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Calculating and plotting a moving average
We could simply plot each event occurrence as a data point in a time series. However, this will likely yield a graph that is not very informative. Much easier to grasp is a moving average that tells us the average frequency of the events for a defined period of time in the past. We can create columns for these moving averages, which we can then easily plot.
# Calculate moving average
for i in [7*4, 7*4*2]:
mvav = i # moving average period, i.e. number of points to average
dfi = np.convolve(df['earthquake'],
np.ones((mvav,))*7/mvav # factor for obtaining average
, mode='full')
df['earthquake moving average %sw' % (int(i/7))] = dfi[:-(i-1)]
dfj = np.convolve(df['movie_release'],
np.ones((mvav,))*7/mvav # factor for obtaining average
, mode='full')
df['movie_release moving average %sw' % (int(i/7))] = dfj[:-(i-1)]
df.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
The shorter the period that we choose for the moving average, the noisier our graphic will get. Let’s settle with a moving average that reflects the frequency during the past 8 weeks. And voilà! Now we can see how the frequency of earthquakes and movie releases changed over time.
# Plot relevant columns from dataframe
df.loc[:,['earthquake moving average 8w', 'movie_release moving average 8w']].\
plot(cmap='Set2', figsize=(12,4))# possible
plt.xlim(df.index[0], df.index.max()+10)
plt.title('Moving average of events per week')
plt.ylabel('Frequency')
plt.show()
![]()
Descriptive analysis of the event occurrence
What other insights can we get from this data set? Two very dissimilar series of events, one natural, and one man-made, will surely have very different properties. Let’s start with a simple question: on which days of the week do both events typically happen?
#%% DAY OF THE WEEK ANALYSIS
# create column for day of the week
df['Day'] = df.index.dayofweek
df['Day'] = df.Day.astype('category')
df.Day.cat.categories = ['Mon','Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
# create column for type
df['Type'] = np.where(df['earthquake']>0, 'earthquake', np.where(df['movie_release']>0,\
'movie_release', np.nan))
df['Type'] = df['Type'].astype('category')
df['Type'] = df['Type'].cat.remove_categories(['nan'])
# show count for each day
#df[(df.Event1 == 1)&(df.Day == 'Mon')].Day.count()
# plot count data per day of the week
plt.figure()
plt.title('Event count per day of the week')
sns.countplot(data=df, x='Day', hue='Type')
sns.despine(left=True)
plt.show()
![]()
We can see that nature does not respect our weekends, as earthquakes seem to be flatly distributed by day of the week.
The movie releases, on the other hand, are most frequent on Friday, followed by Wednesday and a few on Sunday (it’s almost as if movie release days were chosen by someone… /s). It seems that, if you’re planning to release a new movie in the US, Friday is the way to go. People are usually happy to start the weekend with a leisurely activity, so that makes sense. As of why Saturdays and Sundays are almost not used as movie release days, even though on this days people are also usually free from work, it would be interesting to find out why. Another intriguing finding is the not high but remarkable number of releases on Wednesdays. Don’t people work on Thursdays?
Analysis of frequency per week
The world is a big place (or is it a small world?) and earthquakes occur all the time, even though we might not always find out. On the other hand, I would expect that movie releases occur much more frequently. So let’s take a look at the data by plotting histograms for both events side by side.
# joint histograms
plt.figure()
df['earthquake moving average 8w'].hist(alpha=.9)
df['movie_release moving average 8w'].hist(alpha=.9)
plt.title('Histogram of event frequency per week')
plt.show()
# With seaborn
sns.distplot(df['earthquake moving average 8w']), \
sns.distplot(df['movie_release moving average 8w'])
![]()
(<matplotlib.axes._subplots.AxesSubplot at 0x11eb49198>,
<matplotlib.axes._subplots.AxesSubplot at 0x11eb49198>)
![]()
Luckily, movie releases are much more frequent per week as earthquakes. On most weeks, there are between two and three movie releases, compared to 0.5 to 1.5 earthquakes.
Final remarks
In this post, we gathered information on the occurrence of two events: earthquakes around the world, and movie releases in the US. By plotting their moving averages we could better compare when they occurred and gained some interesting insights about how they compare. All thanks to Python!
