For the last few weeks, Boston and our fellow New Englanders have endured a heat wave. The temperature and humidity skyrocketed, but after several 90℉ days in a row, I finally heard some good news… July 15th is National Ice Cream Day. It couldn’t have come at a better time, and I’m sure we can all agree that there’s nothing better than an ice cream cone on a hot day.
In observance of this special holiday, I decided to devote the inaugural Talking Shop blog post to answer the question “Does hot weather increase ice cream sales?”
Data & Tools
To dig into this question, I wanted to look at a few data points: monthly average temperatures and monthly sales numbers from 3 different grocery stores. I’ll be using a Jupyter notebook and the Autodaas data platform to perform this analysis. Jupyter is a powerful tool, and Autodaas lets me focus on using Jupyter to analyze my data, not on infrastructure for streaming and querying large datasets.
Jupyter notebook is a live coding environment and is popular with data scientists. It provides analysts with statistics, machine learning and visualization tools to create analyses and reports that drive decision-making. The images in this post are from the complete Jupyter notebook analysis which you can find here.
At a high level, our workflow is 1) import data into Autodaas, 2) transform the data in Jupyter via the Autodaas query engine, 3) use python data libraries to analyze the data.
To get started, we import all our data sources into Autodaas with a few clicks. Our data sources for this tutorial are a temperatures file and a product sales database.
- Monthly Average Temperatures: a flat file of temperatures in CSV format, generated from a weather tracking system.
- Grocery Store Database: a MySQL database with ice cream sales data by brand and flavor for 3 retailers, a table for each retailer.
Next, we connect our Jupyter notebook to Autodaas and transform data using it’s optimized query engine. Using Autodaas’s SQL query engine allows us to offload an expensive transformation operation and create a master data-set to use throughout the analysis. Offloading the transformation operations to Autodaas allows us to reserve computing power for our visualization and statistical algorithms.
The text above is the SQL transformation query, which unions all the retailer productsales tables, joins the file’s temperature data, then, aggregates the data into a master dataframe.
Then, we use python’s analysis and visualization libraries to begin exploring our data. We use statistics to get a better understanding of our master data set. After that, we’ll run a correlation algorithm to find whether the average temperature and units sold are related. Finally, we’ll visualize the data attributes to gain insight help form a conclusion.
Using the Pandas describe function, we generate various statistics on unit sales by month and average temperature:
- mean average temperature of months: 58 degrees
- mean of units sold over all months: 607
- min units sold in a month over all months: 427
- max units sold over all months: 857
- standard deviation between months: 122
The statistics above help introduce us to the data at a high level.
Using the pearsonR correlation function, we find whether average temperatures and units sold are related.
- A 75% correlation shows there is a strong positive correlation between average temperatures and units sold.
- A p-value of 0.005 presents this finding as statistically significant (not by chance).
Our data analysis above shows a strong positive correlation and consistent visualizations which support the hypothesis: higher temperatures result in more ice cream sales.
Do higher temperatures increase ice cream sales?
Based on our analysis using correlation and visualization; We can conclude it is likely that higher temperatures increase ice cream sales. It looks like we should increase the production and distribution of our ice cream brands to retailers during the higher temperature months.
A word of caution, correlation does not mean causation. However, we did review the data from multiple angles and it looks good. We can increase the size of the data set over time (more years) to strengthen our conclusion.