In the past decade, avocados have soared in popularity. From avocado salad, to avocado toast, and even to avocado face masks, people seem to be buying more and more avocados and using it in many different ways. This rise in popularity has been linked to the wellness movement as more people are focusing on their health. If you want to learn more about the cultural history of avocados, check out this article by the BBC: https://www.bbc.co.uk/bbcthree/article/87a56e5c-6d41-4495-9e22-523efb6b4cb0#:~:text=1920s%3A%20A%20PR%20campaign%20starts&text=By%20the%20late%2019th%20century,but%20they%20weren't%20selling.
In this tutorial, we will use data science to analyze the trends in avocado prices and the number of avocados purchased.
Let's look at this data from Timofei Kornev on Kaggle (https://www.kaggle.com/timmate/avocado-prices-2020). This data was collected from the Hass Avocado Board's website (https://hassavocadoboard.com/). It contains weekly scan data of Hass avocados from the years 2015 to 2020.
First, we'll read in the data and convert it to a dataframe.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
# creating dataframe from data
df = pd.read_csv("avocado-updated-2020.csv", sep=',')
df.head()
Our dataframe contains many columns, but the columns we will be focusing on are:
Before doing EDA, we should check if there are any missing data. Missing data can cause problems during EDA, so we need to figure out how to deal with them if they exist. To learn more about missing data and ways to handle them, read this https://www.mastersindatascience.org/learning/how-to-deal-with-missing-data/.
# checking if dataframe has any null values
df.isnull().values.any()
The dataframe has no missing data so we can go ahead and start doing EDA.
Our dataframe is ready, so now we will do some data analysis.
First things first, let's say that you're an avid avocado consumer, where in the U.S. should you live if you don't want to break your bank by buying avocados? Using this data, we can find out where in the U.S. has the cheapest avocados and where has the most expensive avocados.
# finding the average price of avocados for each geographic location
avg_price_by_location = df.groupby(['geography'])['average_price'].mean().reset_index()
avg_price_by_location = avg_price_by_location.sort_values(by=['average_price'])
avg_price_by_location
The data has an entry for 'Total U.S.'. We can use the average price associated with 'Total U.S.' to see how far the avocado prices for the different locations is from the average.
# getting the total average price from 'Total U.S.'
average_total = \
avg_price_by_location.loc[avg_price_by_location['geography'] == 'Total U.S.','average_price'].\
iloc[0]
average_total
# dropping 'Total U.S.' from dataset
avg_price_by_location = avg_price_by_location[avg_price_by_location.geography != 'Total U.S.']
# plotting average avocado prices for each Geographic Location
plt.figure(figsize=(25, 15))
sns.barplot(x='geography',y='average_price', data=avg_price_by_location)
plt.title('Average Avocado Prices For Each Geographic Area', fontsize=23)
# making x-axis labels vertical to take less space
plt.xticks(rotation=90, size=18)
plt.yticks(size=18)
plt.xlabel('Geography', fontsize=20)
plt.ylabel('Average Price', fontsize=20)
plt.axhline(average_total)
The graph above shows the average avocado prices for each geographic location. The blue horizontal line represents the average avocado price in the U.S. From the graph, it looks like Houston, Dallas/Ft. Worth, and South Central are the three places with the cheapest avocado prices with prices ranging between \$1.08 and \\$1.12. On the other hand, New York, Hartfold/Springfield, and San Francisco are the three places with the most expensive avocado prices with prices ranging between \$1.68 to \\$1.77.
The graph above doesn't take into consideration if the avocados are organic or conventional. Organic and conventional avocados usually have different prices from each other, so we will find the average avocado prices for each geographic location with the type of avocado taken into consideration.
# finding the average for each avocado type and each geographic location
newdf = df.groupby(['geography','type'])['average_price'].mean().reset_index()
newdf
# making a dataframe with only organic avocado prices
organic_newdf = newdf.loc[newdf['type'] == 'organic']
organic_newdf = organic_newdf.sort_values(by=['average_price'])
# making a dataframe with only conventional avocado prices
conventional_newdf = newdf.loc[newdf['type'] == 'conventional']
conventional_newdf = conventional_newdf.sort_values(by=['average_price'])
print('organic df\n', organic_newdf.head())
print('conventional df\n', conventional_newdf.head())
# getting the average organic avocado price of 'Total U.S.'
organic_avg = \
organic_newdf.loc[organic_newdf['geography'] == 'Total U.S.','average_price'].\
iloc[0]
organic_avg
# getting the average conventional avocado price of 'Total U.S.'
conventional_avg = \
conventional_newdf.loc[conventional_newdf['geography'] == 'Total U.S.','average_price'].\
iloc[0]
conventional_avg
# dropping 'Total U.S.' from both datasets
organic_newdf = organic_newdf[organic_newdf.geography != 'Total U.S.']
conventional_newdf = conventional_newdf[conventional_newdf.geography != 'Total U.S.']
# plotting average organic avocado prices for each Geographic Location
plt.figure(figsize=(25, 15))
sns.barplot(x='geography',y='average_price', data=organic_newdf)
plt.title('Average Organic Avocado Prices For Each Geographic Area', fontsize=23)
# making x-axis labels vertical to take less space
plt.xticks(rotation=90, size=18)
plt.yticks(size=18)
plt.xlabel('Geography', fontsize=20)
plt.ylabel('Average Price', fontsize=20)
plt.axhline(organic_avg)
# plotting average conventional avocado prices for each Geographic location
plt.figure(figsize=(25, 15))
sns.barplot(x='geography',y='average_price', data=conventional_newdf)
plt.title('Average Conventional Avocado Prices For Each Geographic Area', fontsize=23)
# making x-axis labels vertical to take less space
plt.xticks(rotation=90, size=18)
plt.yticks(size=18)
plt.xlabel('Geography', fontsize=20)
plt.ylabel('Average Price', fontsize=20)
plt.axhline(conventional_avg)
We can clearly see from the graphs above that organic avocados tend to be more expensive than conventional avocados. The highest organic avocado price we see is over \$2 while the highest conventional avocado price we see is around \\$1.4.
From the graphs above, it looks like New York, Hartfold/Springfield, and San Francisco are still the three places with the most expensive avocados, organic and conventional. When it comes to organic avocados, San Francisco and Hartford/Springfield have a significantly higher average price compared to the entire U.S. with their prices looking like they're around \$2.50 while New York's average price is around \\$2. When it comes to conventional avocados, the average prices in New York, Hartfold/Springfield, and San Francisco are around the same with all of their prices being around \$1.4.
When it comes to cheapest organic avocados, Dallas/Ft. Worth, Houston, and South Central are still the top three. Their prices seem to be around \$1.30 to \\$1.35.
When it comes to cheapest conventional avocados, Phoenix/Tucson, Houston, and Dallas/Ft. Worth are the top three with prices looking like they range from \$0.81 to \\$0.85. South Central was the place with the third cheapest average price for organic avocados, but for conventional avocados it has the 4th cheapest price at around \$0.90.
Something that's interesting in the data is that Phoenix/Tucson is on the more expensive end when it comes to organic avocados; its average organic avocado price is around \$1.70, but it has the most cheapest price when it comes to conventional avocados.
It seems like avid avocado consumers should consider living in Dallas/Ft. Worth, Houston, and South Central because they consistently have cheaper avocado prices compared to other locations. They can also consider living in Phoenix/Tucson if they prefer to buy conventional avocados over organic avocados.
New York, Hartfold/Springfield, and San Francisco have much higher avocado prices than the average, so you may not be able to afford buying many avocados here.
We now know which places have the most and least expensive avocados.
Let's now find out how avocado prices and the number of avocados bought change over time for all geographic locations in the data. Our data contains weekly retail information so we can graph the prices over time and total number of avocados bought over time. We'll first do this for all avocados, disregarding if they're conventional or organic. Then we'll do this for organic avocados, and then for conventional avocados.
# changing date column of dataframe to datetime type
df['date']= pd.to_datetime(df['date'])
# getting the average price of all avocados depending on the date
new_df = df.groupby(df.date)['average_price'].mean().reset_index()
new_df
# plotting average avocado prices over time
plt.figure(figsize=(25, 15))
sns.lineplot(x='date',y='average_price', data=new_df)
plt.title('Average Avocado Prices Over Time', fontsize=23)
plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Average Price', fontsize=20)
# getting the average number of avocados bought in all geographic locations over time
volume_over_time = df.groupby(df.date)['total_volume'].mean().reset_index()
# plotting number of avocados bought over time
plt.figure(figsize=(25, 15))
ax = sns.lineplot(x='date',y='total_volume', data=volume_over_time)
plt.title('Number of Avocados Sold Over Time', fontsize=23)
plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Number of Avocados Sold', fontsize=20)
#getting rid of scientific notation
plt.ticklabel_format(style='plain', axis='y')
# splitting the dataframe into two dataframes based on type (organic or conventional)
organic_df = df[(df['type'] == 'organic')]
conventional_df = df[(df['type'] == 'conventional')]
# getting average price over time for organic avocados
organic_price_over_time = organic_df.groupby(organic_df.date)['average_price'].mean()\
.reset_index()
# getting average price over time for conventional avocados
conventional_price_over_time = conventional_df.groupby(conventional_df.date)['average_price']\
.mean().reset_index()
# plotting average organic prices over time
plt.figure(figsize=(25, 15))
sns.lineplot(x='date',y='average_price', data=organic_price_over_time)
plt.title('Average Organic Avocado Prices Over Time', fontsize=23)
plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Average Price', fontsize=20)
# getting the average number of organic avocados bought over time
organic_volume_over_time = organic_df.groupby(organic_df.date)['total_volume'].mean().reset_index()
#plotting number of organic avocados sold over time
plt.figure(figsize=(25, 15))
sns.lineplot(x='date',y='total_volume', data=organic_volume_over_time)
plt.title('Number of Organic Avocados Sold Over Time', fontsize=23)
plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Number of Avocados Sold', fontsize=20)
#getting rid of scientific notation
plt.ticklabel_format(style='plain', axis='y')
# plotting average conventional avocado price over time
plt.figure(figsize=(25, 15))
sns.lineplot(x='date',y='average_price', data=conventional_price_over_time)
plt.title('Average Conventional Avocado Prices Over Time', fontsize=23)
plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Average Price', fontsize=20)
# getting the average number of conventional avocados bought over time
conventional_volume_over_time = conventional_df.groupby(conventional_df.date)['total_volume']\
.mean().reset_index()
#plotting number of conventional avocados sold over time
plt.figure(figsize=(25, 15))
sns.lineplot(x='date',y='total_volume', data=conventional_volume_over_time)
plt.title('Number of conventional Avocados Sold Over Time', fontsize=23)
plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Number of Avocados Sold', fontsize=20)
#getting rid of scientific notation
plt.ticklabel_format(style='plain', axis='y')
The average avocado prices seem to vary over time. From the years 2015 to the beginning of 2016, average avocado prices were pretty low with prices ranging between \$1.10 to \\$1.50. Then there was a spike in prices from the middle of 2016 to the end of 2016. There was another spike in avocado prices from 2017 to 2018 and from 2019 to 2020. In between spikes, the prices seem to dip low again. The average price for both conventional and organic avocados seem to follow this trend.
When it comes to number of avocados bought over time, it looks like it's increasing over time with organic avocados seeing a clearer/higher increase over time compared to conventional avocados. This could be the case because more and more people in the U.S. are focusing on living a healthy lifestyle, so more organic avocados are being bought than before because they seem healthier than conventional avocados.
It looks like more and more avocados are being bought in the U.S., but is there a relation between the prices of avocados and the number of avocados being bought? We can try to predict the number of avocados based on price by using machine learning.
First let's prep the data.
# getting average price and average total volume
avg_price_over_time = df.groupby(df.date)['average_price'].mean().reset_index()
volume_over_time = df.groupby(df.date)['total_volume'].mean().reset_index()
# merging average price and average total volume into one dataframe
merged_df = avg_price_over_time
merged_df = pd.merge(merged_df, volume_over_time,on='date',how='outer')
merged_df
# making a scatterplot of average price vs average total volume
sns.scatterplot(data=merged_df,x='average_price',y='total_volume')
plt.xlabel('Price')
plt.ylabel('Volume')
plt.title('Avocado Prices vs Volume/Number Bought')
From the scatterplot, it looks like number of avocados bought tend to decrease when the price increases.
We will use linear regression to see if we can find a predictive relationship between avocado price and number of avocados bought. If you want to learn how to do linear regression using sklearn, use this link: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
#Splitting data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(merged_df.drop(['total_volume','date'], axis='columns'), \
merged_df.total_volume, test_size=0.2)
# doing linear regression
lg = LinearRegression()
lg.fit(X_train, y_train)
print('score: ', lg.score(X_test, y_test))
Our regression model score is very low, which means that our model isn't very accurate. Let's see what the regression looks like.
# graphing the linear regression
prediction = lg.predict(merged_df.average_price.values.reshape(-1,1))
plt.figure(figsize=(10,8))
plt.plot(merged_df.average_price, prediction, label='Linear Regression')
plt.scatter(merged_df.average_price, merged_df.total_volume, color='black')
plt.title('Average Avocado Price vs Total Volume of Avocados Bought')
plt.xlabel('Average Price')
plt.ylabel('Total Volume/Number of Avocados Bought')
# coefficient of determination, a 1 is a perfect prediction
print('coefficient of determinations: ', r2_score(merged_df.total_volume, prediction))
The regression line doesn't look like it fits that well either and our $r^2$ value/coefficient of determination also isn't that high, which means that the model doesn't fit our data well.
We can try polynomial regression to see if it fits our data better. Read this to learn more about polynomial regression: https://towardsdatascience.com/polynomial-regression-bbe8b9d97491.
from sklearn.preprocessing import PolynomialFeatures
# polynomial regression using a degree of 2
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(X_train)
lg.fit(x_poly,y_train)
# plotting the regression
prediction = lg.predict(poly.fit_transform(merged_df.average_price.values.reshape(-1,1)))
plt.figure(figsize=(10,8))
plt.plot(merged_df.average_price.values.reshape(-1,1), prediction, label=' Regression')
plt.scatter(merged_df.average_price, merged_df.total_volume, color='black')
plt.title('Average Avocado Price vs Total Volume of Avocados Bought')
plt.xlabel('Average Price')
plt.ylabel('Total Volume/Number of Avocados Bought')
# coefficient of determination, a 1 is a perfect prediction
print('coefficient of determinations: ', r2_score(merged_df.total_volume, prediction))
The above seems a bit better and our $r^2$ increased a bit, but it's still not great. Let's try a degree of 3 to see if it'll fit our model even better.
# polynomial regression using degree = 3
poly = PolynomialFeatures(degree=3)
x_poly = poly.fit_transform(X_train)
lg.fit(x_poly,y_train)
prediction = lg.predict(poly.fit_transform(merged_df.average_price.values.reshape(-1,1)))
# plotting the regression
plt.figure(figsize=(10,8))
plt.plot(merged_df.average_price.values.reshape(-1,1), prediction, label=' Regression')
plt.scatter(merged_df.average_price, merged_df.total_volume, color='black')
plt.title('Average Avocado Price vs Total Volume of Avocados Bought')
plt.xlabel('Average Price')
plt.ylabel('Total Volume/Number of Avocados Bought')
# coefficient of determination, a 1 is a perfect prediction
print('coefficient of determinations: ', r2_score(merged_df.total_volume, prediction))
Our $r^2$ improved a bit more. Now let's try a degree of 4.
# polynomial regression with degree = 4
poly = PolynomialFeatures(degree=4)
x_poly = poly.fit_transform(X_train)
lg.fit(x_poly,y_train)
prediction = lg.predict(poly.fit_transform(merged_df.average_price.values.reshape(-1,1)))
# plotting the regression
plt.figure(figsize=(10,8))
plt.plot(merged_df.average_price.values.reshape(-1,1), prediction, label=' Regression')
plt.scatter(merged_df.average_price, merged_df.total_volume, color='black')
plt.title('Average Avocado Price vs Total Volume of Avocados Bought')
plt.xlabel('Average Price')
plt.ylabel('Total Volume/Number of Avocados Bought')
# coefficient of determination, a 1 is a perfect prediction
print('coefficient of determinations: ', r2_score(merged_df.total_volume, prediction))
Our $r^2$ improved again but by a very small amount. If we keep increasing the degree, our model will be able to make better predictions with our data. However, doing that will lead to over-fitting the data. This will cause our model to predict this dataset of ours very well but it will fail to predict data that it hasn't seen before.
Our above models don't have a very high $r^2$ value so we most likely cannot accurately predict how many avocados are bought based only on price.
After analyzing and visualizing the data, we came up with some insights:
Dallas/Ft. Worth, Houston, and South Central seem to have cheaper avocado prices compared to other locations in the U.S. while New York, Hartfold/Springfield, and San Francisco have more expensive avocado prices. Phoenix/Tucson also has low prices for conventional avocados, but their organic avocados are pretty expensive. We can conclude that Dallas/Ft. Worth, Houston, and South Central are the best places to live if you don't want to go bankrupt from buying avocados
Avocado prices vary over time. Sometimes the prices spike and other times their prices dip low.
The number of avocados being bought over time is increasing. Even though both organic and conventional avocado consumption are increasing, organic avocados seem to be having a more consistent and higher increase in consumption than conventional avocados.
When plotting avocado prices vs number of avocados bought, it looks like less avocados are bought when the prices are high. However, we cannot accurately predict how many avocados are bought based solely on the price.
Overall, it seems like avocados are not getting any less popular. However, if we want to predict how many avocados will be bought, we need to take into consideration more parameters than just the price of the avocados.