Analyzing Avocado Prices and Consumption in the U.S.¶

Nhi Ngo¶

In the past decade, avocados have soared in popularity. From avocado salad, to avocado toast, and even to avocado face masks, people seem to be buying more and more avocados and using it in many different ways. This rise in popularity has been linked to the wellness movement as more people are focusing on their health. If you want to learn more about the cultural history of avocados, check out this article by the BBC: https://www.bbc.co.uk/bbcthree/article/87a56e5c-6d41-4495-9e22-523efb6b4cb0#:~:text=1920s%3A%20A%20PR%20campaign%20starts&text=By%20the%20late%2019th%20century,but%20they%20weren't%20selling.

In this tutorial, we will use data science to analyze the trends in avocado prices and the number of avocados purchased.

Getting the Data¶

Let's look at this data from Timofei Kornev on Kaggle (https://www.kaggle.com/timmate/avocado-prices-2020). This data was collected from the Hass Avocado Board's website (https://hassavocadoboard.com/). It contains weekly scan data of Hass avocados from the years 2015 to 2020.

First, we'll read in the data and convert it to a dataframe.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# creating dataframe from data
df = pd.read_csv("avocado-updated-2020.csv", sep=',')
df.head()

Our dataframe contains many columns, but the columns we will be focusing on are:

date: date of the observation
average_price: the average price of a single avocado
total_volume: total number of avocados sold
type: whether or not the avocado is organically grown or conventionally grown
geography: the city or region of the observation

Before doing EDA, we should check if there are any missing data. Missing data can cause problems during EDA, so we need to figure out how to deal with them if they exist. To learn more about missing data and ways to handle them, read this https://www.mastersindatascience.org/learning/how-to-deal-with-missing-data/.

# checking if dataframe has any null values
df.isnull().values.any()

False

The dataframe has no missing data so we can go ahead and start doing EDA.

Exploratory Data Analysis (EDA) and Visualization¶

Our dataframe is ready, so now we will do some data analysis.

First things first, let's say that you're an avid avocado consumer, where in the U.S. should you live if you don't want to break your bank by buying avocados? Using this data, we can find out where in the U.S. has the cheapest avocados and where has the most expensive avocados.

# finding the average price of avocados for each geographic location
avg_price_by_location = df.groupby(['geography'])['average_price'].mean().reset_index()
avg_price_by_location = avg_price_by_location.sort_values(by=['average_price'])
avg_price_by_location

The data has an entry for 'Total U.S.'. We can use the average price associated with 'Total U.S.' to see how far the avocado prices for the different locations is from the average.

# getting the total average price from 'Total U.S.'
average_total = \
avg_price_by_location.loc[avg_price_by_location['geography'] == 'Total U.S.','average_price'].\
iloc[0]

average_total

1.3299460431654677

# dropping 'Total U.S.' from dataset
avg_price_by_location = avg_price_by_location[avg_price_by_location.geography != 'Total U.S.']

# plotting average avocado prices for each Geographic Location

plt.figure(figsize=(25, 15))
sns.barplot(x='geography',y='average_price', data=avg_price_by_location)
plt.title('Average Avocado Prices For Each Geographic Area', fontsize=23)

# making x-axis labels vertical to take less space
plt.xticks(rotation=90, size=18)

plt.yticks(size=18)
plt.xlabel('Geography', fontsize=20)
plt.ylabel('Average Price', fontsize=20)

plt.axhline(average_total)

<matplotlib.lines.Line2D at 0x7fb37c6144f0>

The graph above shows the average avocado prices for each geographic location. The blue horizontal line represents the average avocado price in the U.S. From the graph, it looks like Houston, Dallas/Ft. Worth, and South Central are the three places with the cheapest avocado prices with prices ranging between \$1.08 and \\$1.12. On the other hand, New York, Hartfold/Springfield, and San Francisco are the three places with the most expensive avocado prices with prices ranging between \$1.68 to \\$1.77.

The graph above doesn't take into consideration if the avocados are organic or conventional. Organic and conventional avocados usually have different prices from each other, so we will find the average avocado prices for each geographic location with the type of avocado taken into consideration.

# finding the average for each avocado type and each geographic location
newdf = df.groupby(['geography','type'])['average_price'].mean().reset_index()
newdf

# making a dataframe with only organic avocado prices
organic_newdf = newdf.loc[newdf['type'] == 'organic']
organic_newdf = organic_newdf.sort_values(by=['average_price'])

# making a dataframe with only conventional avocado prices
conventional_newdf = newdf.loc[newdf['type'] == 'conventional']
conventional_newdf = conventional_newdf.sort_values(by=['average_price'])


print('organic df\n', organic_newdf.head())
print('conventional df\n', conventional_newdf.head())

organic df
            geography     type  average_price
23  Dallas/Ft. Worth  organic       1.335647
37           Houston  organic       1.349964
91     South Central  organic       1.361547
27           Detroit  organic       1.410755
79           Roanoke  organic       1.414496
conventional df
                geography          type  average_price
66        Phoenix/Tucson  conventional       0.776115
36               Houston  conventional       0.813669
22      Dallas/Ft. Worth  conventional       0.840755
90         South Central  conventional       0.867950
106  West Tex/New Mexico  conventional       0.878058

# getting the average organic avocado price of 'Total U.S.'
organic_avg =  \
organic_newdf.loc[organic_newdf['geography'] == 'Total U.S.','average_price'].\
iloc[0]

organic_avg

1.56

# getting the average conventional avocado price of 'Total U.S.'
conventional_avg = \
conventional_newdf.loc[conventional_newdf['geography'] == 'Total U.S.','average_price'].\
iloc[0]

conventional_avg

1.0998920863309354

# dropping 'Total U.S.' from both datasets
organic_newdf = organic_newdf[organic_newdf.geography != 'Total U.S.']
conventional_newdf = conventional_newdf[conventional_newdf.geography != 'Total U.S.']

# plotting average organic avocado prices for each Geographic Location

plt.figure(figsize=(25, 15))
sns.barplot(x='geography',y='average_price', data=organic_newdf)
plt.title('Average Organic Avocado Prices For Each Geographic Area', fontsize=23)
# making x-axis labels vertical to take less space
plt.xticks(rotation=90, size=18)
plt.yticks(size=18)
plt.xlabel('Geography', fontsize=20)
plt.ylabel('Average Price', fontsize=20)

plt.axhline(organic_avg)

<matplotlib.lines.Line2D at 0x7fb37961c400>

# plotting average conventional avocado prices for each Geographic location

plt.figure(figsize=(25, 15))
sns.barplot(x='geography',y='average_price', data=conventional_newdf)
plt.title('Average Conventional Avocado Prices For Each Geographic Area', fontsize=23)
# making x-axis labels vertical to take less space
plt.xticks(rotation=90, size=18)
plt.yticks(size=18)
plt.xlabel('Geography', fontsize=20)
plt.ylabel('Average Price', fontsize=20)

plt.axhline(conventional_avg)

<matplotlib.lines.Line2D at 0x7fb37a2f2d90>

We can clearly see from the graphs above that organic avocados tend to be more expensive than conventional avocados. The highest organic avocado price we see is over \$2 while the highest conventional avocado price we see is around \\$1.4.

Most Expensive Avocados¶

From the graphs above, it looks like New York, Hartfold/Springfield, and San Francisco are still the three places with the most expensive avocados, organic and conventional. When it comes to organic avocados, San Francisco and Hartford/Springfield have a significantly higher average price compared to the entire U.S. with their prices looking like they're around \$2.50 while New York's average price is around \\$2. When it comes to conventional avocados, the average prices in New York, Hartfold/Springfield, and San Francisco are around the same with all of their prices being around \$1.4.

Cheapest Avocados¶

When it comes to cheapest organic avocados, Dallas/Ft. Worth, Houston, and South Central are still the top three. Their prices seem to be around \$1.30 to \\$1.35.

When it comes to cheapest conventional avocados, Phoenix/Tucson, Houston, and Dallas/Ft. Worth are the top three with prices looking like they range from \$0.81 to \\$0.85. South Central was the place with the third cheapest average price for organic avocados, but for conventional avocados it has the 4th cheapest price at around \$0.90.

Something that's interesting in the data is that Phoenix/Tucson is on the more expensive end when it comes to organic avocados; its average organic avocado price is around \$1.70, but it has the most cheapest price when it comes to conventional avocados.

Insights¶

It seems like avid avocado consumers should consider living in Dallas/Ft. Worth, Houston, and South Central because they consistently have cheaper avocado prices compared to other locations. They can also consider living in Phoenix/Tucson if they prefer to buy conventional avocados over organic avocados.

New York, Hartfold/Springfield, and San Francisco have much higher avocado prices than the average, so you may not be able to afford buying many avocados here.

We now know which places have the most and least expensive avocados.

Let's now find out how avocado prices and the number of avocados bought change over time for all geographic locations in the data. Our data contains weekly retail information so we can graph the prices over time and total number of avocados bought over time. We'll first do this for all avocados, disregarding if they're conventional or organic. Then we'll do this for organic avocados, and then for conventional avocados.

# changing date column of dataframe to datetime type
df['date']= pd.to_datetime(df['date'])

# getting the average price of all avocados depending on the date
new_df = df.groupby(df.date)['average_price'].mean().reset_index()
new_df

# plotting average avocado prices over time

plt.figure(figsize=(25, 15))
sns.lineplot(x='date',y='average_price', data=new_df)
plt.title('Average Avocado Prices Over Time', fontsize=23)

plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Average Price', fontsize=20)

Text(0, 0.5, 'Average Price')

# getting the average number of avocados bought in all geographic locations over time
volume_over_time = df.groupby(df.date)['total_volume'].mean().reset_index()

# plotting number of avocados bought over time
plt.figure(figsize=(25, 15))
ax = sns.lineplot(x='date',y='total_volume', data=volume_over_time)
plt.title('Number of Avocados Sold Over Time', fontsize=23)

plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Number of Avocados Sold', fontsize=20)

#getting rid of scientific notation
plt.ticklabel_format(style='plain', axis='y')

# splitting the dataframe into two dataframes based on type (organic or conventional)
organic_df = df[(df['type'] == 'organic')]
conventional_df = df[(df['type'] == 'conventional')]

# getting average price over time for organic avocados
organic_price_over_time = organic_df.groupby(organic_df.date)['average_price'].mean()\
.reset_index()

# getting average price over time for conventional avocados
conventional_price_over_time = conventional_df.groupby(conventional_df.date)['average_price']\
.mean().reset_index()

# plotting average organic prices over time

plt.figure(figsize=(25, 15))
sns.lineplot(x='date',y='average_price', data=organic_price_over_time)
plt.title('Average Organic Avocado Prices Over Time', fontsize=23)

plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Average Price', fontsize=20)

Text(0, 0.5, 'Average Price')

# getting the average number of organic avocados bought over time
organic_volume_over_time = organic_df.groupby(organic_df.date)['total_volume'].mean().reset_index()

#plotting number of organic avocados sold over time
plt.figure(figsize=(25, 15))
sns.lineplot(x='date',y='total_volume', data=organic_volume_over_time)
plt.title('Number of Organic Avocados Sold Over Time', fontsize=23)

plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Number of Avocados Sold', fontsize=20)

#getting rid of scientific notation
plt.ticklabel_format(style='plain', axis='y')

# plotting average conventional avocado price over time

plt.figure(figsize=(25, 15))
sns.lineplot(x='date',y='average_price', data=conventional_price_over_time)
plt.title('Average Conventional Avocado Prices Over Time', fontsize=23)

plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Average Price', fontsize=20)

Text(0, 0.5, 'Average Price')

# getting the average number of conventional avocados bought over time
conventional_volume_over_time = conventional_df.groupby(conventional_df.date)['total_volume']\
.mean().reset_index()

#plotting number of conventional avocados sold over time
plt.figure(figsize=(25, 15))
sns.lineplot(x='date',y='total_volume', data=conventional_volume_over_time)
plt.title('Number of conventional Avocados Sold Over Time', fontsize=23)

plt.xticks(size=18)
plt.yticks(size=18)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Number of Avocados Sold', fontsize=20)

#getting rid of scientific notation
plt.ticklabel_format(style='plain', axis='y')

Insights¶

The average avocado prices seem to vary over time. From the years 2015 to the beginning of 2016, average avocado prices were pretty low with prices ranging between \$1.10 to \\$1.50. Then there was a spike in prices from the middle of 2016 to the end of 2016. There was another spike in avocado prices from 2017 to 2018 and from 2019 to 2020. In between spikes, the prices seem to dip low again. The average price for both conventional and organic avocados seem to follow this trend.

When it comes to number of avocados bought over time, it looks like it's increasing over time with organic avocados seeing a clearer/higher increase over time compared to conventional avocados. This could be the case because more and more people in the U.S. are focusing on living a healthy lifestyle, so more organic avocados are being bought than before because they seem healthier than conventional avocados.

Predicting Number of Avocados Bought¶

It looks like more and more avocados are being bought in the U.S., but is there a relation between the prices of avocados and the number of avocados being bought? We can try to predict the number of avocados based on price by using machine learning.

First let's prep the data.

# getting average price and average total volume
avg_price_over_time = df.groupby(df.date)['average_price'].mean().reset_index()
volume_over_time = df.groupby(df.date)['total_volume'].mean().reset_index()

# merging average price and average total volume into one dataframe
merged_df = avg_price_over_time
merged_df = pd.merge(merged_df, volume_over_time,on='date',how='outer')

merged_df

# making a scatterplot of average price vs average total volume
sns.scatterplot(data=merged_df,x='average_price',y='total_volume')
plt.xlabel('Price')
plt.ylabel('Volume')
plt.title('Avocado Prices vs Volume/Number Bought')

Text(0.5, 1.0, 'Avocado Prices vs Volume/Number Bought')

From the scatterplot, it looks like number of avocados bought tend to decrease when the price increases.

We will use linear regression to see if we can find a predictive relationship between avocado price and number of avocados bought. If you want to learn how to do linear regression using sklearn, use this link: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

#Splitting data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(merged_df.drop(['total_volume','date'], axis='columns'), \
                                                   merged_df.total_volume, test_size=0.2)

# doing linear regression
lg = LinearRegression()
lg.fit(X_train, y_train)
print('score: ', lg.score(X_test, y_test))

score:  0.28142537037501625

Our regression model score is very low, which means that our model isn't very accurate. Let's see what the regression looks like.

# graphing the linear regression

prediction = lg.predict(merged_df.average_price.values.reshape(-1,1))

plt.figure(figsize=(10,8))
plt.plot(merged_df.average_price, prediction, label='Linear Regression')
plt.scatter(merged_df.average_price, merged_df.total_volume, color='black')
plt.title('Average Avocado Price vs Total Volume of Avocados Bought')
plt.xlabel('Average Price')
plt.ylabel('Total Volume/Number of Avocados Bought')

Text(0, 0.5, 'Total Volume/Number of Avocados Bought')

# coefficient of determination, a 1 is a perfect prediction
print('coefficient of determinations: ', r2_score(merged_df.total_volume, prediction))

coefficient of determinations:  0.2266479665513883

The regression line doesn't look like it fits that well either and our $r^2$ value/coefficient of determination also isn't that high, which means that the model doesn't fit our data well.

We can try polynomial regression to see if it fits our data better. Read this to learn more about polynomial regression: https://towardsdatascience.com/polynomial-regression-bbe8b9d97491.

from sklearn.preprocessing import PolynomialFeatures

# polynomial regression using a degree of 2
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(X_train)

lg.fit(x_poly,y_train)

LinearRegression()

# plotting the regression

prediction = lg.predict(poly.fit_transform(merged_df.average_price.values.reshape(-1,1)))

plt.figure(figsize=(10,8))
plt.plot(merged_df.average_price.values.reshape(-1,1), prediction, label=' Regression')
plt.scatter(merged_df.average_price, merged_df.total_volume, color='black')
plt.title('Average Avocado Price vs Total Volume of Avocados Bought')
plt.xlabel('Average Price')
plt.ylabel('Total Volume/Number of Avocados Bought')

Text(0, 0.5, 'Total Volume/Number of Avocados Bought')

# coefficient of determination, a 1 is a perfect prediction
print('coefficient of determinations: ', r2_score(merged_df.total_volume, prediction))

coefficient of determinations:  0.24957235731778293

The above seems a bit better and our $r^2$ increased a bit, but it's still not great. Let's try a degree of 3 to see if it'll fit our model even better.

# polynomial regression using degree = 3
poly = PolynomialFeatures(degree=3)
x_poly = poly.fit_transform(X_train)

lg.fit(x_poly,y_train)

prediction = lg.predict(poly.fit_transform(merged_df.average_price.values.reshape(-1,1)))

# plotting the regression
plt.figure(figsize=(10,8))
plt.plot(merged_df.average_price.values.reshape(-1,1), prediction, label=' Regression')
plt.scatter(merged_df.average_price, merged_df.total_volume, color='black')
plt.title('Average Avocado Price vs Total Volume of Avocados Bought')
plt.xlabel('Average Price')
plt.ylabel('Total Volume/Number of Avocados Bought')

Text(0, 0.5, 'Total Volume/Number of Avocados Bought')

# coefficient of determination, a 1 is a perfect prediction
print('coefficient of determinations: ', r2_score(merged_df.total_volume, prediction))

coefficient of determinations:  0.28631737222139697

Our $r^2$ improved a bit more. Now let's try a degree of 4.

# polynomial regression with degree = 4
poly = PolynomialFeatures(degree=4)
x_poly = poly.fit_transform(X_train)

lg.fit(x_poly,y_train)

prediction = lg.predict(poly.fit_transform(merged_df.average_price.values.reshape(-1,1)))

# plotting the regression
plt.figure(figsize=(10,8))
plt.plot(merged_df.average_price.values.reshape(-1,1), prediction, label=' Regression')
plt.scatter(merged_df.average_price, merged_df.total_volume, color='black')
plt.title('Average Avocado Price vs Total Volume of Avocados Bought')
plt.xlabel('Average Price')
plt.ylabel('Total Volume/Number of Avocados Bought')

Text(0, 0.5, 'Total Volume/Number of Avocados Bought')

# coefficient of determination, a 1 is a perfect prediction
print('coefficient of determinations: ', r2_score(merged_df.total_volume, prediction))

coefficient of determinations:  0.2870786886861463

Our $r^2$ improved again but by a very small amount. If we keep increasing the degree, our model will be able to make better predictions with our data. However, doing that will lead to over-fitting the data. This will cause our model to predict this dataset of ours very well but it will fail to predict data that it hasn't seen before.

Our above models don't have a very high $r^2$ value so we most likely cannot accurately predict how many avocados are bought based only on price.

Conclusion¶

After analyzing and visualizing the data, we came up with some insights:

Dallas/Ft. Worth, Houston, and South Central seem to have cheaper avocado prices compared to other locations in the U.S. while New York, Hartfold/Springfield, and San Francisco have more expensive avocado prices. Phoenix/Tucson also has low prices for conventional avocados, but their organic avocados are pretty expensive. We can conclude that Dallas/Ft. Worth, Houston, and South Central are the best places to live if you don't want to go bankrupt from buying avocados
Avocado prices vary over time. Sometimes the prices spike and other times their prices dip low.
The number of avocados being bought over time is increasing. Even though both organic and conventional avocado consumption are increasing, organic avocados seem to be having a more consistent and higher increase in consumption than conventional avocados.
When plotting avocado prices vs number of avocados bought, it looks like less avocados are bought when the prices are high. However, we cannot accurately predict how many avocados are bought based solely on the price.

Overall, it seems like avocados are not getting any less popular. However, if we want to predict how many avocados will be bought, we need to take into consideration more parameters than just the price of the avocados.

	date	average_price	total_volume	4046	4225	4770	total_bags	small_bags	large_bags	type	year	geography
0	2015-01-04	1.22	40873.28	2819.50	28287.42	49.90	9716.46	9186.93	529.53	conventional	2015	Albany
1	2015-01-04	1.79	1373.95	57.42	153.88	0.00	1162.65	1162.65	0.00	organic	2015	Albany
2	2015-01-04	1.00	435021.49	364302.39	23821.16	82.15	46815.79	16707.15	30108.64	conventional	2015	Atlanta
3	2015-01-04	1.76	3846.69	1500.15	938.35	0.00	1408.19	1071.35	336.84	organic	2015	Atlanta
4	2015-01-04	1.08	788025.06	53987.31	552906.04	39995.03	141136.68	137146.07	3990.61	conventional	2015	Baltimore/Washington

	geography	average_price
18	Houston	1.081817
11	Dallas/Ft. Worth	1.088201
45	South Central	1.114748
33	Phoenix/Tucson	1.224209
26	Nashville	1.226025
10	Columbus	1.230450
9	Cincinnati/Dayton	1.239191
39	Roanoke	1.243813
27	New Orleans/Mobile	1.247140
38	Richmond/Norfolk	1.258345
13	Detroit	1.262518
53	West Tex/New Mexico	1.266275
19	Indianapolis	1.272320
12	Denver	1.276403
23	Louisville	1.282068
22	Los Angeles	1.304353
1	Atlanta	1.312842
15	Great Lakes	1.318471
50	Tampa	1.321906
51	Total U.S.	1.329946
52	West	1.330090
34	Pittsburgh	1.335054
46	Southeast	1.342644
24	Miami/Ft. Lauderdale	1.355306
35	Plains	1.373633
21	Las Vegas	1.377788
44	South Carolina	1.379748
25	Midsouth	1.386241
31	Orlando	1.389424
16	Harrisburg/Scranton	1.400629
5	Buffalo/Rochester	1.410576
36	Portland	1.414730
20	Jacksonville	1.416906
49	Syracuse	1.430737
3	Boise	1.440072
6	California	1.444784
30	Northern New England	1.454964
41	San Diego	1.455594
14	Grand Rapids	1.456403
48	St. Louis	1.460647
2	Baltimore/Washington	1.481996
0	Albany	1.506187
47	Spokane	1.507590
4	Boston	1.529694
43	Seattle	1.535683
8	Chicago	1.535989
32	Philadelphia	1.543669
29	Northeast	1.549784
7	Charlotte	1.570450
37	Raleigh/Greensboro	1.573759
40	Sacramento	1.596583
28	New York	1.678309
17	Hartford/Springfield	1.770953
42	San Francisco	1.771871

	date	average_price
0	2015-01-04	1.301296
1	2015-01-11	1.370648
2	2015-01-18	1.391111
3	2015-01-25	1.397130
4	2015-02-01	1.247037
...	...	...
273	2020-04-19	1.386204
274	2020-04-26	1.385556
275	2020-05-03	1.304815
276	2020-05-10	1.329537
277	2020-05-17	1.371111

	date	average_price	total_volume
0	2015-01-04	1.301296	7.840216e+05
1	2015-01-11	1.370648	7.273686e+05
2	2015-01-18	1.391111	7.258221e+05
3	2015-01-25	1.397130	7.080211e+05
4	2015-02-01	1.247037	1.106048e+06
...	...	...	...
273	2020-04-19	1.386204	1.279173e+06
274	2020-04-26	1.385556	1.326299e+06
275	2020-05-03	1.304815	1.572185e+06
276	2020-05-10	1.329537	1.489704e+06
277	2020-05-17	1.371111	1.318729e+06

	geography	type	average_price
0	Albany	conventional	1.314101
1	Albany	organic	1.698273
2	Atlanta	conventional	1.052410
3	Atlanta	organic	1.573273
4	Baltimore/Washington	conventional	1.341906
...	...	...	...
103	Total U.S.	organic	1.560000
104	West	conventional	1.030324
105	West	organic	1.629856
106	West Tex/New Mexico	conventional	0.878058
107	West Tex/New Mexico	organic	1.658727