Avocado is one of the most loved fruit in the world. The purpose of this project is to analyze the Avocado sales data using aggregation and visualizations to help understand the customers of the product. The project shall involve cleaning the data to ensure that it is consistent, relevant and complete, and then using the clean data to help answer the following research Questions:
In Which region are Avocado total sale volume high?
In which year were avocado sales high?
Do high average sales prices help avocado sellers make more money?
Which Avocado type high average price?
Which Avocado have high total sales amount?
We first load the libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(dplyr)
We now load the data set
data =read.csv("E:/Data Science/R/course materials/avocado.csv", header = TRUE)
Having an overview of the data to understand it
checking the first few rows of the data
head(data)
## X Date AveragePrice Total.Volume X4046 X4225 X4770 Total.Bags
## 1 0 27/12/2015 1.33 64236.62 1036.74 54454.85 48.16 8696.87
## 2 1 20/12/2015 1.35 54876.98 674.28 44638.81 58.33 9505.56
## 3 2 13/12/2015 0.93 118220.22 794.70 109149.67 130.50 8145.35
## 4 3 06/12/2015 1.08 78992.15 1132.00 71976.41 72.58 5811.16
## 5 4 29/11/2015 1.28 51039.60 941.48 43838.39 75.78 6183.95
## 6 5 22/11/2015 1.26 55979.78 1184.27 48067.99 43.61 6683.91
## Small.Bags Large.Bags XLarge.Bags type year region
## 1 8603.62 93.25 0 conventional 2015 Albany
## 2 9408.07 97.49 0 conventional 2015 Albany
## 3 8042.21 103.14 0 conventional 2015 Albany
## 4 5677.40 133.76 0 conventional 2015 Albany
## 5 5986.26 197.69 0 conventional 2015 Albany
## 6 6556.47 127.44 0 conventional 2015 Albany
The first view rows of the data help us understand that there are some variables that are not useful in answering our research questions. They need to be dropped from our data set.
Dropping unnecessary variables
data = subset(data, select = -c(X,X4046, X4225, X4770))
head(data)
## Date AveragePrice Total.Volume Total.Bags Small.Bags Large.Bags
## 1 27/12/2015 1.33 64236.62 8696.87 8603.62 93.25
## 2 20/12/2015 1.35 54876.98 9505.56 9408.07 97.49
## 3 13/12/2015 0.93 118220.22 8145.35 8042.21 103.14
## 4 06/12/2015 1.08 78992.15 5811.16 5677.40 133.76
## 5 29/11/2015 1.28 51039.60 6183.95 5986.26 197.69
## 6 22/11/2015 1.26 55979.78 6683.91 6556.47 127.44
## XLarge.Bags type year region
## 1 0 conventional 2015 Albany
## 2 0 conventional 2015 Albany
## 3 0 conventional 2015 Albany
## 4 0 conventional 2015 Albany
## 5 0 conventional 2015 Albany
## 6 0 conventional 2015 Albany
The unnecessary variables have been dropped from the data set. We now check the datatypes in all variables to ensure consistency.
Checking the datatypes in the variables
str(data)
## 'data.frame': 18249 obs. of 10 variables:
## $ Date : chr "27/12/2015" "20/12/2015" "13/12/2015" "06/12/2015" ...
## $ AveragePrice: num 1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
## $ Total.Volume: num 64237 54877 118220 78992 51040 ...
## $ Total.Bags : num 8697 9506 8145 5811 6184 ...
## $ Small.Bags : num 8604 9408 8042 5677 5986 ...
## $ Large.Bags : num 93.2 97.5 103.1 133.8 197.7 ...
## $ XLarge.Bags : num 0 0 0 0 0 0 0 0 0 0 ...
## $ type : chr "conventional" "conventional" "conventional" "conventional" ...
## $ year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ region : chr "Albany" "Albany" "Albany" "Albany" ...
The date variable contains a character variable but it should contain a date datatype. The type and region variables’ data should be converted from characters to factor data types. The year variable can be converted from integers to factors too.
Converting the date variable from characters to dates
data1=data %>%
mutate(Date = as.Date(Date, format = "%d/%m/%Y"))
str(data1)
## 'data.frame': 18249 obs. of 10 variables:
## $ Date : Date, format: "2015-12-27" "2015-12-20" ...
## $ AveragePrice: num 1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
## $ Total.Volume: num 64237 54877 118220 78992 51040 ...
## $ Total.Bags : num 8697 9506 8145 5811 6184 ...
## $ Small.Bags : num 8604 9408 8042 5677 5986 ...
## $ Large.Bags : num 93.2 97.5 103.1 133.8 197.7 ...
## $ XLarge.Bags : num 0 0 0 0 0 0 0 0 0 0 ...
## $ type : chr "conventional" "conventional" "conventional" "conventional" ...
## $ year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ region : chr "Albany" "Albany" "Albany" "Albany" ...
The date variable has been changed to a date so it can be analyzed well. We no change the type, region and year variables to factors.
Converting type, region and year variables to factors
data2 = data1 %>%
mutate(type = (as.factor(type))) %>%
mutate(region = (as.factor(region))) %>%
mutate(year = (as.factor(year)))
str(data2)
## 'data.frame': 18249 obs. of 10 variables:
## $ Date : Date, format: "2015-12-27" "2015-12-20" ...
## $ AveragePrice: num 1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
## $ Total.Volume: num 64237 54877 118220 78992 51040 ...
## $ Total.Bags : num 8697 9506 8145 5811 6184 ...
## $ Small.Bags : num 8604 9408 8042 5677 5986 ...
## $ Large.Bags : num 93.2 97.5 103.1 133.8 197.7 ...
## $ XLarge.Bags : num 0 0 0 0 0 0 0 0 0 0 ...
## $ type : Factor w/ 2 levels "conventional",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : Factor w/ 4 levels "2015","2016",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 54 levels "Albany","Atlanta",..: 1 1 1 1 1 1 1 1 1 1 ...
Finally to complete the data cleaning and processing, we create the sales Amount variable by multiplying the total sales and the AveragePrice variables.
Creating the SalesAmount variable
data3 = data2 %>%
mutate(SalesAmount = data2$AveragePrice*data2$Total.Volume)
head(data3)
## Date AveragePrice Total.Volume Total.Bags Small.Bags Large.Bags
## 1 2015-12-27 1.33 64236.62 8696.87 8603.62 93.25
## 2 2015-12-20 1.35 54876.98 9505.56 9408.07 97.49
## 3 2015-12-13 0.93 118220.22 8145.35 8042.21 103.14
## 4 2015-12-06 1.08 78992.15 5811.16 5677.40 133.76
## 5 2015-11-29 1.28 51039.60 6183.95 5986.26 197.69
## 6 2015-11-22 1.26 55979.78 6683.91 6556.47 127.44
## XLarge.Bags type year region SalesAmount
## 1 0 conventional 2015 Albany 85434.70
## 2 0 conventional 2015 Albany 74083.92
## 3 0 conventional 2015 Albany 109944.80
## 4 0 conventional 2015 Albany 85311.52
## 5 0 conventional 2015 Albany 65330.69
## 6 0 conventional 2015 Albany 70534.52
The data is now clean, consistent, complete and relevant. We are no going to analyze it though exploratory data analysis techniques and some inferential statistics to answer the research questions identified above.
1) In Which region are Avocado average total sale volume high?
data3 %>%
group_by(region) %>%
drop_na() %>%
summarize(average_total_sale = mean(SalesAmount)) %>%
arrange(-average_total_sale)
## # A tibble: 54 × 2
## region average_total_sale
## <fct> <dbl>
## 1 TotalUS 18898204.
## 2 California 3317793.
## 3 West 3156313.
## 4 Northeast 2840260.
## 5 SouthCentral 2585974.
## 6 Southeast 2081747.
## 7 GreatLakes 2037461.
## 8 Midsouth 1821668.
## 9 LosAngeles 1432626.
## 10 Plains 1065197.
## # … with 44 more rows
Insight: The total sales of the whole country is $ 18 898 204. The highest selling region was California with average sales of $ 3 317 793.
2) In which year were avocado sales high?
ggplot(data3, aes(x = factor(year), y = Total.Volume, colour = year))+
geom_bar(stat = "identity")
Insight: The highest sales for avocado were made in 2017. The avocado sales were increasing from 2015 to 2017. However, there was an abnormal decrease from 2017 to 2018.
3)Do high average sales prices help avocado sellers make more money?
The following scatter plot shows the relationship between sale prices and total amount received from sales.
ggplot(data3, aes(x = AveragePrice, y = SalesAmount))+
geom_point()
Insight: The scatter plot does not reveal a clear relationship between the two variables. we try evaluating the relationship using a linear regression model as shown below.
linear_model = lm(SalesAmount~AveragePrice, data = data3)
summary(linear_model)
##
## Call:
## lm(formula = SalesAmount ~ AveragePrice, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2341618 -1009377 -634526 -123003 52651943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3026563 97781 30.95 <2e-16 ***
## AveragePrice -1492636 66859 -22.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3637000 on 18247 degrees of freedom
## Multiple R-squared: 0.02659, Adjusted R-squared: 0.02654
## F-statistic: 498.4 on 1 and 18247 DF, p-value: < 2.2e-16
Insight: The linear regression model shows that as the average price of Avocados increase, The Total sales reduces since the coefficient of the average price has a negative sign. The p-value of the regression model is 2.2e-16 which is less than the significant level (0.05). Therefore, the relationship between average price and sales amount shown in the regression model is statistically significant. We now create another regression model to investigate if the decrease in total sales as the average price of products increase is caused lower demand as the price increases.
linear_model2 = lm(Total.Volume~AveragePrice, data = data3)
summary(linear_model2)
##
## Call:
## lm(formula = Total.Volume ~ AveragePrice, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2383481 -957612 -576276 -49248 60768957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3174918 91115 34.84 <2e-16 ***
## AveragePrice -1653136 62301 -26.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3389000 on 18247 degrees of freedom
## Multiple R-squared: 0.03715, Adjusted R-squared: 0.0371
## F-statistic: 704.1 on 1 and 18247 DF, p-value: < 2.2e-16
Insight: The linear model shows that there is a statistically significant negative linear relationship between average price and total volume purchased. Therefore, we can conclude that the decrease in Total sales as the average price increases is caused by decreased demand due to an increase in unit price. #4) Which Avocado type high average price?
The following bar chart shows average price by year
ggplot(data3, aes(x = factor(type), y = AveragePrice, colour = type))+
geom_bar(stat = "identity")
Insight: Organic avocados have a high average price than the conventional avocados.
#5) Which Avocado type have high total sales amount?
The following chart shows sales amount by avocado type.
ggplot(data3, aes(x = factor(type), y = SalesAmount, colour = type))+
geom_bar(stat = "identity")
Insight: The chart shows that the conventional avocado type has high sales amount than the organic type.
The data analysis project shows that although the organic avocado types have a high average price, they have low demand. As a result, business people dealing with organic avocado type are expected to record low sales than those dealing with conventional avocado.