Executive summary

Avocado is one of the most loved fruit in the world. The purpose of this project is to analyze the Avocado sales data using aggregation and visualizations to help understand the customers of the product. The project shall involve cleaning the data to ensure that it is consistent, relevant and complete, and then using the clean data to help answer the following research Questions:

  1. In Which region are Avocado total sale volume high?

  2. In which year were avocado sales high?

  3. Do high average sales prices help avocado sellers make more money?

  4. Which Avocado type high average price?

  5. Which Avocado have high total sales amount?

Loading the neccessary libraries and the data

We first load the libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(dplyr)

We now load the data set

data =read.csv("E:/Data Science/R/course materials/avocado.csv", header = TRUE)

DATA CLEANING

Having an overview of the data to understand it

checking the first few rows of the data

head(data)
##   X       Date AveragePrice Total.Volume   X4046     X4225  X4770 Total.Bags
## 1 0 27/12/2015         1.33     64236.62 1036.74  54454.85  48.16    8696.87
## 2 1 20/12/2015         1.35     54876.98  674.28  44638.81  58.33    9505.56
## 3 2 13/12/2015         0.93    118220.22  794.70 109149.67 130.50    8145.35
## 4 3 06/12/2015         1.08     78992.15 1132.00  71976.41  72.58    5811.16
## 5 4 29/11/2015         1.28     51039.60  941.48  43838.39  75.78    6183.95
## 6 5 22/11/2015         1.26     55979.78 1184.27  48067.99  43.61    6683.91
##   Small.Bags Large.Bags XLarge.Bags         type year region
## 1    8603.62      93.25           0 conventional 2015 Albany
## 2    9408.07      97.49           0 conventional 2015 Albany
## 3    8042.21     103.14           0 conventional 2015 Albany
## 4    5677.40     133.76           0 conventional 2015 Albany
## 5    5986.26     197.69           0 conventional 2015 Albany
## 6    6556.47     127.44           0 conventional 2015 Albany

The first view rows of the data help us understand that there are some variables that are not useful in answering our research questions. They need to be dropped from our data set.

Dropping unnecessary variables

data = subset(data, select = -c(X,X4046, X4225, X4770))
head(data)
##         Date AveragePrice Total.Volume Total.Bags Small.Bags Large.Bags
## 1 27/12/2015         1.33     64236.62    8696.87    8603.62      93.25
## 2 20/12/2015         1.35     54876.98    9505.56    9408.07      97.49
## 3 13/12/2015         0.93    118220.22    8145.35    8042.21     103.14
## 4 06/12/2015         1.08     78992.15    5811.16    5677.40     133.76
## 5 29/11/2015         1.28     51039.60    6183.95    5986.26     197.69
## 6 22/11/2015         1.26     55979.78    6683.91    6556.47     127.44
##   XLarge.Bags         type year region
## 1           0 conventional 2015 Albany
## 2           0 conventional 2015 Albany
## 3           0 conventional 2015 Albany
## 4           0 conventional 2015 Albany
## 5           0 conventional 2015 Albany
## 6           0 conventional 2015 Albany

The unnecessary variables have been dropped from the data set. We now check the datatypes in all variables to ensure consistency.

Checking the datatypes in the variables

str(data)
## 'data.frame':    18249 obs. of  10 variables:
##  $ Date        : chr  "27/12/2015" "20/12/2015" "13/12/2015" "06/12/2015" ...
##  $ AveragePrice: num  1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
##  $ Total.Volume: num  64237 54877 118220 78992 51040 ...
##  $ Total.Bags  : num  8697 9506 8145 5811 6184 ...
##  $ Small.Bags  : num  8604 9408 8042 5677 5986 ...
##  $ Large.Bags  : num  93.2 97.5 103.1 133.8 197.7 ...
##  $ XLarge.Bags : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ type        : chr  "conventional" "conventional" "conventional" "conventional" ...
##  $ year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ region      : chr  "Albany" "Albany" "Albany" "Albany" ...

The date variable contains a character variable but it should contain a date datatype. The type and region variables’ data should be converted from characters to factor data types. The year variable can be converted from integers to factors too.

Converting the date variable from characters to dates

data1=data %>% 
  mutate(Date = as.Date(Date, format = "%d/%m/%Y"))
str(data1)
## 'data.frame':    18249 obs. of  10 variables:
##  $ Date        : Date, format: "2015-12-27" "2015-12-20" ...
##  $ AveragePrice: num  1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
##  $ Total.Volume: num  64237 54877 118220 78992 51040 ...
##  $ Total.Bags  : num  8697 9506 8145 5811 6184 ...
##  $ Small.Bags  : num  8604 9408 8042 5677 5986 ...
##  $ Large.Bags  : num  93.2 97.5 103.1 133.8 197.7 ...
##  $ XLarge.Bags : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ type        : chr  "conventional" "conventional" "conventional" "conventional" ...
##  $ year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ region      : chr  "Albany" "Albany" "Albany" "Albany" ...

The date variable has been changed to a date so it can be analyzed well. We no change the type, region and year variables to factors.

Converting type, region and year variables to factors

data2 = data1 %>%
  mutate(type = (as.factor(type))) %>% 
  mutate(region = (as.factor(region))) %>% 
  mutate(year = (as.factor(year)))
str(data2)
## 'data.frame':    18249 obs. of  10 variables:
##  $ Date        : Date, format: "2015-12-27" "2015-12-20" ...
##  $ AveragePrice: num  1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
##  $ Total.Volume: num  64237 54877 118220 78992 51040 ...
##  $ Total.Bags  : num  8697 9506 8145 5811 6184 ...
##  $ Small.Bags  : num  8604 9408 8042 5677 5986 ...
##  $ Large.Bags  : num  93.2 97.5 103.1 133.8 197.7 ...
##  $ XLarge.Bags : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ type        : Factor w/ 2 levels "conventional",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year        : Factor w/ 4 levels "2015","2016",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ region      : Factor w/ 54 levels "Albany","Atlanta",..: 1 1 1 1 1 1 1 1 1 1 ...

Finally to complete the data cleaning and processing, we create the sales Amount variable by multiplying the total sales and the AveragePrice variables.

Creating the SalesAmount variable

data3 = data2 %>% 
  mutate(SalesAmount = data2$AveragePrice*data2$Total.Volume)
head(data3)
##         Date AveragePrice Total.Volume Total.Bags Small.Bags Large.Bags
## 1 2015-12-27         1.33     64236.62    8696.87    8603.62      93.25
## 2 2015-12-20         1.35     54876.98    9505.56    9408.07      97.49
## 3 2015-12-13         0.93    118220.22    8145.35    8042.21     103.14
## 4 2015-12-06         1.08     78992.15    5811.16    5677.40     133.76
## 5 2015-11-29         1.28     51039.60    6183.95    5986.26     197.69
## 6 2015-11-22         1.26     55979.78    6683.91    6556.47     127.44
##   XLarge.Bags         type year region SalesAmount
## 1           0 conventional 2015 Albany    85434.70
## 2           0 conventional 2015 Albany    74083.92
## 3           0 conventional 2015 Albany   109944.80
## 4           0 conventional 2015 Albany    85311.52
## 5           0 conventional 2015 Albany    65330.69
## 6           0 conventional 2015 Albany    70534.52

Analyzing the data to answer research questions.

The data is now clean, consistent, complete and relevant. We are no going to analyze it though exploratory data analysis techniques and some inferential statistics to answer the research questions identified above.

1) In Which region are Avocado average total sale volume high?

data3 %>% 
  group_by(region) %>% 
  drop_na() %>% 
  summarize(average_total_sale = mean(SalesAmount)) %>% 
  arrange(-average_total_sale)
## # A tibble: 54 × 2
##    region       average_total_sale
##    <fct>                     <dbl>
##  1 TotalUS               18898204.
##  2 California             3317793.
##  3 West                   3156313.
##  4 Northeast              2840260.
##  5 SouthCentral           2585974.
##  6 Southeast              2081747.
##  7 GreatLakes             2037461.
##  8 Midsouth               1821668.
##  9 LosAngeles             1432626.
## 10 Plains                 1065197.
## # … with 44 more rows

Insight: The total sales of the whole country is $ 18 898 204. The highest selling region was California with average sales of $ 3 317 793.

2) In which year were avocado sales high?

ggplot(data3, aes(x = factor(year), y = Total.Volume, colour = year))+
  geom_bar(stat = "identity")

Insight: The highest sales for avocado were made in 2017. The avocado sales were increasing from 2015 to 2017. However, there was an abnormal decrease from 2017 to 2018.

3)Do high average sales prices help avocado sellers make more money?

The following scatter plot shows the relationship between sale prices and total amount received from sales.

ggplot(data3, aes(x = AveragePrice, y = SalesAmount))+
  geom_point()

Insight: The scatter plot does not reveal a clear relationship between the two variables. we try evaluating the relationship using a linear regression model as shown below.

linear_model = lm(SalesAmount~AveragePrice, data = data3)
summary(linear_model)
## 
## Call:
## lm(formula = SalesAmount ~ AveragePrice, data = data3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2341618 -1009377  -634526  -123003 52651943 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3026563      97781   30.95   <2e-16 ***
## AveragePrice -1492636      66859  -22.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3637000 on 18247 degrees of freedom
## Multiple R-squared:  0.02659,    Adjusted R-squared:  0.02654 
## F-statistic: 498.4 on 1 and 18247 DF,  p-value: < 2.2e-16

Insight: The linear regression model shows that as the average price of Avocados increase, The Total sales reduces since the coefficient of the average price has a negative sign. The p-value of the regression model is 2.2e-16 which is less than the significant level (0.05). Therefore, the relationship between average price and sales amount shown in the regression model is statistically significant. We now create another regression model to investigate if the decrease in total sales as the average price of products increase is caused lower demand as the price increases.

linear_model2 = lm(Total.Volume~AveragePrice, data = data3)
summary(linear_model2)
## 
## Call:
## lm(formula = Total.Volume ~ AveragePrice, data = data3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2383481  -957612  -576276   -49248 60768957 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3174918      91115   34.84   <2e-16 ***
## AveragePrice -1653136      62301  -26.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3389000 on 18247 degrees of freedom
## Multiple R-squared:  0.03715,    Adjusted R-squared:  0.0371 
## F-statistic: 704.1 on 1 and 18247 DF,  p-value: < 2.2e-16

Insight: The linear model shows that there is a statistically significant negative linear relationship between average price and total volume purchased. Therefore, we can conclude that the decrease in Total sales as the average price increases is caused by decreased demand due to an increase in unit price. #4) Which Avocado type high average price?

The following bar chart shows average price by year

ggplot(data3, aes(x = factor(type), y = AveragePrice, colour = type))+
  geom_bar(stat = "identity")

Insight: Organic avocados have a high average price than the conventional avocados.

#5) Which Avocado type have high total sales amount?

The following chart shows sales amount by avocado type.

ggplot(data3, aes(x = factor(type), y = SalesAmount, colour = type))+
  geom_bar(stat = "identity")

Insight: The chart shows that the conventional avocado type has high sales amount than the organic type.

Recommendations

The data analysis project shows that although the organic avocado types have a high average price, they have low demand. As a result, business people dealing with organic avocado type are expected to record low sales than those dealing with conventional avocado.