Forecasting Sales Revenue¶

Machine Learning: Random Forest Regressor, Gradient Boosting Regressor¶

+ RandomizedSearch + HugginFace + jobLib + Docker + Flask + StreamLit¶

Keys: #EDA #HuggingFace #RandomForestRegressor #GradientBoostingRegressor #Flask #Streamlit #JobLib #jsonify #Docker

Josue Quiros Batista¶

No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image

  • Overview of the Dataset
  • Exploratory Data Analysis (EDA)
  • Univariate Analysis
  • Bivariate Analysis
  • Data Pre-processing
  • Feature Engineering
  • Context proxies
  • Data Preparation for modeling
  • Model Evaluation Criterion
  • Models: Random Forest Regressor and Gradient Boosting Regressor
  • Model Performance Improvement - Hyperparameter Tuning
  • Model Performance Comparison, Final Model Selection, and Serialization
  • Actionable Insights and Business Recommendations derived from this model

Business Context¶

A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.

Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.

Objective¶

SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.

To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.

Actionable Insights and Business Recommendations derived from this model¶

Insights from the analysis conducted¶

Interpretation of metrics

  • RMSE (Root Mean Squared Error)

    On average, predictions deviate by 621 USD from the actual sales.

  • MAE (Mean Absolute Error) On average, the absolute difference between predicted and actual sales is 489.71 USD

  • R-squared: About 66.5% of the variance in sales is explained by the model.

  • MAPE (Mean Absolute Percentage Error): On average, the predictions are off by 16.7% of the actual sales value.

  • Evaluation

    • Product_Id: FD1961
    • Weight: 9.45
    • Sugar content: Low Sugar
    • Allocated area: 0.047
    • Type: Snack foods
    • MRP: 95.95
    • Sold in a Food Mart
    • Product_Store_Sales_Total = 1684.82 USD

Based on the trained model:

  • The actual sales of product FD1961 in OUT002 were 1,684.82 USD

  • The forecasted sales from the selected model were 1,736 USD.

  • Absolute Error: |1736 - 1684.82| = 51.18 USD

  • Percentage Error: (51.18 / 1684.82) × 100 = 3.03%

In this case, the error was much lower than average, indicating an excellent performance.

  • With a MAPE of 16.7%, our model generally predicts sales within ±17% of actual values. The R Squared of 66.5% shows it successfully identifies many of the trends in the data, though improvements are possible, through more detailed analysis and by evaluating alternative models or further fine-tuning hyperparameters. Nevertheless, this level of accuracy is generally sufficient for inventory, purchasing, and marketing decisions for most products.

Business recommendations¶

  • Fruits and vegetables represent the highest-selling product category, accounting for 14.3% of all products sold and generating the largest revenue at 4,300,833.27 USD. However, due to their seasonal nature, some produce items may not always be consistently available.

    It would be beneficial to strengthen the acquisition and sales efforts for the second and third most significant product groups: Snack Foods, which generated 3,988,996.95 USD, and Dairy, with 2,811,918.04 USD. This strategy would reduce reliance on perishable products with unpredictable availability. Additionally, "Snack Foods" should not exclusively imply unhealthy options; rather, healthy alternatives can be offered to health-conscious customers, who represent the identified target demographic for the stores.

  • The least sold product category is "Seafood," with only 76 units. It is advisable to assess whether to continue offering these products. Due to their nature, they require special in-store placement and specific maintenance, such as refrigeration, which may incur costs that do not justify the return on investment. Furthermore, these products are subject to a shorter expiration period.

  • Over 70% of stores are located in cities where the standard of living is middle class (Tier 2). Both areas with a higher standard of living (Tier 1) and those with lower purchasing power (Tier 3) are represented in smaller, yet very similar, proportions at 15.4% and 13.1%, respectively.

    Marketing efforts and product selections should match the local purchasing power in each zone. This strategy aims to boost sales of products genuinely appealing to the local demographic. It is recommended that Key Performance Indicators (KPIs) specific to each Store Type be developed to effectively measure these efforts.

    This approach could also benefit from utilizing "Context Proxies," which aim to use a derived variable to indirectly represent its associated variable. For instance, calculating the average sales per store could indirectly reflect each location's capacity to generate profit, encompassing both the store itself and its surrounding area (e.g., the socio-economic status of the surrounding area)).

  • An interesting finding is that despite the majority of products sold being low in sugar, all sugar content categories generate very similar revenues. In other words, the fact that over 50% of products are low in sugar does not translate to higher sales compared to products with a regular sugar level or sugar-free products, at least when interpreting the median values.

    This could mean that products with a regular sugar level might sell well, perhaps even better than low-sugar options. This is because customers actively seek these products, even if they are not the predominant items offered in stores.

  • Finally, it would be beneficial to begin recording data on the seasonal sales peaks for each product category. The aim is to forecast sales more accurately based on this information. Additionally, a marketing strategy that includes discounts and promotions for these products at specific times of the year could be developed.

Used libraries¶

  • Numpy
  • Pandas
  • Scikit-learn
  • Matplotlib
  • Seaborn
  • joblib
  • Huggingface
In [ ]:
_data = r"/content/SuperKart.csv"
#_data = r"G:\Mi unidad\Documentos\Data Science\"
_data_original = pd.read_csv(_data)
_data_copy = _data_original.copy()

Data Overview¶

In [ ]:
_data_copy.head()
Out[ ]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
0 FD6114 12.66 Low Sugar 0.027 Frozen Foods 117.08 OUT004 2009 Medium Tier 2 Supermarket Type2 2842.40
1 FD7839 16.54 Low Sugar 0.144 Dairy 171.43 OUT003 1999 Medium Tier 1 Departmental Store 4830.02
2 FD5075 14.28 Regular 0.031 Canned 162.08 OUT001 1987 High Tier 2 Supermarket Type1 4130.16
3 FD8233 12.10 Low Sugar 0.112 Baking Goods 186.31 OUT001 1987 High Tier 2 Supermarket Type1 4132.18
4 NC1180 9.57 No Sugar 0.010 Health and Hygiene 123.67 OUT002 1998 Small Tier 3 Food Mart 2279.36

Shape¶

In [ ]:
print(_data_copy.shape)
print("There are", _data_copy.shape[0], 'rows and', _data_copy.shape[1], "columns.")
(8763, 12)
There are 8763 rows and 12 columns.

Data types¶

In [ ]:
_data_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Product_Id                 8763 non-null   object 
 1   Product_Weight             8763 non-null   float64
 2   Product_Sugar_Content      8763 non-null   object 
 3   Product_Allocated_Area     8763 non-null   float64
 4   Product_Type               8763 non-null   object 
 5   Product_MRP                8763 non-null   float64
 6   Store_Id                   8763 non-null   object 
 7   Store_Establishment_Year   8763 non-null   int64  
 8   Store_Size                 8763 non-null   object 
 9   Store_Location_City_Type   8763 non-null   object 
 10  Store_Type                 8763 non-null   object 
 11  Product_Store_Sales_Total  8763 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 821.7+ KB

Statistical summary¶

In [ ]:
_data_copy.describe(include='all')
Out[ ]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
count 8763 8763.000000 8763 8763.000000 8763 8763.000000 8763 8763.000000 8763 8763 8763 8763.000000
unique 8763 NaN 4 NaN 16 NaN 4 NaN 3 3 4 NaN
top FD306 NaN Low Sugar NaN Fruits and Vegetables NaN OUT004 NaN Medium Tier 2 Supermarket Type2 NaN
freq 1 NaN 4885 NaN 1249 NaN 4676 NaN 6025 6262 4676 NaN
mean NaN 12.653792 NaN 0.068786 NaN 147.032539 NaN 2002.032751 NaN NaN NaN 3464.003640
std NaN 2.217320 NaN 0.048204 NaN 30.694110 NaN 8.388381 NaN NaN NaN 1065.630494
min NaN 4.000000 NaN 0.004000 NaN 31.000000 NaN 1987.000000 NaN NaN NaN 33.000000
25% NaN 11.150000 NaN 0.031000 NaN 126.160000 NaN 1998.000000 NaN NaN NaN 2761.715000
50% NaN 12.660000 NaN 0.056000 NaN 146.740000 NaN 2009.000000 NaN NaN NaN 3452.340000
75% NaN 14.180000 NaN 0.096000 NaN 167.585000 NaN 2009.000000 NaN NaN NaN 4145.165000
max NaN 22.000000 NaN 0.298000 NaN 266.000000 NaN 2009.000000 NaN NaN NaN 8000.000000

Observations¶

  • The heaviest item weighs 22 lb, while the lightest item weighs 4 lb. The median weight is 12.66 lb.
  • The majority of products exhibit a low sugar content, specifically 4,885 out of 8,763 items. There are four distinct categories used to classify sugar content.
  • The highest ratio of the allocated display area of each product to the total display area of all the products in a store is 0.29. The lowest is 0.004.
  • There are 16 product categories in total. "Fruits and vegetables" is the most frequent category, with a count of 1,249.
  • The most expensive product, based on its maximum retail price, is 266 USD. Additionally, 75% of all products do not exceed 167.5 USD. The least expensive product is priced at 31 USD.
  • The store with the highest number of registered products is identified by ID OUT004, with 4,676 products.
  • The oldest store was established in 1987, while the most recently established store dates back to 2009.
  • The majority of stores are categorized as medium-sized. There are three size categories in total.
  • Most stores are situated in Tier 2, accounting for 6,262 out of 8,763 stores.
  • "Supermarket Type2" is the most frequent store type, representing 4,676 stores.
  • The highest total revenue generated by the sale of a particular product in a particular store is 8,000 USD. The median revenue is 3,452.34 USD. The product with the lowest sales at a specific location generated only 33 USD in revenue.

Let's check the count of each unique category in each of the categorical variables

Convert object variables to categorical

In [ ]:
_data_copy["Product_Sugar_Content"] = _data_copy["Product_Sugar_Content"].astype("category")
_data_copy["Product_Type"] = _data_copy["Product_Type"].astype("category")
_data_copy["Store_Size"] = _data_copy["Store_Size"].astype("category")
_data_copy["Store_Location_City_Type"] = _data_copy["Store_Location_City_Type"].astype("category")
_data_copy["Store_Type"] = _data_copy["Store_Type"].astype("category")

Observations:¶

In order to improve memory usage, get faster operations and improved interactions with models

Check the count of instances on each category

In [ ]:
_cat = _data_copy.select_dtypes(include=['category']).columns.tolist()

for column in _cat:
    print(_data_copy[column].value_counts())
    print("-" * 25)
Product_Sugar_Content
Low Sugar    4885
Regular      2251
No Sugar     1519
reg           108
Name: count, dtype: int64
-------------------------
Product_Type
Fruits and Vegetables    1249
Snack Foods              1149
Frozen Foods              811
Dairy                     796
Household                 740
Baking Goods              716
Canned                    677
Health and Hygiene        628
Meat                      618
Soft Drinks               519
Breads                    200
Hard Drinks               186
Others                    151
Starchy Foods             141
Breakfast                 106
Seafood                    76
Name: count, dtype: int64
-------------------------
Store_Size
Medium    6025
High      1586
Small     1152
Name: count, dtype: int64
-------------------------
Store_Location_City_Type
Tier 2    6262
Tier 1    1349
Tier 3    1152
Name: count, dtype: int64
-------------------------
Store_Type
Supermarket Type2     4676
Supermarket Type1     1586
Departmental Store    1349
Food Mart             1152
Name: count, dtype: int64
-------------------------

Observations¶

  • Among the sugar content categories, 4,885 products are observed to have a low sugar level. Followed by products with a regular sugar level, totaling 2,251 items. 1,519 products are entirely sugar free. It is important to note that the "reg" category should be incorporated into the "regular" sugar content classification.
  • The "Fruits and Vegetables" category holds the majority with 1,249 products. Other less frequent categories include "Snacks" with 1,149 products and "Frozen Foods" with 811. The least common category is "Seafood" containing only 76 products.
In [ ]:
for _variable in _cat:
  print(_data_copy[_variable].unique())
['Low Sugar', 'Regular', 'No Sugar', 'reg']
Categories (4, object): ['Low Sugar', 'No Sugar', 'Regular', 'reg']
['Frozen Foods', 'Dairy', 'Canned', 'Baking Goods', 'Health and Hygiene', ..., 'Soft Drinks', 'Breakfast', 'Others', 'Starchy Foods', 'Seafood']
Length: 16
Categories (16, object): ['Baking Goods', 'Breads', 'Breakfast', 'Canned', ..., 'Seafood', 'Snack Foods',
                          'Soft Drinks', 'Starchy Foods']
['Medium', 'High', 'Small']
Categories (3, object): ['High', 'Medium', 'Small']
['Tier 2', 'Tier 1', 'Tier 3']
Categories (3, object): ['Tier 1', 'Tier 2', 'Tier 3']
['Supermarket Type2', 'Departmental Store', 'Supermarket Type1', 'Food Mart']
Categories (4, object): ['Departmental Store', 'Food Mart', 'Supermarket Type1', 'Supermarket Type2']

Numerical features

In [ ]:
_num_col = _data_copy.select_dtypes(include=np.number).columns.tolist()
for _variable in _num_col:
  print(_variable, ": ", len(_data_copy[_variable].unique()))
Product_Weight :  1113
Product_Allocated_Area :  228
Product_MRP :  6100
Store_Establishment_Year :  4
Product_Store_Sales_Total :  8668

Observations¶

  • 1,113 distinct weights have been observed.
  • There are 6,100 different prices.

Duplicate value check

In [ ]:
print("There are",_data_copy.duplicated().sum(), "duplicated rows")
There are 0 duplicated rows

Missing value treatment (if needed)

In [ ]:
_null_values = _data_copy.isnull().sum()
print(_null_values[_null_values > 0])
Series([], dtype: int64)

Observations¶

There are no null or missing values. Imputation is not necessary.

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

In [ ]:
plt.figure(figsize=(22, 12))
_columns = ['Product_Weight','Product_Sugar_Content','Product_MRP',
            'Store_Id','Store_Size','Store_Location_City_Type','Store_Type','Product_Store_Sales_Total']
for _cont, _variable in enumerate(_columns):
    plt.subplot(4, 4, _cont+1)
    sns.histplot(data=_data_copy, x=_variable)
    plt.xticks(rotation=30)
    plt.tight_layout();
No description has been provided for this image

Observations¶

  • The weight of each product, the maximum retail price of each product and the total revenue generated by the sale of a particular product in a particular store exhibit a normal distribution.
  • This dataset represents products sold in only four stores, which are identified by their respective "ID" feature.
  • Additional findings follow.

1. Product_Weight - weight of each product

In [ ]:
sns.histplot(data=_data_copy, x='Product_Weight');
No description has been provided for this image

Observations¶

  • This variable exhibits a normal distribution, readily identifiable by its symmetrical bell shape. Data points are concentrated around a central value, with their frequency decreasing as they deviate from this center. The highest concentration of product weights is observed between 12 lb and 13 lb, representing the most frequent weight range.

2. Product_Sugar_Content - sugar content of each product

In [ ]:
labeled_barplot(_data_copy, "Product_Sugar_Content")
No description has been provided for this image
In [ ]:
_percentage_types = _data_copy['Product_Sugar_Content'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
Out[ ]:
proportion
Product_Sugar_Content
Low Sugar 55.7
No Sugar 17.3
Regular 25.7
reg 1.2

Observations¶

  • 4,885 products exhibit a low sugar level, constituting 55.7% of all products. Following this, products with a regular sugar level account for 25.7%, totaling 2,251 items. Additionally, 1,519 products are entirely sugar-free, representing 17.3%.
  • As previously indicated, the "reg" category within the sugar level feature should be reclassified under the "regular" category.
  • This trend suggests a strategic focus among stores to offer products that prioritize customer well-being.

3. Product_Allocated_Area

Ratio of the allocated display area of each product to the total display area of all the products in a store.

In [ ]:
histogram_boxplot(_data_copy, "Product_Allocated_Area")
No description has been provided for this image

Observations¶

  • This feature exhibits a right-skewed distribution, indicating that the highest concentration of products occupies a small allocated display area, ranging between 0 and 0.06. As the allocated area increases, the number of products decreases.
  • A considerable number of products thus occupy a very small proportion of the total display area.
  • Outlier values do not appear to be concerning; they are interpreted as naturally occurring due to the aforementioned skewness.

4. Product_Type

In [ ]:
labeled_barplot(_data_copy, "Product_Type")
No description has been provided for this image
In [ ]:
_percentage_types = _data_copy['Product_Type'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
Out[ ]:
proportion
Product_Type
Baking Goods 8.2
Breads 2.3
Breakfast 1.2
Canned 7.7
Dairy 9.1
Frozen Foods 9.3
Fruits and Vegetables 14.3
Hard Drinks 2.1
Health and Hygiene 7.2
Household 8.4
Meat 7.1
Others 1.7
Seafood 0.9
Snack Foods 13.1
Soft Drinks 5.9
Starchy Foods 1.6

Observations¶

  • As previously observed, fruits and vegetables represent the dominant product category, accounting for 14.3% of all products sold (1,249 items). This reinforces the store chain's commitment to offering products that promote customer well-being. However, snacks constitute the second-highest selling category, representing 13.1% of products, and it is currently unknown whether these are healthy options.
  • Other categories such as "Breads," "Breakfast," "Hard Drinks," "Seafood," and "Starchy Foods" each represent less than 5% of the total products sold.

5. Product_MRP

Maximum retail price of each product

In [ ]:
histogram_boxplot(_data_copy, "Product_MRP")
No description has been provided for this image

Observations¶

  • The variable representing the maximum retail price of each product also exhibits a normal distribution. This is identifiable by its symmetrical bell shape, with data points clustered around a central value and decreasing in frequency as they move away from the center. The strongest concentration of prices is observed around 145 USD - 150 USD, which represents the most frequent price range.

6. Store_Id

In [ ]:
labeled_barplot(_data_copy, "Store_Id")
No description has been provided for this image
In [ ]:
_percentage_types = _data_copy['Store_Id'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
Out[ ]:
proportion
Store_Id
OUT004 53.4
OUT003 15.4
OUT001 18.1
OUT002 13.1

Observations¶

  • As previously mentioned, this dataset records products sold in only four retail outlets, identified by IDs OUT001, OUT002, OUT003, and OUT004.
  • OUT004 has the highest number of registered products, totaling 4,676 items, which represents 53.4% of the dataset. Conversely, OUT002 has the fewest registered products, accounting for 13.1% of the products (1,153 items).

7. Store_Establishment_Year

Year in which the store was established

In [ ]:
labeled_barplot(_data_copy, "Store_Establishment_Year")
No description has been provided for this image
In [ ]:
_percentage_types = _data_copy['Store_Establishment_Year'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
Out[ ]:
proportion
Store_Establishment_Year
2009 53.4
1999 15.4
1987 18.1
1998 13.1

Observations¶

  • More than 50% of the products in the dataset are sold in the store established in 2009, accounting for 4,676 products to be precise.

8. Store_Size

Size of the store depending on sq. feet

In [ ]:
labeled_barplot(_data_copy, "Store_Size")
No description has been provided for this image
In [ ]:
_percentage_types = _data_copy['Store_Size'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
Out[ ]:
proportion
Store_Size
High 18.1
Medium 68.8
Small 13.1

Observations¶

  • Nearly 70% of the stores are medium-sized, while 18.1% are classified as large stores.

9. Store_Location_City_Type

Type of city in which the store is located

In [ ]:
labeled_barplot(_data_copy, "Store_Location_City_Type")
No description has been provided for this image
In [ ]:
_percentage_types = _data_copy['Store_Location_City_Type'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
Out[ ]:
proportion
Store_Location_City_Type
Tier 1 15.4
Tier 2 71.5
Tier 3 13.1

Observations¶

  • Over 70% of the stores are located in cities where the standard of living is middle class (Tier 2). Both areas with a higher standard of living (Tier 1) and those with lower purchasing power (Tier 3) are represented in smaller, yet very similar, proportions at 15.4% and 13.1%, respectively.

10. Store_Type

In [ ]:
labeled_barplot(_data_copy, "Store_Type")
No description has been provided for this image
In [ ]:
_percentage_types = _data_copy['Store_Type'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
Out[ ]:
proportion
Store_Type
Departmental Store 15.4
Food Mart 13.1
Supermarket Type1 18.1
Supermarket Type2 53.4

Observations¶

  • More than half of the stores are classified as "Supermarket Type2." Other categories, such as "Supermarkets Type1," "Departmental Stores," and "Food Marts," are represented in smaller proportions: 18.1%, 15.4%, and 13.1%, respectively.

Bivariate Analysis¶

Correlation

In [ ]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
plt.figure(figsize=(6,4))
sns.heatmap(_data_copy[_num_col].corr(),annot=True,cmap='Spectral',vmin=-1,vmax=1)
plt.show()
No description has been provided for this image

Observations¶

  • A positive correlation is observed between the maximum retail price and the weight of the product. This suggests that a heavier product tends to have a higher price.
  • Product weight also correlates with total sales, which is logical given the established relationship between product price and weight.
  • Similarly, and as anticipated, the product price exhibits a correlational effect on total sales.

1. Product weight versus total product sales

In [ ]:
plt.figure(figsize=(8, 6))
sns.kdeplot(x='Product_Weight', y='Product_Store_Sales_Total', data=_data_copy, fill=True, cmap='viridis', cbar=True)
plt.title('Density plot - Product weight & Total product sales')
plt.xlabel('Product_Weight')
plt.ylabel('Product_Store_Sales_Total')
plt.grid(True)
plt.show()
No description has been provided for this image

Observations¶

  • As noted previously, the graph also displays a positive correlation. This density plot not only indicates a relationship between the variables but also highlights where this relationship is strongest. Lighter colors at the center of the plot signify a higher density of data points, illustrating where the relationship between product weight and sales volume is most concentrated. For instance, a considerable concentration of products is observed within an approximate weight range of 12 lb to 14 lb, generating sales between 3,500 USD and 4,000 USD. These represent the most frequent characteristics within the dataset. This observation reinforces the statistical summary, which indicated that 50% of products weigh 12.66 lb and 75% do not exceed 14.18 lb. Similarly, the summary showed that the median product generates up to 3,452.34 USD, and 75% do not exceed 4,145.16 USD.
  • Conversely, as the colors darken, the data density decreases, meaning there are fewer products with the weight and sales characteristics represented by each quadrant of the graph's grid

2. Sugar content versus total product sales

In [ ]:
sns.boxplot(x="Product_Sugar_Content", y="Product_Store_Sales_Total", data=_data_copy);
No description has been provided for this image

Observations¶

  • An interesting finding is that despite the majority of products sold being low in sugar, all sugar content categories generate very similar revenues. In other words, the fact that over 50% of products are low in sugar does not translate to higher sales compared to products with a regular sugar level or sugar-free products, at least when interpreting the median values.

3. Allocated area versus total product sales

In [ ]:
plt.figure(figsize=(8, 6))
sns.kdeplot(x='Product_Allocated_Area', y='Product_Store_Sales_Total', data=_data_copy, fill=True, cbar=True)
plt.title('Density plot - Allocated area & Total product sales')
plt.xlabel('Product_Allocated_Area')
plt.ylabel('Product_Store_Sales_Total')
plt.grid(True)
plt.show()
No description has been provided for this image

Observations¶

  • There is no clear linear correlation observed between the product's allocated display area and its sales. The data points are dispersed without a distinct trend, indicating that a larger allocated area does not guarantee higher sales for a specific product, nor does a smaller area imply lower sales.
  • Additionally, the majority of points are concentrated on the left side of the graph, specifically between 0.00 and approximately 0.18. This observation is consistent with the results from the histogram in the preceding section's univariable analysis.

4. Product type versus total product sales

In [ ]:
# Group by product type and sum up the total sales
_revenue_x_type = _data_copy.groupby(["Product_Type"], as_index=False)["Product_Store_Sales_Total"].sum()

colors = [plt.cm.Spectral(i / float(len(_revenue_x_type['Product_Type']))) for i in range(len(_revenue_x_type['Product_Type']))]

plt.figure(figsize=[12, 8])
squarify.plot(sizes=_revenue_x_type['Product_Store_Sales_Total'],
              label=_revenue_x_type['Product_Type'],
              color=colors,
              alpha=.8,
              value=_revenue_x_type['Product_Store_Sales_Total'])
plt.title('Total revenue per product type')
plt.axis('off')
plt.show()
No description has been provided for this image
In [ ]:
_revenue_x_type.sort_values(by='Product_Store_Sales_Total', ascending=False)
Out[ ]:
Product_Type Product_Store_Sales_Total
6 Fruits and Vegetables 4300833.27
13 Snack Foods 3988996.95
4 Dairy 2811918.04
5 Frozen Foods 2809980.83
9 Household 2564740.17
0 Baking Goods 2452986.00
3 Canned 2300082.71
8 Health and Hygiene 2163707.21
10 Meat 2129211.94
14 Soft Drinks 1797044.72
1 Breads 714942.24
7 Hard Drinks 625814.62
11 Others 541496.30
15 Starchy Foods 518774.45
2 Breakfast 362130.41
12 Seafood 272404.04

Observations¶

  • The product type generating the highest sales is Fruits and Vegetables, with a total of 4,300,833.27 USD. This is followed by Snack Foods, which generated 3,988,996.95 USD, and Dairy, with 2,811,918.04 USD.

5. Maximum retail price (MRP) of each product versus total product sales

In [ ]:
plt.figure(figsize=(8, 6))
sns.kdeplot(x='Product_MRP', y='Product_Store_Sales_Total', data=_data_copy, fill=True, cmap='viridis', cbar=True)
plt.title('Density plot - Product MRP & Total product sales')
plt.xlabel('Product_MRP')
plt.ylabel('Product_Store_Sales_Total')
plt.grid(True)
plt.show()
No description has been provided for this image
In [ ]:
sns.scatterplot(data=_data_copy, x='Product_MRP', y='Product_Store_Sales_Total')
plt.show()
No description has been provided for this image

Observations¶

  • The relationship between these variables exhibits a positive correlation. A significant concentration of products is observed within an approximate price range of 140 USD - 160 USD, generating sales between 3,000 USD and 4,000 USD. These figures represent the most frequent characteristics within the dataset.
  • This observation reinforces the statistical summary, which indicated that 50% of products are priced at 146.74 USD, and 75% do not exceed 167.58 USD. Similarly, the summary showed that the median product generates up to 3,452.34 USD, with 75% not exceeding 4,145.16 USD.
  • Conversely, darker colors indicate a decrease in data density, signifying fewer products with the maximum price and sales characteristics represented in those areas of the graph.

6. Store ID versus total product sales

In [ ]:
# Group by Store_Id and sum up the total sales
_revenue_x_id = _data_copy.groupby(["Store_Id"], as_index=False)["Product_Store_Sales_Total"].sum()

colors = [plt.cm.Spectral(i / float(len(_revenue_x_id['Store_Id']))) for i in range(len(_revenue_x_id['Store_Id']))]

plt.figure(figsize=[8, 4])
squarify.plot(sizes=_revenue_x_id['Product_Store_Sales_Total'],
              label=_revenue_x_id['Store_Id'],
              color=colors,
              alpha=.8,
              value=_revenue_x_id['Product_Store_Sales_Total'])
plt.title('Total revenue per store ID')
plt.axis('off')
plt.show()
No description has been provided for this image
In [ ]:
_revenue_x_id.sort_values(by='Product_Store_Sales_Total', ascending=False)
Out[ ]:
Store_Id Product_Store_Sales_Total
3 OUT004 15427583.43
2 OUT003 6673457.57
0 OUT001 6223113.18
1 OUT002 2030909.72

Observations¶

  • The store with the highest total sales is OUT004, achieving 15,427,583.43 USD.
  • It is confirmed that the store with the lowest total sales is OUT002, with only 2,030,909.72 USD.

7. Store Establishment Year versus total product sales

In [ ]:
# Group by Store_Establishment_Year and sum up the total sales
_revenue_x_year = _data_copy.groupby(["Store_Establishment_Year"], as_index=False)["Product_Store_Sales_Total"].sum()

colors = [plt.cm.Spectral(i / float(len(_revenue_x_type['Store_Establishment_Year']))) for i in range(len(_revenue_x_type['Store_Establishment_Year']))]

plt.figure(figsize=[8, 4])
squarify.plot(sizes=_revenue_x_type['Product_Store_Sales_Total'],
              label=_revenue_x_type['Store_Establishment_Year'],
              color=colors,
              alpha=.8,
              value=_revenue_x_type['Product_Store_Sales_Total'])
plt.title('Total revenue per store year of establishment')
plt.axis('off')
plt.show()
No description has been provided for this image
In [ ]:
_revenue_x_year.sort_values(by='Product_Store_Sales_Total', ascending=False)
Out[ ]:
Store_Establishment_Year Product_Store_Sales_Total
3 2009 15427583.43
2 1999 6673457.57
0 1987 6223113.18
1 1998 2030909.72

Observations¶

  • The store established in 2009 generates the highest total sales, amounting to 15,427,583.43 USD.

  • The store inaugurated in 1998, while not the oldest, has the lowest total sales per product, with a total of 2,030,909.72 USD.

8. Store size Year versus total product sales

In [ ]:
sns.boxplot(x="Store_Size", y="Product_Store_Sales_Total", data=_data_copy);
No description has been provided for this image
In [ ]:
summary_stats = _data_copy.groupby('Store_Size')['Product_Store_Sales_Total'].agg(
    mean='mean',
    Q1=lambda x: x.quantile(0.25),
    median='median',
    Q3=lambda x: x.quantile(0.75),
    min='min',
    max='max'
).reset_index()

print(summary_stats)
  Store_Size         mean         Q1    median         Q3      min      max
0       High  3923.778802  3285.5100  4139.645  4639.4000  2300.56  4997.63
1     Medium  3668.222573  3060.3800  3511.100  3969.8100  1561.06  8000.00
2      Small  1762.942465  1495.4725  1889.495  2133.6225    33.00  2299.63

Observations¶

  • As anticipated, store size significantly influences product sales. Larger stores tend to generate higher median sales than medium-sized and small-sized stores.
  • Small stores consistently exhibit the lowest sales performance, likely due to factors such as a more limited inventory and reduced physical space.

9. Store location versus total product sales

In [ ]:
# Group by Store_Location_City_Type and sum up the total sales
_revenue_x_location = _data_copy.groupby(["Store_Location_City_Type"], as_index=False)["Product_Store_Sales_Total"].sum()

colors = [plt.cm.Spectral(i / float(len(_revenue_x_location['Store_Location_City_Type']))) for i in range(len(_revenue_x_location['Store_Location_City_Type']))]

plt.figure(figsize=[8, 4])
squarify.plot(sizes=_revenue_x_location['Product_Store_Sales_Total'],
              label=_revenue_x_location['Store_Location_City_Type'],
              color=colors,
              alpha=.8,
              value=_revenue_x_location['Product_Store_Sales_Total'])
plt.title('Total revenue per store location')
plt.axis('off')
plt.show()
No description has been provided for this image
In [ ]:
_revenue_x_location.sort_values(by='Product_Store_Sales_Total', ascending=False)
Out[ ]:
Store_Location_City_Type Product_Store_Sales_Total
1 Tier 2 21650696.61
0 Tier 1 6673457.57
2 Tier 3 2030909.72

Observations¶

  • Tier 2 represents the location with the highest total sales per product, significantly surpassing both Tier 1 and Tier 3 locations.

10. Store type versus total product sales

In [ ]:
# Group by Store_Type and sum up the total sales
_revenue_store_type = _data_copy.groupby(["Store_Type"], as_index=False)["Product_Store_Sales_Total"].sum()

colors = [plt.cm.Spectral(i / float(len(_revenue_store_type['Store_Type']))) for i in range(len(_revenue_store_type['Store_Type']))]

plt.figure(figsize=[8, 4])
squarify.plot(sizes=_revenue_store_type['Product_Store_Sales_Total'],
              label=_revenue_store_type['Store_Type'],
              color=colors,
              alpha=.8,
              value=_revenue_store_type['Product_Store_Sales_Total'])
plt.title('Total revenue per store type')
plt.axis('off')
plt.show()
No description has been provided for this image
In [ ]:
_revenue_store_type.sort_values(by='Product_Store_Sales_Total', ascending=False)
Out[ ]:
Store_Type Product_Store_Sales_Total
3 Supermarket Type2 15427583.43
0 Departmental Store 6673457.57
2 Supermarket Type1 6223113.18
1 Food Mart 2030909.72

Observations¶

  • As previously observed, the "Supermarket Type2" store type accounts for the highest product sales, specifically totaling 15,427,583.43 USD.

11. Store ID versus Store establishment year

In [ ]:
plt.figure(figsize=(6, 3))
sns.pointplot(x='Store_Id', y='Store_Establishment_Year', data=_data_copy, join=False, ci=None, marker='o', linestyles='')
plt.xlabel('ID de Tienda')
plt.ylabel('Establishment Year')
plt.title('Establishment Year - Store')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y')
plt.tight_layout()
plt.show()
No description has been provided for this image

Observations¶

  • The oldest store is OUT001, established in 1987. The newest store, inaugurated in 2009, is OUT004.
In [ ]:
revenue_by_store_product_type = _data_copy.groupby(['Store_Id', 'Product_Type'])['Product_Store_Sales_Total'].sum().reset_index()
stores_group1 = ['OUT001','OUT002']
stores_group2 = ['OUT003','OUT004']

df_group1 = revenue_by_store_product_type[revenue_by_store_product_type['Store_Id'].isin(stores_group1)]
df_group2 = revenue_by_store_product_type[revenue_by_store_product_type['Store_Id'].isin(stores_group2)]

# First plot
plt.figure(figsize=(16, 8))
ax1 = sns.barplot(
    data=df_group1, x='Store_Id', y='Product_Store_Sales_Total',hue='Product_Type',palette='tab20'
)
plt.title('Total product sales per store (OUT001 and OUT002)')
plt.xlabel('Store ID')
plt.ylabel('Product_Store_Sales_Total')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Product type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()

for p in ax1.patches:
    height = p.get_height()
    if height > 50:
        ax1.annotate(f'{height:.0f}',
                     (p.get_x() + p.get_width() / 2., height),
                     ha='center', va='center',
                     xytext=(0, 5), textcoords='offset points',
                     fontsize=7, color='black')
plt.show()

# Second plot
plt.figure(figsize=(16, 8))
ax2 = sns.barplot(
    data=df_group2, x='Store_Id', y='Product_Store_Sales_Total',hue='Product_Type', palette='tab20'
)
plt.title('Total product sales per store (OUT003 and OUT004)')
plt.xlabel('Store ID')
plt.ylabel('Product_Store_Sales_Total')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Product type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()

for p in ax2.patches:
    height = p.get_height()
    if height > 50:
        ax2.annotate(f'{height:.0f}',
                     (p.get_x() + p.get_width() / 2., height),
                     ha='center', va='center',
                     xytext=(0, 5), textcoords='offset points',
                     fontsize=7, color='black')
plt.show()
No description has been provided for this image
No description has been provided for this image

Observations¶

  • At Store ID OUT001, the highest-selling product categories are "Fruits and Vegetables" and "Snack Foods", generating total sales of 792,993 USD and 806,142 USD respectively. The lowest-selling category is "Breakfast," with total sales of 38,161 USD. The second-lowest selling category is "Seafood," with total sales of 52,937 USD.

  • Similarly, for Store ID OUT002, "Fruits and Vegetables" and "Snack Foods" are the predominant categories. In this store, the least sold category is "Seafood," with sales totaling 17,663 USD.

  • For both Store ID OUT003 and OUT004, "Fruits and Vegetables" and "Snack Foods" also consistently emerge as the top-selling categories. Notably, Store OUT004 records the highest sales within these categories, with 2,311,900 USD from "Fruits and Vegetables" and 2,009,027 USD from "Snack Foods." In both OUT003 and OUT004, "Seafood" consistently represents the lowest-selling category.

Data Preprocessing¶

Feature engineering¶

Outlier check

In [ ]:
Q1 = _data_copy.select_dtypes(include=["float64", "int64"]).quantile(0.25)
Q3 = _data_copy.select_dtypes(include=["float64", "int64"]).quantile(0.75)
# Inter Quantile Range
IQR = Q3 - Q1

lower = (Q1 - 1.5 * IQR) #Values outside these bounds are outliers
upper = (Q3 + 1.5 * IQR) #Values outside these bounds are outliers

((_data_copy.select_dtypes(include=["float64", "int64"]) < lower)|(_data_copy.select_dtypes(include=["float64", "int64"]) > upper)).sum() / len(_data_copy) * 100
Out[ ]:
0
Product_Weight 0.616227
Product_Allocated_Area 1.186808
Product_MRP 0.650462
Store_Establishment_Year 0.000000
Product_Store_Sales_Total 1.357982

Observations¶

  • The outliers are not significant, and therefore, no action will be taken.

Variable: Product_Sugar_Content. Unificar "reg" con "Regular"

In [ ]:
_data_copy['Product_Sugar_Content'].replace(to_replace=["reg"], value=["Regular"], inplace=True)
In [ ]:
_data_copy['Product_Sugar_Content'].unique()
Out[ ]:
['Low Sugar', 'Regular', 'No Sugar']
Categories (3, object): ['Low Sugar', 'No Sugar', 'Regular']

Combine variables to capture nonlinear or proportional relationships

In [ ]:
_data_copy['_price_per_Gram'] = _data_copy['Product_MRP'] / _data_copy['Product_Weight']
_data_copy['_weight_per_area'] = _data_copy['Product_Weight'] / _data_copy['Product_Allocated_Area']

Observations¶

  • The objective of creating new variables through their combination is to enable the model to learn non-linear relationships, thereby facilitating the identification of implicit patterns.

Context proxies

In [ ]:
_avg_sales = _data_copy.groupby("Store_Id")["Product_Store_Sales_Total"].mean().reset_index()
_avg_sales.rename(columns={"Product_Store_Sales_Total": "_avg_sales_x_store"}, inplace=True)
_data_copy = _data_copy.merge(_avg_sales, on="Store_Id", how="left")

Observations¶

  • The objective is to utilize a derived variable to indirectly represent its associated variable. For example, calculating the average sales per store could indirectly reflect each location's capacity to generate profit, encompassing both the store itself and its surrounding area (e.g., the socio-economic status of the surrounding area)).

Drop unnecessary column

In [ ]:
_data_copy.drop(['Product_Id','Store_Id','Store_Establishment_Year'], axis=1, inplace=True)
In [ ]:
_data_copy.head()
Out[ ]:
Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total _price_per_Gram _weight_per_area _avg_sales_x_store
0 12.66 Low Sugar 0.027 Frozen Foods 117.08 Medium Tier 2 Supermarket Type2 2842.40 9.248025 468.888889 3299.312111
1 16.54 Low Sugar 0.144 Dairy 171.43 Medium Tier 1 Departmental Store 4830.02 10.364571 114.861111 4946.966323
2 14.28 Regular 0.031 Canned 162.08 High Tier 2 Supermarket Type1 4130.16 11.350140 460.645161 3923.778802
3 12.10 Low Sugar 0.112 Baking Goods 186.31 High Tier 2 Supermarket Type1 4132.18 15.397521 108.035714 3923.778802
4 9.57 No Sugar 0.010 Health and Hygiene 123.67 Small Tier 3 Food Mart 2279.36 12.922675 957.000000 1762.942465

Preprocessing pipeline for encoding categorical features

In [ ]:
_categorical_transformer = OneHotEncoder(handle_unknown='ignore')

_pipeline_preprocessor = ColumnTransformer(
    transformers=[
        ('cat', _categorical_transformer, _cat)
    ]
)

Additional observations on feature engineering¶

  • Imputation is not necessary as there are no missing values in the dataset.
  • The models to be implemented: Random Forest and Gradient Boosting, do not require data scaling.
  • With "Product_MRP" and "Product_Store_Sales_Total" displaying normal (symmetrical) distributions, additional transformations to mitigate bias or extreme outliers are unnecessary.
  • Overall, the dataset is clean and well-organized, thus extensive feature engineering tasks are not required.

Data splitting

In [ ]:
X = _data_copy.drop(["Product_Store_Sales_Total"], axis=1)
y = _data_copy["Product_Store_Sales_Total"]
In [ ]:
# Splitting data into training, validation and test set:
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1
)

# Split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1
)
print(X_train.shape, X_val.shape, X_test.shape)
(5257, 11) (1753, 11) (1753, 11)
In [ ]:
print("Shape of training: ", X_train.shape)
print("Shape of validation: ", X_val.shape)
print("Shape of test: ", X_test.shape)
Shape of training:  (5257, 11)
Shape of validation:  (1753, 11)
Shape of test:  (1753, 11)

Model Building¶

Metric¶

  • Problem Type and Objective

  • The problem type is regression, with the objective of forecasting the Product_Store_Sales_Total variable. The aim is to quantify the difference between the values generated by the model and the actual sales figures.

  • Key Metric: RMSE: the primary metric of focus will be RMSE (Root Mean Squared Error). This metric is chosen to penalize errors with larger deviations, thereby mitigating errors that could significantly impact the prediction of high-profit products. In other words, not all sales contribute equally to business value.

  • Outliers are unlikely to significantly affect the RMSE.

  • RMSE will indicate the average magnitude of difference between the predictions and the actual values.

  • Complementary Metric: R-Squared Additionally, R-Squared (coefficient of determination) will be used to identify the percentage of sales variation explained by the implemented models.

Models: Random Forest Regressor and Gradient Boosting Regressor¶

Tasks¶

  • Fit: training and validation

    • Pipeline: make_pipeline (Preprocessing pipeline for encoding categorical features, model), then fit.
  • Predict: pipeline, x_train. y_train

  • Compute metrics: r2 score, adjr2, RMSE, MAE, MAPE.

In [ ]:
_models.append(("Random Forest Regressor - Before tuning", RandomForestRegressor(random_state=1)))
_models.append(("Gradient Boost Regressor - Before tuning", GradientBoostingRegressor(random_state=1)))

Performance - Training

In [ ]:
_perf_train
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE Model
0 599.302247 471.734564 0.683112 0.682447 0.166443 Random Forest Regressor - Before tuning
1 602.784108 474.329441 0.679419 0.678747 0.167860 Gradient Boost Regressor - Before tuning

Performance - Validation

In [ ]:
_perf_val
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE Model
0 589.020035 459.391759 0.690892 0.688939 0.188043 Random Forest Regressor - Before tuning
1 576.429627 451.981812 0.703966 0.702095 0.186927 Gradient Boost Regressor - Before tuning

Observations¶

RMSE (Root Mean Squared Error): This metric quantifies the average magnitude by which predictions deviate from actual values.

  • While the difference between the two models isn't substantial, the Gradient Boost model demonstrates a lower RMSE of 576.43 on the validation set, compared to Random Forest's 589.02. Consequently, Gradient Boost exhibits superior predictive accuracy on unseen data
  • There's no concern regarding overfitting.

R-Squared: This metric represents the proportion of variability accounted for by the model.

  • The R-Squared value on the validation set is also marginally superior for the Gradient Boost model (0.70 versus 0.69); still, both models do an acceptable job of explaining the data's variability.

Model Performance Improvement - Hyperparameter Tuning¶

Random forest

Hyperparameters to consider¶

  • n_estimators: [50, 100, 150, 200],

  • max_depth: [10, 15, 20, None],

  • min_samples_leaf: [10, 15, 25, 30],

  • max_features: ['sqrt', 'log2'],

  • max_samples: [0.3, 0.5, 0.7, 0.9],

  • criterion: ['squared_error']

  • RandomizedSearchCV:

    • estimator=_pipeline
    • Hyperparameters
    • jobs = -1
    • iterations = 60
    • scoring = neg_root_mean_squared_error
    • cv = 5
    • random state = 1
In [ ]:
_tuned_RF = RandomForestRegressor(random_state=1)
_pipeline_RF = make_pipeline(_pipeline_preprocessor, _tuned_RF)

print("Best params:", _rf_randomized_cv.best_params_)
print("Best CV RMSE:", -_rf_randomized_cv.best_score_)
Best params: {'randomforestregressor__n_estimators': 100, 'randomforestregressor__min_samples_leaf': 30, 'randomforestregressor__max_samples': 0.9, 'randomforestregressor__max_features': 'sqrt', 'randomforestregressor__max_depth': 15, 'randomforestregressor__criterion': 'squared_error'}
Best CV RMSE: 609.5711902921707
In [ ]:
best_pipeline_rf = _rf_randomized_cv.best_estimator_
best_pipeline_rf
Out[ ]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Product_Sugar_Content',
                                                   'Product_Type', 'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=15, max_features='sqrt',
                                       max_samples=0.9, min_samples_leaf=30,
                                       random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Product_Sugar_Content',
                                                   'Product_Type', 'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=15, max_features='sqrt',
                                       max_samples=0.9, min_samples_leaf=30,
                                       random_state=1))])
ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['Product_Sugar_Content', 'Product_Type',
                                  'Store_Size', 'Store_Location_City_Type',
                                  'Store_Type'])])
['Product_Sugar_Content', 'Product_Type', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(max_depth=15, max_features='sqrt', max_samples=0.9,
                      min_samples_leaf=30, random_state=1)
In [ ]:
_predictions_train_rf = model_performance_regression(best_pipeline_rf, X_train, y_train)
_predictions_train_rf["Model"] = 'Random Forest Regressor - After tuning'
_perf_train = pd.concat([_perf_train,_predictions_train_rf], ignore_index=True)

_predictions_train_rf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE Model
0 605.067619 476.487114 0.676986 0.676308 0.169244 Random Forest Regressor - After tuning
In [ ]:
_predictions_val_rf = model_performance_regression(best_pipeline_rf, X_val, y_val)
_predictions_val_rf["Model"] = 'Random Forest Regressor - After tuning'
_perf_val = pd.concat([_perf_val,_predictions_val_rf], ignore_index=True)

_predictions_val_rf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE Model
0 573.516807 450.611094 0.70695 0.705098 0.184908 Random Forest Regressor - After tuning

Gradient boosting

Hyperparameters to consider¶

  • n_estimators: [100, 150, 200, 250],

  • learning_rate: [0.01, 0.05, 0.1],

  • subsample: [0.5, 0.6, 0.7, 0.8, 1],

  • max_features: ['sqrt', 'log2', None],

  • max_depth: [3, 5, 8, 10,],

  • min_samples_split: [6, 8, 10, 12]

  • min_samples_leaf:[4, 8,10]

  • RandomizedSearchCV:

    • estimator=_pipeline
    • Hyperparameters
    • jobs = -1
    • iterations = 60
    • scoring = neg_root_mean_squared_error
    • cv = 5
    • random state = 1
In [ ]:
_tuned_GB = GradientBoostingRegressor(random_state=1)
_pipeline = make_pipeline(_pipeline_preprocessor, _tuned_GB)

# Best Hyperparameters scores (positive RMSE)
print("Best params:", _GB_randomized_cv.best_params_)
print("Best CV RMSE:", -_GB_randomized_cv.best_score_)
Best params: {'gradientboostingregressor__subsample': 0.7, 'gradientboostingregressor__n_estimators': 100, 'gradientboostingregressor__min_samples_split': 6, 'gradientboostingregressor__min_samples_leaf': 8, 'gradientboostingregressor__max_features': 'sqrt', 'gradientboostingregressor__max_depth': 3, 'gradientboostingregressor__learning_rate': 0.05}
Best CV RMSE: 610.1829332102039
In [ ]:
best_pipeline_gb = _GB_randomized_cv.best_estimator_
best_pipeline_gb
Out[ ]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Product_Sugar_Content',
                                                   'Product_Type', 'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type'])])),
                ('gradientboostingregressor',
                 GradientBoostingRegressor(learning_rate=0.05,
                                           max_features='sqrt',
                                           min_samples_leaf=8,
                                           min_samples_split=6, random_state=1,
                                           subsample=0.7))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Product_Sugar_Content',
                                                   'Product_Type', 'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type'])])),
                ('gradientboostingregressor',
                 GradientBoostingRegressor(learning_rate=0.05,
                                           max_features='sqrt',
                                           min_samples_leaf=8,
                                           min_samples_split=6, random_state=1,
                                           subsample=0.7))])
ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['Product_Sugar_Content', 'Product_Type',
                                  'Store_Size', 'Store_Location_City_Type',
                                  'Store_Type'])])
['Product_Sugar_Content', 'Product_Type', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']
OneHotEncoder(handle_unknown='ignore')
GradientBoostingRegressor(learning_rate=0.05, max_features='sqrt',
                          min_samples_leaf=8, min_samples_split=6,
                          random_state=1, subsample=0.7)
In [ ]:
_predictions_train_gb = model_performance_regression(best_pipeline_gb, X_train, y_train)
_predictions_train_gb["Model"] = 'Gradient boost - After tuning'
_perf_train = pd.concat([_perf_train,_predictions_train_gb], ignore_index=True)

_predictions_train_gb
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE Model
0 604.465527 476.781998 0.677628 0.676952 0.169115 Gradient boost - After tuning
In [ ]:
_predictions_val_gb = model_performance_regression(best_pipeline_gb, X_val, y_val)
_predictions_val_gb["Model"] = 'Gradient boost - After tuning'
_perf_val = pd.concat([_perf_val,_predictions_val_gb], ignore_index=True)

_predictions_val_gb
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE Model
0 573.8879 451.414201 0.706571 0.704717 0.186055 Gradient boost - After tuning

Observations¶

After hyperparameter tuning,an improvement in the RMSE was observed on the validation set. Specifically, the Random Forest model's RMSE decreased from 589.02 to 573.5, while the Gradient Boost model's RMSE improved from 576.43 to 573.89.

Although the difference is marginal, the enhancement resulting from the tuning process is evident.

Model Performance Comparison, Final Model Selection, and Serialization¶

Performance comparison - Training¶

In [ ]:
_perf_train
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE Model
0 599.302247 471.734564 0.683112 0.682447 0.166443 Random Forest Regressor - Before tuning
1 602.784108 474.329441 0.679419 0.678747 0.167860 Gradient Boost Regressor - Before tuning
2 605.067619 476.487114 0.676986 0.676308 0.169244 Random Forest Regressor - After tuning
3 604.465527 476.781998 0.677628 0.676952 0.169115 Gradient boost - After tuning

Observations¶

Training set

  • For the Random Forest model, a slight deterioration in the RMSE was observed, increasing from 599.30 to 605. A similar trend was noted for the R-Squared value.

  • The Gradient Boost model also exhibited a deterioration in its RMSE, rising from 602.78 to 604.46.

  • However, in both instances, the difference is not substantial. The values obtained from the validation set will be reviewed, as they are of greater interest for assessing whether an improvement in the models' generalization capabilities for predicting unseen data has occurred.

Performance comparison - Validation¶

In [ ]:
_perf_val
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE Model
0 589.020035 459.391759 0.690892 0.688939 0.188043 Random Forest Regressor - Before tuning
1 576.429627 451.981812 0.703966 0.702095 0.186927 Gradient Boost Regressor - Before tuning
2 573.516807 450.611094 0.706950 0.705098 0.184908 Random Forest Regressor - After tuning
3 573.887900 451.414201 0.706571 0.704717 0.186055 Gradient boost - After tuning

Observations¶

Validation set

  • For the Random Forest model, the RMSE improved from 589 to 573.5. Similarly, the R-Squared also showed improvement, increasing from 0.68 to 0.70.

  • In the case of the Gradient Boost model, an improved performance in the RMSE was also observed, with values decreasing from 576.43 to 573.88.

  • While the results on the training set did not show improvement, the validation set results did improve. Therefore, it can be interpreted that the models with hyperparameter tuning are indeed better at predicting unseen cases, which is the scenario of greater interest.

Final model selection¶

  • Both models yielded very similar results. However, the Random Forest model was selected after applying hyperparameter tuning due to its superior RMSE value of 573.52 on the validation set, and its better R-Squared metric of 0.71.

  • Additionally, Random Forest models offer greater interpretability compared to Gradient Boost models, and their training time is shorter.

Performance check on the test set¶

In [ ]:
_predictions_test = model_performance_regression(best_pipeline_rf, X_test, y_test)
_predictions_test
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 621.011706 489.714707 0.66544 0.663326 0.166553

Observations¶

  • RMSE and MAE: Given that the statistical summary indicates the third quartile (Q3) is USD 4,145.16, the RMSE and MAE errors represent a small fraction of the variable's value. This suggests the model demonstrates relatively good performance.

  • R-Squared: The model explains 66.5% of the variability in the product's total sales across stores, which can be considered acceptable.

  • MAPE: The model's average error is 16.65% of the actual value, which is also an acceptable figure.

  • Conclusion: the selected model possesses the capability to predict product sales per store with an average error of 16.6%, explaining 66.5% of sales variability. Nevertheless, this capability could be enhanced through more detailed analysis and by evaluating alternative models or further fine-tuning hyperparameters.

Serialization¶

Tasks¶

  • Define the file path to serialize the trained model along with the data preprocessing steps
  • Save the trained model pipeline using joblib
    • .dump (best model pipeline, saved model path
  • Load the saved model pipeline from the file
    • joblib.load

Saved model check

In [ ]:
saved_model
Out[ ]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Product_Sugar_Content',
                                                   'Product_Type', 'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=15, max_features='sqrt',
                                       max_samples=0.9, min_samples_leaf=30,
                                       random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Product_Sugar_Content',
                                                   'Product_Type', 'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=15, max_features='sqrt',
                                       max_samples=0.9, min_samples_leaf=30,
                                       random_state=1))])
ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['Product_Sugar_Content', 'Product_Type',
                                  'Store_Size', 'Store_Location_City_Type',
                                  'Store_Type'])])
['Product_Sugar_Content', 'Product_Type', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(max_depth=15, max_features='sqrt', max_samples=0.9,
                      min_samples_leaf=30, random_state=1)

Testing predictions on the test set using the serialized/saved model

In [ ]:
saved_model.predict(X_test)
Out[ ]:
array([3272.21098797, 4893.73359389, 4885.75254721, ..., 4881.93593986,
       3286.629245  , 3307.20963861])

Deployment - Backend¶

Flask Web Framework¶

Tasks¶

  • Store backend server deployment files
  • %%writefile backend_files/app.py
  • Initialize Flask app with a name
  • Load the trained prediction model
  • joblib.load("SuperKart_Model_Deployment.joblib")
  • Define a route for the home page @_superkart_api.get('/')
  • Define an endpoint @_superkart_api.post('/v1/forecast')
  • Get JSON data from the request
  • Extract features from the input data
In [ ]:
 _input = {
        'Product_Id': _data['Product_Id'],
        'Product_Weight': _data['Product_Weight'],
        'Product_Sugar_Content': _data['Product_Sugar_Content'],
        'Product_Allocated_Area': _data['Product_Allocated_Area'],
        'Product_Type': _data['Product_Type'],
        'Product_MRP': _data['Product_MRP'],
        'Store_Id': _data['Store_Id'],
        'Store_Establishment_Year': _data['Store_Establishment_Year'],
        'Store_Size': _data['Store_Size'],
        'Store_Location_City_Type': _data['Store_Location_City_Type'],
        'Store_Type': _data['Store_Type']
        }

    # Convert the extracted data into a DataFrame
    input_data = pd.DataFrame([_input])

    # Make a prediction using the trained model
    prediction = saved_model.predict(input_data).tolist()[0]

    # Return the prediction as a JSON response
    return jsonify({'Sales forescast': prediction})

# Run the Flask app in debug mode
if __name__ == '__main__':
    _superkart_api.run(debug=True)
Overwriting backend_files/app.py

Dependencies File¶

In [ ]:
%%writefile backend_files/requirements.txt
numpy==2.0.2
pandas==2.2.2
scikit-learn==1.6.1
joblib==1.4.2
xgboost==2.1.4
requests==2.32.3
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
uvicorn[standard]
streamlit==1.43.2
Overwriting backend_files/requirements.txt

Dockerfile¶

In [ ]:
%%writefile backend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim

# Set the working directory inside the container to /app
WORKDIR /app

# Copy all files from the current directory on the host to the container's /app directory
COPY . .

# Install Python dependencies listed in requirements.txt
RUN pip3 install --no-cache-dir --upgrade -r requirements.txt

# Define the command to start the application using Gunicorn with 4 worker processes
# - `-w 4`: Uses 4 worker processes for handling requests
# - `-b 0.0.0.0:7860`: Binds the server to port 7860 on all network interfaces
# - `app:app`: Runs the Flask app (assuming `app.py` contains the Flask instance named `app`)
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "app:_superkart_api"]
Overwriting backend_files/Dockerfile

Setting up a Hugging Face Docker Space for the Backend¶

  • The space was created via the Hugging Face website: https://huggingface.co/
  • Repository id = JosueQB12/SuperKart_JQB
  • Link of the Hugging Face space = https://huggingface.co/spaces/JosueQB12/superkart
  • Base link for specifying endpoints = https://josueqb12-superkart.hf.space/

Uploading Files to Hugging Face Space (Docker Space)¶

Tasks¶

  • Create access token
  • Get repo id (Hugging Face space ID)
  • Login to Hugging Face platform with the access token
  • Initialize the API
  • Upload Streamlit app files stored in the folder called deployment_files
    • api.upload_folder(folder_path (Local folder path), repo_id, repo_type="space")
In [ ]:
 
Uploading...:   0%|          | 0.00/413k [00:00<?, ?B/s]
Out[ ]:
CommitInfo(commit_url='https://huggingface.co/spaces/JosueQB12/superkart/commit/39fcb5d1ee278f67be5d97c535fa40ef17df0594', commit_message='Upload folder using huggingface_hub', commit_description='', oid='39fcb5d1ee278f67be5d97c535fa40ef17df0594', pr_url=None, repo_url=RepoUrl('https://huggingface.co/spaces/JosueQB12/superkart', endpoint='https://huggingface.co', repo_type='space', repo_id='JosueQB12/superkart'), pr_revision=None, pr_num=None)

Share the link of the Hugging Face space - Backend¶

  • https://huggingface.co/spaces/JosueQB12/superkart
  • Base link for specifying endpoints = https://josueqb12-superkart.hf.space/

Deployment - Frontend¶

Streamlit for Interactive UI¶

Tasks¶

  • Create a folder for storing the files needed for frontend UI deployment.
  • %%writefile frontend_files/app.py
  • os.environ["STREAMLIT_HOME"] = "/tmp/.streamlit"
  • os.environ["XDG_CONFIG_HOME"] = "/tmp"
  • os.makedirs("/tmp/.streamlit", exist_ok=True)
In [ ]:
# Input fields
Product_Id = st.text_input("Product Id.", max_chars=10, help="Enter the product Id.")
Product_Weight = st.number_input("Product Weight", min_value=0.0, help="Enter the product weight (lbs).")
Product_Sugar_Content = st.selectbox("Product sugar content", ["Low Sugar", "Regular", "No Sugar"], help="Enter the product sugar content.")
Product_Allocated_Area = st.number_input("Product allocated area", min_value=0.0, help="Enter the ratio of the allocated display area of the product.")
Product_Type = st.selectbox("Product type", ["Baking Goods","Breads","Breakfast","Canned","Dairy","Frozen Foods","Fruits and Vegetables","Hard Drinks",
                                             "Health and Hygiene","Household","Meat","Others","Seafood","Snack Foods","Soft Drinks","Starchy Foods"],
                            help="Enter the product sugar content.")
Product_MRP = st.number_input("Product MRP", min_value=0.0, help="Enter the maximum retail price of the product.")
Store_Id = st.text_input("Store Id.", max_chars=10, help="Enter the store Id.")
Store_Establishment_Year = st.number_input("Store establishment year", min_value=1987, max_value=2025, help="Enter the year in which the store was established.")
Store_Size = st.selectbox("Store size", ["High", "Medium", "Small"], help="Enter the size of the store.")
Store_Location_City_Type = st.selectbox("Store size", ["Tier 1", "Tier 2", "Tier 3"], help="Enter the type of city in which the store is located.")
Store_Type = st.selectbox("Store type", ["Departmental Store", "Food Mart","Store_Type","Supermarket Type1","Supermarket Type2"],
                          help="Enter the type of store depending on the products that are being sold.")

product_data = {
    "Product_Id": Product_Id,
    "Product_Weight": Product_Weight,
    "Product_Sugar_Content": Product_Sugar_Content,
    "Product_Allocated_Area": Product_Allocated_Area,
    "Product_Type": Product_Type,
    "Product_MRP": Product_MRP,
    "Store_Id": Store_Id,
    "Store_Establishment_Year": Store_Establishment_Year,
    "Store_Size": Store_Size,
    "Store_Location_City_Type": Store_Location_City_Type,
    "Store_Type": Store_Type
}

if st.button("Predict", type='primary'):
    response = requests.post("https://josueqb12-superkart.hf.space/v1/forecast", json=product_data)
    if response.status_code == 200:
        result = response.json()
        _sales_forecast = result["Sales forescast"]
        st.write(f"Based on the information provided, the total predicted product store sales are: {_sales_forecast:.2f}.")
    else:
        st.error("Error in API request")
Overwriting frontend_files/app.py

Dependencies File¶

In [ ]:
%%writefile frontend_files/requirements.txt
pandas==2.2.2
requests==2.32.3
streamlit==1.45.0
Overwriting frontend_files/requirements.txt

DockerFile¶

In [ ]:
%%writefile frontend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim

# Set the working directory inside the container to /app
WORKDIR /app

# Copy all files from the current directory on the host to the container's /app directory
COPY . .

# Install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt

# Define the command to run the Streamlit app on port 8501 and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

# NOTE: Disable XSRF protection for easier external access in order to make batch predictions
Overwriting frontend_files/Dockerfile

Uploading Files to Hugging Face Space (Streamlit Space)¶

Tasks¶

  • Set access token: Hugging Face token created from access keys in write mode
  • Set repo id: Hugging Face space id
  • Login to Hugging Face platform with the access token
  • Initialize the API
  • Upload Streamlit app files stored in the folder called deployment_files
    • api.upload_folder( folder_path: Local folder path, repo_id, repo_type)
In [ ]:
 
Out[ ]:
CommitInfo(commit_url='https://huggingface.co/spaces/JosueQB12/superkart_frontend/commit/d6d162f0d45f545aac1cf2fc4790fe7e31497718', commit_message='Upload folder using huggingface_hub', commit_description='', oid='d6d162f0d45f545aac1cf2fc4790fe7e31497718', pr_url=None, repo_url=RepoUrl('https://huggingface.co/spaces/JosueQB12/superkart_frontend', endpoint='https://huggingface.co', repo_type='space', repo_id='JosueQB12/superkart_frontend'), pr_revision=None, pr_num=None)

Share the link of the Hugging Face space - Frontend¶

  • https://huggingface.co/spaces/JosueQB12/superkart_frontend

Test

In [ ]:
_url = "https://josueqb12-superkart.hf.space/v1/forecast"
In [ ]:
payload = {
    "Product_Id": "FD1961",
    "Product_Weight": 9.45,
    "Product_Sugar_Content": "Low Sugar",
    "Product_Allocated_Area": 0.047,
    "Product_Type": "Snack Foods",
    "Product_MRP": 95.95,
    "Store_Id": "OUT002",
    "Store_Establishment_Year": 1998,
    "Store_Size": "Small",
    "Store_Location_City_Type": "Tier 3",
    "Store_Type": "Food Mart"
}
In [ ]:
response = requests.post(_url, json=payload)
response
Out[ ]:
<Response [200]>
In [ ]:
print(response.json())
{'Sales forescast': 1736.006160411349}

Actionable Insights and Business Recommendations¶

Insights from the analysis conducted¶

Interpretation of metrics

  • RMSE (Root Mean Squared Error)

    On average, predictions deviate by 621 USD from the actual sales.

  • MAE (Mean Absolute Error) On average, the absolute difference between predicted and actual sales is 489.71 USD

  • R-squared: About 66.5% of the variance in sales is explained by the model.

  • MAPE (Mean Absolute Percentage Error): On average, the predictions are off by 16.7% of the actual sales value.

  • Evaluation

    • Product_Id: FD1961
    • Weight: 9.45
    • Sugar content: Low Sugar
    • Allocated area: 0.047
    • Type: Snack foods
    • MRP: 95.95
    • Sold in a Food Mart
    • Product_Store_Sales_Total = 1684.82 USD

Based on the trained model:

  • The actual sales of product FD1961 in OUT002 were 1,684.82 USD

  • The forecasted sales from the selected model were 1,736 USD.

  • Absolute Error: |1736 - 1684.82| = 51.18 USD

  • Percentage Error: (51.18 / 1684.82) × 100 = 3.03%

In this case, the error was much lower than average, indicating an excellent performance.

  • With a MAPE of 16.7%, our model generally predicts sales within ±17% of actual values. The R Squared of 66.5% shows it successfully identifies many of the trends in the data, though improvements are possible, through more detailed analysis and by evaluating alternative models or further fine-tuning hyperparameters. Nevertheless, this level of accuracy is generally sufficient for inventory, purchasing, and marketing decisions for most products.

Business recommendations¶

  • Fruits and vegetables represent the highest-selling product category, accounting for 14.3% of all products sold and generating the largest revenue at 4,300,833.27 USD. However, due to their seasonal nature, some produce items may not always be consistently available.

    It would be beneficial to strengthen the acquisition and sales efforts for the second and third most significant product groups: Snack Foods, which generated 3,988,996.95 USD, and Dairy, with 2,811,918.04 USD. This strategy would reduce reliance on perishable products with unpredictable availability. Additionally, "Snack Foods" should not exclusively imply unhealthy options; rather, healthy alternatives can be offered to health-conscious customers, who represent the identified target demographic for the stores.

  • The least sold product category is "Seafood," with only 76 units. It is advisable to assess whether to continue offering these products. Due to their nature, they require special in-store placement and specific maintenance, such as refrigeration, which may incur costs that do not justify the return on investment. Furthermore, these products are subject to a shorter expiration period.

  • Over 70% of stores are located in cities where the standard of living is middle class (Tier 2). Both areas with a higher standard of living (Tier 1) and those with lower purchasing power (Tier 3) are represented in smaller, yet very similar, proportions at 15.4% and 13.1%, respectively.

    Marketing efforts and product selections should match the local purchasing power in each zone. This strategy aims to boost sales of products genuinely appealing to the local demographic. It is recommended that Key Performance Indicators (KPIs) specific to each Store Type be developed to effectively measure these efforts.

    This approach could also benefit from utilizing "Context Proxies," which aim to use a derived variable to indirectly represent its associated variable. For instance, calculating the average sales per store could indirectly reflect each location's capacity to generate profit, encompassing both the store itself and its surrounding area (e.g., the socio-economic status of the surrounding area)).

  • An interesting finding is that despite the majority of products sold being low in sugar, all sugar content categories generate very similar revenues. In other words, the fact that over 50% of products are low in sugar does not translate to higher sales compared to products with a regular sugar level or sugar-free products, at least when interpreting the median values.

    This could mean that products with a regular sugar level might sell well, perhaps even better than low-sugar options. This is because customers actively seek these products, even if they are not the predominant items offered in stores.

  • Finally, it would be beneficial to begin recording data on the seasonal sales peaks for each product category. The aim is to forecast sales more accurately based on this information. Additionally, a marketing strategy that includes discounts and promotions for these products at specific times of the year could be developed.


Josue Quiros Batista¶