Forecasting Sales Revenue¶
Machine Learning: Random Forest Regressor, Gradient Boosting Regressor¶
+ RandomizedSearch + HugginFace + jobLib + Docker + Flask + StreamLit¶
Keys: #EDA #HuggingFace #RandomForestRegressor #GradientBoostingRegressor #Flask #Streamlit #JobLib #jsonify #Docker
Josue Quiros Batista¶
- Overview of the Dataset
- Exploratory Data Analysis (EDA)
- Univariate Analysis
- Bivariate Analysis
- Data Pre-processing
- Feature Engineering
- Context proxies
- Data Preparation for modeling
- Model Evaluation Criterion
- Models: Random Forest Regressor and Gradient Boosting Regressor
- Model Performance Improvement - Hyperparameter Tuning
- Model Performance Comparison, Final Model Selection, and Serialization
- Actionable Insights and Business Recommendations derived from this model
Business Context¶
A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.
Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.
Objective¶
SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.
To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.
Actionable Insights and Business Recommendations derived from this model¶
Insights from the analysis conducted¶
Interpretation of metrics
RMSE (Root Mean Squared Error)
On average, predictions deviate by 621 USD from the actual sales.
MAE (Mean Absolute Error) On average, the absolute difference between predicted and actual sales is 489.71 USD
R-squared: About 66.5% of the variance in sales is explained by the model.
MAPE (Mean Absolute Percentage Error): On average, the predictions are off by 16.7% of the actual sales value.
Evaluation
- Product_Id: FD1961
- Weight: 9.45
- Sugar content: Low Sugar
- Allocated area: 0.047
- Type: Snack foods
- MRP: 95.95
- Sold in a Food Mart
- Product_Store_Sales_Total = 1684.82 USD
Based on the trained model:
The actual sales of product FD1961 in OUT002 were 1,684.82 USD
The forecasted sales from the selected model were 1,736 USD.
Absolute Error: |1736 - 1684.82| = 51.18 USD
Percentage Error: (51.18 / 1684.82) × 100 = 3.03%
In this case, the error was much lower than average, indicating an excellent performance.
- With a MAPE of 16.7%, our model generally predicts sales within ±17% of actual values. The R Squared of 66.5% shows it successfully identifies many of the trends in the data, though improvements are possible, through more detailed analysis and by evaluating alternative models or further fine-tuning hyperparameters. Nevertheless, this level of accuracy is generally sufficient for inventory, purchasing, and marketing decisions for most products.
Business recommendations¶
Fruits and vegetables represent the highest-selling product category, accounting for 14.3% of all products sold and generating the largest revenue at 4,300,833.27 USD. However, due to their seasonal nature, some produce items may not always be consistently available.
It would be beneficial to strengthen the acquisition and sales efforts for the second and third most significant product groups: Snack Foods, which generated 3,988,996.95 USD, and Dairy, with 2,811,918.04 USD. This strategy would reduce reliance on perishable products with unpredictable availability. Additionally, "Snack Foods" should not exclusively imply unhealthy options; rather, healthy alternatives can be offered to health-conscious customers, who represent the identified target demographic for the stores.
The least sold product category is "Seafood," with only 76 units. It is advisable to assess whether to continue offering these products. Due to their nature, they require special in-store placement and specific maintenance, such as refrigeration, which may incur costs that do not justify the return on investment. Furthermore, these products are subject to a shorter expiration period.
Over 70% of stores are located in cities where the standard of living is middle class (Tier 2). Both areas with a higher standard of living (Tier 1) and those with lower purchasing power (Tier 3) are represented in smaller, yet very similar, proportions at 15.4% and 13.1%, respectively.
Marketing efforts and product selections should match the local purchasing power in each zone. This strategy aims to boost sales of products genuinely appealing to the local demographic. It is recommended that Key Performance Indicators (KPIs) specific to each Store Type be developed to effectively measure these efforts.
This approach could also benefit from utilizing "Context Proxies," which aim to use a derived variable to indirectly represent its associated variable. For instance, calculating the average sales per store could indirectly reflect each location's capacity to generate profit, encompassing both the store itself and its surrounding area (e.g., the socio-economic status of the surrounding area)).
An interesting finding is that despite the majority of products sold being low in sugar, all sugar content categories generate very similar revenues. In other words, the fact that over 50% of products are low in sugar does not translate to higher sales compared to products with a regular sugar level or sugar-free products, at least when interpreting the median values.
This could mean that products with a regular sugar level might sell well, perhaps even better than low-sugar options. This is because customers actively seek these products, even if they are not the predominant items offered in stores.
Finally, it would be beneficial to begin recording data on the seasonal sales peaks for each product category. The aim is to forecast sales more accurately based on this information. Additionally, a marketing strategy that includes discounts and promotions for these products at specific times of the year could be developed.
Used libraries¶
- Numpy
- Pandas
- Scikit-learn
- Matplotlib
- Seaborn
- joblib
- Huggingface
_data = r"/content/SuperKart.csv"
#_data = r"G:\Mi unidad\Documentos\Data Science\"
_data_original = pd.read_csv(_data)
_data_copy = _data_original.copy()
Data Overview¶
_data_copy.head()
| Product_Id | Product_Weight | Product_Sugar_Content | Product_Allocated_Area | Product_Type | Product_MRP | Store_Id | Store_Establishment_Year | Store_Size | Store_Location_City_Type | Store_Type | Product_Store_Sales_Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FD6114 | 12.66 | Low Sugar | 0.027 | Frozen Foods | 117.08 | OUT004 | 2009 | Medium | Tier 2 | Supermarket Type2 | 2842.40 |
| 1 | FD7839 | 16.54 | Low Sugar | 0.144 | Dairy | 171.43 | OUT003 | 1999 | Medium | Tier 1 | Departmental Store | 4830.02 |
| 2 | FD5075 | 14.28 | Regular | 0.031 | Canned | 162.08 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 4130.16 |
| 3 | FD8233 | 12.10 | Low Sugar | 0.112 | Baking Goods | 186.31 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 4132.18 |
| 4 | NC1180 | 9.57 | No Sugar | 0.010 | Health and Hygiene | 123.67 | OUT002 | 1998 | Small | Tier 3 | Food Mart | 2279.36 |
Shape¶
print(_data_copy.shape)
print("There are", _data_copy.shape[0], 'rows and', _data_copy.shape[1], "columns.")
(8763, 12) There are 8763 rows and 12 columns.
Data types¶
_data_copy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8763 entries, 0 to 8762 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Product_Id 8763 non-null object 1 Product_Weight 8763 non-null float64 2 Product_Sugar_Content 8763 non-null object 3 Product_Allocated_Area 8763 non-null float64 4 Product_Type 8763 non-null object 5 Product_MRP 8763 non-null float64 6 Store_Id 8763 non-null object 7 Store_Establishment_Year 8763 non-null int64 8 Store_Size 8763 non-null object 9 Store_Location_City_Type 8763 non-null object 10 Store_Type 8763 non-null object 11 Product_Store_Sales_Total 8763 non-null float64 dtypes: float64(4), int64(1), object(7) memory usage: 821.7+ KB
Statistical summary¶
_data_copy.describe(include='all')
| Product_Id | Product_Weight | Product_Sugar_Content | Product_Allocated_Area | Product_Type | Product_MRP | Store_Id | Store_Establishment_Year | Store_Size | Store_Location_City_Type | Store_Type | Product_Store_Sales_Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8763 | 8763.000000 | 8763 | 8763.000000 | 8763 | 8763.000000 | 8763 | 8763.000000 | 8763 | 8763 | 8763 | 8763.000000 |
| unique | 8763 | NaN | 4 | NaN | 16 | NaN | 4 | NaN | 3 | 3 | 4 | NaN |
| top | FD306 | NaN | Low Sugar | NaN | Fruits and Vegetables | NaN | OUT004 | NaN | Medium | Tier 2 | Supermarket Type2 | NaN |
| freq | 1 | NaN | 4885 | NaN | 1249 | NaN | 4676 | NaN | 6025 | 6262 | 4676 | NaN |
| mean | NaN | 12.653792 | NaN | 0.068786 | NaN | 147.032539 | NaN | 2002.032751 | NaN | NaN | NaN | 3464.003640 |
| std | NaN | 2.217320 | NaN | 0.048204 | NaN | 30.694110 | NaN | 8.388381 | NaN | NaN | NaN | 1065.630494 |
| min | NaN | 4.000000 | NaN | 0.004000 | NaN | 31.000000 | NaN | 1987.000000 | NaN | NaN | NaN | 33.000000 |
| 25% | NaN | 11.150000 | NaN | 0.031000 | NaN | 126.160000 | NaN | 1998.000000 | NaN | NaN | NaN | 2761.715000 |
| 50% | NaN | 12.660000 | NaN | 0.056000 | NaN | 146.740000 | NaN | 2009.000000 | NaN | NaN | NaN | 3452.340000 |
| 75% | NaN | 14.180000 | NaN | 0.096000 | NaN | 167.585000 | NaN | 2009.000000 | NaN | NaN | NaN | 4145.165000 |
| max | NaN | 22.000000 | NaN | 0.298000 | NaN | 266.000000 | NaN | 2009.000000 | NaN | NaN | NaN | 8000.000000 |
Observations¶
- The heaviest item weighs 22 lb, while the lightest item weighs 4 lb. The median weight is 12.66 lb.
- The majority of products exhibit a low sugar content, specifically 4,885 out of 8,763 items. There are four distinct categories used to classify sugar content.
- The highest ratio of the allocated display area of each product to the total display area of all the products in a store is 0.29. The lowest is 0.004.
- There are 16 product categories in total. "Fruits and vegetables" is the most frequent category, with a count of 1,249.
- The most expensive product, based on its maximum retail price, is 266 USD. Additionally, 75% of all products do not exceed 167.5 USD. The least expensive product is priced at 31 USD.
- The store with the highest number of registered products is identified by ID OUT004, with 4,676 products.
- The oldest store was established in 1987, while the most recently established store dates back to 2009.
- The majority of stores are categorized as medium-sized. There are three size categories in total.
- Most stores are situated in Tier 2, accounting for 6,262 out of 8,763 stores.
- "Supermarket Type2" is the most frequent store type, representing 4,676 stores.
- The highest total revenue generated by the sale of a particular product in a particular store is 8,000 USD. The median revenue is 3,452.34 USD. The product with the lowest sales at a specific location generated only 33 USD in revenue.
Let's check the count of each unique category in each of the categorical variables
Convert object variables to categorical
_data_copy["Product_Sugar_Content"] = _data_copy["Product_Sugar_Content"].astype("category")
_data_copy["Product_Type"] = _data_copy["Product_Type"].astype("category")
_data_copy["Store_Size"] = _data_copy["Store_Size"].astype("category")
_data_copy["Store_Location_City_Type"] = _data_copy["Store_Location_City_Type"].astype("category")
_data_copy["Store_Type"] = _data_copy["Store_Type"].astype("category")
Observations:¶
In order to improve memory usage, get faster operations and improved interactions with models
Check the count of instances on each category
_cat = _data_copy.select_dtypes(include=['category']).columns.tolist()
for column in _cat:
print(_data_copy[column].value_counts())
print("-" * 25)
Product_Sugar_Content Low Sugar 4885 Regular 2251 No Sugar 1519 reg 108 Name: count, dtype: int64 ------------------------- Product_Type Fruits and Vegetables 1249 Snack Foods 1149 Frozen Foods 811 Dairy 796 Household 740 Baking Goods 716 Canned 677 Health and Hygiene 628 Meat 618 Soft Drinks 519 Breads 200 Hard Drinks 186 Others 151 Starchy Foods 141 Breakfast 106 Seafood 76 Name: count, dtype: int64 ------------------------- Store_Size Medium 6025 High 1586 Small 1152 Name: count, dtype: int64 ------------------------- Store_Location_City_Type Tier 2 6262 Tier 1 1349 Tier 3 1152 Name: count, dtype: int64 ------------------------- Store_Type Supermarket Type2 4676 Supermarket Type1 1586 Departmental Store 1349 Food Mart 1152 Name: count, dtype: int64 -------------------------
Observations¶
- Among the sugar content categories, 4,885 products are observed to have a low sugar level. Followed by products with a regular sugar level, totaling 2,251 items. 1,519 products are entirely sugar free. It is important to note that the "reg" category should be incorporated into the "regular" sugar content classification.
- The "Fruits and Vegetables" category holds the majority with 1,249 products. Other less frequent categories include "Snacks" with 1,149 products and "Frozen Foods" with 811. The least common category is "Seafood" containing only 76 products.
for _variable in _cat:
print(_data_copy[_variable].unique())
['Low Sugar', 'Regular', 'No Sugar', 'reg']
Categories (4, object): ['Low Sugar', 'No Sugar', 'Regular', 'reg']
['Frozen Foods', 'Dairy', 'Canned', 'Baking Goods', 'Health and Hygiene', ..., 'Soft Drinks', 'Breakfast', 'Others', 'Starchy Foods', 'Seafood']
Length: 16
Categories (16, object): ['Baking Goods', 'Breads', 'Breakfast', 'Canned', ..., 'Seafood', 'Snack Foods',
'Soft Drinks', 'Starchy Foods']
['Medium', 'High', 'Small']
Categories (3, object): ['High', 'Medium', 'Small']
['Tier 2', 'Tier 1', 'Tier 3']
Categories (3, object): ['Tier 1', 'Tier 2', 'Tier 3']
['Supermarket Type2', 'Departmental Store', 'Supermarket Type1', 'Food Mart']
Categories (4, object): ['Departmental Store', 'Food Mart', 'Supermarket Type1', 'Supermarket Type2']
Numerical features
_num_col = _data_copy.select_dtypes(include=np.number).columns.tolist()
for _variable in _num_col:
print(_variable, ": ", len(_data_copy[_variable].unique()))
Product_Weight : 1113 Product_Allocated_Area : 228 Product_MRP : 6100 Store_Establishment_Year : 4 Product_Store_Sales_Total : 8668
Observations¶
- 1,113 distinct weights have been observed.
- There are 6,100 different prices.
Duplicate value check
print("There are",_data_copy.duplicated().sum(), "duplicated rows")
There are 0 duplicated rows
Missing value treatment (if needed)
_null_values = _data_copy.isnull().sum()
print(_null_values[_null_values > 0])
Series([], dtype: int64)
Observations¶
There are no null or missing values. Imputation is not necessary.
Exploratory Data Analysis (EDA)¶
Univariate Analysis¶
plt.figure(figsize=(22, 12))
_columns = ['Product_Weight','Product_Sugar_Content','Product_MRP',
'Store_Id','Store_Size','Store_Location_City_Type','Store_Type','Product_Store_Sales_Total']
for _cont, _variable in enumerate(_columns):
plt.subplot(4, 4, _cont+1)
sns.histplot(data=_data_copy, x=_variable)
plt.xticks(rotation=30)
plt.tight_layout();
Observations¶
- The weight of each product, the maximum retail price of each product and the total revenue generated by the sale of a particular product in a particular store exhibit a normal distribution.
- This dataset represents products sold in only four stores, which are identified by their respective "ID" feature.
- Additional findings follow.
1. Product_Weight - weight of each product
sns.histplot(data=_data_copy, x='Product_Weight');
Observations¶
- This variable exhibits a normal distribution, readily identifiable by its symmetrical bell shape. Data points are concentrated around a central value, with their frequency decreasing as they deviate from this center. The highest concentration of product weights is observed between 12 lb and 13 lb, representing the most frequent weight range.
2. Product_Sugar_Content - sugar content of each product
labeled_barplot(_data_copy, "Product_Sugar_Content")
_percentage_types = _data_copy['Product_Sugar_Content'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
| proportion | |
|---|---|
| Product_Sugar_Content | |
| Low Sugar | 55.7 |
| No Sugar | 17.3 |
| Regular | 25.7 |
| reg | 1.2 |
Observations¶
- 4,885 products exhibit a low sugar level, constituting 55.7% of all products. Following this, products with a regular sugar level account for 25.7%, totaling 2,251 items. Additionally, 1,519 products are entirely sugar-free, representing 17.3%.
- As previously indicated, the "reg" category within the sugar level feature should be reclassified under the "regular" category.
- This trend suggests a strategic focus among stores to offer products that prioritize customer well-being.
3. Product_Allocated_Area
Ratio of the allocated display area of each product to the total display area of all the products in a store.
histogram_boxplot(_data_copy, "Product_Allocated_Area")
Observations¶
- This feature exhibits a right-skewed distribution, indicating that the highest concentration of products occupies a small allocated display area, ranging between 0 and 0.06. As the allocated area increases, the number of products decreases.
- A considerable number of products thus occupy a very small proportion of the total display area.
- Outlier values do not appear to be concerning; they are interpreted as naturally occurring due to the aforementioned skewness.
4. Product_Type
labeled_barplot(_data_copy, "Product_Type")
_percentage_types = _data_copy['Product_Type'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
| proportion | |
|---|---|
| Product_Type | |
| Baking Goods | 8.2 |
| Breads | 2.3 |
| Breakfast | 1.2 |
| Canned | 7.7 |
| Dairy | 9.1 |
| Frozen Foods | 9.3 |
| Fruits and Vegetables | 14.3 |
| Hard Drinks | 2.1 |
| Health and Hygiene | 7.2 |
| Household | 8.4 |
| Meat | 7.1 |
| Others | 1.7 |
| Seafood | 0.9 |
| Snack Foods | 13.1 |
| Soft Drinks | 5.9 |
| Starchy Foods | 1.6 |
Observations¶
- As previously observed, fruits and vegetables represent the dominant product category, accounting for 14.3% of all products sold (1,249 items). This reinforces the store chain's commitment to offering products that promote customer well-being. However, snacks constitute the second-highest selling category, representing 13.1% of products, and it is currently unknown whether these are healthy options.
- Other categories such as "Breads," "Breakfast," "Hard Drinks," "Seafood," and "Starchy Foods" each represent less than 5% of the total products sold.
5. Product_MRP
Maximum retail price of each product
histogram_boxplot(_data_copy, "Product_MRP")
Observations¶
- The variable representing the maximum retail price of each product also exhibits a normal distribution. This is identifiable by its symmetrical bell shape, with data points clustered around a central value and decreasing in frequency as they move away from the center. The strongest concentration of prices is observed around 145 USD - 150 USD, which represents the most frequent price range.
6. Store_Id
labeled_barplot(_data_copy, "Store_Id")
_percentage_types = _data_copy['Store_Id'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
| proportion | |
|---|---|
| Store_Id | |
| OUT004 | 53.4 |
| OUT003 | 15.4 |
| OUT001 | 18.1 |
| OUT002 | 13.1 |
Observations¶
- As previously mentioned, this dataset records products sold in only four retail outlets, identified by IDs OUT001, OUT002, OUT003, and OUT004.
- OUT004 has the highest number of registered products, totaling 4,676 items, which represents 53.4% of the dataset. Conversely, OUT002 has the fewest registered products, accounting for 13.1% of the products (1,153 items).
7. Store_Establishment_Year
Year in which the store was established
labeled_barplot(_data_copy, "Store_Establishment_Year")
_percentage_types = _data_copy['Store_Establishment_Year'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
| proportion | |
|---|---|
| Store_Establishment_Year | |
| 2009 | 53.4 |
| 1999 | 15.4 |
| 1987 | 18.1 |
| 1998 | 13.1 |
Observations¶
- More than 50% of the products in the dataset are sold in the store established in 2009, accounting for 4,676 products to be precise.
8. Store_Size
Size of the store depending on sq. feet
labeled_barplot(_data_copy, "Store_Size")
_percentage_types = _data_copy['Store_Size'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
| proportion | |
|---|---|
| Store_Size | |
| High | 18.1 |
| Medium | 68.8 |
| Small | 13.1 |
Observations¶
- Nearly 70% of the stores are medium-sized, while 18.1% are classified as large stores.
9. Store_Location_City_Type
Type of city in which the store is located
labeled_barplot(_data_copy, "Store_Location_City_Type")
_percentage_types = _data_copy['Store_Location_City_Type'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
| proportion | |
|---|---|
| Store_Location_City_Type | |
| Tier 1 | 15.4 |
| Tier 2 | 71.5 |
| Tier 3 | 13.1 |
Observations¶
- Over 70% of the stores are located in cities where the standard of living is middle class (Tier 2). Both areas with a higher standard of living (Tier 1) and those with lower purchasing power (Tier 3) are represented in smaller, yet very similar, proportions at 15.4% and 13.1%, respectively.
10. Store_Type
labeled_barplot(_data_copy, "Store_Type")
_percentage_types = _data_copy['Store_Type'].value_counts(normalize=True, sort=False)*100
_percentage_types.round(1)
| proportion | |
|---|---|
| Store_Type | |
| Departmental Store | 15.4 |
| Food Mart | 13.1 |
| Supermarket Type1 | 18.1 |
| Supermarket Type2 | 53.4 |
Observations¶
- More than half of the stores are classified as "Supermarket Type2." Other categories, such as "Supermarkets Type1," "Departmental Stores," and "Food Marts," are represented in smaller proportions: 18.1%, 15.4%, and 13.1%, respectively.
Bivariate Analysis¶
Correlation
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
plt.figure(figsize=(6,4))
sns.heatmap(_data_copy[_num_col].corr(),annot=True,cmap='Spectral',vmin=-1,vmax=1)
plt.show()
Observations¶
- A positive correlation is observed between the maximum retail price and the weight of the product. This suggests that a heavier product tends to have a higher price.
- Product weight also correlates with total sales, which is logical given the established relationship between product price and weight.
- Similarly, and as anticipated, the product price exhibits a correlational effect on total sales.
1. Product weight versus total product sales
plt.figure(figsize=(8, 6))
sns.kdeplot(x='Product_Weight', y='Product_Store_Sales_Total', data=_data_copy, fill=True, cmap='viridis', cbar=True)
plt.title('Density plot - Product weight & Total product sales')
plt.xlabel('Product_Weight')
plt.ylabel('Product_Store_Sales_Total')
plt.grid(True)
plt.show()
Observations¶
- As noted previously, the graph also displays a positive correlation. This density plot not only indicates a relationship between the variables but also highlights where this relationship is strongest. Lighter colors at the center of the plot signify a higher density of data points, illustrating where the relationship between product weight and sales volume is most concentrated. For instance, a considerable concentration of products is observed within an approximate weight range of 12 lb to 14 lb, generating sales between 3,500 USD and 4,000 USD. These represent the most frequent characteristics within the dataset. This observation reinforces the statistical summary, which indicated that 50% of products weigh 12.66 lb and 75% do not exceed 14.18 lb. Similarly, the summary showed that the median product generates up to 3,452.34 USD, and 75% do not exceed 4,145.16 USD.
- Conversely, as the colors darken, the data density decreases, meaning there are fewer products with the weight and sales characteristics represented by each quadrant of the graph's grid
2. Sugar content versus total product sales
sns.boxplot(x="Product_Sugar_Content", y="Product_Store_Sales_Total", data=_data_copy);
Observations¶
- An interesting finding is that despite the majority of products sold being low in sugar, all sugar content categories generate very similar revenues. In other words, the fact that over 50% of products are low in sugar does not translate to higher sales compared to products with a regular sugar level or sugar-free products, at least when interpreting the median values.
3. Allocated area versus total product sales
plt.figure(figsize=(8, 6))
sns.kdeplot(x='Product_Allocated_Area', y='Product_Store_Sales_Total', data=_data_copy, fill=True, cbar=True)
plt.title('Density plot - Allocated area & Total product sales')
plt.xlabel('Product_Allocated_Area')
plt.ylabel('Product_Store_Sales_Total')
plt.grid(True)
plt.show()
Observations¶
- There is no clear linear correlation observed between the product's allocated display area and its sales. The data points are dispersed without a distinct trend, indicating that a larger allocated area does not guarantee higher sales for a specific product, nor does a smaller area imply lower sales.
- Additionally, the majority of points are concentrated on the left side of the graph, specifically between 0.00 and approximately 0.18. This observation is consistent with the results from the histogram in the preceding section's univariable analysis.
4. Product type versus total product sales
# Group by product type and sum up the total sales
_revenue_x_type = _data_copy.groupby(["Product_Type"], as_index=False)["Product_Store_Sales_Total"].sum()
colors = [plt.cm.Spectral(i / float(len(_revenue_x_type['Product_Type']))) for i in range(len(_revenue_x_type['Product_Type']))]
plt.figure(figsize=[12, 8])
squarify.plot(sizes=_revenue_x_type['Product_Store_Sales_Total'],
label=_revenue_x_type['Product_Type'],
color=colors,
alpha=.8,
value=_revenue_x_type['Product_Store_Sales_Total'])
plt.title('Total revenue per product type')
plt.axis('off')
plt.show()
_revenue_x_type.sort_values(by='Product_Store_Sales_Total', ascending=False)
| Product_Type | Product_Store_Sales_Total | |
|---|---|---|
| 6 | Fruits and Vegetables | 4300833.27 |
| 13 | Snack Foods | 3988996.95 |
| 4 | Dairy | 2811918.04 |
| 5 | Frozen Foods | 2809980.83 |
| 9 | Household | 2564740.17 |
| 0 | Baking Goods | 2452986.00 |
| 3 | Canned | 2300082.71 |
| 8 | Health and Hygiene | 2163707.21 |
| 10 | Meat | 2129211.94 |
| 14 | Soft Drinks | 1797044.72 |
| 1 | Breads | 714942.24 |
| 7 | Hard Drinks | 625814.62 |
| 11 | Others | 541496.30 |
| 15 | Starchy Foods | 518774.45 |
| 2 | Breakfast | 362130.41 |
| 12 | Seafood | 272404.04 |
Observations¶
- The product type generating the highest sales is Fruits and Vegetables, with a total of 4,300,833.27 USD. This is followed by Snack Foods, which generated 3,988,996.95 USD, and Dairy, with 2,811,918.04 USD.
5. Maximum retail price (MRP) of each product versus total product sales
plt.figure(figsize=(8, 6))
sns.kdeplot(x='Product_MRP', y='Product_Store_Sales_Total', data=_data_copy, fill=True, cmap='viridis', cbar=True)
plt.title('Density plot - Product MRP & Total product sales')
plt.xlabel('Product_MRP')
plt.ylabel('Product_Store_Sales_Total')
plt.grid(True)
plt.show()
sns.scatterplot(data=_data_copy, x='Product_MRP', y='Product_Store_Sales_Total')
plt.show()
Observations¶
- The relationship between these variables exhibits a positive correlation. A significant concentration of products is observed within an approximate price range of 140 USD - 160 USD, generating sales between 3,000 USD and 4,000 USD. These figures represent the most frequent characteristics within the dataset.
- This observation reinforces the statistical summary, which indicated that 50% of products are priced at 146.74 USD, and 75% do not exceed 167.58 USD. Similarly, the summary showed that the median product generates up to 3,452.34 USD, with 75% not exceeding 4,145.16 USD.
- Conversely, darker colors indicate a decrease in data density, signifying fewer products with the maximum price and sales characteristics represented in those areas of the graph.
6. Store ID versus total product sales
# Group by Store_Id and sum up the total sales
_revenue_x_id = _data_copy.groupby(["Store_Id"], as_index=False)["Product_Store_Sales_Total"].sum()
colors = [plt.cm.Spectral(i / float(len(_revenue_x_id['Store_Id']))) for i in range(len(_revenue_x_id['Store_Id']))]
plt.figure(figsize=[8, 4])
squarify.plot(sizes=_revenue_x_id['Product_Store_Sales_Total'],
label=_revenue_x_id['Store_Id'],
color=colors,
alpha=.8,
value=_revenue_x_id['Product_Store_Sales_Total'])
plt.title('Total revenue per store ID')
plt.axis('off')
plt.show()
_revenue_x_id.sort_values(by='Product_Store_Sales_Total', ascending=False)
| Store_Id | Product_Store_Sales_Total | |
|---|---|---|
| 3 | OUT004 | 15427583.43 |
| 2 | OUT003 | 6673457.57 |
| 0 | OUT001 | 6223113.18 |
| 1 | OUT002 | 2030909.72 |
Observations¶
- The store with the highest total sales is OUT004, achieving 15,427,583.43 USD.
- It is confirmed that the store with the lowest total sales is OUT002, with only 2,030,909.72 USD.
7. Store Establishment Year versus total product sales
# Group by Store_Establishment_Year and sum up the total sales
_revenue_x_year = _data_copy.groupby(["Store_Establishment_Year"], as_index=False)["Product_Store_Sales_Total"].sum()
colors = [plt.cm.Spectral(i / float(len(_revenue_x_type['Store_Establishment_Year']))) for i in range(len(_revenue_x_type['Store_Establishment_Year']))]
plt.figure(figsize=[8, 4])
squarify.plot(sizes=_revenue_x_type['Product_Store_Sales_Total'],
label=_revenue_x_type['Store_Establishment_Year'],
color=colors,
alpha=.8,
value=_revenue_x_type['Product_Store_Sales_Total'])
plt.title('Total revenue per store year of establishment')
plt.axis('off')
plt.show()
_revenue_x_year.sort_values(by='Product_Store_Sales_Total', ascending=False)
| Store_Establishment_Year | Product_Store_Sales_Total | |
|---|---|---|
| 3 | 2009 | 15427583.43 |
| 2 | 1999 | 6673457.57 |
| 0 | 1987 | 6223113.18 |
| 1 | 1998 | 2030909.72 |
Observations¶
The store established in 2009 generates the highest total sales, amounting to 15,427,583.43 USD.
The store inaugurated in 1998, while not the oldest, has the lowest total sales per product, with a total of 2,030,909.72 USD.
8. Store size Year versus total product sales
sns.boxplot(x="Store_Size", y="Product_Store_Sales_Total", data=_data_copy);
summary_stats = _data_copy.groupby('Store_Size')['Product_Store_Sales_Total'].agg(
mean='mean',
Q1=lambda x: x.quantile(0.25),
median='median',
Q3=lambda x: x.quantile(0.75),
min='min',
max='max'
).reset_index()
print(summary_stats)
Store_Size mean Q1 median Q3 min max 0 High 3923.778802 3285.5100 4139.645 4639.4000 2300.56 4997.63 1 Medium 3668.222573 3060.3800 3511.100 3969.8100 1561.06 8000.00 2 Small 1762.942465 1495.4725 1889.495 2133.6225 33.00 2299.63
Observations¶
- As anticipated, store size significantly influences product sales. Larger stores tend to generate higher median sales than medium-sized and small-sized stores.
- Small stores consistently exhibit the lowest sales performance, likely due to factors such as a more limited inventory and reduced physical space.
9. Store location versus total product sales
# Group by Store_Location_City_Type and sum up the total sales
_revenue_x_location = _data_copy.groupby(["Store_Location_City_Type"], as_index=False)["Product_Store_Sales_Total"].sum()
colors = [plt.cm.Spectral(i / float(len(_revenue_x_location['Store_Location_City_Type']))) for i in range(len(_revenue_x_location['Store_Location_City_Type']))]
plt.figure(figsize=[8, 4])
squarify.plot(sizes=_revenue_x_location['Product_Store_Sales_Total'],
label=_revenue_x_location['Store_Location_City_Type'],
color=colors,
alpha=.8,
value=_revenue_x_location['Product_Store_Sales_Total'])
plt.title('Total revenue per store location')
plt.axis('off')
plt.show()
_revenue_x_location.sort_values(by='Product_Store_Sales_Total', ascending=False)
| Store_Location_City_Type | Product_Store_Sales_Total | |
|---|---|---|
| 1 | Tier 2 | 21650696.61 |
| 0 | Tier 1 | 6673457.57 |
| 2 | Tier 3 | 2030909.72 |
Observations¶
- Tier 2 represents the location with the highest total sales per product, significantly surpassing both Tier 1 and Tier 3 locations.
10. Store type versus total product sales
# Group by Store_Type and sum up the total sales
_revenue_store_type = _data_copy.groupby(["Store_Type"], as_index=False)["Product_Store_Sales_Total"].sum()
colors = [plt.cm.Spectral(i / float(len(_revenue_store_type['Store_Type']))) for i in range(len(_revenue_store_type['Store_Type']))]
plt.figure(figsize=[8, 4])
squarify.plot(sizes=_revenue_store_type['Product_Store_Sales_Total'],
label=_revenue_store_type['Store_Type'],
color=colors,
alpha=.8,
value=_revenue_store_type['Product_Store_Sales_Total'])
plt.title('Total revenue per store type')
plt.axis('off')
plt.show()
_revenue_store_type.sort_values(by='Product_Store_Sales_Total', ascending=False)
| Store_Type | Product_Store_Sales_Total | |
|---|---|---|
| 3 | Supermarket Type2 | 15427583.43 |
| 0 | Departmental Store | 6673457.57 |
| 2 | Supermarket Type1 | 6223113.18 |
| 1 | Food Mart | 2030909.72 |
Observations¶
- As previously observed, the "Supermarket Type2" store type accounts for the highest product sales, specifically totaling 15,427,583.43 USD.
11. Store ID versus Store establishment year
plt.figure(figsize=(6, 3))
sns.pointplot(x='Store_Id', y='Store_Establishment_Year', data=_data_copy, join=False, ci=None, marker='o', linestyles='')
plt.xlabel('ID de Tienda')
plt.ylabel('Establishment Year')
plt.title('Establishment Year - Store')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y')
plt.tight_layout()
plt.show()
Observations¶
- The oldest store is OUT001, established in 1987. The newest store, inaugurated in 2009, is OUT004.
revenue_by_store_product_type = _data_copy.groupby(['Store_Id', 'Product_Type'])['Product_Store_Sales_Total'].sum().reset_index()
stores_group1 = ['OUT001','OUT002']
stores_group2 = ['OUT003','OUT004']
df_group1 = revenue_by_store_product_type[revenue_by_store_product_type['Store_Id'].isin(stores_group1)]
df_group2 = revenue_by_store_product_type[revenue_by_store_product_type['Store_Id'].isin(stores_group2)]
# First plot
plt.figure(figsize=(16, 8))
ax1 = sns.barplot(
data=df_group1, x='Store_Id', y='Product_Store_Sales_Total',hue='Product_Type',palette='tab20'
)
plt.title('Total product sales per store (OUT001 and OUT002)')
plt.xlabel('Store ID')
plt.ylabel('Product_Store_Sales_Total')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Product type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
for p in ax1.patches:
height = p.get_height()
if height > 50:
ax1.annotate(f'{height:.0f}',
(p.get_x() + p.get_width() / 2., height),
ha='center', va='center',
xytext=(0, 5), textcoords='offset points',
fontsize=7, color='black')
plt.show()
# Second plot
plt.figure(figsize=(16, 8))
ax2 = sns.barplot(
data=df_group2, x='Store_Id', y='Product_Store_Sales_Total',hue='Product_Type', palette='tab20'
)
plt.title('Total product sales per store (OUT003 and OUT004)')
plt.xlabel('Store ID')
plt.ylabel('Product_Store_Sales_Total')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Product type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
for p in ax2.patches:
height = p.get_height()
if height > 50:
ax2.annotate(f'{height:.0f}',
(p.get_x() + p.get_width() / 2., height),
ha='center', va='center',
xytext=(0, 5), textcoords='offset points',
fontsize=7, color='black')
plt.show()
Observations¶
At Store ID OUT001, the highest-selling product categories are "Fruits and Vegetables" and "Snack Foods", generating total sales of 792,993 USD and 806,142 USD respectively. The lowest-selling category is "Breakfast," with total sales of 38,161 USD. The second-lowest selling category is "Seafood," with total sales of 52,937 USD.
Similarly, for Store ID OUT002, "Fruits and Vegetables" and "Snack Foods" are the predominant categories. In this store, the least sold category is "Seafood," with sales totaling 17,663 USD.
For both Store ID OUT003 and OUT004, "Fruits and Vegetables" and "Snack Foods" also consistently emerge as the top-selling categories. Notably, Store OUT004 records the highest sales within these categories, with 2,311,900 USD from "Fruits and Vegetables" and 2,009,027 USD from "Snack Foods." In both OUT003 and OUT004, "Seafood" consistently represents the lowest-selling category.
Data Preprocessing¶
Feature engineering¶
Outlier check
Q1 = _data_copy.select_dtypes(include=["float64", "int64"]).quantile(0.25)
Q3 = _data_copy.select_dtypes(include=["float64", "int64"]).quantile(0.75)
# Inter Quantile Range
IQR = Q3 - Q1
lower = (Q1 - 1.5 * IQR) #Values outside these bounds are outliers
upper = (Q3 + 1.5 * IQR) #Values outside these bounds are outliers
((_data_copy.select_dtypes(include=["float64", "int64"]) < lower)|(_data_copy.select_dtypes(include=["float64", "int64"]) > upper)).sum() / len(_data_copy) * 100
| 0 | |
|---|---|
| Product_Weight | 0.616227 |
| Product_Allocated_Area | 1.186808 |
| Product_MRP | 0.650462 |
| Store_Establishment_Year | 0.000000 |
| Product_Store_Sales_Total | 1.357982 |
Observations¶
- The outliers are not significant, and therefore, no action will be taken.
Variable: Product_Sugar_Content. Unificar "reg" con "Regular"
_data_copy['Product_Sugar_Content'].replace(to_replace=["reg"], value=["Regular"], inplace=True)
_data_copy['Product_Sugar_Content'].unique()
['Low Sugar', 'Regular', 'No Sugar'] Categories (3, object): ['Low Sugar', 'No Sugar', 'Regular']
Combine variables to capture nonlinear or proportional relationships
_data_copy['_price_per_Gram'] = _data_copy['Product_MRP'] / _data_copy['Product_Weight']
_data_copy['_weight_per_area'] = _data_copy['Product_Weight'] / _data_copy['Product_Allocated_Area']
Observations¶
- The objective of creating new variables through their combination is to enable the model to learn non-linear relationships, thereby facilitating the identification of implicit patterns.
Context proxies
_avg_sales = _data_copy.groupby("Store_Id")["Product_Store_Sales_Total"].mean().reset_index()
_avg_sales.rename(columns={"Product_Store_Sales_Total": "_avg_sales_x_store"}, inplace=True)
_data_copy = _data_copy.merge(_avg_sales, on="Store_Id", how="left")
Observations¶
- The objective is to utilize a derived variable to indirectly represent its associated variable. For example, calculating the average sales per store could indirectly reflect each location's capacity to generate profit, encompassing both the store itself and its surrounding area (e.g., the socio-economic status of the surrounding area)).
Drop unnecessary column
_data_copy.drop(['Product_Id','Store_Id','Store_Establishment_Year'], axis=1, inplace=True)
_data_copy.head()
| Product_Weight | Product_Sugar_Content | Product_Allocated_Area | Product_Type | Product_MRP | Store_Size | Store_Location_City_Type | Store_Type | Product_Store_Sales_Total | _price_per_Gram | _weight_per_area | _avg_sales_x_store | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12.66 | Low Sugar | 0.027 | Frozen Foods | 117.08 | Medium | Tier 2 | Supermarket Type2 | 2842.40 | 9.248025 | 468.888889 | 3299.312111 |
| 1 | 16.54 | Low Sugar | 0.144 | Dairy | 171.43 | Medium | Tier 1 | Departmental Store | 4830.02 | 10.364571 | 114.861111 | 4946.966323 |
| 2 | 14.28 | Regular | 0.031 | Canned | 162.08 | High | Tier 2 | Supermarket Type1 | 4130.16 | 11.350140 | 460.645161 | 3923.778802 |
| 3 | 12.10 | Low Sugar | 0.112 | Baking Goods | 186.31 | High | Tier 2 | Supermarket Type1 | 4132.18 | 15.397521 | 108.035714 | 3923.778802 |
| 4 | 9.57 | No Sugar | 0.010 | Health and Hygiene | 123.67 | Small | Tier 3 | Food Mart | 2279.36 | 12.922675 | 957.000000 | 1762.942465 |
Preprocessing pipeline for encoding categorical features
_categorical_transformer = OneHotEncoder(handle_unknown='ignore')
_pipeline_preprocessor = ColumnTransformer(
transformers=[
('cat', _categorical_transformer, _cat)
]
)
Additional observations on feature engineering¶
- Imputation is not necessary as there are no missing values in the dataset.
- The models to be implemented: Random Forest and Gradient Boosting, do not require data scaling.
- With "Product_MRP" and "Product_Store_Sales_Total" displaying normal (symmetrical) distributions, additional transformations to mitigate bias or extreme outliers are unnecessary.
- Overall, the dataset is clean and well-organized, thus extensive feature engineering tasks are not required.
Data splitting
X = _data_copy.drop(["Product_Store_Sales_Total"], axis=1)
y = _data_copy["Product_Store_Sales_Total"]
# Splitting data into training, validation and test set:
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1
)
# Split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1
)
print(X_train.shape, X_val.shape, X_test.shape)
(5257, 11) (1753, 11) (1753, 11)
print("Shape of training: ", X_train.shape)
print("Shape of validation: ", X_val.shape)
print("Shape of test: ", X_test.shape)
Shape of training: (5257, 11) Shape of validation: (1753, 11) Shape of test: (1753, 11)
Model Building¶
Metric¶
Problem Type and Objective
The problem type is regression, with the objective of forecasting the Product_Store_Sales_Total variable. The aim is to quantify the difference between the values generated by the model and the actual sales figures.
Key Metric: RMSE: the primary metric of focus will be RMSE (Root Mean Squared Error). This metric is chosen to penalize errors with larger deviations, thereby mitigating errors that could significantly impact the prediction of high-profit products. In other words, not all sales contribute equally to business value.
Outliers are unlikely to significantly affect the RMSE.
RMSE will indicate the average magnitude of difference between the predictions and the actual values.
Complementary Metric: R-Squared Additionally, R-Squared (coefficient of determination) will be used to identify the percentage of sales variation explained by the implemented models.
Models: Random Forest Regressor and Gradient Boosting Regressor¶
Tasks¶
Fit: training and validation
- Pipeline: make_pipeline (Preprocessing pipeline for encoding categorical features, model), then fit.
Predict: pipeline, x_train. y_train
Compute metrics: r2 score, adjr2, RMSE, MAE, MAPE.
_models.append(("Random Forest Regressor - Before tuning", RandomForestRegressor(random_state=1)))
_models.append(("Gradient Boost Regressor - Before tuning", GradientBoostingRegressor(random_state=1)))
Performance - Training
_perf_train
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | Model | |
|---|---|---|---|---|---|---|
| 0 | 599.302247 | 471.734564 | 0.683112 | 0.682447 | 0.166443 | Random Forest Regressor - Before tuning |
| 1 | 602.784108 | 474.329441 | 0.679419 | 0.678747 | 0.167860 | Gradient Boost Regressor - Before tuning |
Performance - Validation
_perf_val
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | Model | |
|---|---|---|---|---|---|---|
| 0 | 589.020035 | 459.391759 | 0.690892 | 0.688939 | 0.188043 | Random Forest Regressor - Before tuning |
| 1 | 576.429627 | 451.981812 | 0.703966 | 0.702095 | 0.186927 | Gradient Boost Regressor - Before tuning |
Observations¶
RMSE (Root Mean Squared Error): This metric quantifies the average magnitude by which predictions deviate from actual values.
- While the difference between the two models isn't substantial, the Gradient Boost model demonstrates a lower RMSE of 576.43 on the validation set, compared to Random Forest's 589.02. Consequently, Gradient Boost exhibits superior predictive accuracy on unseen data
- There's no concern regarding overfitting.
R-Squared: This metric represents the proportion of variability accounted for by the model.
- The R-Squared value on the validation set is also marginally superior for the Gradient Boost model (0.70 versus 0.69); still, both models do an acceptable job of explaining the data's variability.
Model Performance Improvement - Hyperparameter Tuning¶
Random forest
Hyperparameters to consider¶
n_estimators: [50, 100, 150, 200],
max_depth: [10, 15, 20, None],
min_samples_leaf: [10, 15, 25, 30],
max_features: ['sqrt', 'log2'],
max_samples: [0.3, 0.5, 0.7, 0.9],
criterion: ['squared_error']
RandomizedSearchCV:
- estimator=_pipeline
- Hyperparameters
- jobs = -1
- iterations = 60
- scoring = neg_root_mean_squared_error
- cv = 5
- random state = 1
_tuned_RF = RandomForestRegressor(random_state=1)
_pipeline_RF = make_pipeline(_pipeline_preprocessor, _tuned_RF)
print("Best params:", _rf_randomized_cv.best_params_)
print("Best CV RMSE:", -_rf_randomized_cv.best_score_)
Best params: {'randomforestregressor__n_estimators': 100, 'randomforestregressor__min_samples_leaf': 30, 'randomforestregressor__max_samples': 0.9, 'randomforestregressor__max_features': 'sqrt', 'randomforestregressor__max_depth': 15, 'randomforestregressor__criterion': 'squared_error'}
Best CV RMSE: 609.5711902921707
best_pipeline_rf = _rf_randomized_cv.best_estimator_
best_pipeline_rf
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['Product_Sugar_Content',
'Product_Type', 'Store_Size',
'Store_Location_City_Type',
'Store_Type'])])),
('randomforestregressor',
RandomForestRegressor(max_depth=15, max_features='sqrt',
max_samples=0.9, min_samples_leaf=30,
random_state=1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['Product_Sugar_Content',
'Product_Type', 'Store_Size',
'Store_Location_City_Type',
'Store_Type'])])),
('randomforestregressor',
RandomForestRegressor(max_depth=15, max_features='sqrt',
max_samples=0.9, min_samples_leaf=30,
random_state=1))])ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
['Product_Sugar_Content', 'Product_Type',
'Store_Size', 'Store_Location_City_Type',
'Store_Type'])])['Product_Sugar_Content', 'Product_Type', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(max_depth=15, max_features='sqrt', max_samples=0.9,
min_samples_leaf=30, random_state=1)_predictions_train_rf = model_performance_regression(best_pipeline_rf, X_train, y_train)
_predictions_train_rf["Model"] = 'Random Forest Regressor - After tuning'
_perf_train = pd.concat([_perf_train,_predictions_train_rf], ignore_index=True)
_predictions_train_rf
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | Model | |
|---|---|---|---|---|---|---|
| 0 | 605.067619 | 476.487114 | 0.676986 | 0.676308 | 0.169244 | Random Forest Regressor - After tuning |
_predictions_val_rf = model_performance_regression(best_pipeline_rf, X_val, y_val)
_predictions_val_rf["Model"] = 'Random Forest Regressor - After tuning'
_perf_val = pd.concat([_perf_val,_predictions_val_rf], ignore_index=True)
_predictions_val_rf
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | Model | |
|---|---|---|---|---|---|---|
| 0 | 573.516807 | 450.611094 | 0.70695 | 0.705098 | 0.184908 | Random Forest Regressor - After tuning |
Gradient boosting
Hyperparameters to consider¶
n_estimators: [100, 150, 200, 250],
learning_rate: [0.01, 0.05, 0.1],
subsample: [0.5, 0.6, 0.7, 0.8, 1],
max_features: ['sqrt', 'log2', None],
max_depth: [3, 5, 8, 10,],
min_samples_split: [6, 8, 10, 12]
min_samples_leaf:[4, 8,10]
RandomizedSearchCV:
- estimator=_pipeline
- Hyperparameters
- jobs = -1
- iterations = 60
- scoring = neg_root_mean_squared_error
- cv = 5
- random state = 1
_tuned_GB = GradientBoostingRegressor(random_state=1)
_pipeline = make_pipeline(_pipeline_preprocessor, _tuned_GB)
# Best Hyperparameters scores (positive RMSE)
print("Best params:", _GB_randomized_cv.best_params_)
print("Best CV RMSE:", -_GB_randomized_cv.best_score_)
Best params: {'gradientboostingregressor__subsample': 0.7, 'gradientboostingregressor__n_estimators': 100, 'gradientboostingregressor__min_samples_split': 6, 'gradientboostingregressor__min_samples_leaf': 8, 'gradientboostingregressor__max_features': 'sqrt', 'gradientboostingregressor__max_depth': 3, 'gradientboostingregressor__learning_rate': 0.05}
Best CV RMSE: 610.1829332102039
best_pipeline_gb = _GB_randomized_cv.best_estimator_
best_pipeline_gb
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['Product_Sugar_Content',
'Product_Type', 'Store_Size',
'Store_Location_City_Type',
'Store_Type'])])),
('gradientboostingregressor',
GradientBoostingRegressor(learning_rate=0.05,
max_features='sqrt',
min_samples_leaf=8,
min_samples_split=6, random_state=1,
subsample=0.7))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['Product_Sugar_Content',
'Product_Type', 'Store_Size',
'Store_Location_City_Type',
'Store_Type'])])),
('gradientboostingregressor',
GradientBoostingRegressor(learning_rate=0.05,
max_features='sqrt',
min_samples_leaf=8,
min_samples_split=6, random_state=1,
subsample=0.7))])ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
['Product_Sugar_Content', 'Product_Type',
'Store_Size', 'Store_Location_City_Type',
'Store_Type'])])['Product_Sugar_Content', 'Product_Type', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']
OneHotEncoder(handle_unknown='ignore')
GradientBoostingRegressor(learning_rate=0.05, max_features='sqrt',
min_samples_leaf=8, min_samples_split=6,
random_state=1, subsample=0.7)_predictions_train_gb = model_performance_regression(best_pipeline_gb, X_train, y_train)
_predictions_train_gb["Model"] = 'Gradient boost - After tuning'
_perf_train = pd.concat([_perf_train,_predictions_train_gb], ignore_index=True)
_predictions_train_gb
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | Model | |
|---|---|---|---|---|---|---|
| 0 | 604.465527 | 476.781998 | 0.677628 | 0.676952 | 0.169115 | Gradient boost - After tuning |
_predictions_val_gb = model_performance_regression(best_pipeline_gb, X_val, y_val)
_predictions_val_gb["Model"] = 'Gradient boost - After tuning'
_perf_val = pd.concat([_perf_val,_predictions_val_gb], ignore_index=True)
_predictions_val_gb
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | Model | |
|---|---|---|---|---|---|---|
| 0 | 573.8879 | 451.414201 | 0.706571 | 0.704717 | 0.186055 | Gradient boost - After tuning |
Observations¶
After hyperparameter tuning,an improvement in the RMSE was observed on the validation set. Specifically, the Random Forest model's RMSE decreased from 589.02 to 573.5, while the Gradient Boost model's RMSE improved from 576.43 to 573.89.
Although the difference is marginal, the enhancement resulting from the tuning process is evident.
Model Performance Comparison, Final Model Selection, and Serialization¶
Performance comparison - Training¶
_perf_train
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | Model | |
|---|---|---|---|---|---|---|
| 0 | 599.302247 | 471.734564 | 0.683112 | 0.682447 | 0.166443 | Random Forest Regressor - Before tuning |
| 1 | 602.784108 | 474.329441 | 0.679419 | 0.678747 | 0.167860 | Gradient Boost Regressor - Before tuning |
| 2 | 605.067619 | 476.487114 | 0.676986 | 0.676308 | 0.169244 | Random Forest Regressor - After tuning |
| 3 | 604.465527 | 476.781998 | 0.677628 | 0.676952 | 0.169115 | Gradient boost - After tuning |
Observations¶
Training set
For the Random Forest model, a slight deterioration in the RMSE was observed, increasing from 599.30 to 605. A similar trend was noted for the R-Squared value.
The Gradient Boost model also exhibited a deterioration in its RMSE, rising from 602.78 to 604.46.
However, in both instances, the difference is not substantial. The values obtained from the validation set will be reviewed, as they are of greater interest for assessing whether an improvement in the models' generalization capabilities for predicting unseen data has occurred.
Performance comparison - Validation¶
_perf_val
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | Model | |
|---|---|---|---|---|---|---|
| 0 | 589.020035 | 459.391759 | 0.690892 | 0.688939 | 0.188043 | Random Forest Regressor - Before tuning |
| 1 | 576.429627 | 451.981812 | 0.703966 | 0.702095 | 0.186927 | Gradient Boost Regressor - Before tuning |
| 2 | 573.516807 | 450.611094 | 0.706950 | 0.705098 | 0.184908 | Random Forest Regressor - After tuning |
| 3 | 573.887900 | 451.414201 | 0.706571 | 0.704717 | 0.186055 | Gradient boost - After tuning |
Observations¶
Validation set
For the Random Forest model, the RMSE improved from 589 to 573.5. Similarly, the R-Squared also showed improvement, increasing from 0.68 to 0.70.
In the case of the Gradient Boost model, an improved performance in the RMSE was also observed, with values decreasing from 576.43 to 573.88.
While the results on the training set did not show improvement, the validation set results did improve. Therefore, it can be interpreted that the models with hyperparameter tuning are indeed better at predicting unseen cases, which is the scenario of greater interest.
Final model selection¶
Both models yielded very similar results. However, the Random Forest model was selected after applying hyperparameter tuning due to its superior RMSE value of 573.52 on the validation set, and its better R-Squared metric of 0.71.
Additionally, Random Forest models offer greater interpretability compared to Gradient Boost models, and their training time is shorter.
Performance check on the test set¶
_predictions_test = model_performance_regression(best_pipeline_rf, X_test, y_test)
_predictions_test
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 621.011706 | 489.714707 | 0.66544 | 0.663326 | 0.166553 |
Observations¶
RMSE and MAE: Given that the statistical summary indicates the third quartile (Q3) is USD 4,145.16, the RMSE and MAE errors represent a small fraction of the variable's value. This suggests the model demonstrates relatively good performance.
R-Squared: The model explains 66.5% of the variability in the product's total sales across stores, which can be considered acceptable.
MAPE: The model's average error is 16.65% of the actual value, which is also an acceptable figure.
Conclusion: the selected model possesses the capability to predict product sales per store with an average error of 16.6%, explaining 66.5% of sales variability. Nevertheless, this capability could be enhanced through more detailed analysis and by evaluating alternative models or further fine-tuning hyperparameters.
Serialization¶
Tasks¶
- Define the file path to serialize the trained model along with the data preprocessing steps
- Save the trained model pipeline using joblib
- .dump (best model pipeline, saved model path
- Load the saved model pipeline from the file
- joblib.load
Saved model check
saved_model
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['Product_Sugar_Content',
'Product_Type', 'Store_Size',
'Store_Location_City_Type',
'Store_Type'])])),
('randomforestregressor',
RandomForestRegressor(max_depth=15, max_features='sqrt',
max_samples=0.9, min_samples_leaf=30,
random_state=1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('cat',
OneHotEncoder(handle_unknown='ignore'),
['Product_Sugar_Content',
'Product_Type', 'Store_Size',
'Store_Location_City_Type',
'Store_Type'])])),
('randomforestregressor',
RandomForestRegressor(max_depth=15, max_features='sqrt',
max_samples=0.9, min_samples_leaf=30,
random_state=1))])ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
['Product_Sugar_Content', 'Product_Type',
'Store_Size', 'Store_Location_City_Type',
'Store_Type'])])['Product_Sugar_Content', 'Product_Type', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(max_depth=15, max_features='sqrt', max_samples=0.9,
min_samples_leaf=30, random_state=1)Testing predictions on the test set using the serialized/saved model
saved_model.predict(X_test)
array([3272.21098797, 4893.73359389, 4885.75254721, ..., 4881.93593986,
3286.629245 , 3307.20963861])
Deployment - Backend¶
Flask Web Framework¶
Tasks¶
- Store backend server deployment files
- %%writefile backend_files/app.py
- Initialize Flask app with a name
- Load the trained prediction model
- joblib.load("SuperKart_Model_Deployment.joblib")
- Define a route for the home page @_superkart_api.get('/')
- Define an endpoint @_superkart_api.post('/v1/forecast')
- Get JSON data from the request
- Extract features from the input data
_input = {
'Product_Id': _data['Product_Id'],
'Product_Weight': _data['Product_Weight'],
'Product_Sugar_Content': _data['Product_Sugar_Content'],
'Product_Allocated_Area': _data['Product_Allocated_Area'],
'Product_Type': _data['Product_Type'],
'Product_MRP': _data['Product_MRP'],
'Store_Id': _data['Store_Id'],
'Store_Establishment_Year': _data['Store_Establishment_Year'],
'Store_Size': _data['Store_Size'],
'Store_Location_City_Type': _data['Store_Location_City_Type'],
'Store_Type': _data['Store_Type']
}
# Convert the extracted data into a DataFrame
input_data = pd.DataFrame([_input])
# Make a prediction using the trained model
prediction = saved_model.predict(input_data).tolist()[0]
# Return the prediction as a JSON response
return jsonify({'Sales forescast': prediction})
# Run the Flask app in debug mode
if __name__ == '__main__':
_superkart_api.run(debug=True)
Overwriting backend_files/app.py
Dependencies File¶
%%writefile backend_files/requirements.txt
numpy==2.0.2
pandas==2.2.2
scikit-learn==1.6.1
joblib==1.4.2
xgboost==2.1.4
requests==2.32.3
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
uvicorn[standard]
streamlit==1.43.2
Overwriting backend_files/requirements.txt
Dockerfile¶
%%writefile backend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim
# Set the working directory inside the container to /app
WORKDIR /app
# Copy all files from the current directory on the host to the container's /app directory
COPY . .
# Install Python dependencies listed in requirements.txt
RUN pip3 install --no-cache-dir --upgrade -r requirements.txt
# Define the command to start the application using Gunicorn with 4 worker processes
# - `-w 4`: Uses 4 worker processes for handling requests
# - `-b 0.0.0.0:7860`: Binds the server to port 7860 on all network interfaces
# - `app:app`: Runs the Flask app (assuming `app.py` contains the Flask instance named `app`)
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "app:_superkart_api"]
Overwriting backend_files/Dockerfile
Setting up a Hugging Face Docker Space for the Backend¶
- The space was created via the Hugging Face website: https://huggingface.co/
- Repository id = JosueQB12/SuperKart_JQB
- Link of the Hugging Face space = https://huggingface.co/spaces/JosueQB12/superkart
- Base link for specifying endpoints = https://josueqb12-superkart.hf.space/
Uploading Files to Hugging Face Space (Docker Space)¶
Tasks¶
- Create access token
- Get repo id (Hugging Face space ID)
- Login to Hugging Face platform with the access token
- Initialize the API
- Upload Streamlit app files stored in the folder called deployment_files
- api.upload_folder(folder_path (Local folder path), repo_id, repo_type="space")
Uploading...: 0%| | 0.00/413k [00:00<?, ?B/s]
CommitInfo(commit_url='https://huggingface.co/spaces/JosueQB12/superkart/commit/39fcb5d1ee278f67be5d97c535fa40ef17df0594', commit_message='Upload folder using huggingface_hub', commit_description='', oid='39fcb5d1ee278f67be5d97c535fa40ef17df0594', pr_url=None, repo_url=RepoUrl('https://huggingface.co/spaces/JosueQB12/superkart', endpoint='https://huggingface.co', repo_type='space', repo_id='JosueQB12/superkart'), pr_revision=None, pr_num=None)
Share the link of the Hugging Face space - Backend¶
- https://huggingface.co/spaces/JosueQB12/superkart
- Base link for specifying endpoints = https://josueqb12-superkart.hf.space/
Deployment - Frontend¶
Streamlit for Interactive UI¶
Tasks¶
- Create a folder for storing the files needed for frontend UI deployment.
- %%writefile frontend_files/app.py
- os.environ["STREAMLIT_HOME"] = "/tmp/.streamlit"
- os.environ["XDG_CONFIG_HOME"] = "/tmp"
- os.makedirs("/tmp/.streamlit", exist_ok=True)
# Input fields
Product_Id = st.text_input("Product Id.", max_chars=10, help="Enter the product Id.")
Product_Weight = st.number_input("Product Weight", min_value=0.0, help="Enter the product weight (lbs).")
Product_Sugar_Content = st.selectbox("Product sugar content", ["Low Sugar", "Regular", "No Sugar"], help="Enter the product sugar content.")
Product_Allocated_Area = st.number_input("Product allocated area", min_value=0.0, help="Enter the ratio of the allocated display area of the product.")
Product_Type = st.selectbox("Product type", ["Baking Goods","Breads","Breakfast","Canned","Dairy","Frozen Foods","Fruits and Vegetables","Hard Drinks",
"Health and Hygiene","Household","Meat","Others","Seafood","Snack Foods","Soft Drinks","Starchy Foods"],
help="Enter the product sugar content.")
Product_MRP = st.number_input("Product MRP", min_value=0.0, help="Enter the maximum retail price of the product.")
Store_Id = st.text_input("Store Id.", max_chars=10, help="Enter the store Id.")
Store_Establishment_Year = st.number_input("Store establishment year", min_value=1987, max_value=2025, help="Enter the year in which the store was established.")
Store_Size = st.selectbox("Store size", ["High", "Medium", "Small"], help="Enter the size of the store.")
Store_Location_City_Type = st.selectbox("Store size", ["Tier 1", "Tier 2", "Tier 3"], help="Enter the type of city in which the store is located.")
Store_Type = st.selectbox("Store type", ["Departmental Store", "Food Mart","Store_Type","Supermarket Type1","Supermarket Type2"],
help="Enter the type of store depending on the products that are being sold.")
product_data = {
"Product_Id": Product_Id,
"Product_Weight": Product_Weight,
"Product_Sugar_Content": Product_Sugar_Content,
"Product_Allocated_Area": Product_Allocated_Area,
"Product_Type": Product_Type,
"Product_MRP": Product_MRP,
"Store_Id": Store_Id,
"Store_Establishment_Year": Store_Establishment_Year,
"Store_Size": Store_Size,
"Store_Location_City_Type": Store_Location_City_Type,
"Store_Type": Store_Type
}
if st.button("Predict", type='primary'):
response = requests.post("https://josueqb12-superkart.hf.space/v1/forecast", json=product_data)
if response.status_code == 200:
result = response.json()
_sales_forecast = result["Sales forescast"]
st.write(f"Based on the information provided, the total predicted product store sales are: {_sales_forecast:.2f}.")
else:
st.error("Error in API request")
Overwriting frontend_files/app.py
Dependencies File¶
%%writefile frontend_files/requirements.txt
pandas==2.2.2
requests==2.32.3
streamlit==1.45.0
Overwriting frontend_files/requirements.txt
DockerFile¶
%%writefile frontend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim
# Set the working directory inside the container to /app
WORKDIR /app
# Copy all files from the current directory on the host to the container's /app directory
COPY . .
# Install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt
# Define the command to run the Streamlit app on port 8501 and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]
# NOTE: Disable XSRF protection for easier external access in order to make batch predictions
Overwriting frontend_files/Dockerfile
Uploading Files to Hugging Face Space (Streamlit Space)¶
Tasks¶
- Set access token: Hugging Face token created from access keys in write mode
- Set repo id: Hugging Face space id
- Login to Hugging Face platform with the access token
- Initialize the API
- Upload Streamlit app files stored in the folder called deployment_files
- api.upload_folder( folder_path: Local folder path, repo_id, repo_type)
CommitInfo(commit_url='https://huggingface.co/spaces/JosueQB12/superkart_frontend/commit/d6d162f0d45f545aac1cf2fc4790fe7e31497718', commit_message='Upload folder using huggingface_hub', commit_description='', oid='d6d162f0d45f545aac1cf2fc4790fe7e31497718', pr_url=None, repo_url=RepoUrl('https://huggingface.co/spaces/JosueQB12/superkart_frontend', endpoint='https://huggingface.co', repo_type='space', repo_id='JosueQB12/superkart_frontend'), pr_revision=None, pr_num=None)
Share the link of the Hugging Face space - Frontend¶
Test
_url = "https://josueqb12-superkart.hf.space/v1/forecast"
payload = {
"Product_Id": "FD1961",
"Product_Weight": 9.45,
"Product_Sugar_Content": "Low Sugar",
"Product_Allocated_Area": 0.047,
"Product_Type": "Snack Foods",
"Product_MRP": 95.95,
"Store_Id": "OUT002",
"Store_Establishment_Year": 1998,
"Store_Size": "Small",
"Store_Location_City_Type": "Tier 3",
"Store_Type": "Food Mart"
}
response = requests.post(_url, json=payload)
response
<Response [200]>
print(response.json())
{'Sales forescast': 1736.006160411349}
Actionable Insights and Business Recommendations¶
Insights from the analysis conducted¶
Interpretation of metrics
RMSE (Root Mean Squared Error)
On average, predictions deviate by 621 USD from the actual sales.
MAE (Mean Absolute Error) On average, the absolute difference between predicted and actual sales is 489.71 USD
R-squared: About 66.5% of the variance in sales is explained by the model.
MAPE (Mean Absolute Percentage Error): On average, the predictions are off by 16.7% of the actual sales value.
Evaluation
- Product_Id: FD1961
- Weight: 9.45
- Sugar content: Low Sugar
- Allocated area: 0.047
- Type: Snack foods
- MRP: 95.95
- Sold in a Food Mart
- Product_Store_Sales_Total = 1684.82 USD
Based on the trained model:
The actual sales of product FD1961 in OUT002 were 1,684.82 USD
The forecasted sales from the selected model were 1,736 USD.
Absolute Error: |1736 - 1684.82| = 51.18 USD
Percentage Error: (51.18 / 1684.82) × 100 = 3.03%
In this case, the error was much lower than average, indicating an excellent performance.
- With a MAPE of 16.7%, our model generally predicts sales within ±17% of actual values. The R Squared of 66.5% shows it successfully identifies many of the trends in the data, though improvements are possible, through more detailed analysis and by evaluating alternative models or further fine-tuning hyperparameters. Nevertheless, this level of accuracy is generally sufficient for inventory, purchasing, and marketing decisions for most products.
Business recommendations¶
Fruits and vegetables represent the highest-selling product category, accounting for 14.3% of all products sold and generating the largest revenue at 4,300,833.27 USD. However, due to their seasonal nature, some produce items may not always be consistently available.
It would be beneficial to strengthen the acquisition and sales efforts for the second and third most significant product groups: Snack Foods, which generated 3,988,996.95 USD, and Dairy, with 2,811,918.04 USD. This strategy would reduce reliance on perishable products with unpredictable availability. Additionally, "Snack Foods" should not exclusively imply unhealthy options; rather, healthy alternatives can be offered to health-conscious customers, who represent the identified target demographic for the stores.
The least sold product category is "Seafood," with only 76 units. It is advisable to assess whether to continue offering these products. Due to their nature, they require special in-store placement and specific maintenance, such as refrigeration, which may incur costs that do not justify the return on investment. Furthermore, these products are subject to a shorter expiration period.
Over 70% of stores are located in cities where the standard of living is middle class (Tier 2). Both areas with a higher standard of living (Tier 1) and those with lower purchasing power (Tier 3) are represented in smaller, yet very similar, proportions at 15.4% and 13.1%, respectively.
Marketing efforts and product selections should match the local purchasing power in each zone. This strategy aims to boost sales of products genuinely appealing to the local demographic. It is recommended that Key Performance Indicators (KPIs) specific to each Store Type be developed to effectively measure these efforts.
This approach could also benefit from utilizing "Context Proxies," which aim to use a derived variable to indirectly represent its associated variable. For instance, calculating the average sales per store could indirectly reflect each location's capacity to generate profit, encompassing both the store itself and its surrounding area (e.g., the socio-economic status of the surrounding area)).
An interesting finding is that despite the majority of products sold being low in sugar, all sugar content categories generate very similar revenues. In other words, the fact that over 50% of products are low in sugar does not translate to higher sales compared to products with a regular sugar level or sugar-free products, at least when interpreting the median values.
This could mean that products with a regular sugar level might sell well, perhaps even better than low-sugar options. This is because customers actively seek these products, even if they are not the predominant items offered in stores.
Finally, it would be beneficial to begin recording data on the seasonal sales peaks for each product category. The aim is to forecast sales more accurately based on this information. Additionally, a marketing strategy that includes discounts and promotions for these products at specific times of the year could be developed.