Study design and forecasting framework
The present study was done at the Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation (KLEF), Greenfields, Guntur, Andhra Pradesh, India, during the year 2025. This study applies a reproducible machine-learning workflow designed for early forecasting of cocoon silk prices, ensuring that all predictor variables are available prior to the forecast time. The target variable is modal price of cocoon silk (Pt), representing the most commonly seen market price on a given day. To prevent target leakage, contemporaneous price variables such as minimum and maximum prices were filtered out from the forecasting model. Instead, the prediction task was defined using lagged historical prices and relevant environmental indicators, allowing the model to forecast future prices using only past and pre-available information.
Input variables and feature construction
The forecasting function is defined as:

Explanation
This equation explains the early forecasting task, where the future modal price is estimated using only historically available price information and contemporaneous environmental and management factors, ensuring a leakage-free prediction setting.
Where,
Pt - 1 and Pt - 7 = Lagged modal prices capturing short-term and weekly temporal dependencies.
Tt= Ambient temperature.
Dt= Disease incidence.
St= Corresponds to sanitation conditions.
Mt= Mulberry feeding frequency.
Lagged features were built strictly following the chronological order of observations in the dataset, without introducing any synthetic or external temporal references. Records with insufficient historical data for lag generation were excluded from the analysis.
Data pre-processing
All records were reviewed for missing, inconsistent, or out-of-range values. Missing numerical values were imputed using feature-wise medians where appropriate and irrecoverable records were removed. Categorical variables were encoded numerically and continuous features were standardized using z-score normalization, defined as:

Explanation
Here, each numerical feature is standardized by removing its mean and scaling by its standard deviation, which improves numerical stability during model training and maintains reproducibility across datasets.
Where,
μi and σi = The mean and standard deviation of feature Xi, respectively.
This normalization ensures numerical stability and reproducibility across datasets.
Model architecture
A Random Forest Regressor (RFR) was employed due to its robustness to noise, resistance to overfitting and ability to capture non-linear relationships in agricultural market data. The model consists of an ensemble of M decision trees trained on bootstrapped subsets of the data. Overall workflow of the proposed early forecasting framework for cocoon silk prices is illustrated in (Fig 1). The final prediction is obtained by averaging the outputs of individual trees.

Explanation
The forecast is obtained by averaging predictions from multiple independently trained decision trees, reducing variance and improving generalization under volatile market conditions. where h
m (
·) donetes the prediction from the m-th tree.
Feature importance analysis
Feature importance was calculated using the mean decrease in impurity, which quantifies the average reduction in prediction variance when a feature is used for node splitting across the ensemble.
FIj = Σn∈Nj ΔIn ...(4)
Explanation
Above formulation measures the contribution of each predictor by aggregating the reduction in prediction variance across all tree nodes where the feature is used to splitting. Where ΔIn donetes impurity reduction at node n, Nj represents all nodes where feature j is used. This analysis enables identification of dominant predictors influencing price dynamics.
Model evaluation
The dataset was split into training (80%) and testing (20%) subsets. Model performance was evaluated using Root Mean Square Error (RMSE) and the coefficient of determination (R2), defined as:

Explanation
RMSE and were used to evaluate forecasting performance, where RMSE quantifies the average magnitude of prediction error in price units and indicates the portion of price changes explained by the model relative to a mean-based baseline. where is the observed price, is the predicted price and is the mean observed price.
Data collection
Market-level cocoon silk price data with related environmental and management indicators were collected from multiple states and districts to ensure broad spatial and temporal coverage.
Data preprocessing
The dataset was cleaned by handling missing values, correcting inconsistencies, converting date fields into a standardized chronological format suitable for time-series analysis.
Feature engineering
Lagged modal prices were generated strictly from historical observations and relevant environmental and management variables were incorporated to capture short-term market dynamics without data leakage.
Model implementation (Random forest regressor)
A Random Forest regression model was employed to learn non-linear relationships across lagged price information, environmental factors and cocoon silk price behavior.
Model evaluation
Model performance was assessed using error-based metrics and explained variance to quantify predictive uncertainty and practical usefulness under real-world forecasting conditions.
The following exploratory data analysis provides contextual understanding of market and environmental conditions; however, only lagged modal prices and environmental variables were used in the forecasting model.
Modal price trend over time
As Fig 2 presents the temporal evolution of cocoon silk modal prices during the study period. The series is characterized by substantial short-term variability and the presence of occasional price spikes. No stable or smooth long-term trend is observed, indicating a highly volatile pricing environment. Such behavior is typical of agricultural commodity markets, where prices are influenced by multiple interacting factors and sudden market disturbances. The pronounced variability observed in the time series highlights the inherent difficulty of price forecasting and motivates the use of robust, leakage-free modeling approaches based on historical information.
Temperature distribution
Fig 3 depicts a histogram that represents a distribution of the values of temperature in the dataset. In an effort to conduct visual comparison, a Kernel Density Estimate (KDE) curve is used to overlay the graph that smooths the distribution of data. Such form of visualization facilitates the determination of general temperature trend across various areas and any exception to the trend. Given the importance of temperature in agriculture in terms of farming activities and prices f crops, these charts provides contextual insight in helping determine effects of climatic conditions on agriculture.
Disease percentage distribution
The distribution of the percentage of disease within the dataset is illustrated using a Kernel Density Estimate (KDE) as shown in Fig 4 as below. The method neutralizes anomalies in data and thus it becomes easy to count the density and the frequency of the occurrence of the diseases. The reason why knowledge of such distribution is useful is that it allows discovering the dominance of plant diseases in various areas and the severity of the problem. A preponderance of heightened values makes an indication of wide distribution of diseases which may adversely affect crop yields and in turn spiking of prices will occur. It is beneficial to be acquainted with these trends so that farmers and agricultural scholars can devise ways to prevent or control disease outbreaks.
Sanitization impact on prices
As can be seen in Fig 5 modal prices differ based on the level of sanitation. Improved sanitation and hygiene in the market are likely to increase the price of the products and a number of arguments support it with better quality and less risk of contamination. When the graph reveals that there are considerable pricing differences among the levels of sanitation, the argument about improving the current standards of sanitation in the agricultural markets to enhance consumer confidence and hence elevate the profitability tends to be even stronger.
Mulberry frequency vs. prices
Fig 6 has validated the fact that the modal prices are moderated by mulberry crop yields. Another question that the author seeks to answer is whether intensive varieties of mulberry backgrounds show specific patterns in terms of prices. This may show that the mulberry farming affects the market behaviour such that it has its price categories being extremely low or high. These are good insights that can be critical to farmers and other stakeholders in the agricultural sector in deciding on the economic feasibility of investing in mulberry production.
Correlation heatmap of forecasting variables
Fig 7 shows the correlation matrix of the target variable and selected forecasting inputs. The modal price exhibits a moderate positive correlation with its one-period lagged value, indicating short-term temporal dependence in market prices. The correlation with seven-day lagged price is weaker, suggesting diminishing influence over longer lags. Environmental variables, including temperature and disease incidence, display negligible linear correlation with modal prices. This indicates that their effects on price dynamics are likely indirect and not captured through simple linear relationships. Overall, the correlation structure supports including of lagged price variables as primary predictors while justifying the use of non-linear modeling techniques to capture complex interactions.