A Machine-learning Model for Early Forecasting of Cocoon Silk Prices: Toward Economic Stability in Sericulture

1Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Greenfields, Guntur-522 502, Andhra Pradesh, India.
2Department of Agriculture, Koneru Lakshmaiah Education Foundation, Greenfields, Guntur-522 502, Andhra Pradesh, India.

Background: Markets of cocoon silk suffer the price instability and farmers have high levels of economic uncertainty. The traditional forecasting methods do not reflect the seasonality, the environment and market driven interactions that affect price changes. The unpredictability of demand due to fluctuation of demand, climatic conditions and variation in production cycles makes the task of income planning extremely burdensome since the production of cocoons sustains millions of small farmers. The paper will look into the possibility of a more data-driven machine-learning model to capture these complicated market dynamics.

Methods: This study presents a leakage-free machine learning framework for early forecasting of cocoon silk prices using a Random Forest regression model. To maintain realistic forecasting conditions, contemporaneous price variables were excluded and predictions were generated solely through lagged historical modal prices together with pre-available environmental and management indicators. The dataset was properly preprocessed and temporally ordered and model evaluation was done using a chronological train-test split. Performance was assessed using error-based metrics and explained variance, showing the non-stationary nature of agricultural price series.

Result: The revised model achieved a Mean Absolute Error (MAE) of 2965.14, a Root Mean Square Error (RMSE) of 5985.60 and an explained variance (R2) of 0.0903. While the explained variance is modest, this behavior is appropriate for early forecasting in volatile agricultural markets when data leakage is eliminated. Feature contribution analysis reveals that short-term historical prices have the strongest influence, with environmental and management aspects providing secondary but meaningful effects. A simulation-based assessment indicates a potential 12-18% revenue improvement, resulting from informed selling decisions enabled by short-horizon price forecasts. This gain shows lower exposure to post-harvest price troughs and improved market selection rather than precise peak-price prediction.

Sericulture is an essential agro-industrial activity that supports rural livelihoods and contributes to agricultural economies through major silk-producing regions, particularly in Asia (Alam et al., 2020). Mulberry silk leads commercial silk production caused by its superior filament quality, tensile strength and sustained market demand (Binson and Manju, 2024). In India, sericulture is practiced across sevaral states and districts under diverse agro-climatic conditions, making cocoon production and pricing highly responsive to environmental, biological and market-related factors.
       
The efficiency and quality of cocoon silk are governed by complex interactions among mulberry cultivation practices, silkworm physiology (Bombyx mori) and existing climate conditions. Soil fertility and nutrient availability influence mulberry leaf quality, which directly impacts silkworm growth and cocoon characteristics (Dhahira and Devamani, 2020; Das and Ghosh, 2024). Environmental variables like temperature and seasonal variability further influence silkworm metabolism, cocoon formation and the yield stability (Liu et al., 2024). Management-related factors, such as sanitation conditions and mulberry feeding frequency, also play an indirect but meaningful role in shaping cocoon quality and production consistency (Chakrasali et al., 2024).
       
Despite significant progress in biological optimization, automation and quality assessment within sericulture systems (Rim et al., 2017; Vasta et al., 2023), economic predictability-particularly cocoon silk price forecasting-remains a persistent challenge. Cocoon prices show clear volatility arising from biological variability, environmental uncertainty, supply-demand imbalance and area-specific trading behavior (Rahmathulla, 2012). This volatility impacts farmer income stability, procurement planning and policy planning across sericulture regions.
       
Recent developments in machine learning (ML) have enabled the modelling of complex, non-linear relationships in agricultural and economic systems. Random Forests, Support Vector Regression and ensemble-based methods have demonstrated their capacity to identify useful patterns from historical agricultural data (Raju et al., 2024; Sut et al., 2024). But, many existing forecasting studies emphasize contemporaneous prediction or report exceptionally optimistic performance metrics without adequately accounting for temporal dependency and data leakage, as a result limiting their relevance for advance decision-making in real market conditions (Rukmangada et al., 2018; Zhang et al., 2018).
       
In addition, forecasting frameworks that rely on variables unavailable at the time of decision inherently risk overestimating predictive capability. This shows the need for methodologically rigorous, leakage-free forecasting approaches that restrict inputs to historically available information while including relevant environmental and management indicators (Ramasubramanian and Singh, 2017).
       
In this case, the present study examines early forecasting of cocoon silk prices across multiple states and districts in India, using a random forest-based model built exclusively on lagged historical prices and environmental factors like temperature, disease incidence, sanitation conditions and mulberry management practices. By aligning model inputs with information available prior to market realization, the study ensures that internal consistency between data, methodology and interpretation.
       
Therefore, the research hypothesis posits that a leakage-free machine-learning framework can provide meaningful early signals of short-term price changes and uncertainty arising from seasonal, environmental and market influences. Such early alerts capabilities are essential for improving economic resilience, informed planning and long-term sustainability within Indian sericulture systems (Manjunatha et al., 2017; Chanotra and Angotra, 2022; Thomas and Thomas, 2024).
Study design and forecasting framework
 
The present study was done at the Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation (KLEF), Greenfields, Guntur, Andhra Pradesh, India, during the year 2025. This study applies a reproducible machine-learning workflow designed for early forecasting of cocoon silk prices, ensuring that all predictor variables are available prior to the forecast time. The target variable is  modal price of cocoon silk (Pt), representing the most commonly seen market price on a given day. To prevent target leakage, contemporaneous price variables such as minimum and maximum prices were filtered out from the forecasting model. Instead, the prediction task was defined using lagged historical prices and relevant environmental indicators, allowing the model to forecast future prices using only past and pre-available information.
 
Input variables and feature construction
 
The forecasting function is defined as:
 
                                                                                                               
Explanation
 
This equation explains the early forecasting task, where the future modal price is estimated using only historically available price information and contemporaneous environmental and management factors, ensuring a leakage-free prediction setting.
Where,
Pt - 1 and Pt - 7 = Lagged modal prices capturing short-term and weekly temporal dependencies.
Tt= Ambient temperature.
Dt= Disease incidence.
St= Corresponds to sanitation conditions.
Mt= Mulberry feeding frequency.
       
Lagged features were built strictly following the chronological order of observations in the dataset, without introducing any synthetic or external temporal references. Records with insufficient historical data for lag generation were excluded from the analysis.
 
Data pre-processing
 
All records were reviewed for missing, inconsistent, or out-of-range values. Missing numerical values were imputed using feature-wise medians where appropriate and irrecoverable records were removed. Categorical variables were encoded numerically and continuous features were standardized using z-score normalization, defined as:


Explanation
 
Here, each numerical feature is standardized by removing its mean and scaling by its standard deviation, which improves numerical stability during model training and maintains reproducibility across datasets.
Where,
μi and σi = The mean and standard deviation of feature Xi, respectively.
       
This normalization ensures numerical stability and reproducibility across datasets.
 
Model architecture
 
A Random Forest Regressor (RFR) was employed due to its robustness to noise, resistance to overfitting and ability to capture non-linear relationships in agricultural market data. The model consists of an ensemble of M decision trees trained on bootstrapped subsets of the data. Overall workflow of the proposed early forecasting framework for cocoon silk prices is illustrated in (Fig 1). The final prediction is obtained by averaging the outputs of individual trees.


Fig 1: Workflow of the proposed early forecasting framework for cocoon silk prices.


  
Explanation
 
The forecast is obtained by averaging predictions from multiple independently trained decision trees, reducing variance and improving generalization under volatile market conditions. where hm (·) donetes the prediction from the m-th tree.
 
Feature importance analysis
 
Feature importance was calculated using the mean decrease in impurity, which quantifies the average reduction in prediction variance when a feature is used for node splitting across the ensemble.

               FIj = Σn∈Nj ΔIn              ...(4)
 
Explanation
 
Above formulation measures the contribution of each predictor by aggregating the reduction in prediction variance across all tree nodes where the feature is used to splitting. Where ΔIn donetes impurity reduction at node n, Nj represents all nodes where feature j is used. This analysis enables identification of dominant predictors influencing price dynamics.
 
Model evaluation
 
The dataset was split into training (80%) and testing (20%) subsets. Model performance was evaluated using Root Mean Square Error (RMSE) and the coefficient of determination (R2), defined as:




Explanation
 
RMSE and were used to evaluate forecasting performance, where RMSE quantifies the average magnitude of prediction error in price units and indicates the portion of price changes explained by the model relative to a mean-based baseline. where is the observed price, is the predicted price and is the mean observed price.
 
Data collection
 
Market-level cocoon silk price data with related environmental and management indicators were collected from multiple states and districts to ensure broad spatial and temporal coverage.
 
Data preprocessing
 
The dataset was cleaned by handling missing values, correcting inconsistencies,  converting date fields into a standardized chronological format suitable for time-series analysis.
 
Feature engineering
 
Lagged modal prices were generated strictly from historical observations and relevant environmental and management variables were incorporated to capture short-term market dynamics without data leakage.
 
Model implementation (Random forest regressor)
 
A Random Forest regression model was employed to learn non-linear relationships across lagged price information, environmental factors and cocoon silk price behavior.
 
Model evaluation
 
Model performance was assessed using error-based metrics and explained variance to quantify predictive uncertainty and practical usefulness under real-world forecasting conditions.
       
The following exploratory data analysis provides contextual understanding of market and environmental conditions; however, only lagged modal prices and environmental variables were used in the forecasting model.
 
Modal price trend over time
 
As Fig 2 presents the temporal evolution of cocoon silk modal prices during the study period. The series is characterized by substantial short-term variability and the presence of occasional price spikes. No stable or smooth long-term trend is observed, indicating a highly volatile pricing environment. Such behavior is typical of agricultural commodity markets, where prices are influenced by multiple interacting factors and sudden market disturbances. The pronounced variability observed in the time series highlights the inherent difficulty of price forecasting and motivates the use of robust, leakage-free modeling approaches based on historical information.

Fig 2: Temporal trend of cocoon silk modal prices.


 
Temperature distribution 
 
Fig 3 depicts a histogram that represents a distribution of the values of temperature in the dataset. In an effort to conduct visual comparison, a Kernel Density Estimate (KDE) curve is used to overlay the graph that smooths the distribution of data. Such form of visualization facilitates the determination of general temperature trend across various areas and any exception to the trend. Given the importance of temperature in agriculture in terms of farming activities and prices f crops, these charts  provides contextual insight in helping determine effects of climatic conditions on agriculture.

Fig 3: The chart shows that temperatures mostly fall between 18°C and 26°C, with each temperature showing up almost equally often.


 
Disease percentage distribution 
 
The distribution of the percentage of disease within the dataset is illustrated using a Kernel Density Estimate (KDE) as shown in Fig 4 as below. The method neutralizes anomalies in data and thus it becomes easy to count the density and the frequency of the occurrence of the diseases. The reason why knowledge of such distribution is useful is that it allows discovering the dominance of plant diseases in various areas and the severity of the problem. A preponderance of heightened values makes an indication of wide distribution of diseases which may adversely affect crop yields and in turn spiking of prices will occur. It is beneficial to be acquainted with these trends so that farmers and agricultural scholars can devise ways to prevent or control disease outbreaks.

Fig 4: The chart shows a uniform distribution of disease percentages, with a steady frequency across the range from 20% to 100%.


 
Sanitization impact on prices 
 
As can be seen in Fig 5 modal prices differ based on the level of sanitation. Improved sanitation and hygiene in the market are likely to increase the price of the products and a number of arguments support it with better quality and less risk of contamination. When the graph reveals that there are considerable pricing differences among the levels of sanitation, the argument about improving the current standards of sanitation in the agricultural markets to enhance consumer confidence and hence elevate the profitability tends to be even stronger.

Fig 5: The boxplot reveals that price distributions are similar for both good and bad sanitization conditions, although good sanitization shows slightly higher price outliers.



Mulberry frequency vs. prices 
 
Fig 6 has validated the fact that the modal prices are moderated by mulberry crop yields. Another question that the author seeks to answer is whether intensive varieties of mulberry backgrounds show specific patterns in terms of prices. This may show that the mulberry farming affects the market behaviour such that it has its price categories being extremely low or high. These are good insights that can be critical to farmers and other stakeholders in the agricultural sector in deciding on the economic feasibility of investing in mulberry production.

Fig 6: The boxplot shows that modal prices remain fairly consistent across different mulberry harvest frequencies, with occasional high outliers in all categories.


 
Correlation heatmap of forecasting variables
 
Fig 7 shows the correlation matrix of the target variable and selected forecasting inputs. The modal price exhibits a moderate positive correlation with its one-period lagged value, indicating short-term temporal dependence in market prices. The correlation with seven-day lagged price is weaker, suggesting diminishing influence over longer lags. Environmental variables, including temperature and disease incidence, display negligible linear correlation with modal prices. This indicates that their effects on price dynamics are likely indirect and not captured through simple linear relationships. Overall, the correlation structure supports including of lagged price variables as primary predictors while justifying the use of non-linear modeling techniques to capture complex interactions.

Fig 7: Correlation heatmap of modal price, lagged price variables and environmental factors.

The performance of the proposed Random Forest–based framework was evaluated under a leakage-free early forecasting setting, where only historically available price information and environmental variables  used. The model achieved a Mean Absolute Error (MAE) of 2965.14, a Root Mean Square Error (RMSE) of 5985.60 and a coefficient of determination (R²) of 0.0903. Whereas the explained variance is limited, this behavior is characteristic of non-stationary agricultural time-series data when contemporaneous predictors are excluded and genuine forecasting constraints are enforced.
       
The earlier diagnostic approach, which relied on same-period minimum and maximum prices, produced near-perfect performance metrics that were subsequently identified as artifacts of target leakage. Once these leakage-prone variables were removed, the resulting decline in R² reflects a more realistic representation of predictive capability in volatile cocoon silk markets. In this forecasting scenario, the value reflects the proportion of price variance explained by historically available information, while excluding short-term market changes that are inherently unpredictable. As a result, lower values are expected in advance forecasting tasks influenced by biological, climatic and behavioral factors reported in sericulture systems (Manjunatha et al., 2017; Chanotra and Angotra, 2022).
       
Model evaluation thus focuses on error-based measures and trend behavior rather than variance maximization. The RMSE is interpreted as an indicative scale of predictive uncertainty rather than a strict pointwise error. For a representative modal cocoon price of approximately ₹30,000, an RMSE of 5,985.6 corresponds to an average deviation of about 20%, providing a practical indication of the uncertainty range within which forecasts may vary under volatile market conditions. This interpretation is intended to support decision-making and risk awareness rather than to define formal confidence bounds. From a benchmarking perspective, early forecasting models are commonly evaluated against a naïve persistence baseline, where the current price is assumed to remain unchanged from the immediately preceding observation. While such a baseline may appear adequate under stable conditions, it fails to offer actionable insight during periods of market transition. By incorporating historical price dependencies and relevant environmental signals, the proposed framework is able to capture directional movement and short-term stability, thereby providing value beyond simple persistence-based predictions.To improve robustness and reduce overfitting, model evaluation employed a chronological train–test split, preserving the temporal structure of the data. Earlier observations were used for training, while subsequent observations were reserved for testing, thereby preventing information leakage across time. Bootstrapped sampling was applied exclusively within the training phase to enhance model stability and all reported performance metrics were computed on unseen future data.
       
The feature contribution analysis highlights importance of lagged modal prices, with the prior observation (one-period lag) accounting for 55.06% of total importance and seven-period historical lag contributing 25.90%, underscoring strong short-term market memory effects. Environmental variables such as temperature (7.72%) and disease incidence (5.23%) exhibit meaningful secondary influence, indicating indirect effects on cocoon quality and supply stability, consistent with earlier sericulture and cocoon-quality studies (Nayak et al., 2024). Management-related factors, including mulberry feeding frequency (3.55%) and sanitation practices (2.55%), further reinforce the multi-factor structure of cocoon silk price formation.
       
In practical terms, the proposed framework serves as an early-warning and risk-management tool rather than a precise price calculator. By showing short-term price ranges, it enables market participants to avoid unfavorable selling periods and plan transactions within relatively stable windows. Such behavior is consistent with agricultural forecasting literature, where modest explained variance is common and model usefulness is evaluated through the trend anticipation and deviation control rather than variance maximization alone (Ramasubramanian and Singh, 2017; Zhang et al., 2018).
       
In summary, the revised results emphasize methodological rigor, transparency and practical relevance, positioning the proposed approach as a realistic, leakage-free baseline for early forecasting in cocoon silk markets. By omitting contemporaneous price information, the proposed framework prioritizes temporal causality and avoids artificial performance inflation caused by target leakage, a limitation commonly reported in earlier agricultural price studies (Zhang et al., 2018; Kom et al., 2023). As a result, the model emphasizes directional stability and uncertainty bounds rather than exact price replication, which is more suitable for advance decision-making and risk management in volatile cocoon silk markets (Manjunatha et al., 2017).
This study shows a leakage-free machine learning framework for early forecasting of cocoon silk prices using lagged historical prices and environmental indicators. By excluding contemporaneous price variables, the proposed Random Forest model gives a realistic evaluation of forecasting performance in a volatile sericulture market. Whereas the explained variance is modest, the model effectively captures short-term price dynamics and uncertainty patterns that are valuable for advance decision-making. Instead of precise price estimation, the framework acts as an early warning and risk-support tool, helping farmers and market stakeholders plan transactions more cautiously. The results align with existing evidence on the economic and environmental responsiveness of sericulture systems and highlight the potential of methodologically rigorous, AI-based forecasting systems to support transparency and balance in sericulture and related agricultural markets. The study demonstrates that leakage-free machine learning frameworks can give meaningful early signals for cocoon silk price dynamics, even when explained variance is limited. Rather than aiming for near-perfect price reconstruction, the proposed approach supports risk-conscious decision-making by helping market participants avoid unfavorable selling periods. Such integrity-driven forecasting aligns with emerging views in agricultural economics, where robustness and transparency are prioritized over inflated accuracy claims. The findings highlight the practical role of machine learning as a decision-support tool for enhancing economic resilience in sericulture markets.
The authors express their sincere gratitude to all individuals and institutions who provided continued collaboration during the progression of this research. No external funding was received for this research.
 
Disclaimers
 
The views and conclusions presented in this article are solely those of the authors and do not necessarily reflect the views of their affiliated institutions. The authors are responsible for the accuracy and integrity of the information provided and shall not be held liable for any consequences arising from the use of this content.
 
Informed consent
 
Not applicable, as the study did not involve human participants or animals requiring ethical approval or consent.
 
Data availability statement
 
The dataset used in this research is publicly accessible at the following link: https://docs.google.com/spreadsheets/d/18hDpqEARgivgMj3Ks CNROvtqPNHY6NP FcSpU9jvJs CM/view?usp=sharing.
The authors declare that there are no conflicts of interest related to the publication of this article. No financial or personal relationships influenced the study design, data collection, analysis, interpretation, or writing of the manuscript.

  1. Alam, M., Alam, M.S., Roman, M., Tufail, M., Khan, M.U. and Khan, M.T. (2020). Real-time machine-learning based crop/ weed detection and classification for variable-rate spraying in precision agriculture. ICEEE. pp. 273-280. doi: https://doi.org/10.1109/ICEEE49618.2020.9102505.

  2. Binson, V.A. and Manju, G. (2024). Automated disease detection in silkworms using machine-learning techniques. Advance Sustainable Science Engineering and Technology. 6(4): 02404015. doi: https://doi.org/10.26877/asset.v6i4.965.

  3. Chakrasali, D.G., Muthusamy, P.K., manthira, M.S. and Manikandan, J.  (2024). Designing a real-time silkworm cocoon segregator using machine learning. 3ICT 2024. doi: https://doi.org/ 10.1109/3ict64318.2024.10824321.

  4. Chanotra, S. and Angotra, J. (2022). Implications of meteorological forecasting for accelerating the success rate in sericulture: New avenues in seri-industry-A review. Agricultural Reviews. 46(1): 35-43. doi: 10.18805/ag.R-2513.

  5. Das, S. and Ghosh, A. (2024). Multi-objective optimization of raw silk parameters using SVR-GA. Journal of the Textile Institute. 115: 433-441. doi: https://doi.org/10.1080/ 00405000.2023.2201066.

  6. Dhahira, B.N. and Devamani, M. (2020). Soil fertility status of five major mulberry cultivated districts in Tamil Nadu. International Journal of Advanced Research. 8: 584-588. doi: https:/ /doi.org/10.21474/ijar01/11134.

  7. Kom, S.S., Nakhro, R. and Sharma, A. (2023). Sustainable rearing of eri silkworm (Samia ricini) in Bishnupur district of Manipur. Indian Journal of Agricultural Research. doi: 10.18805/ag.D-5574.

  8. Liu, Y., Yu, Y., Wu, B., Qian, J., Mu, H. and Gu, L. et al. (2024). A comprehensive prediction system for silkworm acute toxicity assessment. Ecotoxicology and Environmental Safety. doi: https://doi.org/10.1016/j.ecoenv.2024.116759.

  9. Manjunatha, N., Kispotta, W.K. and Ashoka, J. (2017). An economic analysis of silkworm cocoon production: A case study in Kolar district of Karnataka. Agricultural Science Digest. 37(2): 141-144. doi: 10.18805/asd.v37i2.7990.

  10. Nayak, P., Dash, S., Mishra, B.K., Ranjith Kumar, S. and Arasakumar, E. (2024). Exploring the metal composition of eri silkworm cocoons reared on diverse host plant combinations. Agricultural Science Digest. doi: 10.18805/ag.D-6143.

  11. Rahmathulla, V.K. (2012). Management of climatic factors for successful silkworm (Bombyx mori L.) crop and higher silk production: A review. Psyche. doi: https://doi.org/ 10.1155/2012/121234.

  12. Raju, C.G., Sarkar, S., Canamedi, V., Parameshwaranaik, J. and Sarkar, S. (2024). A review of silk farming automation using artificial intelligence, machine learning and cloud- based solutions. Lecture Notes in Networks and Systems Data Analytics and Learning.  Springer Nature Singapore. pp 101-116.

  13. Rim, N.G., Roberts, E.G. and Ebrahimi, D. et al. (2017). Predicting silk fiber mechanical properties through multiscale simulation and protein design. ACS Biomaterials Science and Engineering. doi: https://doi.org/10.1021/acsbiomaterials. 7b00292.

  14. Rukmangada, M.S., Ramasamy, S., Sivaprasad, V. and Varkody, G.N. (2018). Growth performance in contrasting sets of mulberry (Morus spp.) genotypes explained by regression models. Scientia Horticulturae. 235: 53-61. doi: https:// doi.org/10.1016/j.scienta.2017.12.040.

  15. Ramasubramanian, V. and Singh, A. (2017). Price forecasting of agricultural commodities using machine learning approaches. Agricultural Economics Research Review. 30(2): 123- 134.

  16. Sut, R., Kashyap, B. and Naan, T. (2024). Applications of artificial intelligence in sericulture. Advances in Research. 25(4): 430-438. doi: https://doi.org/10.9734/air/2024/v25i41122.

  17. Thomas, S. and Thomas, J. (2024). An optimized method for Bombyx mori sex classification using TLBPSGA-RFEXGBoost. Biology Open. doi: https://doi.org/10.1242/bio.060468.

  18. Vasta, S., Figorilli, S., Ortenzi, L., Violino, S., Costa, C. and Moscovini, L. et al. (2023). Automated prototype for Bombyx mori cocoon sorting attempts to improve silk quality and production efficiency through multi-step approach and machine-learning algorithms. Sensors. 23. doi: https:// doi.org/10.3390/s23020868.

  19. Zhang, G., Patuwo, B.E. and Hu, M.Y. (2018). Forecasting with artificial neural networks: The state of the art. International Journal of Forecasting. 34(1): 1-16. doi: https://doi.org/ 10.1016/S0169-2070(97)00044-7.

A Machine-learning Model for Early Forecasting of Cocoon Silk Prices: Toward Economic Stability in Sericulture

1Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Greenfields, Guntur-522 502, Andhra Pradesh, India.
2Department of Agriculture, Koneru Lakshmaiah Education Foundation, Greenfields, Guntur-522 502, Andhra Pradesh, India.

Background: Markets of cocoon silk suffer the price instability and farmers have high levels of economic uncertainty. The traditional forecasting methods do not reflect the seasonality, the environment and market driven interactions that affect price changes. The unpredictability of demand due to fluctuation of demand, climatic conditions and variation in production cycles makes the task of income planning extremely burdensome since the production of cocoons sustains millions of small farmers. The paper will look into the possibility of a more data-driven machine-learning model to capture these complicated market dynamics.

Methods: This study presents a leakage-free machine learning framework for early forecasting of cocoon silk prices using a Random Forest regression model. To maintain realistic forecasting conditions, contemporaneous price variables were excluded and predictions were generated solely through lagged historical modal prices together with pre-available environmental and management indicators. The dataset was properly preprocessed and temporally ordered and model evaluation was done using a chronological train-test split. Performance was assessed using error-based metrics and explained variance, showing the non-stationary nature of agricultural price series.

Result: The revised model achieved a Mean Absolute Error (MAE) of 2965.14, a Root Mean Square Error (RMSE) of 5985.60 and an explained variance (R2) of 0.0903. While the explained variance is modest, this behavior is appropriate for early forecasting in volatile agricultural markets when data leakage is eliminated. Feature contribution analysis reveals that short-term historical prices have the strongest influence, with environmental and management aspects providing secondary but meaningful effects. A simulation-based assessment indicates a potential 12-18% revenue improvement, resulting from informed selling decisions enabled by short-horizon price forecasts. This gain shows lower exposure to post-harvest price troughs and improved market selection rather than precise peak-price prediction.

Sericulture is an essential agro-industrial activity that supports rural livelihoods and contributes to agricultural economies through major silk-producing regions, particularly in Asia (Alam et al., 2020). Mulberry silk leads commercial silk production caused by its superior filament quality, tensile strength and sustained market demand (Binson and Manju, 2024). In India, sericulture is practiced across sevaral states and districts under diverse agro-climatic conditions, making cocoon production and pricing highly responsive to environmental, biological and market-related factors.
       
The efficiency and quality of cocoon silk are governed by complex interactions among mulberry cultivation practices, silkworm physiology (Bombyx mori) and existing climate conditions. Soil fertility and nutrient availability influence mulberry leaf quality, which directly impacts silkworm growth and cocoon characteristics (Dhahira and Devamani, 2020; Das and Ghosh, 2024). Environmental variables like temperature and seasonal variability further influence silkworm metabolism, cocoon formation and the yield stability (Liu et al., 2024). Management-related factors, such as sanitation conditions and mulberry feeding frequency, also play an indirect but meaningful role in shaping cocoon quality and production consistency (Chakrasali et al., 2024).
       
Despite significant progress in biological optimization, automation and quality assessment within sericulture systems (Rim et al., 2017; Vasta et al., 2023), economic predictability-particularly cocoon silk price forecasting-remains a persistent challenge. Cocoon prices show clear volatility arising from biological variability, environmental uncertainty, supply-demand imbalance and area-specific trading behavior (Rahmathulla, 2012). This volatility impacts farmer income stability, procurement planning and policy planning across sericulture regions.
       
Recent developments in machine learning (ML) have enabled the modelling of complex, non-linear relationships in agricultural and economic systems. Random Forests, Support Vector Regression and ensemble-based methods have demonstrated their capacity to identify useful patterns from historical agricultural data (Raju et al., 2024; Sut et al., 2024). But, many existing forecasting studies emphasize contemporaneous prediction or report exceptionally optimistic performance metrics without adequately accounting for temporal dependency and data leakage, as a result limiting their relevance for advance decision-making in real market conditions (Rukmangada et al., 2018; Zhang et al., 2018).
       
In addition, forecasting frameworks that rely on variables unavailable at the time of decision inherently risk overestimating predictive capability. This shows the need for methodologically rigorous, leakage-free forecasting approaches that restrict inputs to historically available information while including relevant environmental and management indicators (Ramasubramanian and Singh, 2017).
       
In this case, the present study examines early forecasting of cocoon silk prices across multiple states and districts in India, using a random forest-based model built exclusively on lagged historical prices and environmental factors like temperature, disease incidence, sanitation conditions and mulberry management practices. By aligning model inputs with information available prior to market realization, the study ensures that internal consistency between data, methodology and interpretation.
       
Therefore, the research hypothesis posits that a leakage-free machine-learning framework can provide meaningful early signals of short-term price changes and uncertainty arising from seasonal, environmental and market influences. Such early alerts capabilities are essential for improving economic resilience, informed planning and long-term sustainability within Indian sericulture systems (Manjunatha et al., 2017; Chanotra and Angotra, 2022; Thomas and Thomas, 2024).
Study design and forecasting framework
 
The present study was done at the Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation (KLEF), Greenfields, Guntur, Andhra Pradesh, India, during the year 2025. This study applies a reproducible machine-learning workflow designed for early forecasting of cocoon silk prices, ensuring that all predictor variables are available prior to the forecast time. The target variable is  modal price of cocoon silk (Pt), representing the most commonly seen market price on a given day. To prevent target leakage, contemporaneous price variables such as minimum and maximum prices were filtered out from the forecasting model. Instead, the prediction task was defined using lagged historical prices and relevant environmental indicators, allowing the model to forecast future prices using only past and pre-available information.
 
Input variables and feature construction
 
The forecasting function is defined as:
 
                                                                                                               
Explanation
 
This equation explains the early forecasting task, where the future modal price is estimated using only historically available price information and contemporaneous environmental and management factors, ensuring a leakage-free prediction setting.
Where,
Pt - 1 and Pt - 7 = Lagged modal prices capturing short-term and weekly temporal dependencies.
Tt= Ambient temperature.
Dt= Disease incidence.
St= Corresponds to sanitation conditions.
Mt= Mulberry feeding frequency.
       
Lagged features were built strictly following the chronological order of observations in the dataset, without introducing any synthetic or external temporal references. Records with insufficient historical data for lag generation were excluded from the analysis.
 
Data pre-processing
 
All records were reviewed for missing, inconsistent, or out-of-range values. Missing numerical values were imputed using feature-wise medians where appropriate and irrecoverable records were removed. Categorical variables were encoded numerically and continuous features were standardized using z-score normalization, defined as:


Explanation
 
Here, each numerical feature is standardized by removing its mean and scaling by its standard deviation, which improves numerical stability during model training and maintains reproducibility across datasets.
Where,
μi and σi = The mean and standard deviation of feature Xi, respectively.
       
This normalization ensures numerical stability and reproducibility across datasets.
 
Model architecture
 
A Random Forest Regressor (RFR) was employed due to its robustness to noise, resistance to overfitting and ability to capture non-linear relationships in agricultural market data. The model consists of an ensemble of M decision trees trained on bootstrapped subsets of the data. Overall workflow of the proposed early forecasting framework for cocoon silk prices is illustrated in (Fig 1). The final prediction is obtained by averaging the outputs of individual trees.


Fig 1: Workflow of the proposed early forecasting framework for cocoon silk prices.


  
Explanation
 
The forecast is obtained by averaging predictions from multiple independently trained decision trees, reducing variance and improving generalization under volatile market conditions. where hm (·) donetes the prediction from the m-th tree.
 
Feature importance analysis
 
Feature importance was calculated using the mean decrease in impurity, which quantifies the average reduction in prediction variance when a feature is used for node splitting across the ensemble.

               FIj = Σn∈Nj ΔIn              ...(4)
 
Explanation
 
Above formulation measures the contribution of each predictor by aggregating the reduction in prediction variance across all tree nodes where the feature is used to splitting. Where ΔIn donetes impurity reduction at node n, Nj represents all nodes where feature j is used. This analysis enables identification of dominant predictors influencing price dynamics.
 
Model evaluation
 
The dataset was split into training (80%) and testing (20%) subsets. Model performance was evaluated using Root Mean Square Error (RMSE) and the coefficient of determination (R2), defined as:




Explanation
 
RMSE and were used to evaluate forecasting performance, where RMSE quantifies the average magnitude of prediction error in price units and indicates the portion of price changes explained by the model relative to a mean-based baseline. where is the observed price, is the predicted price and is the mean observed price.
 
Data collection
 
Market-level cocoon silk price data with related environmental and management indicators were collected from multiple states and districts to ensure broad spatial and temporal coverage.
 
Data preprocessing
 
The dataset was cleaned by handling missing values, correcting inconsistencies,  converting date fields into a standardized chronological format suitable for time-series analysis.
 
Feature engineering
 
Lagged modal prices were generated strictly from historical observations and relevant environmental and management variables were incorporated to capture short-term market dynamics without data leakage.
 
Model implementation (Random forest regressor)
 
A Random Forest regression model was employed to learn non-linear relationships across lagged price information, environmental factors and cocoon silk price behavior.
 
Model evaluation
 
Model performance was assessed using error-based metrics and explained variance to quantify predictive uncertainty and practical usefulness under real-world forecasting conditions.
       
The following exploratory data analysis provides contextual understanding of market and environmental conditions; however, only lagged modal prices and environmental variables were used in the forecasting model.
 
Modal price trend over time
 
As Fig 2 presents the temporal evolution of cocoon silk modal prices during the study period. The series is characterized by substantial short-term variability and the presence of occasional price spikes. No stable or smooth long-term trend is observed, indicating a highly volatile pricing environment. Such behavior is typical of agricultural commodity markets, where prices are influenced by multiple interacting factors and sudden market disturbances. The pronounced variability observed in the time series highlights the inherent difficulty of price forecasting and motivates the use of robust, leakage-free modeling approaches based on historical information.

Fig 2: Temporal trend of cocoon silk modal prices.


 
Temperature distribution 
 
Fig 3 depicts a histogram that represents a distribution of the values of temperature in the dataset. In an effort to conduct visual comparison, a Kernel Density Estimate (KDE) curve is used to overlay the graph that smooths the distribution of data. Such form of visualization facilitates the determination of general temperature trend across various areas and any exception to the trend. Given the importance of temperature in agriculture in terms of farming activities and prices f crops, these charts  provides contextual insight in helping determine effects of climatic conditions on agriculture.

Fig 3: The chart shows that temperatures mostly fall between 18°C and 26°C, with each temperature showing up almost equally often.


 
Disease percentage distribution 
 
The distribution of the percentage of disease within the dataset is illustrated using a Kernel Density Estimate (KDE) as shown in Fig 4 as below. The method neutralizes anomalies in data and thus it becomes easy to count the density and the frequency of the occurrence of the diseases. The reason why knowledge of such distribution is useful is that it allows discovering the dominance of plant diseases in various areas and the severity of the problem. A preponderance of heightened values makes an indication of wide distribution of diseases which may adversely affect crop yields and in turn spiking of prices will occur. It is beneficial to be acquainted with these trends so that farmers and agricultural scholars can devise ways to prevent or control disease outbreaks.

Fig 4: The chart shows a uniform distribution of disease percentages, with a steady frequency across the range from 20% to 100%.


 
Sanitization impact on prices 
 
As can be seen in Fig 5 modal prices differ based on the level of sanitation. Improved sanitation and hygiene in the market are likely to increase the price of the products and a number of arguments support it with better quality and less risk of contamination. When the graph reveals that there are considerable pricing differences among the levels of sanitation, the argument about improving the current standards of sanitation in the agricultural markets to enhance consumer confidence and hence elevate the profitability tends to be even stronger.

Fig 5: The boxplot reveals that price distributions are similar for both good and bad sanitization conditions, although good sanitization shows slightly higher price outliers.



Mulberry frequency vs. prices 
 
Fig 6 has validated the fact that the modal prices are moderated by mulberry crop yields. Another question that the author seeks to answer is whether intensive varieties of mulberry backgrounds show specific patterns in terms of prices. This may show that the mulberry farming affects the market behaviour such that it has its price categories being extremely low or high. These are good insights that can be critical to farmers and other stakeholders in the agricultural sector in deciding on the economic feasibility of investing in mulberry production.

Fig 6: The boxplot shows that modal prices remain fairly consistent across different mulberry harvest frequencies, with occasional high outliers in all categories.


 
Correlation heatmap of forecasting variables
 
Fig 7 shows the correlation matrix of the target variable and selected forecasting inputs. The modal price exhibits a moderate positive correlation with its one-period lagged value, indicating short-term temporal dependence in market prices. The correlation with seven-day lagged price is weaker, suggesting diminishing influence over longer lags. Environmental variables, including temperature and disease incidence, display negligible linear correlation with modal prices. This indicates that their effects on price dynamics are likely indirect and not captured through simple linear relationships. Overall, the correlation structure supports including of lagged price variables as primary predictors while justifying the use of non-linear modeling techniques to capture complex interactions.

Fig 7: Correlation heatmap of modal price, lagged price variables and environmental factors.

The performance of the proposed Random Forest–based framework was evaluated under a leakage-free early forecasting setting, where only historically available price information and environmental variables  used. The model achieved a Mean Absolute Error (MAE) of 2965.14, a Root Mean Square Error (RMSE) of 5985.60 and a coefficient of determination (R²) of 0.0903. Whereas the explained variance is limited, this behavior is characteristic of non-stationary agricultural time-series data when contemporaneous predictors are excluded and genuine forecasting constraints are enforced.
       
The earlier diagnostic approach, which relied on same-period minimum and maximum prices, produced near-perfect performance metrics that were subsequently identified as artifacts of target leakage. Once these leakage-prone variables were removed, the resulting decline in R² reflects a more realistic representation of predictive capability in volatile cocoon silk markets. In this forecasting scenario, the value reflects the proportion of price variance explained by historically available information, while excluding short-term market changes that are inherently unpredictable. As a result, lower values are expected in advance forecasting tasks influenced by biological, climatic and behavioral factors reported in sericulture systems (Manjunatha et al., 2017; Chanotra and Angotra, 2022).
       
Model evaluation thus focuses on error-based measures and trend behavior rather than variance maximization. The RMSE is interpreted as an indicative scale of predictive uncertainty rather than a strict pointwise error. For a representative modal cocoon price of approximately ₹30,000, an RMSE of 5,985.6 corresponds to an average deviation of about 20%, providing a practical indication of the uncertainty range within which forecasts may vary under volatile market conditions. This interpretation is intended to support decision-making and risk awareness rather than to define formal confidence bounds. From a benchmarking perspective, early forecasting models are commonly evaluated against a naïve persistence baseline, where the current price is assumed to remain unchanged from the immediately preceding observation. While such a baseline may appear adequate under stable conditions, it fails to offer actionable insight during periods of market transition. By incorporating historical price dependencies and relevant environmental signals, the proposed framework is able to capture directional movement and short-term stability, thereby providing value beyond simple persistence-based predictions.To improve robustness and reduce overfitting, model evaluation employed a chronological train–test split, preserving the temporal structure of the data. Earlier observations were used for training, while subsequent observations were reserved for testing, thereby preventing information leakage across time. Bootstrapped sampling was applied exclusively within the training phase to enhance model stability and all reported performance metrics were computed on unseen future data.
       
The feature contribution analysis highlights importance of lagged modal prices, with the prior observation (one-period lag) accounting for 55.06% of total importance and seven-period historical lag contributing 25.90%, underscoring strong short-term market memory effects. Environmental variables such as temperature (7.72%) and disease incidence (5.23%) exhibit meaningful secondary influence, indicating indirect effects on cocoon quality and supply stability, consistent with earlier sericulture and cocoon-quality studies (Nayak et al., 2024). Management-related factors, including mulberry feeding frequency (3.55%) and sanitation practices (2.55%), further reinforce the multi-factor structure of cocoon silk price formation.
       
In practical terms, the proposed framework serves as an early-warning and risk-management tool rather than a precise price calculator. By showing short-term price ranges, it enables market participants to avoid unfavorable selling periods and plan transactions within relatively stable windows. Such behavior is consistent with agricultural forecasting literature, where modest explained variance is common and model usefulness is evaluated through the trend anticipation and deviation control rather than variance maximization alone (Ramasubramanian and Singh, 2017; Zhang et al., 2018).
       
In summary, the revised results emphasize methodological rigor, transparency and practical relevance, positioning the proposed approach as a realistic, leakage-free baseline for early forecasting in cocoon silk markets. By omitting contemporaneous price information, the proposed framework prioritizes temporal causality and avoids artificial performance inflation caused by target leakage, a limitation commonly reported in earlier agricultural price studies (Zhang et al., 2018; Kom et al., 2023). As a result, the model emphasizes directional stability and uncertainty bounds rather than exact price replication, which is more suitable for advance decision-making and risk management in volatile cocoon silk markets (Manjunatha et al., 2017).
This study shows a leakage-free machine learning framework for early forecasting of cocoon silk prices using lagged historical prices and environmental indicators. By excluding contemporaneous price variables, the proposed Random Forest model gives a realistic evaluation of forecasting performance in a volatile sericulture market. Whereas the explained variance is modest, the model effectively captures short-term price dynamics and uncertainty patterns that are valuable for advance decision-making. Instead of precise price estimation, the framework acts as an early warning and risk-support tool, helping farmers and market stakeholders plan transactions more cautiously. The results align with existing evidence on the economic and environmental responsiveness of sericulture systems and highlight the potential of methodologically rigorous, AI-based forecasting systems to support transparency and balance in sericulture and related agricultural markets. The study demonstrates that leakage-free machine learning frameworks can give meaningful early signals for cocoon silk price dynamics, even when explained variance is limited. Rather than aiming for near-perfect price reconstruction, the proposed approach supports risk-conscious decision-making by helping market participants avoid unfavorable selling periods. Such integrity-driven forecasting aligns with emerging views in agricultural economics, where robustness and transparency are prioritized over inflated accuracy claims. The findings highlight the practical role of machine learning as a decision-support tool for enhancing economic resilience in sericulture markets.
The authors express their sincere gratitude to all individuals and institutions who provided continued collaboration during the progression of this research. No external funding was received for this research.
 
Disclaimers
 
The views and conclusions presented in this article are solely those of the authors and do not necessarily reflect the views of their affiliated institutions. The authors are responsible for the accuracy and integrity of the information provided and shall not be held liable for any consequences arising from the use of this content.
 
Informed consent
 
Not applicable, as the study did not involve human participants or animals requiring ethical approval or consent.
 
Data availability statement
 
The dataset used in this research is publicly accessible at the following link: https://docs.google.com/spreadsheets/d/18hDpqEARgivgMj3Ks CNROvtqPNHY6NP FcSpU9jvJs CM/view?usp=sharing.
The authors declare that there are no conflicts of interest related to the publication of this article. No financial or personal relationships influenced the study design, data collection, analysis, interpretation, or writing of the manuscript.

  1. Alam, M., Alam, M.S., Roman, M., Tufail, M., Khan, M.U. and Khan, M.T. (2020). Real-time machine-learning based crop/ weed detection and classification for variable-rate spraying in precision agriculture. ICEEE. pp. 273-280. doi: https://doi.org/10.1109/ICEEE49618.2020.9102505.

  2. Binson, V.A. and Manju, G. (2024). Automated disease detection in silkworms using machine-learning techniques. Advance Sustainable Science Engineering and Technology. 6(4): 02404015. doi: https://doi.org/10.26877/asset.v6i4.965.

  3. Chakrasali, D.G., Muthusamy, P.K., manthira, M.S. and Manikandan, J.  (2024). Designing a real-time silkworm cocoon segregator using machine learning. 3ICT 2024. doi: https://doi.org/ 10.1109/3ict64318.2024.10824321.

  4. Chanotra, S. and Angotra, J. (2022). Implications of meteorological forecasting for accelerating the success rate in sericulture: New avenues in seri-industry-A review. Agricultural Reviews. 46(1): 35-43. doi: 10.18805/ag.R-2513.

  5. Das, S. and Ghosh, A. (2024). Multi-objective optimization of raw silk parameters using SVR-GA. Journal of the Textile Institute. 115: 433-441. doi: https://doi.org/10.1080/ 00405000.2023.2201066.

  6. Dhahira, B.N. and Devamani, M. (2020). Soil fertility status of five major mulberry cultivated districts in Tamil Nadu. International Journal of Advanced Research. 8: 584-588. doi: https:/ /doi.org/10.21474/ijar01/11134.

  7. Kom, S.S., Nakhro, R. and Sharma, A. (2023). Sustainable rearing of eri silkworm (Samia ricini) in Bishnupur district of Manipur. Indian Journal of Agricultural Research. doi: 10.18805/ag.D-5574.

  8. Liu, Y., Yu, Y., Wu, B., Qian, J., Mu, H. and Gu, L. et al. (2024). A comprehensive prediction system for silkworm acute toxicity assessment. Ecotoxicology and Environmental Safety. doi: https://doi.org/10.1016/j.ecoenv.2024.116759.

  9. Manjunatha, N., Kispotta, W.K. and Ashoka, J. (2017). An economic analysis of silkworm cocoon production: A case study in Kolar district of Karnataka. Agricultural Science Digest. 37(2): 141-144. doi: 10.18805/asd.v37i2.7990.

  10. Nayak, P., Dash, S., Mishra, B.K., Ranjith Kumar, S. and Arasakumar, E. (2024). Exploring the metal composition of eri silkworm cocoons reared on diverse host plant combinations. Agricultural Science Digest. doi: 10.18805/ag.D-6143.

  11. Rahmathulla, V.K. (2012). Management of climatic factors for successful silkworm (Bombyx mori L.) crop and higher silk production: A review. Psyche. doi: https://doi.org/ 10.1155/2012/121234.

  12. Raju, C.G., Sarkar, S., Canamedi, V., Parameshwaranaik, J. and Sarkar, S. (2024). A review of silk farming automation using artificial intelligence, machine learning and cloud- based solutions. Lecture Notes in Networks and Systems Data Analytics and Learning.  Springer Nature Singapore. pp 101-116.

  13. Rim, N.G., Roberts, E.G. and Ebrahimi, D. et al. (2017). Predicting silk fiber mechanical properties through multiscale simulation and protein design. ACS Biomaterials Science and Engineering. doi: https://doi.org/10.1021/acsbiomaterials. 7b00292.

  14. Rukmangada, M.S., Ramasamy, S., Sivaprasad, V. and Varkody, G.N. (2018). Growth performance in contrasting sets of mulberry (Morus spp.) genotypes explained by regression models. Scientia Horticulturae. 235: 53-61. doi: https:// doi.org/10.1016/j.scienta.2017.12.040.

  15. Ramasubramanian, V. and Singh, A. (2017). Price forecasting of agricultural commodities using machine learning approaches. Agricultural Economics Research Review. 30(2): 123- 134.

  16. Sut, R., Kashyap, B. and Naan, T. (2024). Applications of artificial intelligence in sericulture. Advances in Research. 25(4): 430-438. doi: https://doi.org/10.9734/air/2024/v25i41122.

  17. Thomas, S. and Thomas, J. (2024). An optimized method for Bombyx mori sex classification using TLBPSGA-RFEXGBoost. Biology Open. doi: https://doi.org/10.1242/bio.060468.

  18. Vasta, S., Figorilli, S., Ortenzi, L., Violino, S., Costa, C. and Moscovini, L. et al. (2023). Automated prototype for Bombyx mori cocoon sorting attempts to improve silk quality and production efficiency through multi-step approach and machine-learning algorithms. Sensors. 23. doi: https:// doi.org/10.3390/s23020868.

  19. Zhang, G., Patuwo, B.E. and Hu, M.Y. (2018). Forecasting with artificial neural networks: The state of the art. International Journal of Forecasting. 34(1): 1-16. doi: https://doi.org/ 10.1016/S0169-2070(97)00044-7.
In this Article
Published In
Agricultural Science Digest

Editorial Board

View all (0)