Legume Research

  • Chief EditorJ. S. Sandhu

  • Print ISSN 0250-5371

  • Online ISSN 0976-0571

  • NAAS Rating 6.80

  • SJR 0.391

  • Impact Factor 0.8 (2023)

Frequency :
Monthly (January, February, March, April, May, June, July, August, September, October, November and December)
Indexing Services :
BIOSIS Preview, ISI Citation Index, Biological Abstracts, Elsevier (Scopus and Embase), AGRICOLA, Google Scholar, CrossRef, CAB Abstracting Journals, Chemical Abstracts, Indian Science Abstracts, EBSCO Indexing Services, Index Copernicus
Legume Research, volume 47 issue 1 (january 2024) : 38-44

Establishment of Detection Model of Soybean Quality Traits by Near Infrared Spectroscopy

Weiran Gao1, Ronghan Ma1, Aohua Jiang1, Jiaqi Liu1, Pingting Tan1, Fang Liu1, Jian Zhang1,*
1College of Agronomy and Biotechnology, Southwest University, Chongqing 400715, China.
  • Submitted20-07-2023|

  • Accepted25-09-2023|

  • First Online 04-01-2024|

  • doi 10.18805/LRF-760

Cite article:- Gao Weiran, Ma Ronghan, Jiang Aohua, Liu Jiaqi, Tan Pingting, Liu Fang, Zhang Jian (2024). Establishment of Detection Model of Soybean Quality Traits by Near Infrared Spectroscopy . Legume Research. 47(1): 38-44. doi: 10.18805/LRF-760.

Background: Rapid prediction with near infrared (NIR) spectroscopy on quality traits is pretty popular recently, for the convenience and simple operation. But to make good use of this technology, precise and suitable calibration equations are very important to get dependable result. In this study, we mostly refer to the building of the equation and how the pretreatment effect them.  

Methods: In this paper, near infrared (NIR) spectroscopy was used to simultaneously predict the quality traits of soybean, including oil content, protein content, oleic acid content, linoleic acid content, stearic acid content. Near infrared spectral data of a total of 112 samples is collected from given materials in Chongqing. Samples were scanned from 1000 nm to 2500 nm using a monochromator instrument (SuperNIR-2700). Calibration equations were developed from NIR data using partial least squares (PLS) regression with internal cross validation. In addition, in this study, we also cover the affection of different pre-treatments to the different calibration equations predicting different quality traits. And measure the effect with three indicators including R, SECV and RPD. 

Result: Eventually we find the most suitable combination of pre-treatments for each calibration equation predicting a certain trait soybean. The present study would lay the foundations of rapid detection of quality traits in soybean.

Improvement on quality traits has always been one of the most significant targets in crops breeding. In soybean [Glycine max (L.) Merr], oil content (Clemente and Cahoon, 2009), protein content (Medic and Atkinson, 2014) and oil composition are especially important. Soybean oil is mainly composed of five fatty acids, palmitic, stearic, oleic, linoleic and linolenic acids (Kinney and Knowlton, 1998), Oleic acid is a monounsaturated fatty acid that facilitates oxidative stability for increased shelf life, heat stability in cooking and health benefits (Zambelli, 2020). Thus, the improvement on those quality traits are surely desirable.

At present, classic chemistry methods is necessary to measure the quality traits of crops. Like the Kjeldahl method for protein content and the Soxhlet extraction method for oil content (Jung and Rickert, 2003). Although these methods have the advantages of low cost, simple operation and high accuracy, they are time-consuming and labor-intensive and will cause environmental pollution. Meanwhile the method of near infrared spectroscopy takes little time and thus become more and more popular and widely used since it was developed (Corte and Blasco, 2019). Nowadays, the application of near-infrared spectroscopy technology to analyze and measure various quality traits of crops is common. It can achieve fully automatic operation, with relatively small human error caused during the entire measurement process and has high precision and good reproducibility. It is not only widely used in crop quality prediction, but also widely applicable in various biological, medical and other fields (McClure, 2003, Nicolaï and Beullens, 2007). This technology only requires a large amount of basic work such as collection, calibration and equation establishment to be completed in the preliminary work. After the model is established, unknown samples can be measured using the established model and collected spectra. The data of the required chemical components of the sample can be obtained within 1 minute without causing damage or pollution to the test samples.

Near infrared spectroscopy (NIR), the light in the wavelength range of 780 nm-2526 nm contains information of the relative proportions of C-H, N-H and O-H bonds which are the primary structural components of organic molecules (Nikolić and Jović, 2007). By analyzing the correlation between sample composition, as determined by defined reference chemical methods and the absorption of light at different wavelengths in the near infrared region measured in the certain environment by the certain machine, we could build a calibration equation for this once for all.

However, NIR method is not perfect, first we could not guarantee the consistence of every scan. There are a lot of factors effecting the spectrum (Qiao and Mu, 2021). Under different environmental conditions, the near-infrared spectrum of soybeans will have some slight differences. As a result, to insure the accuracy, building a particular equation to suit the environment is necessary.

The representativeness of the sample determines the effectiveness of modeling. In order to obtain representative samples, 112 soybean materials with significant differences in quality traits were selected for this experiment in the laboratory; Randomly select 10 samples for the correction models of protein and oil content as the validation set. The samples were planted in summer of 2022 in Chongqing and sown in single row, with row length of 1m, row width of 0.5 m, plant spacing of 0.2 m. All samples were conducted with general field management.


RT-01A 50G Western Medicine Crusher for grinding the materials. YG-2 cable extractor, K1100 fully automatic Kjeldahl nitrogen analyzer and Agilent Technologies GC (Gas chromatography) system for the determination of quality traits; SupNIR-2700 system (Including detector and software) for the scanning and analysis of spectrum.

Near infrared scanning

Samples were scanned with SupNIR-2700 system in the range of 1000-1799 nm. Each sample is repeatedly scanned 3 times. A reference scan was taken once in every 30 sample scans. Samples were temperature equilibrated at 26°C in the instrument before scanning. Spectral data were collected using SuperNIR-2700 software.

Chemometrics and data analysis

PLS (Partial least squares regression) is a typical linear algorithm, combining principal component analysis and canonical correlation analysis (Geladi and Kowalski, 1986).

To measure the goodness of the calibration equation under different pre-treatments (Wavelength range, Derivative, Normalization and Standardization). Three indicators were collected including the standard error in cross validation (SECV), the coefficient of determination in calibration (R) and the residual predictive deviation (RPD), which respectively represent the accuracy, sensitivity and stability of the equation. Taking account of these three parameters comprehensively [initially SECV (Cozzolino and Kwiatkowski, 2004), we can pick out the most suitable pre-treatments for the correction equation measuring different quality trait. MSC (Multiplicative Scatter Correction) is a pre-treatment meaning to reduce the scatter effect by correcting every single spectrum based on the univariate linear regression with the average spectrum, SNV (Standard Normal Variate Transform) is also a pretreatment to correct the single spectrum based on the variance of itself. The two are common methods to reduce the noise. To measure how well the calibration model could predict the traits, we used the residual predictive deviation (RPD). The RPD is defined as the standard deviation (S.D.) of the population’s reference values divided by the standard error in cross validation for the NIRS calibrations. If the error for estimating a constituent (SECV) is large compared to the spread of that trait in all samples (S.D.), a relatively small RPD is calculated, thereby demonstrating that the NIR calibration model is not robust. In contrast, relatively high RPD values indicate models having greater power to predict the chemical composition. Generally, an RPD greater than three could be considered dependable for prediction purposes.
Testing results of quality traits for samples by traditional methods
The quality traits of the samples used for building the equations are tested by traditional methods (Kjeldahl method for protein content, Soxhlet extractor method for oil content, gas chromatography (GC) for each kinds of oleic acid (Keller, 1961). The result is shown on the Table 1.

Table 1: Descriptive statistics of the samples used for the development of calibrations for the prediction of five traits.

The collection of near infrared spectroscopy
Each spectroscopy is scanned and recorded with three repeats, the original spectrum is shown on Fig 1. Along all the samples, we chose 102 of them is used to train and 10 of them is used to prove. Each sample is scanned repeatedly for 3 times. Every repeat is paired with the value measured by traditional methods to reduce error. As a result, we have 306 training data pairs and 30 proving data pairs. 

Fig 1: The original scanned infrared spectrum of soybean seeds.

The building of calibration equation and the effect of different pre-treatment
The original spectrum of samples contains the information of the quality traits, however, it also takes a lot of distractive information caused by scattering and other factors. Thus, we need pre-treatments to adjust the spectrum to enhance the right signal and ignore the wrong signal, but not all these treatments have positive effect on different quality traits, to measure different quality traits, we may need different pre-treatments to convey the information. One pre-treatment might be efficient to enhance the information of oil content, on the other hand, it may weaken the signal of protein content. In one word, pre-treatments are some kind of math methods adjusting the spectrum to excavate the information from it as far as possible. And the result are as follows shown in (Table 2). The different effect of those pre-treatments is shown on the (Fig 1-5), we can see that in the same range of spectrum for different quality traits, the R appears to be different too. So we may say that different range of the spectrum surly carry different information. And for the pre-treatments, we can see huge effects on how the treatments effect the equation in (Table 2), probably the pre-treatments that cause the best parameters might also be the best combination to clear the signal. All the combinations of pre-treatments shown in (Table 2) contains the pre-treatment of Savitzky-Golay smoothing which is not mentioned on the table. In the table, four variables are set to find out the best pre-treatments combination for each quality trait.

Table 2: The effect of different combination of pre-treatments.

Fig 2: The range of R of the original spectrum on linoleic acid content.

Fig 3: The range of R of the original spectrum on oleic acid content.

Fig 4: The effect on R of the most suitable pre-treatments on linoleic acid content.

Fig 5: The effect on R of the most suitable pre-treatments on oleic acid content.

Mainly considered the parameter of SECV, we find that the best combination of pre-treatments for oil content is the combination of First derivative, MSC and Mean centering in the range of 1000-1799 nm. For protein content it is the combination of SNV and Mean centering in the range of 1140-1760 nm. For oleic acid content, it is only Mean centering in the range of 1000-1799 nm. For linoleic acid content, it is the combination of First derivative and Mean centering in the range of 1000-1799 nm. For stearic acid content, it is the original spectrum in the range of 1140-1760 that have the best parameters. Those calibration equations and the rough discussions may have some positive effect for the development on both soybean breeding and chemometrics. Besides, we can also see some interesting clue around the (Fig 2-5), for the R range of oleic acid content and linoleic acid content, we can see totally opposite curves from the two. Even after the pre-treatments, R of the same range still appears to be so. As a result, we have enough reasons to conclude that there is a competitive relation between linoleic acid content and oleic acid content.The difference of the curve might be the cause of the difference of structures between unsaturated bond and saturated bond.

In addition, we can see most of the best pre-treatment combination contains SNV but not MSC, consider the fact that SNV correct the spectrum based on the variance while MSC correct the spectrum based on the univariate linear regression, which means that SNV correct the single spectrum only considering the variance within one single spectrum, but the MSC correct the single spectrum considering the relation between all the spectrums and the average spectrum. So we can say that if the noise is predictable and ordered, MSC have an advantage, if not, SNV is better.

Giving the parameters in this study, we may say that the noise in the process of scanning might not be regular.
Conceptualization, Jian Zhang; methodology, Ronghan Ma and Weiran Gao; Software, Weiran Gao; Validation, Aohua Jiang; Formal analysis Jiaqi Liu; Resources, Jian Zhang; Data curation, Pingting Tan and Fang Liu; Writing-original draft preparation, Weiran Gao; Writing-review and editing, Weiran Gao and Jian Zhang; Visualization, Jian Zhang and Weiran Gao; Supervision, Jian Zhang; Project administration, Jian Zhang and funding acquisition, Jian Zhang. All authors have read and agreed to the published version of the manuscript.


This study was supported by Chongqing Technology Innovation and Application Development Special Key Project (cstc2021jscx-gksbX0011); Collection, Utilization and Innovation of Germplasm Resources by Research Institutes and Enterprises of Chongqing (cqnyncw-kqlhtxm) and Southwest University experimental technology research project(syj2023005).

Institutional review board statement

Not applicable.

Data availability statement

The data presented in this study are available from the first author upon request.
The authors declare no conflict of interest.

  1. Clemente, T.E., Cahoon, E.B. (2009). Soybean oil: Genetic approaches  for modification of functionality and total content. Plant Physiology. 151(3): 1030-1040. DOI: 10.2307/40537933.

  2. Cortes, V., Blasco, J., Aleixos, N., Cubero, S., Talens, P. (2019). Monitoring strategies for quality control of agricultural products using visible and near-infrared spectroscopy: A review. Trends Food Sci. Technol. 85: 138-148. DOI: 10.1016/j.tifs.2019.01.015.

  3. Cozzolino, D., Kwiatkowski, M., Parker, M., Cynkar, W., Dambergs, R., Gishen, M., Herderich, M. (2004). Prediction of phenolic compounds in red wine fermentations by visible and near infrared spectroscopy. Analytica Chimica Acta. 513(1): 73-80. DOI: 10.1016/j.aca.2003.08.066.

  4. Geladi, P., Kowalski, B.R. (1986). Partial least-squares regression: A tutorial. Analytica Chimica Acta. 185: 1-17. 

  5. Jung, S., Rickert, D.A., Deak, N.A., Aldin, E.D., Recknor, J., Johnson, L.A., Murphy, P.A. (2003). Comparison of Kjeldahl and Dumas methods for determining protein contents of soybean products. Journal of the American Oil Chemists’ Society. 80: 1169-1173. DOI: 10.1007/s11746-003-0837-3.

  6. Keller, R.A. (1961). Gas chromatography. Scientific American. 205(4): 58-67.

  7. Kinney, A.J., Knowlton, S. (1998). Designer Oils: The High Oleic Acid Soybean. In: Genetic Modification in the Food Industry. [Roller, S., Harlander, S. (eds)]. Blackie Academic,  London. pp 193-213.

  8. McClure, W. F. (2003). 204 years of near infrared technology: 1800- 2003. Journal of Near Infrared Spectroscopy. 11(6): 487- 518. DOI: 10.1255/jnirs.399.

  9. Medic, J., Atkinson, C., Hurburgh, C.R.J. (2014). Current knowledge  in soybean composition. Journal of the American Oil Chemists Society. 91(3): 363-384.

  10. Nicolaï, B.M., Beullens, K., Bobelyn, E., Peirs, A., Saeys, W., Theron,  K.I., Lammertyn, J. (2007). Nondestructive measurement of fruit and vegetable quality by means of NIR spectroscopy: A review. Postharvest Biology and Technology. 46(2): 99-118. DOI: 10.1016/j.postharvbio.2007.06.02.

  11. Nikolić, A., Jović, B., Csanady, S., Petrović, S. (2007). N-H...O Hydrogen bonding: FT IR, NIR and 1H NMR study of N- Methylpropionamide-Cyclic ether systems. J. Mol. Struct.  834-836, 249-252. DOI: 10.1016/j.molstruc.2006.11.003.

  12. Qiao, L., Mu, Y., Lu, B., Tang, X. (2021). Calibration maintenance application of near-infrared spectrometric model in food analysis. Food Rev. Int. 1-17. DOI: 10.1080/87559129. 2021.1935999.

  13. Zambelli, A. (2020). Current status of high oleic seed oils in food processing. Journal of the American Oil Chemists’ Society.  98: 129-137. DOI: 10.1002/aocs.12450.

Editorial Board

View all (0)