Pattern recognition in tracing the origins of wolfberries
- Abstract
Wolfberry known for its medicinal and health benefits. The existence of merchants attempting to misrepresent the quality of wolfberries underscores the urgency for reliable identification methods. Consumers' growing concern over food origins underscores the need for rapid and accurate identification to guide the distribution of wolfberry resources and establish quality standards. Consequently, machining learning algorithmic approaches are gaining popularity due to their precision and non-destructive nature in identifying various wolfberry varieties.
This study focused on the use of near-infrared spectroscopy (NIR) combined with pattern recognition methods, which verified its great potential in distinguishing wolfberries from different origins. The study found that the main differences in the spectrum of wolfberry were significantly reflected in the peaks and troughs in specific wavelength ranges such as 1222–1279 nm, 1431–1461 nm, and 1529–1574 nm, but these subtle differences were difficult to directly identify with the naked eye. By introducing machine learning algorithms related to pattern recognition, these spectral differences can be effectively identified and distinguished. The overall research process can be summarized into three key steps: spectral preprocessing, feature extraction, and classifier construction, each of which can adopt different methods to address different technical challenges and subtle problems. This study summarizes the efficient solutions for origin traceability of wolfberry.
key words: wolfberry, NIR, machine learning algorithms, pattern recognition
- Introduction
Wolfberry is the dried and mature fruit of Lycium barbarum L. in Solanaceae. It is a famous medicinal material and health care product with anti-tumor, anti-aging, anti-lipid, anti-oxidation and other effects. Studies have shown that the contents of polyphenols, sugars, vitamins, amino acids and other chemical components of different varieties or different origins of wolfberry fruits have great differences, which have different degrees of influence on taste, flavor and quality, so their market competitiveness differs. Some merchants try to sell seconds at best quality prices, which results in many harmful consequences to the market. Nowadays, consumers are increasingly concerned about the origins of the foods they eat. Therefore, the rapid and accurate identification of different varieties of wolfberry is highly demanded to target the deliver of wolfberry resources and to paly a important role in setting the quality standard. At present, the methods to identify different varieties of wolfberry mainly include sensory judgment and chemical method, but the visual and sensory judgment usually have a large error, and the chemical method needs to crush and extract the sample, resulting in great damage to the sample. Thus, methods using algorithms are getting popular because they are accurate and without damage.
De Géa Neves Marina et al. (2022) used near-infrared spectroscopy (NIR) and chemometric tools to classify potentially adulterated substances in plant-based protein powders. To this end, the OC-PLS (Class of Partial Least Squares) model was used for authentication, and the PLS2-DA (partial least squares discriminant analysis) model was used to classify the counterfeiters. The VIP (Variable Importance in Projection) score is used to confirm the main relevant variables in PLS2-DA, and the spectral range is responsible for each class. The results show that it is a promising method to combine the single-class (OC-PLS) method with the multi-class (PLS-DA) method combined with near-infrared spectroscopy to study plant protein powders。
Shen Tingting et al. (2016) investigated the feasibility of near-infrared spectroscopy and chemometrics as samples for the analysis of samples from four different topographic regions of Chinese wolfberry. LS-SVM (The Least Squares Support Vector Machine) was used to calibrate the discriminant model of the geographical origin of wolfberry, and compared with the artificial neural network (ANN) and K-nearest neighbor (KNN) methods, the LS-SVM algorithm showed better generalization of the recognition results. Since the total flavonoid content (TFC) of wolfberry is highly correlated with the quality of wolfberry, the TFC prediction model was constructed by using the synergistic interval partial least squares method (Si-PLS). This work shows that NIR spectroscopy in combination with LS-SVM and Si-PLS has great potential as a fast and effective technique to assess the quality of retail goji berries.
Li Yahui, Zou Xiaobo et al. (2016) used near-infrared (NIR) spectroscopy and chemometrics for data acquisition in order to quickly and efficiently determine the geographic origin and characteristic categories of five varieties of black wolfberry. In the process of model development, the co-interval partial least squares method (Si-PLS), linear discriminant analysis (LDA), K-nearest neighbor (KNN), backpropagation artificial neural network (BP-ANN) and least squares support vector machine (LS-SVM) regression were systematically evaluated and compared. Compared with other models, the recognition rate of LS-SVM is more than 98.18%, and the recognition results show excellent generalization. A prediction model of total anthocyanin content was established by using the synergistic interval partial least squares method (Si-PLS). The model was optimized by leaving one for cross-validation, and the model performance was evaluated by evaluating the root mean square error (RMSEP) and correlation coefficient (R(t)) predicted in the prediction set. The overall results fully show that the combination of spectroscopy and Si-PLS regression tools has the potential to successfully distinguish the varieties of black wolfberry.
The researches mentioned above used similar methods when identifying the origins or species of wolfberry. This essay will review current methods that can efficiently distinguish between wolfberry of different origins.
- A professional device—NIR
Near infrared spectroscopy (NIR) has been successfully applied to the identification of different varieties and different origins of food with its advantages of fast, non-destructive and accurate. The interest in NIR spectroscopy lies in its advantages over alternative instrumental techniques. Thus, it can record spectra for solid and liquid samples with no pretreatment, implement continuous methodologies, provide spectra quickly and predict physical and chemical parameters from a single spectrum. These attributes make it especially attractive for straightforward, speedy characterization of samples(Blanco, H et al.,2002).
Fig. 1 potable NIR scanner
The picture shows that the device is quite potable so its application is flexible, making quick inspection convenient.
Here are the spectra of wolfberry of different origins:
Fig. 2 wolfberry spectra
The valleys and peaks are at 1222–1279 nm, 1431–1461 nm and 1529–1574 nm. These are where the major differences lie.
However, it’s impossible to distinguish between different spectra by observing. After the NIR scanner delivers the right spectroscopic data to computer, it’s algorithm’s turn to process these data.
- Pattern recognition
Pattern recognition is a data analysis method that uses machine learning algorithms to automatically recognize patterns and regularities in data. This data can be anything from text and images to sounds or other definable qualities. Pattern recognition systems can recognize familiar patterns quickly and accurately. They can also recognize and classify unfamiliar objects, recognize shapes and objects from different angles, and identify patterns and objects even if they’re partially obscured. Those characteristics shows that pattern recognition is capable of wolfberry classification.
Here are the basic steps of pattern recognition:
The existing spectral preprocessing can be divided into four categories according to its effect: baseline correction, scattering correction, smoothing processing and scaling.
common baseline correction algorithms include derivative method, iterative polynomial fitting method, piecewise fitting algorithm, moving window smoothing algorithm, wavelet transform algorithm, and algorithm based on penalty least squares.
- Derivative Spectroscopy
Highlighting the details of spectral curves by calculating first-, second-, or higher-order derivatives can help separate overlapping peaks and eliminate the effects of baseline drift.
- Standard normal variables(SNV)
SNV is the original spectrum minus the average μ of the spectrum, and then divided by the standard deviation of the data σ (scaling), which essentially normalizes the original spectral data.
- Multivariate scattering correction(MSC)
MSC can effectively eliminate spectral differences due to different scattering levels, thereby enhancing the correlation between spectra and data. This method corrects the baseline translation and shift of spectral data with ideal spectra, and is particularly suitable for near-infrared (NIR) data processing of solid and powder samples.
- Savitzky-Golay filter(SG)
The Savitzky-Golay filter is a digital filter that can be applied to a set of data to smooth the data, which can improve the accuracy of the data without changing the signal trend and width. The filtering is realized through the process of convolution, and on the same curve, different window widths can be selected at any position to meet the needs of different smooth filtering. Especially when processing time series data, there are obvious advantages for sequence processing at different stages.
- Min-Max scaling
asume Xminand Xmax are the minimum and maximum values of the attributes A, The value x is obtained by mapping one of the raw values of A to the [0,1] interval by this normalization method.
- Principal component analysis(PCA)(Duda R O,2004)
PCA is one of the most widely used data dimensionality reduction algorithms. The main idea of PCA is to map n-dimensional features to k-dimension, which is a new orthogonal feature also known as principal component, which is a reconstructed k-dimensional feature on the basis of the original n-dimensional feature to realize the dimensionality reduction of data features. If the main difference in the data is the variance, the PCA dimensionality reduction effect is better. But by its nature (unsupervised learning algorithm), it is not suitable for classification problems because it does not use any class information when calculating principal components.
LDA is a supervised learning dimensionality reduction technique, which needs to find a suitable straight line , so that the projection points of similar samples are as close as possible when the samples in the dataset are projected to the straight line, and the projection points of different types of samples are as far away as possible, so as to achieve the purpose of dimensionality reduction or classification of high-latitude data. LDA generally works well when the data are differentiated by the mean. In addition, LDA has a large number of derivative algorithms.
- Constrained linear discriminant transformations(CLDA)(Du, Q.,2003)
CLDA method for hyperspectral image detection and classification and its real-time implementation. The basic idea of CLDA is to design an optimal transformation matrix that can maximize the ratio of inter-class distance to intra-class distance, and at the same time apply constraints along different directions of different class centers after transformation, so that different classes can be better separated.
- Direct linear discriminant analysis(DLDA)(Wu Xiao-Jun,2004)
DLDA accepts high-dimensional data (such as raw images) as input and directly optimizes Fisher's criterion without any feature extraction or dimensionality reduction steps, i.e., PCA transformation is not required to reduce the dimensionality required by other techniques, and can also overcome the problem of small sample size.
However, the DLDA algorithm only retains the range space of the inter-class scatter matrix, but discards the zero space of the inter-class scatter matrix, which will lead to the loss of some useful information and reduce the classification ability of the whole model. Paliwal, K. K.and Sharma, A. managed to overcome the above shortcomings of DLDA, while retaining its significant advantages, then comes the IDLDA(Improved direct LDA)( Paliwal, K. K.,2010).
Here is the simplified implementation steps of IDLDA:
table 1 Improved DLDA algorithm for easy implementation steps
Classifiers
- Naive Bayes
Naive Bayes are used to calculate the likelihood of whether a data point belongs to a certain category or not。
- Support vector machines(SVM)
The SVM algorithm is to find a hyperplane, and for the trained samples that have been labeled, the SVM trains to obtain a hyperplane, so that the vertical distance between the samples closest to the hyperplane in the two class training sets is maximized
- K-Nearest Neighbor Algorithm(KNN)
The K-Nearest Neighbor algorithm, also known as KNN, is a nonparametric, supervised learning classifier that uses proximity to classify or predict groupings of individual data points. It can be used both for regression or classification problems, as well as as for classification algorithms. Particularly suitable for multi-modal problems (where objects have multiple class labels), KNN performs better than SVM.
- Decision Tree
A decision tree is a supervised learning algorithm that is well suited for solving classification problems because it is able to sort categories precisely. It works similarly to a flowchart, separating data points into two similar categories at once, from "trunk" to "branch" to "leaf", making those categories more similar to a limited extent. With decision trees, you can create categories within categories for organic categorization with limited human supervision.
- BP Neural Network
BP networks can learn and store a large number of input-output pattern mappings without revealing the mathematical equations that describe these mappings in advance. Its learning rule is to use the fastest descent method to continuously adjust the weights and thresholds of the network through backpropagation, so that the sum of squares of the error of the network is minimized.
- The general technical route
By summarizing the existing methods, the approximate technical route of tracing the origin of wolfberry can be obtained. The wolfberry samples can be divided into a training set, a verification set and a test set through stratified sampling. In order to ensure the accuracy of the spectral information, the researchers will carry out targeted preprocessing according to the four links of baseline correction, scattering correction, smoothing and scale scaling. Specifically, the first derivative can be calculated (the higher derivative can be further calculated according to the severity of the baseline drift of the spectral image) to correct the spectral details and greatly reduce the impact of baseline drift. Further, since wolfberry is a solid sample, the scattering correction effect is best with MSC. Then, a filter was applied to smooth the data, and the remaining high-noise part of the final spectrum was removed, and the remaining spectral bands were taken for analysis. Finally, the standard deviation normalization was used to standardize the corresponding infrared spectral data matrix, and the feature extraction model was established and compared and synthesized.
- Conclusion
In this research, it was verified that pattern recognition based on NIR have high potential to distinguish between wolfberries of different origins. The major differences of wolfberries spectra lies in the valleys and peaks at 1222–1279 nm, 1431–1461 nm and 1529–1574 nm, but it’s hard to tell the differences by simply observing. By applying machining learning algorithms relating to pattern recognition, these differences can be easily recognized. The overall technical route of the research can be summarized in three steps:Spectra pretreatment, feature extraction and classifier. Each part includes various methods which can deal with different tiny problems.
reference
De Géa Neves, M., Poppi, R. J., & Breitkreitz, M. C. (2022). Authentication of plant-based protein powders and classification of adulterants as whey, soy protein, and wheat using FT-nir in tandem with OC-PLS and PLS-DA models. Food Control, 132, 108489.
Tingting, S., Xiaobo, Z., Jiyong, S., Zhihua, L., Xiaowei, H., Yiwei, X., & Wu, C. (2015). Determination geographical origin and flavonoids content of Goji Berry using near-infrared spectroscopy and Chemometrics. Food Analytical Methods, 9(1), 68–79
Yahui, L., Xiaobo, Z., Tingting, S., Jiyong, S., Jiewen, Z., & Holmes, M. (2016). Determination of geographical origin and anthocyanin content of Black Goji Berry (Lycium Ruthenicum Murr.) using near-infrared spectroscopy and Chemometrics. Food Analytical Methods, 10(4), 1034–1044.
Ltd., A. (n.d.). What is pattern recognition. https://www.arm.com/glossary/pattern-recognition
Wikimedia Foundation. (2024a, January 8). Near-infrared spectroscopy. Wikipedia. https://en.wikipedia.org/wiki/Near-infrared_spectroscopy
Blanco, M., & Villarroya, I. (2002). Nir Spectroscopy: A rapid-response analytical tool. TrAC Trends in Analytical Chemistry, 21(4), 240–250.
Cevikalp, H., & Wilkes, M. (2004). Face recognition by using discriminative common vectors. Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004.
Duda R O , Hart P E , Stork D G .Pattern Classification[J].Wiley, 2004.
Du, Q., & Ren, H. (2003). Real-time constrained linear discriminant analysis to target detection and classification in hyperspectral imagery. Pattern Recognition, 36(1), 1–12
Wu Xiao-Jun, Kittler, J., Yang Jing-Yu, Messer, K., & Wang Shitong. (2004). A new direct LDA (D-Lda) algorithm for feature extraction in face recognition. Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004.
Paliwal, K. K., & Sharma, A. (2010). Improved direct LDA and its application to DNA microarray gene expression data. Pattern Recognition Letters, 31(16), 2489–2492.