The protocol describes a method of predicting o-cresol concentration during the production of polyphenylene ether using near-infrared spectroscopy and partial least squares regression. To describe the process more clearly and completely, an example of predicting the o-cresol concentration during the production of polyphenylene is used to clarify the steps.
Unlike macroscopic process variables, near-infrared spectroscopy provides process information at the molecular level and can significantly improve the prediction of the components in industrial processes. The ability to record spectra for solid and liquid samples without any pretreatment is advantageous and the method is widely used. However, the disadvantages of analyzing high-dimensional near-infrared spectral data include information redundancy and multicollinearity of the spectral data. Thus, we propose to use partial least squares regression method, which has traditionally been used to reduce the data dimensionality and eliminate the collinearity between the original features. We implement the method for predicting the o-cresol concentration during the production of polyphenylene ether. The proposed approach offers the following advantages over component regression prediction methods: 1) partial least squares regression solves the multicollinearity problem of the independent variables and effectively avoids overfitting, which occurs in a regression analysis due to the high correlation between the independent variables; 2) the use of the near-infrared spectra results in high accuracy because it is a non-destructive and non-polluting method to obtain information at microscopic and molecular scales.
Near infrared (NIR) spectroscopy (NIRS) has gained wide acceptance as a fast, efficient, non-destructive, and non-polluting modern analytical technology; the method has been used during the past several years for product quality detection and analysis and chemical component measurement in industrial processes. The most essential specialty of the method is its ability to record spectra for solid and liquid samples without any pre-processing, making NIRS especially suitable for the direct and rapid detection and analysis of natural and synthetic products1,2. Unlike traditional sensors that measure process variables (e.g., temperature, pressure, liquid level, etc.) at a macroscopic scale and inevitably suffer the external noise and background interference, NIRS detects the structural information of the chemical composition at microscopic and molecular scales. Thus, essential information can be measured more accurately and effectively than with other methods3,4.
Polyphenyl ether, as one of the engineering plastics, are widely used due to its heat resistance, flame retardant, insulation, electrical properties, dimensional stability, impact resistance, creep resistance, mechanical strength and other properties5. More importantly, it is non-toxic and harmless compared to other engineering plastics. At present, 2,6-xylenol is one of the basic raw materials for the synthesis of polyphenylene ether, and it is usually prepared by catalyzed alkylation of phenol with methanol method6. There are two main products of this preparation method, o-cresol and 2,6-xylenol. After a series of separation and extraction steps, 2,6 xylenol is used to produce polyphenylene ether. However, trace amounts of o-cresol remain in 2,6-xylenol. O-cresol does not participate in the synthesis of polyphenylene ether and will remain in the polyphenylene ether product, resulting in a decrease in product quality or even the substandard. At present, most companies still analyze the compositions of complex organic mixtures such as liquid phase polyphenyl ether products containing impurities (e.g., o-cresol) by physical or chemical separation analysis such as chromatography7,8. The separation principle of chromatography is the use of the mixture of compositions in the fixed phase and the flow phase in the dissolution, analysis, adsorption, desorption or other affinity of the minor differences in the performance. When the two phases move relative to each other, the compositions are separated by the above actions repeatedly in the two phases. Depending on the object, it usually takes a few minutes to a few tens of minutes to complete a complex material separation operation. It can be seen that the measurement efficiency is low.
Nowadays, the measurement of product quality and the advanced control technology based on this analysis for the modern fine process chemical materials industry is the key direction to further improve product quality. In the process industry of polyphenyl ether production, real-time measurement of o-cresol content in polyphenylene ether product is of great development significance. Chromatographic analysis clearly cannot meet the requirements of advanced control technology for real-time measurement of substances and signal feedback. Therefore, we propose the partial least squares regression (PLSR) method to establish a linear model between the NIRS data and the o-cresol concentration, which realize the online measurement of o-cresol content in the liquid polyphenylene ether product of outlet.
The pre-processing for NIRS plays the most important role prior to multivariate statistical modeling. NIRS wavenumbers in the NIR spectrum and the particle sizes of biological samples are comparable, so it is known for unexpected scatter effects that has influence on the recorded sample spectra. By performing appropriate pre-processing methods, these effects are easy to be eliminated largely9. The most commonly used pre-processing techniques in NIRS are categorized as scatter correction and spectral derivative methods. First group of methods includes multiplicative scatter correction, detrending, standard normal variate transformations, and normalization. The spectral derivation methods include the use of the first and second derivatives.
Prior to developing a quantitative regression model, it is important to remove the unsystematic scatter variations from the NIRS data because they have a significant influence on the accuracy of the predictive model, its complexity and parsimony. The selection of a suitable pre-processing method should always depend on the subsequent modeling step. Here, if the NIR spectral dataset does not follow the Lambert-Beer law, then other factors tend to compensate for the non-ideal behavior of the prediction for predicted components. The disadvantage of the existence of such needless factors leads to the increase of model complexity, even most likely, a reduction in the robustness. Thus, the application of spectral derivatives and a conventional normalization to the spectral data is an essential part of the method.
After spectral preprocessing, the NIRS data with a high signal-to-noise ratio and low background interference are obtained. Modern NIRS analysis provides the rapid acquisition of large amounts of absorbance over an appropriate spectral range. The chemical composition of the sample is then predicted by extracting the relevant variables using the information contained in the spectral curve. Generally, NIRS is combined with multivariate analysis techniques for qualitative or quantitative analyses10. A multivariate linear regression (MLR) analysis is commonly used for developing and mining the mathematical relationship between the data and the components in industrial processes and has been widely used in NIRS analysis.
However, there are two fundamental problems when implementing an MLR for preprocessed NIRS data. One problem is the variable redundancy. The high dimensionality of the NIRS data often renders the prediction of a dependent variable unreliable because variables are included that have no correlation with the components. These redundant variables reduce the information efficiency of the spectral data and affect the accuracy of the model. In order to eliminate the variable redundancy, it is essential to develop and maximize the correlation between the NIRS data and the predicted components.
Another problem is the issue of multicollinearity in the NIRS data. One of the important assumptions of multiple linear regression models is that there is no linear relationship between any of the explanatory variables of the regression model. If this linear relationship exists, it is proved that there is multicollinearity in the linear regression model and the assumption is violated. In multiple linear regressions, such as an ordinary least squares regression (OLSR), multiple correlations between the variables affect the parameter estimation, increase the model error, and affect the stability of the model. To eliminate the multilinear correlation between the NIR spectral data, we use variable selection methods that maximize the inherent variability of the samples.
Here, we propose to use the PLSR, which is a generalization of multiple linear regression that has been widely used in the field of NIRS11,12. The PLSR integrates the basic functions of the MLR, canonical correlation analysis (CCA), and principal component analysis (PCA) and combines the forecasting analysis with a non-model data connotation analysis. The PLSR can be divided into two parts. The first part selects the components of the characteristic variables and the predicted components by partial least squares analysis (PLS). PLS maximizes the inherent variability of principal components by making the covariance of the principal components and predicted components as large as possible when extracting the principal components. Next, the OLSR model of o-cresol concentration is established for the principal components selected. PLSR is suitable for the analysis of noisy data with numerous independent variables that are strongly collinear and highly correlated and for the simultaneous modeling of several response variables. Also, PLSR extracts the effective information of the sample spectra, overcomes the problem of multicollinearity, and has the advantages of strong stability and high prediction accuracy13,14.
The following protocol describes the process of using the PLSR model for measuring the o-cresol concentration using NIR spectral data. The reliability and accuracy of the model are evaluated quantitatively by using the determination coefficient (), the prediction correlation coefficient () and the mean square prediction error of cross-validation (MSPECV). Moreover, to intuitively show the advantages of the PLSR, the evaluation indicators are visualized in several plots for a qualitative analysis. Finally, evaluation indicators of an experiment are presented in table format to quantitatively illustrate the reliability and precision of the PLSR model.
1. NIR spectrum data acquisition with Fourier transform (FT)-NIR process spectrometer
2. NIR spectroscopy data pre-processing
3. Establishment of PLSR model
The predicted value of o-cresol Impurity in polyphenyl ether products is obtained by PLSR-based near-infrared spectroscopy. Figure 2 and Figure 3 respectively show the reliability of the method in the feature selection stage from the curve of the decision coefficient and the error interpretation percentage increasing with the number of principal components.
Specifically, please note that in the selection of the principal components, in order to minimize the complexity of the model, generally, when the two indicators do not increase significantly with the number of principal elements, the subsequent principal elements with less information can be discarded. In this paper, according to the two curves, when is 0.9917, it is the threshold to discard subsequent principal components.
Visually test the accuracy and stability of the method for predicting the purity of o-cresol products by means in shown Figure 4 and Figure 5. Based on the training set and test set generated by the o-cresol samples obtained from industrial field and their corresponding near-infrared spectroscopy data, we compared the residual and mean square error of the predicted and reference values of the purity of polyphenyl ether products.
The residual here refers to the difference between the o-cresol content reference value and the PLSR model estimate (fitted value). Using the information provided by the residuals can help us examine the rationality of the established PLSR model assumptions and the reliability of collecting near-infrared spectral data. It is shown that PLSR method effectively reduce the residuals to an acceptable range. Moreover, the PLSR has a small fluctuation range from -0.2 to 0.2, while the calibration range is from 42.1063 mg/1 g polyphenyl ether product to 51.6763 mg/1 g polyphenyl ether product. The residual plot data let us conclude qualitatively that the PLSR for the measurement of the o-cresol content based on the NIR spectral data has high accuracy.
The cross-validation mean square error is a measure of the degree of difference between the reference and the predicted o-cresol content. This can help us evaluate the degree of change in the predictive data. The smaller the value of MSE, the better the accuracy of the predictive model describing o-cresol content. Figure 5 indicates that the MSPECV for the o-cresol concentration measurement based on the PLSR decrease as the number of principal components increases and reach an acceptable minimum at 10 principal components. Moreover, the error decreases significantly and the descent process is relatively stable. This proves that the PLSR results in high stability for the measurement of the o-cresol concentration using NIRS.
The model evaluation indicators for a 10-fold cross-validation are shown in Table 1. The of 0.98332 is pretty high for the PLSR, indicating that the model based on the PLSR well reflects the linear relationship between the NIR spectral data and the o-cresol concentration (i.e., the model has stronger explanatory power). The Pearson correlation coefficient is a statistic used to reflect the degree of linear correlation between two variables. The larger the absolute value of , the stronger the correlation. This can help to quantitatively observe a linear correlation between the predicted o-cresol content value and the chemical reference value to confirm the reliability of the model. The mean relative prediction error (MRPE) of 0.01106 is very low for the PLSR and the prediction correlation coefficient of 0.99161 is large; therefore, the PLSR model is of great prediction stability and accuracy.
Figure 1. Original NIR spectrum taken with FT-NIR spectrometer. The figure shows the non-preprocessed spectral dataset collected over a period of time by the FT-NIR spectrometer. Please click here to view a larger version of this figure.
Figure 2. Determination coefficient for the o-cresol concentration determined by PLSR. The figure shows the trend of the determination coefficient for the o-cresol concentration with increasing number of components. Please click here to view a larger version of this figure.
Figure 3. Percent variance explained in the NIR spectral data for the PLSR. The figure shows the percent variance explained in the NIR spectral data for the principal components under the PLSR. Please click here to view a larger version of this figure.
Figure 4. Residual of the PLSR for the test set. The figure shows the residuals of the 200 test set samples for the PLSR. Please click here to view a larger version of this figure.
Figure 5. MSPECV of the PLSR. The precision of the PLSR model was evaluated with a 10-fold cross-validation; the MSPECV of model is shown in Figure 5. Please click here to view a larger version of this figure.
Interpretability | Prediction accuracy | ||
PLSR | R² | Rp | MRPECV |
0.98332 | 0.9916 | 0.01106 |
Table 1. Evaluation indices of the model. Table 1 shows the model evaluation indicators for a 10-fold cross-validation.
This protocol describes the process of performing the PLSR on the measurement of the o-cresol concentration remaining in the liquid product of polyphenylene ether with NIRS.
The two critical steps in this process are the pre-processing of the original NIR spectral data and the variables selection of the high-dimensional NIR spectral data.
Generally, the non-systematic background interference leads to the non-systematic scattering deviation or baseline drift of NIR spectrum. Proper NIR spectral pre-processing method (derivative, normalization, multivariate scattering correction, etc.), as the first critical step, eliminates background interference and increases the signal-to-noise ratio of data.
Multiple linear correlations existing in the high-dimensional NIR spectral data introduces unnecessary computation and reduces the operation efficiency of the regression model. Moreover, there are redundant variables that have weak or even no correlation to the o-cresol content in liquid polyphenyl ether products. The variable selection based on PLS, as the second critical step, solves the above two problems, which reduces the complexity of the model and enhances its robustness compared with the direct use of OLSR for modeling.
In addition, the technology has its limitations as following. Firstly, it requires that there is no significant nonlinear correlation between the various dimensional variables of the NIR spectral data. As the PLS algorithm mechanism is based on the expectation of maximizing the variance of NIR spectral data variables and maximizing the Pearson correlation between the selected principal component and the content of o-cresol. Therefore, the operation of the algorithm on data is essentially a linear combination of high-dimensional variables, which does not include the possible nonlinear relationship between data15. Secondly, due to the offline modeling, it cannot be self-updated. Our work has not yet covered the effects of interaction between o-cresol and solvent molecules or other impurity molecules on NIRS. The interaction between o-cresol and solvents or other impurity molecules may have an unexpected effect on NIRS, but we still have not found an appropriate method to quantify this effect. Thirdly, the change of environment, working condition and other factors may lead to the concept drift of the model.
However, for the existing methods such as physical or chemical separation analysis for the measurement of chemical substance content in the chemical material industry, NIRS detects composition and structural information about chemical substance content from a microscopic molecular perspective. In general, the qualitative transition at the macroscopic level is caused by the quantitative change at the microscopic level. Therefore, this technology has stronger accuracy, timeliness and sensitivity than other methods. Moreover, the combination with data analysis technology can effectively utilize industrial big data, which is conducive to industrial automation and even intelligent construction.
In the future, this technology can be flexibly applied to online measurement of various liquid or solid phase chemicals content in other fine process industries where the measurement of general sensors and laboratory measurement is difficult or unable to meet the industrial requirements16. Moreover, this technology can be widely combined with advanced control technology due to the real-time feedback of analysis variables and thus meet higher industrial requirements17,18,19.
The authors have nothing to disclose.
This work was supported by the National Natural Science Foundation of China (Grant Nos. 61722306 and 61473137) and National First-class Discipline Program of Light Industry Technology and Engineering (LITE2018-025).
MPA II Multi Purpose FT-NIR Analyzer | Bruker | 1 | |
Fiber Optic Probes(Liquid phase) | Bruker | 1 | |
Liquid chromatography analyzer | / | 1 | |
Laboratory Equipment and Supplies(e.g. test tube, etc.) | / | ||
MATLAB | MathWork | 1 | |
OPUS | Bruker | 1 | |
Principal computer | DELL | 1 | |
The Unscrambler | CAMO | 1 |