Here, we present a protocol for collection of confocal Raman spectra from human subjects in clinical studies combined with chemometric approaches for spectral outlier removal and the subsequent extraction of key features.
Development of this in vivo confocal Raman spectroscopic method enables the direct measurement of water, proteins, and lipids with depth resolution in human subjects. This information is very important for skin-related diseases and characterizing skin care product performance. This protocol illustrates a method for confocal Raman spectra collection and the subsequent analysis of the spectral dataset leveraging chemometrics. The goal of this method is to establish a standard protocol for data collection and provide general guidance for data analysis. Preprocessing (e.g., removal of outlier spectra) is a critical step when processing large datasets from clinical studies. As an example, we provide guidance based on prior knowledge of a dataset to identify the types of outliers and develop specific strategies to remove them. A principal component analysis is performed, and the loading spectra are compared with spectra from reference materials to select the number of components used in the final multivariate curve resolution (MCR) analysis. This approach is successful for extracting meaningful information from a large spectral dataset.
In clinical studies, in vivo confocal Raman spectroscopy has shown its unique ability for determining stratum corneum thickness and water content1,2,3,4, and tracking the penetration of active materials topically applied to the skin5,6. As a noninvasive approach, confocal Raman spectroscopy detects molecular signals based on vibrational modes. Thus, labeling is not needed7. In vivo confocal Raman spectroscopy provides chemical information with depth resolution based on the confocal nature of the technique. This depth-dependent information is very useful in studying the effects of skin care products4,8, aging9,10, seasonal changes3, as well as skin barrier function diseases, such as atopic dermatitis11,12. There is a lot of information in the high frequency region of confocal Raman spectroscopy (2,500–4,000 cm-1), where water produces distinct peaks in the region between 3,250–3,550 cm-1. However, the Raman peaks of proteins and lipids, which are centered between approximately 2,800–3,000 cm-1, overlap each other because the signals are mainly produced from methylene (-CH2-) and methyl (-CH3) groups13. This overlapped information presents a technical challenge when obtaining relative amounts of individual molecular species. Peak fitting14,15 and selective peak position12,16 approaches have been used to resolve this challenge. However, it is difficult for these single peak-based methods to extract pure component information because multiple Raman peaks from the same component change simultaneously17. In our recent publication18, an MCR approach was proposed to elucidate the pure component information. Using this approach, three components (water, proteins, and lipids) were extracted from a large in vivo confocal Raman spectroscopic dataset.
The execution of large clinical studies can be demanding on individuals collecting in vivo spectroscopic data. In some cases, spectral acquisition can require operating equipment for many hours in a day and the study can extend up to weeks or months. Under these conditions, spectroscopic data may be generated by equipment operators that lack the technical expertise to identify, exclude, and correct for all sources of spectroscopic artifacts. The resulting data set may contain a small fraction of spectroscopic outliers that need to be identified and excluded from the data prior to analysis. This paper illustrates in detail a chemometric analysis process to "clean up" a clinical Raman dataset before analyzing the data with MCR. To successfully remove the outliers, the types of outliers and the potential cause for the generation of the outlier spectra need to be identified. Then, a specific approach can be developed to remove the targeted outliers. This requires prior knowledge of the dataset, including a detailed understanding about the data generation process and the study design. In this dataset, the majority of outliers are low signal-to-noise spectra and originate primarily from 1) spectra collected above the skin surface (6,208 out of 30,862), and 2) strong contribution to the spectrum from fluorescent room light (67 out of 30,862). Spectra collected above the skin surface produce a weak Raman response, as the laser focal point approaches the skin surface and is mostly in the instrument window below the skin. Spectra with a strong contribution from fluorescent room light are generated due to either instrument operator error or subject movement, which produces a condition where the confocal Raman collection window is not fully covered by the subject’s body site. Although these types of spectral artifacts could be identified and remediated during spectral acquisition by a spectroscopic expert at the time of data acquisition, the trained instrument operators used in this study were instructed to collect all data unless a catastrophic failure was observed. The task of identifying and excluding outliers is incorporated into the data analysis protocol. The protocol presented is developed to resolve this challenge. To address the low signal-to-noise spectra above the skin surface, the location of the skin surface needs to be determined first to allow removal of spectra collected above the skin surface. The location of the skin surface is defined as the depth where the Raman laser focal point is half in the skin and half out of the skin as illustrated in Supplemental Figure 1. After removing low signal-to-noise spectra, a principal component analysis (PCA) is implemented to extract the factor dominated by fluorescent room light peaks. These outliers are removed based on the score value of the corresponding factor.
This protocol provides detailed information for how six principal components are determined in the MCR process. This is done through a PCA analysis followed by spectral shape comparison between the loadings for models generated with a different number of principal components. The experimental process for data collection of reference materials as well as the human subjects is also explained in detail.
This study was approved by the institutional review committee of Beijing Children’s Hospital in compliance with the ethical guidelines of the 1975 Declaration of Helsinki. It was conducted according to ICH guidelines for Good Clinical Practice. The study took place from May to July 2015.
1. Collection of in vivo confocal Raman spectra from human subjects with atopic dermatitis
2. Collection of confocal Raman spectra from reference materials
3. Removal of the outlier spectra through chemometrics analysis
4. Selection of the number of the components in MCR decomposition analysis
In this clinical study, in vivo confocal Raman spectra were collected from 28 subjects from 4–18 years old. A total of 30,862 Raman spectra were collected with the data collection protocol mentioned above. This large spectral dataset contains 20% spectral outliers as shown in Figure 4A. The low signal-to-noise outlier spectra were removed after determining the skin surface, followed by the PCA to identify the spectra with room light features. The third factor in this PCA model is identified room light peaks. This is confirmed by comparison of the loading spectra of factor 3 with a spectrum of fluorescent room light collected separately at the study site using the same confocal Raman instrument (see Supplementary Figure 3). Figure 4B indicates that most of the outlier spectra were removed after this process.
PCA was performed on the preprocessed confocal Raman dataset and the eigenvalue along with the number of factors used are plotted in Figure 5. According to prior studies12,19, the model should include at least three components: water, protein, and lipid. A significant decrease in eigenvalue was observed for factor 9 as shown in Figure 5. This observation suggests investigating models with the number of principal components varying between three and eight factors for inclusion in the MCR model. MCR loadings that contain spectroscopic features most consistent with protein, water, and lipid are shown in Figure 6.
Figure 1. Illustration of the lesion and non-lesion mark on forearm. (A) A 3 cm x 4 cm marked area on a lesion site. (B) A 3 cm x 4 cm marked area on a non-lesion site. Please click here to view a larger version of this figure.
Figure 2. Illustration of the confocal Raman data collection. (A) Confocal Raman instrument. (B) Spectra collection on the forearm of human subject. (C) A screen shot of determining the reference position for data collection. Please click here to view a larger version of this figure.
Figure 3. Determining the skin surface. (A) Integration of the protein area under each Raman spectrum. (B) Setting the skin surface based on the maximum and minimum points. Please click here to view a larger version of this figure.
Figure 4. Raman dataset spectra. (A) Confocal Raman spectra before removal of the outlier spectra. (B) Confocal Raman spectra after removal of the outlier spectra. Please click here to view a larger version of this figure.
Figure 5. Determining the number of components from PCA analysis. (A) Eigenvalue on a logarithmic scale plotted as a function of the number of components used in PCA model. (B) Difference in eigenvalues between 'n' and 'n + 1' components Please click here to view a larger version of this figure.
Figure 6. Comparison of the loading shape with the corresponding reference materials' spectra with three to eight components in MCR model. (A) Protein, (B) water, and (C) lipid factors' shape with three to eight components in the MCR model compared with BSA, water, and lipid reference materials' spectra, respectively. Please click here to view a larger version of this figure.
Figure 7. The additional three loadings from the six component MCR model not used in the final model. These three MCR components are dominated by fluorescence and baseline artifacts. Please click here to view a larger version of this figure.
Supplemental Figure 1. Illustration of the determination of the skin surface where the center of the laser focus touches the skin. Please click here to view a larger version of this figure.
Supplemental Figure 2. Illustration of the selection of three components in the PLS_Toolbox software PCA analysis. Please click here to view a larger version of this figure.
Supplemental Figure 3. Identification of the loading factor dominated by room light superimposed on a reference spectrum of room light. Please click here to view a larger version of this figure.
Supplemental Figure 4. Comparison of loadings from the MCR model before and after removal of cosmic rays. (A), (B), and (C) are the factors representing water, protein, and lipid, respectively. The additional loading factors not used in the final MCR model are d, e, and f. Please click here to view a larger version of this figure.
Supplemental Figure 5. Raman spectra of typical lipid materials in stratum corneum. (A) Cholesterol 3-sulfate sodium and cholesterol. (B) Oleic, palmitic, palmitoleic, and stearic acid. (C) Squalene. (D) N-behenoyl-D-erythro-sphingosine, N-Lignoceroyl-D-erythro-sphinganine, and D-Erythro-Dihyrosphingosine. Please click here to view a larger version of this figure.
During the data collection, as described in section 2 and 3 of the protocol, each depth profile was collected in an area with contact between the instrument window and the skin by finding the darker areas from the microscopic images highlighted in the red circles in Figure 2C. Once these areas were located, it was critical to start the depth profile above the skin surface to accurately determine the location of the skin surface for the data analysis procedure. The location of the skin surface was subsequently used to determine the relative depth of each spectrum in the corresponding depth profile. As mentioned in section 1 of the protocol, starting the depth profile 10 μm above the skin surface produces five data points outside the skin. This allows for successfully determining the locations of the maximum and minimum signal intensity on both sides of the skin surface. It is also important to avoid measuring locations that contain pen marks and higher pigmented areas such as freckles, because these areas produce a high fluorescence background signal. The selection of the exposure time is a balance between spectral quality and measurement duration. Longer exposure time improves signal-to-noise and significantly increases the overall measurement time. However, many subjects find it challenging to remain motionless for extended periods of time. This is extremely challenging for children, for example. Increasing laser power increases the signal-to-noise. However, too much power can damage the skin due to the absorption of the energy. The maximum permissible exposures, 17 mW laser power as defined by the Chinese national standard (GB 7247.1-2012), and the international laser safety standard (IEC 60285-1:2007; <20 mW for 671 nm and <30 mW for 785 nm), cannot be exceeded. Other safety precautions include ensuring that each subject is wearing eye protection prior to data acquisition, that body sites have an individual topology angle (ITA) value higher than 40, and avoiding areas with high skin pigmentation.
To determine the location of the skin surface, the area under the protein Raman peak (2,910-2,965 cm-1) was integrated to obtain the depth profile of the protein signal. The Raman spectra were first baseline-corrected using the automated weighted least square method from PLS_Toolbox prior to the integration of the peaks. The 26 data points from one depth profile were interpolated to 260 points using the linspace method for the instrument offset vaue (X-axis value in Figure 3A) and the spline method for the corresponding intensity value. The resulting data were interpolated onto a 20th order polynomial using the polyfit and polyval functions in MATLAB and the maximum and minimum points of the interpolated data were determined. The mean intensity value was calculated by dividing the sum of the maximum and minimum values by 2. The skin surface was defined as the location where the intensity value from the interpolated depth profile was closest to the mean intensity. The exact location of the skin surface does not need to coincide with an experimental data point. This method can only measure a limited depth of the skin due to the absorption and scattering of the beam21. Collecting spectroscopic data below ~50 μm under the skin surface may require significant changes to the experimental parameters.
As described in section 3 of the protocol, after removal of outlier spectra with low signal-to-noise and high contribution from room lights, a small fraction of spectra containing cosmic rays remained in the data set. A comparison of the loading spectra generated before and after cosmic ray removal is shown in Supplementary Figure 4. A comparison of loading spectra shown in Supplementary Figure 4 indicates that the impact of a small number of spectra with cosmic rays was negligible. The three factors representing water, protein, and lipid were identical, and the additional three loadings associated with noise and spectral artifacts were also very similar. This might be attributed to a low occurrence of cosmic rays in the spectra (~0.25%) because the location of cosmic rays in the spectra are random.
The selection of the number of the components used in the MCR analysis is critical, because interpretation of the loadings' shape in terms of the corresponding molecular species responsible for each loading significantly impacts both how the corresponding score values are used and overall method performance. As described in section 4 of the protocol, PCA was performed first to investigate the eigenvalue evolution associated with the increasing of the number of the components. This investigation was used to identify the number of the components that should be used in the following MCR analysis. Plotting the eigenvalue on a logarithmic scale can make this identification process easier than examining the raw eigenvalues, as shown in Figure 5A. Each eigenvalue is a representation of the variance that one component can capture. The larger the eigenvalue, the more variance this component can model in the spectra. Eigenvalues with similar size should be selected or eliminated together22. Following this guideline, two, five, and eight components were considered for the MCR analysis because components three, four, and five produce eigenvalues similar in size. A similar trend was also observed for components six, seven, and eight. Figure 5B is a plot of the difference in eigenvalues between 'n' and 'n+1' components showing local maxima after the second, fifth, and eighth components. Prior knowledge about the molecular composition of skin combined with the study design supports a minimum of three components required to model the high frequency Raman spectra. Therefore, multiple MCR models containing three to eight components were investigated and the loadings were compared to spectra from reference materials to identify the key components required for the final model.
Comparison of the loadings with Raman spectra from reference materials easily allows identifying and assigning two of the final MCR components to protein and water because they dominate the MCR loadings for all models tested and match the corresponding reference spectra, which are BSA and DI water. However, the expected spectroscopic properties of lipid in some of the MCR components was a weaker match to the lipid reference spectrum illustrated in MCR models that contain three and four components. In addition, residual protein peaks (2,840–3,000 cm-1) were observed in the MCR water loadings for all models tested below six components. Based on these observations, a six component MCR model was used in the final MCR analysis. Three of the six components were assigned to water, protein, and lipid by matching their loading spectrum to the corresponding reference spectrum. The interpretation and assignment of the lipid factor is based on comparison of the loading to Raman spectra of three representative ceramide materials, including N-behenoyl-D-erythro-sphingosine, N-Lignoceroyl-D-erythro-sphinganine, and D-Erythro-Dihyrosphingosine. The Raman spectra of other lipid materials in the stratum corneum were also examined. These materials include fatty acids (oleic, palmitic, palmitoleic, and stearic acid), cholesterol (cholesterol 3-sulfate sodium and cholesterol), and squalene, as shown in Supplementary Figure 5. The lipid factor used in the final MCR model was a strong match to ceramide spectra and consistent with other materials that contain long chain hydrocarbons. The other three MCR components were dominated by fluorescence and baseline artifacts and their corresponding score values were not used in any calculations. These three components are shown in Figure 7.
The overall analysis approach presented in this manuscript produces a final method with improved specificity and accuracy for measuring the key components in skin compared to other single peak or peak-fitting approaches. This methodology demonstrates that critical components can be extracted from a clinical dataset that contains a relatively small fraction of bad spectra. Future efforts are focused on the automation of this methodology into a software package to improve its efficiency and reduce the amount of technical expertise required for the analysis. Similar methodology is being developed for Raman spectra collected in the fingerprint region (400–1,800 cm-1) using a 785 nm laser source rather than the 671 nm laser incorporated into the same instrument.
The authors have nothing to disclose.
The authors greatly acknowledge the financial support from the corporate function analytical and personal cleansing care department. We want to express our gratitude to analytical associate directors Ms. Jasmine Wang and Dr. Robb Gardner for their guidance and support and Ms. Li Yang for her help on data collection.
Bovine Serum Albumin | Sigma-Aldrich | ||
Cholesterol | Sigma-Aldrich | ||
Cholesterol 3-sulfate sodium | Sigma-Aldrich | ||
D-Erythro-Dihydrosphingosine | Sigma-Aldrich | ||
DI water | Purified with Milipore(18.2MΩ) | ||
Gen2-SCA skin analyzer | River Diagnostics, Rotterdam, The Netherlands | Gen2 | |
Matlab 2018b | Mathwork | 2018b | |
N-behenoyl-D-erythro-sphingosine | Avanti Polar Lipids, Inc. | ||
N-Lignoceroyl-D-erythro-sphinganine(ceramide) | Avanti Polar Lipids, Inc. | ||
Oleic Acid | Sigma-Aldrich | ||
Palmitic Acid | Sigma-Aldrich | ||
Palmitoleic Acid | Sigma-Aldrich | ||
PLS_Toolbox version 8.2 | Eigenvector Research Inc. | 8.2 | |
RiverICon | River Diagnostics, Rotterdam, The Netherlands | version 3.2 | |
Squalene | Sigma-Aldrich | ||
Stearic Acid | Sigma-Aldrich |