We demonstrate the utility of remotely sensed data and the newly developed Software for Assisted Habitat Modeling (SAHM) in predicting invasive species occurrence on the landscape. An ensemble of predictive models produced highly accurate maps of tamarisk (Tamarix spp.) invasion in Southeastern Colorado, USA when assessed with subsequent field validations.
Early detection of invasive plant species is vital for the management of natural resources and protection of ecosystem processes. The use of satellite remote sensing for mapping the distribution of invasive plants is becoming more common, however conventional imaging software and classification methods have been shown to be unreliable. In this study, we test and evaluate the use of five species distribution model techniques fit with satellite remote sensing data to map invasive tamarisk (Tamarix spp.) along the Arkansas River in Southeastern Colorado. The models tested included boosted regression trees (BRT), Random Forest (RF), multivariate adaptive regression splines (MARS), generalized linear model (GLM), and Maxent. These analyses were conducted using a newly developed software package called the Software for Assisted Habitat Modeling (SAHM). All models were trained with 499 presence points, 10,000 pseudo-absence points, and predictor variables acquired from the Landsat 5 Thematic Mapper (TM) sensor over an eight-month period to distinguish tamarisk from native riparian vegetation using detection of phenological differences. From the Landsat scenes, we used individual bands and calculated Normalized Difference Vegetation Index (NDVI), Soil-Adjusted Vegetation Index (SAVI), and tasseled capped transformations. All five models identified current tamarisk distribution on the landscape successfully based on threshold independent and threshold dependent evaluation metrics with independent location data. To account for model specific differences, we produced an ensemble of all five models with map output highlighting areas of agreement and areas of uncertainty. Our results demonstrate the usefulness of species distribution models in analyzing remotely sensed data and the utility of ensemble mapping, and showcase the capability of SAHM in pre-processing and executing multiple complex models.
Riparian and wetland ecosystems throughout the southwestern United States are being threatened by the invasion of tamarisk (Tamarix spp.), a non-native woody shrub introduced from Eurasia in the 1800s1. Tamarisk has many physiological mechanisms that allow the genus to exploit water resources, out-compete native species, and alter ecosystem processes1-2. Mapping tamarisk distributions for assessing environmental impacts and formulating effective control strategies are high priorities for resource managers. Although ground surveys remain regularly used, they are impractical for extremely large areas due to the associated costs of labor, time, and logistics.
Satellite remote sensing has played an important, but limited, role in the detection and mapping of tamarisk infestations. Conventional classification analyses and remote sensing software have had marginal success3-5. Several recent studies have explored non-traditional approaches to detect invasive plants using remote sensing data1,6. Tamarisk, like many invasive plants, exhibits phenological variation throughout the growing season that differs from native riparian species' phenology. In some areas, for example, tamarisk leaf-out is before some native riparian plants, and tamarisk retains its foliage longer than other native species. By using spectral bands and spectral indices derived from a time-series of satellite data throughout the growing season, we can distinguish tamarisk from native plants based on these phenological differences1,6. Building on the work of Evangelista et al. 20091, in this study we incorporated individual bands 1-7 from a time-series of Landsat 5 Thematic Mapper (TM) satellite imagery and derived normalized difference vegetation index (NDVI), soil-adjusted vegetation index (SAVI), and tasseled cap transformations from these bands. Normalized difference vegetation index (NDVI) is one of the most commonly used spectral indices for estimating vegetation biomass, canopy cover, and leaf area indices8-9, and is a non-linear transformation of the ratio between the visible (red) and near-infrared bands10. Soil-adjusted vegetation index (SAVI) is a modified NDVI used to minimize the effects of soil background on vegetation indices11. Tasseled cap transformations are weighted composites of the six Landsat bands into three orthogonal bands that measure soil brightness (tasseled cap, band 1), vegetation greenness (tasseled cap, band 2), and soil/vegetation wetness (tasseled cap, band 3) and are often used to distinguish vegetation composition, age class, and structure12-14. We used the coefficients reported in Crist (1985)15 for all tasseled cap transformations.
In this study, we test five species distribution models with a time-series of spectral bands and vegetation indices derived from Landsat 5 TM to map tamarisk along the lower Arkansas River in southeastern Colorado, USA. The Arkansas River, spanning 2,364 km (1,469 mi), is the second largest tributary in the Missouri-Mississippi system. Its watershed covers 435,123 km2 (168,002 mi2) with headwaters in the Colorado Rocky Mountains. From its origin at 2,965 m, the Arkansas drops considerably in elevation, leveling out near Pueblo, CO, and meandering through agricultural lands and short-grass prairie. The river is subject to seasonal flooding and is relied on for municipal and agricultural water use in Rocky Ford, La Junta, and Lamar, before continuing into Kansas, Oklahoma, and Arkansas where it flows into the Mississippi River. Tamarisk was first observed on the Arkansas River by R. Niedrach in 1913 near the present-day town of Lamar16. Today, it has been estimated that tamarisk covers more than 100 km2 between Pueblo and the Kansas state line, with an additional 60 km2 along the tributaries of the Arkansas River17. The study area includes irrigation ditches, wetlands, agricultural land, and the confluences of several tributaries; all with varying degrees of tamarisk infestation. Ranching and agriculture are the primary land-uses adjacent to the riparian corridors consisting largely of alfalfa, hay, corn, and winter wheat.
Species distribution models rely on geo-referenced occurrences (i.e., latitude, longitude) to identify relationships between a species' occurrence and its environment18. The environmental data can include multiple remote sensing and other spatial layers. The five species distribution models we tested include boosted regression trees (BRT)19, random forests (RF)20, multivariate adaptive regression splines (MARS)21, a generalized linear model (GLM)22, and Maxent23. These five model algorithms are among the most commonly employed for species distribution modeling, and a number of studies have demonstrated their efficacy24-25. We used the Software for Assisted Habitat Modeling (SAHM) v. 2.0 modules to execute the five models, which are contained in VisTrails v.2.2.226 visualization and processing software. There are several advantages to using SAHM for comparative modeling. In addition to the formalization and tractable recording of modeling processes, SAHM allows users to work with multiple species distribution model algorithms that, individually, have disparate interfaces, software and file formatting27. SAHM produces consistent threshold-independent and threshold-dependent evaluation metrics to evaluate model performance. One of these is Area Under the Receiver Operating Characteristic Curve (AUC), a threshold independent metric that evaluates ability of a model to discriminate presence from background28. An AUC value of 0.5 or less indicates model predictions are not better or worse than random; values between 0.5 and 0.70 indicate poor performance; and values increasing from 0.70 to 1.0 indicate progressively higher performance. Another metric is percent correctly classified (PCC), a threshold dependent metric that weighs sensitivity and specificity based on a user-defined threshold metric; sensitivity measures the percentage of observed presences classified as suitable and specificity measures the percentage of background locations classified as unsuitable. Yet another metric is True Skill Statistic (TSS = sensitivity + specificity – 1), which places more weight on model sensitivity than specificity, with values ranging between -1 and 1 where values > 0 indicate better model performance than chance29.
To map tamarisk using model output, we constructed binary classifications using the threshold that equalizes sensitivity and specificity to define the presence or absence of tamarisk. These individual model derived maps were then summed to create an ensemble map30. Ensemble maps combine the predictions of individual species distribution models to produce a classified map that ranks the collective agreement of the models tested. For example, an ensemble cell value of one indicates that only one model classified that cell as suitable habitat, whereas a value of five indicates that all five models classified the cell as suitable habitat. One advantage to this approach is that ensemble maps yield a lower mean error than any individual model. It also allows users to visually compare the performance of each model tested. Our overall goal was to provide a detailed description of these methods that can be tailored to model the current distribution of species on the landscape.
1. Field Data Collection
2. Predictor Variables
Figure 1. Remote Sensing Indices Derivation Tool GUI.
3. Software for Assisted Habitat Modeling (SAHM) (Figure 2)
Figure 2. The entire SAHM workflow includes input data, preprocessing, preliminary model analysis and decision, correlative models, and output routines.
Figure 3. Covariate correlation and selection SAHM interface.
Figure 4. VisTrails spreadsheets can be used to evaluate model output. This is the AUC model comparison for the training data; from left to right the models are BRT, GLM, MARS, RF, and Maxent, respectively.
Statistical evaluations of BRT, RF, MARS, GLM, and Maxent based on the independent test dataset indicated all five models performed relatively well in detecting tamarisk; there was little difference between threshold independent and threshold dependent evaluation metrics among models. AUC values were > 0.88, percent correctly classified values were > 77%, sensitivities and specificities were > 0.77, and TSS were > 0.54 (Table 1). An ensemble of binary model outputs revealed much model agreement in areas along the Arkansas River (Figure 5). The MESS (multivariate environmental similarity surface) map outputs for each model indicated the available environment of the study area was well sampled (Figure 6), further increasing our confidence in the ensemble approach.
Model | AUC | PCC | Sensitivity | Specificity | TSS |
BRT | 0.91 | 85 | 0.85 | 0.85 | 0.70 |
RF | 0.92 | 85 | 0.85 | 0.85 | 0.70 |
MARS | 0.90 | 82 | 0.82 | 0.82 | 0.64 |
GLM | 0.88 | 77 | 0.77 | 0.77 | 0.54 |
Maxent | 0.92 | 84 | 0.83 | 0.84 | 0.67 |
Table 1. Threshold independent (AUC) and threshold dependent (PCC, Sensitivity, Specificity, and TSS) evaluation metrics for BRT, RF, MARS, GLM, and Maxent models fit to an independent test dataset of tamarisk presence and absence.
Figure 5. Ensemble results combining BRT, GLM, MARS, RF, and Maxent binary output maps in ArcGIS. Areas are colored by number of models in agreement, from 0 (no color) to 5 (red). Note the colored area in the northwest corner of the prediction; this line is an artifact of Landsat imagery; therefore model results should be taken with caution in this region. Please click here to view a larger version of this figure.
Figure 6. Multivariate environmental similarity surface (MESS) output. Please click here to view a larger version of this figure.
Of the nine predictors used, June 30 2006 Brightness was the most important variable for all five models (Table 2). This was the only variable retained by GLM based on stepwise Akaike information criterion (AIC; this is the default for GLM model selection in SAHM), however it is important to note this model also included a squared term of this variable. RF and Maxent retain all variables by default.
Predictor | BRT | RF | MARS | GLM | Maxent |
July 30 2006 Brightness | 41.60 | 34.11 | 76.78 | 100 | 67.27 |
August 31 2006 Band 4 | 6.35 | 5.87 | 5.16 | 0 | 2.82 |
June 09 2005 SAVI | 13.67 | 14.09 | 9.14 | 0 | 9.75 |
April 22 2005 Brightness | 6.29 | 6.30 | 0 | 0 | 0.43 |
October 28 2004 NDVI | 5.66 | 8.25 | 0 | 0 | 2.94 |
Table 2. Relative importance of predictors in each model.
Our results demonstrate fitting BRT, RF, MARS, GLM, and Maxent with presence points for tamarisk and a time-series of remotely sensed Landsat satellite imagery data can distinguish tamarisk on the landscape and is an effective alternative to traditional single-scene classification methods. It is clear from our results that June is a particularly important time for detecting tamarisk within our study area; this agrees with Evangelista et al. 20091 which indicated June Wetness was the most important predictor for tamarisk occurrence in this area based on a Maxent model fit with a time series of Landsat imagery.
The other spectral indices and bands that were included in the BRT, RF, MARS, and Maxent models may further distinguish tamarisk from soil substrate, other deciduous trees including cottonwood (Populus spp.) and willow (Salix spp.), or irrigated agriculture that is common in the lower Arkansas River basin. Other GIS layers, such as topography, soil types, or climate data could also be considered as covariates and included in these models, but we recommend keeping these to a minimum if the goal is to detect current species distribution on the landscape rather than predict potential occurrence or suitable habitat.
The models tested for our research provided strong analytical capacity and multiple options for evaluating results. Having all of these correlative models within a single framework, such as SAHM, allows the formalization and tractable recording of the modeling process. Pre- and post-processing of response and predictor variables are standardized in SAHM, allowing better and efficient model comparisons, while workflows record each step of the analyses facilitating modification, iteration and replication.
Ensemble mapping aims to combine the strengths of several correlative models, while minimizing the weakness of any one model30. We believe this was the case in our study; however, we caution that models that underperform (i.e., under-predict or over-predict) can weaken the overall results. The limited use of ensemble mapping in the literature has had favorable results, but most of these approaches have attempted to "predict" species occurrence rather than "detect." Furthermore, ensemble mapping allows for a visual assessment of uncertainty among the different modeling methods, identifying levels of model agreement. Most often it is the choice of modeling method (e.g., GLM versus BRT) that has greatest quantifiable impact on model results rather than other decisions in the modeling process such as location data uncertainty31. Although we believe that our best tamarisk map is where all five models are in agreement, further testing and using various methods of ensemble mapping is recommended (e.g., weighted by AUC)32, and best validated through independent field observations. In summary, these methods can easily be tailored to model the distribution of other species using environmental variables derived for a given study region in SAHM.
The authors have nothing to disclose.
The authors would like to thank the U.S. Geological Survey, Natural Resource Ecology Laboratory at Colorado State University, Colorado State Forest Service and the Tamarisk Coalition for logistical support, data, use of facilities and expertise. Additionally, we thank Shelly Simmons, Lane Carter, John Moore, and Chandra Reed for their contributions to this work. Thomas J. Stohlgren was partially supported by the Bioenergy Alliance Network of the Rockies (BANR), USDA UV-B Monitoring and Research Program and USDA CSREES/NRI 2008-35615-04666. Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
Earth Explorer | USGS | http://earthexplorer.usgs.gov | Open Access: Yes |
Remote Sensing Indices Derivation Tool | github | https://github.com/rander38/Remote-Sensing-Indices-Derivation-Tool | Open Access: Yes |
Software for Assisted Habitat Modeling | USGS | https://my.usgs.gov/catalog/RAM/SAHM | Open Access: Yes |
ArcGIS v.10.3 | Esri | https://www.arcgis.com/features/ | Open Access: No |