We provide a detailed protocol for conducting underwater structure-from-motion photogrammetry surveys to generate 3D models and orthomosaics.
Structure-from-motion (SfM) photogrammetry is a technique used to generate three-dimensional (3D) reconstructions from a sequence of two-dimensional (2D) images. SfM methods are becoming increasingly popular as a noninvasive way to monitor many systems, including anthropogenic and natural landscapes, geologic structures, and both terrestrial and aquatic ecosystems. Here, a detailed protocol is provided for collecting SfM imagery to generate 3D models of benthic habitats. Additionally, the cost, time efficiency, and output quality of employing a Digital Single Lens Reflex (DSLR) camera versus a less expensive action camera have been compared. A tradeoff between computational time and resolution was observed, with the DSLR camera producing models with more than twice the resolution, but taking approximately 1.4-times longer to produce than the action camera. This primer aims to provide a thorough description of the steps necessary to collect SfM data in benthic habitats for those who are unfamiliar with the technique as well as for those already using similar methods.
Ecosystem processes are naturally dynamic and can be difficult to quantify. The past decade has seen a surge in new technologies for capturing ecosystems and their dynamics in a range of scales from 3D laser scanning of individual ecosystem features to satellite remote sensing of large areas1,2,3. In benthic habitats, structure is intimately connected with ecosystem function8, making tools that simultaneously allow for monitoring geometry and community structure especially valuable for understanding ecological dynamics. However, many modern approaches cannot be used in aquatic systems due to the physical properties of water (e.g., refraction, distortion, turbidity). Techniques, such as LiDAR (Light Detection and Ranging) and some aerial survey methods, may be appropriate on large spatial scales, but cannot acquire the resolution needed to assess fine scale changes in benthic habitats. Structure-from-Motion (SfM) photogrammetry methods have recently been adapted to produce large-scale, high-resolution orthomosaics and 3D surface models of underwater habitats4,5,6,7.
SfM photogrammetry is a relatively low-cost, simple, non-invasive, and repeatable method that allows for the generation of large-scale, high-resolution records of the benthic environment in aquatic ecosystems9. SfM uses a sequence of 2D images to generate 3D model reconstructions. The models generated from SfM can be used to collect data on the structural complexity (e.g., rugosity, dimensionality)4,5,10,11,12 and community structure (e.g., species composition, population demography)13,14,15 of benthic ecosystems. Furthermore, as this method is relatively inexpensive, quick, and repeatable, it can be used by both scientists and non-scientists to gather valuable, objective information on these ecosystems. Therefore, this method is a viable technique for use in citizen science projects where standardization of sampling effort, minimization of bias, engagement of participants, and ease of training are vital to the quality of data and overall success16,17.
This article provides a detailed protocol for conducting underwater SfM surveys. Simultaneously, the use of a DSLR camera has been compared with that of a more cost-effective 'action camera', and the relative advantages and disadvantages of each are outlined. The overall objective is to familiarize scientists and non-scientists with benthic SfM survey methods as rapidly as possible by providing a simple, commonly used protocol, in turn, promoting the use of this method more widely. For examples of studies that have applied variations of this method to study underwater ecological communities, see Burns et al. (2015)4, Storlazzi et al. (2016)18, Ventura et al (2016 and 2018)19,20, Edwards et al. (2017)14, George et al. (2018)21, Anelli et al. (2019)22, and Torres-Pulliza et al. (2020)10.
The method described here requires a two-person snorkel or SCUBA team. After the survey site is selected, a spool of line (Figure 1A) is placed at the center of the site, and calibration tiles (Figure 1B) are distributed ~2 m from the center. One person (the swimmer) swims with the camera and captures images of the site, while the second person (the assistant) tends the spool in the center of the plot (Figure 1C). First, the swimmer connects the camera to the spool via the line and then begins to take continuous pictures of the benthos while swimming face-down and forward to unwind the line off the spool. The swimmer should maintain a vertical distance of ~1 m above the substrate at all times, adjusting their position to match that of the topography as they swim. Importantly, the line connecting the spool and camera should remain taut at all times to create even spacing in the spiral as the swimmer surveys the plot. The assistant maintains the spool in a stable, upright position and ensures that the spool does not rotate, and that the line does not become tangled.
Once the line has been completely unwound, the swimmer stops, turns, and swims in the opposite direction to recoil the line around the spool. As the swimmer switches directions, the assistant turns the spool to wind the line in, exactly 180° to prevent exact overlap of the outgoing path. Once the swimmer is as close to the center as possible, the camera is detached from the line, and the assistant takes the spool and line and swims away from the central portion of the site. The swimmer then finishes imaging the center of the plot by moving the camera in a small spiral over the center. While there are several ways to image an area effectively, the spool-and-line method described here is robust in even non-ideal environmental conditions where choppy surface waters, swell, or low visibility might otherwise impede data collection. In these scenarios, this method keeps snorkelers/divers attached and ensures high overlap of images by keeping the swimmer on a controlled path.
1. Materials
2. Detailed methods
3. Clean up the site.
In this example, Reef Site 2_7 located on Patch Reef 13 in Kāneʻohe Bay, Oʻahu, Hawaiʻi, was imaged, and 3,125 JPEG photos from the DSLR and 3,125 JPEG frame captures from the action camera video (Table 1) were used as input to create the orthomosaics and 3D models. The general workflow consisted of 5 stages: 1) alignment of photos to generate the sparse point cloud, 2) scaling the sparse point cloud and optimizing cameras, 3) building the dense point cloud (depth maps were also generated during this stage), 4) building the digital elevation model (DEM) and orthomosaic, and 5) generating the 3D model and texture. Note that stages 4 and 5 do not necessarily need to be done in that order, but they must be performed after processing the dense point cloud and depth maps. Georeferencing the models should occur prior to generating the orthomosaic and DEM. The settings used for these stages and processing details are outlined in Table 2 and Table of Materials, respectively.
For more detailed methods of how to generate 3D models and orthomosaics see the Supplementary Material and Suka et al.23. Processing time was shorter for the action camera-derived model for every step including sparse point cloud generation, dense point cloud generation, mesh model rendering, and textured model rendering. This led to a significantly faster overall processing time for the action camera model (6 h 39 min) than the DSLR model (9 h 14 min). The exact time for model processing will vary with computational power and specific hardware configurations.
The model generated using images from the DSLR camera contained 2,848,358 sparse cloud points and 787,450,347 dense cloud points while the model generated from the action camera images contained only 2,630,543 sparse cloud points and 225,835,648 dense cloud points. This led to the DSLR models having more than 2x the resolution than the action camera models with orthomosaics resolutions of 0.442 and 0.208 mm/pixel for the DSLR- and action camera-derived models, respectively (Table 1). Despite the better resolution of the DSLR model relative to the action camera model, both methods were able to produce high-quality models with little difference in visual representation when the ~113 m2 reef area was represented as a 20 cm2 digital elevation model (Figure 2 top panels) or 2D orthomosaic projection (Figure 2 middle panels).
Figure 1: Structure-from-Motion photogrammetry. (A) Example of a spool rig for controlling swimmer distance with an attached handle and pole for precise positioning and handling. (B) Calibration tiles. (C) A schematic of the swim path with relative positions of the swimmer (green) and the assistant (orange). Please click here to view a larger version of this figure.
Figure 2: Visual comparison of digital elevation models and orthomosaics. Digital elevation models (top) and orthomosaics (middle) constructed from DSLR (left) and action camera (right) images. The bottom panel is a zoom of the areas in the white boxes in the orthomosaics. The heatmap scales in the top panel represent distance from the surface of the water in meters (m). Please click here to view a larger version of this figure.
Canon EOS Rebel SL3 | GoPro Hero 7 | |
Cost | ||
Camera | ~$600.00 | ~$220.00 |
Underwater housing | ~$1,700.00 | NA |
Total Cost | ~$2,300.00 | ~$220.00 |
Photos | ||
Photo file format | jpeg | jpeg |
Photo resolution | 24 Megapixels | 12 Megapixels (from 4K video) |
Aligned photos / total photos | 3125 / 3125 | 3125 / 3125 |
Photogrammetry metrics | ||
Sparse cloud points | 2,848,358 | 2,630,543 |
Dense cloud points | 787,450,347 | 225,835,648 |
Faces (3D model) | 11,919,451 | 3,834,651 |
Digital elevation model (DEM) resolution | 0.831 mm/pixel | 1.77 mm/pixel |
Orthomosaic resolution | 0.208 mm/pixel | 0.442 mm/pixel |
Processing times | ||
Sparse cloud generation | 1 h 23 min | 1 h 27 min |
Dense cloud generation | 4 h | 3 h 11 min |
Mesh model rendering | 3 h 32 min | 1 h 49 min |
Texture rendering | 19 min | 12 min |
Total computer processing time | 9 h 14 min | 6 h 39 min |
Table 1: Detailed information about setup cost, photos used to construct the models, photogrammetry metrics, and processing time. Processing was done using the same settings for both models. Note that processing time does not include time for various steps such as photo editing, extracting images from video, re-aligning photos, and editing and scaling the models.
Canon EOS Rebel SL3 | GoPro Hero 7 | |
Images | ||
Average file size | ~ 8.3 MB | ~ 4.7 MB |
Photo acquisition | Continuous mode | Extracted from 4K video |
Color correction | Manual | Manual |
Lens correction | No | Yes |
Photogrammetry Process Settings | ||
Sparse cloud generation | Accuracy: High | Accuracy: High |
Key Point: 40,000 | Key Point: 40,000 | |
Tie Point: 4,000 | Tie Point: 4,000 | |
Generic Preselection: Yes | Generic Preselection: Yes | |
Dense cloud generation | Medium Quality | Medium Quality |
3D mesh model generation | ||
Source data: | Depth Maps | Depth Maps |
Quality: | Medium | Medium |
Face count: | Low | Low |
Interpolation: | Enabled | Enabled |
Calculate vertex colors: | Yes | Yes |
3D texture generation | ||
Texture type: | Diffuse Map | Diffuse Map |
Source data: | Images | Images |
Mapping mode: | Generic | Generic |
Blending mode: | Mosaic | Mosaic |
Texture size/count: | 4096 / 1 | 4096 / 1 |
Digital elevation model (DEM) | From Dense cloud | From Dense Cloud |
Orthomosaic | From DEM | From DEM |
Table 2: Detailed information on collected images and photogrammetric processing. Processing was done using the same settings for both models.
Supplementary Material. Please click here to download this file.
This study demonstrates that both the DSLR camera and the action camera produce models with better than 0.5 mm/pixel resolution in less than 10 h of processing time on a standard desktop computer. The major tradeoff between the DSLR and action camera, aside from cost, is finer resolution versus faster processing time, respectively. However, the reported processing times only include the computational processing. Thus, although the computational time is less for the action camera, there is a significant amount of time (10-20 min) invested in image extraction from the videos that is not required with the DSLR. An alternative is to use the action camera in continuous shooting mode to avoid image extraction. Continuous shooting mode was not used in this example, as the action camera can only shoot at 2 fps, which requires a significantly slower swim-rate to collect enough images to build a complete model. In this regard, there is a tradeoff between longer time in the field using the continuous shooting mode versus longer time on the computer, extracting images, when using video mode.
Advantages of the action camera include affordability and ease of transport and operation underwater. The main advantage of the DSLR is that it produces higher resolution images; hence, DSLR cameras are recommended over action cameras when the former is not cost-prohibitive. The kinds of questions a study seeks to address will also be important in determining the method used. For instance, an action camera might be preferable in environments that are relatively homogenous (e.g., seagrass beds, dead coral/rubble habitats), or where only broad community metrics (such as abundance, diversity) are being assessed over large spatial scales. However, a DSLR camera might be deployed in cases where tracking fine-scale changes in individual organisms or substrates is of interest.
As this is a field method, the model outputs will depend on various environmental factors such as lighting, water clarity, surface conditions, amount of surge, and movement of fish or non-stationary benthic structures (e.g., sea grass). Although there are no absolute thresholds of when it is appropriate to use this method, slightly overcast days with high water clarity, calm surface conditions, and little surge typically produce the best models. Moreover, there is a limit to the minimum depth required for these methods. These methods do not work well under conditions where there is less than 0.5 m of water because of the low overlap between photos and fewer distinguishing features per photo. However, this does highlight another advantage of the action camera, i.e., they are smaller and thus are easier for use at shallower depths. Furthermore, a smaller diameter spool and higher frame rate (or wider-angle lens) can improve image overlap in very shallow conditions9.
Many other data types can be integrated with this approach. For example, orthomosaics have been used to show the spatial density of molecular data (e.g., genes and metabolites) on corals24 and humans25 using the open source software 'ili'26. The same platform could also be used to map the spatial densities of animals, microorganisms, viruses, and/or chemicals in the environment. Other examples have used SfM for annotating benthic species spatially onto orthomosaics using geographic information system software10. Furthermore, the 3D models generated by SfM can be used to estimate habitat characteristics such as rugosity and fractal dimension. Indeed, the methods outlined here were recently used to derive a new geometric theory for habitat surfaces10. Finally, orthomosaics are being used as input surfaces for spatially explicit computational models, allowing for dynamical simulations to be overlaid on the model's 3D surface. Being able to easily generate large images and 3D representations of benthic habitats has allowed marine scientists to address hitherto unimagined questions3.
Overall, here is a detailed protocol for conducting underwater SfM photogrammetry with either DSLR cameras or more cost-effective action cameras. These methods can be used by scientists for a broad range of purposes, from extracting data about benthic ecosystems to developing 3D input surfaces for in silico simulations. However, these protocols can also be used by non-scientists as part of citizen science efforts to gather valuable information on patterns of biodiversity, habitat complexity, community structure, and other ecological metrics.
The authors have nothing to disclose.
We thank the Paul G. Allen Family Foundation for funding this research and are grateful to Ruth Gates for the inspiration to use technology to help conserve reefs. We also thank NOAA and other collaborators for thoughtful discussion concerning these methods. Lastly, we thank Catie Foley and Patrick Nichols for providing the drone and underwater video of these methods.
We acknowledge the National Fish and Wildlife Foundation as a funding partner in this work.
Action camera (GoPro Hero7 Black) | GoPro | Could be any waterproof action camera | |
Adobe Lightroom | Adobe | Color correction | |
Calibration tiles ( flat PVC board cut to size for Agisoft targets. Attach a dive weight underneath if expecting waves) | Any negatively buoyant object of known size and color. We recommend using the scale marker templates available from Agisoft Metashape software (v.1.6.0). | ||
DSLR camera (Canon EOS Rebel SL3 ) | Canon | 3453C002AA | Could be any DSLR camera in a underwater housing |
Line (plastic clothes line filament) | Any negatively buoyant line that is strong enough to withstand field use | ||
Micro SDXC memory card (for GoPro) | |||
Oceanic Veo 2.0 | Oceanic | Digital depth gauge | |
SDXC memory card (for DSLR) | Any SDXC memory card should work, so long as there is enough space to hold all the pictures necessary to build the model | ||
Spool (2 inch-long section of 8 inch diameter PVC pipe which was attached to a 3 feet section of 1 inch PVC pipe to form the stem | Any negatively buoyant, round object of the desired diameter | ||
Underwater camera housing for DSLR (Ikelite 200DLM/C Underwater TTL Housing) | Ikelite | 6970.09 | Should be the specific water housing for the DSLR make and model |
Windows 10 desktop computer with an Intel i9-9900K 8-core CPU, two Nvidia GeForceRTX 2070 SUPER GPUs, and 128 GB of RAM. | Processing |