When we grasp an object, multiple regions of the fingers and hand typically make contact with the object’s surface. Reconstructing such contact regions is challenging. Here, we present a method for approximately estimating the contact regions by combining marker-based motion capture with existing deep learning-based hand mesh reconstruction.
To grasp an object successfully, we must select appropriate contact regions for our hands on the surface of the object. However, identifying such regions is challenging. This paper describes a workflow to estimate the contact regions from marker-based tracking data. Participants grasp real objects, while we track the 3D position of both the objects and the hand, including the fingers’ joints. We first determine the joint Euler angles from a selection of tracked markers positioned on the back of the hand. Then, we use state-of-the-art hand mesh reconstruction algorithms to generate a mesh model of the participant’s hand in the current pose and the 3D position.
Using objects that were either 3D printed or 3D scanned-and are, thus, available as both real objects and mesh data-allows the hand and object meshes to be co-registered. In turn, this allows the estimation of approximate contact regions by calculating the intersections between the hand mesh and the co-registered 3D object mesh. The method may be used to estimate where and how humans grasp objects under a variety of conditions. Therefore, the method could be of interest to researchers studying visual and haptic perception, motor control, human-computer interaction in virtual and augmented reality, and robotics.
The capacity to grasp and manipulate objects is a key ability that allows humans to reshape the environment to their wants and needs. However, controlling multi-jointed hands effectively is a challenging task that requires a sophisticated control system. This motor control system is guided by several forms of sensory input, amongst which vision is paramount. Through vision, individuals can identify the objects in the environment and estimate their position and physical properties and can then reach, grasp, and manipulate those objects with ease. Understanding the complex system that links the input at the retinae with the motor commands that control the hands is a key challenge of sensorimotor neuroscience. To model, predict, and understand how this system works, we must first be able to study it in detail. This requires high-fidelity measurements of both visual inputs and hand motor outputs.
Past motion-tracking technology has imposed a number of limitations on the study of human grasping. For example, systems requiring cables attached to the participants' hands1,2 tend to restrict the range of finger motions, potentially altering the grasping movements or the measurements themselves. Despite such limitations, previous research has been able to identify several factors that influence visually guided grasping. Some of these factors include object shape3,4,5,6, surface roughness7,8,9, or the orientation of an object relative to the hand4,8,10. However, to overcome previous technological limitations, the majority of this prior research has employed simple stimuli and highly constrained tasks, thus predominantly focusing on individual factors3,4,6,7,10, two-digit precision grips3,4,6,9,11,12,13,14,15,16,17,18, single objects19, or very simple 2D shapes20,21. How previous findings generalize beyond such reduced and artificial lab conditions is unknown. Additionally, the measurement of hand-object contact is often reduced to the estimation of digit contact points22. This simplification may be appropriate to describe a small subset of grasps in which only the fingertips are in contact with an object. However, in the majority of real-world grasps, extensive regions of the fingers and palm come in contact with an object. Further, a recent study23 has demonstrated, using a haptic glove, that objects can be recognized by how their surface impinges on the hand. This highlights the importance of studying the extended contact regions between the hands and the objects grasped, not just the contact points between the objects and the fingertips22.
Recent advances in motion capture and 3D hand modeling have allowed us to move past previous limitations and to study grasping in its full complexity. Passive marker-based motion tracking is now available with millimeter-sized markers that can be attached to the back of the participant's hands to track joint movements24. Further, automatic marker identification algorithms for passive marker systems are now sufficiently robust to almost eliminate the need for the extensive manual postprocessing of marker data25,26,27. Markerless solutions are also reaching impressive levels of performance at tracking animal body parts in videos28. These motion-tracking methods, thus, finally allow reliable and non-invasive measurements of complex multi-digit hand movements24. Such measurements can inform us about joint kinematics and enable us to estimate the contact points between the hand and an object. Additionally, in recent years, the computer vision community has been tackling the problem of constructing models of the human hands that can replicate the soft tissue deformations during object grasping and even during self-contact between hand parts29,30,31,32. Such 3D mesh reconstructions can be derived from different types of data, such as video footage33,34, skeletal joints (derived from marker-based35 or markerless tracking36), and depth images37. The first key advance in this domain was provided by Romero et al.38, who derived a parametric hand model (MANO) from over 1,000 hand scans from 31 subjects in various poses. The model contains parameters for both the pose and shape of the hand, facilitating regression from different sources of data to a full hand reconstruction. The more recent DeepHandMesh29 solution builds on this approach by constructing a parametrized model through deep learning and by adding penetration avoidance, which more accurately replicates physical interactions between hand parts. By combining such hand mesh reconstructions with 3D tracked object meshes, it is, thus, now possible to estimate contact regions not just on the surface of objects32 but also on the surface of the hand.
Here, we propose a workflow that brings together the high-fidelity 3D tracking of objects and hand joints with novel hand mesh reconstruction algorithms. The method yields detailed maps of hand-object contact surfaces. These measurements will assist sensorimotor neuroscientists in extending our theoretical understanding of human visually guided grasping. Further, the method could be useful to researchers in adjacent fields. For example, human factor researchers may use this method to construct better human-machine interface systems in virtual and augmented reality18. High-fidelity measurements of human grasping behaviors can also assist roboticists in designing human-inspired robotic grasping systems based on the principles of interactive perception39,40,41,42,43. We, thus, hope that this method will help advance grasping research across neuroscience and engineering fields from sparse descriptions of highly constrained tasks to fuller characterizations of naturalistic grasping behaviors with complex objects and real-world tasks. The overall approach is outlined in Figure 1.
Figure 1: Key steps in the proposed method. (A) Motion capture cameras image a workbench from multiple angles. (B) A stimulus object is 3D printed from a triangulated mesh model. (C) Four spherical reflective markers are glued to the surface of the real object. A semi-automated procedure identifies four corresponding points on the surface of the mesh object. This correspondence allows us to roto-translate the mesh model to the 3D tracked position of the real object. (D) Reflective markers are attached to different landmarks on the back of a participant's hand using double-sided tape. (E) The motion capture system acquires the trajectories in 3D space of the tracked object and hand markers during a single trial. (F) A participant-specific hand skeleton is constructed using 3D computer graphics software. Skeletal joint poses are then estimated for each frame of each trial in an experiment through inverse kinematics. (G) Joint poses are input to a modified version of DeepHandMesh29, which outputs an estimated 3D hand mesh in the current 3D pose and position. (H) Finally, we use mesh intersection to compute the hand-object contact regions. Please click here to view a larger version of this figure.
Prior to beginning an experiment, the participants must provide informed consent in accordance with the institutional guidelines and the Declaration of Helsinki. All the protocols described here have been approved by the local ethics committee of Justus Liebig University Giessen (LEK-FB06).
1. Installation of all the necessary software
2. Preparing the motion capture system
3. Calibrating the cameras
4. Creating a stimulus object
5. Preparing the stimulus object
6. Co-registering real and mesh model versions of the stimulus object
7. Setting up markers on the hands
Figure 2: Marker placement on a participant's hand. Abbreviation: RH = right hand. Please click here to view a larger version of this figure.
8. Acquiring a single trial
9. Labeling the markers
10. Creating a personalized skeleton definition for the participant
11. Reconstruct the joint skeletal joint poses
12. Generating hand mesh reconstructions
13. Generating hand-object contact region estimates
The first requirement for the proposed method is a system to accurately track the position of 3D objects and hands. The specific setup is shown in Figure 1A and uses hardware and software produced by the motion capture company Qualisys. We position a workbench within a tracking volume (100 cm x 100 cm x 100 cm), which is imaged from multiple angles by eight tracking cameras and six video cameras arranged on a cubical frame surrounding the workspace. The tracking cameras track the 3D position of the reflective markers within the tracking volume at 180 frames/s and with sub-millimeter 3D spatial resolution. We employ 4 mm reflective markers, which are attached to the objects and hands using skin-friendly double-sided adhesive tape. The 3D marker positions are processed by the motion capture software. The discussion section also reviews alternative motion capture systems that could be employed with the proposed method.
To obtain accurate 3D reconstructions of real objects being grasped and manipulated, we propose two options. The first, which is the one adopted here, is to start from a virtual 3D object model in the form of a polygon mesh. Such 3D models can be constructed using appropriate software (e.g., Blender 3D44) and then 3D printed (Figure 1B). The second option is to take an existing, real 3D object and use 3D scanning technology to construct a mesh model replica of the object. Whichever the strategy, the end goal is to obtain both a real 3D object and the corresponding virtual 3D object mesh model. Of note, the approach described here works only with rigid (i.e., non-deformable) objects.
Once the 3D surface of an object is available as a mesh model, its position must be tracked and co-registered (Figure 1C). To do so, four non-planar reflective markers are attached to the surface of the real object, and the object is placed within the tracking volume. The 3D positions of the object markers are then briefly captured. This capture is used to establish the correspondence between the four markers and four vertices of the object mesh model. This is done using a simple ad hoc software route written in Blender’s Python API. Within Blender’s Viewport, the program presents the virtual object together with the marker positions which are represented as a single mesh object comprised of one sphere for each marker. The user can then rotate and translate the object and/or the markers to align them such that they co-align with the real markers placed on the real object. The program will register the rotations and translation that are applied to calculate a single roto-translation that is finally applied to the original object mesh, providing an object mesh that is co-registered with the rigid body definition in QTM.
Having established correspondence, whenever the real object is moved within the tracking volume, the virtual object can be placed in the new position by computing the roto-translation between the tracked markers and the four corresponding mesh vertices. To record the dynamics of the grasp instead, a total of 24 spherical reflective markers are attached on different landmarks of the hand using double-sided tape (Figure 1D and Figure 2).
At the beginning of a trial (Figure 1E), a participant places their hand flat on the workbench with the palm facing downward and closes their eyes. The experimenter places a target object on the workbench in front of the participant. Next, an auditory cue signals to the participant to open their eyes and execute the grasp. In our demonstrations, the task is to reach and grasp the target object, lift it vertically by approximately 10 cm, set it down, and return the hand to its starting position. A script written in Python 3.7 controls the experiment. On each trial, the script selects and communicates the current condition settings to the experimenter (e.g., object identity and positioning). The script also controls the trial timing, including auditory cues and the start and stop of the motion capture recordings.
Limbs are not only characterized by their position in 3D space but also by their pose. Thus, to obtain a complete 3D reconstruction of a human hand executing a real grasp, we require not only the positions of each joint in 3D space but also the relative pose (translation and rotation) of each joint with respect to its parent joint (Figure 1F). Skeletal joint positions and orientations can be inferred from marker positions using inverse kinematics. To do so, here we employ the skeleton solver provided by the QTM software. For the solver to work, we must first provide a skeleton definition that links the position and orientation of each joint to multiple marker positions. A skeleton definition is, thus, constructed, and the skeleton rig is linked to the marker data using the QTM Connect plugin for Maya. We create personalized skeleton definitions for each participant to maximize the accuracy of the skeleton fits to the marker data. For each participant, we manually fit a hand skeleton to a single frame of motion capture data. Having obtained a participant-specific skeleton definition, we then run the skeleton solver to estimate the skeletal joint poses for each frame of each trial in the experiment.
For each frame of each trial in an experiment, we generate a hand mesh that reconstructs the current hand pose using the open-source and pretrained hand mesh generation tool, DeepHandMesh28 (Figure 1G). DeepHandMesh is a deep encoder-decoder network that generates personalized hand meshes from images. First, the encoder estimates the pose of a hand within an image (i.e., the joint Euler angles). Then, the estimated hand pose and a personalized ID vector are input to the decoder, which estimates a set of three additive correctives to a rigged template mesh. Finally, the template mesh is deformed according to the estimated hand pose and correctives using linear blend skinning. The first corrective is an ID-dependent skeleton corrective through which the skeletal rig is adjusted to incorporate the person-specific joint positions. The other two correctives are mesh correctives through which the mesh vertices are adjusted to better represent the hand surface of the participant. One of the mesh correctives is an ID-dependent mesh corrective that accounts for the surface structure of an individual participant's hand. The final mesh corrective instead is a pose-dependent vertex corrective that accounts for hand surface deformations due to the current hand pose.
DeepHandMesh is trained using weak supervision with 2D joint key points and scene depth maps. Here, we use only the pretrained DeepHandMesh decoder to generate hand mesh reconstructions, modified in the following ways (Figure 3). First, as the network is not trained on specific participants, the generic ID-dependent mesh corrective provided with the pretrained model is employed (Figure 3A). Further, the ID-dependent skeleton corrective is derived using the QTM skeleton solver as described above (Figure 3B). Proportional scaling of the hand with the skeleton length is assumed, and the mesh thickness is uniformly scaled by a factor derived from the relative scaling of the skeleton such that the mesh better approximates the participant's hand size (Figure 3C). This modified mesh is input to the decoder, together with the current hand pose (derived from the marker data) and the 3D position and orientation of the wrist. The decoder, thus, computes the current pose-dependent corrective, applies all the correctives and roto-translations, and outputs a 3D hand mesh reconstruction of the current hand pose in the same coordinate frame as the 3D tracked object mesh (Figure 3D).
Figure 3: Modifications to the pretrained DeepHandMesh decoder. (A) Fixed, generic ID-dependent mesh corrective. (B) ID-dependent skeleton corrective derived through inverse kinematics in step 10. (C) The size of the hand mesh is scaled by the same factor as the skeletal joints. (D) Final 3D hand mesh reconstruction of the current hand pose. Please click here to view a larger version of this figure.
Having reconstructed 3D mesh models for both a participant's hand and a grasped object, hand-object contact regions can be estimated by computing the intersection between the hand and object meshes (Figure 1H). The assumption behind this is that the real hand is deformed by contact with the surface, meaning the skeleton can come closer to the surface than would be possible if the hand were rigid, which allows portions of the hand mesh to pass through the object mesh. As a result, the contact areas can be approximated as the regions of overlap between the two meshes.
Specifically, to compute these regions of overlap, we define object mesh vertices that are contained within the 3D volume of the hand mesh as being in contact with the hand. These vertices are identified using a standard raytracing approach45. For each vertex of the object mesh, a ray is cast from that vertex to an arbitrary 3D point outside the hand mesh. We then assess the number of intersections that occur between the cast ray and the triangles composing the hand's surface. If the number of intersections is odd, the object vertex is contained inside the hand mesh. If the number of intersections is even, then the object vertex is outside the hand mesh. The contact regions on the surface of the object can, thus, be approximated as the set of triangle faces whose vertices are all contained within the hand mesh. We can apply the same rationale to the hand mesh vertices contained in the 3D volume of the object mesh to estimate the contact regions on the surface of the hand. Notably, more advanced approaches to Boolean mesh operations could also be used31.
Video 1 shows a video of a hand, tracked points, and co-registered mesh all moving side-by-side during a single grasp to a 3D-printed cat figurine. Figure 4A instead shows a single frame at the time of hand-object contact from a grasp to a 3D-printed croissant, together with the hand-object mesh reconstructions (Figure 4B) and the estimated contact regions on the surface of the croissant (Figure 4C).
Figure 4: Estimated hand-object contact regions. (A) Tracked hand and object viewed from one of the tracking cameras during a grasp. (B) Reconstructed hand mesh and tracked object mesh rendered from the same viewpoint as the tracking camera. (C) Contact regions on the surface of the object seen from multiple viewpoints. Please click here to view a larger version of this figure.
Video 1: Mesh reconstructions of the hand and object. Gif animation of the hand, tracked markers, and the hand and object mesh reconstructions during a single grasp viewed from the same camera viewpoint. Please click here to download this Video.
We propose a method that enables the estimation of contact regions for hand-object interactions during multi-digit grasps. As full tracking of the whole surface of a hand is currently intractable, we propose using a reconstruction of a hand mesh whose pose is determined by sparse key points on the hand. To track these sparse key points, our solution employs a research-grade motion capture system based on passive marker tracking. Of course, other motion capture systems could also be employed with the proposed method, granted that they yield sufficiently accurate 3D position data. We advise against active marker motion capture systems (such as the popular but discontinued Optotrak Certus), since these require attaching cables and/or electronic devices to the participants' hands, which may restrict movements or at least yield less typical grasps as participants are made more consciously aware of the pose of their hands. Motion-tracking gloves using inertial measurement units may be a possibility, even though these systems are known to suffer from drift, may also restrict hand movements, and do not allow for the surface of the hand to come into full and direct contact with the object surfaces. Commercial markerless hand-tracking solutions (e.g., the Leap Motion46,47,48) may also be a possibility, although it may not be possible to track object positions with these systems alone. The most promising alternative option to a research-grade motion capture system is given by open-source, markerless tracking solutions (e.g., Mathis et al.28). If used with multiple co-registered cameras49, such systems could potentially track hand joint positions and object positions in 3D without the need of markers, gloves, or cables. These solutions, as well as this marker-based system, may suffer, however, from data loss issues due to occlusions.
Limitations and future directions
As the hand reconstructions obtained through this method will not be fully accurate, there are some limitations to the types of experiments for which the method should be used. Deviations in hand mesh reconstructions from ground truth will manifest themselves in deviations in the estimated hand/object contact regions. Thus, applying this method to derive absolute measures would require assessing the fidelity of the contact region estimates. However, even approximate estimates can still be useful in within-participant experimental designs because the potential biases of the method are likely to affect different experimental conditions within a participant in a similar way. Therefore, statistical analyses and inferences should be performed only on measures such as the differences in contact area between conditions, where the direction of an effect will correlate with the respective ground truth. In future research, we plan to further validate our approach, for example by comparing contact region estimates to thermal fingerprints on objects covered in thermochromic paint.
Most processing steps from the data collection to the final contact region estimation are fully automated and, thus, offer important contributions toward a standardized procedure for hand-object contact region estimation. However, an initial fit of the individualized skeletons to the 3D positions of the tracked markers must still be performed manually to obtain a skeleton definition for each participant. As the number of participants for an experiment increases, so does the number of manual fits, and this is currently the most time-consuming step in the procedure and requires some familiarity with manual rigging in the Autodesk Maya Software. In the future, we aim to automate this step to avoid human influence on the procedure by adding an automatic skeleton calibration procedure.
The workflow described here relies on the Qualisys hardware and software (e.g., the QTM skeleton solver). This currently limits the accessibility of our method to laboratories that have a similar setup. In principle, however, the method can be applied to any source of motion capture data. To expand the accessibility, in ongoing work, we are exploring alternatives that should generalize our workflow and make it less reliant on specific hardware and software licenses.
Another important limitation of the method is that, in its current form, it can only be applied to rigid (non-deformable) objects. In the future, this limitation could be overcome using methods for recording the surface shape of the grasped object as it deforms. Additionally, due to its approximate nature, the method is not currently well suited to very small or thin objects.
In conclusion, by integrating state-of-the-art motion tracking with high-fidelity hand surface modeling, we provide a method to estimate hand-object contact regions during grasping and manipulation. In future research, we plan to deploy this method to investigate and model visually guided grasping behavior in humans16. We further plan to integrate these tools with eye tracking46, 50,51,52 and virtual/augmented reality systems53,54,55 to investigate visually guided hand and eye movement motor control in real and virtual naturalistic environments18,46,56,57. For these reasons, the proposed method could be of interest to researchers studying haptic perception58, motor control, and human-computer interaction in virtual and augmented reality. Finally, accurate measurements of human grasping abilities could inform the design of robust robotic systems based on the principles of interactive perception39,40,41,42,43 and may have translational applications for upper limb prosthetics.
The authors have nothing to disclose.
This research was funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation: project No. 222641018-SFB/TRR 135 TP C1 and IRTG-1901 "The Brain in Action") and by the Research Cluster "The Adaptive Mind" funded by the Excellence Program of the Hessian Ministry of Higher Education, Science, Research, and Art. The authors thank the Qualisys support team, including Mathias Bankay and Jeffrey Thingvold, for assistance in developing our methods. The authors also thank Michaela Jeschke for posing as the hand model. All data and analysis scripts to reproduce the method and the results presented in the manuscript are available on Zenodo (doi: 10.5281/zenodo.7458911).
Anaconda Python distribution | (Anaconda 5.3.1 or later); https://repo.anaconda.com/archive/ | scripts and functions were generated in Python version 3.7 | |
Autodesk Maya | Autodesk, Inc. | Maya2022; https://www.autodesk.com/products/maya/overview | 3D computer graphics application. |
Blender | Blender Foundation | Blender 2.92; https://download.blender.org/release/ | 3D computer graphics application. |
Computer Workstation | N/A | N/A | OS: Windows 10 or higher. |
DeepHandMesh | Meta Platforms, Inc. (Meta Open Source) | https://github.com/facebookresearch/DeepHandMesh | Pre-trained hand mesh generation tool. |
Miqus M5 | Qualisys Ab | https://www.qualisys.com/cameras/miqus/ | Passive marker motion tracking camera (8 units). |
Miqus video camera | Qualisys Ab | https://www.qualisys.com/cameras/miqus-video/ | Color video camera, synchronized with Miquis M5 tracking cameras (6 units). |
Project repository | N/A | Data and Code Repository | Data and code to replicate the current project. The repository is currently under construction, but we provide a private link where reviewers can download the current and most up-to-date version of the repository. The final repository will be made publicly available upon acceptance. |
Python 3 | Python Software Foundation | Python Version 3.7 | Python3 and associated built-in libraries. |
QTM Connect for Maya | Qualisys Ab | https://github.com/qualisys/QTM-Connect-For-Maya | Stream skeleton, rigid bodies and marker data from QTM to Maya |
QTM Qualisys Track Manager | Qualisys Ab | Qualisys Track Manager 2021.2; https://www.qualisys.com/software/qualisys-track-manager/ | Motion capture software |
Qualisys SDK for Python | Qualisys Ab | https://github.com/qualisys/qualisys_python_sdk | Implements communication between QTM and Python |