The goal of the work presented in this article is to develop technology for automated recognition of food and beverage items from images taken by mobile devices. The technology comprises of two different approaches – the first one performs food image recognition while the second one performs food image segmentation.
Due to the issues and costs associated with manual dietary assessment approaches, automated solutions are required to ease and speed up the work and increase its quality. Today, automated solutions are able to record a person’s dietary intake in a much simpler way, such as by taking an image with a smartphone camera. In this article, we will focus on such image-based approaches to dietary assessment. For the food image recognition problem, deep neural networks have achieved the state of the art in recent years, and we present our work in this field. In particular, we first describe the method for food and beverage image recognition using a deep neural network architecture, called NutriNet. This method, like most research done in the early days of deep learning-based food image recognition, is limited to one output per image, and therefore unsuitable for images with multiple food or beverage items. That is why approaches that perform food image segmentation are considerably more robust, as they are able to identify any number of food or beverage items in the image. We therefore also present two methods for food image segmentation – one is based on fully convolutional networks (FCNs), and the other on deep residual networks (ResNet).
Dietary assessment is a crucial step in determining actionable areas of an individual's diet. However, performing dietary assessment using traditionally manual approaches is associated with considerable costs. These approaches are also prone to errors as they often rely on self-reporting by the individual. Automated dietary assessment addresses these issues by providing a simpler way to quantify and qualify food intake. Such an approach can also alleviate some of the errors present in manual approaches, such as missed meals, inability to accurately assess food volume, etc. Therefore, there are clear benefits to automating dietary assessment by developing solutions that identify different foods and beverages and quantify food intake1. These solutions can also be used to enable an estimation of nutritional values of food and beverage items (henceforth 'food items'). Consequently, automated dietary assessment is useful for multiple applications – from strictly medical uses, such as allowing dietitians to more easily and accurately track and analyze their patients' diets, to the usage inside well-being apps targeted at the general population.
Automatically recognizing food items from images is a challenging computer vision problem. This is due to foods being typically deformable objects, and due to the fact that a large amount of the food item's visual information can be lost during its preparation. Additionally, different foods can appear to be very similar to each other, and the same food can appear to be substantially different on multiple images2. Furthermore, the recognition accuracy depends on many more factors, such as image quality, whether the food item is obstructed by another item, distance from which the image was taken, etc. Recognizing beverage items presents its own set of challenges, the main one being the limited amount of visual information that is available in an image. This information could be the beverage color, beverage container color and structure, and, under optimal image conditions, the beverage density2.
To successfully recognize food items from images, it is necessary to learn features of each food and beverage class. This was traditionally done using manually-defined feature extractors3,4,5,6 that perform recognition based on specific item features like color, texture, size, etc., or a combination of these features. Examples of these feature extractors include multiple kernel learning4, pairwise local features5 and the bag-of-features model6. Due to the complexity of food images, these approaches mostly achieved a low classification accuracy – between 10% and 40%3,4,5. The reason for this is that the manual approach is not robust enough to be sufficiently accurate. Because a food item can vary significantly in appearance, it is not feasible to encompass all these variances manually. Higher classification accuracy can be achieved with manually-defined feature extractors when either the number of food classes is reduced5, or different image features are combined6, thus indicating that there is a need for more complex solutions to this problem.
This is why deep learning proved to be so effective for the food image recognition problem. Deep learning, or deep neural networks, was inspired by biological brains, and allows computational models composed of multiple processing layers to automatically learn features through training on a set of input images7,8. Because of this, deep learning has substantially improved the state of the art in a variety of research fields7, with computer vision, and subsequently food image recognition, being one of them2.
In particular, deep convolutional neural networks (DCNNs) are most popular for food image recognition – these networks are inspired by the visual system of animals, where individual neurons try to gain an understanding of the visual input by reacting to overlapping regions in the visual field9. A convolutional neural network takes the input image and performs a series of operations in each of the network layers, the most common of which are convolutional, fully-connected and pooling layers. Convolutional layers contain learnable filters that respond to certain features in the input data, whereas fully-connected layers compose output data from other layers to gain higher-level knowledge from it. The goal of pooling layers is to down-sample the input data2. There are two approaches to using deep learning models that proved popular: taking an existing deep neural network definition10,11, referred to as a deep learning architecture in this article, or defining a new deep learning architecture12,13, and training either one of these on a food image dataset. There are strengths and weaknesses to both approaches – when using an existing deep learning architecture, an architecture that performed well for other problems can be chosen and fine-tuned for the desired problem, thus saving time and ensuring that a validated architecture has been chosen. Defining a new deep learning architecture, on the other hand, is more time-intensive, but allows the development of architectures that are specifically made to take into account the specifics of a problem and thus theoretically perform better for that problem.
In this article, we present both approaches. For the food image recognition problem, we developed a novel DCNN architecture called NutriNet2, which is a modification of the well-known AlexNet architecture14. There are two main differences compared to AlexNet: NutriNet accepts 512×512-pixel images as input (as opposed to 256×256-pixel images for AlexNet), and NutriNet has an additional convolutional layer at the beginning of the neural network. These two changes were introduced in order to extract as much information from the recognition dataset images as possible. Having higher-resolution images meant that there is more information present on images and having more convolutional layers meant that additional knowledge could be extracted from the images. Compared to AlexNet's around 60 million parameters, NutriNet contains less parameters: approximately 33 million. This is because of the difference in dimensionality at the first fully-connected layer caused by the additional convolutional layer2. Figure 1 contains a diagram of the NutriNet architecture. The food images that were used to train the NutriNet model were gathered from the Internet – the procedure is described in the protocol text.
For the food image segmentation problem, we used two different existing architectures: fully convolutional networks (FCNs)15 and deep residual networks (ResNet)16, both of which represented the state of the art for image segmentation when we used them to develop their respective food image segmentation solutions. There are multiple FCN variants that were introduced by Long et al.: FCN-32s, FCN-16s and FCN-8s15. FCN-32s outputs a pixel map based on the predictions by the FCN's final layer, whereas the FCN-16s variant combines these predictions with those by an earlier layer. FCN-8s considers yet another layer's predictions and is therefore able to make predictions at the finest grain, which is why it is suitable for food image recognition. The FCN-8s that we used was pre-trained on the PASCAL Visual Object Classes (PASCAL VOC) dataset17 and trained and tested on images of food replicas (henceforth 'fake food')18 due to their visual resemblance to real food and due to a lack of annotated images of real food on a pixel level. Fake food is used in different behavioral studies and images are taken for all dishes from all study participants. Because the food contents of these images are known, it makes the image dataset useful for deep learning model training. Dataset processing steps are described in the protocol text.
The ResNet-based solution was developed in the scope of the Food Recognition Challenge (FRC)19. It uses the Hybrid Task Cascade (HTC)20 method with a ResNet-10116 backbone. This is a state-of-the-art approach for the image segmentation problem that can use different feature extractors, or backbones. We considered other backbone networks as well, particularly other ResNet variants such as ResNet-5016, but ResNet-101 was the most suitable due to its depth and ability to represent input images in a complex enough manner. The dataset used for training the HTC ResNet-101 model was the FRC dataset with added augmented images. These augmentations are presented in the protocol text.
This article is intended as a resource for machine learning experts looking for information about which deep learning architectures and data augmentation steps perform well for the problems of food image recognition and segmentation, as well as for nutrition researchers looking to use our approach to automate food image recognition for use in dietary assessment. In the paragraphs below, deep learning solutions and datasets from the food image recognition field are presented. In the protocol text, we detail how each of the three approaches was used to train deep neural network models that can be used for automated dietary assessment. Additionally, each protocol section contains a description of how the food image datasets used for training and testing were acquired and processed.
DCNNs generally achieved substantially better results than other methods for food image recognition and segmentation, which is why the vast majority of recent research in the field is based on these networks. Kawano et al. used DCNNs to complement manual approaches21 and achieved a classification accuracy of 72.26% on the UEC-FOOD100 dataset22. Christodoulidis et al. used them exclusively to achieve a higher accuracy of 84.90% on a self-acquired dataset23. Tanno et al. developed DeepFoodCam – a smartphone app for food image recognition that uses DCNNs24. Liu et al. presented a system that performs an Internet of Things-based dietary assessment using DCNNs25. Martinel et al. introduced a DCNN-based approach that exploits the specifics of food images26 and reported an accuracy of 90.27% on the Food-101 dataset27. Zhou et al. authored a review of deep learning solutions in the food domain28.
Recently, Zhao et al. proposed a network specifically for food image recognition in mobile applications29. This approach uses a smaller 'student' network that learns from a larger 'teacher' network. With it, they managed to achieve an accuracy of 84% on the UEC-FOOD25630 and an accuracy of 91.2% on the Food-101 dataset27. Hafiz et al. used DCNNs to develop a beverage-only image recognition solution and reported a very high accuracy of 98.51%31. Shimoda et al. described a novel method for detecting plate regions in food images without the usage of pixel-wise annotation32. Ciocca et al. introduced a new dataset containing food items from 20 different food classes in 11 different states (solid, sliced, creamy paste, etc.) and presented their approach for training recognition models that are able to recognize the food state, in addition to the food class33. Knez et al. evaluated food image recognition solutions for mobile devices34. Finally, Furtado et al. conducted a study on how the human visual system compares to the performance of DCNNs and found that human recognition still outperforms DCNNs with an accuracy of 80% versus 74.5%35. The authors noted that with a small number of food classes, the DCNNs perform well, but on a dataset with hundreds of classes, human recognition accuracy is higher35, highlighting the complexity of the problem.
Despite its state-of-the-art results, deep learning has a major drawback – it requires a large input dataset to train the model on. In the case of food image recognition, a large food image dataset is required, and this dataset needs to encompass as many different real-world scenarios as possible. In practice this means that for each individual food or beverage item, a large collection of images is required, and as many different items as possible need to be present in the dataset. If there are not enough images for a specific item in the dataset, that item is unlikely to be recognized successfully. On the other hand, if only a small number of items is covered by the dataset, the solution will be limited in scope, and only able to recognize a handful of different foods and beverages.
Multiple datasets were made available in the past. The Pittsburgh Fast-Food Image Dataset (PFID)3 was introduced to encourage more research in the field of food image recognition. The University of Electro-Communications Food 100 (UEC-FOOD100)22 and University of Electro-Communications Food 256 (UEC-FOOD256)30 datasets contain Japanese dishes, expanded with some international dishes in the case of the UEC-FOOD256 dataset. The Food-101 dataset contains popular dishes acquired from a website27. The Food-5036 and Video Retrieval Group Food 172 (VireoFood-172)37 datasets are Chinese-based collections of food images. The University of Milano-Bicocca 2016 (UNIMIB2016) dataset is composed of images of food trays from an Italian canteen38. Recipe1M is a large-scale dataset of cooking recipes and food images39. The Food-475 dataset40 collects four previously published food image datasets27,30,36,37 into one. The Beijing Technology and Business University Food 60 (BTBUFood-60) is a dataset of images meant for food detection41. Recently, the ISIA Food-500 dataset42 of miscellaneous food images was made available. In comparison to other publicly available food image datasets, it contains a large number of images, divided into 500 food classes, and is meant to advance the development of multimedia food recognition solutions42.
1. Food image recognition with NutriNet
2. Food image segmentation with FCNs
3. Food image segmentation with HTC ResNet
NutriNet was tested against three popular deep learning architectures of the time: AlexNet14, GoogLeNet51 and ResNet16. Multiple training parameters were also tested for all architectures to define the optimal values2. Among these is the choice of solver type, which determines how the loss function is minimized. This function is the primary quality measure for training neural networks as it is better suited for optimization during training than classification accuracy. We tested three solvers: Stochastic Gradient Descent (SGD)52, Nesterov's Accelerated Gradient (NAG)53 and the Adaptive Gradient algorithm (AdaGrad)54. The second parameter is batch size, which defines the number of images that are processed at the same time. The depth of the deep learning architecture determined the value of this parameter, as deeper architectures require more space in the GPU memory – the consequence of this approach was that the memory was completely filled with images for all architectures, regardless of depth. The third parameter is learning rate, which defines the speed with which the neural network parameters are being changed during training. This parameter was set in unison with the batch size, as the number of concurrently processed images dictates the convergence rate. AlexNet models were trained using a batch size of 256 images and a base learning rate of 0.02; NutriNet used a batch size of 128 images and a rate of 0.01; GoogLeNet 64 images and a rate of 0.005; and ResNet 16 images and a rate of 0.00125. Three other parameters were fixed for all architectures: learning rate policy (step-down), step size (30%) and gamma (0.1). These parameters jointly describe how the learning rate is changing in every epoch. The idea behind this approach is that the learning rate is being gradually lowered to fine-tune the model the closer it gets to the optimal loss value. Finally, the number of training epochs was also fixed to 150 for all deep learning architectures2.
The best result among all the parameters tested that NutriNet achieved was a classification accuracy of 86.72% on the recognition dataset, which was around 2% higher than the best result for AlexNet and slightly higher than GoogLeNet's best result. The best-performing architecture overall was ResNet (by around 1%), however the training time for ResNet is substantially higher compared to NutriNet (by a factor of approximately five), which is important if models are continuously re-trained to improve accuracy and the number of recognizable food items. NutriNet, AlexNet and GoogLeNet achieved their best results using the AdaGrad solver, whereas ResNet's best model used the NAG solver. NutriNet was also tested on the publicly available UNIMIB2016 food image dataset38. This dataset contains 3,616 images of 73 different food items. NutriNet achieved a recognition accuracy of 86.39% on this dataset, slightly outperforming the baseline recognition result of the authors of the dataset, which was 85.80%. Additionally, NutriNet was tested on a small dataset of 200 real-world images of 115 different food and beverage items, where NutriNet achieved a top-5 accuracy of 55%.
To train the FCN-8s fake-food image segmentation model, we used Adam55 as the solver type, as we found that it performed optimally for this task. The base learning rate was set very low – to 0.0001. The reason for the low number is the fact that only one image could be processed at a time, which is a consequence of the pixel-level classification process. The GPU memory requirements for this approach are significantly greater than image-level classification. The learning rate thus had to be set low so that the parameters were not being changed too fast and converge to less optimal values. The number of training epochs was set to 100, while the learning rate policy, step size and gamma were set to step-down, 34% and 0.1, respectively, as these parameters produced the most accurate models.
Accuracy measurements of the FCN-8s model were performed using the pixel accuracy measure15, which is analogous to the classification accuracy of traditional deep learning networks, the main difference being that the accuracy is computed on the pixel level instead of on the image level:
where PA is the pixel accuracy measure, nij is the number of pixels from class i predicted to belong to class j and ti = Σj nij is the total number of pixels from class i in the ground-truth labels1. In other words, the pixel accuracy measure is computed by dividing correctly predicted pixels by the total number of pixels. The final accuracy of the trained FCN-8s model was 92.18%. Figure 2 shows three example images from the fake-food image dataset (one from each of the training, validation and testing subsets), along with the corresponding ground-truth and model prediction labels.
The parameters to train the HTC20 ResNet-101 model for food image segmentation were set as follows: the solver type used was SGD because it outperformed other solver types. The base learning rate was set to 0.00125 and the batch size to 2 images. The number of training epochs was set to 40 per training phase, and multiple training phases were performed – first on the original FRC dataset without augmented images, then on the 8x-augmented and 4x-augmented FRC dataset multiple times in an alternating fashion, each time taking the best-performing model and fine-tuning it in the next training phase. More details on the training phases can be found in section 3.3 of the protocol text. Finally, the step-down learning policy was used, with fixed epochs for when the learning rate decreased (epochs 28 and 35 for the first training phase). An important thing to note is that while this sequence of training phases produced the best results in our testing in the scope of the FRC, using another dataset might require a different sequence to produce optimal results.
This ResNet-based solution for food image segmentation was evaluated using the following precision measure19:
where P is precision, TP is the number of true positive predictions by the food image segmentation model, FP is the number of false positive predictions and IoU is Intersection over Union, which is computed with this equation:
where Area of Overlap represents the number of predictions by the model that overlap with the ground truth, and Area of Union represents the total number of predictions by the model together with the ground truth, both on a pixel level and for each individual food class. Recall is used as a secondary measure and is calculated in a similar way, using the following formula19:
where R is recall and FN is the number of false negative predictions by the food image segmentation model. The precision and recall measures are then averaged across all classes in the ground truth. Using these measures, our model achieved an average precision of 59.2% and an average recall of 82.1%, which ranked second in the second round of the Food Recognition Challenge19. This result was 4.2% behind the first place and 5.3% ahead of the third place in terms of the average precision measure. Table 1 contains the results for the top-4 participants in the competition.
Figure 1: Diagram of the NutriNet deep neural network architecture. This figure has been published in Mezgec et al.2. Please click here to view a larger version of this figure.
Figure 2: Images from the fake-food image dataset. Original images (left), manually-labelled ground-truth labels (middle) and predictions from the FCN-8s model (right). This figure has been published in Mezgec et al.1. Please click here to view a larger version of this figure.
Team Name | Placement | Average Precision | Average Recall |
rssfete | 1 | 63.4% | 88.6% |
simon_mezgec | 2 | 59.2% | 82.1% |
arimboux | 3 | 53.9% | 73.5% |
latentvec | 4 | 48.7% | 71.1% |
Table 1: Top-4 results from the second round of the Food Recognition Challenge. Average precision is taken as the primary performance measure and average recall as a secondary measure. Results are taken from the official competition leaderboard19.
Supplemental Files. Please click here to download this File.
In recent years, deep neural networks have been validated multiple times as a suitable solution for recognizing food images10,11,12,21,23,25,26,29,31,33. Our work presented in this article serves to further prove this1,2. The single-output food image recognition approach is straightforward and can be used for simple applications where images with only one food or beverage item are expected2.
The food image segmentation approach seems particularly suitable for recognizing food images in general, without any restriction on the number of food items1. Because it works by classifying each individual pixel of the image, it is able to not only recognize any number of food items in the image, but also specify where a food item is located, as well as how large it is. The latter can then be used to perform food weight estimation, particularly if used with either a reference object or a fixed-distance camera.
There has been some work done regarding the availability of food image datasets3,22,27,30,36,37,38,39,40,41,42, and we hope more will be done in the future, particularly when it comes to aggregating food image datasets from different regions across the world, which would enable more robust solutions to be developed. Currently, the accuracy of automatic food image recognition solutions has not yet reached human-level accuracy35, and this is likely in large part due to the usage of food image datasets of insufficient size and quality.
In the future, our goal will be to further evaluate the developed procedures on real-world images. In general, datasets in this field often contain images taken in controlled environments or images that were manually optimized for recognition. This is why it is important to gather a large and diverse real-world food image dataset to encompass all the different food and beverage items that individuals might want to recognize. The first step towards this was provided by the Food Recognition Challenge, which included a dataset of real-world food images19, but further work needs to be done to validate this approach on food images from all around the world and in cooperation with dietitians.
The authors have nothing to disclose.
The authors would like to thank Tamara Bucher from the University of Newcastle, Australia, for providing the fake-food image dataset. This work was supported by the European Union's Horizon 2020 research and innovation programs (grant numbers 863059 – FNS-Cloud, 769661 – SAAM); and the Slovenian Research Agency (grant number P2-0098). The European Union and Slovenian Research Agency had no role in the design, analysis or writing of this article.
HARDWARE | |||
NVIDIA GPU | NVIDIA | N/A | An NVIDIA GPU is needed as some of the software frameworks below will not work otherwise. https://www.nvidia.com |
SOFTWARE | |||
Caffe | Berkeley AI Research | N/A | Caffe is a deep learning framework. https://caffe.berkeleyvision.org |
CLoDSA | Jónathan Heras | N/A | CLoDSA is a Python image augmentation library. https://github.com/joheras/CLoDSA |
Google API Client | N/A | Google API Client is a Python client library for Google's discovery based APIs. https://github.com/googleapis/google-api-python-client | |
JavaScript Segment Annotator | Kota Yamaguchi | N/A | JavaScript Segment Annotator is a JavaScript image annotation tool. https://github.com/kyamagu/js-segment-annotator |
MMDetection | Multimedia Laboratory, CUHK | N/A | MMDetection is an object detection toolbox based on PyTorch. https://github.com/open-mmlab/mmdetection |
NVIDIA DIGITS | NVIDIA | N/A | NVIDIA DIGITS is a wrapper for Caffe that provides a graphical web interface. https://developer.nvidia.com/digits |
OpenCV | Intel | N/A | OpenCV is a library for computer vision. https://opencv.org |
Python | Python Software Foundation | N/A | Python is a programming language. https://www.python.org |
PyTorch | Facebook AI Research | N/A | PyTorch is a machine learning framework. https://pytorch.org |
Ubuntu OS | Canonical | N/A | Ubuntu 14.04 is the OS used by the authors and offers compatibility with all of the software frameworks and tools above. https://ubuntu.com |