The present protocol describes an efficient multi-organ segmentation method called Swin-PSAxialNet, which has achieved excellent accuracy compared to previous segmentation methods. The key steps of this procedure include dataset collection, environment configuration, data preprocessing, model training and comparison, and ablation experiments.
Abdominal multi-organ segmentation is one of the most important topics in the field of medical image analysis, and it plays an important role in supporting clinical workflows such as disease diagnosis and treatment planning. In this study, an efficient multi-organ segmentation method called Swin-PSAxialNet based on the nnU-Net architecture is proposed. It was designed specifically for the precise segmentation of 11 abdominal organs in CT images. The proposed network has made the following improvements compared to nnU-Net. Firstly, Space-to-depth (SPD) modules and parameter-shared axial attention (PSAA) feature extraction blocks were introduced, enhancing the capability of 3D image feature extraction. Secondly, a multi-scale image fusion approach was employed to capture detailed information and spatial features, improving the capability of extracting subtle features and edge features. Lastly, a parameter-sharing method was introduced to reduce the model’s computational cost and training speed. The proposed network achieves an average Dice coefficient of 0.93342 for the segmentation task involving 11 organs. Experimental results indicate the notable superiority of Swin-PSAxialNet over previous mainstream segmentation methods. The method shows excellent accuracy and low computational costs in segmenting major abdominal organs.
Contemporary clinical intervention, including the diagnosis of diseases, the formulation of treatment plans, and the tracking of treatment outcomes, relies on the accurate segmentation of medical images1. However, the complex structural relationships among abdominal organs2make it a challenging task to achieve accurate segmentation of multiple abdominal organs3. Over the past few decades, the flourishing developments in medical imaging and computer vision have presented both new opportunities and challenges in the field of abdominal multi-organ segmentation. Advanced Magnetic Resonance Imaging (MRI)4 and Computed Tomography (CT) technology5 enable us to acquire high-resolution abdominal images. The precise segmentation of multiple organs from CT images holds significant clinical value for the assessment and treatment of vital organs such as the liver, kidneys, spleen, pancreas, etc.6,7,8,9,10 However, manual annotation of these anatomical structures, especially those requiring intervention from radiologists or radiation oncologists, is both time-consuming and susceptible to subjective influences11. Therefore, there is an urgent need to develop automated and accurate methods for abdominal multi-organ segmentation.
Previous research on image segmentation predominantly relied on Convolutional Neural Networks (CNNs), which improve segmentation efficiency by stacking layers and introducing ResNet12. In 2020, the Google research team introduced the Vision Transformer (VIT) model13, marking a pioneering instance of incorporating Transformer architecture into the traditional visual domain for a range of visual tasks14. While convolutional operations can only contemplate local feature information, the attention mechanism in Transformers enables the comprehensive consideration of global feature information.
Considering the superiority of Transformer-based architectures over traditional convolutional networks15, numerous research teams have undertaken extensive exploration in optimizing the synergy between the strengths of Transformers and convolutional networks16,17,18,19. Chen et al. introduced the TransUNet for medical image segmentation tasks16, which leverages Transformers to extract global features from images. Due to the high cost of network training and the failure to utilize the concept of feature extraction hierarchy, the advantages of Transformer have not been fully realized.
To address these issues, many researchers have started experimenting with incorporating Transformers as the backbone for training segmentation networks. Liu et al.17 introduced the Swin Transformer, which employed a hierarchical construction method for layered feature extraction. The concept of Windows Multi-Head Self-Attention (W-MSA) was proposed, significantly reducing computational cost, particularly in the presence of larger shallow-level feature maps. While this approach reduced computational requirements, it also isolated information transmission between different windows. To address this problem, the authors further introduced the concept of Shifted Windows Multi-Head Self-Attention (SW-MSA), enabling information propagation among adjacent windows. Building upon this methodology, Cao et al. formulated the Swin-UNet18, replacing the 2D convolutions in U-Net with Swin modules and incorporating W-MSA and SW-MSA into the encoding and decoding processes, achieving commendable segmentation outcomes.
Conversely, Zhou et al. highlighted that the advantage of conv operation could not be ignored when processing high-resolution images19. Their proposed nnFormer employs a self-attention computation method based on local three-dimensional image blocks, constituting a Transformer model characterized by a cross-shaped structure. The utilization of attention based on local three-dimensional blocks significantly reduced the training load on the network.
Given the problems with the above study, an efficient hybrid hierarchical structure for 3D medical image segmentation, termed Swin-PSAxialNet, is proposed. This method incorporates a downsampling block, Space-to-depth (SPD)20 block, capable of extracting global information21. Additionally, it adds a parameter shared axial attention (PSAA) module, which reduces the learning parameter count from quadratic to linear and will have a good effect on the accuracy of network training and the complexity of training models22.
Swin-PSAxialNet network
The overall architecture of the network adopts the U-shaped structure of nnU-Net23, consisting of encoder and decoder structures. These structures engage in local feature extraction and the concatenation of features from large and small-scale images, as illustrated in Figure 1.
Figure 1: Swin-PSAxialNet schematic diagram of network architecture. Please click here to view a larger version of this figure.
In the encoder structure, the traditional Conv block is combined with the SPD block20 to form a downsampling volume. The first layer of the encoder incorporates Patch Embedding, a module that partitions the 3D data into 3D patches , (P1, P2, P3) represents non-overlapping patches in this context, signifies the sequence length of 3D patches. Following the embedding layer, the next step involves a non-overlapping convolutional downsampling unit comprising both a convolutional block and an SPD block. In this setup, the convolutional block has a stride set to 1, and the SPD block is employed for image scaling, leading to a fourfold reduction in resolution and a twofold increase in channels.
In the decoder structure, each upsample block after the Bottleneck Feature layer consists of a combination of an upsampling block and a PSAA block. The resolution of the feature map is increased twofold, and the channel count is halved between each pair of decoder stages. To restore spatial information and enhance feature representation, feature fusion between large and small-scale images is performed between the upsampling blocks. Ultimately, the upsampling results are fed into the Head layer to restore the original image size, with an output size of (H × W × D × C, C = 3).
SPD block architecture
In traditional methods, the downsampling section employs a single stride with a step size of 2. This involves convolutional pooling at local positions in the image, limiting the receptive field, and confining the model to the extraction of features from small image patches. This method utilizes the SPD block, which finely divides the original image into three dimensions. The original 3D image is evenly segmented along the x, y, and z axes, resulting in four sub-volume bodies. (Figure 2) Subsequently, the four volumes are concatenated through "cat" operation, and the resulting image undergoes a 1 × 1 × 1 convolution to obtain the down-sampled image20.
Figure 2: SPD block diagram. Please click here to view a larger version of this figure.
PSAA block architecture
In contrast to traditional CNN networks, the proposed PSAA block is more effective in conducting global information focus and more efficient in network learning and training. This enables the capture of richer images and spatial features. The PSAA block includes axial attention learning based on params sharing in three dimensions: Height, Width, and Depth. In comparison to the conventional attention mechanism that performs attention learning for each pixel in the image, this method independently conducts attention learning for each of the three dimensions, reducing the complexity of self-attention from quadratic to linear. Moreover, a learnable keys-queries parameter sharing mechanism is employed, enabling the network to perform attention mechanism operations in parallel across the three dimensions, resulting in faster, superior, and more effective feature representation.
The present protocol was approved by the Ethics Committee of Nantong University. It involves the intelligent assessment and research of acquired non-invasive or minimally invasive multimodal data, including human medical images, limb movements, and vascular imaging, utilizing artificial intelligence technology. Figure 3 depicts the overall flowchart of multi-organ segmentation. All the necessary weblinks are provided in the Table of Materials.
Figure 3: Overall flowchart of the multi-organ segmentation. Please click here to view a larger version of this figure.
1. Dataset collection
2. Environment configuration
3. Data preprocessing
4. Model training and comparison
NOTE: As a widely used baseline in the field of image segmentation, nnU-Net23 serves as a baseline model in the study. The specific model comparison process is as follows.
5. Ablation experiment
This protocol employs two metrics to evaluate the model: Dice Similarity Score (DSC) and 95% Hausdorff Distance (HD95). DSC measures the overlap between voxel segmentation predictions and ground truth, while 95% HD assesses the overlap between voxel segmentation prediction boundaries and ground truth, filtering out 5% of outliers. The definition of DSC26 is as follows:
(1)
Where T and P represent the ground truth values and predicted values for all voxels, respectively. The definition of 95% HD26 is as follows:
(2)
Where dT,P, is the 95th percentile distance of the maximum HD95 between the ground truth and predicted voxels, and dT,P, represents the 95th percentile distance of the maximum HD95 between the predicted voxels and ground truth.
Swin-PSAxialNet is compared with various mainstream multi-organ segmentation algorithms in Table 1. The results demonstrate improved segmentation performance for abdominal organs in the proposed algorithm. The Dice accuracy of the segmentation method surpasses the average Dice value of other methods by 2-3 percentage points and even outperforms nnFormer by 1 percentage point. Swin-PSAxialNet also exhibits high accuracy in boundary segmentation, with an average HD95 coefficient of 5.63, the lowest among all compared algorithms.
Methods | Lkn | Spl | Liv | Pan | Aro | Bla | Sto | Gbd | Pos | Eso | Rkn | Avg | ||
DSC | HD95 | |||||||||||||
nnU-Net | 96.3 | 96.8 | 97.21 | 85.2 | 95.22 | 86.57 | 91.47 | 84.7 | 91.79 | 82.5 | 95.44 | 91.2 | 7.13 | |
Swin-Unet | 96.03 | 96.68 | 97.22 | 85.74 | 95.3 | 87.82 | 91.77 | 84.64 | 91.67 | 83.26 | 96.14 | 91.48 | 9.68 | |
TransUNet | 96.42 | 96.74 | 97.05 | 85.09 | 94.86 | 85.2 | 91.81 | 83.33 | 91.52 | 82.22 | 96.12 | 90.94 | 13.77 | |
UNETR | 95.94 | 96.48 | 96.85 | 84.14 | 94.62 | 85.01 | 90.18 | 81.16 | 90.79 | 80.34 | 95.15 | 90.06 | 12.92 | |
nnFormer | 96.59 | 97.22 | 97.33 | 86.79 | 95.31 | 89.74 | 92.82 | 84.88 | 92.43 | 84.24 | 96.32 | 92.15 | 6.59 | |
Swin-PSAxialNet | 96.85 | 97.67 | 97.64 | 88.39 | 95.97 | 92.03 | 94.4 | 87.78 | 93.15 | 86.24 | 96.59 | 93.34 | 5.63 |
Table 1: Comparison of mainstream segmentation algorithms. The abbreviations stand for the following organs: Lkn for left kidney, Spl for spleen, Liv for liver, Pan for pancreas, Aro for aorta, Bla for bladder, Sto for stomach, Gbd for gall bladder, Pos for postcava, Eso for esophagus, Rkn for right kidney, and DSC for the average Dice coefficient, HD95 for the average 95% Hausdorff Distance.
Moreover, the introduction of a parameter-sharing mechanism and improvements to the transformer structure provide significant advantages in both model complexity and inference speed (Table 2, Figure 4, and Figure 5).
Models | Params (M) | FLOPs (G) | Inference Time (s) |
nn-Unet | 17.96 | 398.54 | 9.07 |
UNETR | 89.58 | 39.67 | 12.51 |
nnFormer | 134.78 | 152.43 | 11.24 |
Swin-PSAxialNet | 37.64 | 43.15 | 10.31 |
Table 2: Comparison table of network model indicators.
Figure 4: Dice variation curve of model training network. (A) The current network. (B) nnFormer. (C) nnU-Net. (D) UNETR. Please click here to view a larger version of this figure.
Figure 5: Model training results comparison. The figure highlights the results of the current network, the results of nnFormer, the results of nnU-Net, and the results of UNETR. Please click here to view a larger version of this figure.
This study conducted comprehensive ablation experiments on the algorithm network, using the PSAA block and SPD block as variables, and comparing single-fold training with five-fold training. The Dice coefficients from the experimental results are presented in Table 3. These results demonstrate that utilizing the SPD block for downsampling and the PSAA block for upsampling effectively enhances segmentation performance compared to the unmodified nnU-Net (Figure 6).
Models | PSAA Block for the entire network | SPD Block for the entire network | PSAA Block only for upsampling | PSAA Block for upsampling, use SPD |
Single-fold training | 92.17 | 90.51 | 92.24 | 92.39 |
Five-fold training | 92.82 | 91.26 | 92.79 | 93.34 |
Table 3: The results of the ablation experiment of Swin-PSAxialNet. The results of the ablation experiment of Swin PSAxialNet are presented. In the ablation experiment, the use of PSAA Block and SPD Block were compared, and the training results of DSC are shown in the table.
Figure 6: The segmentation results. Please click here to view a larger version of this figure.
The segmentation of abdominal organs is a complicated work. Compared to other internal structures of the human body, such as the brain or heart, segmenting abdominal organs seems more challenging because of the low contrast and large shape changes in CT images27,28. Swin-PSAxialNet is proposed here to solve this difficult problem.
In the data collection step, this study downloaded 200 images from the AMOS2022 official website24. In the environment configuration step, the experimental setup included the Paddlepaddle environment along with its corresponding CUDA version. During the data preprocessing step, random cropping and random flipping were applied to enhance the data robustness. In the model training and comparison steps, based on the baseline model nnU-Net, the experiment added the SPD Block for downsampling and the PSAA feature extraction block for upsampling, achieving excellent segmentation results. Swin-PSAxialNet was compared with various mainstream segmentation models and demonstrated promising results on the public dataset AMOS2022. It achieved higher segmentation accuracy than the other segmentation methods mentioned in this study16,17,18,19. In the ablation experiment section, the usage of the PSAA block and SPD block was varied to find the optimal segmentation network. The results of the ablation experiment indicated that using the PSAA module only for upsampling achieves one percentage point higher than using it for the entire network.
The innovations of this study are as follows: Firstly, the SPD module is incorporated into the network to constitute the downsampling unit alongside the Conv block, thereby enhancing the network's feature extraction capabilities. Secondly, the introduction of the PSAA feature extraction block effectively reduces network complexity while improving training accuracy. Lastly, the introduction of a parameter-sharing mechanism significantly reduces the complexity of the training model.
This study has limitations in the following aspects: It has only been applied to the AMOS2022 dataset, and results have not been validated on other datasets. Again, although nnU-Net has achieved high performance in abdominal organ segmentation tasks after a series of optimizations as an automated segmentation framework, some parameters still have a certain degree of subjectivity, such as initial learning rate, number of iterations, network architecture, etc. Future research will explore methods for the automatic calculation of these parameters29.
Swin-PSAxialNet has achieved excellent segmentation results in abdominal organ segmentation tasks, demonstrating strong generalization capabilities and stability. In future research, additional datasets will be employed to investigate the algorithm's generalization capabilities further and make adjustments to the model. Moreover, the algorithm's structure can undergo further optimization to enhance the overall accuracy and robustness of the network. The aim is to integrate the network with multimodal capabilities to transform it into a medical multimodal model.
The authors have nothing to disclose.
This study was supported by the '333' Engineering Project of Jiangsu Province ([2022]21-003), the Wuxi Health Commission General Program (M202205), and the Wuxi Science and Technology Development Fund (Y20212002-1), whose contributions have been invaluable to the success of this work." The authors thank all the research assistants and study participants for their support.
AMOS2022 dataset | None | None | Datasets for network training and testing. The weblink is: https://pan.baidu.com/s/1x2ZW5FiZtVap0er55Wk4VQ?pwd=xhpb |
ASUS mainframe | ASUS | https://www.asusparts.eu/en/asus-13020-01910200 | |
CUDA version 11.7 | NVIDIA | https://developer.nvidia.com/cuda-11-7-0-download-archive | |
NVIDIA GeForce RTX 3090 | NVIDIA | https://www.nvidia.com/en-in/geforce/graphics-cards/30-series/rtx-3090-3090ti/ | |
Paddlepaddle environment | Baidu | None | Environmental preparation for network training. The weblink is: https://www.paddlepaddle.org.cn/ |
PaddleSeg | Baidu | None | The baseline we use: https://github.com/PaddlePaddle/PaddleSeg |
.