End-to-End Semi-Supervised Opportunistic Osteoporosis Screening Using Computed Tomography
Article information
Abstract
Background
Osteoporosis is the most common metabolic bone disease and can cause fragility fractures. Despite this, screening utilization rates for osteoporosis remain low among populations at risk. Automated bone mineral density (BMD) estimation using computed tomography (CT) can help bridge this gap and serve as an alternative screening method to dual-energy X-ray absorptiometry (DXA).
Methods
The feasibility of an opportunistic and population agnostic screening method for osteoporosis using abdominal CT scans without bone densitometry phantom-based calibration was investigated in this retrospective study. A total of 268 abdominal CT-DXA pairs and 99 abdominal CT studies without DXA scores were obtained from an oncology specialty clinic in the Republic of Korea. The center axial CT slices from the L1, L2, L3, and L4 lumbar vertebrae were annotated with the CT slice level and spine segmentation labels for each subject. Deep learning models were trained to localize the center axial slice from the CT scan of the torso, segment the vertebral bone, and estimate BMD for the top four lumbar vertebrae.
Results
Automated vertebra-level DXA measurements showed a mean absolute error (MAE) of 0.079, Pearson’s r of 0.852 (P<0.001), and R2 of 0.714. Subject-level predictions on the held-out test set had a MAE of 0.066, Pearson’s r of 0.907 (P<0.001), and R2 of 0.781.
Conclusion
CT scans collected during routine examinations without bone densitometry calibration can be used to generate DXA BMD predictions.
INTRODUCTION
Osteoporosis is a common condition that causes bone fragility. With increasing life expectancy and a graying population across the world, more individuals are projected to be at risk for this condition [1,2]. Osteoporotic fractures are associated with a significantly increased risk of long-term disability [3] and mortality [4,5]. In the United States, approximately 21% of women and 32% of men aged 65 or older die within 1 year of a hip fracture [6]. Although the condition is both preventable and treatable with early detection as well as its associated socioeconomic burden, osteoporosis remains underdiagnosed [7].
The gold standard modality for estimating bone mineral density (BMD) and diagnosing osteoporosis is dual-energy X-ray absorptiometry (DXA) [8-10]. However, timely diagnosis of this condition is difficult as osteoporosis is asymptomatic [10], and individuals at risk of fracture often have other comorbidities that require more immediate care [11]. To address this disconnection between disease prevalence and diagnosis, previous studies proposed manual and machine learning-based opportunistic osteoporosis screening methods using computed tomography (CT) as an alternative to DXA [12-14]. These methods do not require additional radiation, cost, or, if completely automated, input from health professionals. However, as these systems do not generate end-to-end predictions [15,16] or population agnostic BMD using DXA (BMDDXA) estimates [17-20], there are limitations in applying these results to real-world clinical practice.
This study proposes an automated, end-to-end, deep learning-based method for opportunistic osteoporosis screening of the top four lumbar vertebrae using abdominal CT scans. In contrast to the results of previous studies on CT-based osteoporosis screening, our system generates BMDDXA estimates that are population agnostic and do not rely on calibration using bone densitometry phantoms.
METHODS
Data
This retrospective study was approved by the Institutional Review Board of the National Cancer Center in the Republic of Korea (NCC2020-0165), and all identifying information was anonymized before analysis. The data used in this study included 367 (198 females and 169 males) contrast-enhanced abdominal CT scans of patients who underwent routine cancer screening between 2010 and 2019 at National Cancer Center in the Republic of Korea. Of these subjects, 268 had associated lumbar DXA values collected within 180 days of their CT scan, irrespective of the imaging order. Patients with spinal implants or DXA z-scores greater than 3.3 or less than –3.3 were excluded from this study. We did not include cases with a suspected fracture on CT scout films or a suspected fracture on DXA with a T-score difference of more than 1 between adjacent vertebrae.
Table 1 contains additional demographic information on the DXA subject population. Subjects were grouped into normal, osteopenia, and osteoporosis groups using their L1–L4 DXA T-score values and the World Health Organization (WHO) diagnostic criteria. Body mass index (BMI) status was based on the WHO BMI cutoffs for Asian populations, where a value below or equal to 18.5 is underweight, between 18.5 and 23 is normal, between 23 and 27.5 is overweight, and greater than 27.5 is obese [21].
The CTs and Convolution Kernels used were: (1) Discovery CT 750HD with Convolution Kernel: Standard (GE Healthcare, Waukesha, WI, USA) (n=238); (2) Brilliance with Convolution Kernel: B (Philips Medical Systems, Best, the Netherlands) (n=109); (3) SOMATOM Definition Edge with Convolution Kernel: B30f (Siemens Healthcare, Forchheim, Germany) (n=20). All CTs were scanned 80 seconds after injection of Omnipaqu (Iohexol, GE Healthcare) 300 contrast medium (Venous phase). Peak tube voltage was 120 kVp (n=347) or 100 kVp (n=20) The tube voltage used for all abdominal CT images was 120 kvP and the tube current (milliampere) was adjusted automatically by the thickness of the patient’s body for radiation dose optimization.
Lumbar DXA scans were performed using QDR 4,500 W (n=165), Horizon W (n=98), and Discovery W (n=5) scanners manufactured by Hologic (Marlborough, MA, USA). For each subject, slice number annotations and vertebral segmentation mask labels were assigned to the central axial CT slice of the L1, L2, L3, and L4 vertebrae.
Experimental setup
Fig. 1 illustrates the CT-based opportunistic osteoporosis screening method. The system generated end-to-end BMD using convolutional neural network (BMDCNN) estimates from abdominal CT scans via three successive steps: localization, segmentation, and estimation. The localization subtask takes the abdominal CT scan of a subject as the input and outputs the center axial CT slice locations for the L1, L2, L3, and L4 vertebrae as integer values. For each of the four predicted center slices, the system takes one slice above and two slices below the detected center, segmenting the vertebral bone into four slices during the segmentation step. The segmented slices corresponding to each lumbar vertebra were cropped, concatenated counterclockwise, and used as the input for the estimation step to generate the BMDCNN predictions.
Lumbar spine localization
The localization subtask uses deep learning-based regressors to detect the center axial slice locations of the L1, L2, L3, and L4 spines. Our localization method is a two-step process that utilizes both frontal and sagittal maximum intensity projections (MIPs) and two regression models. In contrast to the previous work on CT-based lumbar spine localization by Belharbi et al. [22] and Kanavati et al. [23], our method outputs vertebral-level predictions for the top four lumbar vertebrae rather than just the center L3 slice location. Fig. 2 shows the overall flow of the proposed method.
The two stages of the localization method are trained separately and combined during the inference phase. In the first stage, a neural network outputs the top axial slice location of the L1 vertebra using frontal MIP, and the second model predicts the center slice locations of the L1, L2, L3, and L4 vertebrae during the second stage. The frontal MIP used in the first stage was center cropped along the x-axis and zero padded along the y-axis to generate a 160×256 pixel image. In the second stage, the sagittal MIP was cropped from the 120th pixel to the 280th pixel along the x-axis and 80 slices from the top L1 slice along the y-axis to generate a 160×80-pixel region of interest. During training, the sagittal MIP was cropped, starting from a random slice between 5 and 15 above the top L1 location ground-truth, and the MIP was cropped from 10 slices above the top L1 slice prediction in the first step of the localization subtask. Both stages use VGG19-based regressors [24], with a channel attention block applied before each pooling layer. Additional implementation details are outlined in Supplemental Fig. S1.
Vertebral segmentation
Spine BMDDXA measurements are known to be affected by body composition [25]; therefore, our system segments the vertebra for each axial CT slice using a U-Net-based model [26], such that the subsequent estimation subtask will only use bone images as input. However, there is a trade-off between improving model performance with more labeled samples and the resource demands associated with manual data annotation. To address this disconnection, we take a semi-supervised approach for this subtask, where we only use one segmentation mask for each vertebra but use all available axial slices during training via the Mumford-Shah (MS) loss [27]. As shown in Fig. 3, a semi-supervised learning approach allows us to use all the CT slices in a vertebra, leading to an approximately ten-fold increase in training data as compared to a strongly supervised approach using only the center vertebral slices. The loss was calculated using a weighted sum of cross-entropy (CE) and MS losses, where the CE loss was only applied to samples with ground-truth segmentation masks, whereas the MS loss was applied to all training samples. Additional details on the model architecture, loss function, and training procedure are provided in Supplemental Fig. S2.
Bone mineral density estimation
In this study, BMD estimation was considered as an image regression task, where the system generated continuous BMDCNN predictions for the L1, L2, L3, and L4 vertebrae independently using a DenseNet169-based regressor [28]. Fig. 4 illustrates the image generation process for the BMD estimation subtask. For training, all axial CT slices from the entire range of a given vertebra were segmented, whereas the images used for testing were based on the output predictions from the detection and segmentation subtasks. For a given vertebra, four consecutive axial CT slices were selected to generate one input sample. During training and validation, a sliding window was applied to the slices, such that for a vertebra with n ordered slices, n−3 samples containing four adjacent CT slices were generated. In contrast to using only the slices near the center, this approach uses data from the entire vertebra during training while ensuring that slices closer to the start and end of the vertebra, which contain a higher proportion of denser cortical bone, are sampled less frequently. During testing, the center slice prediction for a given vertebra was used from the detection step, and one slice above and two slices below the predicted slice were selected. This choice, in contrast to two slices above and one slice below the detected center, was arbitrary because the differences between the two methods were negligible during model selection. Additional training and architecture details for the estimation subtask are available in Supplemental Methods.
Statistical analysis
The models in this study were evaluated using Dice scores, regression metrics (mean absolute error [MAE], root mean squared error, and mean absolute percentage error), Pearson correlation coefficients, and coefficients of determination. Dice scores were calculated using PyTorch (https://pytorch.org) [29]. Regression metrics and coefficients of determination were evaluated using Python and the scikit-learn [30] package. Pearson r coefficients and their respective P values were acquired using Python and the SciPy [31] package.
RESULTS
Lumbar spine localization
Table 2 presents the results of the proposed localization method at the vertebral and subject levels. Localization performance was evaluated in CT slices with an absolute error. The performance with the channel attention block was listed in Supplemental Table S1.
Vertebral segmentation
For each CT slice, segmentation maps were generated using the largest connected component of the predicted output. Predictions were evaluated using Dice scores. For a segmentation map A and ground-truth label B, the Dice score is defined as:
where |A∩B| is the area of the overlapping region between A and B and |A|+|B| is the sum of the areas A and B. Since the proposed screening method is performed end-to-end, segmentation performance is contingent on the results of the localization subtask (Supplemental Fig. S3). However, the vertebral segmentation masks were evaluated and the Dice scores on the manually labeled center vertebral slices were computed. Table 3 reports the results of our semi-supervised segmentation method at the subject and vertebral levels compared with a standard, strongly supervised U-Net. The detailed results were presented in Supplemental Table S2.
Bone mineral density estimation
Table 4 lists the end-to-end performance of the estimation subtask. The 98 subjects in the test set (52 females and 46 males) were represented by 392 image samples generated from 1,568 CT slices, where each image represented one lumbar vertebra. Vertebral-level results were obtained by evaluating the end-toend BMDCNN estimates against the BMDDXA ground truths, while the subject-level results were generated by taking the arithmetic mean of the L1, L2, L3, and L4 BMDCNN predictions and comparing them to the total lumbar BMDDXA values. Fig. 5 illustrates these results using regression and Bland-Altman [32] plots.
DISCUSSION
This study proposes a deep learning-based system for end-to-end opportunistic osteoporosis screening using abdominal CT scans. Our system detects the center L1, L2, L3, and L4 axial CT slices with a MAE of 1.07±2.31 slices, segments the vertebra with a Dice score of 0.968, and generates vertebra-level BMDDXA estimates with a MAE of 0.079, Pearson r of 0.852 (P<0.001), and R2 of 0.714 on the held-out test set. For subject-level predictions, the system output predictions had an MAE of 0.066, Pearson r of 0.907 (P<0.001), and R2 of 0.781.
Previous studies have investigated deep learning-based methods for CT-based BMD estimation. Some of these studies relied on extracting features on manually selected CT slices rather than entire studies. Tang et al. [16] proposed a method of screening for osteoporosis using a multiclass classification network to discriminate between normal bone mass, low bone mass, and osteoporosis from segmented axial CT slices, while Yasaka et al. [15] used manually cropped mid-vertebral axial CT slices with circular region-of-interest markers to train a convolutional neural network (CNN) to output continuous BMDDXA predictions. Other more recent studies took end-to-end approaches to BMD estimation. Pickhardt et al. [33] and Fang et al. [18] presented end-to-end automatic methods to predict quantitative CT (QCT) values, while Liu et al. [19] correlated the mean Hounsfield unit values in regions of interest from low-dose chest CT scans with BMDDXA values for the thoracic and first two lumbar vertebrae. Krishnaraj et al. [20] used a cascade of two segmentation networks and machine learning-based regression to predict the DXA t-scores for the top four lumbar vertebrae from extracted voxel volumes. With the exception of Liu et al. [34] and Pickhardt et al. [33], these prior studies relied on strongly supervised methods (i.e., each training sample has a corresponding ground-truth label) for slice detection and region-of-interest extraction tasks. To reduce the cost associated with manual annotation, our work adopts a semi-supervised approach to segmentation, where only the ground-truth masks for the center axial CT slice of each vertebra are provided. We note that while the localization and segmentation approaches used by Liu et al. [34] and Pickhardt et al. [33] require no masks, the respective methods require anatomical landmarks (i.e., clavicles) that are not present in our dataset of abdominal CT scans [34] or the use of the lowest ribs [35], which can be difficult owing to lumbar or absent ribs.
Our system generates end-to-end predictions of BMDDXA values from abdominal CT scans rather than DXA t-scores [20], normal/osteopenia/osteoporosis classes [16], or QCT BMD values [17,18]. DXA t-scores and DXA-based osteoporosis classifications are population-dependent, and thresholds for osteopenia and osteoporosis do not always generalize across different groups. Although QCT has certain benefits over DXA, including capturing volumetric rather than areal BMD, the screening modality has lower utilization than DXA and often requires the use of a BMD calibration phantom at the time of the scan. This limits the number of available QCT samples relative to that of DXA, leaving fewer CT-BMD pairs for future training and validation.
This study has a few limitations of its own. All DXA and CT studies were sourced via routine cancer screening at a single oncology specialty center in the Republic of Korea, limiting the scope of our analysis as BMD varies between national and ethnic groups [2,36]. To generate vertebral-level annotations frontal MIP images were compared to lumbar DXA images. This was because of the difficulty introduced by the transitional vertebrae, lumbar ribs, and anatomical markers not being present in abdominal CTs, which would enable annotation with CT images alone. In particular, lumbosacral transitional vertebrae (LSTV)— common anatomical variants of the lumbar spine caused by the lumbarization of the S1 vertebra (leading to the appearance of six lumbar vertebrae) and sacralization of the L5 vertebra (leading to the appearance of four lumbar vertebrae)—make vertebral-level annotation difficult even for trained clinicians [37]. In practice, LSTVs cause errors in imaging [38] and wrong-site surgery [39]. To account for these variations, DXA images were used as ground-truth labels to reflect real-world clinical decision-making. Additionally, we were unable to collect information on family history, menopausal status, physical activity, or other risk factors.
This study describes an end-to-end deep learning-based system to measure BMD using abdominal CT. Furthermore, this study demonstrated that CT scans collected during routine cancer screening without bone densitometry phantoms could generate BMDDXA predictions without additional tests.
In this study, we have shown that machine learning can be used to predict BMD at each level of the lumbar spine from routine CT images. Next, we would like to develop a machine learning model that predicts the T-score and z-score of L1–L4 from CT images and develop software to diagnose osteoporosis. We also want to develop artificial intelligence software that uses clinical data from electronic health records and CT images to predict the long-term risk of fractures.
Supplementary Material
Notes
CONFLICTS OF INTEREST
No potential conflict of interest relevant to this article was reported.
AUTHOR CONTRIBUTIONS
Conception or design: Y.H., J.C.Y. Acquisition, analysis, or interpretation of data: J.O., B.K., G.O., Y.H., J.C.Y. Drafting the work or revising: J.O., B.K., G.O. Final approval of the manuscript: J.O., B.K., G.O., Y.H., J.C.Y.
Acknowledgements
This work was funded by Korea Advanced Institute of Science and Technology (KAIST) R&D Program (KI Meta-Convergence Program) 2020 (N10200012) through KAIST and by a grant from the National Cancer Center (2010080 and 2310840) in the Republic of Korea. Boah Kim is now at National Institutes of Health, Bethesda, MD, USA.