Applications of Machine Learning in Bone and Mineral Research
Article information
Abstract
In this unprecedented era of the overwhelming volume of medical data, machine learning can be a promising tool that may shed light on an individualized approach and a better understanding of the disease in the field of osteoporosis research, similar to that in other research fields. This review aimed to provide an overview of the latest studies using machine learning to address issues, mainly focusing on osteoporosis and fractures. Machine learning models for diagnosing and classifying osteoporosis and detecting fractures from images have shown promising performance. Fracture risk prediction is another promising field of research, and studies are being conducted using various data sources. However, these approaches may be biased due to the nature of the techniques or the quality of the data. Therefore, more studies based on the proposed guidelines are needed to improve the technical feasibility and generalizability of artificial intelligence algorithms.
INTRODUCTION
In this aging society, osteoporosis and its clinical outcome, fragility fracture, have become a growing social issue in both medical and economic aspects. In South Korea, the total health care costs of osteoporotic fractures increased by approximately 30% from 2008 to 2011, and this trend has been continuously rising in the United States and Korea [1,2]. Therefore, preventing fractures is a core purpose in the diagnosis and management of osteoporosis. The diagnosis of osteoporosis is based on assessing bone mineral density (BMD) using dual-energy X-ray absorptiometry (DXA). In addition to BMD, the Fracture Risk Assessment Tool (FRAX), incorporating additional clinical risk factors, is a well-validated and widely used tool for fracture prediction [3]. However, there is an unmet need for tools with easier accessibility and better performance in classifying patients with osteoporosis and predicting the risk of fractures [4].
Machine learning (ML) methodologies are rapidly implemented in various medical fields [5], such as bone and mineral research, including diagnosis of osteoporosis and detection/prediction of fractures using both clinical and imaging data. In the same context, studies of bone and mineral research using ML approaches have been explodingly published, as depicted in Fig. 1. The studies have become possible because of the combination of rapidly accumulating medical data [6] and advances in accessible computing power [7]. Especially, studies of classification tasks—such as screening osteoporosis or detecting fractures—have been increasing. It might be due to the relatively easier access to cross-sectional than survival datasets and the use of more widely distributed models than survival tasks. However, as the large-sized standardized datasets have become more widely available [8] and the attractive methodologies are continuously evolving at this very moment, it is expected to help solve the currently unmet needs of bone and mineral research.
In this review, studies related to the use of ML methods related to bone and mineral research were reviewed from a medical perspective, focusing on osteoporosis screening, fracture detection, and prediction of the risks. The literature search was performed in PubMed, including studies published from 2016 January until March 2021. Furthermore, future perspectives for researchers and clinicians in the bone field have been summarized.
A SHORT GUIDE FOR INTERPRETATION
Confusion metrics
Fig. 2 shows a cross table of the relationship between the results of the artificial intelligence (AI) algorithm and the reference standard. In the literature of AI, the cross-table is usually described as a ‘confusion metrics.’ Sensitivity, also called ‘recall,’ refers to the fraction of cases in which AI determines to have disease among the reference cases with the disease. On the other hand, specificity refers to the fraction of cases in which AI determines not to have disease among the reference cases without the disease. Sensitivity and specificity are the most basic indicators of the accuracy of AI algorithms. For more intuitive measures, When the AI gives a positive (or negative) result, the probability that the disease actually exists (or does not exist) is called positive predictive value (PPV) (or negative predictive value [NPV]). PPV is also called ‘precision’ in AI literature. However, as prevalence is similar to the pretest probability in terms of individual patients, the algorithm’s accuracy significantly varies by the prevalence of the disease even with the same AI algorithm. Therefore, physicians using AI algorithms should apply the results presented by AI considering the expected pretest probability of the population.
Receiver operating characteristic curve and precision-recall curve
A widely-used way to demonstrate the performance of AI algorithms is the area under the receiver operating characteristic curve (AUROC), or, in short, the area under the curve (AUC). The receiver operating characteristic (ROC) curve is a graph drawn with 1-specificity as an x-axis and sensitivity as a y-axis. The ROC curve can have a maximum value of 1, and the closer to 1, the higher the model’s accuracy. However, even if the AUROC is high, the model can be used only with an appropriate threshold. Therefore, the threshold with sensitivity and specificity at the threshold should be presented along with AUROC values.
Precision-recall curve (PRC) is another way to show the performance of the model, which was drawn with recall (sensitivity) as an x-axis and precision (PPV) as a y-axis. Different from ROC, as a y-axis is PPV, PRC shows the results reflecting prevalence. Therefore, the shape and AUC of PRC can be changed by the prevalence of the disease, which makes PRC more suitable in an imbalanced dataset with a low prevalence.
Internal and external validation
Among the processes of training, tuning, and testing in the development of AI algorithms, testing is a process to check the performance of the developed algorithms. Mathematically complex AI models, such as deep learning, are highly dependent on data itself. Therefore, it is crucial to evaluate the performance using independent datasets not used for training and tuning, usually using datasets from other institutions, which is called external validation. On the other hand, evaluating performance with datasets used for training or tuning is called internal validation. However, it is likely to overestimate the performance due to overfitting.
Split-sample validation uses a randomly selected subgroup of datasets, usually about 10% of the total dataset, only for testing. Although the testing process uses datasets not used for training and tuning, it is also regarded as internal validation. It is mainly because of the selection bias, which is often inevitable in collecting a large amount of data. It leads to various discrepancies between the real-world data and the data collected for specific AI algorithms. Naturally, split data from the specific collected dataset inherits the limitation.
Therefore, it is recommended to externally evaluate the performance in an independent dataset that can reflect the actual clinical situation. In specific, ideal external validation datasets are supposed to be prospectively collected with an accurate definition of clinical setting without bias as much as possible from institutions other than an institution that collected training dataset.
APPLICATIONS IN DIAGNOSIS
Screening osteoporosis
In the era of AI, many researchers have focused their attention on developing practical screening tools for osteoporosis using this methodology. Easier-to-use and accurate diagnostic tools may improve the prognosis of individuals at high risk of fractures by earlier intervention and aid the effective use of public health resources for individuals at low risk. Most studies have focused on predicting BMD or categorizing patients with osteoporosis using opportunistic imaging modalities such as computed tomography (CT) [9–13] and X-rays [14–18] or various clinical parameters (Table 1) [19–22].
In general, CT has been used in studies predicting BMD [9–13]. A recent study by Fang et al. [9] using quantitative CT images from 1,499 patients reported that CT images could predict BMD using a convolutional neural network (CNN), such as DenseNet-121, with an excellent correlation of r>0.98. This result has clinical significance in generalizability because CT images were obtained using scanners from different vendors. The results from other types of CT, such as spinal or chest CT, have also shown excellent correlation to BMD values using CNN [11–13]. For the classification of patients with or without osteoporosis, studies using CT demonstrated outstanding performances, with an accuracy of 0.82 to 0.91 and an AUROC of 0.90 to 0.97 [11–13,23]. However, some studies had a critical limitation—BMD estimated from CT was used and not BMD estimated from DXA, which is the gold standard [9–12].
Studies using X-rays or dental radiography have usually focused on classifying tasks. Most studies used a CNN, especially DenseNet and ResNet, have shown excellent performances, with an AUROC of 0.81 to 0.94, accuracy of 0.85 to 0.92 [14–18]; some studies have even reported an AUROC of 1.00 [24–26]. In addition, studies have attempted to use clinical parameters instead of images for categorizing osteoporosis, showing excellent performances, with correlation coefficients of 0.778 to 0.978 for BMD and an AUROC of 0.74 to 1.00 [19–22]. The performance of the models using clinical parameters varies widely depending on the type and quality of the data. Some studies also reported precision and recall [12,16], but did not reported PRC, which might be more appropriate in the imbalanced dataset, as mentioned above.
On the other hand, in a complex model like CNN with numerous parameters, it inevitably risks overfitting due to the variance-bias tradeoff [27]. Overfitting represents a model that learned the detail of the training set too well that it negatively impacts the performance of the data other than the training set. The most intuitive way to solve the problem in a ‘deep-learning’ way is to secure enough data to train models. However, securing sufficient data is not always possible considering the prevalence of osteoporosis or fractures [28]. Therefore, some studies have attempted to control overfitting by feature selection [19], data augmentation [11], and transfer learning [9], while other studies have mentioned the limitation of bias in selecting patients, models, or the testing dataset.
Taken together, increasing attempts have been made to diagnose osteoporosis using various data sources and ML methods, and performance has improved over time, especially when using images with CNN methods. Although studies reporting AUROCs of almost 1.00 can have a risk of overfitting and need external validation of the model [24–26], the practical use of opportunistically taken images in screening osteoporosis may be realized in the future.
Screening fractures
Many studies have reported the application of ML in fracture detection [29–43], and some of them have become the basis of commercially available programs—such as OsteoDetect (Imagen Technologies, New York, NY, USA; 2018, the U.S. Food and Drug Administration [FDA]-approved), Aidoc BriefCase-CSF triage (Aidoc Medical Ltd., Tel Aviv, Israel; 2019, FDA-approved), HealthVCF (Zebra Medical Vision Ltd., Shefayim, Israel; 2020, FDA-approved), FractureDetect (Imagen Technologies, 2020, FDA-approved) [44], and DEEP-SPINE-CF-01 (Deepnoid Inc., Seoul, Korea; 2019, Korean FDA-approved).
Several earlier studies used X-ray images to detect fractures, and studies using CT images to detect fractures have been increasing recently. As the basis for the OsteoDetect program, Lindsey et al. [30] used wrist radiographs to detect wrist fractures using a CNN and showed performances in AUROC of 0.96 and 0.97 in two internal test datasets. Also, they showed that the when aided with the program, misinterpretation rate of average clinician was significantly reduced by 47.0% [30]. Another study which used X-ray to detect wrist fracture using CNN showed excellent performances in external test of AUROC of 0.95, a specificity of 0.90, and a sensitivity of 0.88, which surpassed the performance of the previous computational methods [31]. A similar study by Chung et al. [32] used shoulder radiographs to detect humerus fractures using a CNN model. In the study, the model demonstrated superior performance to general orthopedic surgeons in distinguishing fractures [32]. For detecting vertebral and femoral neck fractures, many studies have reported AUROCs as high as 0.91 to 0.99 using spine and hip X-rays with CNN methods, consistent with other studies [35–43]. Another interesting study conducted by Badgeley et al. [45] reported that imaging features from hip X-rays could be used to discriminate fractures using a CNN (AUROC of 0.78) and that patient data with hospital process variables, such as scanner model, scanner manufacturer, and order date, showed better performance for fracture detection (AUROC of 0.91) than images. In a subgroup analysis of selected radiographs matched with patient data and hospital process variables, X-ray could not detect hip fractures [45]. This result implied that the model detected fractures indirectly through the associated clinical variables rather than directly utilizing the image features of the fracture. Also, it was partly because of the model imbalance that the PRC, which is dependent on the disease prevalence, was significantly higher for case-control cohorts (hip fracture prevalence of 50% than in original population (the prevalence of 3%) [45].
In terms of studies using CT images, Tomita et al. [29] detected osteoporotic vertebral fractures from 1,432 pelvic CT scans in 2018. They used multiple methods combined, the CNN-based model for feature extraction, and the ResNet long short-term memory model for aggregating the extracted features. Along with other studies using random forest or support vector machines [46,47], the study demonstrated an acceptable accuracy of 0.89. While the number of studies for predicting the hip and any osteoporotic fractures is relatively smaller than that for predicting vertebral fractures, they also showed a possibility of ML models as a diagnostic tool for the fractures, using diverse methods of deep CNN, ElasticNet, and others [48–50]. However, studies using CT images are usually based on a small number of cases; hence, there is a need for larger studies with external validation.
In particular, in imbalanced tasks such as detecting fractures, data augmentation was attempted in some studies to control the overfitting problem [31,37,39,41]. Some studies have used sampling methods to handle class imbalance [51,52]. In a recent study, images with data augmentation techniques of generative adversarial networks and digitally reconstructed radiographs from CT showed better performances than those without augmentation (AUROC of 0.92 vs. 0.80, accuracy 86.0%, sensitivity 0.79, specificity 0.90, PPV 0.80, NPV 0.90) [37]. Another recent study reported that the accuracy of fracture detection increased with larger training dataset sizes and mildly improved with augmentation [35]. Consequently, larger studies with optimal augmentation techniques are needed for real-world application of automatic ML-driven detection systems, which may reduce the time and burden of radiologists.
APPLICATIONS IN RISK PREDICTION
As in other fields of medical research, accurate prediction of musculoskeletal outcomes enables an individualized approach for initiating and monitoring treatments. A few studies have evaluated the risks of fractures, falls, or bone loss in patients with osteoporosis. In terms of predicting fracture, most studies used a database to build prediction models. In men, Su et al. [53] reported that the classification of a high-risk group for hip fractures using a classic ML method of classification and regression trees showed a discrimination power similar to that of FRAX ≥3%. Total hip BMD was the most robust discriminator, followed by age and femoral neck BMD [53]. In postmenopausal women, fracture classification using the CatBoost method, a recently developed ML method, outperformed the FRAX score for fracture prediction (AUROC of 0.69 vs. 0.66) [54]. The top predicting factors were total hip, lumbar spine, and femur neck BMD, followed by subjective arthralgia score, serum creatinine level, and homocysteine level [54]. The latter factors were listed higher than conventional predictors, such as age [54]. The results implied that ML could be used to build prediction models and identify novel risk factors. Based on claims data of more than 280,000 individuals, Engels et al. [55] developed a hip fracture prediction model with an AUROC of 0.65 to 0.70 using a super-learner algorithm that considered both regression and ML algorithms, such as support vector machines and RUSBoost. Interestingly, image-based fracture prediction model was tried recently by Muehlematter et al. [56]. They showed that the bone texture analysis from CT scan combined with ML methods may identify patients at high risk of vertebral fractures with high accuracy.
Moreover, considering the sequential characteristics of electronic health records, Almog et al. [57] developed a short-term incident fracture prediction model based on natural language processing methods. These findings indicate the possibility of using the unique medical history data of the patients over time to predict the risk of fractures. Contrarily, studies using unsupervised learning to identify fractures were also conducted [58,59]. Kruse et al. [58] found nine different fracture risk clusters based on BMD, clinical risk factors, and medications using simple unsupervised hierarchical agglomerative clustering analysis. Clusters based on BMD could discriminate between patients with poor and good treatment compliance to antiresorptives in the future.
With regard to predicting outcomes other than fracture, few studies have attempted to predict bone loss and falls [60–62]. The rate of bone loss over 10 years could be predicted better with the artificial neural network than with multiple regression analysis using conventional parameters, such as age, body mass index, menopause, fat and lean body mass, and BMD values [60]. Falls were also accurately predicted using XGBoost, reporting the following top predictors: cognitive disorders, abnormalities of gait and balance, and Parkinson’s disease [61]. The most common problem encountered in learning tasks is a class imbalance because of the low incidence of positive events. Model calibration has been attempted in some studies by adjusting the predicted and observed probabilities to attenuate class imbalance [61,63]. Although further validation studies are needed, efforts are being made to identify patients at risk and provide individualized treatment.
FUTURE DIRECTIONS
Overall, many studies have consistently shown that ML models can detect fractures better than clinicians [32,39,41], expanding the limits of human performance. Recently, FDA and Korea FDA approved some fracture detection algorithms to support clinicians, which makes AI-guided tools within reach. However, AI models exceedingly better than conventional models have not been suggested for the task of predicting fractures. One of the main reasons for the phenomenon could be that the conventional models are well-designed and already have excellent performances in fracture risk prediction, which leaves small room for the improvement. Also, especially for the AI models with images, although CNN showed excellent performance in discriminating existing fractures, the information included in the image of the bone may not have enough information to predict future fractures. Therefore, more AI models conjoining images of bone and muscle with clinical informations are needed in the near future. It could be considered in designing the models whether input images can provide high-quality information to predict fractures, as there is a significant difference in the quality and amount of included information included depending on the image type.
In addition to the above applications, AI can be used to predict treatment responses. For example, treatment response can be accurately predicted based on anthropometric, biochemical, and imaging features of patients with acromegaly using a gradient boosting decision tree method [64]. Further, in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) trial, among patients with diabetes, a subgroup of patients with survival benefit from intensive treatment was newly identified in post hoc analysis using the gradient forest method [65]. These results provide insights into the utilization of ML methods to predict treatment responses, leading to an individualized approach in designing treatment regimens and targets.
Moreover, AI methods can be effectively used in translational research, especially for evaluating large data, such as genetic, epigenetic, proteomic, and other molecular profiling data. In the field of cancer immunology, researchers have used ML to predict the treatment response to immunotherapy with a rich dataset of gene expression of tumor and immune cells and their clinical characteristics [66,67]. A recent study tried to identify plasma protein patterns for various health outcomes using ML techniques [68]. The authors found novel predictive proteins and built models using ML techniques. However, the findings of these studies require further validation in more extensive and different populations.
However, despite the enthusiasm about the use of AI for medicine, the lack of sufficient and appropriate validation of the algorithms has been a concern, and it is called ‘digital exceptionalism’ [69,70]. A recent meta-analysis that evaluated AI algorithms for the diagnostic analysis of medical images reported that only 6% of them performed external validation. None of these studies had a diagnostic cohort design and were prospectively collected for external validation [71]. To improve the technical feasibility and generalizability of current AI studies, there are some methodologic guides for various study designs—Standard Protocol Items: Recommendations for Interventional Trials–Artificial Intelligence (SPIRIT-AI) and Consolidated Standards of Reporting Trials–Artificial Intelligence (CONSORT-AI) guidelines for intervention studies [72], Standards for Reporting of Diagnostic Accuracy Studies–Artificial Intelligence (STARD-AI) guidelines for diagnostic accuracy [73], and others [74,75]. In the near future, only studies with appropriate validation can be accepted and utilized in clinical practice. Also, beyond the AI models’ performance, other principles, such as data privacy and safety, need proper attention before implemented in clinical practice.
CONCLUSIONS
In this era of the overwhelming volume of medical data, AI is a promising tool that may shed light on an individualized approach and a better understanding of the disease in the field of bone and mineral research. The present review aimed to provide an overview of the latest studies using ML to address the issues in the field, focusing on osteoporosis and fragility fractures. ML models for diagnosing and classifying osteoporosis and detecting fractures from images have shown promising performance and have improved over time. Fracture risk prediction is another promising field of research, and studies are being conducted using various data sources.
On the verge of this methodological turning point, endocrinologists as domain experts will continue to serve as a key person for finding unmet clinical needs to initiate the research and find clinical meanings from converging the vast outcomes from the analyses to aid patients with musculoskeletal diseases. We believe that the data presented in this review may help clinicians and researchers understand the current progress of ML to date and its strengths and limitations.
Acknowledgements
This study was funded by the Korea Research Foundation (Project number 2021R1A2C2003410).
Notes
CONFLICTS OF INTEREST
No potential conflict of interest relevant to this article was reported.