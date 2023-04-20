Ethics oversight

The study was approved by the Institutional Review Board (IRB) of the Mount Sinai School of Medicine, in accordance with Mount Sinai’s Federal Wide Assurances to the Department of Health and Human Services (ID# STUDY-14-00584-CR001). Written informed consent has been obtained from patients enrolled in this research registry. A Data and Safety Monitoring Board (DSMB) from Mount Sinai IRB had oversight of the study.

Study population

We collected chest CT scans and clinical information from 458 patients enrolled in the MSMC-ILD between September 2014 and April 2021. Individuals for participation in Mount Sinai Medical Center Research Registry for Interstitial Lung Disease (MSMC-ILD) included all adult (age > 18 years old) patients who were receiving or seeking medical care for the treatment of interstitial lung disease at Mount Sinai Medical Center, St Luke’s and Beth Israel Medical Centers. Patients with lung fibrosis or other interstitial lung disease were enrolled in the MSMC-ILD and assessing the extent of the disease. MSMC-ILD was established in 2014. The diagnosis of an ILD subtype followed the ATS2018 guidelines. All registry patients had a consensus diagnosis from radiology, pathology, and pulmonology. In this study, occupational exposure or other environmental exposure is included as a clinical feature. It is likely that the patient cohort at MSMC might be different from other patient cohorts. For example, patients at MSMC might be influenced by World Trade Center exposure. There were nine patients excluded due to low image quality resulting in a total of 449 patients with both clinical information and CT images that were included in our ILD diagnosis study. The patient population age ranged from 22 to 91 years (median 63, IQR 56-71), with 226 males and 223 females. A total of 132 patients (29.4%) were diagnosed with usual interstitial pneumonia (UIP), 37 patients (8.2%) with chronic hypersensitivity pneumonitis (CHP), 142 patients (31.6%) with nonspecific interstitial pneumonia (NSIP), 42 patients (9.4%) with sarcoidosis and 96 patients (21.4%) with other various ILD. 234 patients were selected for the 3-year survival analysis (see Supplementary Fig. 1 for inclusion and exclusion criteria). Sex information was used in the diagnosis of ILD subtypes as well as the prediction of 3 year survival analysis. Study participants did not receive compensation.

Clinical information

Clinical information was retrospectively collected by medical students, radiology residents, and thoracic radiology fellows through chart review via electronic medical records. The following data were collected within 6 months of the study date of each patient’s CT scan: age, sex, history of current or former smoking, history of rheumatic disease, home oxygen requirement, history of occupational or other exposures (including pets), World Trade Center exposure, pulmonary function test (PFT) values (FEV1/FVC ratio, FEV1 value, DLCO percentage), presence of pulmonary hypertension based on echocardiography or right heart catheterization, and history/results of lung biopsy. Clinical information was collected from pulmonology visit notes in the Electronic Medical Record and PFT flowcharts. If data was not available within the 6-month time frame, the data entry for that variable was left blank. Incomplete clinical variables were later filled with values from the nearest visit.

We also recorded the medications being used at or about the time of the CT to treat the ILD. There were eight types of medicine used for patients, including azathioprine (immunosuppressant), bosentan (cardiovascular), cyclophosphamide (antineoplastics), mycophenolate (immunosuppressant), nintedanib (unclassified), pirfenidone (unclassified), prednisone (hormone), rituximab (unclassified).

Data split

The dataset was split by patient ID and hospital. For ILD subtype classification, 128 (28.5%) patients with their initial CT scan and clinical information collected at the Mount Sinai Hospital were used as the external test set. The rest of the 321 (71.5%) patients whose initial data were collected at outside hospitals were used for model development, with 258 (57.5%) patients within the training set and 63 (14.0%) patients into the validation set. For the analysis of the 3-year survival rate, a subset of 234 patients meeting the criteria in Supplementary Fig. 1 was utilized. These 234 patients were split into a training dataset containing 123 patients (6 dead), a validation dataset containing 38 patients (5 dead), and a test dataset containing 73 patients (11 dead).

Human reader studies

The predictions of the joint CNN AI model were compared to seven human readers on the test set. Six board-certified and fellowship-trained radiologists and a pulmonologist, as well as one thoracic radiology fellow, were provided with the same initial CT scan and associated clinical information that were used to develop the AI models. A senior thoracic radiologist (A.J.) with 10 years of post graduate experience, two junior thoracic radiologists (M.C. and A.B.) with 5-years post graduate experience, a thoracic radiology fellow (A.S.), two senior radiologists (M.H. and J.M.) with 10 years of experience in non thoracic radiology specialties(musculoskeletal radiology and pediatric radiology respectively), and a senior pulmonologist (S.D.) with 10 years experience each reviewed the 128 studies and associated clinical information from the test set. Their predictions were compared to the predictions of the joint deep learning model and the consensus diagnosis.

AI models

The consensus diagnosis of UIP, CHP, NSIP, sarcoidosis, and other various ILD was used as the gold standard to develop the AI models in ILD subcategory classification. We created five models using image data and clinical information. First, a CNN model (model 1) using pre-trained weights from the RadImageNet19 and ViT model (model 2) based on CT images were developed. Second, machine learning models (model 3), including MLP, SVM, and XGBoost, were generated based on the clinical information. Finally, a joint CNN model (model 4) and a joint ViT model (model 5) were developed which integrated both the imaging and clinical data.

ILD subtype classification model training

We used the same optimization strategies for all classification AI models by employing the Adam optimizer with a learning rate of 0.001 and weight decay of 0.0001, except model 4 used a learning rate of 0.0001. Each model was trained with 40 epochs. We used categorical cross-entropy as the objective function.

To predict patients’ 3-year survival rate longitudinal radiological and clinical data were used to create time series models based on each time point (initial visit, year 1, year 2, year 3) with LSTM14 and Transformer13, respectively.

Three-year survival rate model training

We used the same optimization strategies for all longitudinal AI models by employing the Adam optimizer of learning rate of 0.0001 and weight decay of 0.0001. Both LSTM and the Transformer were developed in two different parameter settings. For each parameter setting, we repeated the simulation 30 times. Each simulation was trained with 100 epochs with a batch size of 64. We used categorical cross-entropy as the objective function. The top two models from each parameter setting with the best performance on the validation dataset were selected for the ensemble model. A total of four models through averaging probability for each patient were then calculated for their performance on the test dataset. The details of parameter settings were reported in later sections.

Data preprocessing

Clinical information and CT data collection

The following clinical data were collected: patients’ sex, age, lung function lab test results (FEV1, DLCO, FVC, FEV1/FVC), smoking history, occupational exposure, rheumatic disease, hypertension, lung biopsy, and the use of home oxygen. CT imaging data were collected from the study DICOM header. For missing data, we added an unknown class to each categorical variable. The LabelEncoder function in the scikit-learn package was utilized to encode these categorical data into numerical variables. The StandardScaler function in the scikit-learn package was used to normalize each feature to unit variance with the mean set as 0.

Image preprocessing

First, all CT scans were resampled to an isotropic voxel. Next, we generated lung regions for each image in each study. This was achieved by applying a threshold of -400HU to each CT slice to effectively convert the CT image into a binary image consisting of two densities—air and not air. The “not air” periphery of the binary image was removed, and the two largest “air” regions were kept. The binary mask was then used on the input raw CT image to separate the lung regions. After lung segmentation, a standard lung window (width = 1500HU and level = -400HU) was used to normalize pixel intensities between 0 and 255 for each segmented lung CT slice. GE Centricity Universal Viewer 6.0 was used to review the CT studies. Preprocessed images were used to develop CT-based models in Tensorflow (2.4.0).

CT-based convolutional neural network model (model 1)

We designed a CT-based convolutional neural network model to diagnose ILD using the CT images. This CT-based CNN model was built via transfer learning using pre-trained weights from a RadImageNet convolutional neural network Inception-ResNet-V2 (IRV2)19,20,21. We froze all layers from the pre-trained model and only trained the top10 layers that incorporated high-level features. An average pooling layer and the last dense classifier layers were followed by the last convolutional layer. Using a RadImageNet pre-trained model provided a better starting point than an ImageNet model as the RadImageNet database contains CT lung images and therefore shares higher similarity with the target data.

CT-based vision Transformer model (model 2)

We trained a CT-based vision Transformer model. ViT model was developed to transfer the success of the self-attention mechanism on NLP tasks into imaging applications9. Our ViT model first split the input image into 10 patches and encoded each embedded patch into a self-attention based deep neural network. Then, two fully connected layers with 2048 and 1024 nodes and the final prediction layer were followed by the encoded embedding layers.

Machine learning model (model 3)

To classify ILD subtypes based on clinical information, we applied MLP, SVM, and XGBoost classifiers to build machine learning models. We evaluated the performance of these three classifiers on the validation dataset (Supplementary Fig 5). We fine-tuned the model’s hyperparameters on the training and validation dataset and evaluated the best model on the test dataset. For the SVM classifier, we assessed the ‘C’ and kernel type. For the XGBoost classifier, the learning rate and several iterations were tuned. For MLP, we assessed the number of hidden nodes in each layer, the learning rate, activation method, and solver for weight optimization. After the hyperparameter optimization, the two-layer MLP model with 64 and 32 nodes was selected because it achieved the highest AUC score on the validation dataset.

Joint CNN and MLP model (model 4)

A joint model combining CT images and clinical information was developed. The inception-res-net-v2 architecture using pre-trained weights derived from the RadImageNet database19 was used to learn features from imaging data. Given the pre-trained weights included CT imaging patterns relevant to our targeted CT images we froze the base layers that stored fundamental information from CT features and only trained the top10 layers that incorporated high-level features. An average pooling layer and three full layers with 1024, 512, and 32 nodes were followed by the last convolutional layer. CT images were finally presented in a vector with 32 features. 16 clinical variables were learned by the MLP model that had two fully connected layers with 64 and 32 nodes, respectively. The last MLP layer was combined with the vector containing CT features. The joint vector was then fed into a fully connected layer having 512-dimensional features before the output layer.

Joint ViT and MLP model (model 5)

A joint ViT and MLP model was also developed to study the combined information of CT images and clinical data. Because the location of lung regions might vary in CT images from different centers, we chose to split the input image into 32 patches. Then, patches were processed via the Transformer encoder, which contained four independent self-attention heads to repeat the computation in parallel. The image features extracted from the Transformer encoder were then connected with three fully connected layers with 1024, 512, and 32 nodes. All CT images were presented in a vector with 32 features. Similar to model 4, a total of 16 clinical variables were learned by the MLP model that had two fully connected layers with 64 and 32 nodes, respectively. The last layer of the MLP model was combined with the vector containing CT features. The joint vector was then fed into a fully connected layer having 512-dimensional features before the output layer.

Radiomics

Radiomics was used to extract textual features of normal lung regions from CT images22. We first converted our segmented lung CT images into binary images as the masked images to indicate the region of interest for Radiomics. Then, we applied the PyRadiomics tool to combine CT images and masked CT images to obtain textual features based on volumetric data. The features extracted from PyRadiomics contain information about the size, shape, spatial relationship, and image intensity of medical images23. A total of 116 radiomics features were obtained for further model development in predicting a 3-year survival rate.

CNN extractor and Uniform Manifold Approximation and Projection (UMAP)

We used a pre-trained RIN-generic IRV2 CNN model developed on the RadImageNet database as the extractor to obtain high-level CT features. The last convolutional layer conv_7b having 1536 kernel maps in 6 by 6 matrix size, was used to screen each CT image. Each CT image was presented as a vector of 55,296 features. We then averaged the CT slices from each study. After CNN feature extraction, we used UMAP24 to reduce the dimension of features while preserving the global structure allowing the 55, 296 features to be reduced to 32.

Time Series data preprocessing

The time-series data included clinical information, medication information, and imaging features for each patient visit were extracted from Radiomics and CNN. The MinMaxScaler function was used to normalize all features. The maximum visit number from our dataset was seven, so patients who had less than seven visits were given data values of zero for the “missing” visits as the sign for our model to skip the data during processing. Sklearn (0.24.1) was used to preprocess and develop the models.

Transformer time series model

We developed a Transformer time series model to study the temporal information from the time series data of patients’ clinical information and CT image features. The Transformer time series model was developed by stacking 16 Transformer encoders together to evaluate data at each time point. The time-series data were processed via the Transformer encoders and then followed by an average pooling layer and a fully connected layer with 128 nodes. We fine-tuned the hyperparameters of the Transformer model on the training and validation dataset and evaluated the best model on the test dataset. We assessed the number of heads in the Transformer encoder.

LSTM time series model

LSTM is an improved form of a Recurrent Neural Network, designed to solve the problem of vanishing long-term gradients14. The LSTM time series model was developed to predict living status based on patients’ clinical information and CT image features over time. The time-series input was first passed through two layers of LSTM, which computes the corresponding sequence of input data at different time states and then outputs a sequence of hidden state vectors in forward and reverse directions. Then, the features extracted from LSTM were followed by three fully connected layers and one final classifier layer. We fine-tuned the hyperparameters of the LSTM model on the training and validation dataset and evaluated the best model on the test dataset. We assessed the number of LSTM layers.

Statistical analysis

Comparisons of AUROCs were performed by bootstrap in the pROC package (version 1.18.0)25 in R. A total of 2000 bootstrap permutations were simulated to calculate 95% CIs and p-values. The 95%CIs of sensitivity and specificity for AI models and human readers were calculated by the exact Clopper-Pearson method26. McNemar’s test27 was used to compare sensitivity and specificity. Generalized score statistic test28 was used to calculate p values for negative predictive values and positive predictive values. Two-sided p values were assessed for all statistical analyses, and p-value < 0.05 was defined as statistical significance. We performed logistic regression to evaluate the correlations between clinical variables and each ILD subcategory. The Hosmer–Lemeshow test29 confirmed the goodness of logistic regression. McNemar’s and the generalized score statistic tests were performed in the DTComPair30 package (version 1.0.3) in R 4.1.3.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.