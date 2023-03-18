Study design, population and ethics

This was a retrospective, multicenter cohort study conducted from January 2005 to December 2019 and included all patients with leptospirosis who were consecutively admitted to three tertiary reference hospitals in Fortaleza, State of Ceara, Brazil. will be

Patients with a confirmed diagnosis of leptospirosis were included. Diagnostic criteria for leptospirosis include the presence of a positive serological result with a microagglutination test (MAT) titer greater than 1:800 or an ELISA assay to detect immunoglobulin M (IgM) antibodies. , leptospirosis. Patients with insufficient data for diagnosis and those with concurrent acute infections (such as hepatitis A, HIV, dengue fever, and typhoid fever) were excluded.

The study protocol was conducted in agreement with the Declaration of Helsinki and the National Health Council resolution 466/2012 regulating ethics in human research in Brazil. The Local Institutional Review Boards (IRBs) of the three participating hospitals (Hospital São José de Doenças Infecciosas, Hospital Universitário Walter Cantídio, and Hospital Geral Fortaleza) approved this study (no. 65452016.2.3001.5044). Due to the observational and retrospective nature of this study with anonymized data, the IRB waived obtaining informed consent.

Evaluated parameter

Data were collected from medical records and patients were followed from admission until death or discharge, whichever came first. Demographic and hospitalization characteristics such as age, sex, time from onset to admission, and length of stay were recorded. Clinical investigations included clinical signs and symptoms on admission, vital signs on admission (systolic and diastolic blood pressure, heart rate, respiratory rate), incidence of acute kidney injury (AKI), and need for dialysis during hospitalization. included a record of Laboratory data collected within 24 hours of admission included serum urea, creatinine, sodium, potassium, direct bilirubin, indirect bilirubin, aspartate aminotransferase (AST), alanine aminotransferase (ALT), lactate dehydrogenase (LDH ), creatine phosphokinase (CK), and hemoglobin. , hematocrit, white blood cell (WBC) count, platelet count, and arterial blood gas analysis.

AKI was defined according to the Kidney Disease Improving Global Outcomes (KDIGO) criteria.17Tachypnea was defined as a respiratory rate greater than or equal to 22 breaths per minute. Oliguria was defined as a urine output less than 400 mL/day 24 hours after effective hydration. Hypotension was defined as mean arterial blood pressure (MAP) < 60 mmHg, and treatment with vasoactive drugs was initiated if MAP remained < 60 mmHg despite administration of injectable fluids. Symptoms of pulmonary involvement were defined by the occurrence of coughing, crackling, or hemoptysis. Apathetic symptoms were defined by the presence of sensory changes such as disorientation, lethargy, and agitation.

result

The primary outcome was in-hospital mortality.

statistical analysis

exploratory data analysis

All variables of interest were compared between patients who survived and those who died during hospitalization.

Forecast model – preprocessing step

We removed variables with more than 30% of missing values ​​(14% of predictors) and imputed other values ​​(Supplemental Information — S2 Table 2). The k-nearest neighbor (KNN) algorithm was used for imputation to account for missing values. We computed Gower’s distance and the five nearest neighbors of the KNN imputation model using all predictors. Once the nearest neighbors are determined, the model is used to impute the nominal variables and the means are used for the numerical data.

Continuous variables were standardized by subtracting their values ​​from the mean (center) and dividing by the standard deviation (scale). Continuous variables were transformed using the Box–Cox transformation. Variables with zero or near-zero variance were removed from the model. In the Lasso regression feature engineering process, a natural spline with four degrees of freedom for age was chosen to account for the nonlinearity.

To adjust for class imbalance, we used the Synthetic Minority Over-sampling Technique (SMOTE) to create synthetic classes in the training set. The SMOTE algorithm used the nearest neighbors of these cases to generate new examples for the minority class. This approach was used to balance the target classes. All preprocessing steps were performed on the training set.

Feature selection

We used the Boruta algorithm to select the most important predictors. The Boruta algorithm is a feature selection method that classifies important and unimportant features. The Boruta algorithm uses feature importance scores provided by random forests. Attribute importance is obtained as the loss of classification accuracy caused by random permutation of attribute values ​​between objects. This is calculated separately for every tree in the forest that uses a particular attribute for classification.Then the mean and standard deviation of precision loss are calculated18This method performs a top-down search for relevant features by comparing the importance of the original attributes and gradually removing irrelevant features.19Features deemed unimportant by the Boruta algorithm have been removed. (Support Information — S2 Table 2). Apply feature selection to the training set.

model training

Split the data into a derived (training) dataset and a validation (testing) dataset. To create the dataset, random splits were used and stratified by target into training (80%) and test sets (20%). In the training set (derived cohort), bootstrap resampling was used to select the model hyperparameters and reduce the bias.

A gradient boosted decision tree (xgBoost) and Lasso regression were fitted to generate candidate equations. Finally, the optimal hyperparameters were selected using a machine learning approach with bootstrap resampling on the training set aiming to maximize the area under the receiver operating characteristic (ROC) curve .

Accuracy evaluation

The accuracy of the derived cohort model was tested with data from the validation cohort. The area under the ROC curve (AUC-ROC) was used to identify the capabilities of the model on the training and test sets. 95% confidence intervals (95% CI) of AUC-ROC were estimated by bootstrap resampling (2000 samples) to reduce overfitting bias. In addition, balanced precision, sensitivity, and specificity were evaluated. In addition, we use the method of maximizing the metric function and the J-Index metric with 1000 bootstrap resamples to estimate the optimal cut point for the ROC curve.

Fit score and model visualization

We created a new score named LeptoScore using models with higher AUC-ROC in the validation cohort associated with more balanced accuracy values. A quick score (QuickLepto) was then created using the highest coefficient importance value from the Lasso regression. For the development of QuickLepto for numeric predictors, we discretized the data using: cut off Derived from machine learning classification and regression trees (CART) trees.

Accuracy metrics for previously published models

The final models (LeptoScore and QuickLepto) were compared with SPIRO and Quick SOFA. SPIRO predicts severe disease in patients with leptospirosis (pulmonary hemorrhage, admission to intensive care unit (ICU), need for renal replacement therapy (RRT), intubation, or need for vasoactive drugs) and is based on the variables of output ≤ 500 mL/24 h), auscultatory abnormalities on respiratory examination and hypotension (systolic blood pressure ≤ 100 mmHg)15The Quick SOFA is a widely used 3-point score to identify patients with suspected infection outside the ICU who are at high risk of in-hospital mortality. Altered mental status (coma Glasgow score < 15), respiratory rate ≥ 22 breaths/min, and systolic blood pressure ≤ 100 mmHg are predictors of this score.16Given the relevance of these scores, we applied them (SPIRO and Quick SOFA) to the dataset and compared the predictive values ​​with the new LeptoScore and QuickLepto models.

Lasso regression was performed using the software R version 4.0.2 with the tidymodels package and the R package ‘glmnet’ statistical software (R Foundation).