3 datasets found
  1. f

    Candidate predictors

    • plos.figshare.com
    xls
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid (2025). Candidate predictors [Dataset]. http://doi.org/10.1371/journal.pdig.0000820.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset provided by
    PLOS Digital Health
    Authors
    Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many comparisons of statistical regression and machine learning algorithms to build clinical predictive models use inadequate methods to build regression models and do not have proper independent test sets on which to externally validate the models. Proper comparisons for models of ordinal categorical outcomes do not exist. We set out to compare model discrimination for four regression and machine learning methods in a case study predicting the ordinal outcome of severe, some, or no dehydration among patients with acute diarrhea presenting to a large medical center in Bangladesh using data from the NIRUDAK study derivation and validation cohorts. Proportional Odds Logistic Regression (POLR), penalized ordinal regression (RIDGE), classification trees (CART), and random forest (RF) models were built to predict dehydration severity and compared using three ordinal discrimination indices: ordinal c-index (ORC), generalized c-index (GC), and average dichotomous c-index (ADC). Performance was evaluated on models developed on the training data, on the same models applied to an external test set and through internal validation with three bootstrap algorithms to correct for overoptimism. RF had superior discrimination on the original training data set, but its performance was more similar to the other three methods after internal validation using the bootstrap. Performance for all models was lower on the prospective test dataset, with particularly large reduction for RF and RIDGE. POLR had the best performance in the test dataset and was also most efficient, with the smallest final model size. Clinical prediction models for ordinal outcomes, just like those for binary and continuous outcomes, need to be prospectively validated on external test sets if possible because internal validation may give a too optimistic picture of model performance. Regression methods can perform as well as more automated machine learning methods if constructed with attention to potential nonlinear associations. Because regression models are often more interpretable clinically, their use should be encouraged.

  2. f

    Baseline sociodemographic and clinical data

    • plos.figshare.com
    xls
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid (2025). Baseline sociodemographic and clinical data [Dataset]. http://doi.org/10.1371/journal.pdig.0000820.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset provided by
    PLOS Digital Health
    Authors
    Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many comparisons of statistical regression and machine learning algorithms to build clinical predictive models use inadequate methods to build regression models and do not have proper independent test sets on which to externally validate the models. Proper comparisons for models of ordinal categorical outcomes do not exist. We set out to compare model discrimination for four regression and machine learning methods in a case study predicting the ordinal outcome of severe, some, or no dehydration among patients with acute diarrhea presenting to a large medical center in Bangladesh using data from the NIRUDAK study derivation and validation cohorts. Proportional Odds Logistic Regression (POLR), penalized ordinal regression (RIDGE), classification trees (CART), and random forest (RF) models were built to predict dehydration severity and compared using three ordinal discrimination indices: ordinal c-index (ORC), generalized c-index (GC), and average dichotomous c-index (ADC). Performance was evaluated on models developed on the training data, on the same models applied to an external test set and through internal validation with three bootstrap algorithms to correct for overoptimism. RF had superior discrimination on the original training data set, but its performance was more similar to the other three methods after internal validation using the bootstrap. Performance for all models was lower on the prospective test dataset, with particularly large reduction for RF and RIDGE. POLR had the best performance in the test dataset and was also most efficient, with the smallest final model size. Clinical prediction models for ordinal outcomes, just like those for binary and continuous outcomes, need to be prospectively validated on external test sets if possible because internal validation may give a too optimistic picture of model performance. Regression methods can perform as well as more automated machine learning methods if constructed with attention to potential nonlinear associations. Because regression models are often more interpretable clinically, their use should be encouraged.

  3. The largest diamond dataset currently on Kaggle

    • kaggle.com
    Updated Mar 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hrokr (2023). The largest diamond dataset currently on Kaggle [Dataset]. https://www.kaggle.com/hrokrin/the-largest-diamond-dataset-currely-on-kaggle/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    hrokr
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    Here is a brief rundown of the columns as well as links to some background information to get you talking like an expert in no time.

    • cut refers to one of the 10 or so most common diamond cuts. This dataset has an additional one called the 'Cushion Modified'. Diamond Shapes
    • color Clear diamonds are graded D-Z. The higher letters more yellowish but are often better values since color is hard to determine once in a ring.
    • clarity refers the inclusions (i.e., internal flaws) in the diamonds seen though a jewelers loupe or microscope. Fewer and smaller are better.
    • carat_weight Refers to the mass of the diamond. It's loosely connected with dimension of a diamond but cut and cut_quality tends to play an equally large if not larger role.
    • cut_quality refers the GIA Cut Grading System which was developed in 2005 and is de facto standard.
    • lab is the grading lab. The big three are GIA, IGI and HRD. Each diamond gets a lab certificate that looks like this.
    • polish and symmetry are what you would expect.
    • eye-clean refers to the blemishes or inclusions can see with a the naked eye. There are 10 grades.
    • culet_size is the size of the circle you'd see if you looked straight down. None is ideal because it affects the amount of light that gets reflected.
    • culet_condition indicates if the culet has any chipping, which is why some diamonds don't close to a point but rather a very small flat spot.
    • fancy_color_ columns have to do with colored diamonds. Formerly, extremely rare but now common, popular, and almost always lab grown.
    • fluor columns refer to the effect of long wave UV light. According to GIA 25-35% have it; for ~10% of those it's noticeable to an expert.
    • depth_percentandtable_percent are the relative measurements of the flat part of the top and the depth. This varies somewhat by cut.
    • meas_length, meas_width, meas_depth are the absolute measurements of stone.
    • girdle min/max are where the id of a stone is engraved they also are where the meets the setting and play a role in reflection. There are 9 values ranging from extremely thin to extremely thick
    • fancy columns refer to colored diamonds. They can be natural like the extremely rare blue diamonds, or lab grown. The columns refer to the colors, secondary colors and their intensity.
    • total_sales_price is priced in dollars.
  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid (2025). Candidate predictors [Dataset]. http://doi.org/10.1371/journal.pdig.0000820.t002

Candidate predictors

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
May 6, 2025
Dataset provided by
PLOS Digital Health
Authors
Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Many comparisons of statistical regression and machine learning algorithms to build clinical predictive models use inadequate methods to build regression models and do not have proper independent test sets on which to externally validate the models. Proper comparisons for models of ordinal categorical outcomes do not exist. We set out to compare model discrimination for four regression and machine learning methods in a case study predicting the ordinal outcome of severe, some, or no dehydration among patients with acute diarrhea presenting to a large medical center in Bangladesh using data from the NIRUDAK study derivation and validation cohorts. Proportional Odds Logistic Regression (POLR), penalized ordinal regression (RIDGE), classification trees (CART), and random forest (RF) models were built to predict dehydration severity and compared using three ordinal discrimination indices: ordinal c-index (ORC), generalized c-index (GC), and average dichotomous c-index (ADC). Performance was evaluated on models developed on the training data, on the same models applied to an external test set and through internal validation with three bootstrap algorithms to correct for overoptimism. RF had superior discrimination on the original training data set, but its performance was more similar to the other three methods after internal validation using the bootstrap. Performance for all models was lower on the prospective test dataset, with particularly large reduction for RF and RIDGE. POLR had the best performance in the test dataset and was also most efficient, with the smallest final model size. Clinical prediction models for ordinal outcomes, just like those for binary and continuous outcomes, need to be prospectively validated on external test sets if possible because internal validation may give a too optimistic picture of model performance. Regression methods can perform as well as more automated machine learning methods if constructed with attention to potential nonlinear associations. Because regression models are often more interpretable clinically, their use should be encouraged.

Search
Clear search
Close search
Google apps
Main menu