3 datasets found

f
Candidate predictors
plos.figshare.com
xls
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid (2025). Candidate predictors [Dataset]. http://doi.org/10.1371/journal.pdig.0000820.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000820.t002
Dataset updated
May 6, 2025
Dataset provided by
PLOS Digital Health
Authors
Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many comparisons of statistical regression and machine learning algorithms to build clinical predictive models use inadequate methods to build regression models and do not have proper independent test sets on which to externally validate the models. Proper comparisons for models of ordinal categorical outcomes do not exist. We set out to compare model discrimination for four regression and machine learning methods in a case study predicting the ordinal outcome of severe, some, or no dehydration among patients with acute diarrhea presenting to a large medical center in Bangladesh using data from the NIRUDAK study derivation and validation cohorts. Proportional Odds Logistic Regression (POLR), penalized ordinal regression (RIDGE), classification trees (CART), and random forest (RF) models were built to predict dehydration severity and compared using three ordinal discrimination indices: ordinal c-index (ORC), generalized c-index (GC), and average dichotomous c-index (ADC). Performance was evaluated on models developed on the training data, on the same models applied to an external test set and through internal validation with three bootstrap algorithms to correct for overoptimism. RF had superior discrimination on the original training data set, but its performance was more similar to the other three methods after internal validation using the bootstrap. Performance for all models was lower on the prospective test dataset, with particularly large reduction for RF and RIDGE. POLR had the best performance in the test dataset and was also most efficient, with the smallest final model size. Clinical prediction models for ordinal outcomes, just like those for binary and continuous outcomes, need to be prospectively validated on external test sets if possible because internal validation may give a too optimistic picture of model performance. Regression methods can perform as well as more automated machine learning methods if constructed with attention to potential nonlinear associations. Because regression models are often more interpretable clinically, their use should be encouraged.
f
Baseline sociodemographic and clinical data
plos.figshare.com
xls
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid (2025). Baseline sociodemographic and clinical data [Dataset]. http://doi.org/10.1371/journal.pdig.0000820.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000820.t003
Dataset updated
May 6, 2025
Dataset provided by
PLOS Digital Health
Authors
Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many comparisons of statistical regression and machine learning algorithms to build clinical predictive models use inadequate methods to build regression models and do not have proper independent test sets on which to externally validate the models. Proper comparisons for models of ordinal categorical outcomes do not exist. We set out to compare model discrimination for four regression and machine learning methods in a case study predicting the ordinal outcome of severe, some, or no dehydration among patients with acute diarrhea presenting to a large medical center in Bangladesh using data from the NIRUDAK study derivation and validation cohorts. Proportional Odds Logistic Regression (POLR), penalized ordinal regression (RIDGE), classification trees (CART), and random forest (RF) models were built to predict dehydration severity and compared using three ordinal discrimination indices: ordinal c-index (ORC), generalized c-index (GC), and average dichotomous c-index (ADC). Performance was evaluated on models developed on the training data, on the same models applied to an external test set and through internal validation with three bootstrap algorithms to correct for overoptimism. RF had superior discrimination on the original training data set, but its performance was more similar to the other three methods after internal validation using the bootstrap. Performance for all models was lower on the prospective test dataset, with particularly large reduction for RF and RIDGE. POLR had the best performance in the test dataset and was also most efficient, with the smallest final model size. Clinical prediction models for ordinal outcomes, just like those for binary and continuous outcomes, need to be prospectively validated on external test sets if possible because internal validation may give a too optimistic picture of model performance. Regression methods can perform as well as more automated machine learning methods if constructed with attention to potential nonlinear associations. Because regression models are often more interpretable clinically, their use should be encouraged.
The largest diamond dataset currently on Kaggle
kaggle.com
Updated Mar 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hrokr (2023). The largest diamond dataset currently on Kaggle [Dataset]. https://www.kaggle.com/hrokrin/the-largest-diamond-dataset-currely-on-kaggle/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
hrokr
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
Here is a brief rundown of the columns as well as links to some background information to get you talking like an expert in no time.

cut refers to one of the 10 or so most common diamond cuts. This dataset has an additional one called the 'Cushion Modified'.

color Clear diamonds are graded D-Z. The higher letters more yellowish but are often better values since color is hard to determine once in a ring.

clarity refers the inclusions (i.e., internal flaws) in the diamonds seen though a jewelers loupe or microscope. Fewer and smaller are better.

carat_weight Refers to the mass of the diamond. It's loosely connected with dimension of a diamond but cut and cut_quality tends to play an equally large if not larger role.

cut_quality refers the GIA Cut Grading System which was developed in 2005 and is de facto standard.

lab is the grading lab. The big three are GIA, IGI and HRD. Each diamond gets a lab certificate that looks like this.

polish and symmetry are what you would expect.

eye-clean refers to the blemishes or inclusions can see with a the naked eye. There are 10 grades.

culet_size is the size of the circle you'd see if you looked straight down. None is ideal because it affects the amount of light that gets reflected.

culet_condition indicates if the culet has any chipping, which is why some diamonds don't close to a point but rather a very small flat spot.

fancy_color_ columns have to do with colored diamonds. Formerly, extremely rare but now common, popular, and almost always lab grown.

fluor columns refer to the effect of long wave UV light. According to GIA 25-35% have it; for ~10% of those it's noticeable to an expert.

depth_percentandtable_percent are the relative measurements of the flat part of the top and the depth. This varies somewhat by cut.

meas_length, meas_width, meas_depth are the absolute measurements of stone.

girdle min/max are where the id of a stone is engraved they also are where the meets the setting and play a role in reflection. There are 9 values ranging from extremely thin to extremely thick

fancy columns refer to colored diamonds. They can be natural like the extremely rare blue diamonds, or lab grown. The columns refer to the colors, secondary colors and their intensity.

total_sales_price is priced in dollars.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid (2025). Candidate predictors [Dataset]. http://doi.org/10.1371/journal.pdig.0000820.t002

Candidate predictors

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pdig.0000820.t002

Dataset updated

May 6, 2025

Dataset provided by

PLOS Digital Health

Authors

Kexin Qu; Monique Gainey; Samika S. Kanekar; Sabiha Nasrim; Eric J. Nelson; Stephanie C. Garbern; Mahmuda Monjory; Nur H. Alam; Adam C. Levine; Christopher H. Schmid

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Many comparisons of statistical regression and machine learning algorithms to build clinical predictive models use inadequate methods to build regression models and do not have proper independent test sets on which to externally validate the models. Proper comparisons for models of ordinal categorical outcomes do not exist. We set out to compare model discrimination for four regression and machine learning methods in a case study predicting the ordinal outcome of severe, some, or no dehydration among patients with acute diarrhea presenting to a large medical center in Bangladesh using data from the NIRUDAK study derivation and validation cohorts. Proportional Odds Logistic Regression (POLR), penalized ordinal regression (RIDGE), classification trees (CART), and random forest (RF) models were built to predict dehydration severity and compared using three ordinal discrimination indices: ordinal c-index (ORC), generalized c-index (GC), and average dichotomous c-index (ADC). Performance was evaluated on models developed on the training data, on the same models applied to an external test set and through internal validation with three bootstrap algorithms to correct for overoptimism. RF had superior discrimination on the original training data set, but its performance was more similar to the other three methods after internal validation using the bootstrap. Performance for all models was lower on the prospective test dataset, with particularly large reduction for RF and RIDGE. POLR had the best performance in the test dataset and was also most efficient, with the smallest final model size. Clinical prediction models for ordinal outcomes, just like those for binary and continuous outcomes, need to be prospectively validated on external test sets if possible because internal validation may give a too optimistic picture of model performance. Regression methods can perform as well as more automated machine learning methods if constructed with attention to potential nonlinear associations. Because regression models are often more interpretable clinically, their use should be encouraged.

Clear search

Close search

Google apps

Main menu

Candidate predictors

Baseline sociodemographic and clinical data

The largest diamond dataset currently on Kaggle

Candidate predictors