55 datasets found

iaaa mri train data partition
kaggle.com
Updated Aug 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir Mohammad Parvizi (2024). iaaa mri train data partition [Dataset]. https://www.kaggle.com/datasets/amirmohammadparvizi/iaaa-mri-train-data-partition/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 24, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Amir Mohammad Parvizi
Description
Dataset

This dataset was created by Amir Mohammad Parvizi

Contents
MNIST-Federated-Learning
zenodo.org
csv, zip
Updated Jul 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ferraguig Lynda; Ferraguig Lynda; Benoit Alexandre; Benoit Alexandre; Bettinelli Mickael; Bettinelli Mickael; Lin-Kwong-Chon Christophe; Lin-Kwong-Chon Christophe (2023). MNIST-Federated-Learning [Dataset]. http://doi.org/10.5281/zenodo.8104408
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8104408
Dataset updated
Jul 3, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ferraguig Lynda; Ferraguig Lynda; Benoit Alexandre; Benoit Alexandre; Bettinelli Mickael; Bettinelli Mickael; Lin-Kwong-Chon Christophe; Lin-Kwong-Chon Christophe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please find below the descriptions of the three configurations for partitioning the MNIST Train dataset into 10 clients and the MNIST Train data:

Balanced Distribution: In the first configuration, the MNIST dataset is partitioned among 10 clients in a balanced manner. This means that the data samples from each class are evenly distributed among the clients. Each client receives a roughly equal number of images from each digit class, ensuring that the distribution of samples across clients is proportional and representative of the overall dataset. [ Config 1]

Heterogeneous Distribution (One Class per Client): In the second configuration, the MNIST dataset is partitioned in a heterogeneous manner, where each client is assigned a single digit class exclusively. This means that one client will only receive images of the digit '0', another client will receive images of the digit '1', and so on. In this setup, each client becomes an expert in classifying a specific digit, allowing for specialized training and evaluation. [ Config 2]

Mixed Distribution: In the third configuration, the MNIST dataset is partitioned using a mixed distribution approach. This means that the data samples from all digit classes are distributed among the 10 clients, but the distribution is not necessarily balanced. The number of samples assigned to each client may vary for different digit classes, resulting in an uneven distribution across the clients. This configuration aims to capture both the overall diversity of the dataset and the varying difficulty levels of classifying different digits. [ Config 3 ]

Mnist-dataset/
├── config1/
│ ├── client-1/
│ │ └── data.csv
│ ├── client-2/
│ │ └── data.csv
│ ├── client-3/
│ │ └── data.csv
│ └── ...
├── config2/
│ ├── client-1/
│ │ └── data.csv
│ ├── client-2/
│ │ └── data.csv
│ ├── client-3/
│ │ └── data.csv
│ └── ...
├── config3/
│ ├── client-1/
│ │ └── data.csv
│ ├── client-2/
│ │ └── data.csv
│ ├── client-3/
│ │ └── data.csv
│ └── ...
└── mnist_test.csv

***

License: Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license.

***
f
Data from: Predicting Solute Descriptors for Organic Chemicals by a Deep...
figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kai Zhang; Huichun Zhang (2023). Predicting Solute Descriptors for Organic Chemicals by a Deep Neural Network (DNN) Using Basic Chemical Structures and a Surrogate Metric [Dataset]. http://doi.org/10.1021/acs.est.1c05398.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.est.1c05398.s002
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Kai Zhang; Huichun Zhang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Solute descriptors have been widely used to model chemical transfer processes through poly-parameter linear free energy relationships (pp-LFERs); however, there are still substantial difficulties in obtaining these descriptors accurately and quickly for new organic chemicals. In this research, models (PaDEL-DNN) that require only SMILES of chemicals were built to satisfactorily estimate pp-LFER descriptors using deep neural networks (DNN) and the PaDEL chemical representation. The PaDEL-DNN-estimated pp-LFER descriptors demonstrated good performance in modeling storage-lipid/water partitioning coefficient (log Kstorage‑lipid/water), bioconcentration factor (BCF), aqueous solubility (ESOL), and hydration free energy (freesolve). Then, assuming that the accuracy in the estimated values of widely available properties, e.g., logP (octanol–water partition coefficient), can calibrate estimates for less available but related properties, we proposed logP as a surrogate metric for evaluating the overall accuracy of the estimated pp-LFER descriptors. When using the pp-LFER descriptors to model log Kstorage‑lipid/water, BCF, ESOL, and freesolve, we achieved around 0.1 log unit lower errors for chemicals whose estimated pp-LFER descriptors were deemed “accurate” by the surrogate metric. The interpretation of the PaDEL-DNN models revealed that, for a given test chemical, having several (around 5) “similar” chemicals in the training data set was crucial for accurate estimation while the remaining less similar training chemicals provided reasonable baseline estimates. Lastly, pp-LFER descriptors for over 2800 persistent, bioaccumulative, and toxic chemicals were reasonably estimated by combining PaDEL-DNN with the surrogate metric. Overall, the PaDEL-DNN/surrogate metric and newly estimated descriptors will greatly benefit chemical transfer modeling.
f
Data set partitioning into training, validation and test data sets,...
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carina Albuquerque; Leonardo Vanneschi; Roberto Henriques; Mauro Castelli; Vanda Póvoa; Rita Fior; Nickolas Papanikolaou (2023). Data set partitioning into training, validation and test data sets, considering the quantity of cells and the quantity of images. [Dataset]. http://doi.org/10.1371/journal.pone.0260609.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0260609.t003
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Carina Albuquerque; Leonardo Vanneschi; Roberto Henriques; Mauro Castelli; Vanda Póvoa; Rita Fior; Nickolas Papanikolaou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data set partitioning into training, validation and test data sets, considering the quantity of cells and the quantity of images.
f
Ablation studies of length-scaling cosine distance, the dynamic training...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xia, Chunqiu; Shen, Hong-Bin; Xia, Ying; Feng, Shi-Hao; Pan, Xiaoyong (2022). Ablation studies of length-scaling cosine distance, the dynamic training data partition strategy and the GNN-based encoder on SCOPe v2.07 and ind_PDB. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000279177
Explore at:
Dataset updated
Mar 24, 2022
Authors
Xia, Chunqiu; Shen, Hong-Bin; Xia, Ying; Feng, Shi-Hao; Pan, Xiaoyong
Description
Ablation studies of length-scaling cosine distance, the dynamic training data partition strategy and the GNN-based encoder on SCOPe v2.07 and ind_PDB.
h
DDI_Ben
huggingface.co
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang Cao (2025). DDI_Ben [Dataset]. https://huggingface.co/datasets/juejueziok/DDI_Ben
Explore at:
Dataset updated
May 28, 2025
Authors
Yang Cao
Description
DDI_Ben

The DDI_Ben dataset is divided into five parts:

Random_drugbank contains DDI data for the training, validation, and test sets under scenarios S1 and S2, generated by randomly partitioning the DrugBank dataset into training, validation, and test subsets. Random_twosides contains DDI data for the training, validation, and test sets under scenarios S1 and S2, generated by randomly partitioning the TWOSIDES dataset into training, validation, and test subsets.… See the full description on the dataset page: https://huggingface.co/datasets/juejueziok/DDI_Ben.
Yeast genotype/phenotype data partitioned into train/validation/test splits...
zenodo.org
zip
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Yeast genotype/phenotype data partitioned into train/validation/test splits from Ba Nguyen et al, 2022 [Dataset]. http://doi.org/10.5281/zenodo.15313069
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15313069
Dataset updated
Apr 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The raw data comes from Ba Nguyen et al, 2022, who hosted their data here. This dataset was used in an independent study in Rijal et al, 2025, who preprocessed the data using these notebook scripts. They did not release their processed data, so we reproduced their processing pipeline and have uploaded the data ourselves as part of this data resource.

This release accompanies this publication: https://doi.org/10.57844/arcadia-bmb9-fzxd
n
Malaria disease and grading system dataset from public hospitals reflecting...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4xgxd25gn
Dataset updated
Nov 10, 2023
Dataset provided by
Nasarawa State University
Authors
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
C
Data from: Tibidabo Treebank and IULA Spanish LSP Treebank Train and Test...
dataverse.csuc.cat
txt, zip
Updated Jul 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
de titut Universitari de Lingüística Aplicada; Montserrat Marimon; Montserrat Marimon; de titut Universitari de Lingüística Aplicada (2024). Tibidabo Treebank and IULA Spanish LSP Treebank Train and Test Partitions [Dataset]. http://doi.org/10.34810/data314
Explore at:
txt(1731), zip(5519589)Available download formats
Unique identifier
https://doi.org/10.34810/data314
Dataset updated
Jul 16, 2024
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
de titut Universitari de Lingüística Aplicada; Montserrat Marimon; Montserrat Marimon; de titut Universitari de Lingüística Aplicada
License
https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.34810/data314https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.34810/data314
Description
This package contains a partition of the Iula Spanish LSP Treebank into train and test sets to perform Machine Learning experiments. In that way the same partitions can be used by different researchers and their results can be directly compared. In this package we also deliver the Tibidabo Treebank (Marimon 2010) which contains a set of sentences extracted from Ancora corpus annotated in the same way than the Iula Treebank. Tibidabo Treebank is a very good test set for models trained with Iula Spanish LSP Treebank since the sentences that form it from a very different domain than those of the Iula Spanish LSP Treebank.
f
Results of each step in all data partitions.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reisert, Marco; Rau, Alexander; Urbach, Horst; Watzlawick, Ralf; Elsheikh, Samer; Kellner, Elias; Demerath, Theo; Würtemberger, Urs; Elbaz, Ahmed (2024). Results of each step in all data partitions. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001464570
Explore at:
Dataset updated
Dec 26, 2024
Authors
Reisert, Marco; Rau, Alexander; Urbach, Horst; Watzlawick, Ralf; Elsheikh, Samer; Kellner, Elias; Demerath, Theo; Würtemberger, Urs; Elbaz, Ahmed
Description
Background and purposeExternal drainage represents a well-established treatment option for acute intracerebral hemorrhage. The current standard of practice includes post-operative computer tomography imaging, which is subjectively evaluated. The implementation of an objective, automated evaluation of postoperative studies may enhance diagnostic accuracy and facilitate the scaling of research projects. The objective is to develop and validate a fully automated pipeline for intracerebral hemorrhage and drain detection, quantification of intracerebral hemorrhage coverage, and detection of malpositioned drains.Materials and methodsIn this retrospective study, we selected patients (n = 68) suffering from supratentorial intracerebral hemorrhage treated by minimally invasive surgery, from years 2010–2018. These were divided into training (n = 21), validation (n = 3) and testing (n = 44) datasets. Mean age (SD) was 70 (±13.56) years, 32 female. Intracerebral hemorrhage and drains were automatically segmented using a previously published artificial intelligence-based approach. From this, we calculated coverage profiles of the correctly detected drains to quantify the drains’ coverage by the intracerebral hemorrhage and classify malpositioning. We used accuracy measures to assess detection and classification results and intraclass correlation coefficient to assess the quantification of the drain coverage by the intracerebral hemorrhage.ResultsIn the test dataset, the pipeline showed a drain detection accuracy of 0.97 (95% CI: 0.92 to 0.99), an agreement between predicted and ground truth coverage profiles of 0.86 (95% CI: 0.85 to 0.87) and a drain position classification accuracy of 0.88 (95% CI: 0.77 to 0.95) resulting in area under the receiver operating characteristic curve of 0.92 (95% CI: 0.85 to 0.99).ConclusionWe developed and statistically validated an automated pipeline for evaluating computed tomography scans after minimally invasive surgery for intracerebral hemorrhage. The algorithm reliably detects drains, quantifies drain coverage by the hemorrhage, and uses machine learning to detect malpositioned drains. This pipeline has the potential to impact the daily clinical workload, as well as to facilitate the scaling of data collection for future research into intracerebral hemorrhage and other diseases.
Encrypted Federated AI Dataset
kaggle.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziya (2025). Encrypted Federated AI Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/encrypted-federated-ai-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ziya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is designed to support research in Federated Learning (FL) and privacy-preserving machine learning using mathematical encryption algorithms. The dataset simulates data distributed across multiple client devices (or nodes), where each client contains its own unique subset of the data. This is suitable for training decentralized AI models without directly sharing raw data.

Each record includes:

Client ID: Identifies which client (or device) the data belongs to.

Features (f1–f5): Randomly generated numeric attributes following a normal distribution.

Features (f6–f8): Random categorical variables to introduce feature diversity.

Target: A binary classification label (0 or 1) determined by a rule-based logic.

Encrypted Parameters: Simulated placeholder strings representing encrypted model updates using homomorphic encryption concepts.
f
Table_1_Using machine learning for crop yield prediction in the past or the...
frontiersin.figshare.com
xlsx
Updated Jun 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alejandro Morales; Francisco J. Villalobos (2023). Table_1_Using machine learning for crop yield prediction in the past or the future.xlsx [Dataset]. http://doi.org/10.3389/fpls.2023.1128388.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpls.2023.1128388.s001
Dataset updated
Jun 20, 2023
Dataset provided by
Frontiers
Authors
Alejandro Morales; Francisco J. Villalobos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The use of ML in agronomy has been increasing exponentially since the start of the century, including data-driven predictions of crop yields from farm-level information on soil, climate and management. However, little is known about the effect of data partitioning schemes on the actual performance of the models, in special when they are built for yield forecast. In this study, we explore the effect of the choice of predictive algorithm, amount of data, and data partitioning strategies on predictive performance, using synthetic datasets from biophysical crop models. We simulated sunflower and wheat data using OilcropSun and Ceres-Wheat from DSSAT for the period 2001-2020 in 5 areas of Spain. Simulations were performed in farms differing in soil depth and management. The data set of farm simulated yields was analyzed with different algorithms (regularized linear models, random forest, artificial neural networks) as a function of seasonal weather, management, and soil. The analysis was performed with Keras for neural networks and R packages for all other algorithms. Data partitioning for training and testing was performed with ordered data (i.e., older data for training, newest data for testing) in order to compare the different algorithms in their ability to predict yields in the future by extrapolating from past data. The Random Forest algorithm had a better performance (Root Mean Square Error 35-38%) than artificial neural networks (37-141%) and regularized linear models (64-65%) and was easier to execute. However, even the best models showed a limited advantage over the predictions of a sensible baseline (average yield of the farm in the training set) which showed RMSE of 42%. Errors in seasonal weather forecasting were not taken into account, so real-world performance is expected to be even closer to the baseline. Application of AI algorithms for yield prediction should always include a comparison with the best guess to evaluate if the additional cost of data required for the model compensates for the increase in predictive power. Random partitioning of data for training and validation should be avoided in models for yield forecasting. Crop models validated for the region and cultivars of interest may be used before actual data collection to establish the potential advantage as illustrated in this study.
Z
Language modeling data for Swahili
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mokhosi Refuoe (2020). Language modeling data for Swahili [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3553422
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Mokhosi Refuoe
Shivachi Casper Shikali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Swahili dataset developed specifically for language modeling task. The dataset contains 28,000 unique words with 6.84M, 970k, and 2M words for the train, valid and test partitions respectively which represent the ratio 80:10:10. The entire dataset is lowercased, has no punctuation marks and, the start and end of sentence markers have been incorporated to facilitate easy tokenization during language modeling. The train partition is the largest in order to support unsupervised learning of word representations while the hyper-parameters are adjusted based on the performance on the valid partition before evaluating the language model on the test partition.
Data from: Red Wine Quality
kaggle.com
zip
Updated Nov 27, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2017). Red Wine Quality [Dataset]. https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
Explore at:
zip(26176 bytes)Available download formats
Dataset updated
Nov 27, 2017
Dataset authored and provided by
UCI Machine Learning
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.)

Content

For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

Tips

What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)

KNIME is a great tool (GUI) that can be used for this.
1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA.
2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this:
- $quality$ > 6.5 => "good"
- TRUE => "bad"
3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking)
4- Column Filter Node output to input of Partitioning Node (your standard train/tes split, e.g. 75%/25%, choose 'random' or 'stratified')
5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and
6- Partitioning Node test data split output to input Decision Tree predictor Node
7- Decision Tree learner Node output to input Decision Tree Node input
8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value)

Inspiration

Use machine learning to determine which physiochemical properties make a wine 'good'!

Acknowledgements

This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset.

Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Relevant publication

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Z
Data from: Dataset of very-high-resolution satellite RGB images to train...
data.niaid.nih.gov
produccioncientifica.ugr.es
+1more
Updated Jul 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siham Tabik (2022). Dataset of very-high-resolution satellite RGB images to train deep learning models to recognize high-mountain juniper shrubs from Sierra Nevada (Spain) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6793421
Explore at:
Dataset updated
Jul 6, 2022
Dataset provided by
Siham Tabik
Sergio Puertas
Rohaifa Khaldi
Domingo Alcaraz-Segura
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Spain, Sierra Nevada
Description
This dataset provides annotated very-high-resolution satellite RGB images extracted from Google Earth to train deep learning models to recognize Juniperus communis L. and Juniperus sabina L. shrubs. All images are from the high mountain of Sierra Nevada in Spain. The dataset contains 2000 images (.jpg) of size 512x512 pixels partitioned into two classes: Shrubs and NoShrubs. We also provide partitioning of the data into Train (1800 images), Test (100 images), and Validation (100 images) subsets.
f
Data from: Big Data Model Building Using Dimension Reduction and Sample...
tandf.figshare.com
txt
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lih-Yuan Deng; Ching-Chi Yang; Dale Bowman; Dennis K. J. Lin; Henry Horng-Shing Lu (2023). Big Data Model Building Using Dimension Reduction and Sample Selection [Dataset]. http://doi.org/10.6084/m9.figshare.24233113.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24233113.v2
Dataset updated
Nov 15, 2023
Dataset provided by
Taylor & Francis
Authors
Lih-Yuan Deng; Ching-Chi Yang; Dale Bowman; Dennis K. J. Lin; Henry Horng-Shing Lu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
It is difficult to handle the extraordinary data volume generated in many fields with current computational resources and techniques. This is very challenging when applying conventional statistical methods to big data. A common approach is to partition full data into smaller subdata for purposes such as training, testing, and validation. The primary purpose of training data is to represent the full data. To achieve this goal, the selection of training subdata becomes pivotal in retaining essential characteristics of the full data. Recently, several procedures have been proposed to select “optimal design points” as training subdata under pre-specified models, such as linear regression and logistic regression. However, these subdata will not be “optimal” if the assumed model is not appropriate. Furthermore, such subdata cannot be useful to build alternative models because it is not an appropriate representative sample of the full data. In this article, we propose a novel algorithm for better model building and prediction via a process of selecting a “good” training sample. The proposed subdata can retain most characteristics of the original big data. It is also more robust that one can fit various response model and select the optimal model. Supplementary materials for this article are available online.
Data from: Structural constraints in current stomatal conductance models...
zenodo.org
data.niaid.nih.gov
txt, zip
Updated Mar 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pushpendra Raghav; Pushpendra Raghav; Mukesh Kumar; Mukesh Kumar; Yanlan Liu; Yanlan Liu (2023). Structural constraints in current stomatal conductance models preclude accurate estimation of evapotranspiration and its partitions [Dataset]. http://doi.org/10.5281/zenodo.7768479
Explore at:
txt, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7768479
Dataset updated
Mar 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pushpendra Raghav; Pushpendra Raghav; Mukesh Kumar; Mukesh Kumar; Yanlan Liu; Yanlan Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This archive includes the scripts and related input data to produce results for the paper entitled - "Structural constraints in current stomatal conductance models preclude accurate estimation of evapotranspiration and its partitions". Following is the description of files/folders:

1. Input_Data: This folder contains all the required input data including FluxNet data, soil properties, quality controlled training-validation data, and metadata & other supporting information of the sites.

2. Model_EMP: This folder contains all the scripts for empirical model of stomatal conductance. (Note: Scripts have been written in MATLAB"). No need to change anything except the MATLAB executive path in two files "run_all_tasks_to_optimize_params.sh" and "prediction.sh". Read "ReadMe.txt" file in the folder "Model_EMP" for more instructions on running the model.

3. Model_ML: This folder contains all the scripts for pure machine learning model of stomatal conductance. It contains four sub-folders: 1. Model_Config_1 (Model with configuration-1); 2. Model_Config_2_TEA (Model with Configuration-2 & TEA-based T estimates); 3. Model_Config_2_uWUE (Model with Configuration-2 & uWUE-based T estimates); 4. Model_Config_2_Yu22 (Model with Configuration-2 & Yu22-based T estimates). Further instructions have been given in each jupyter notebooks. Briefly, in folder "Model_Config_1", the notebook "train_ML_config_1.ipynb" trains the model parameters and notebook "Predictions_ML_config_1" is used to do predictions. Similar instructions apply for other subfolders. (Note: Scripts have been written in Python Language"). All the scripts are fully functional as long as all the required modules are installed.

4. Model_PH_exp: This folder contains all the scripts for plant hydraulics model with explicit representation. All the scripts are self explanatory and further instructions are provided in the scripts as needed. (Note: Scripts have been written in Python Language"). All the scripts are fully functional as long as all the required modules are installed.

5. Model_PN_imp: This folder contains all the scripts for plant hydraulics model with implicit representation. Instructions given for "Model_ML" are applicable here. (Note: Scripts have been written in Python Language"). All the scripts are fully functional as long as all the required modules are installed.

Versions: Tensorflow 2.11.0, MATLAB_R2022a, Python 3.10.9

ref_coco

tensorflow.org
opendatalab.com

Updated May 31, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). ref_coco [Dataset]. https://www.tensorflow.org/datasets/catalog/ref_coco

Explore at:

Dataset updated

May 31, 2024

Description

A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

dataset	partition	split	refs	images
refcoco	google	train	40000	19213
refcoco	google	val	5000	4559
refcoco	google	test	5000	4527
refcoco	unc	train	42404	16994
refcoco	unc	val	3811	1500
refcoco	unc	testA	1975	750
refcoco	unc	testB	1810	750
refcoco+	unc	train	42278	16992
refcoco+	unc	val	3805	1500
refcoco+	unc	testA	1975	750
refcoco+	unc	testB	1798	750
refcocog	google	train	44822	24698
refcocog	google	val	5000	4650
refcocog	umd	train	42226	21899
refcocog	umd	val	2573	1300
refcocog	umd	test	5023	2600

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('ref_coco', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">

f
DataSheet1_Use of the linear regression method to evaluate population...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Jun 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haipeng Yu; Rohan L. Fernando; Jack C. M. Dekkers (2024). DataSheet1_Use of the linear regression method to evaluate population accuracy of predictions from non-linear models.pdf [Dataset]. http://doi.org/10.3389/fgene.2024.1380643.s005
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2024.1380643.s005
Dataset updated
Jun 4, 2024
Dataset provided by
Frontiers
Authors
Haipeng Yu; Rohan L. Fernando; Jack C. M. Dekkers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundTo address the limitations of commonly used cross-validation methods, the linear regression method (LR) was proposed to estimate population accuracy of predictions based on the implicit assumption that the fitted model is correct. This method also provides two statistics to determine the adequacy of the fitted model. The validity and behavior of the LR method have been provided and studied for linear predictions but not for nonlinear predictions. The objectives of this study were to 1) provide a mathematical proof for the validity of the LR method when predictions are based on conditional means, regardless of whether the predictions are linear or non-linear 2) investigate the ability of the LR method to detect whether the fitted model is adequate or inadequate, and 3) provide guidelines on how to appropriately partition the data into training and validation such that the LR method can identify an inadequate model.ResultsWe present a mathematical proof for the validity of the LR method to estimate population accuracy and to determine whether the fitted model is adequate or inadequate when the predictor is the conditional mean, which may be a non-linear function of the phenotype. Using three partitioning scenarios of simulated data, we show that the one of the LR statistics can detect an inadequate model only when the data are partitioned such that the values of relevant predictor variables differ between the training and validation sets. In contrast, we observed that the other LR statistic was able to detect an inadequate model for all three scenarios.ConclusionThe LR method has been proposed to address some limitations of the traditional approach of cross-validation in genetic evaluation. In this paper, we showed that the LR method is valid when the model is adequate and the conditional mean is the predictor, even when it is a non-linear function of the phenotype. We found one of the two LR statistics is superior because it was able to detect an inadequate model for all three partitioning scenarios (i.e., between animals, by age within animals, and between animals and by age) that were studied.
RIM-ONE retinal dataset for assessing glaucoma
kaggle.com
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orvile (2025). RIM-ONE retinal dataset for assessing glaucoma [Dataset]. https://www.kaggle.com/datasets/orvile/rim-one-retinal-dataset-for-assessing-glaucoma
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Orvile
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RIM-ONE DL is a unified retinal image database for assessing glaucoma using Deep Learning. The full paper is available in this publication of the Image Analysis and Stereology journal: https://www.ias-iss.org/ojs/IAS/article/view/2346

This repository hosts the RIM-ONE DL image dataset and the related data and tools, which consists of:

The images divided into training and test sets. The reference segmentations of the optic disc and cup for each image The tools to convert a reference segmentation to a NumPy array or to PNG The weights of the CNNs used in the publication

Using the database

Data included in this database can only be used for research and educational purposes, free of charge and without requesting permission to the authors. Copy, redistribution, and any unauthorized commercial use are prohibited.

In order to use this database and have comparable results among different publications, please use the original partitions described below for training and testing purposes. Moreover, use only the data included in RIM-ONE DL, do not add more images from different databases to train your model or tune your algorithm.

If you use RIM-ONE DL in your work, please cite the following publication:

FUMERO BATISTA, Francisco José et al. RIM-ONE DL: A Unified Retinal Image Database for Assessing Glaucoma Using Deep Learning. Image Analysis & Stereology, v. 39, n. 3, p. 161-167, nov. 2020. ISSN 1854-5165. Available at: https://www.ias-iss.org/ojs/IAS/article/view/2346. doi: https://doi.org/10.5566/ias.2346.

BibTeX format:

@article{RIMONEDLImageAnalStereol2346, author = {Francisco José Fumero Batista and Tinguaro Diaz-Aleman and Jose Sigut and Silvia Alayon and Rafael Arnay and Denisse Angel-Pereira}, title = {RIM-ONE DL: A Unified Retinal Image Database for Assessing Glaucoma Using Deep Learning}, journal = {Image Analysis & Stereology}, volume = {39}, number = {3}, year = {2020}, keywords = {Convolutional Neural Networks; Deep Learning; Glaucoma Assessment; RIM-ONE}, issn = {1854-5165}, pages = {161--167}, doi = {10.5566/ias.2346}, url = {https://www.ias-iss.org/ojs/IAS/article/view/2346} }

Images

The RIM-ONE DL image dataset consists of 313 retinographies from normal subjects and 172 retinographies from patients with glaucoma. These images were captured in three Spanish hospitals: Hospital Universitario de Canarias (HUC), in Tenerife, Hospital Universitario Miguel Servet (HUMS), in Zaragoza, and Hospital Clínico Universitario San Carlos (HCSC), in Madrid.

This dataset has been divided into training and test sets, with two variants:

Partitioned randomly: the training and test sets are built randomly from all the images of the dataset. Partitioned by hospital: the images taken in the HUC are used for the training set, while the images taken in the HUMS and HCSC are used for testing.

Facebook

Twitter

Click to copy link

Link copied

Cite

Amir Mohammad Parvizi (2024). iaaa mri train data partition [Dataset]. https://www.kaggle.com/datasets/amirmohammadparvizi/iaaa-mri-train-data-partition/code

iaaa mri train data partition

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 24, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Amir Mohammad Parvizi

Description

Dataset

This dataset was created by Amir Mohammad Parvizi

Clear search

Close search

Google apps

Main menu

iaaa mri train data partition

Dataset

Contents

MNIST-Federated-Learning

Data from: Predicting Solute Descriptors for Organic Chemicals by a Deep...

Data set partitioning into training, validation and test data sets,...

Ablation studies of length-scaling cosine distance, the dynamic training...

DDI_Ben

Yeast genotype/phenotype data partitioned into train/validation/test splits...

Malaria disease and grading system dataset from public hospitals reflecting...

Data from: Tibidabo Treebank and IULA Spanish LSP Treebank Train and Test...

Results of each step in all data partitions.

Encrypted Federated AI Dataset

Table_1_Using machine learning for crop yield prediction in the past or the...

Language modeling data for Swahili

Data from: Red Wine Quality

Context

Content

Tips

Inspiration

Acknowledgements

Relevant publication

Data from: Dataset of very-high-resolution satellite RGB images to train...

Data from: Big Data Model Building Using Dimension Reduction and Sample...

Data from: Structural constraints in current stomatal conductance models...

ref_coco

DataSheet1_Use of the linear regression method to evaluate population...

RIM-ONE retinal dataset for assessing glaucoma

iaaa mri train data partition

Dataset

Contents