100+ datasets found

d
UCI Machine Learning Repository
dknet.org
rrid.site
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning Repository [Dataset]. http://identifiers.org/RRID:SCR_026571
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_026571
Description
Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given
Cancer Multiple Dataset UCI MLR
kaggle.com
zip
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Medi Hunter - 4004 (2025). Cancer Multiple Dataset UCI MLR [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasakbd/cancer-multiple-dataset-uci-mlr/suggestions
Explore at:
zip(74213598 bytes)Available download formats
Dataset updated
Aug 5, 2025
Authors
Medi Hunter - 4004
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Source More Info : https://archive.ics.uci.edu/datasets

The **UCI Machine Learning Repository **is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

The datasets collected in this project represent a diverse and comprehensive set of cancer-related data sourced from the UCI Machine Learning Repository. They cover a wide spectrum of cancer types and research perspectives, including breast cancer datasets such as the original, diagnostic, prognostic, and Coimbra variants, which focus on tumor features, recurrence, and biochemical markers. Cervical cancer is represented through datasets focusing on behavioral risks and general risk factors. The lung cancer dataset provides categorical diagnostic attributes, while the primary tumor dataset offers insights into tumor locations based on metastasis data. Additionally, specialized datasets like differentiated thyroid cancer recurrence, glioma grading with clinical and mutation features, and gene expression RNA-Seq data expand the scope into genetic and molecular-level cancer analysis. Together, these datasets support a wide range of machine learning applications including classification, prediction, survival analysis, and feature correlation across various types of cancer.

RRA_Think Differently, Create history’s next line.

Hello Data Hunters! Hope you're doing well. https://www.kaggle.com/shuvokumarbasak4004 (More Dataset) https://www.kaggle.com/shuvokumarbasak2030
G
GIS Resource Compilation Map Package - Applications of Machine Learning...
gdr.openei.org
data.openei.org
+3more
Updated Jun 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren; Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren (2021). GIS Resource Compilation Map Package - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. http://doi.org/10.15121/1897037
Explore at:
Unique identifier
https://doi.org/10.15121/1897037
Dataset updated
Jun 1, 2021
Dataset provided by
Geothermal Data Repository
Nevada Bureau of Mines and Geology
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Renewable Power Office. Geothermal Technologies Program (EE-4G)
Authors
Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren; Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Nevada, Great Basin
Description
This submission contains an ESRI map package (.mpk) with an embedded geodatabase for GIS resources used or derived in the Nevada Machine Learning project, meant to accompany the final report. The package includes layer descriptions, layer grouping, and symbology. Layer groups include: new/revised datasets (paleo-geothermal features, geochemistry, geophysics, heat flow, slip and dilation, potential structures, geothermal power plants, positive and negative test sites), machine learning model input grids, machine learning models (Artificial Neural Network (ANN), Extreme Learning Machine (ELM), Bayesian Neural Network (BNN), Principal Component Analysis (PCA/PCAk), Non-negative Matrix Factorization (NMF/NMFk) - supervised and unsupervised), original NV Play Fairway data and models, and NV cultural/reference data.

See layer descriptions for additional metadata. Smaller GIS resource packages (by category) can be found in the related datasets section of this submission. A submission linking the full codebase for generating machine learning output models is available through the "Related Datasets" link on this page, and contains results beyond the top picks present in this compilation.
Z
Data from: MLFMF: Data Sets for Machine Learning for Mathematical...
data.niaid.nih.gov
Updated Oct 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bauer, Andrej; Petković, Matej; Todorovski, Ljupčo (2023). MLFMF: Data Sets for Machine Learning for Mathematical Formalization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10041074
Explore at:
Dataset updated
Oct 26, 2023
Dataset provided by
University of Ljubljana
Institute of Mathematics, Physics, and Mechanics
Authors
Bauer, Andrej; Petković, Matej; Todorovski, Ljupčo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MLFMF MLFMF (Machine Learning for Mathematical Formalization) is a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. With more than 250,000 entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format. In addition to benchmarking the recommendation systems, the data sets can also be used for benchmarking node classification and link prediction algorithms. The four data sets Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes

the largest Lean 4 library Mathlib, the three largest Agda libraries:

the standard library the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the syntax trees give complete and easily parsed information about each entry. The Lean library data set was obtained by converting .olean files into s-expressions (see the lean2sexp tool). The Agda data sets were obtained with an s-expression extension of the official Agda repository (use either master-sexp or release-2.6.3-sexp branch). For more details, see our arXiv copy of the paper. Directory structure First, the mlfmf.zip archive needs to be unzipped. It contains a separate directory for every library (for example, the standard library of Agda can be found in the stdlib directory) and some auxiliary files. Every library directory contains

the network file from which the heterogeneous network can be loaded, a zip of the entries directory that contains (many) files with abstract syntax trees. Each of those files describes a single entry of the library. In addition to the auxiliary files which are used for loading the data (and described below), the zipped sources of lean2sexp and Agda s-expression extension are present. Loading the data In addition to the data files, there is also a simple python script main.py for loading the data. To run it, you will have to install the packages listed in the file requirements.txt: tqdm and networkx. The easiest way to do so is calling pip install -r requirements.txt. When running main.py for the first time, the script will unzip the entry files into the directory named entries. After that, the script loads the syntax trees of the entries (see the Entry class) and the network (as networkx.MultiDiGraph object). Note. The entry files have extension .dag (directed acyclic graph), since Lean uses node sharing, which breaks the tree structure (a shared node has more than one parent node). More information For more information about the data collection process, detailed data (and data format) description, and baseline experiments that were already performed with these data, see our arXiv copy of the paper. For the code that was used to perform the experiments and data format description, visit our github repository https://github.com/ul-fmf/mlfmf-data. Funding Since not all the funders are available in the Zenodo's database, we list them here:

This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-21-1-0024. The authors also acknowledge the financial support of the Slovenian Research Agency via the research core funding No. P2-0103 and No. P1-0294.
Table_1_Neuroimaging data repositories and AI-driven healthcare—Global...
frontiersin.figshare.com
docx
Updated Feb 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christine Lock; Nicole Si Min Tan; Ian James Long; Nicole C. Keong (2024). Table_1_Neuroimaging data repositories and AI-driven healthcare—Global aspirations vs. ethical considerations in machine learning models of neurological disease.DOCX [Dataset]. http://doi.org/10.3389/frai.2023.1286266.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2023.1286266.s001
Dataset updated
Feb 19, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Christine Lock; Nicole Si Min Tan; Ian James Long; Nicole C. Keong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Neuroimaging data repositories are data-rich resources comprising brain imaging with clinical and biomarker data. The potential for such repositories to transform healthcare is tremendous, especially in their capacity to support machine learning (ML) and artificial intelligence (AI) tools. Current discussions about the generalizability of such tools in healthcare provoke concerns of risk of bias—ML models underperform in women and ethnic and racial minorities. The use of ML may exacerbate existing healthcare disparities or cause post-deployment harms. Do neuroimaging data repositories and their capacity to support ML/AI-driven clinical discoveries, have both the potential to accelerate innovative medicine and harden the gaps of social inequities in neuroscience-related healthcare? In this paper, we examined the ethical concerns of ML-driven modeling of global community neuroscience needs arising from the use of data amassed within neuroimaging data repositories. We explored this in two parts; firstly, in a theoretical experiment, we argued for a South East Asian-based repository to redress global imbalances. Within this context, we then considered the ethical framework toward the inclusion vs. exclusion of the migrant worker population, a group subject to healthcare inequities. Secondly, we created a model simulating the impact of global variations in the presentation of anosmia risks in COVID-19 toward altering brain structural findings; we then performed a mini AI ethics experiment. In this experiment, we interrogated an actual pilot dataset (n = 17; 8 non-anosmic (47%) vs. 9 anosmic (53%) using an ML clustering model. To create the COVID-19 simulation model, we bootstrapped to resample and amplify the dataset. This resulted in three hypothetical datasets: (i) matched (n = 68; 47% anosmic), (ii) predominant non-anosmic (n = 66; 73% disproportionate), and (iii) predominant anosmic (n = 66; 76% disproportionate). We found that the differing proportions of the same cohorts represented in each hypothetical dataset altered not only the relative importance of key features distinguishing between them but even the presence or absence of such features. The main objective of our mini experiment was to understand if ML/AI methodologies could be utilized toward modelling disproportionate datasets, in a manner we term “AI ethics.” Further work is required to expand the approach proposed here into a reproducible strategy.
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
heart-disease-data
kaggle.com
zip
Updated Aug 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nagaveda Reddy (2020). heart-disease-data [Dataset]. https://www.kaggle.com/nagavedareddy/heartdiseasedata
Explore at:
zip(3494 bytes)Available download formats
Dataset updated
Aug 5, 2020
Authors
Nagaveda Reddy
Description
Dataset

This dataset was created by Nagaveda Reddy

Contents
MIProblems: A repository of multiple instance learning datasets
figshare.com
zip
Updated Jun 21, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Veronika Cheplygina (2018). MIProblems: A repository of multiple instance learning datasets [Dataset]. http://doi.org/10.6084/m9.figshare.6633983.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6633983.v1
Dataset updated
Jun 21, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Veronika Cheplygina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the multiple instance learning datasets previously stored at miproblems.org. As I am now longer maintaining the website, I moved the datasets to Figshare. A detailed description of the files is found in readme.pdf

If you use these datasets, please cite this Figshare resource rather than linking to miproblems.org, which will be offline soon.
r
Penn machine learning benchmark repository
rrid.site
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Penn machine learning benchmark repository [Dataset]. http://identifiers.org/RRID:SCR_017138
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_017138
Dataset updated
Oct 19, 2025
Description
Python wrapper for Penn Machine Learning Benchmark data repository. Large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Part of PyPI https://pypi.org/
n
Data from: Assessing predictive performance of supervised machine learning...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wh70rxwrh
Dataset updated
May 23, 2023
Dataset provided by
Strathmore University
Authors
Evans Omondi
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.

Breast Cancer Wisconsin (Prognostic) Data Set

kaggle.com

zip

Updated Mar 31, 2017

Facebook

Twitter

Click to copy link

Link copied

Cite

Sarah VCH (2017). Breast Cancer Wisconsin (Prognostic) Data Set [Dataset]. https://www.kaggle.com/sarahvch/breast-cancer-wisconsin-prognostic-data-set

Explore at:

zip(49800 bytes)Available download formats

Dataset updated

Mar 31, 2017

Authors

Sarah VCH

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Context

Data From: UCI Machine Learning Repository http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names

Content

"Each record represents follow-up data for one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis.

The first 30 features are computed from a digitized image of a
fine needle aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
A few of the images can be found at
http://www.cs.wisc.edu/~street/images/

The separation described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

The Recurrence Surface Approximation (RSA) method is a linear
programming model which predicts Time To Recur using both
recurrent and nonrecurrent cases. See references (i) and (ii)
above for details of the RSA method. 

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WPBC/

1) ID number 2) Outcome (R = recur, N = nonrecur) 3) Time (recurrence time if field 2 = R, disease-free time if field 2 = N) 4-33) Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)"

Acknowledgements

Creators:

Dr. William H. Wolberg, General Surgery Dept., University of
Wisconsin, Clinical Sciences Center, Madison, WI 53792
wolberg@eagle.surgery.wisc.edu

W. Nick Street, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
street@cs.wisc.edu 608-262-6619

Olvi L. Mangasarian, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
olvi@cs.wisc.edu

Inspiration

I'm really interested in trying out various machine learning algorithms on some real life science data.

m
Crops Recommendation AI-GeoInfo Framework Datasets
data.mendeley.com
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hisham AbouGrad (2025). Crops Recommendation AI-GeoInfo Framework Datasets [Dataset]. http://doi.org/10.17632/9pzzmhv7gp.1
Explore at:
Unique identifier
https://doi.org/10.17632/9pzzmhv7gp.1
Dataset updated
Apr 8, 2025
Authors
Hisham AbouGrad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Research datasets about Crops Recommendation AI-GeoInfo Framework Datasets and the use of Supervised Machine Learning Algorithms and GeoAPI Module
i
ManifoldEM: ESPER Data and Code Repository
ieee-dataport.org
Updated May 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evan Seitz (2022). ManifoldEM: ESPER Data and Code Repository [Dataset]. https://ieee-dataport.org/documents/manifoldem-esper-data-and-code-repository
Explore at:
Dataset updated
May 8, 2022
Authors
Evan Seitz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IEEE TCI). As inputs into ESPER
t
50 Classiﬁcation Data Sets - Dataset - LDM
service.tib.eu
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). 50 Classiﬁcation Data Sets - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/50-classi-cation-data-sets
Explore at:
Dataset updated
Jan 3, 2025
Description
The dataset used in the paper is a collection of 50 classiﬁcation data sets downloaded from the UCI Machine Learning Repository.
Multiple Machine Learning Datasets
kaggle.com
zip
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Amoh Adjei (2024). Multiple Machine Learning Datasets [Dataset]. https://www.kaggle.com/datasets/ericamohadjei/trending-public-datasets
Explore at:
zip(15544969 bytes)Available download formats
Dataset updated
Nov 12, 2024
Authors
Eric Amoh Adjei
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Trending Public Datasets Overview

These Datasets contain a diverse collection of datasets intended for machine learning research and practice. Each dataset is curated to support different types of machine learning challenges, including classification, regression, and clustering. Below is a detailed list of the datasets available in this repository, along with descriptions and links to their sources.

Available Datasets

Iris Dataset

Description: This classic dataset includes measurements for 150 iris flowers from three different species. It includes four features: sepal length, sepal width, petal length, and petal width. Source: Iris Dataset Source Files: iris.csv

DHFR Dataset

Description: Contains data for 325 molecules with biological activity against the DHFR enzyme, relevant in anti-malarial drug research. It includes 228 molecular descriptors as features. Source: DHFR Dataset Source Files: dhfr.csv

Heart Disease Dataset (Cleveland)

Description: Comprises diagnostic measurements from 303 patients tested for heart disease at the Cleveland Clinic. It features 13 clinical attributes. Source: UCI Machine Learning Repository Files: heart-disease-cleveland.csv

HCV Data

Description: Detailed datasets related to Hepatitis C Virus (HCV) progression, with features for classification and regression tasks. Files: HCV_NS5B_Curated.csv, hcv_classification.csv, hcv_regression.arff

NBA Seasons Stats

Description: Player statistics from the NBA 2020 and 2021 seasons for detailed sports analytics. Files: NBA_2020.csv, NBA_2021.csv

Boston Housing Dataset

Description: Data concerning housing values in the suburbs of Boston, suitable for regression analysis. Files: BostonHousing.csv, BostonHousing_train.csv, BostonHousing_test.csv

Acetylcholinesterase Inhibitor Bioactivity

Description: Chemical bioactivity data against acetylcholinesterase, a target relevant to Alzheimer's research. It includes raw and processed formats with chemical fingerprints. Files: acetylcholinesterase_01_bioactivity_data_raw.csv to acetylcholinesterase_07_bioactivity_data_2class_pIC50_pubchem_fp.csv

California Housing Dataset

Description: Data aimed at predicting median house prices in California districts. Files: california_housing_train.csv, california_housing_test.csv

Virtual Reality Experiences Data

Description: Data from user experiences with various virtual reality setups to study user engagement and satisfaction. Files: Virtual Reality Experiences-data.csv

Fast-Food Chains in USA

Description: Overview of various fast-food chains operating in the USA, their locations, and popularity. Files: Fast-Food Chains in USA.csv

Contributing We welcome contributions to this dataset repository. If you have a dataset that you believe would be beneficial for the machine learning community, please see our contribution guidelines in CONTRIBUTING.md.

License This dataset is available under the MIT License.
D
Clinical Trial Data Repository Market Report | Global Forecast From 2025 To...
dataintelo.com
csv, pdf, pptx
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Clinical Trial Data Repository Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-clinical-trial-data-repository-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Sep 23, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Clinical Trial Data Repository Market Outlook

The global clinical trial data repository market size was estimated to be approximately $1.8 billion in 2023 and is projected to grow at a compound annual growth rate (CAGR) of 9.5% to reach around $4.1 billion by 2032. The primary growth factors include the increasing volume and complexity of clinical trials, rising need for efficient data management systems, and stringent regulatory requirements for data accuracy and integrity. The advent of advanced technologies such as artificial intelligence and big data analytics further drives market expansion by enhancing data processing capabilities and providing actionable insights.

The growth of the clinical trial data repository market is significantly influenced by the increasing number of clinical trials being conducted globally. With the rise in chronic diseases, the need for innovative treatments and therapies has surged, leading to an upsurge in clinical trials. This increase in clinical trials necessitates robust data management systems to handle vast amounts of data generated, thereby propelling the demand for clinical trial data repositories. Moreover, the complexity of modern clinical trials, which often involve multiple sites and diverse patient populations, further amplifies the need for sophisticated data management solutions.

Another critical driver for the market is the stringent regulatory landscape governing clinical trial data. Regulatory bodies such as the FDA, EMA, and other local authorities mandate rigorous data management standards to ensure data integrity, accuracy, and accessibility. These regulations necessitate the adoption of advanced data repository systems that can comply with regulatory requirements, thereby fueling market growth. Additionally, regulatory frameworks are becoming increasingly stringent, prompting pharmaceutical and biotechnology companies to invest in state-of-the-art data management systems to avoid compliance issues and potential financial penalties.

Technological advancements play a pivotal role in the market's growth. The integration of artificial intelligence, machine learning, and big data analytics into data repository systems enhances data processing and analysis capabilities. These technologies enable real-time data monitoring, predictive analytics, and improved decision-making, thereby improving the efficiency of clinical trials. Furthermore, the shift towards cloud-based solutions offers scalability, flexibility, and cost-effectiveness, making advanced data management systems accessible to even small and medium-sized enterprises.

Regionally, North America dominates the clinical trial data repository market owing to its robust healthcare infrastructure, high R&D investments, and presence of major pharmaceutical and biotechnology companies. Europe follows closely due to stringent regulatory standards and a strong focus on clinical research. The Asia Pacific region is expected to witness the highest growth rate during the forecast period due to increasing clinical trial activities, growing healthcare expenditure, and the rising adoption of advanced technologies. Latin America and the Middle East & Africa are also likely to experience growth, albeit at a slower pace, driven by improving healthcare systems and increasing focus on clinical research.

Component Analysis

The clinical trial data repository market is segmented by components into software and services. The software segment is anticipated to hold a significant share of the market due to the essential role software plays in data management. Advanced software solutions offer capabilities such as data storage, management, retrieval, and analysis, which are critical for effective clinical trial management. The integration of AI and machine learning algorithms into these software systems further enhances their efficiency by enabling predictive analytics and real-time monitoring, thus driving the software segment's growth.

Software solutions in clinical trial data repositories also offer interoperability, enabling seamless integration with other clinical trial management systems (CTMS) and electronic data capture (EDC) systems. This interoperability is crucial for ensuring data consistency and accuracy across different platforms, thereby enhancing overall data management. Additionally, the increasing adoption of cloud-based software solutions provides scalability, cost-effectiveness, and remote acce
A Repository of Big Data Sets for Computer Vision and Machine Learning...
zenodo.org
agdatacommons.nal.usda.gov
Updated Jan 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo F. M. Milan; Michael T. Gorczyca; Kristen M. Perano; Gustavo A. B. Moura; Patric A. Castro; Bharath Hariharan; Alex S. C. Maia; Kifle G. Gebremedhin; Hugo F. M. Milan; Michael T. Gorczyca; Kristen M. Perano; Gustavo A. B. Moura; Patric A. Castro; Bharath Hariharan; Alex S. C. Maia; Kifle G. Gebremedhin (2024). A Repository of Big Data Sets for Computer Vision and Machine Learning Applications in Precision Livestock Farming [Dataset]. http://doi.org/10.5281/zenodo.10463221
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10463221
Dataset updated
Jan 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hugo F. M. Milan; Michael T. Gorczyca; Kristen M. Perano; Gustavo A. B. Moura; Patric A. Castro; Bharath Hariharan; Alex S. C. Maia; Kifle G. Gebremedhin; Hugo F. M. Milan; Michael T. Gorczyca; Kristen M. Perano; Gustavo A. B. Moura; Patric A. Castro; Bharath Hariharan; Alex S. C. Maia; Kifle G. Gebremedhin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 5, 2024
Description
A Precision Livestock Farming repository of Big Data sets for computer vision and machine learning applications (PLFBD).
n
A machine learning based prediction model for life expectancy
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Nov 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo (2022). A machine learning based prediction model for life expectancy [Dataset]. http://doi.org/10.5061/dryad.z612jm6fv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.z612jm6fv
Dataset updated
Nov 14, 2022
Dataset provided by
University of South Carolina Upstate
Strathmore University
Authors
Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The social and financial systems of many nations throughout the world are significantly impacted by life expectancy (LE) models. Numerous studies have pointed out the crucial effects that life expectancy projections will have on societal issues and the administration of the global healthcare system. The computation of life expectancy has primarily entailed building an ordinary life table. However, the life table is limited by its long duration, the assumption of homogeneity of cohorts and censoring. As a result, a robust and more accurate approach is inevitable. In this study, a supervised machine learning model for estimating life expectancy rates is developed. The model takes into consideration health, socioeconomic, and behavioral characteristics by using the eXtreme Gradient Boosting (XGBoost) algorithm to data from 193 UN member states. The effectiveness of the model's prediction is compared to that of the Random Forest (RF) and Artificial Neural Network (ANN) regressors utilized in earlier research. XGBoost attains an MAE and an RMSE of 1.554 and 2.402, respectively outperforming the RF and ANN models that achieved MAE and RMSE values of 7.938 and 11.304, and 3.86 and 5.002, respectively. The overall results of this study support XGBoost as a reliable and efficient model for estimating life expectancy. Methods Secondary data were used from which a sample of 2832 observations of 21 variables was sourced from the World Health Organization (WHO) and the United Nations (UN) databases. The data was on 193 UN member states from the year 2000–2015, with the LE health-related factors drawn from the Global Health Observatory data repository.
m
ANN+Jackknife dataset
data.mendeley.com
Updated Aug 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vinicius Rofatto (2021). ANN+Jackknife dataset [Dataset]. http://doi.org/10.17632/z3399p5h9f.2
Explore at:
Unique identifier
https://doi.org/10.17632/z3399p5h9f.2
Dataset updated
Aug 12, 2021
Authors
Vinicius Rofatto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
dataset = first column: Number ID; second column: North UTM (m); third column: Este UTM (m); fourth column: pH; fifth column: Ca; sixth column: Mg; seventh column: P; and eighth column K.

jack_40.m: Example of the JACK-1T script for the case where the sample was randomly and uniformly reduced to 40%.

redSample.m: Script that randomly and uniformly reduces the sample size to a desired percentage.
Theory aware Machine Learning (TaML)
catalog.data.gov
nist.gov
Updated Jan 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Theory aware Machine Learning (TaML) [Dataset]. https://catalog.data.gov/dataset/theory-aware-machine-learning-taml
Explore at:
Dataset updated
Jan 7, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
A code repository and accompanying data for incorporating imperfect theory into machine learning for improved prediction and explainability. Specifically, it focuses on the case study of the dimensions of a polymer chain in different solvent qualities. Jupyter Notebooks for quickly testing concepts and reproducing figures, as well as source code that computes the mean squared error as a function of dataset size for various machine learning models are included.For additional details on the data, please refer to the README.md associated with the data. For additional details on the code, please refer to the README.md provided with the code repository (GitHub Repo for Theory aware Machine Learning). For additional details on the methodology, see Debra J. Audus, Austin McDannald, and Brian DeCost, "Leveraging Theory for Enhanced Machine Learning" ACS Macro Letters 2022 11 (9), 1117-1122 DOI: 10.1021/acsmacrolett.2c00369.

Facebook

Twitter

Click to copy link

Link copied

Cite

UCI Machine Learning Repository [Dataset]. http://identifiers.org/RRID:SCR_026571

UCI Machine Learning Repository

RRID:SCR_026571, r3d100010960, UCI Machine Learning Repository (RRID:SCR_026571), UC Irvine Machine Learning Repository

Explore at:

Unique identifier

https://identifiers.org/RRID:SCR_026571

Description

Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given

Clear search

Close search

Google apps

Main menu

UCI Machine Learning Repository

Cancer Multiple Dataset UCI MLR

GIS Resource Compilation Map Package - Applications of Machine Learning...

Data from: MLFMF: Data Sets for Machine Learning for Mathematical...

Table_1_Neuroimaging data repositories and AI-driven healthcare—Global...

UCI and OpenML Data Sets for Ordinal Quantification

heart-disease-data

Dataset

Contents

MIProblems: A repository of multiple instance learning datasets

Penn machine learning benchmark repository

Data from: Assessing predictive performance of supervised machine learning...

Breast Cancer Wisconsin (Prognostic) Data Set

Context

Content

Acknowledgements

Inspiration

Crops Recommendation AI-GeoInfo Framework Datasets

ManifoldEM: ESPER Data and Code Repository

50 Classiﬁcation Data Sets - Dataset - LDM

Multiple Machine Learning Datasets

Clinical Trial Data Repository Market Report | Global Forecast From 2025 To...

Clinical Trial Data Repository Market Outlook

Component Analysis

A Repository of Big Data Sets for Computer Vision and Machine Learning...

A machine learning based prediction model for life expectancy

ANN+Jackknife dataset

Theory aware Machine Learning (TaML)

UCI Machine Learning Repository

RRID:SCR_026571, r3d100010960, UCI Machine Learning Repository (RRID:SCR_026571), UC Irvine Machine Learning Repository