100+ datasets found
  1. d

    UCI Machine Learning Repository

    • dknet.org
    • rrid.site
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning Repository [Dataset]. http://identifiers.org/RRID:SCR_026571
    Explore at:
    Description

    Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given

  2. Cancer Multiple Dataset UCI MLR

    • kaggle.com
    zip
    Updated Aug 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Medi Hunter - 4004 (2025). Cancer Multiple Dataset UCI MLR [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasakbd/cancer-multiple-dataset-uci-mlr/suggestions
    Explore at:
    zip(74213598 bytes)Available download formats
    Dataset updated
    Aug 5, 2025
    Authors
    Medi Hunter - 4004
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Source More Info : https://archive.ics.uci.edu/datasets

    The **UCI Machine Learning Repository **is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

    The datasets collected in this project represent a diverse and comprehensive set of cancer-related data sourced from the UCI Machine Learning Repository. They cover a wide spectrum of cancer types and research perspectives, including breast cancer datasets such as the original, diagnostic, prognostic, and Coimbra variants, which focus on tumor features, recurrence, and biochemical markers. Cervical cancer is represented through datasets focusing on behavioral risks and general risk factors. The lung cancer dataset provides categorical diagnostic attributes, while the primary tumor dataset offers insights into tumor locations based on metastasis data. Additionally, specialized datasets like differentiated thyroid cancer recurrence, glioma grading with clinical and mutation features, and gene expression RNA-Seq data expand the scope into genetic and molecular-level cancer analysis. Together, these datasets support a wide range of machine learning applications including classification, prediction, survival analysis, and feature correlation across various types of cancer.

    RRA_Think Differently, Create history’s next line.

    Hello Data Hunters! Hope you're doing well. https://www.kaggle.com/shuvokumarbasak4004 (More Dataset) https://www.kaggle.com/shuvokumarbasak2030

  3. G

    GIS Resource Compilation Map Package - Applications of Machine Learning...

    • gdr.openei.org
    • data.openei.org
    • +3more
    Updated Jun 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren; Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren (2021). GIS Resource Compilation Map Package - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. http://doi.org/10.15121/1897037
    Explore at:
    Dataset updated
    Jun 1, 2021
    Dataset provided by
    Geothermal Data Repository
    Nevada Bureau of Mines and Geology
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Renewable Power Office. Geothermal Technologies Program (EE-4G)
    Authors
    Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren; Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Nevada, Great Basin
    Description

    This submission contains an ESRI map package (.mpk) with an embedded geodatabase for GIS resources used or derived in the Nevada Machine Learning project, meant to accompany the final report. The package includes layer descriptions, layer grouping, and symbology. Layer groups include: new/revised datasets (paleo-geothermal features, geochemistry, geophysics, heat flow, slip and dilation, potential structures, geothermal power plants, positive and negative test sites), machine learning model input grids, machine learning models (Artificial Neural Network (ANN), Extreme Learning Machine (ELM), Bayesian Neural Network (BNN), Principal Component Analysis (PCA/PCAk), Non-negative Matrix Factorization (NMF/NMFk) - supervised and unsupervised), original NV Play Fairway data and models, and NV cultural/reference data.

    See layer descriptions for additional metadata. Smaller GIS resource packages (by category) can be found in the related datasets section of this submission. A submission linking the full codebase for generating machine learning output models is available through the "Related Datasets" link on this page, and contains results beyond the top picks present in this compilation.

  4. Z

    Data from: MLFMF: Data Sets for Machine Learning for Mathematical...

    • data.niaid.nih.gov
    Updated Oct 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bauer, Andrej; Petković, Matej; Todorovski, Ljupčo (2023). MLFMF: Data Sets for Machine Learning for Mathematical Formalization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10041074
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset provided by
    University of Ljubljana
    Institute of Mathematics, Physics, and Mechanics
    Authors
    Bauer, Andrej; Petković, Matej; Todorovski, Ljupčo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MLFMF MLFMF (Machine Learning for Mathematical Formalization) is a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. With more than 250,000 entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format. In addition to benchmarking the recommendation systems, the data sets can also be used for benchmarking node classification and link prediction algorithms. The four data sets Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes

    the largest Lean 4 library Mathlib, the three largest Agda libraries:

    the standard library the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the syntax trees give complete and easily parsed information about each entry. The Lean library data set was obtained by converting .olean files into s-expressions (see the lean2sexp tool). The Agda data sets were obtained with an s-expression extension of the official Agda repository (use either master-sexp or release-2.6.3-sexp branch). For more details, see our arXiv copy of the paper. Directory structure First, the mlfmf.zip archive needs to be unzipped. It contains a separate directory for every library (for example, the standard library of Agda can be found in the stdlib directory) and some auxiliary files. Every library directory contains

    the network file from which the heterogeneous network can be loaded, a zip of the entries directory that contains (many) files with abstract syntax trees. Each of those files describes a single entry of the library. In addition to the auxiliary files which are used for loading the data (and described below), the zipped sources of lean2sexp and Agda s-expression extension are present. Loading the data In addition to the data files, there is also a simple python script main.py for loading the data. To run it, you will have to install the packages listed in the file requirements.txt: tqdm and networkx. The easiest way to do so is calling pip install -r requirements.txt. When running main.py for the first time, the script will unzip the entry files into the directory named entries. After that, the script loads the syntax trees of the entries (see the Entry class) and the network (as networkx.MultiDiGraph object). Note. The entry files have extension .dag (directed acyclic graph), since Lean uses node sharing, which breaks the tree structure (a shared node has more than one parent node). More information For more information about the data collection process, detailed data (and data format) description, and baseline experiments that were already performed with these data, see our arXiv copy of the paper. For the code that was used to perform the experiments and data format description, visit our github repository https://github.com/ul-fmf/mlfmf-data. Funding Since not all the funders are available in the Zenodo's database, we list them here:

    This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-21-1-0024. The authors also acknowledge the financial support of the Slovenian Research Agency via the research core funding No. P2-0103 and No. P1-0294.

  5. Table_1_Neuroimaging data repositories and AI-driven healthcare—Global...

    • frontiersin.figshare.com
    docx
    Updated Feb 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christine Lock; Nicole Si Min Tan; Ian James Long; Nicole C. Keong (2024). Table_1_Neuroimaging data repositories and AI-driven healthcare—Global aspirations vs. ethical considerations in machine learning models of neurological disease.DOCX [Dataset]. http://doi.org/10.3389/frai.2023.1286266.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Feb 19, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Christine Lock; Nicole Si Min Tan; Ian James Long; Nicole C. Keong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Neuroimaging data repositories are data-rich resources comprising brain imaging with clinical and biomarker data. The potential for such repositories to transform healthcare is tremendous, especially in their capacity to support machine learning (ML) and artificial intelligence (AI) tools. Current discussions about the generalizability of such tools in healthcare provoke concerns of risk of bias—ML models underperform in women and ethnic and racial minorities. The use of ML may exacerbate existing healthcare disparities or cause post-deployment harms. Do neuroimaging data repositories and their capacity to support ML/AI-driven clinical discoveries, have both the potential to accelerate innovative medicine and harden the gaps of social inequities in neuroscience-related healthcare? In this paper, we examined the ethical concerns of ML-driven modeling of global community neuroscience needs arising from the use of data amassed within neuroimaging data repositories. We explored this in two parts; firstly, in a theoretical experiment, we argued for a South East Asian-based repository to redress global imbalances. Within this context, we then considered the ethical framework toward the inclusion vs. exclusion of the migrant worker population, a group subject to healthcare inequities. Secondly, we created a model simulating the impact of global variations in the presentation of anosmia risks in COVID-19 toward altering brain structural findings; we then performed a mini AI ethics experiment. In this experiment, we interrogated an actual pilot dataset (n = 17; 8 non-anosmic (47%) vs. 9 anosmic (53%) using an ML clustering model. To create the COVID-19 simulation model, we bootstrapped to resample and amplify the dataset. This resulted in three hypothetical datasets: (i) matched (n = 68; 47% anosmic), (ii) predominant non-anosmic (n = 66; 73% disproportionate), and (iii) predominant anosmic (n = 66; 76% disproportionate). We found that the differing proportions of the same cohorts represented in each hypothetical dataset altered not only the relative importance of key features distinguishing between them but even the presence or absence of such features. The main objective of our mini experiment was to understand if ML/AI methodologies could be utilized toward modelling disproportionate datasets, in a manner we term “AI ethics.” Further work is required to expand the approach proposed here into a reproducible strategy.

  6. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  7. heart-disease-data

    • kaggle.com
    zip
    Updated Aug 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nagaveda Reddy (2020). heart-disease-data [Dataset]. https://www.kaggle.com/nagavedareddy/heartdiseasedata
    Explore at:
    zip(3494 bytes)Available download formats
    Dataset updated
    Aug 5, 2020
    Authors
    Nagaveda Reddy
    Description

    Dataset

    This dataset was created by Nagaveda Reddy

    Contents

  8. MIProblems: A repository of multiple instance learning datasets

    • figshare.com
    zip
    Updated Jun 21, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Veronika Cheplygina (2018). MIProblems: A repository of multiple instance learning datasets [Dataset]. http://doi.org/10.6084/m9.figshare.6633983.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 21, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Veronika Cheplygina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the multiple instance learning datasets previously stored at miproblems.org. As I am now longer maintaining the website, I moved the datasets to Figshare. A detailed description of the files is found in readme.pdf

    If you use these datasets, please cite this Figshare resource rather than linking to miproblems.org, which will be offline soon.

  9. r

    Penn machine learning benchmark repository

    • rrid.site
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Penn machine learning benchmark repository [Dataset]. http://identifiers.org/RRID:SCR_017138
    Explore at:
    Dataset updated
    Oct 19, 2025
    Description

    Python wrapper for Penn Machine Learning Benchmark data repository. Large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Part of PyPI https://pypi.org/

  10. n

    Data from: Assessing predictive performance of supervised machine learning...

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Strathmore University
    Authors
    Evans Omondi
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.

  11. Breast Cancer Wisconsin (Prognostic) Data Set

    • kaggle.com
    zip
    Updated Mar 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah VCH (2017). Breast Cancer Wisconsin (Prognostic) Data Set [Dataset]. https://www.kaggle.com/sarahvch/breast-cancer-wisconsin-prognostic-data-set
    Explore at:
    zip(49800 bytes)Available download formats
    Dataset updated
    Mar 31, 2017
    Authors
    Sarah VCH
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    Data From: UCI Machine Learning Repository http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names

    Content

    "Each record represents follow-up data for one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis.

    The first 30 features are computed from a digitized image of a
    fine needle aspirate (FNA) of a breast mass. They describe
    characteristics of the cell nuclei present in the image.
    A few of the images can be found at
    http://www.cs.wisc.edu/~street/images/
    
    The separation described above was obtained using
    Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
    Construction Via Linear Programming." Proceedings of the 4th
    Midwest Artificial Intelligence and Cognitive Science Society,
    pp. 97-101, 1992], a classification method which uses linear
    programming to construct a decision tree. Relevant features
    were selected using an exhaustive search in the space of 1-4
    features and 1-3 separating planes.
    
    The actual linear program used to obtain the separating plane
    in the 3-dimensional space is that described in:
    [K. P. Bennett and O. L. Mangasarian: "Robust Linear
    Programming Discrimination of Two Linearly Inseparable Sets",
    Optimization Methods and Software 1, 1992, 23-34].
    
    The Recurrence Surface Approximation (RSA) method is a linear
    programming model which predicts Time To Recur using both
    recurrent and nonrecurrent cases. See references (i) and (ii)
    above for details of the RSA method. 
    
    This database is also available through the UW CS ftp server:
    
    ftp ftp.cs.wisc.edu
    cd math-prog/cpo-dataset/machine-learn/WPBC/
    

    1) ID number 2) Outcome (R = recur, N = nonrecur) 3) Time (recurrence time if field 2 = R, disease-free time if field 2 = N) 4-33) Ten real-valued features are computed for each cell nucleus:

    a) radius (mean of distances from center to points on the perimeter)
    b) texture (standard deviation of gray-scale values)
    c) perimeter
    d) area
    e) smoothness (local variation in radius lengths)
    f) compactness (perimeter^2 / area - 1.0)
    g) concavity (severity of concave portions of the contour)
    h) concave points (number of concave portions of the contour)
    i) symmetry 
    j) fractal dimension ("coastline approximation" - 1)"
    

    Acknowledgements

    Creators:

    Dr. William H. Wolberg, General Surgery Dept., University of
    Wisconsin, Clinical Sciences Center, Madison, WI 53792
    wolberg@eagle.surgery.wisc.edu
    
    W. Nick Street, Computer Sciences Dept., University of
    Wisconsin, 1210 West Dayton St., Madison, WI 53706
    street@cs.wisc.edu 608-262-6619
    
    Olvi L. Mangasarian, Computer Sciences Dept., University of
    Wisconsin, 1210 West Dayton St., Madison, WI 53706
    olvi@cs.wisc.edu 
    

    Inspiration

    I'm really interested in trying out various machine learning algorithms on some real life science data.

  12. m

    Crops Recommendation AI-GeoInfo Framework Datasets

    • data.mendeley.com
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hisham AbouGrad (2025). Crops Recommendation AI-GeoInfo Framework Datasets [Dataset]. http://doi.org/10.17632/9pzzmhv7gp.1
    Explore at:
    Dataset updated
    Apr 8, 2025
    Authors
    Hisham AbouGrad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Research datasets about Crops Recommendation AI-GeoInfo Framework Datasets and the use of Supervised Machine Learning Algorithms and GeoAPI Module

  13. i

    ManifoldEM: ESPER Data and Code Repository

    • ieee-dataport.org
    Updated May 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evan Seitz (2022). ManifoldEM: ESPER Data and Code Repository [Dataset]. https://ieee-dataport.org/documents/manifoldem-esper-data-and-code-repository
    Explore at:
    Dataset updated
    May 8, 2022
    Authors
    Evan Seitz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IEEE TCI). As inputs into ESPER

  14. t

    50 Classification Data Sets - Dataset - LDM

    • service.tib.eu
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). 50 Classification Data Sets - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/50-classi-cation-data-sets
    Explore at:
    Dataset updated
    Jan 3, 2025
    Description

    The dataset used in the paper is a collection of 50 classification data sets downloaded from the UCI Machine Learning Repository.

  15. Multiple Machine Learning Datasets

    • kaggle.com
    zip
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric Amoh Adjei (2024). Multiple Machine Learning Datasets [Dataset]. https://www.kaggle.com/datasets/ericamohadjei/trending-public-datasets
    Explore at:
    zip(15544969 bytes)Available download formats
    Dataset updated
    Nov 12, 2024
    Authors
    Eric Amoh Adjei
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Trending Public Datasets Overview

    These Datasets contain a diverse collection of datasets intended for machine learning research and practice. Each dataset is curated to support different types of machine learning challenges, including classification, regression, and clustering. Below is a detailed list of the datasets available in this repository, along with descriptions and links to their sources.

    Available Datasets

    Iris Dataset

    Description: This classic dataset includes measurements for 150 iris flowers from three different species. It includes four features: sepal length, sepal width, petal length, and petal width. Source: Iris Dataset Source Files: iris.csv

    DHFR Dataset

    Description: Contains data for 325 molecules with biological activity against the DHFR enzyme, relevant in anti-malarial drug research. It includes 228 molecular descriptors as features. Source: DHFR Dataset Source Files: dhfr.csv

    Heart Disease Dataset (Cleveland)

    Description: Comprises diagnostic measurements from 303 patients tested for heart disease at the Cleveland Clinic. It features 13 clinical attributes. Source: UCI Machine Learning Repository Files: heart-disease-cleveland.csv

    HCV Data

    Description: Detailed datasets related to Hepatitis C Virus (HCV) progression, with features for classification and regression tasks. Files: HCV_NS5B_Curated.csv, hcv_classification.csv, hcv_regression.arff

    NBA Seasons Stats

    Description: Player statistics from the NBA 2020 and 2021 seasons for detailed sports analytics. Files: NBA_2020.csv, NBA_2021.csv

    Boston Housing Dataset

    Description: Data concerning housing values in the suburbs of Boston, suitable for regression analysis. Files: BostonHousing.csv, BostonHousing_train.csv, BostonHousing_test.csv

    Acetylcholinesterase Inhibitor Bioactivity

    Description: Chemical bioactivity data against acetylcholinesterase, a target relevant to Alzheimer's research. It includes raw and processed formats with chemical fingerprints. Files: acetylcholinesterase_01_bioactivity_data_raw.csv to acetylcholinesterase_07_bioactivity_data_2class_pIC50_pubchem_fp.csv

    California Housing Dataset

    Description: Data aimed at predicting median house prices in California districts. Files: california_housing_train.csv, california_housing_test.csv

    Virtual Reality Experiences Data

    Description: Data from user experiences with various virtual reality setups to study user engagement and satisfaction. Files: Virtual Reality Experiences-data.csv

    Fast-Food Chains in USA

    Description: Overview of various fast-food chains operating in the USA, their locations, and popularity. Files: Fast-Food Chains in USA.csv

    Contributing We welcome contributions to this dataset repository. If you have a dataset that you believe would be beneficial for the machine learning community, please see our contribution guidelines in CONTRIBUTING.md.

    License This dataset is available under the MIT License.

  16. D

    Clinical Trial Data Repository Market Report | Global Forecast From 2025 To...

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Clinical Trial Data Repository Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-clinical-trial-data-repository-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Sep 23, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Clinical Trial Data Repository Market Outlook




    The global clinical trial data repository market size was estimated to be approximately $1.8 billion in 2023 and is projected to grow at a compound annual growth rate (CAGR) of 9.5% to reach around $4.1 billion by 2032. The primary growth factors include the increasing volume and complexity of clinical trials, rising need for efficient data management systems, and stringent regulatory requirements for data accuracy and integrity. The advent of advanced technologies such as artificial intelligence and big data analytics further drives market expansion by enhancing data processing capabilities and providing actionable insights.




    The growth of the clinical trial data repository market is significantly influenced by the increasing number of clinical trials being conducted globally. With the rise in chronic diseases, the need for innovative treatments and therapies has surged, leading to an upsurge in clinical trials. This increase in clinical trials necessitates robust data management systems to handle vast amounts of data generated, thereby propelling the demand for clinical trial data repositories. Moreover, the complexity of modern clinical trials, which often involve multiple sites and diverse patient populations, further amplifies the need for sophisticated data management solutions.




    Another critical driver for the market is the stringent regulatory landscape governing clinical trial data. Regulatory bodies such as the FDA, EMA, and other local authorities mandate rigorous data management standards to ensure data integrity, accuracy, and accessibility. These regulations necessitate the adoption of advanced data repository systems that can comply with regulatory requirements, thereby fueling market growth. Additionally, regulatory frameworks are becoming increasingly stringent, prompting pharmaceutical and biotechnology companies to invest in state-of-the-art data management systems to avoid compliance issues and potential financial penalties.




    Technological advancements play a pivotal role in the market's growth. The integration of artificial intelligence, machine learning, and big data analytics into data repository systems enhances data processing and analysis capabilities. These technologies enable real-time data monitoring, predictive analytics, and improved decision-making, thereby improving the efficiency of clinical trials. Furthermore, the shift towards cloud-based solutions offers scalability, flexibility, and cost-effectiveness, making advanced data management systems accessible to even small and medium-sized enterprises.




    Regionally, North America dominates the clinical trial data repository market owing to its robust healthcare infrastructure, high R&D investments, and presence of major pharmaceutical and biotechnology companies. Europe follows closely due to stringent regulatory standards and a strong focus on clinical research. The Asia Pacific region is expected to witness the highest growth rate during the forecast period due to increasing clinical trial activities, growing healthcare expenditure, and the rising adoption of advanced technologies. Latin America and the Middle East & Africa are also likely to experience growth, albeit at a slower pace, driven by improving healthcare systems and increasing focus on clinical research.



    Component Analysis




    The clinical trial data repository market is segmented by components into software and services. The software segment is anticipated to hold a significant share of the market due to the essential role software plays in data management. Advanced software solutions offer capabilities such as data storage, management, retrieval, and analysis, which are critical for effective clinical trial management. The integration of AI and machine learning algorithms into these software systems further enhances their efficiency by enabling predictive analytics and real-time monitoring, thus driving the software segment's growth.




    Software solutions in clinical trial data repositories also offer interoperability, enabling seamless integration with other clinical trial management systems (CTMS) and electronic data capture (EDC) systems. This interoperability is crucial for ensuring data consistency and accuracy across different platforms, thereby enhancing overall data management. Additionally, the increasing adoption of cloud-based software solutions provides scalability, cost-effectiveness, and remote acce

  17. A Repository of Big Data Sets for Computer Vision and Machine Learning...

    • zenodo.org
    • agdatacommons.nal.usda.gov
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo F. M. Milan; Michael T. Gorczyca; Kristen M. Perano; Gustavo A. B. Moura; Patric A. Castro; Bharath Hariharan; Alex S. C. Maia; Kifle G. Gebremedhin; Hugo F. M. Milan; Michael T. Gorczyca; Kristen M. Perano; Gustavo A. B. Moura; Patric A. Castro; Bharath Hariharan; Alex S. C. Maia; Kifle G. Gebremedhin (2024). A Repository of Big Data Sets for Computer Vision and Machine Learning Applications in Precision Livestock Farming [Dataset]. http://doi.org/10.5281/zenodo.10463221
    Explore at:
    Dataset updated
    Jan 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hugo F. M. Milan; Michael T. Gorczyca; Kristen M. Perano; Gustavo A. B. Moura; Patric A. Castro; Bharath Hariharan; Alex S. C. Maia; Kifle G. Gebremedhin; Hugo F. M. Milan; Michael T. Gorczyca; Kristen M. Perano; Gustavo A. B. Moura; Patric A. Castro; Bharath Hariharan; Alex S. C. Maia; Kifle G. Gebremedhin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 5, 2024
    Description

    A Precision Livestock Farming repository of Big Data sets for computer vision and machine learning applications (PLFBD).

  18. n

    A machine learning based prediction model for life expectancy

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Nov 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo (2022). A machine learning based prediction model for life expectancy [Dataset]. http://doi.org/10.5061/dryad.z612jm6fv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 14, 2022
    Dataset provided by
    University of South Carolina Upstate
    Strathmore University
    Authors
    Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The social and financial systems of many nations throughout the world are significantly impacted by life expectancy (LE) models. Numerous studies have pointed out the crucial effects that life expectancy projections will have on societal issues and the administration of the global healthcare system. The computation of life expectancy has primarily entailed building an ordinary life table. However, the life table is limited by its long duration, the assumption of homogeneity of cohorts and censoring. As a result, a robust and more accurate approach is inevitable. In this study, a supervised machine learning model for estimating life expectancy rates is developed. The model takes into consideration health, socioeconomic, and behavioral characteristics by using the eXtreme Gradient Boosting (XGBoost) algorithm to data from 193 UN member states. The effectiveness of the model's prediction is compared to that of the Random Forest (RF) and Artificial Neural Network (ANN) regressors utilized in earlier research. XGBoost attains an MAE and an RMSE of 1.554 and 2.402, respectively outperforming the RF and ANN models that achieved MAE and RMSE values of 7.938 and 11.304, and 3.86 and 5.002, respectively. The overall results of this study support XGBoost as a reliable and efficient model for estimating life expectancy. Methods Secondary data were used from which a sample of 2832 observations of 21 variables was sourced from the World Health Organization (WHO) and the United Nations (UN) databases. The data was on 193 UN member states from the year 2000–2015, with the LE health-related factors drawn from the Global Health Observatory data repository.

  19. m

    ANN+Jackknife dataset

    • data.mendeley.com
    Updated Aug 12, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vinicius Rofatto (2021). ANN+Jackknife dataset [Dataset]. http://doi.org/10.17632/z3399p5h9f.2
    Explore at:
    Dataset updated
    Aug 12, 2021
    Authors
    Vinicius Rofatto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    dataset = first column: Number ID; second column: North UTM (m); third column: Este UTM (m); fourth column: pH; fifth column: Ca; sixth column: Mg; seventh column: P; and eighth column K.

    jack_40.m: Example of the JACK-1T script for the case where the sample was randomly and uniformly reduced to 40%.

    redSample.m: Script that randomly and uniformly reduces the sample size to a desired percentage.

  20. Theory aware Machine Learning (TaML)

    • catalog.data.gov
    • nist.gov
    Updated Jan 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Theory aware Machine Learning (TaML) [Dataset]. https://catalog.data.gov/dataset/theory-aware-machine-learning-taml
    Explore at:
    Dataset updated
    Jan 7, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    A code repository and accompanying data for incorporating imperfect theory into machine learning for improved prediction and explainability. Specifically, it focuses on the case study of the dimensions of a polymer chain in different solvent qualities. Jupyter Notebooks for quickly testing concepts and reproducing figures, as well as source code that computes the mean squared error as a function of dataset size for various machine learning models are included.For additional details on the data, please refer to the README.md associated with the data. For additional details on the code, please refer to the README.md provided with the code repository (GitHub Repo for Theory aware Machine Learning). For additional details on the methodology, see Debra J. Audus, Austin McDannald, and Brian DeCost, "Leveraging Theory for Enhanced Machine Learning" ACS Macro Letters 2022 11 (9), 1117-1122 DOI: 10.1021/acsmacrolett.2c00369.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
UCI Machine Learning Repository [Dataset]. http://identifiers.org/RRID:SCR_026571

UCI Machine Learning Repository

RRID:SCR_026571, r3d100010960, UCI Machine Learning Repository (RRID:SCR_026571), UC Irvine Machine Learning Repository

Explore at:
Description

Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given

Search
Clear search
Close search
Google apps
Main menu