29 datasets found
  1. Data from: Online Retail Data Set

    • kaggle.com
    zip
    Updated Sep 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahar Pourahmad (2021). Online Retail Data Set [Dataset]. https://www.kaggle.com/datasets/saharpourahmad/online-retail-data-set/code
    Explore at:
    zip(22875837 bytes)Available download formats
    Dataset updated
    Sep 11, 2021
    Authors
    Sahar Pourahmad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Sahar Pourahmad

    Released under CC0: Public Domain

    Contents

  2. UCI datasets

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Apr 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi; Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi (2023). UCI datasets [Dataset]. http://doi.org/10.5281/zenodo.7681792
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 4, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi; Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Collection of two datasets from the UCI website that could be used for structure learning tasks. Includes datasets regarding

    • Air Quality
    • US census 1990

    Size: Two datasets of sizes 9471*17 and 2458285*68 correspondingly

    Number of features: 15-68

    Ground truth: No

    Type of Graph: No ground truth

    More information about the datasets is contained in the dataset_description.html files.

  3. e

    uci.edu Traffic Analytics Data

    • analytics.explodingtopics.com
    Updated Sep 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). uci.edu Traffic Analytics Data [Dataset]. https://analytics.explodingtopics.com/website/uci.edu
    Explore at:
    Dataset updated
    Sep 1, 2025
    Variables measured
    Global Rank, Monthly Visits, Authority Score, US Country Rank, Education Category Rank
    Description

    Traffic analytics, rankings, and competitive metrics for uci.edu as of September 2025

  4. Air Quality Data Set

    • kaggle.com
    zip
    Updated Sep 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahar Pourahmad (2021). Air Quality Data Set [Dataset]. https://www.kaggle.com/saharpourahmad/air-quality-data-set-from-uci-website
    Explore at:
    zip(253800 bytes)Available download formats
    Dataset updated
    Sep 11, 2021
    Authors
    Sahar Pourahmad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. In this dataset we want to build a model to predit the relative humidity (RH) based on the air quality factors givern to us. This is a typical regression model that can be done in many ways.

    Data Set Information:

    The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value. This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.

    Content

    0 Date (DD/MM/YYYY) 1 Time (HH.MM.SS) 2 True hourly averaged concentration CO in mg/m^3 (reference analyzer) 3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted) 4 True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer) 5 True hourly averaged Benzene concentration in microg/m^3 (reference analyzer) 6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted) 7 True hourly averaged NOx concentration in ppb (reference analyzer) 8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted) 9 True hourly averaged NO2 concentration in microg/m^3 (reference analyzer) 10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted) 11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted) 12 Temperature in °C 13 Relative Humidity (%) 14 AH Absolute Humidity

    Acknowledgements

    Source: Saverio De Vito (saverio.devito '@' enea.it), ENEA - National Agency for New Technologies, Energy and Sustainable Economic Development https://archive.ics.uci.edu/ml/datasets/Air+Quality

  5. Fertility Data Set

    • kaggle.com
    zip
    Updated Sep 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahar Pourahmad (2021). Fertility Data Set [Dataset]. https://www.kaggle.com/saharpourahmad/fertility-data-set
    Explore at:
    zip(786 bytes)Available download formats
    Dataset updated
    Sep 11, 2021
    Authors
    Sahar Pourahmad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria. Sperm concentration are related to socio-demographic data, environmental factors, health status, and life habits

    Content

    Attribute Information:

    Season in which the analysis was performed. 1) winter, 2) spring, 3) Summer, 4) fall. (-1, -0.33, 0.33, 1)

    Age at the time of analysis. 18-36 (0, 1)

    Childish diseases (ie , chicken pox, measles, mumps, polio) 1) yes, 2) no. (0, 1)

    Accident or serious trauma 1) yes, 2) no. (0, 1)

    Surgical intervention 1) yes, 2) no. (0, 1)

    High fevers in the last year 1) less than three months ago, 2) more than three months ago, 3) no. (-1, 0, 1)

    Frequency of alcohol consumption 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never (0, 1)

    Smoking habit 1) never, 2) occasional 3) daily. (-1, 0, 1)

    Number of hours spent sitting per day ene-16 (0, 1)

    Output: Diagnosis normal (N), altered (O)

    Acknowledgements

    Source:

    David Gil, dgil '@' dtic.ua.es, Lucentia Research Group, Department of Computer Technology, University of Alicante

    Jose Luis Girela, girela '@' ua.es, Department of Biotechnology, University of Alicante

    Relevant Papers:

    David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 – 12573, 2012

    Citation Request:

    David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 – 12573, 2012

  6. e

    uci.org Traffic Analytics Data

    • analytics.explodingtopics.com
    Updated Sep 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). uci.org Traffic Analytics Data [Dataset]. https://analytics.explodingtopics.com/website/uci.org
    Explore at:
    Dataset updated
    Sep 1, 2025
    Variables measured
    Global Rank, Monthly Visits, Authority Score, US Country Rank, Sports Category Rank
    Description

    Traffic analytics, rankings, and competitive metrics for uci.org as of September 2025

  7. UCI Communities and Crime Unnormalized Data Set

    • kaggle.com
    Updated Feb 21, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kavitha (2018). UCI Communities and Crime Unnormalized Data Set [Dataset]. https://www.kaggle.com/kkanda/communities%20and%20crime%20unnormalized%20data%20set/code
    Explore at:
    Dataset updated
    Feb 21, 2018
    Authors
    Kavitha
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    Introduction: The dataset used for this experiment is real and authentic. The dataset is acquired from UCI machine learning repository website [13]. The title of the dataset is ‘Crime and Communities’. It is prepared using real data from socio-economic data from 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crimedata from the 1995 FBI UCR [13]. This dataset contains a total number of 147 attributes and 2216 instances.

    The per capita crimes variables were calculated using population values included in the 1995 FBI data (which differ from the 1990 Census values).

    Content

    The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units. The crime attributes (N=18) that could be predicted are the 8 crimes considered 'Index Crimes' by the FBI)(Murders, Rape, Robbery, .... ), per capita (actually per 100,000 population) versions of each, and Per Capita Violent Crimes and Per Capita Nonviolent Crimes)

    predictive variables : 125 non-predictive variables : 4 potential goal/response variables : 18

    Acknowledgements

    http://archive.ics.uci.edu/ml/datasets/Communities%20and%20Crime%20Unnormalized

    U. S. Department of Commerce, Bureau of the Census, Census Of Population And Housing 1990 United States: Summary Tape File 1a & 3a (Computer Files),

    U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)

    U.S. Department of Justice, Bureau of Justice Statistics, Law Enforcement Management And Administrative Statistics (Computer File) U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)

    U.S. Department of Justice, Federal Bureau of Investigation, Crime in the United States (Computer File) (1995)

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

    Data available in the dataset may not act as a complete source of information for identifying factors that contribute to more violent and non-violent crimes as many relevant factors may still be missing.

    However, I would like to try and answer the following questions answered.

    1. Analyze if number of vacant and occupied houses and the period of time the houses were vacant had contributed to any significant change in violent and non-violent crime rates in communities

    2. How has unemployment changed crime rate(violent and non-violent) in the communities?

    3. Were people from a particular age group more vulnerable to crime?

    4. Does ethnicity play a role in crime rate?

    5. Has education played a role in bringing down the crime rate?

  8. phishing.arff

    • figshare.com
    txt
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ambroise Odonnat (2024). phishing.arff [Dataset]. http://doi.org/10.6084/m9.figshare.26232710.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ambroise Odonnat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the data from the Phishing Website dataset provided in [1]. All the features are categorical and were preprocessed in integer values. The data can be downloaded from https://archive.ics.uci.edu/dataset/327/phishing+websites. There are 11055 samples with 30 features. Websites belong to 2 domains: websites that use the IP address used instead of the domain name in the URL and websites that use the domain name in the URL. For reference, please refer to: [1] R. Mohammad, F. Thabtah, L. Mccluskey. An assessment of features related to phishing websites using an automated technique In International Conference for Internet Technology and Secured Transactions, 2012

  9. UCI E3SM1.0 model output prepared for CMIP6 PAMIP pdSST-pdSICSIT

    • wdc-climate.de
    Updated 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of California Irvine (UCI) (2021). UCI E3SM1.0 model output prepared for CMIP6 PAMIP pdSST-pdSICSIT [Dataset]. http://doi.org/10.22033/ESGF/CMIP6.16550
    Explore at:
    Dataset updated
    2021
    Dataset provided by
    Earth System Grid
    World Data Center for Climate (WDCC) at DKRZ
    Authors
    University of California Irvine (UCI)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Coupled Model Intercomparison Project Phase 6 (CMIP6) datasets. These data include all datasets published for 'CMIP6.PAMIP.UCI.E3SM-1-0.pdSST-pdSICSIT' with the full Data Reference Syntax following the template 'mip_era.activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.version'.

    The E3SM 1.0 (Energy Exascale Earth System Model) climate model, released in 2018, includes the following components: aerosol: MAM4 with resuspension, marine organics, and secondary organics (same grid as atmos), atmos: EAM (v1.0, cubed sphere spectral-element grid; 5400 elements with p=3; 1 deg average grid spacing; 90 x 90 x 6 longitude/latitude/cubeface; 72 levels; top level 0.1 hPa), atmosChem: Troposphere specified oxidants for aerosols. Stratosphere linearized interactive ozone (LINOZ v2) (same grid as atmos), land: ELM (v1.0, cubed sphere spectral-element grid; 5400 elements with p=3; 1 deg average grid spacing; 90 x 90 x 6 longitude/latitude/cubeface; satellite phenology mode), MOSART (v1.0, 0.5 degree latitude/longitude grid), ocean: MPAS-Ocean (v6.0, oEC60to30 unstructured SVTs mesh with 235160 cells and 714274 edges, variable resolution 60 km to 30 km; 60 levels; top grid cell 0-10 m), seaIce: MPAS-Seaice (v6.0, same grid as ocean). The model was run by the Department of Earth System Science, University of California Irvine, Irvine, CA 92697, USA (UCI) in native nominal resolutions: aerosol: 100 km, atmos: 100 km, atmosChem: 100 km, land: 100 km, ocean: 50 km, seaIce: 50 km.

    Project: These data have been generated as part of the internationally-coordinated Coupled Model Intercomparison Project Phase 6 (CMIP6; see also GMD Special Issue: http://www.geosci-model-dev.net/special_issue590.html). The simulation data provides a basis for climate research designed to answer fundamental science questions and serves as resource for authors of the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC-AR6).

    CMIP6 is a project coordinated by the Working Group on Coupled Modelling (WGCM) as part of the World Climate Research Programme (WCRP). Phase 6 builds on previous phases executed under the leadership of the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and relies on the Earth System Grid Federation (ESGF) and the Centre for Environmental Data Analysis (CEDA) along with numerous related activities for implementation. The original data is hosted and partially replicated on a federated collection of data nodes, and most of the data relied on by the IPCC is being archived for long-term preservation at the IPCC Data Distribution Centre (IPCC DDC) hosted by the German Climate Computing Center (DKRZ).

    The project includes simulations from about 120 global climate models and around 45 institutions and organizations worldwide. - Project website: https://pcmdi.llnl.gov/CMIP6.

  10. 🚴‍♀️ UCI Race Results

    • kaggle.com
    zip
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2024). 🚴‍♀️ UCI Race Results [Dataset]. https://www.kaggle.com/datasets/mexwell/uci-race-results/code
    Explore at:
    zip(61082159 bytes)Available download formats
    Dataset updated
    Oct 16, 2024
    Authors
    mexwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data is gathered from publicly available race results of the UCI website. It includes race results in all cycling disciplines (road, track, mountain bike, bmx, cyclo cross), classifications, data on athletes (such as name, age, country of affiliation etc.). The data is stored in a csv file.

    Original Data

    Citation

    Korf, Jesse (2023), “UCI race results 2010-2020”, Mendeley Data, V2, doi: 10.17632/mkzht8948x.2

    Acknowledgement

    Foto von Mikkel Bech auf Unsplash

  11. H

    Replication Data for: The Gender Readings Gap in Political Science Graduate...

    • dataverse.harvard.edu
    Updated Oct 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heidi Hardt; Amy Erica Smith; Hannah June Kim; Philippe Meister (2018). Replication Data for: The Gender Readings Gap in Political Science Graduate Training [Dataset]. http://doi.org/10.7910/DVN/UNWIHE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Heidi Hardt; Amy Erica Smith; Hannah June Kim; Philippe Meister
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is the replication Stata code and log file for the Journal of Politics research note, "The Gender Readings Gap in Political Science Graduate Training," by Heidi Hardt, Amy Erica Smith, Hannah June Kim and Philippe Meister. For our searchable database, see our website here: http://gradtraining.socsci.uci.edu/

  12. UCI CESM1-WACCM-SC model output prepared for CMIP6 PAMIP

    • wdc-climate.de
    Updated 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peings, Yannick (2020). UCI CESM1-WACCM-SC model output prepared for CMIP6 PAMIP [Dataset]. http://doi.org/10.22033/ESGF/CMIP6.12281
    Explore at:
    Dataset updated
    2020
    Dataset provided by
    Earth System Grid
    World Data Center for Climate (WDCC) at DKRZ
    Authors
    Peings, Yannick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Coupled Model Intercomparison Project Phase 6 (CMIP6) datasets. These data include all datasets published for 'CMIP6.PAMIP.NCAR.CESM1-WACCM-SC' with the full Data Reference Syntax following the template 'mip_era.activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.version'.

    The Community Earth System Model 1, with the Whole Atmosphere Community Climate Model and Specified Chemistry climate model, released in 2011, includes the following components: aerosol: MOZART-specified (same grid as atmos), atmos: WACCM4 (1.9x2.5 finite volume grid; 144 x 96 longitude/latitude; 66 levels; top level 5.9e-06 mb), atmosChem: MOZART-specified (same grid as atmos), land: CLM4.0, ocean: POP2 (320 x 384 longitude/latitude; 60 levels; top grid cell 0-10 m), ocnBgchem: BEC (same grid as ocean), seaIce: CICE4 (same as grid as ocean). The model was run by the Department of Earth System Science, University of California Irvine, Irvine, CA 92697, USA (UCI) in native nominal resolutions: aerosol: 250 km, atmos: 250 km, atmosChem: 250 km, land: 250 km, ocean: 100 km, ocnBgchem: 100 km, seaIce: 100 km.

    Project: These data have been generated as part of the internationally-coordinated Coupled Model Intercomparison Project Phase 6 (CMIP6; see also GMD Special Issue: http://www.geosci-model-dev.net/special_issue590.html). The simulation data provides a basis for climate research designed to answer fundamental science questions and serves as resource for authors of the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC-AR6).

    CMIP6 is a project coordinated by the Working Group on Coupled Modelling (WGCM) as part of the World Climate Research Programme (WCRP). Phase 6 builds on previous phases executed under the leadership of the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and relies on the Earth System Grid Federation (ESGF) and the Centre for Environmental Data Analysis (CEDA) along with numerous related activities for implementation. The original data is hosted and partially replicated on a federated collection of data nodes, and most of the data relied on by the IPCC is being archived for long-term preservation at the IPCC Data Distribution Centre (IPCC DDC) hosted by the German Climate Computing Center (DKRZ).

    The project includes simulations from about 120 global climate models and around 45 institutions and organizations worldwide. - Project website: https://pcmdi.llnl.gov/CMIP6.

  13. Breast Cancer Prognostics

    • kaggle.com
    zip
    Updated Dec 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Breast Cancer Prognostics [Dataset]. https://www.kaggle.com/datasets/thedevastator/improve-breast-cancer-prognostics-using-machine
    Explore at:
    zip(78356 bytes)Available download formats
    Dataset updated
    Dec 4, 2022
    Authors
    The Devastator
    Description

    Breast Cancer Prognostics

    Study the Wisconsin Dataset

    By UCI [source]

    About this dataset

    The Breast Cancer Wisconsin (Prognostic) dataset brings together data collected from hundreds of breast cancer cases, making it valuable for predictive prognosis. It includes 30 features such as radius, texture, area, compactness and concavity that were generated from the a digitized fine needle aspirate (FNA) of the mass to generate characteristics of the cell nuclei present in each case. It also includes outcomes such as recurrence and nonrecurrence and also time-to-recurrence information for those cases that relapse.

    This breaking dataset was created by some leading minds in medical science; Dr William H. Wolberg at the University Of Wisconsin Clinical Sciences Center alongside W. Nick Street at the university's Computer Sciences Dept., and Olvi L Mangasarian also based there - all credited with creating various decision tree construction systems using linear programming models to accurately predict disease recurrences within an incredibly short time frame.

    The data is freely available through UW CS ftp server or on Kaggle's website making use easier than ever before - giving all researchers access up-to-date information regarding breast cancer prognosis and diagnosis via images taken from FNA tests conducted on masses in diagnosed patients' bodies - allowing each participant instantaneous access to a powerful set of features versus outcomes within both recurrent and nonrecurrent situations.. Moreover papers such as 'An inductive learning approach to prognostic prediction.' by WN street et al have utilized this database extensively mapping out how Artificial Neural Networks can be used for predictive tasks with noteworthy success! Armed with these tested ideas consequently anyone has access level ground in understanding how decisions are made as it relates to predicting breast cancer outcome effectively utilizing this dataset helping us better understand how a predictive model can significantly improve patient care processes!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is designed to improve the prognostics of breast cancer using machine learning algorithms. The data consists of a time series of patient symptoms and various medical parameters, such as tumor size and malignancy, that can be used by programmatic algorithms to predict diagnosis and prognosis outcomes. Here are some steps on how to use this dataset:

    • Pre-process and clean the data: Since the dataset contains incomplete or missing values across various parameters, it is important to clean and pre-process the data before attempting any machine learning algorithm (MLA). This includes sorting out what type of values need imputation, standardizing features for better performance, encoding categorical variables for MLAs, and normalizing numerical values for accuracy.

    • Choose an appropriate MLA: Depending on your exact goal with this data set - for example if you wanted reliable classification results or weighted predictions based on factors - there are a variety of MLAs from which you may select; examples include logistic regression classifiers, least squares support vector machines (SVM), neural networks, nonsmooth optimization algorithms like A-Optimality or global optimization methods such as Extract M-of-N rule sets from trained neural nets.. It would be wise to read up on each algorithm in order to determine which one most appropriately meets your needs before starting experimentation with the dataset itself.

    • Train the model using your selected MLA: Once you have identified an MLA that fits your desired result outcome best – or if you decide on experimenting with multiple approaches –it’s time turn back towards the data itself in order run experiments actually examine outcomes based upon training models built upon it through cross validation methods such as k-fold splitting.. Then test these trained models against validation datasets taken from specified subsets within the original larger data set structure held by Kaggle in order get general outputs results determining performance rates over various conditions presented by parameter combinations relevant when predicting breast cancer diagnostic &/or prognostic outcomes .. Establishing any trends revealed during these experiments will help inform future model selections during training process associated implementing an effective predictive solution fitting specific user requirements especially where particular MLA are not tailored handle purpose generally falling outside scope designing said model so guaranteeing ac...

  14. Wine Quality Data Set (Red & White Wine)

    • kaggle.com
    zip
    Updated Nov 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ruthgn (2021). Wine Quality Data Set (Red & White Wine) [Dataset]. https://www.kaggle.com/datasets/ruthgn/wine-quality-data-set-red-white-wine
    Explore at:
    zip(100361 bytes)Available download formats
    Dataset updated
    Nov 3, 2021
    Authors
    ruthgn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Set Information

    This data set contains records related to red and white variants of the Portuguese Vinho Verde wine. It contains information from 1599 red wine samples and 4898 white wine samples. Input variables in the data set consist of the type of wine (either red or white wine) and metrics from objective tests (e.g. acidity levels, PH values, ABV, etc.), while the target/output variable is a numerical score based on sensory data—median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Due to privacy and logistic issues, there is no data about grape types, wine brand, and wine selling price.

    This data set is a combined version of the two separate files (distinct red and white wine data sets) originally shared in the UCI Machine Learning Repository.

    The following are some existing data sets on Kaggle from the same source (with notable differences from this data set): - Red Wine Quality (contains red wine data only) - Wine Quality (combination of red and white wine data but with some values randomly removed) - Wine Quality (red and white wine data not combined)

    Contents

    Input variables:

    1 - type of wine: type of wine (categorical: 'red', 'white')

    (continuous variables based on physicochemical tests)

    2 - fixed acidity: The acids that naturally occur in the grapes used to ferment the wine and carry over into the wine. They mostly consist of tartaric, malic, citric or succinic acid that mostly originate from the grapes used to ferment the wine. They also do not evaporate easily. (g / dm^3)

    3 - volatile acidity: Acids that evaporate at low temperatures—mainly acetic acid which can lead to an unpleasant, vinegar-like taste at very high levels. (g / dm^3)

    4 - citric acid: Citric acid is used as an acid supplement which boosts the acidity of the wine. It's typically found in small quantities and can add 'freshness' and flavor to wines. (g / dm^3)

    5 - residual sugar: The amount of sugar remaining after fermentation stops. It's rare to find wines with less than 1 gram/liter. Wines residual sugar level greater than 45 grams/liter are considered sweet. On the other end of the spectrum, a wine that does not taste sweet is considered as dry. (g / dm^3)

    6 - chlorides: The amount of chloride salts (sodium chloride) present in the wine. (g / dm^3)

    7 - free sulfur dioxide: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. All else constant, the higher the free sulfur dioxide content, the stronger the preservative effect. (mg / dm^3)

    8 - total sulfur dioxide: The amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. (mg / dm^3)

    9 - density: The density of wine juice depending on the percent alcohol and sugar content; it's typically similar but higher than that of water (wine is 'thicker'). (g / cm^3)

    10 - pH: A measure of the acidity of wine; most wines are between 3-4 on the pH scale. The lower the pH, the more acidic the wine is; the higher the pH, the less acidic the wine. (The pH scale technically is a logarithmic scale that measures the concentration of free hydrogen ions floating around in your wine. Each point of the pH scale is a factor of 10. This means a wine with a pH of 3 is 10 times more acidic than a wine with a pH of 4)

    11 - sulphates: Amount of potassium sulphate as a wine additive which can contribute to sulfur dioxide gas (S02) levels; it acts as an antimicrobial and antioxidant agent.(g / dm3)

    12 - alcohol: How much alcohol is contained in a given volume of wine (ABV). Wine generally contains between 5–15% of alcohols. (% by volume)

    Output variable:

    13 - quality: score between 0 (very bad) and 10 (very excellent) by wine experts

    Acknowledgements

    Source: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

    Data credit goes to UCI. Visit their website to access the original data set directly: https://archive.ics.uci.edu/ml/datasets/wine+quality

    Context

    So much about wine making remains elusive—taste is very subjective, making it extremely challenging to predict exactly how consumers will react to a certain bottle of wine. There is no doubt that winemakers, connoisseurs, and scientists have greatly contributed their expertise to ...

  15. r

    The Employment Effects of the Minimum Wage: A Selection Ratio Approach to...

    • resodate.org
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Slichter (2025). The Employment Effects of the Minimum Wage: A Selection Ratio Approach to Measuring Treatment Effects (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9zZWwtcmF0aW8=
    Explore at:
    Dataset updated
    Oct 6, 2025
    Dataset provided by
    Journal of Applied Econometrics
    ZBW
    ZBW Journal Data Archive
    Authors
    David Slichter
    Description

    Replication files for David Slichter, "The Employment Effects of the Minimum Wage: A Selection Ratio Approach to Measuring Treatment Effects,” Journal of Applied Econometrics, forthcoming

    Firstly, I’ve provided a .do file called sr.do which contains general code for implementing the selection ratio approach, with detailed instructions written as comments in the code.

    For the minimum wage application, the main data file is mw_final.dta. A .csv version is also provided. Observations are a county in a time period. I have added self-explanatory variable labels for most variables. A few variables warrant a clearer explanation:

    adj1-adj14: List of FIPS codes of all counties which are adjacent to the county in question. Each variables holds one adjacent county, and counties with fewer than 14 neighbors will have missing values for some of these variables.

    change, logchange: Minimum wage this quarter - minimum wage last quarter, measured either in dollars or in logs.

    time, t1-t108: The variable "time" converts years and quarters into a univariate time period, with time=1 in 1990Q1 and time=108 in 2016Q4. t1-t108 are indicators for each of these time periods.

    lnemp_1418, lnearnbeg_1418, lnsep_1418, lnhira_1418, lnchurn_1418: Logs of employment, earnings, separations, hires, and churn, respectively, for 14-18 year olds.

    gt1-gt6: Dummies for inclusion in each of the six comparisons used for the main (i.e., not spillover-robust) analysis. All treated counties which neighbor a control country take value 1 for each of these variables; all other treated counties take value 0. Among control counties, gt1=1 if the county neighbors a treated county and 0 otherwise, gt2=1 if the county has gt1=0 but neighbors a gt1=1 county, gt3=1 if county has gt1=gt2=0 but neighbors a gt2=1 county, etc.

    h2-h6: Dummies for inclusion in each of the first spillover-robust (i.e., excluding border counties only) comparisons. Among control counties, h2-h6 are equal to gt2-gt6. Among treated counties, h2-h6 are equal to 1 if the treated county has gt1=0 but borders a gt1=1 county, and 0 otherwise.

    k3-k6: Dummies for inclusion in each of the second spillover-robust (i.e., excluding two layers) comparisons. Among control counties, these variables are equal to gt3-gt6. Among treated counties, all observations take value 1 except those with gt1=1 or h2=1.

    The data sources are as follows. The minimum wage law series is taken from David Neumark's website (https://www.economics.uci.edu/~dneumark/datasets.html). The economic variables are taken from the QWI, which I accessed via the Ithaca Virtual RDC. County adjacency files were downloaded from the Census Bureau (https://www.census.gov/geo/reference/county-adjacency.html).

    The file main.do then runs the analyses. The resulting output file containing results is results.dta.

    For the incumbency application, the main data file is incumb_final.dta. A .csv version is also provided. This file is drawn from Caughey and Sekhon's (2011) data; see their description of most variables here: https://doi.org/10.7910/DVN/8EYYA2

    The key added variables are _IDistancea1-_IDistancea50, which are dummies for inclusion in the 50 comparisons used in the paper. Treated observations (i.e., Democratic wins) with margin of victory below 5 points have each of these variables equal to 1. Control observations have these variables equal to 1 if they fall within the margin of victory range, e.g., _IDistancea9=1 for control observations with Republican margin of victory between 8 and 9 points. Note that these variables are redefined by the code for the analyses of treatment effects away from the discontinuity. Lastly, there is a variable called RepWin which is the treatment variable when treatment is defined as a Republican winning.

    The file sr_incumb.do then performs the analysis.

    Please contact me with any questions at slichter@binghamton.edu.

  16. Website Phishing Dataset

    • kaggle.com
    zip
    Updated May 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad Noor (2019). Website Phishing Dataset [Dataset]. https://www.kaggle.com/datasets/ahmednour/website-phishing-data-set/discussion
    Explore at:
    zip(5211 bytes)Available download formats
    Dataset updated
    May 4, 2019
    Authors
    Ahmad Noor
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    ISSR CS602 Machine Learning - Project

    Website Phishing Data Set Download: Data Folder, Data Set Description

    Abstract:

    Data Set Characteristics : MultivariateNumber of Instances : 1353
    Attribute Characteristics : IntegerNumber of Attributes : 10
    Associated Tasks : ClassificationNumber of Web Hits : 54880

    Source: Dataset url

    Neda Abdelhamid Auckland Institute of Studies nedah '@' ais.ac.nz

    Data Set Information:

    The phishing problem is considered a vital issue in “.COM†industry especially e-banking and e-commerce taking the number of online transactions involving payments. We have identified different features related to legitimate and phishy websites and collected 1353 different websites from difference sources.Phishing websites were collected from Phishtank data archive (www.phishtank.com), which is a free community site where users can submit, verify, track and share phishing data. The legitimate websites were collected from Yahoo and starting point directories using a web script developed in PHP. The PHP script was plugged with a browser and we collected 548 legitimate websites out of 1353 websites. There is 702 phishing URLs, and 103 suspicious URLs.

    When a website is considered SUSPICIOUS that means it can be either phishy or legitimate, meaning the website held some legit and phishy features.

    Attribute Information:

    URL Anchor
    Request URL SFH URL Length
    Having ’@’
    Prefix/Suffix
    IP
    Sub Domain
    Web traffic Domain age
    Class

    collected features hold the categorical values , “Legitimate†, †Suspicious†and “Phishy†, these values have been replaced with numerical values 1,0 and -1 respectively. details of each feature are mentioned in the research paper mentioned below

  17. d

    Natural Lands Trust - Public Preserves 2015

    • search.dataone.org
    • hydroshare.org
    Updated Dec 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natural Lands Trust (2021). Natural Lands Trust - Public Preserves 2015 [Dataset]. https://search.dataone.org/view/sha256%3A9b801aa05a3606986e342ee22f261dda474407e3b0a15c5078395d41e37a1b9b
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Hydroshare
    Authors
    Natural Lands Trust
    Area covered
    Description

    These are the NLT preserves that are currently open to the public (2015). Trails, parking areas, and other information can be found on the NLT website: https://natlands.org/category/preserves-to-visit/list-of-preserves/

    This data is hosted at, and may be downloaded or accessed from PASDA, the Pennsylvania Spatial Data Access Geospatial Data Clearinghouse http://www.pasda.psu.edu/uci/DataSummary.aspx?dataset=601

  18. Online Retail II

    • kaggle.com
    zip
    Updated Sep 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Doğukan Vatansever (2024). Online Retail II [Dataset]. https://www.kaggle.com/datasets/dvaser/online-retail-ii
    Explore at:
    zip(45407886 bytes)Available download formats
    Dataset updated
    Sep 13, 2024
    Authors
    Doğukan Vatansever
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.
    • The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.
    • Online Retail II Official Website: Links
  19. YouTube Spam Collection Data Set

    • kaggle.com
    zip
    Updated Mar 18, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lakshmipathi N (2019). YouTube Spam Collection Data Set [Dataset]. https://www.kaggle.com/lakshmi25npathi/images
    Explore at:
    zip(325284 bytes)Available download formats
    Dataset updated
    Mar 18, 2019
    Authors
    Lakshmipathi N
    Area covered
    YouTube
    Description

    Abstract: It is a public set of comments collected for spam research. It has five datasets composed by 1,956 real messages extracted from five videos that were among the 10 most viewed on the collection period.

    Data Set Information: The table below lists the datasets, the YouTube video ID, the number of samples in each class and the total number of samples per dataset.

    Dataset --- YouTube ID -- # Spam - # Ham - Total

    Psy ------- 9bZkp7q19f0 --- 175 --- 175 --- 350

    KatyPerry - CevxZvSJLk8 --- 175 --- 175 --- 350

    LMFAO ----- KQ6zr6kCPj8 --- 236 --- 202 --- 438

    Eminem ---- uelHwf8o7_U --- 245 --- 203 --- 448

    Shakira --- pRpeEdMmmQ0 --- 174 --- 196 --- 370

    Note: the chronological order of the comments were kept.

    The collection is composed by one CSV file per dataset, where each line has the following attributes: COMMENT_ID,AUTHOR,DATE,CONTENT,TAG

    Further details please visit this website,http://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection#

  20. Thyroid Disease Data Set

    • kaggle.com
    zip
    Updated Jul 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasir Hussein Shakir (2025). Thyroid Disease Data Set [Dataset]. https://www.kaggle.com/yasserhessein/thyroid-disease-data-set
    Explore at:
    zip(96193 bytes)Available download formats
    Dataset updated
    Jul 13, 2025
    Authors
    Yasir Hussein Shakir
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    https://cdn.prod.website-files.com/5c17fc782f30f90cd15c25b4/63189857cdf8f072fcccfd5e_Thyroid.gif" alt="">

    Source:

    Ross Quinlan

    Data Set Information:

    From Garavan Institute Documentation: as given by Ross Quinlan 6 databases from the Garavan Institute in Sydney, Australia Approximately the following for each database:

    2800 training (data) instances and 972 test instances Plenty of missing data 29 or so attributes, either Boolean or continuously-valued

    2 additional databases, also from Ross Quinlan, are also here

    Hypothyroid.data and sick-euthyroid.data Quinlan believes that these databases have been corrupted Their format is highly similar to the other databases

    1 more database of 9172 instances that cover 20 classes, and a related domain theory

    Another thyroid database from Stefan Aeberhard

    3 classes, 215 instances, 5 attributes No missing values

    A Thyroid database suited for training ANNs

    3 classes 3772 training instances, 3428 testing instances Includes cost data (donated by Peter Turney)

    Attribute Information:

    N/A

    :Link

    https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease

    R. Quinlan. "Thyroid Disease," UCI Machine Learning Repository, 1986. [Online]. Available: https://doi.org/10.24432/C5D010.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sahar Pourahmad (2021). Online Retail Data Set [Dataset]. https://www.kaggle.com/datasets/saharpourahmad/online-retail-data-set/code
Organization logo

Data from: Online Retail Data Set

From the UCI website (https://archive.ics.uci.edu/ml/datasets/Online+Retail#)

Related Article
Explore at:
zip(22875837 bytes)Available download formats
Dataset updated
Sep 11, 2021
Authors
Sahar Pourahmad
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset

This dataset was created by Sahar Pourahmad

Released under CC0: Public Domain

Contents

Search
Clear search
Close search
Google apps
Main menu