29 datasets found

Data from: Online Retail Data Set
kaggle.com
zip
Updated Sep 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahar Pourahmad (2021). Online Retail Data Set [Dataset]. https://www.kaggle.com/datasets/saharpourahmad/online-retail-data-set/code
Explore at:
zip(22875837 bytes)Available download formats
Dataset updated
Sep 11, 2021
Authors
Sahar Pourahmad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Sahar Pourahmad

Released under CC0: Public Domain

Contents
UCI datasets
zenodo.org
data.niaid.nih.gov
zip
Updated Apr 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi; Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi (2023). UCI datasets [Dataset]. http://doi.org/10.5281/zenodo.7681792
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7681792
Dataset updated
Apr 4, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi; Mathias Drton; Stephan Haug; David Reifferscheidt; Oleksandr Zadorozhnyi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Collection of two datasets from the UCI website that could be used for structure learning tasks. Includes datasets regarding

Air Quality

US census 1990

Size: Two datasets of sizes 9471*17 and 2458285*68 correspondingly

Number of features: 15-68

Ground truth: No

Type of Graph: No ground truth

More information about the datasets is contained in the dataset_description.html files.
e
uci.edu Traffic Analytics Data
analytics.explodingtopics.com
Updated Sep 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). uci.edu Traffic Analytics Data [Dataset]. https://analytics.explodingtopics.com/website/uci.edu
Explore at:
Dataset updated
Sep 1, 2025
Variables measured
Global Rank, Monthly Visits, Authority Score, US Country Rank, Education Category Rank
Description
Traffic analytics, rankings, and competitive metrics for uci.edu as of September 2025
Air Quality Data Set
kaggle.com
zip
Updated Sep 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahar Pourahmad (2021). Air Quality Data Set [Dataset]. https://www.kaggle.com/saharpourahmad/air-quality-data-set-from-uci-website
Explore at:
zip(253800 bytes)Available download formats
Dataset updated
Sep 11, 2021
Authors
Sahar Pourahmad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. In this dataset we want to build a model to predit the relative humidity (RH) based on the air quality factors givern to us. This is a typical regression model that can be done in many ways.

Data Set Information:

The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value. This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.

Content

0 Date (DD/MM/YYYY) 1 Time (HH.MM.SS) 2 True hourly averaged concentration CO in mg/m^3 (reference analyzer) 3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted) 4 True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer) 5 True hourly averaged Benzene concentration in microg/m^3 (reference analyzer) 6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted) 7 True hourly averaged NOx concentration in ppb (reference analyzer) 8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted) 9 True hourly averaged NO2 concentration in microg/m^3 (reference analyzer) 10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted) 11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted) 12 Temperature in Â°C 13 Relative Humidity (%) 14 AH Absolute Humidity

Acknowledgements

Source: Saverio De Vito (saverio.devito '@' enea.it), ENEA - National Agency for New Technologies, Energy and Sustainable Economic Development https://archive.ics.uci.edu/ml/datasets/Air+Quality
Fertility Data Set
kaggle.com
zip
Updated Sep 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahar Pourahmad (2021). Fertility Data Set [Dataset]. https://www.kaggle.com/saharpourahmad/fertility-data-set
Explore at:
zip(786 bytes)Available download formats
Dataset updated
Sep 11, 2021
Authors
Sahar Pourahmad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria. Sperm concentration are related to socio-demographic data, environmental factors, health status, and life habits

Content

Attribute Information:

Season in which the analysis was performed. 1) winter, 2) spring, 3) Summer, 4) fall. (-1, -0.33, 0.33, 1)

Age at the time of analysis. 18-36 (0, 1)

Childish diseases (ie , chicken pox, measles, mumps, polio) 1) yes, 2) no. (0, 1)

Accident or serious trauma 1) yes, 2) no. (0, 1)

Surgical intervention 1) yes, 2) no. (0, 1)

High fevers in the last year 1) less than three months ago, 2) more than three months ago, 3) no. (-1, 0, 1)

Frequency of alcohol consumption 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never (0, 1)

Smoking habit 1) never, 2) occasional 3) daily. (-1, 0, 1)

Number of hours spent sitting per day ene-16 (0, 1)

Output: Diagnosis normal (N), altered (O)

Acknowledgements

Source:

David Gil, dgil '@' dtic.ua.es, Lucentia Research Group, Department of Computer Technology, University of Alicante

Jose Luis Girela, girela '@' ua.es, Department of Biotechnology, University of Alicante

Relevant Papers:

David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 â€“ 12573, 2012

Citation Request:

David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 â€“ 12573, 2012
e
uci.org Traffic Analytics Data
analytics.explodingtopics.com
Updated Sep 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). uci.org Traffic Analytics Data [Dataset]. https://analytics.explodingtopics.com/website/uci.org
Explore at:
Dataset updated
Sep 1, 2025
Variables measured
Global Rank, Monthly Visits, Authority Score, US Country Rank, Sports Category Rank
Description
Traffic analytics, rankings, and competitive metrics for uci.org as of September 2025
UCI Communities and Crime Unnormalized Data Set
kaggle.com
Updated Feb 21, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kavitha (2018). UCI Communities and Crime Unnormalized Data Set [Dataset]. https://www.kaggle.com/kkanda/communities%20and%20crime%20unnormalized%20data%20set/code
Explore at:
Dataset updated
Feb 21, 2018
Authors
Kavitha
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

Introduction: The dataset used for this experiment is real and authentic. The dataset is acquired from UCI machine learning repository website [13]. The title of the dataset is ‘Crime and Communities’. It is prepared using real data from socio-economic data from 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crimedata from the 1995 FBI UCR [13]. This dataset contains a total number of 147 attributes and 2216 instances.

The per capita crimes variables were calculated using population values included in the 1995 FBI data (which differ from the 1990 Census values).

Content

The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units. The crime attributes (N=18) that could be predicted are the 8 crimes considered 'Index Crimes' by the FBI)(Murders, Rape, Robbery, .... ), per capita (actually per 100,000 population) versions of each, and Per Capita Violent Crimes and Per Capita Nonviolent Crimes)

predictive variables : 125 non-predictive variables : 4 potential goal/response variables : 18

Acknowledgements

http://archive.ics.uci.edu/ml/datasets/Communities%20and%20Crime%20Unnormalized

U. S. Department of Commerce, Bureau of the Census, Census Of Population And Housing 1990 United States: Summary Tape File 1a & 3a (Computer Files),

U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)

U.S. Department of Justice, Bureau of Justice Statistics, Law Enforcement Management And Administrative Statistics (Computer File) U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)

U.S. Department of Justice, Federal Bureau of Investigation, Crime in the United States (Computer File) (1995)

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Data available in the dataset may not act as a complete source of information for identifying factors that contribute to more violent and non-violent crimes as many relevant factors may still be missing.

However, I would like to try and answer the following questions answered.

Analyze if number of vacant and occupied houses and the period of time the houses were vacant had contributed to any significant change in violent and non-violent crime rates in communities

How has unemployment changed crime rate(violent and non-violent) in the communities?

Were people from a particular age group more vulnerable to crime?

Does ethnicity play a role in crime rate?

Has education played a role in bringing down the crime rate?
phishing.arff
figshare.com
txt
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ambroise Odonnat (2024). phishing.arff [Dataset]. http://doi.org/10.6084/m9.figshare.26232710.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26232710.v1
Dataset updated
Jul 10, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ambroise Odonnat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the data from the Phishing Website dataset provided in [1]. All the features are categorical and were preprocessed in integer values. The data can be downloaded from https://archive.ics.uci.edu/dataset/327/phishing+websites. There are 11055 samples with 30 features. Websites belong to 2 domains: websites that use the IP address used instead of the domain name in the URL and websites that use the domain name in the URL. For reference, please refer to: [1] R. Mohammad, F. Thabtah, L. Mccluskey. An assessment of features related to phishing websites using an automated technique In International Conference for Internet Technology and Secured Transactions, 2012
UCI E3SM1.0 model output prepared for CMIP6 PAMIP pdSST-pdSICSIT
wdc-climate.de
Updated 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of California Irvine (UCI) (2021). UCI E3SM1.0 model output prepared for CMIP6 PAMIP pdSST-pdSICSIT [Dataset]. http://doi.org/10.22033/ESGF/CMIP6.16550
Explore at:
Unique identifier
https://doi.org/10.22033/ESGF/CMIP6.16550
Dataset updated
2021
Dataset provided by
Earth System Grid
World Data Center for Climate (WDCC) at DKRZ
Authors
University of California Irvine (UCI)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Coupled Model Intercomparison Project Phase 6 (CMIP6) datasets. These data include all datasets published for 'CMIP6.PAMIP.UCI.E3SM-1-0.pdSST-pdSICSIT' with the full Data Reference Syntax following the template 'mip_era.activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.version'.

The E3SM 1.0 (Energy Exascale Earth System Model) climate model, released in 2018, includes the following components: aerosol: MAM4 with resuspension, marine organics, and secondary organics (same grid as atmos), atmos: EAM (v1.0, cubed sphere spectral-element grid; 5400 elements with p=3; 1 deg average grid spacing; 90 x 90 x 6 longitude/latitude/cubeface; 72 levels; top level 0.1 hPa), atmosChem: Troposphere specified oxidants for aerosols. Stratosphere linearized interactive ozone (LINOZ v2) (same grid as atmos), land: ELM (v1.0, cubed sphere spectral-element grid; 5400 elements with p=3; 1 deg average grid spacing; 90 x 90 x 6 longitude/latitude/cubeface; satellite phenology mode), MOSART (v1.0, 0.5 degree latitude/longitude grid), ocean: MPAS-Ocean (v6.0, oEC60to30 unstructured SVTs mesh with 235160 cells and 714274 edges, variable resolution 60 km to 30 km; 60 levels; top grid cell 0-10 m), seaIce: MPAS-Seaice (v6.0, same grid as ocean). The model was run by the Department of Earth System Science, University of California Irvine, Irvine, CA 92697, USA (UCI) in native nominal resolutions: aerosol: 100 km, atmos: 100 km, atmosChem: 100 km, land: 100 km, ocean: 50 km, seaIce: 50 km.

Project: These data have been generated as part of the internationally-coordinated Coupled Model Intercomparison Project Phase 6 (CMIP6; see also GMD Special Issue: http://www.geosci-model-dev.net/special_issue590.html). The simulation data provides a basis for climate research designed to answer fundamental science questions and serves as resource for authors of the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC-AR6).

CMIP6 is a project coordinated by the Working Group on Coupled Modelling (WGCM) as part of the World Climate Research Programme (WCRP). Phase 6 builds on previous phases executed under the leadership of the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and relies on the Earth System Grid Federation (ESGF) and the Centre for Environmental Data Analysis (CEDA) along with numerous related activities for implementation. The original data is hosted and partially replicated on a federated collection of data nodes, and most of the data relied on by the IPCC is being archived for long-term preservation at the IPCC Data Distribution Centre (IPCC DDC) hosted by the German Climate Computing Center (DKRZ).

The project includes simulations from about 120 global climate models and around 45 institutions and organizations worldwide. - Project website: https://pcmdi.llnl.gov/CMIP6.
🚴‍♀️ UCI Race Results
kaggle.com
zip
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mexwell (2024). 🚴‍♀️ UCI Race Results [Dataset]. https://www.kaggle.com/datasets/mexwell/uci-race-results/code
Explore at:
zip(61082159 bytes)Available download formats
Dataset updated
Oct 16, 2024
Authors
mexwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data is gathered from publicly available race results of the UCI website. It includes race results in all cycling disciplines (road, track, mountain bike, bmx, cyclo cross), classifications, data on athletes (such as name, age, country of affiliation etc.). The data is stored in a csv file.

Original Data

Citation

Korf, Jesse (2023), “UCI race results 2010-2020”, Mendeley Data, V2, doi: 10.17632/mkzht8948x.2

Acknowledgement

Foto von Mikkel Bech auf Unsplash
H
Replication Data for: The Gender Readings Gap in Political Science Graduate...
dataverse.harvard.edu
Updated Oct 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heidi Hardt; Amy Erica Smith; Hannah June Kim; Philippe Meister (2018). Replication Data for: The Gender Readings Gap in Political Science Graduate Training [Dataset]. http://doi.org/10.7910/DVN/UNWIHE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/UNWIHE
Dataset updated
Oct 15, 2018
Dataset provided by
Harvard Dataverse
Authors
Heidi Hardt; Amy Erica Smith; Hannah June Kim; Philippe Meister
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is the replication Stata code and log file for the Journal of Politics research note, "The Gender Readings Gap in Political Science Graduate Training," by Heidi Hardt, Amy Erica Smith, Hannah June Kim and Philippe Meister. For our searchable database, see our website here: http://gradtraining.socsci.uci.edu/
UCI CESM1-WACCM-SC model output prepared for CMIP6 PAMIP
wdc-climate.de
Updated 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peings, Yannick (2020). UCI CESM1-WACCM-SC model output prepared for CMIP6 PAMIP [Dataset]. http://doi.org/10.22033/ESGF/CMIP6.12281
Explore at:
Unique identifier
https://doi.org/10.22033/ESGF/CMIP6.12281
Dataset updated
2020
Dataset provided by
Earth System Grid
World Data Center for Climate (WDCC) at DKRZ
Authors
Peings, Yannick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Coupled Model Intercomparison Project Phase 6 (CMIP6) datasets. These data include all datasets published for 'CMIP6.PAMIP.NCAR.CESM1-WACCM-SC' with the full Data Reference Syntax following the template 'mip_era.activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.version'.

The Community Earth System Model 1, with the Whole Atmosphere Community Climate Model and Specified Chemistry climate model, released in 2011, includes the following components: aerosol: MOZART-specified (same grid as atmos), atmos: WACCM4 (1.9x2.5 finite volume grid; 144 x 96 longitude/latitude; 66 levels; top level 5.9e-06 mb), atmosChem: MOZART-specified (same grid as atmos), land: CLM4.0, ocean: POP2 (320 x 384 longitude/latitude; 60 levels; top grid cell 0-10 m), ocnBgchem: BEC (same grid as ocean), seaIce: CICE4 (same as grid as ocean). The model was run by the Department of Earth System Science, University of California Irvine, Irvine, CA 92697, USA (UCI) in native nominal resolutions: aerosol: 250 km, atmos: 250 km, atmosChem: 250 km, land: 250 km, ocean: 100 km, ocnBgchem: 100 km, seaIce: 100 km.

Project: These data have been generated as part of the internationally-coordinated Coupled Model Intercomparison Project Phase 6 (CMIP6; see also GMD Special Issue: http://www.geosci-model-dev.net/special_issue590.html). The simulation data provides a basis for climate research designed to answer fundamental science questions and serves as resource for authors of the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC-AR6).

CMIP6 is a project coordinated by the Working Group on Coupled Modelling (WGCM) as part of the World Climate Research Programme (WCRP). Phase 6 builds on previous phases executed under the leadership of the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and relies on the Earth System Grid Federation (ESGF) and the Centre for Environmental Data Analysis (CEDA) along with numerous related activities for implementation. The original data is hosted and partially replicated on a federated collection of data nodes, and most of the data relied on by the IPCC is being archived for long-term preservation at the IPCC Data Distribution Centre (IPCC DDC) hosted by the German Climate Computing Center (DKRZ).

The project includes simulations from about 120 global climate models and around 45 institutions and organizations worldwide. - Project website: https://pcmdi.llnl.gov/CMIP6.
Breast Cancer Prognostics
kaggle.com
zip
Updated Dec 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Breast Cancer Prognostics [Dataset]. https://www.kaggle.com/datasets/thedevastator/improve-breast-cancer-prognostics-using-machine
Explore at:
zip(78356 bytes)Available download formats
Dataset updated
Dec 4, 2022
Authors
The Devastator
Description
Breast Cancer Prognostics

Study the Wisconsin Dataset

By UCI [source]

About this dataset

The Breast Cancer Wisconsin (Prognostic) dataset brings together data collected from hundreds of breast cancer cases, making it valuable for predictive prognosis. It includes 30 features such as radius, texture, area, compactness and concavity that were generated from the a digitized fine needle aspirate (FNA) of the mass to generate characteristics of the cell nuclei present in each case. It also includes outcomes such as recurrence and nonrecurrence and also time-to-recurrence information for those cases that relapse.

This breaking dataset was created by some leading minds in medical science; Dr William H. Wolberg at the University Of Wisconsin Clinical Sciences Center alongside W. Nick Street at the university's Computer Sciences Dept., and Olvi L Mangasarian also based there - all credited with creating various decision tree construction systems using linear programming models to accurately predict disease recurrences within an incredibly short time frame.

The data is freely available through UW CS ftp server or on Kaggle's website making use easier than ever before - giving all researchers access up-to-date information regarding breast cancer prognosis and diagnosis via images taken from FNA tests conducted on masses in diagnosed patients' bodies - allowing each participant instantaneous access to a powerful set of features versus outcomes within both recurrent and nonrecurrent situations.. Moreover papers such as 'An inductive learning approach to prognostic prediction.' by WN street et al have utilized this database extensively mapping out how Artificial Neural Networks can be used for predictive tasks with noteworthy success! Armed with these tested ideas consequently anyone has access level ground in understanding how decisions are made as it relates to predicting breast cancer outcome effectively utilizing this dataset helping us better understand how a predictive model can significantly improve patient care processes!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset is designed to improve the prognostics of breast cancer using machine learning algorithms. The data consists of a time series of patient symptoms and various medical parameters, such as tumor size and malignancy, that can be used by programmatic algorithms to predict diagnosis and prognosis outcomes. Here are some steps on how to use this dataset:

Pre-process and clean the data: Since the dataset contains incomplete or missing values across various parameters, it is important to clean and pre-process the data before attempting any machine learning algorithm (MLA). This includes sorting out what type of values need imputation, standardizing features for better performance, encoding categorical variables for MLAs, and normalizing numerical values for accuracy.

Choose an appropriate MLA: Depending on your exact goal with this data set - for example if you wanted reliable classification results or weighted predictions based on factors - there are a variety of MLAs from which you may select; examples include logistic regression classifiers, least squares support vector machines (SVM), neural networks, nonsmooth optimization algorithms like A-Optimality or global optimization methods such as Extract M-of-N rule sets from trained neural nets.. It would be wise to read up on each algorithm in order to determine which one most appropriately meets your needs before starting experimentation with the dataset itself.

Train the model using your selected MLA: Once you have identified an MLA that fits your desired result outcome best – or if you decide on experimenting with multiple approaches –it’s time turn back towards the data itself in order run experiments actually examine outcomes based upon training models built upon it through cross validation methods such as k-fold splitting.. Then test these trained models against validation datasets taken from specified subsets within the original larger data set structure held by Kaggle in order get general outputs results determining performance rates over various conditions presented by parameter combinations relevant when predicting breast cancer diagnostic &/or prognostic outcomes .. Establishing any trends revealed during these experiments will help inform future model selections during training process associated implementing an effective predictive solution fitting specific user requirements especially where particular MLA are not tailored handle purpose generally falling outside scope designing said model so guaranteeing ac...
Wine Quality Data Set (Red & White Wine)
kaggle.com
zip
Updated Nov 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ruthgn (2021). Wine Quality Data Set (Red & White Wine) [Dataset]. https://www.kaggle.com/datasets/ruthgn/wine-quality-data-set-red-white-wine
Explore at:
zip(100361 bytes)Available download formats
Dataset updated
Nov 3, 2021
Authors
ruthgn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Set Information

This data set contains records related to red and white variants of the Portuguese Vinho Verde wine. It contains information from 1599 red wine samples and 4898 white wine samples. Input variables in the data set consist of the type of wine (either red or white wine) and metrics from objective tests (e.g. acidity levels, PH values, ABV, etc.), while the target/output variable is a numerical score based on sensory data—median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Due to privacy and logistic issues, there is no data about grape types, wine brand, and wine selling price.

This data set is a combined version of the two separate files (distinct red and white wine data sets) originally shared in the UCI Machine Learning Repository.

The following are some existing data sets on Kaggle from the same source (with notable differences from this data set): - Red Wine Quality (contains red wine data only) - Wine Quality (combination of red and white wine data but with some values randomly removed) - Wine Quality (red and white wine data not combined)

Contents

Input variables:

1 - type of wine: type of wine (categorical: 'red', 'white')

(continuous variables based on physicochemical tests)

2 - fixed acidity: The acids that naturally occur in the grapes used to ferment the wine and carry over into the wine. They mostly consist of tartaric, malic, citric or succinic acid that mostly originate from the grapes used to ferment the wine. They also do not evaporate easily. (g / dm^3)

3 - volatile acidity: Acids that evaporate at low temperatures—mainly acetic acid which can lead to an unpleasant, vinegar-like taste at very high levels. (g / dm^3)

4 - citric acid: Citric acid is used as an acid supplement which boosts the acidity of the wine. It's typically found in small quantities and can add 'freshness' and flavor to wines. (g / dm^3)

5 - residual sugar: The amount of sugar remaining after fermentation stops. It's rare to find wines with less than 1 gram/liter. Wines residual sugar level greater than 45 grams/liter are considered sweet. On the other end of the spectrum, a wine that does not taste sweet is considered as dry. (g / dm^3)

6 - chlorides: The amount of chloride salts (sodium chloride) present in the wine. (g / dm^3)

7 - free sulfur dioxide: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. All else constant, the higher the free sulfur dioxide content, the stronger the preservative effect. (mg / dm^3)

8 - total sulfur dioxide: The amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. (mg / dm^3)

9 - density: The density of wine juice depending on the percent alcohol and sugar content; it's typically similar but higher than that of water (wine is 'thicker'). (g / cm^3)

10 - pH: A measure of the acidity of wine; most wines are between 3-4 on the pH scale. The lower the pH, the more acidic the wine is; the higher the pH, the less acidic the wine. (The pH scale technically is a logarithmic scale that measures the concentration of free hydrogen ions floating around in your wine. Each point of the pH scale is a factor of 10. This means a wine with a pH of 3 is 10 times more acidic than a wine with a pH of 4)

11 - sulphates: Amount of potassium sulphate as a wine additive which can contribute to sulfur dioxide gas (S02) levels; it acts as an antimicrobial and antioxidant agent.(g / dm3)

12 - alcohol: How much alcohol is contained in a given volume of wine (ABV). Wine generally contains between 5–15% of alcohols. (% by volume)

Output variable:

13 - quality: score between 0 (very bad) and 10 (very excellent) by wine experts

Acknowledgements

Source: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Data credit goes to UCI. Visit their website to access the original data set directly: https://archive.ics.uci.edu/ml/datasets/wine+quality

Context

So much about wine making remains elusive—taste is very subjective, making it extremely challenging to predict exactly how consumers will react to a certain bottle of wine. There is no doubt that winemakers, connoisseurs, and scientists have greatly contributed their expertise to ...
r
The Employment Effects of the Minimum Wage: A Selection Ratio Approach to...
resodate.org
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Slichter (2025). The Employment Effects of the Minimum Wage: A Selection Ratio Approach to Measuring Treatment Effects (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9zZWwtcmF0aW8=
Explore at:
Dataset updated
Oct 6, 2025
Dataset provided by
Journal of Applied Econometrics
ZBW
ZBW Journal Data Archive
Authors
David Slichter
Description
Replication files for David Slichter, "The Employment Effects of the Minimum Wage: A Selection Ratio Approach to Measuring Treatment Effects,” Journal of Applied Econometrics, forthcoming

Firstly, I’ve provided a .do file called sr.do which contains general code for implementing the selection ratio approach, with detailed instructions written as comments in the code.

For the minimum wage application, the main data file is mw_final.dta. A .csv version is also provided. Observations are a county in a time period. I have added self-explanatory variable labels for most variables. A few variables warrant a clearer explanation:

adj1-adj14: List of FIPS codes of all counties which are adjacent to the county in question. Each variables holds one adjacent county, and counties with fewer than 14 neighbors will have missing values for some of these variables.

change, logchange: Minimum wage this quarter - minimum wage last quarter, measured either in dollars or in logs.

time, t1-t108: The variable "time" converts years and quarters into a univariate time period, with time=1 in 1990Q1 and time=108 in 2016Q4. t1-t108 are indicators for each of these time periods.

lnemp_1418, lnearnbeg_1418, lnsep_1418, lnhira_1418, lnchurn_1418: Logs of employment, earnings, separations, hires, and churn, respectively, for 14-18 year olds.

gt1-gt6: Dummies for inclusion in each of the six comparisons used for the main (i.e., not spillover-robust) analysis. All treated counties which neighbor a control country take value 1 for each of these variables; all other treated counties take value 0. Among control counties, gt1=1 if the county neighbors a treated county and 0 otherwise, gt2=1 if the county has gt1=0 but neighbors a gt1=1 county, gt3=1 if county has gt1=gt2=0 but neighbors a gt2=1 county, etc.

h2-h6: Dummies for inclusion in each of the first spillover-robust (i.e., excluding border counties only) comparisons. Among control counties, h2-h6 are equal to gt2-gt6. Among treated counties, h2-h6 are equal to 1 if the treated county has gt1=0 but borders a gt1=1 county, and 0 otherwise.

k3-k6: Dummies for inclusion in each of the second spillover-robust (i.e., excluding two layers) comparisons. Among control counties, these variables are equal to gt3-gt6. Among treated counties, all observations take value 1 except those with gt1=1 or h2=1.

The data sources are as follows. The minimum wage law series is taken from David Neumark's website (https://www.economics.uci.edu/~dneumark/datasets.html). The economic variables are taken from the QWI, which I accessed via the Ithaca Virtual RDC. County adjacency files were downloaded from the Census Bureau (https://www.census.gov/geo/reference/county-adjacency.html).

The file main.do then runs the analyses. The resulting output file containing results is results.dta.

For the incumbency application, the main data file is incumb_final.dta. A .csv version is also provided. This file is drawn from Caughey and Sekhon's (2011) data; see their description of most variables here: https://doi.org/10.7910/DVN/8EYYA2

The key added variables are _IDistancea1-_IDistancea50, which are dummies for inclusion in the 50 comparisons used in the paper. Treated observations (i.e., Democratic wins) with margin of victory below 5 points have each of these variables equal to 1. Control observations have these variables equal to 1 if they fall within the margin of victory range, e.g., _IDistancea9=1 for control observations with Republican margin of victory between 8 and 9 points. Note that these variables are redefined by the code for the analyses of treatment effects away from the discontinuity. Lastly, there is a variable called RepWin which is the treatment variable when treatment is defined as a Republican winning.

The file sr_incumb.do then performs the analysis.

Please contact me with any questions at slichter@binghamton.edu.
Website Phishing Dataset
kaggle.com
zip
Updated May 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad Noor (2019). Website Phishing Dataset [Dataset]. https://www.kaggle.com/datasets/ahmednour/website-phishing-data-set/discussion
Explore at:
zip(5211 bytes)Available download formats
Dataset updated
May 4, 2019
Authors
Ahmad Noor
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
ISSR CS602 Machine Learning - Project

Website Phishing Data Set Download: Data Folder, Data Set Description

Abstract:

Data Set Characteristics : Multivariate Number of Instances : 1353
Attribute Characteristics : Integer Number of Attributes : 10
Associated Tasks : Classification Number of Web Hits : 54880

Source: Dataset url

Neda Abdelhamid Auckland Institute of Studies nedah '@' ais.ac.nz

Data Set Information:

The phishing problem is considered a vital issue in â€œ.COMâ€ industry especially e-banking and e-commerce taking the number of online transactions involving payments. We have identified different features related to legitimate and phishy websites and collected 1353 different websites from difference sources.Phishing websites were collected from Phishtank data archive (www.phishtank.com), which is a free community site where users can submit, verify, track and share phishing data. The legitimate websites were collected from Yahoo and starting point directories using a web script developed in PHP. The PHP script was plugged with a browser and we collected 548 legitimate websites out of 1353 websites. There is 702 phishing URLs, and 103 suspicious URLs.

When a website is considered SUSPICIOUS that means it can be either phishy or legitimate, meaning the website held some legit and phishy features.

Attribute Information:

URL Anchor
Request URL SFH URL Length
Having â€™@â€™
Prefix/Suffix
IP
Sub Domain
Web traffic Domain age
Class

collected features hold the categorical values , â€œLegitimateâ€ , â€ Suspiciousâ€ and â€œPhishyâ€ , these values have been replaced with numerical values 1,0 and -1 respectively. details of each feature are mentioned in the research paper mentioned below
d
Natural Lands Trust - Public Preserves 2015
search.dataone.org
hydroshare.org
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natural Lands Trust (2021). Natural Lands Trust - Public Preserves 2015 [Dataset]. https://search.dataone.org/view/sha256%3A9b801aa05a3606986e342ee22f261dda474407e3b0a15c5078395d41e37a1b9b
Explore at:
Dataset updated
Dec 5, 2021
Dataset provided by
Hydroshare
Authors
Natural Lands Trust
Area covered

Description
These are the NLT preserves that are currently open to the public (2015). Trails, parking areas, and other information can be found on the NLT website: https://natlands.org/category/preserves-to-visit/list-of-preserves/

This data is hosted at, and may be downloaded or accessed from PASDA, the Pennsylvania Spatial Data Access Geospatial Data Clearinghouse http://www.pasda.psu.edu/uci/DataSummary.aspx?dataset=601
Online Retail II
kaggle.com
zip
Updated Sep 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doğukan Vatansever (2024). Online Retail II [Dataset]. https://www.kaggle.com/datasets/dvaser/online-retail-ii
Explore at:
zip(45407886 bytes)Available download formats
Dataset updated
Sep 13, 2024
Authors
Doğukan Vatansever
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.

The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

Online Retail II Official Website: Links
YouTube Spam Collection Data Set
kaggle.com
zip
Updated Mar 18, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lakshmipathi N (2019). YouTube Spam Collection Data Set [Dataset]. https://www.kaggle.com/lakshmi25npathi/images
Explore at:
zip(325284 bytes)Available download formats
Dataset updated
Mar 18, 2019
Authors
Lakshmipathi N
Area covered
YouTube
Description
Abstract: It is a public set of comments collected for spam research. It has five datasets composed by 1,956 real messages extracted from five videos that were among the 10 most viewed on the collection period.

Data Set Information: The table below lists the datasets, the YouTube video ID, the number of samples in each class and the total number of samples per dataset.

Dataset --- YouTube ID -- # Spam - # Ham - Total

Psy ------- 9bZkp7q19f0 --- 175 --- 175 --- 350

KatyPerry - CevxZvSJLk8 --- 175 --- 175 --- 350

LMFAO ----- KQ6zr6kCPj8 --- 236 --- 202 --- 438

Eminem ---- uelHwf8o7_U --- 245 --- 203 --- 448

Shakira --- pRpeEdMmmQ0 --- 174 --- 196 --- 370

Note: the chronological order of the comments were kept.

The collection is composed by one CSV file per dataset, where each line has the following attributes: COMMENT_ID,AUTHOR,DATE,CONTENT,TAG

Further details please visit this website,http://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection#
Thyroid Disease Data Set
kaggle.com
zip
Updated Jul 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasir Hussein Shakir (2025). Thyroid Disease Data Set [Dataset]. https://www.kaggle.com/yasserhessein/thyroid-disease-data-set
Explore at:
zip(96193 bytes)Available download formats
Dataset updated
Jul 13, 2025
Authors
Yasir Hussein Shakir
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
https://cdn.prod.website-files.com/5c17fc782f30f90cd15c25b4/63189857cdf8f072fcccfd5e_Thyroid.gif" alt="">

Source:

Ross Quinlan

Data Set Information:

From Garavan Institute Documentation: as given by Ross Quinlan 6 databases from the Garavan Institute in Sydney, Australia Approximately the following for each database:

2800 training (data) instances and 972 test instances Plenty of missing data 29 or so attributes, either Boolean or continuously-valued

2 additional databases, also from Ross Quinlan, are also here

Hypothyroid.data and sick-euthyroid.data Quinlan believes that these databases have been corrupted Their format is highly similar to the other databases

1 more database of 9172 instances that cover 20 classes, and a related domain theory

Another thyroid database from Stefan Aeberhard

3 classes, 215 instances, 5 attributes No missing values

A Thyroid database suited for training ANNs

3 classes 3772 training instances, 3428 testing instances Includes cost data (donated by Peter Turney)

Attribute Information:

N/A

:Link

https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease

R. Quinlan. "Thyroid Disease," UCI Machine Learning Repository, 1986. [Online]. Available: https://doi.org/10.24432/C5D010.

Data Set Characteristics : Multivariate	Number of Instances : 1353
Attribute Characteristics : Integer	Number of Attributes : 10
Associated Tasks : Classification	Number of Web Hits : 54880

Facebook

Twitter

Click to copy link

Link copied

Cite

Sahar Pourahmad (2021). Online Retail Data Set [Dataset]. https://www.kaggle.com/datasets/saharpourahmad/online-retail-data-set/code

Data from: Online Retail Data Set

From the UCI website (https://archive.ics.uci.edu/ml/datasets/Online+Retail#)

Explore at:

zip(22875837 bytes)Available download formats

Dataset updated

Sep 11, 2021

Authors

Sahar Pourahmad

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset

This dataset was created by Sahar Pourahmad

Released under CC0: Public Domain

Clear search

Close search

Google apps

Main menu

Data from: Online Retail Data Set

Dataset

Contents

UCI datasets

uci.edu Traffic Analytics Data

Air Quality Data Set

Context

Content

Acknowledgements

Fertility Data Set

Context

Content

Acknowledgements

uci.org Traffic Analytics Data

UCI Communities and Crime Unnormalized Data Set

Context

Content

Acknowledgements

Inspiration

phishing.arff

UCI E3SM1.0 model output prepared for CMIP6 PAMIP pdSST-pdSICSIT

🚴‍♀️ UCI Race Results

Citation

Acknowledgement

Replication Data for: The Gender Readings Gap in Political Science Graduate...

UCI CESM1-WACCM-SC model output prepared for CMIP6 PAMIP

Breast Cancer Prognostics

Breast Cancer Prognostics

Study the Wisconsin Dataset

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Wine Quality Data Set (Red & White Wine)

Data Set Information

Contents

Acknowledgements

Context

The Employment Effects of the Minimum Wage: A Selection Ratio Approach to...

Website Phishing Dataset

Natural Lands Trust - Public Preserves 2015

Online Retail II

YouTube Spam Collection Data Set

Thyroid Disease Data Set

Source:

Data Set Information:

2 additional databases, also from Ross Quinlan, are also here

1 more database of 9172 instances that cover 20 classes, and a related domain theory

Another thyroid database from Stefan Aeberhard

A Thyroid database suited for training ANNs

Attribute Information:

:Link

Data from: Online Retail Data Set

From the UCI website (https://archive.ics.uci.edu/ml/datasets/Online+Retail#)

Dataset

Contents