Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Sahar Pourahmad
Released under CC0: Public Domain
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Collection of two datasets from the UCI website that could be used for structure learning tasks. Includes datasets regarding
Size: Two datasets of sizes 9471*17 and 2458285*68 correspondingly
Number of features: 15-68
Ground truth: No
Type of Graph: No ground truth
More information about the datasets is contained in the dataset_description.html files.
Facebook
TwitterTraffic analytics, rankings, and competitive metrics for uci.edu as of September 2025
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. In this dataset we want to build a model to predit the relative humidity (RH) based on the air quality factors givern to us. This is a typical regression model that can be done in many ways.
Data Set Information:
The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value. This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.
0 Date (DD/MM/YYYY) 1 Time (HH.MM.SS) 2 True hourly averaged concentration CO in mg/m^3 (reference analyzer) 3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted) 4 True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer) 5 True hourly averaged Benzene concentration in microg/m^3 (reference analyzer) 6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted) 7 True hourly averaged NOx concentration in ppb (reference analyzer) 8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted) 9 True hourly averaged NO2 concentration in microg/m^3 (reference analyzer) 10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted) 11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted) 12 Temperature in °C 13 Relative Humidity (%) 14 AH Absolute Humidity
Source: Saverio De Vito (saverio.devito '@' enea.it), ENEA - National Agency for New Technologies, Energy and Sustainable Economic Development https://archive.ics.uci.edu/ml/datasets/Air+Quality
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria. Sperm concentration are related to socio-demographic data, environmental factors, health status, and life habits
Attribute Information:
Season in which the analysis was performed. 1) winter, 2) spring, 3) Summer, 4) fall. (-1, -0.33, 0.33, 1)
Age at the time of analysis. 18-36 (0, 1)
Childish diseases (ie , chicken pox, measles, mumps, polio) 1) yes, 2) no. (0, 1)
Accident or serious trauma 1) yes, 2) no. (0, 1)
Surgical intervention 1) yes, 2) no. (0, 1)
High fevers in the last year 1) less than three months ago, 2) more than three months ago, 3) no. (-1, 0, 1)
Frequency of alcohol consumption 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never (0, 1)
Smoking habit 1) never, 2) occasional 3) daily. (-1, 0, 1)
Number of hours spent sitting per day ene-16 (0, 1)
Output: Diagnosis normal (N), altered (O)
Source:
David Gil, dgil '@' dtic.ua.es, Lucentia Research Group, Department of Computer Technology, University of Alicante
Jose Luis Girela, girela '@' ua.es, Department of Biotechnology, University of Alicante
Relevant Papers:
David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 – 12573, 2012
Citation Request:
David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications, 39(16):12564 – 12573, 2012
Facebook
TwitterTraffic analytics, rankings, and competitive metrics for uci.org as of September 2025
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Introduction: The dataset used for this experiment is real and authentic. The dataset is acquired from UCI machine learning repository website [13]. The title of the dataset is ‘Crime and Communities’. It is prepared using real data from socio-economic data from 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crimedata from the 1995 FBI UCR [13]. This dataset contains a total number of 147 attributes and 2216 instances.
The per capita crimes variables were calculated using population values included in the 1995 FBI data (which differ from the 1990 Census values).
The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units. The crime attributes (N=18) that could be predicted are the 8 crimes considered 'Index Crimes' by the FBI)(Murders, Rape, Robbery, .... ), per capita (actually per 100,000 population) versions of each, and Per Capita Violent Crimes and Per Capita Nonviolent Crimes)
predictive variables : 125 non-predictive variables : 4 potential goal/response variables : 18
http://archive.ics.uci.edu/ml/datasets/Communities%20and%20Crime%20Unnormalized
U. S. Department of Commerce, Bureau of the Census, Census Of Population And Housing 1990 United States: Summary Tape File 1a & 3a (Computer Files),
U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)
U.S. Department of Justice, Bureau of Justice Statistics, Law Enforcement Management And Administrative Statistics (Computer File) U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)
U.S. Department of Justice, Federal Bureau of Investigation, Crime in the United States (Computer File) (1995)
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Data available in the dataset may not act as a complete source of information for identifying factors that contribute to more violent and non-violent crimes as many relevant factors may still be missing.
However, I would like to try and answer the following questions answered.
Analyze if number of vacant and occupied houses and the period of time the houses were vacant had contributed to any significant change in violent and non-violent crime rates in communities
How has unemployment changed crime rate(violent and non-violent) in the communities?
Were people from a particular age group more vulnerable to crime?
Does ethnicity play a role in crime rate?
Has education played a role in bringing down the crime rate?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the data from the Phishing Website dataset provided in [1]. All the features are categorical and were preprocessed in integer values. The data can be downloaded from https://archive.ics.uci.edu/dataset/327/phishing+websites. There are 11055 samples with 30 features. Websites belong to 2 domains: websites that use the IP address used instead of the domain name in the URL and websites that use the domain name in the URL. For reference, please refer to: [1] R. Mohammad, F. Thabtah, L. Mccluskey. An assessment of features related to phishing websites using an automated technique In International Conference for Internet Technology and Secured Transactions, 2012
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Coupled Model Intercomparison Project Phase 6 (CMIP6) datasets. These data include all datasets published for 'CMIP6.PAMIP.UCI.E3SM-1-0.pdSST-pdSICSIT' with the full Data Reference Syntax following the template 'mip_era.activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.version'.
The E3SM 1.0 (Energy Exascale Earth System Model) climate model, released in 2018, includes the following components: aerosol: MAM4 with resuspension, marine organics, and secondary organics (same grid as atmos), atmos: EAM (v1.0, cubed sphere spectral-element grid; 5400 elements with p=3; 1 deg average grid spacing; 90 x 90 x 6 longitude/latitude/cubeface; 72 levels; top level 0.1 hPa), atmosChem: Troposphere specified oxidants for aerosols. Stratosphere linearized interactive ozone (LINOZ v2) (same grid as atmos), land: ELM (v1.0, cubed sphere spectral-element grid; 5400 elements with p=3; 1 deg average grid spacing; 90 x 90 x 6 longitude/latitude/cubeface; satellite phenology mode), MOSART (v1.0, 0.5 degree latitude/longitude grid), ocean: MPAS-Ocean (v6.0, oEC60to30 unstructured SVTs mesh with 235160 cells and 714274 edges, variable resolution 60 km to 30 km; 60 levels; top grid cell 0-10 m), seaIce: MPAS-Seaice (v6.0, same grid as ocean). The model was run by the Department of Earth System Science, University of California Irvine, Irvine, CA 92697, USA (UCI) in native nominal resolutions: aerosol: 100 km, atmos: 100 km, atmosChem: 100 km, land: 100 km, ocean: 50 km, seaIce: 50 km.
Project: These data have been generated as part of the internationally-coordinated Coupled Model Intercomparison Project Phase 6 (CMIP6; see also GMD Special Issue: http://www.geosci-model-dev.net/special_issue590.html). The simulation data provides a basis for climate research designed to answer fundamental science questions and serves as resource for authors of the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC-AR6).
CMIP6 is a project coordinated by the Working Group on Coupled Modelling (WGCM) as part of the World Climate Research Programme (WCRP). Phase 6 builds on previous phases executed under the leadership of the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and relies on the Earth System Grid Federation (ESGF) and the Centre for Environmental Data Analysis (CEDA) along with numerous related activities for implementation. The original data is hosted and partially replicated on a federated collection of data nodes, and most of the data relied on by the IPCC is being archived for long-term preservation at the IPCC Data Distribution Centre (IPCC DDC) hosted by the German Climate Computing Center (DKRZ).
The project includes simulations from about 120 global climate models and around 45 institutions and organizations worldwide. - Project website: https://pcmdi.llnl.gov/CMIP6.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data is gathered from publicly available race results of the UCI website. It includes race results in all cycling disciplines (road, track, mountain bike, bmx, cyclo cross), classifications, data on athletes (such as name, age, country of affiliation etc.). The data is stored in a csv file.
Korf, Jesse (2023), “UCI race results 2010-2020”, Mendeley Data, V2, doi: 10.17632/mkzht8948x.2
Foto von Mikkel Bech auf Unsplash
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is the replication Stata code and log file for the Journal of Politics research note, "The Gender Readings Gap in Political Science Graduate Training," by Heidi Hardt, Amy Erica Smith, Hannah June Kim and Philippe Meister. For our searchable database, see our website here: http://gradtraining.socsci.uci.edu/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Coupled Model Intercomparison Project Phase 6 (CMIP6) datasets. These data include all datasets published for 'CMIP6.PAMIP.NCAR.CESM1-WACCM-SC' with the full Data Reference Syntax following the template 'mip_era.activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.version'.
The Community Earth System Model 1, with the Whole Atmosphere Community Climate Model and Specified Chemistry climate model, released in 2011, includes the following components: aerosol: MOZART-specified (same grid as atmos), atmos: WACCM4 (1.9x2.5 finite volume grid; 144 x 96 longitude/latitude; 66 levels; top level 5.9e-06 mb), atmosChem: MOZART-specified (same grid as atmos), land: CLM4.0, ocean: POP2 (320 x 384 longitude/latitude; 60 levels; top grid cell 0-10 m), ocnBgchem: BEC (same grid as ocean), seaIce: CICE4 (same as grid as ocean). The model was run by the Department of Earth System Science, University of California Irvine, Irvine, CA 92697, USA (UCI) in native nominal resolutions: aerosol: 250 km, atmos: 250 km, atmosChem: 250 km, land: 250 km, ocean: 100 km, ocnBgchem: 100 km, seaIce: 100 km.
Project: These data have been generated as part of the internationally-coordinated Coupled Model Intercomparison Project Phase 6 (CMIP6; see also GMD Special Issue: http://www.geosci-model-dev.net/special_issue590.html). The simulation data provides a basis for climate research designed to answer fundamental science questions and serves as resource for authors of the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC-AR6).
CMIP6 is a project coordinated by the Working Group on Coupled Modelling (WGCM) as part of the World Climate Research Programme (WCRP). Phase 6 builds on previous phases executed under the leadership of the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and relies on the Earth System Grid Federation (ESGF) and the Centre for Environmental Data Analysis (CEDA) along with numerous related activities for implementation. The original data is hosted and partially replicated on a federated collection of data nodes, and most of the data relied on by the IPCC is being archived for long-term preservation at the IPCC Data Distribution Centre (IPCC DDC) hosted by the German Climate Computing Center (DKRZ).
The project includes simulations from about 120 global climate models and around 45 institutions and organizations worldwide. - Project website: https://pcmdi.llnl.gov/CMIP6.
Facebook
TwitterBy UCI [source]
The Breast Cancer Wisconsin (Prognostic) dataset brings together data collected from hundreds of breast cancer cases, making it valuable for predictive prognosis. It includes 30 features such as radius, texture, area, compactness and concavity that were generated from the a digitized fine needle aspirate (FNA) of the mass to generate characteristics of the cell nuclei present in each case. It also includes outcomes such as recurrence and nonrecurrence and also time-to-recurrence information for those cases that relapse.
This breaking dataset was created by some leading minds in medical science; Dr William H. Wolberg at the University Of Wisconsin Clinical Sciences Center alongside W. Nick Street at the university's Computer Sciences Dept., and Olvi L Mangasarian also based there - all credited with creating various decision tree construction systems using linear programming models to accurately predict disease recurrences within an incredibly short time frame.
The data is freely available through UW CS ftp server or on Kaggle's website making use easier than ever before - giving all researchers access up-to-date information regarding breast cancer prognosis and diagnosis via images taken from FNA tests conducted on masses in diagnosed patients' bodies - allowing each participant instantaneous access to a powerful set of features versus outcomes within both recurrent and nonrecurrent situations.. Moreover papers such as 'An inductive learning approach to prognostic prediction.' by WN street et al have utilized this database extensively mapping out how Artificial Neural Networks can be used for predictive tasks with noteworthy success! Armed with these tested ideas consequently anyone has access level ground in understanding how decisions are made as it relates to predicting breast cancer outcome effectively utilizing this dataset helping us better understand how a predictive model can significantly improve patient care processes!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset is designed to improve the prognostics of breast cancer using machine learning algorithms. The data consists of a time series of patient symptoms and various medical parameters, such as tumor size and malignancy, that can be used by programmatic algorithms to predict diagnosis and prognosis outcomes. Here are some steps on how to use this dataset:
Pre-process and clean the data: Since the dataset contains incomplete or missing values across various parameters, it is important to clean and pre-process the data before attempting any machine learning algorithm (MLA). This includes sorting out what type of values need imputation, standardizing features for better performance, encoding categorical variables for MLAs, and normalizing numerical values for accuracy.
Choose an appropriate MLA: Depending on your exact goal with this data set - for example if you wanted reliable classification results or weighted predictions based on factors - there are a variety of MLAs from which you may select; examples include logistic regression classifiers, least squares support vector machines (SVM), neural networks, nonsmooth optimization algorithms like A-Optimality or global optimization methods such as Extract M-of-N rule sets from trained neural nets.. It would be wise to read up on each algorithm in order to determine which one most appropriately meets your needs before starting experimentation with the dataset itself.
Train the model using your selected MLA: Once you have identified an MLA that fits your desired result outcome best – or if you decide on experimenting with multiple approaches –it’s time turn back towards the data itself in order run experiments actually examine outcomes based upon training models built upon it through cross validation methods such as k-fold splitting.. Then test these trained models against validation datasets taken from specified subsets within the original larger data set structure held by Kaggle in order get general outputs results determining performance rates over various conditions presented by parameter combinations relevant when predicting breast cancer diagnostic &/or prognostic outcomes .. Establishing any trends revealed during these experiments will help inform future model selections during training process associated implementing an effective predictive solution fitting specific user requirements especially where particular MLA are not tailored handle purpose generally falling outside scope designing said model so guaranteeing ac...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains records related to red and white variants of the Portuguese Vinho Verde wine. It contains information from 1599 red wine samples and 4898 white wine samples. Input variables in the data set consist of the type of wine (either red or white wine) and metrics from objective tests (e.g. acidity levels, PH values, ABV, etc.), while the target/output variable is a numerical score based on sensory data—median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Due to privacy and logistic issues, there is no data about grape types, wine brand, and wine selling price.
This data set is a combined version of the two separate files (distinct red and white wine data sets) originally shared in the UCI Machine Learning Repository.
The following are some existing data sets on Kaggle from the same source (with notable differences from this data set): - Red Wine Quality (contains red wine data only) - Wine Quality (combination of red and white wine data but with some values randomly removed) - Wine Quality (red and white wine data not combined)
Input variables:
1 - type of wine: type of wine (categorical: 'red', 'white')
(continuous variables based on physicochemical tests)
2 - fixed acidity: The acids that naturally occur in the grapes used to ferment the wine and carry over into the wine. They mostly consist of tartaric, malic, citric or succinic acid that mostly originate from the grapes used to ferment the wine. They also do not evaporate easily. (g / dm^3)
3 - volatile acidity: Acids that evaporate at low temperatures—mainly acetic acid which can lead to an unpleasant, vinegar-like taste at very high levels. (g / dm^3)
4 - citric acid: Citric acid is used as an acid supplement which boosts the acidity of the wine. It's typically found in small quantities and can add 'freshness' and flavor to wines. (g / dm^3)
5 - residual sugar: The amount of sugar remaining after fermentation stops. It's rare to find wines with less than 1 gram/liter. Wines residual sugar level greater than 45 grams/liter are considered sweet. On the other end of the spectrum, a wine that does not taste sweet is considered as dry. (g / dm^3)
6 - chlorides: The amount of chloride salts (sodium chloride) present in the wine. (g / dm^3)
7 - free sulfur dioxide: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. All else constant, the higher the free sulfur dioxide content, the stronger the preservative effect. (mg / dm^3)
8 - total sulfur dioxide: The amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. (mg / dm^3)
9 - density: The density of wine juice depending on the percent alcohol and sugar content; it's typically similar but higher than that of water (wine is 'thicker'). (g / cm^3)
10 - pH: A measure of the acidity of wine; most wines are between 3-4 on the pH scale. The lower the pH, the more acidic the wine is; the higher the pH, the less acidic the wine. (The pH scale technically is a logarithmic scale that measures the concentration of free hydrogen ions floating around in your wine. Each point of the pH scale is a factor of 10. This means a wine with a pH of 3 is 10 times more acidic than a wine with a pH of 4)
11 - sulphates: Amount of potassium sulphate as a wine additive which can contribute to sulfur dioxide gas (S02) levels; it acts as an antimicrobial and antioxidant agent.(g / dm3)
12 - alcohol: How much alcohol is contained in a given volume of wine (ABV). Wine generally contains between 5–15% of alcohols. (% by volume)
Output variable:
13 - quality: score between 0 (very bad) and 10 (very excellent) by wine experts
Source: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Data credit goes to UCI. Visit their website to access the original data set directly: https://archive.ics.uci.edu/ml/datasets/wine+quality
So much about wine making remains elusive—taste is very subjective, making it extremely challenging to predict exactly how consumers will react to a certain bottle of wine. There is no doubt that winemakers, connoisseurs, and scientists have greatly contributed their expertise to ...
Facebook
TwitterReplication files for David Slichter, "The Employment Effects of the Minimum Wage: A Selection Ratio Approach to Measuring Treatment Effects,” Journal of Applied Econometrics, forthcoming
Firstly, I’ve provided a .do file called sr.do which contains general code for implementing the selection ratio approach, with detailed instructions written as comments in the code.
For the minimum wage application, the main data file is mw_final.dta. A .csv version is also provided. Observations are a county in a time period. I have added self-explanatory variable labels for most variables. A few variables warrant a clearer explanation:
adj1-adj14: List of FIPS codes of all counties which are adjacent to the county in question. Each variables holds one adjacent county, and counties with fewer than 14 neighbors will have missing values for some of these variables.
change, logchange: Minimum wage this quarter - minimum wage last quarter, measured either in dollars or in logs.
time, t1-t108: The variable "time" converts years and quarters into a univariate time period, with time=1 in 1990Q1 and time=108 in 2016Q4. t1-t108 are indicators for each of these time periods.
lnemp_1418, lnearnbeg_1418, lnsep_1418, lnhira_1418, lnchurn_1418: Logs of employment, earnings, separations, hires, and churn, respectively, for 14-18 year olds.
gt1-gt6: Dummies for inclusion in each of the six comparisons used for the main (i.e., not spillover-robust) analysis. All treated counties which neighbor a control country take value 1 for each of these variables; all other treated counties take value 0. Among control counties, gt1=1 if the county neighbors a treated county and 0 otherwise, gt2=1 if the county has gt1=0 but neighbors a gt1=1 county, gt3=1 if county has gt1=gt2=0 but neighbors a gt2=1 county, etc.
h2-h6: Dummies for inclusion in each of the first spillover-robust (i.e., excluding border counties only) comparisons. Among control counties, h2-h6 are equal to gt2-gt6. Among treated counties, h2-h6 are equal to 1 if the treated county has gt1=0 but borders a gt1=1 county, and 0 otherwise.
k3-k6: Dummies for inclusion in each of the second spillover-robust (i.e., excluding two layers) comparisons. Among control counties, these variables are equal to gt3-gt6. Among treated counties, all observations take value 1 except those with gt1=1 or h2=1.
The data sources are as follows. The minimum wage law series is taken from David Neumark's website (https://www.economics.uci.edu/~dneumark/datasets.html). The economic variables are taken from the QWI, which I accessed via the Ithaca Virtual RDC. County adjacency files were downloaded from the Census Bureau (https://www.census.gov/geo/reference/county-adjacency.html).
The file main.do then runs the analyses. The resulting output file containing results is results.dta.
For the incumbency application, the main data file is incumb_final.dta. A .csv version is also provided. This file is drawn from Caughey and Sekhon's (2011) data; see their description of most variables here: https://doi.org/10.7910/DVN/8EYYA2
The key added variables are _IDistancea1-_IDistancea50, which are dummies for inclusion in the 50 comparisons used in the paper. Treated observations (i.e., Democratic wins) with margin of victory below 5 points have each of these variables equal to 1. Control observations have these variables equal to 1 if they fall within the margin of victory range, e.g., _IDistancea9=1 for control observations with Republican margin of victory between 8 and 9 points. Note that these variables are redefined by the code for the analyses of treatment effects away from the discontinuity. Lastly, there is a variable called RepWin which is the treatment variable when treatment is defined as a Republican winning.
The file sr_incumb.do then performs the analysis.
Please contact me with any questions at slichter@binghamton.edu.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
ISSR CS602 Machine Learning - Project
Website Phishing Data Set Download: Data Folder, Data Set Description
Abstract:
| Data Set Characteristics : Multivariate | Number of Instances : 1353 |
|---|---|
| Attribute Characteristics : Integer | Number of Attributes : 10 |
| Associated Tasks : Classification | Number of Web Hits : 54880 |
Source: Dataset url
Neda Abdelhamid Auckland Institute of Studies nedah '@' ais.ac.nz
Data Set Information:
The phishing problem is considered a vital issue in “.COM†industry especially e-banking and e-commerce taking the number of online transactions involving payments. We have identified different features related to legitimate and phishy websites and collected 1353 different websites from difference sources.Phishing websites were collected from Phishtank data archive (www.phishtank.com), which is a free community site where users can submit, verify, track and share phishing data. The legitimate websites were collected from Yahoo and starting point directories using a web script developed in PHP. The PHP script was plugged with a browser and we collected 548 legitimate websites out of 1353 websites. There is 702 phishing URLs, and 103 suspicious URLs.
When a website is considered SUSPICIOUS that means it can be either phishy or legitimate, meaning the website held some legit and phishy features.
Attribute Information:
URL Anchor
Request URL
SFH
URL Length
Having ’@’
Prefix/Suffix
IP
Sub Domain
Web traffic
Domain age
Class
collected features hold the categorical values , “Legitimate†, †Suspicious†and “Phishy†, these values have been replaced with numerical values 1,0 and -1 respectively. details of each feature are mentioned in the research paper mentioned below
Facebook
TwitterThese are the NLT preserves that are currently open to the public (2015). Trails, parking areas, and other information can be found on the NLT website: https://natlands.org/category/preserves-to-visit/list-of-preserves/
This data is hosted at, and may be downloaded or accessed from PASDA, the Pennsylvania Spatial Data Access Geospatial Data Clearinghouse http://www.pasda.psu.edu/uci/DataSummary.aspx?dataset=601
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAbstract: It is a public set of comments collected for spam research. It has five datasets composed by 1,956 real messages extracted from five videos that were among the 10 most viewed on the collection period.
Data Set Information: The table below lists the datasets, the YouTube video ID, the number of samples in each class and the total number of samples per dataset.
Dataset --- YouTube ID -- # Spam - # Ham - Total
Psy ------- 9bZkp7q19f0 --- 175 --- 175 --- 350
KatyPerry - CevxZvSJLk8 --- 175 --- 175 --- 350
LMFAO ----- KQ6zr6kCPj8 --- 236 --- 202 --- 438
Eminem ---- uelHwf8o7_U --- 245 --- 203 --- 448
Shakira --- pRpeEdMmmQ0 --- 174 --- 196 --- 370
Note: the chronological order of the comments were kept.
The collection is composed by one CSV file per dataset, where each line has the following attributes: COMMENT_ID,AUTHOR,DATE,CONTENT,TAG
Further details please visit this website,http://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection#
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
https://cdn.prod.website-files.com/5c17fc782f30f90cd15c25b4/63189857cdf8f072fcccfd5e_Thyroid.gif" alt="">
Ross Quinlan
From Garavan Institute Documentation: as given by Ross Quinlan 6 databases from the Garavan Institute in Sydney, Australia Approximately the following for each database:
2800 training (data) instances and 972 test instances Plenty of missing data 29 or so attributes, either Boolean or continuously-valued
Hypothyroid.data and sick-euthyroid.data Quinlan believes that these databases have been corrupted Their format is highly similar to the other databases
3 classes, 215 instances, 5 attributes No missing values
3 classes 3772 training instances, 3428 testing instances Includes cost data (donated by Peter Turney)
N/A
https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease
R. Quinlan. "Thyroid Disease," UCI Machine Learning Repository, 1986. [Online]. Available: https://doi.org/10.24432/C5D010.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Sahar Pourahmad
Released under CC0: Public Domain