29 datasets found

f
Summary of variables of the data set included in the analysis.
plos.figshare.com
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owen Bodger; Aidan Byrne; Philip A. Evans; Sarah Rees; Gwen Jones; Claire Cowell; Mike B. Gravenor; Rhys Williams (2023). Summary of variables of the data set included in the analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0027161.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0027161.t001
Dataset updated
Jun 8, 2023
Dataset provided by
PLOS ONE
Authors
Owen Bodger; Aidan Byrne; Philip A. Evans; Sarah Rees; Gwen Jones; Claire Cowell; Mike B. Gravenor; Rhys Williams
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Footnote: (f) denotes a categorical variable, (c) a continuous covariate and (n) a nominal variable.
A Dataset of Water Quality and Related Variables in U.S. Reservoirs
catalog.data.gov
s.cnmilf.com
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). A Dataset of Water Quality and Related Variables in U.S. Reservoirs [Dataset]. https://catalog.data.gov/dataset/a-dataset-of-water-quality-and-related-variables-in-u-s-reservoirs
Explore at:
Dataset updated
Jun 13, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
United States
Description
This dataset presents a rich collection of physicochemical parameters from 147 reservoirs distributed across the conterminous U.S. One hundred and eight of the reservoirs were selected using a statistical survey design and can provide unbiased inferences to the condition of all U.S. reservoirs. These data could be of interest to local water management specialists or those assessing the ecological condition of reservoirs at the national scale. These data have been reviewed in accordance with U.S. Environmental Protection Agency policy and approved for publication. This dataset is not publicly accessible because: It is too large. It can be accessed through the following means: https://portal-s.edirepository.org/nis/mapbrowse?scope=edi&identifier=2033&revision=1. Format: This dataset presents water quality and related variables for 147 reservoirs distributed across the U.S. Water quality parameters were measured during the summers of 2016, 2018, and 2020 – 2023. Measurements include nutrient concentration, algae abundance, dissolved oxygen concentration, and water temperature, among many others. Dataset includes links to other national and global scale data sets that provide additional variables.
f
Scaled Dataset.xlsx
figshare.com
xlsx
Updated Dec 23, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arash Mohsenijam (2016). Scaled Dataset.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.4491101.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4491101.v1
Dataset updated
Dec 23, 2016
Dataset provided by
figshare
Authors
Arash Mohsenijam
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The partner company’s historical data could be utilized in developing a data-driven prediction model with project division details as its inputs and project division labor-hours as the desired output. The BIM models contain 42 design features and 1559 records, each record denoting a division of fabrication. The BIM design features are listed in Table 1. Labor-hours spent on each division were extracted from job costing databases serving as the output parameter in the regression model. Although the variables in Table 1 are all considered related, there are certain inter-correlations between them and some variables can be explained by others. For instance, material length and weight are highly correlated; by knowing one, the other can be deduced. Therefore, a variable selection technique is instrumental in removing these inter-correlations in an analytical manner. It is noteworthy that the dataset was linearly scaled prior to performing analyses in order not to reveal sensitive information of the partner company without distorting patterns and relationships inherent in the data.
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
Data from: A Sensitivity Analysis of Methodological Variables Associated...
catalog.data.gov
data.nist.gov
Updated Dec 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). A Sensitivity Analysis of Methodological Variables Associated with Microbiome Measurements [Dataset]. https://catalog.data.gov/dataset/a-sensitivity-analysis-of-methodological-variables-associated-with-microbiome-measurements-83f38
Explore at:
Dataset updated
Dec 15, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This repository provides the raw data, analysis code, and results generated during a systematic evaluation of the impact of selected experimental protocol choices on the metagenomic sequencing analysis of microbiome samples. Briefly, a full factorial experimental design was implemented varying biological sample (n=5), operator (n=2), lot (n=2), extraction kit (n=2), 16S variable region (n=2), and reference database (n=3), and the main effects were calculated and compared between parameters (bias effects) and samples (real biological differences). A full description of the effort is provided in the associated publication.
College Basketball March Madness Dataset 2012-24
kaggle.com
Updated Mar 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
atoziye (2024). College Basketball March Madness Dataset 2012-24 [Dataset]. https://www.kaggle.com/datasets/atoziye/ncaa-d1-college-basketball-march-madness-2012-24
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 21, 2024
Dataset provided by
Kaggle
Authors
atoziye
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is a dataset of team statistics for every March Madness team from the 2011-12 season to the 2023-24 season.

Data was scraped and aggregated from two sources: sports-reference and kenpom.

Basic stats via sports-reference are straightforward, but refer to Ken Pomeroy's website for more details on the advanced stats. Alternatively, you can check out this blog post for a description of some of these stats and the results of a model we trained to predict success in this year's tournament.

Statistics are inclusive of March Madness tournament games in addition to the regular season, which is why we removed total stats (e.g., rebounds, turnovers), as these are biased towards teams who made it further in the tournament.

Rows representing teams from 2012-23 teams include a binary variable indicating whether or not that team made it to the Sweet Sixteen.
Z
Global Dataset of Cyber Incidents V.1.2
data.niaid.nih.gov
zenodo.org
Updated May 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Repository of Cyber Incidents (EuRepoC) (2024). Global Dataset of Cyber Incidents V.1.2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7848940
Explore at:
Dataset updated
May 3, 2024
Dataset authored and provided by
European Repository of Cyber Incidents (EuRepoC)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains data on 2889 cyber incidents between 01.01.2000 and 02.05.2024 using 60 variables, including the start date, names and categories of receivers along with names and categories of initiators. The database was compiled as part of the European Repository of Cyber Incidents (EuRepoC) project.

EuRepoC gathers, codes, and analyses publicly available information from over 200 sources and 600 Twitter accounts daily to report on dynamic trends in the global, and particularly the European, cyber threat environment.For more information on the scope and data collection methodology see: https://eurepoc.eu/methodologyCodebook available hereInformation about each file:

Global Database (csv or xlsx):This file includes all variables coded for each incident, organised such that one row corresponds to one incident - our main unit of investigation. Where multiple codes are present for a single variable for a single incident, these are separated with semi-colons within the same cell.

Receiver Dataset (csv):In this file, the data of affected entities and individuals (receivers) is restructured to facilitate analysis. Each cell contains only a single code, with the data "unpacked" across multiple rows. Thus, a single incident can span several rows, identifiable through the unique identifier assigned to each incident (incident_id).

Attribution Dataset (csv):This file follows a similar approach to the receiver dataset. The attribution data is "unpacked" over several rows, allowing each cell to contain only one code. Here too, a single incident may occupy several rows, with the unique identifier enabling easy tracking of each incident (incident_id). In addition, some attributions may also have multiple possible codes for one variable, these are also "unpacked" over several rows, with the attribution_id enabling to track each attribution.eurepoc_global_database_1.2 (json):This file contains the whole database in JSON format.
bnlearn datasets
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). bnlearn datasets [Dataset]. http://doi.org/10.5281/zenodo.7676616
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7676616
Dataset updated
Jan 29, 2025
Dataset provided by
Zenodohttp://zenodo.org/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This collection consists of 5 structure learning datasets from the Bayesian Network Repository (Scutari, 2010).

Task: The dataset collection can be used to study causal discovery algorithms.

Summary:

Size of collection: 5 datasets with 3 - 56 columns of various sizes

Task: Causal Discovery

Data Type: Discrete

Dataset Scope: Collection

Ground Truth: Known / Estimated

Temporal Structure: No

License: TBD

Missing Values: No

Missingness Statement: There are no missing values.

Collection:

The alarm dataset contains the following 37 variables:

CVP (central venous pressure): a three-level factor with levels LOW, NORMAL and HIGH.

PCWP (pulmonary capillary wedge pressure): a three-level factor with levels LOW, NORMAL and HIGH.

HIST (history): a two-level factor with levels TRUE and FALSE.

TPR (total peripheral resistance): a three-level factor with levels LOW, NORMAL and HIGH.

... (33 more variables, see the corresponding .html file)

The binary synthetic asia dataset:

D (dyspnoea), a two-level factor with levels yes and no.

T (tuberculosis), a two-level factor with levels yes and no.

L (lung cancer), a two-level factor with levels yes and no.

B (bronchitis), a two-level factor with levels yes and no.

A(visit to Asia), a two-level factor with levels yes and no.

S (smoking), a two-level factor with levels yes and no.

X (chest X-ray), a two-level factor with levels yes and no.

E (tuberculosis versus lung cancer/bronchitis), a two-level factor with levels yes and no.

The binary coronary dataset:

Smoking (smoking): a two-level factor with levels no and yes.

M. Work (strenuous mental work): a two-level factor with levels no and yes.

P. Work (strenuous physical work): a two-level factor with levels no and yes.

Pressure (systolic blood pressure): a two-level factor with levels <140 and >140.

Proteins (ratio of beta and alpha lipoproteins): a two-level factor with levels <3 and >3.

Family (family anamnesis of coronary heart disease): a two-level factor with levels neg and pos.

The hailfinder dataset contains the following 56 variables:

N07muVerMo (10.7mu vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.

SubjVertMo (subjective judgment of vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.

QGVertMotion (quasigeostrophic vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.

CombVerMo (combined vertical motion): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.

AreaMesoALS (area of meso-alpha): a four-level factor with levels StrongUp, WeakUp, Neutral and Down.

SatContMoist (satellite contribution to moisture): a four-level factor with levels VeryWet, Wet, Neutral and Dry.

... (49 more variables are in the correspondent .html file)

The lizards dataset contains the following 3 variables:

Species (the species of the lizard): a two-level factor with levels Sagrei and Distichus.

Height (perch height): a two-level factor with levels high (greater than 4.75 feet) and low (lesser or equal to 4.75 feet).

Diameter (perch diameter): a two-level factor with levels narrow (greater than 4 inches) and wide (lesser or equal to 4 inches).
d
Data for comparison of climate envelope models developed using...
catalog.data.gov
data.usgs.gov
+2more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data for comparison of climate envelope models developed using expert-selected variables versus statistical selection [Dataset]. https://catalog.data.gov/dataset/data-for-comparison-of-climate-envelope-models-developed-using-expert-selected-variables-v
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The data we used for this study include species occurrence data (n=15 species), climate data and predictions, an expert opinion questionnaire, and species masks that represented the model domain for each species. For this data release, we include the results of the expert opinion questionnaire and the species model domains (or masks). We developed an expert opinion questionnaire to gather information regarding expert opinion regarding the importance of climate variables in determining a species geographic range. The species masks, or model domains, were defined separately for each species using a variation of the “target-group” approach (Phillips et al. 2009), where the domain was determine using convex polygons including occurrence data for at least three phylogenetically related and similar species (Watling et al. 2012). The species occurrence data, climate data, and climate predictions are freely available online, and therefore not included in this data release. The species occurrence data were obtained primarily from the online database Global Biodiversity Information Facility (GBIF; http://www.gbif.org/), and from scientific literature (Watling et al. 2011). Climate data were obtained from the WorldClim database (Hijmans et al. 2005) and climate predictions were obtained from the Center for Ocean-Atmosphere Prediction Studies (COAPS) at Florida State University (https://floridaclimateinstitute.org/resources/data-sets/regional-downscaling). See metadata for references.
d
NYSERDA Low- to Moderate-Income New York State Census Population Analysis...
catalog.data.gov
datasets.ai
+3more
Updated Jun 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ny.gov (2025). NYSERDA Low- to Moderate-Income New York State Census Population Analysis Dataset: Average for 2013-2015 [Dataset]. https://catalog.data.gov/dataset/nyserda-low-to-moderate-income-new-york-state-census-population-analysis-dataset-aver-2013
Explore at:
Dataset updated
Jun 28, 2025
Dataset provided by
data.ny.gov
Area covered
New York
Description
How does your organization use this dataset? What other NYSERDA or energy-related datasets would you like to see on Open NY? Let us know by emailing OpenNY@nyserda.ny.gov. The Low- to Moderate-Income (LMI) New York State (NYS) Census Population Analysis dataset is resultant from the LMI market database designed by APPRISE as part of the NYSERDA LMI Market Characterization Study (https://www.nyserda.ny.gov/lmi-tool). All data are derived from the U.S. Census Bureau’s American Community Survey (ACS) 1-year Public Use Microdata Sample (PUMS) files for 2013, 2014, and 2015. Each row in the LMI dataset is an individual record for a household that responded to the survey and each column is a variable of interest for analyzing the low- to moderate-income population. The LMI dataset includes: county/county group, households with elderly, households with children, economic development region, income groups, percent of poverty level, low- to moderate-income groups, household type, non-elderly disabled indicator, race/ethnicity, linguistic isolation, housing unit type, owner-renter status, main heating fuel type, home energy payment method, housing vintage, LMI study region, LMI population segment, mortgage indicator, time in home, head of household education level, head of household age, and household weight. The LMI NYS Census Population Analysis dataset is intended for users who want to explore the underlying data that supports the LMI Analysis Tool. The majority of those interested in LMI statistics and generating custom charts should use the interactive LMI Analysis Tool at https://www.nyserda.ny.gov/lmi-tool. This underlying LMI dataset is intended for users with experience working with survey data files and producing weighted survey estimates using statistical software packages (such as SAS, SPSS, or Stata).
m
Data from: A clustering based forecasting algorithm for multivariable fuzzy...
data.mendeley.com
Updated Oct 31, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salar Askari Lasaki (2016). A clustering based forecasting algorithm for multivariable fuzzy time series using linear combinations of independent variables [Dataset]. http://doi.org/10.17632/35fw8pb6s9.1
Explore at:
Unique identifier
https://doi.org/10.17632/35fw8pb6s9.1
Dataset updated
Oct 31, 2016
Authors
Salar Askari Lasaki
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dear Researcher,

Thank you for using this code and datasets. I explain how CFTS code related to my paper "A clustering based forecasting algorithm for multivariable fuzzy time series using linear combinations of independent variables" published in Applied Soft Computing works. All datasets mentioned in the paper accompanied with CFTS code are included. If there is any question feel free to contact me at: bas_salaraskari@yahoo.com s_askari@aut.ac.ir

Regards,

S. Askari

Guidelines for CFTS algorithm: 1. Open the file CFTS Code using MATLAB. 2. Enter or paste name of the dataset you wish to simulate in line 5 after "load". It loads the dataset in the workplace. 3. Lines 6 and 7: "r" is number of independent variables and "N" is number of data vectors used for training. 4. Line 9: "C" is number of clusters. You can use the optimal number of clusters given in Table 6 of paper or your own preferred value. 5. If line 28 is "comment", covariance norm (Mahalanobis distance) is use and if it is "uncomment", identity norm (Euclidean distance) is used. 6. Please press Ctrl Enter to run the code. 7. For your own dataset, please arrange the data as the datasets described in MS Word file "Read Me".
d
Background data for: Latent-variable modeling of ordinal outcomes in...
dataone.org
dataverse.no
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krug, Manfred; Vetter, Fabian; Sönning, Lukas (2024). Background data for: Latent-variable modeling of ordinal outcomes in language data analysis [Dataset]. http://doi.org/10.18710/WI9TEH
Explore at:
Unique identifier
https://doi.org/10.18710/WI9TEH
Dataset updated
Sep 25, 2024
Dataset provided by
DataverseNO
Authors
Krug, Manfred; Vetter, Fabian; Sönning, Lukas
Time period covered
Jan 1, 2008 - Dec 31, 2018
Description
This dataset contains tabular files with information about the usage preferences of speakers of Maltese English with regard to 63 pairs of lexical expressions. These pairs (e.g. truck-lorry or realization-realisation) are known to differ in usage between BrE and AmE (cf. Algeo 2006). The data were elicited with a questionnaire that asks informants to indicate whether they always use one of the two variants, prefer one over the other, have no preference, or do not use either expression (see Krug and Sell 2013 for methodological details). Usage preferences were therefore measured on a symmetric 5-point ordinal scale. Data were collected between 2008 to 2018, as part of a larger research project on lexical and grammatical variation in settings where English is spoken as a native, second, or foreign language. The current dataset, which we use for our methodological study on ordinal data modeling strategies, consists of a subset of 500 speakers that is roughly balanced on year of birth. Abstract: Related publication In empirical work, ordinal variables are typically analyzed using means based on numeric scores assigned to categories. While this strategy has met with justified criticism in the methodological literature, it also generates simple and informative data summaries, a standard often not met by statistically more adequate procedures. Motivated by a survey of how ordered variables are dealt with in language research, we draw attention to an un(der)used latent-variable approach to ordinal data modeling, which constitutes an alternative perspective on the most widely used form of ordered regression, the cumulative model. Since the latent-variable approach does not feature in any of the studies in our survey, we believe it is worthwhile to promote its benefits. To this end, we draw on questionnaire-based preference ratings by speakers of Maltese English, who indicated on a 5-point scale which of two synonymous expressions (e.g. package-parcel) they (tend to) use. We demonstrate that a latent-variable formulation of the cumulative model affords nuanced and interpretable data summaries that can be visualized effectively, while at the same time avoiding limitations inherent in mean response models (e.g. distortions induced by floor and ceiling effects). The online supplementary materials include a tutorial for its implementation in R.
Treatment Episode Data Set -- Admissions (TEDS-A), 2004
icpsr.umich.edu
healthdata.gov
+4more
ascii, delimited, r +3
Updated Sep 10, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Department of Health and Human Services. Substance Abuse and Mental Health Services Administration. Office of Applied Studies (2014). Treatment Episode Data Set -- Admissions (TEDS-A), 2004 [Dataset]. http://doi.org/10.3886/ICPSR04431.v11
Explore at:
sas, ascii, delimited, spss, stata, rAvailable download formats
Unique identifier
https://doi.org/10.3886/ICPSR04431.v11
Dataset updated
Sep 10, 2014
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
United States Department of Health and Human Services. Substance Abuse and Mental Health Services Administration. Office of Applied Studies
License
https://www.icpsr.umich.edu/web/ICPSR/studies/4431/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/4431/terms
Time period covered
2004
Area covered
United States
Description
The Treatment Episode Data Set -- Admissions (TEDS-A) is a national census data system of annual admissions to substance abuse treatment facilities. TEDS-A provides annual data on the number and characteristics of persons admitted to public and private substance abuse treatment programs that receive public funding. The unit of analysis is a treatment admission. TEDS consists of data reported to state substance abuse agencies by the treatment programs, which in turn report it to SAMHSA. A sister data system, called the Treatment Episode Data Set -- Discharges (TEDS-D), collects data on discharges from substance abuse treatment facilities. The first year of TEDS-A data is 1992, while the first year of TEDS-D is 2006. TEDS variables that are required to be reported are called the "Minimum Data Set (MDS)", while those that are optional are called the "Supplemental Data Set (SuDS)". Variables in the MDS include: information on service setting, number of prior treatments, primary source of referral, gender, race, ethnicity, education, employment status, substance(s) abused, route of administration, frequency of use, age at first use, and whether methadone was prescribed in treatment. Supplemental variables include: diagnosis codes, presence of psychiatric problems, living arrangements, source of income, health insurance, expected source of payment, pregnancy and veteran status, marital status, detailed not in labor force codes, detailed criminal justice referral codes, days waiting to enter treatment, and the number of arrests in the 30 days prior to admissions (starting in 2008). Substances abused include alcohol, cocaine and crack, marijuana and hashish, heroin, nonprescription methadone, other opiates and synthetics, PCP, other hallucinogens, methamphetamine, other amphetamines, other stimulants, benzodiazepines, other non-benzodiazepine tranquilizers, barbiturates, other non-barbiturate sedatives or hypnotics, inhalants, over-the-counter medications, and other substances. Created variables include total number of substances reported, intravenous drug use (IDU), and flags for any mention of specific substances.
f
Predictor variables for the Taiwan credit data.
plos.figshare.com
xls
Updated Aug 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rivalani Hlongwane; Kutlwano Ramabao; Wilson Mongwe (2024). Predictor variables for the Taiwan credit data. [Dataset]. http://doi.org/10.1371/journal.pone.0308718.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0308718.t003
Dataset updated
Aug 12, 2024
Dataset provided by
PLOS ONE
Authors
Rivalani Hlongwane; Kutlwano Ramabao; Wilson Mongwe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Taiwan
Description
Credit scorecards are essential tools for banks to assess the creditworthiness of loan applicants. While advanced machine learning models like XGBoost and random forest often outperform traditional logistic regression in predictive accuracy, their lack of interpretability hinders their adoption in practice. This study bridges the gap between research and practice by developing a novel framework for constructing interpretable credit scorecards using Shapley values. We apply this framework to two credit datasets, discretizing numerical variables and utilizing one-hot encoding to facilitate model development. Shapley values are then employed to derive credit scores for each predictor variable group in XGBoost, random forest, LightGBM, and CatBoost models. Our results demonstrate that this approach yields credit scorecards with interpretability comparable to logistic regression while maintaining superior predictive accuracy. This framework offers a practical and effective solution for credit practitioners seeking to leverage the power of advanced models without sacrificing transparency and regulatory compliance.
f
Descriptive statistics of sexual violence victim-survivors in the Crime...
plos.figshare.com
xls
Updated Jan 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Estela Capelas Barbosa; Niels Blom; Annie Bunce (2025). Descriptive statistics of sexual violence victim-survivors in the Crime Survey for England and Wales (CSEW) and Rape Crisis England & Wales (RCEW) datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0301155.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0301155.t001
Dataset updated
Jan 14, 2025
Dataset provided by
PLOS ONE
Authors
Estela Capelas Barbosa; Niels Blom; Annie Bunce
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Descriptive statistics of sexual violence victim-survivors in the Crime Survey for England and Wales (CSEW) and Rape Crisis England & Wales (RCEW) datasets.
Benchmark datasets to study fairness in synthetic data generation
zenodo.org
csv, json
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joao Fonseca; Joao Fonseca (2024). Benchmark datasets to study fairness in synthetic data generation [Dataset]. http://doi.org/10.5281/zenodo.13385610
Explore at:
csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13385610
Dataset updated
Aug 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joao Fonseca; Joao Fonseca
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The traveltime dataset is based on the Folktables project covering US census data. The target is a binary variable encoding whether or not the individual needs to travel more than 20 minutes for work; here, having a shorter travel time is the desirable outcome. We use a subset of data from the states of California, Florida, Maine, New York, Utah, and Wyoming states in 2018. Although the folktables dataset does not have any missing values, there are some values recorded as NaN due to the Bureau's data collection methodology. We remove the "esp" column, which encodes the employment status of parents, and has 99.55% missing values. We encode the missing values in the povpip, income to poverty ratio (0.85%), to -1 in accordance to the methodology in Ding et al.. See https://arxiv.org/pdf/2108.04884 for metadata.

The cardio (a) dataset contains patient data recorded during medical examination, including 3 binary features supplied by the patient. The target class denotes the presence of cardiovascular disease. This dataset represents predictive tasks that allocate access to priority medical care for patients, and has been used for fairness evaluations in the domain.

The credit dataset contains historical financial data of borrowers, including past non-serious delinquencies. Here, a serious delinquency is considered to be 90 days past due, and this is the target variable.

The German Credit dataset (https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data) contains financial and personal information regarding loan-seeking applicants.
c
Annual Population Survey Household Dataset, January - December, 2019
datacatalogue.cessda.eu
Updated May 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2025). Annual Population Survey Household Dataset, January - December, 2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-8665-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-8665-1
Dataset updated
May 16, 2025
Dataset provided by
Social Survey Division
Authors
Office for National Statistics
Time period covered
Jan 1, 2019 - Dec 31, 2019
Area covered
United Kingdom
Variables measured
Families/households, National
Measurement technique
Face-to-face interview, Telephone interview, Data compiled from households completing the main APS and LFS.
Description
Abstract copyright UK Data Service and data collection copyright owner.
The Annual Population Survey (APS) household datasets are produced annually and are available from 2004 (Special Licence) and 2006 (End User Licence). They allow production of family and household labour market statistics at local areas and for small sub-groups of the population across the UK. The household data comprise key variables from the Labour Force Survey (LFS) and the APS 'person' datasets. The APS household datasets include all the variables on the LFS and APS person datasets, except for the income variables. They also include key family and household-level derived variables. These variables allow for an analysis of the combined economic activity status of the family or household. In addition, they also include more detailed geographical, industry, occupation, health and age variables.
For further detailed information about methodology, users should consult the Labour Force Survey User Guide, included with the APS documentation. For variable and value labelling and coding frames that are not included either in the data or in the current APS documentation, users are advised to consult the latest versions of the LFS User Guides, which are available from the ONS Labour Force Survey - User Guidance webpages.

Occupation data for 2021 and 2022
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. None of ONS' headline statistics, other than those directly sourced from occupational data, are affected and you can continue to rely on their accuracy. Further information can be found in the ONS article published on 11 July 2023: Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022

End User Licence and Secure Access APS data
Users should note that there are two versions of each APS dataset. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. The EUL version includes Government Office Region geography, banded age, 3-digit SOC and industry sector for main, second and last job. The Secure Access version contains more detailed variables relating to:
age: single year of age, year and month of birth, age completed full-time education and age obtained highest qualification, age of oldest dependent child and age of youngest dependent child
family unit and household: including a number of variables concerning the number of dependent children in the family according to their ages, relationship to head of household and relationship to head of family
nationality and country of origin
geography: including county, unitary/local authority, place of work, Nomenclature of Territorial Units for Statistics 2 (NUTS2) and NUTS3 regions, and whether lives and works in same local authority district
health: including main health problem, and current and past health problems
education and apprenticeship: including numbers and subjects of various qualifications and variables concerning apprenticeships
industry: including industry, industry class and industry group for main, second and last job, and industry made redundant from
occupation: including 4-digit Standard Occupational Classification (SOC) for main, second and last job and job made redundant from
system variables: including week number when interview took place and number of households at address
The Secure Access data have more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.
Main Topics:
Topics covered include: household composition and relationships, housing tenure, nationality, ethnicity and residential history, employment and training (including government schemes), workplace and location, job hunting, educational background and qualifications.
u
Data from: MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile...
produccioncientifica.ucm.es
zenodo.org
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia; Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia (2024). MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile Dataset for Investigating Individual and Collective Well-Being [Dataset]. https://produccioncientifica.ucm.es/documentos/668fc499b9e7c03b01be2372
Explore at:
Dataset updated
2024
Authors
Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia; Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia
Description
This study engaged 409 participants over a period spanning from July 10 to August 8, 2023, ensuring representation across various demographic factors: 221 females, 186 males, 2 non-binary, year of birth between 1951 and 2005, with varied annual incomes and from 15 Spanish regions. The MobileWell400+ dataset, openly accessible, encompasses a wide array of data collected via the participants' mobile phone, including demographic, emotional, social, behavioral, and well-being data. Methodologically, the project presents a promising avenue for uncovering new social, behavioral, and emotional indicators, supplementing existing literature. Notably, artificial intelligence is considered to be instrumental in analysing these data, discerning patterns, and forecasting trends, thereby advancing our comprehension of individual and population well-being. Ethical standards were upheld, with participants providing informed consent.

The following is a non-exhaustive list of collected data:

Data continuously collected through the participants' smartphone sensors: physical activity (resting, walking, driving, cycling, etc.), name of detected WiFi networks, connectivity type (WiFi, mobile, none), ambient light, ambient noise, and status of the device screen (on, off, locked, unlocked).

Data corresponding to an initial survey prompted via the smartphone, with information related to demographic data, effects and COVID vaccination, average hours of physical activity, and answers to a series of questions to measure mental health, many of them taken from internationally recognised psychological and well-being scales (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception.

Data corresponding to daily surveys prompted via the smartphone, where variables related to mood (valence, activation, energy and emotional events) and social interaction (quantity and quality) are measured.

Data corresponding to weekly surveys prompted via the smartphone, where information on overall health, hours of physical activity per week, lonileness, and questions related to well-being are asked.

Data corresponding to an final survey prompted via the smartphone, consisting of similar questions to the ones asked in the initial survey, namely psychological and well-being items (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception questions.

For a more detailed description of the study please refer to MobileWell400+StudyDescription.pdf.

For a more detailed description of the collected data, variables and data files please refer to MobileWell400+FilesDescription.pdf.
KNMI’23 climate scenario data for official data portal with extra variables
dataplatform.knmi.nl
gimi9.com
+3more
Updated May 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
knmi.nl (2025). KNMI’23 climate scenario data for official data portal with extra variables [Dataset]. https://dataplatform.knmi.nl/dataset/knmi23-user-friendly-racmo-3-0
Explore at:
Dataset updated
May 6, 2025
Dataset provided by
Royal Netherlands Meteorological Institutehttp://www.knmi.nl/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The KNMI'23 climate scenarios are based on 240 years (8 ensembles of 30 years each) of RACMO (Regional Atmospheric Climate Model) v2.3 data for every horizon/scenario. The data in this set is a user-friendly version of the RACMO putput that was used to calculate the scenario tables. ‘User-friendly’ means that the data is mapped to a regular lat/lon grid, and that the time coordinate corresponds to the nominal period it is used for. Version 1.0 (https://dataplatform.knmi.nl/dataset/knmi23-user-friendly-racmo-1-0) of the dataset only includes values within the Dutch borders. Version 2.0 also includes values outside the borders. In addition, this version also includes a mask to mask the points outside the Netherlands. This user-friendly dataset is also provided to the public via the data portal. The current version (3.0) is the same as version 2.0 but has four extra variables: psl (sea level pressure), sfcwindmax (daily maximum wind speed), uas (daily mean zonal wind speed), vas (daily mean meridional wind speed). These variables are not bias-corrected because the necessary observational data are not available. Users should be aware of this difference.
S
2023 Census totals by topic for dwellings by statistical area 1
datafinder.stats.govt.nz
csv, dwg, geodatabase +6
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats NZ, 2023 Census totals by topic for dwellings by statistical area 1 [Dataset]. https://datafinder.stats.govt.nz/layer/120759-2023-census-totals-by-topic-for-dwellings-by-statistical-area-1/
Explore at:
csv, mapinfo mif, dwg, shapefile, kml, geopackage / sqlite, mapinfo tab, geodatabase, pdfAvailable download formats
Dataset provided by
Statistics New Zealandhttp://www.stats.govt.nz/
Authors
Stats NZ
License
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Area covered
Description
Dataset contains counts and measures for dwellings from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 1.

The variables included in this dataset are for occupied private dwellings (unless otherwise stated). All data is for level 1 of the classification (unless otherwise stated):

Access to basic amenities (total responses)

Dwelling dampness

Dwelling mould

Dwelling occupancy status for all dwellings for levels 1 and 2

Dwelling type for occupied dwellings for levels 1 and 2

Fuel types used to heat dwellings (total responses)

Main types of heating used (total responses)

Number of bedrooms

Average number of bedrooms

Number of rooms

Average number of rooms.

Download lookup file from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.

Footnotes

Geographical boundaries

Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.

Caution using time series

Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).

About the 2023 Census dataset

For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.

Data quality

The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.

Concept descriptions and quality ratings

Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.

Using data for good

Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.

Confidentiality

The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.

Measures

Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.

Percentages

To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.

Symbol

-999 Confidential

Inconsistencies in definitions

Please note that there may be differences in definitions between census classifications and those used for other data collections.

Facebook

Twitter

Click to copy link

Link copied

Cite

Owen Bodger; Aidan Byrne; Philip A. Evans; Sarah Rees; Gwen Jones; Claire Cowell; Mike B. Gravenor; Rhys Williams (2023). Summary of variables of the data set included in the analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0027161.t001

Summary of variables of the data set included in the analysis.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0027161.t001

Dataset updated

Jun 8, 2023

Dataset provided by

PLOS ONE

Authors

Owen Bodger; Aidan Byrne; Philip A. Evans; Sarah Rees; Gwen Jones; Claire Cowell; Mike B. Gravenor; Rhys Williams

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Footnote: (f) denotes a categorical variable, (c) a continuous covariate and (n) a nominal variable.

Clear search

Close search

Google apps

Main menu

Summary of variables of the data set included in the analysis.

A Dataset of Water Quality and Related Variables in U.S. Reservoirs

Scaled Dataset.xlsx

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Data from: A Sensitivity Analysis of Methodological Variables Associated...

College Basketball March Madness Dataset 2012-24

Global Dataset of Cyber Incidents V.1.2

bnlearn datasets

Data for comparison of climate envelope models developed using...

NYSERDA Low- to Moderate-Income New York State Census Population Analysis...

Data from: A clustering based forecasting algorithm for multivariable fuzzy...

Background data for: Latent-variable modeling of ordinal outcomes in...

Treatment Episode Data Set -- Admissions (TEDS-A), 2004

Predictor variables for the Taiwan credit data.

Descriptive statistics of sexual violence victim-survivors in the Crime...

Benchmark datasets to study fairness in synthetic data generation

Annual Population Survey Household Dataset, January - December, 2019

Data from: MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile...

KNMI’23 climate scenario data for official data portal with extra variables

2023 Census totals by topic for dwellings by statistical area 1

Summary of variables of the data set included in the analysis.