38 datasets found

f
Summary of variables of the data set included in the analysis.
plos.figshare.com
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owen Bodger; Aidan Byrne; Philip A. Evans; Sarah Rees; Gwen Jones; Claire Cowell; Mike B. Gravenor; Rhys Williams (2023). Summary of variables of the data set included in the analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0027161.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0027161.t001
Dataset updated
Jun 8, 2023
Dataset provided by
PLOS ONE
Authors
Owen Bodger; Aidan Byrne; Philip A. Evans; Sarah Rees; Gwen Jones; Claire Cowell; Mike B. Gravenor; Rhys Williams
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Footnote: (f) denotes a categorical variable, (c) a continuous covariate and (n) a nominal variable.
A Dataset of Water Quality and Related Variables in U.S. Reservoirs
catalog.data.gov
s.cnmilf.com
+1more
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). A Dataset of Water Quality and Related Variables in U.S. Reservoirs [Dataset]. https://catalog.data.gov/dataset/a-dataset-of-water-quality-and-related-variables-in-u-s-reservoirs
Explore at:
Dataset updated
Jun 13, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
United States
Description
This dataset presents a rich collection of physicochemical parameters from 147 reservoirs distributed across the conterminous U.S. One hundred and eight of the reservoirs were selected using a statistical survey design and can provide unbiased inferences to the condition of all U.S. reservoirs. These data could be of interest to local water management specialists or those assessing the ecological condition of reservoirs at the national scale. These data have been reviewed in accordance with U.S. Environmental Protection Agency policy and approved for publication. This dataset is not publicly accessible because: It is too large. It can be accessed through the following means: https://portal-s.edirepository.org/nis/mapbrowse?scope=edi&identifier=2033&revision=1. Format: This dataset presents water quality and related variables for 147 reservoirs distributed across the U.S. Water quality parameters were measured during the summers of 2016, 2018, and 2020 – 2023. Measurements include nutrient concentration, algae abundance, dissolved oxygen concentration, and water temperature, among many others. Dataset includes links to other national and global scale data sets that provide additional variables.
f
Scaled Dataset.xlsx
figshare.com
xlsx
Updated Dec 23, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arash Mohsenijam (2016). Scaled Dataset.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.4491101.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4491101.v1
Dataset updated
Dec 23, 2016
Dataset provided by
figshare
Authors
Arash Mohsenijam
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The partner company’s historical data could be utilized in developing a data-driven prediction model with project division details as its inputs and project division labor-hours as the desired output. The BIM models contain 42 design features and 1559 records, each record denoting a division of fabrication. The BIM design features are listed in Table 1. Labor-hours spent on each division were extracted from job costing databases serving as the output parameter in the regression model. Although the variables in Table 1 are all considered related, there are certain inter-correlations between them and some variables can be explained by others. For instance, material length and weight are highly correlated; by knowing one, the other can be deduced. Therefore, a variable selection technique is instrumental in removing these inter-correlations in an analytical manner. It is noteworthy that the dataset was linearly scaled prior to performing analyses in order not to reveal sensitive information of the partner company without distorting patterns and relationships inherent in the data.
Performance vs. Predicted Performance
kaggle.com
Updated Dec 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Calathea21 (2022). Performance vs. Predicted Performance [Dataset]. http://doi.org/10.34740/kaggle/dsv/4752670
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/4752670
Dataset updated
Dec 21, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Calathea21
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains information about high school students and their actual and predicted performance on an exam. Most of the information, including some general information about high school students and their grade for an exam, was based on an already existing dataset, while the predicted exam performance was based on a human experiment. In this experiment, participants were shown short descriptions of the students (based on the information in the original data) and had to rank and grade according to their expected performance. Prior to this task some participants were exposed to some "Stereotype Activation", suggesting that boys perform less well in school than girls.

Description of *original_data.csv*

Based on this dataset (which is also available on kaggle), we extracted a number of student profiles that participants had to make grade predictions for. For more information about this dataset we refer to the corresponding kaggle page: https://www.kaggle.com/datasets/uciml/student-alcohol-consumption

Note that we performed some preprocessing on the original data:

The original data consisted of two parts: the information about students following a Maths course and the information about students following a Portuguese course. Since in both datasets the same type of information was recorded, we merged both datasets and added a column "subject", to show which course each student belongs to

We excluded all data where G3 = 0 (i.e. the grade for the last exam = 0)

From original_data.csv we randomly sampled 856 students that participants in our study had to make grade predictions for.

Description of *CompleteDataAndBiases.csv*

index - this column corresponds to the indeces in the file "original_data.csv". Through these indices, it is possible to add columns from the original data to the dataset with the grade prediction

ParticipantID - the ID of the participant who made the performance predictions for the corresponding student. Predictions needed to be made for 856 students, and each participant made 8 predictions total. Thus there are 107 different participant IDs

name - to make the prediction task more engaging for participants, each of the 8 student profiles, that participants had to grade & rank was randomly matched to one of four boy/girl's names (depending on the sex of the student)

sex - the sex of each student, either female (F) or male (M). For benchmarking fair ML algorithms, this can be used as the sensitive attribute. We assume that in the fair version of the decision variable ("Pass"), no sex discrimination occurs. The biased versions of the variable ("Predicted Pass") are mostly discriminatory towards male students.

studytime - this variable is taken from the original dataset and denotes how long a student studied for their exam. In the original data this variable consisted of four levels (less than 2 hours vs. 2-5 hours vs. 5-10 hours vs. more than 10 hours). We binned the latter two levels together and encoded this column numerically from 1-3.

freetime - Originally, this variable ranged from 1 (very low) to 5 (very high). We binned this variable into three categories, where level 1 and 2 are binned, as well as level 4 and 5.

romantic - Binary variable, denoting whether the student is in a romantic relationship or not.

Walc - This variable shows how much alcohol each student consumes in the weekend. Originally it ranged from 1 to 5 (5 corresponding to the highest alcohol consumption), but we binned the last two levels together.

goout - This variable shows how often a student goes out in a week. Originally it ranged from 1 to 5 (5 corresponding to going out very often), but we binned the last two levels together.

Parents_edu - This variable was not present in the original dataset. Instead, the original dataset consisted of two variables "mum_edu" and "dad_edu". We obtained "Parents_edu" by taking the higher one of both. The variable consist of 4 levels, whereas 4 = highest level of education.

absences - This variable shows the number of absences per student. Originally it ranged from 0 - 93, but because large number of absences were infrequent we binned all absences of >=7 into one level.

reason - The reason for why a student chose to go to the school in question. The levels are close to home, school's reputation, school's curricular and other

G3 - The actual grade each student received for the final exam of the course, ranging from 0-20.

Pass - A binary variable showing whether G3 is a passing grade (i.e. >=10) or not.

Predicted Grade - The grade the student was predicted to receive in our experiment

Predicted Rank - In our ex...
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
f
Data from: S1 Dataset -
plos.figshare.com
xlsx
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Navid Behzadi Koochani; Raúl Muñoz Romo; Ignacio Hernández Palencia; Sergio López Bernal; Carmen Martin Curto; José Cabezas Rodríguez; Almudena Castaño Reguillo (2024). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0305699.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305699.s002
Dataset updated
Jul 18, 2024
Dataset provided by
PLOS ONE
Authors
Navid Behzadi Koochani; Raúl Muñoz Romo; Ignacio Hernández Palencia; Sergio López Bernal; Carmen Martin Curto; José Cabezas Rodríguez; Almudena Castaño Reguillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThere is a need to develop harmonized procedures and a Minimum Data Set (MDS) for cross-border Multi Casualty Incidents (MCI) in medical emergency scenarios to ensure appropriate management of such incidents, regardless of place, language and internal processes of the institutions involved. That information should be capable of real-time communication to the command-and-control chain. It is crucial that the models adopted are interoperable between countries so that the rights of patients to cross-border healthcare are fully respected.ObjectiveTo optimize management of cross-border Multi Casualty Incidents through a Minimum Data Set collected and communicated in real time to the chain of command and control for each incident. To determine the degree of agreement among experts.MethodWe used the modified Delphi method supplemented with the Utstein technique to reach consensus among experts. In the first phase, the minimum requirements of the project, the profile of the experts who were to participate, the basic requirements of each variable chosen and the way of collecting the data were defined by providing bibliography on the subject. In the second phase, the preliminary variables were grouped into 6 clusters, the objectives, the characteristics of the variables and the logistics of the work were approved. Several meetings were held to reach a consensus to choose the MDS variables using a Modified Delphi technique. Each expert had to score each variable from 1 to 10. Non-voting variables were eliminated, and the round of voting ended. In the third phase, the Utstein Style was applied to discuss each group of variables and choose the ones with the highest consensus. After several rounds of discussion, it was agreed to eliminate the variables with a score of less than 5 points. In phase four, the researchers submitted the variables to the external experts for final assessment and validation before their use in the simulations. Data were analysed with SPSS Statistics (IBM, version 2) software.ResultsSix data entities with 31 sub-entities were defined, generating 127 items representing the final MDS regarded as essential for incident management. The level of consensus for the choice of items was very high and was highest for the category ‘Incident’ with an overall kappa of 0.7401 (95% CI 0.1265–0.5812, p 0.000), a good level of consensus in the Landis and Koch model. The items with the greatest degree of consensus at ten were those relating to location, type of incident, date, time and identification of the incident. All items met the criteria set, such as digital collection and real-time transmission to the chain of command and control.ConclusionsThis study documents the development of a MDS through consensus with a high degree of agreement among a group of experts of different nationalities working in different fields. All items in the MDS were digitally collected and forwarded in real time to the chain of command and control. This tool has demonstrated its validity in four large cross-border simulations involving more than eight countries and their emergency services.
Data from: A Sensitivity Analysis of Methodological Variables Associated...
catalog.data.gov
data.nist.gov
Updated Dec 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). A Sensitivity Analysis of Methodological Variables Associated with Microbiome Measurements [Dataset]. https://catalog.data.gov/dataset/a-sensitivity-analysis-of-methodological-variables-associated-with-microbiome-measurements-83f38
Explore at:
Dataset updated
Dec 15, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This repository provides the raw data, analysis code, and results generated during a systematic evaluation of the impact of selected experimental protocol choices on the metagenomic sequencing analysis of microbiome samples. Briefly, a full factorial experimental design was implemented varying biological sample (n=5), operator (n=2), lot (n=2), extraction kit (n=2), 16S variable region (n=2), and reference database (n=3), and the main effects were calculated and compared between parameters (bias effects) and samples (real biological differences). A full description of the effort is provided in the associated publication.
o
University SET data, with faculty and courses characteristics
openicpsr.org
Updated Sep 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Under blind review in refereed journal (2021). University SET data, with faculty and courses characteristics [Dataset]. http://doi.org/10.3886/E149801V1
Explore at:
Unique identifier
https://doi.org/10.3886/E149801V1
Dataset updated
Sep 12, 2021
Authors
Under blind review in refereed journal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
Z
Global Dataset of Cyber Incidents V.1.2
data.niaid.nih.gov
zenodo.org
Updated May 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Repository of Cyber Incidents (EuRepoC) (2024). Global Dataset of Cyber Incidents V.1.2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7848940
Explore at:
Dataset updated
May 3, 2024
Dataset authored and provided by
European Repository of Cyber Incidents (EuRepoC)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains data on 2889 cyber incidents between 01.01.2000 and 02.05.2024 using 60 variables, including the start date, names and categories of receivers along with names and categories of initiators. The database was compiled as part of the European Repository of Cyber Incidents (EuRepoC) project.

EuRepoC gathers, codes, and analyses publicly available information from over 200 sources and 600 Twitter accounts daily to report on dynamic trends in the global, and particularly the European, cyber threat environment.For more information on the scope and data collection methodology see: https://eurepoc.eu/methodologyCodebook available hereInformation about each file:

Global Database (csv or xlsx):This file includes all variables coded for each incident, organised such that one row corresponds to one incident - our main unit of investigation. Where multiple codes are present for a single variable for a single incident, these are separated with semi-colons within the same cell.

Receiver Dataset (csv):In this file, the data of affected entities and individuals (receivers) is restructured to facilitate analysis. Each cell contains only a single code, with the data "unpacked" across multiple rows. Thus, a single incident can span several rows, identifiable through the unique identifier assigned to each incident (incident_id).

Attribution Dataset (csv):This file follows a similar approach to the receiver dataset. The attribution data is "unpacked" over several rows, allowing each cell to contain only one code. Here too, a single incident may occupy several rows, with the unique identifier enabling easy tracking of each incident (incident_id). In addition, some attributions may also have multiple possible codes for one variable, these are also "unpacked" over several rows, with the attribution_id enabling to track each attribution.eurepoc_global_database_1.2 (json):This file contains the whole database in JSON format.
d
Data for comparison of climate envelope models developed using...
catalog.data.gov
data.usgs.gov
+2more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data for comparison of climate envelope models developed using expert-selected variables versus statistical selection [Dataset]. https://catalog.data.gov/dataset/data-for-comparison-of-climate-envelope-models-developed-using-expert-selected-variables-v
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The data we used for this study include species occurrence data (n=15 species), climate data and predictions, an expert opinion questionnaire, and species masks that represented the model domain for each species. For this data release, we include the results of the expert opinion questionnaire and the species model domains (or masks). We developed an expert opinion questionnaire to gather information regarding expert opinion regarding the importance of climate variables in determining a species geographic range. The species masks, or model domains, were defined separately for each species using a variation of the “target-group” approach (Phillips et al. 2009), where the domain was determine using convex polygons including occurrence data for at least three phylogenetically related and similar species (Watling et al. 2012). The species occurrence data, climate data, and climate predictions are freely available online, and therefore not included in this data release. The species occurrence data were obtained primarily from the online database Global Biodiversity Information Facility (GBIF; http://www.gbif.org/), and from scientific literature (Watling et al. 2011). Climate data were obtained from the WorldClim database (Hijmans et al. 2005) and climate predictions were obtained from the Center for Ocean-Atmosphere Prediction Studies (COAPS) at Florida State University (https://floridaclimateinstitute.org/resources/data-sets/regional-downscaling). See metadata for references.
f
Predictor variables for the Taiwan credit data.
plos.figshare.com
xls
Updated Aug 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rivalani Hlongwane; Kutlwano Ramabao; Wilson Mongwe (2024). Predictor variables for the Taiwan credit data. [Dataset]. http://doi.org/10.1371/journal.pone.0308718.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0308718.t003
Dataset updated
Aug 12, 2024
Dataset provided by
PLOS ONE
Authors
Rivalani Hlongwane; Kutlwano Ramabao; Wilson Mongwe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Taiwan
Description
Credit scorecards are essential tools for banks to assess the creditworthiness of loan applicants. While advanced machine learning models like XGBoost and random forest often outperform traditional logistic regression in predictive accuracy, their lack of interpretability hinders their adoption in practice. This study bridges the gap between research and practice by developing a novel framework for constructing interpretable credit scorecards using Shapley values. We apply this framework to two credit datasets, discretizing numerical variables and utilizing one-hot encoding to facilitate model development. Shapley values are then employed to derive credit scores for each predictor variable group in XGBoost, random forest, LightGBM, and CatBoost models. Our results demonstrate that this approach yields credit scorecards with interpretability comparable to logistic regression while maintaining superior predictive accuracy. This framework offers a practical and effective solution for credit practitioners seeking to leverage the power of advanced models without sacrificing transparency and regulatory compliance.
d
Background data for: Latent-variable modeling of ordinal outcomes in...
dataone.org
dataverse.no
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krug, Manfred; Vetter, Fabian; Sönning, Lukas (2024). Background data for: Latent-variable modeling of ordinal outcomes in language data analysis [Dataset]. http://doi.org/10.18710/WI9TEH
Explore at:
Unique identifier
https://doi.org/10.18710/WI9TEH
Dataset updated
Sep 25, 2024
Dataset provided by
DataverseNO
Authors
Krug, Manfred; Vetter, Fabian; Sönning, Lukas
Time period covered
Jan 1, 2008 - Dec 31, 2018
Description
This dataset contains tabular files with information about the usage preferences of speakers of Maltese English with regard to 63 pairs of lexical expressions. These pairs (e.g. truck-lorry or realization-realisation) are known to differ in usage between BrE and AmE (cf. Algeo 2006). The data were elicited with a questionnaire that asks informants to indicate whether they always use one of the two variants, prefer one over the other, have no preference, or do not use either expression (see Krug and Sell 2013 for methodological details). Usage preferences were therefore measured on a symmetric 5-point ordinal scale. Data were collected between 2008 to 2018, as part of a larger research project on lexical and grammatical variation in settings where English is spoken as a native, second, or foreign language. The current dataset, which we use for our methodological study on ordinal data modeling strategies, consists of a subset of 500 speakers that is roughly balanced on year of birth. Abstract: Related publication In empirical work, ordinal variables are typically analyzed using means based on numeric scores assigned to categories. While this strategy has met with justified criticism in the methodological literature, it also generates simple and informative data summaries, a standard often not met by statistically more adequate procedures. Motivated by a survey of how ordered variables are dealt with in language research, we draw attention to an un(der)used latent-variable approach to ordinal data modeling, which constitutes an alternative perspective on the most widely used form of ordered regression, the cumulative model. Since the latent-variable approach does not feature in any of the studies in our survey, we believe it is worthwhile to promote its benefits. To this end, we draw on questionnaire-based preference ratings by speakers of Maltese English, who indicated on a 5-point scale which of two synonymous expressions (e.g. package-parcel) they (tend to) use. We demonstrate that a latent-variable formulation of the cumulative model affords nuanced and interpretable data summaries that can be visualized effectively, while at the same time avoiding limitations inherent in mean response models (e.g. distortions induced by floor and ceiling effects). The online supplementary materials include a tutorial for its implementation in R.
S
2023 Census totals by topic for dwellings by statistical area 1
datafinder.stats.govt.nz
csv, dwg, geodatabase +6
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats NZ, 2023 Census totals by topic for dwellings by statistical area 1 [Dataset]. https://datafinder.stats.govt.nz/layer/120759-2023-census-totals-by-topic-for-dwellings-by-statistical-area-1/
Explore at:
csv, mapinfo mif, dwg, shapefile, kml, geopackage / sqlite, mapinfo tab, geodatabase, pdfAvailable download formats
Dataset provided by
Statistics New Zealandhttp://www.stats.govt.nz/
Authors
Stats NZ
License
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Area covered
Description
Dataset contains counts and measures for dwellings from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 1.

The variables included in this dataset are for occupied private dwellings (unless otherwise stated). All data is for level 1 of the classification (unless otherwise stated):

Access to basic amenities (total responses)

Dwelling dampness

Dwelling mould

Dwelling occupancy status for all dwellings for levels 1 and 2

Dwelling type for occupied dwellings for levels 1 and 2

Fuel types used to heat dwellings (total responses)

Main types of heating used (total responses)

Number of bedrooms

Average number of bedrooms

Number of rooms

Average number of rooms.

Download lookup file from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.

Footnotes

Geographical boundaries

Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.

Caution using time series

Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).

About the 2023 Census dataset

For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.

Data quality

The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.

Concept descriptions and quality ratings

Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.

Using data for good

Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.

Confidentiality

The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.

Measures

Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.

Percentages

To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.

Symbol

-999 Confidential

Inconsistencies in definitions

Please note that there may be differences in definitions between census classifications and those used for other data collections.
d
ECMWF: surface variables and fluxes, single layer, 1-hr avg
catalog.data.gov
cloud.csiss.gmu.edu
+2more
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atmospheric Radiation Measurement Data Center (2020). ECMWF: surface variables and fluxes, single layer, 1-hr avg [Dataset]. https://catalog.data.gov/dataset/ecmwf-surface-variables-and-fluxes-single-layer-1-hr-avg
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
Atmospheric Radiation Measurement Data Center
Description
No description found
Benchmark datasets to study fairness in synthetic data generation
zenodo.org
csv, json
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joao Fonseca; Joao Fonseca (2024). Benchmark datasets to study fairness in synthetic data generation [Dataset]. http://doi.org/10.5281/zenodo.13385610
Explore at:
csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13385610
Dataset updated
Aug 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joao Fonseca; Joao Fonseca
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The traveltime dataset is based on the Folktables project covering US census data. The target is a binary variable encoding whether or not the individual needs to travel more than 20 minutes for work; here, having a shorter travel time is the desirable outcome. We use a subset of data from the states of California, Florida, Maine, New York, Utah, and Wyoming states in 2018. Although the folktables dataset does not have any missing values, there are some values recorded as NaN due to the Bureau's data collection methodology. We remove the "esp" column, which encodes the employment status of parents, and has 99.55% missing values. We encode the missing values in the povpip, income to poverty ratio (0.85%), to -1 in accordance to the methodology in Ding et al.. See https://arxiv.org/pdf/2108.04884 for metadata.

The cardio (a) dataset contains patient data recorded during medical examination, including 3 binary features supplied by the patient. The target class denotes the presence of cardiovascular disease. This dataset represents predictive tasks that allocate access to priority medical care for patients, and has been used for fairness evaluations in the domain.

The credit dataset contains historical financial data of borrowers, including past non-serious delinquencies. Here, a serious delinquency is considered to be 90 days past due, and this is the target variable.

The German Credit dataset (https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data) contains financial and personal information regarding loan-seeking applicants.
c
Annual Population Survey Household Dataset, January - December, 2019
datacatalogue.cessda.eu
Updated May 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2025). Annual Population Survey Household Dataset, January - December, 2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-8665-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-8665-1
Dataset updated
May 16, 2025
Dataset provided by
Social Survey Division
Authors
Office for National Statistics
Time period covered
Jan 1, 2019 - Dec 31, 2019
Area covered
United Kingdom
Variables measured
Families/households, National
Measurement technique
Face-to-face interview, Telephone interview, Data compiled from households completing the main APS and LFS.
Description
Abstract copyright UK Data Service and data collection copyright owner.
The Annual Population Survey (APS) household datasets are produced annually and are available from 2004 (Special Licence) and 2006 (End User Licence). They allow production of family and household labour market statistics at local areas and for small sub-groups of the population across the UK. The household data comprise key variables from the Labour Force Survey (LFS) and the APS 'person' datasets. The APS household datasets include all the variables on the LFS and APS person datasets, except for the income variables. They also include key family and household-level derived variables. These variables allow for an analysis of the combined economic activity status of the family or household. In addition, they also include more detailed geographical, industry, occupation, health and age variables.
For further detailed information about methodology, users should consult the Labour Force Survey User Guide, included with the APS documentation. For variable and value labelling and coding frames that are not included either in the data or in the current APS documentation, users are advised to consult the latest versions of the LFS User Guides, which are available from the ONS Labour Force Survey - User Guidance webpages.

Occupation data for 2021 and 2022
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. None of ONS' headline statistics, other than those directly sourced from occupational data, are affected and you can continue to rely on their accuracy. Further information can be found in the ONS article published on 11 July 2023: Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022

End User Licence and Secure Access APS data
Users should note that there are two versions of each APS dataset. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. The EUL version includes Government Office Region geography, banded age, 3-digit SOC and industry sector for main, second and last job. The Secure Access version contains more detailed variables relating to:
age: single year of age, year and month of birth, age completed full-time education and age obtained highest qualification, age of oldest dependent child and age of youngest dependent child
family unit and household: including a number of variables concerning the number of dependent children in the family according to their ages, relationship to head of household and relationship to head of family
nationality and country of origin
geography: including county, unitary/local authority, place of work, Nomenclature of Territorial Units for Statistics 2 (NUTS2) and NUTS3 regions, and whether lives and works in same local authority district
health: including main health problem, and current and past health problems
education and apprenticeship: including numbers and subjects of various qualifications and variables concerning apprenticeships
industry: including industry, industry class and industry group for main, second and last job, and industry made redundant from
occupation: including 4-digit Standard Occupational Classification (SOC) for main, second and last job and job made redundant from
system variables: including week number when interview took place and number of households at address
The Secure Access data have more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.
Main Topics:
Topics covered include: household composition and relationships, housing tenure, nationality, ethnicity and residential history, employment and training (including government schemes), workplace and location, job hunting, educational background and qualifications.
4
Data and code for the master thesis: Evaluation of the handling of a...
data.4tu.nl
zip
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Floris van Willigen (2023). Data and code for the master thesis: Evaluation of the handling of a variable dynamics tilting tricycle [Dataset]. http://doi.org/10.4121/7571da71-6844-411b-9a3e-aefb4ef2c19c.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/7571da71-6844-411b-9a3e-aefb4ef2c19c.v1
Dataset updated
Jun 16, 2023
Dataset provided by
4TU.ResearchData
Authors
Floris van Willigen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The data and Matlab code can be found that support the findings of the master's thesis "Evaluation of the handling of a variable dynamics tilting tricycle".

The objective of the experimental study is to find the configuration of the tilt mechanism of the Dressel tilting tricycle with the optimal handling performance. The matlab code is used to calculate the handling performance from raw data, obtained by gyroscopes. A slalom manoeuvre and a low-speed line following manoeuvre have been performed and the code supplies the processing methods of the data. The results of the repeated trials can be found in the datasets. Also the velocity of the different vehicles can be found in the datasets. Another matlab file is present that was used to optimize the dimensions of the tilt mechanism for a larger tilt limit with a simplified model of the tricycle.
f
Descriptive statistics of sexual violence victim-survivors in the Crime...
plos.figshare.com
xls
Updated Jan 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Estela Capelas Barbosa; Niels Blom; Annie Bunce (2025). Descriptive statistics of sexual violence victim-survivors in the Crime Survey for England and Wales (CSEW) and Rape Crisis England & Wales (RCEW) datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0301155.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0301155.t001
Dataset updated
Jan 14, 2025
Dataset provided by
PLOS ONE
Authors
Estela Capelas Barbosa; Niels Blom; Annie Bunce
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Descriptive statistics of sexual violence victim-survivors in the Crime Survey for England and Wales (CSEW) and Rape Crisis England & Wales (RCEW) datasets.
m
Data from: A clustering based forecasting algorithm for multivariable fuzzy...
data.mendeley.com
Updated Oct 31, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salar Askari Lasaki (2016). A clustering based forecasting algorithm for multivariable fuzzy time series using linear combinations of independent variables [Dataset]. http://doi.org/10.17632/35fw8pb6s9.1
Explore at:
Unique identifier
https://doi.org/10.17632/35fw8pb6s9.1
Dataset updated
Oct 31, 2016
Authors
Salar Askari Lasaki
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dear Researcher,

Thank you for using this code and datasets. I explain how CFTS code related to my paper "A clustering based forecasting algorithm for multivariable fuzzy time series using linear combinations of independent variables" published in Applied Soft Computing works. All datasets mentioned in the paper accompanied with CFTS code are included. If there is any question feel free to contact me at: bas_salaraskari@yahoo.com s_askari@aut.ac.ir

Regards,

S. Askari

Guidelines for CFTS algorithm: 1. Open the file CFTS Code using MATLAB. 2. Enter or paste name of the dataset you wish to simulate in line 5 after "load". It loads the dataset in the workplace. 3. Lines 6 and 7: "r" is number of independent variables and "N" is number of data vectors used for training. 4. Line 9: "C" is number of clusters. You can use the optimal number of clusters given in Table 6 of paper or your own preferred value. 5. If line 28 is "comment", covariance norm (Mahalanobis distance) is use and if it is "uncomment", identity norm (Euclidean distance) is used. 6. Please press Ctrl Enter to run the code. 7. For your own dataset, please arrange the data as the datasets described in MS Word file "Read Me".
Treatment Episode Data Set -- Admissions (TEDS-A), 2004
icpsr.umich.edu
healthdata.gov
+4more
ascii, delimited, r +3
Updated Sep 10, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Department of Health and Human Services. Substance Abuse and Mental Health Services Administration. Office of Applied Studies (2014). Treatment Episode Data Set -- Admissions (TEDS-A), 2004 [Dataset]. http://doi.org/10.3886/ICPSR04431.v11
Explore at:
sas, ascii, delimited, spss, stata, rAvailable download formats
Unique identifier
https://doi.org/10.3886/ICPSR04431.v11
Dataset updated
Sep 10, 2014
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
United States Department of Health and Human Services. Substance Abuse and Mental Health Services Administration. Office of Applied Studies
License
https://www.icpsr.umich.edu/web/ICPSR/studies/4431/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/4431/terms
Time period covered
2004
Area covered
United States
Description
The Treatment Episode Data Set -- Admissions (TEDS-A) is a national census data system of annual admissions to substance abuse treatment facilities. TEDS-A provides annual data on the number and characteristics of persons admitted to public and private substance abuse treatment programs that receive public funding. The unit of analysis is a treatment admission. TEDS consists of data reported to state substance abuse agencies by the treatment programs, which in turn report it to SAMHSA. A sister data system, called the Treatment Episode Data Set -- Discharges (TEDS-D), collects data on discharges from substance abuse treatment facilities. The first year of TEDS-A data is 1992, while the first year of TEDS-D is 2006. TEDS variables that are required to be reported are called the "Minimum Data Set (MDS)", while those that are optional are called the "Supplemental Data Set (SuDS)". Variables in the MDS include: information on service setting, number of prior treatments, primary source of referral, gender, race, ethnicity, education, employment status, substance(s) abused, route of administration, frequency of use, age at first use, and whether methadone was prescribed in treatment. Supplemental variables include: diagnosis codes, presence of psychiatric problems, living arrangements, source of income, health insurance, expected source of payment, pregnancy and veteran status, marital status, detailed not in labor force codes, detailed criminal justice referral codes, days waiting to enter treatment, and the number of arrests in the 30 days prior to admissions (starting in 2008). Substances abused include alcohol, cocaine and crack, marijuana and hashish, heroin, nonprescription methadone, other opiates and synthetics, PCP, other hallucinogens, methamphetamine, other amphetamines, other stimulants, benzodiazepines, other non-benzodiazepine tranquilizers, barbiturates, other non-barbiturate sedatives or hypnotics, inhalants, over-the-counter medications, and other substances. Created variables include total number of substances reported, intravenous drug use (IDU), and flags for any mention of specific substances.

Facebook

Twitter

Click to copy link

Link copied

Cite

Owen Bodger; Aidan Byrne; Philip A. Evans; Sarah Rees; Gwen Jones; Claire Cowell; Mike B. Gravenor; Rhys Williams (2023). Summary of variables of the data set included in the analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0027161.t001

Summary of variables of the data set included in the analysis.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0027161.t001

Dataset updated

Jun 8, 2023

Dataset provided by

PLOS ONE

Authors

Owen Bodger; Aidan Byrne; Philip A. Evans; Sarah Rees; Gwen Jones; Claire Cowell; Mike B. Gravenor; Rhys Williams

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Footnote: (f) denotes a categorical variable, (c) a continuous covariate and (n) a nominal variable.

Clear search

Close search

Google apps

Main menu

Summary of variables of the data set included in the analysis.

A Dataset of Water Quality and Related Variables in U.S. Reservoirs

Scaled Dataset.xlsx

Performance vs. Predicted Performance

**Description of *original_data.csv***

**Description of *CompleteDataAndBiases.csv***

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Data from: S1 Dataset -

Data from: A Sensitivity Analysis of Methodological Variables Associated...

University SET data, with faculty and courses characteristics

Global Dataset of Cyber Incidents V.1.2

Data for comparison of climate envelope models developed using...

Predictor variables for the Taiwan credit data.

Background data for: Latent-variable modeling of ordinal outcomes in...

2023 Census totals by topic for dwellings by statistical area 1

ECMWF: surface variables and fluxes, single layer, 1-hr avg

Benchmark datasets to study fairness in synthetic data generation

Annual Population Survey Household Dataset, January - December, 2019

Data and code for the master thesis: Evaluation of the handling of a...

Descriptive statistics of sexual violence victim-survivors in the Crime...

Data from: A clustering based forecasting algorithm for multivariable fuzzy...

Treatment Episode Data Set -- Admissions (TEDS-A), 2004

Summary of variables of the data set included in the analysis.