34 datasets found

n
Data from: Missing data estimation in morphometrics: how much is too much?
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Dec 5, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Clavel; Gildas Merceron; Gilles Escarguel (2013). Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.f0b50
Dataset updated
Dec 5, 2013
Dataset provided by
Centre National de la Recherche Scientifique
Authors
Julien Clavel; Gildas Merceron; Gilles Escarguel
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...
plos.figshare.com
datasetcatalog.nlm.nih.gov
+1more
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0164464
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nikolaj Bak; Lars K. Hansen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.
Data from: A multiple imputation method using population information
tandf.figshare.com
pdf
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tadayoshi Fushiki (2025). A multiple imputation method using population information [Dataset]. http://doi.org/10.6084/m9.figshare.28900017.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28900017.v1
Dataset updated
Apr 30, 2025
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Tadayoshi Fushiki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multiple imputation (MI) is effectively used to deal with missing data when the missing mechanism is missing at random. However, MI may not be effective when the missing mechanism is not missing at random (NMAR). In such cases, additional information is required to obtain an appropriate imputation. Pham et al. (2019) proposed the calibrated-δ adjustment method, which is a multiple imputation method using population information. It provides appropriate imputation in two NMAR settings. However, the calibrated-δ adjustment method has two problems. First, it can be used only when one variable has missing values. Second, the theoretical properties of the variance estimator have not been provided. This article proposes a multiple imputation method using population information that can be applied when several variables have missing values. The proposed method is proven to include the calibrated-δ adjustment method. It is shown that the proposed method provides a consistent estimator for the parameter of the imputation model in an NMAR situation. The asymptotic variance of the estimator obtained by the proposed method and its estimator are also given.
Data from: Evaluating Supplemental Samples in Longitudinal Research:...
tandf.figshare.com
txt
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura K. Taylor; Xin Tong; Scott E. Maxwell (2024). Evaluating Supplemental Samples in Longitudinal Research: Replacement and Refreshment Approaches [Dataset]. http://doi.org/10.6084/m9.figshare.12162072.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12162072.v1
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Laura K. Taylor; Xin Tong; Scott E. Maxwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite the wide application of longitudinal studies, they are often plagued by missing data and attrition. The majority of methodological approaches focus on participant retention or modern missing data analysis procedures. This paper, however, takes a new approach by examining how researchers may supplement the sample with additional participants. First, refreshment samples use the same selection criteria as the initial study. Second, replacement samples identify auxiliary variables that may help explain patterns of missingness and select new participants based on those characteristics. A simulation study compares these two strategies for a linear growth model with five measurement occasions. Overall, the results suggest that refreshment samples lead to less relative bias, greater relative efficiency, and more acceptable coverage rates than replacement samples or not supplementing the missing participants in any way. Refreshment samples also have high statistical power. The comparative strengths of the refreshment approach are further illustrated through a real data example. These findings have implications for assessing change over time when researching at-risk samples with high levels of permanent attrition.
f
Data from: A Comparison of FIML- versus Multiple-imputation-based methods to...
tandf.figshare.com
docx
Updated Feb 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Liu; Suppanut Sriutaisuk (2024). A Comparison of FIML- versus Multiple-imputation-based methods to test measurement invariance with incomplete ordinal variables [Dataset]. http://doi.org/10.6084/m9.figshare.14062423.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14062423.v1
Dataset updated
Feb 26, 2024
Dataset provided by
Taylor & Francis
Authors
Yu Liu; Suppanut Sriutaisuk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To ensure meaningful comparison of test scores across groups or time, measurement invariance (i.e., invariance of the general factor structure and the values of the measurement parameters) across groups or time must be examined. However, many empirical examinations of measurement invariance of psychological/educational questionnaires need to address two issues: Using the appropriate model for ordinal variables (e.g., Likert scale items), and handling missing data. In two Monte Carlo simulations, this study examined the performance of one full-information-maximum-likelihood-based method and five multiple-imputation-based methods to obtain tests of measurement invariance across groups for ordinal variables that have missing data. Our results indicate that the full-information-maximum-likelihood-based method and one of the multiple-imputation-based methods generally have better performance than the other examined methods, though they also have their own limitations.
Dataset for: Avoiding pitfalls when combining multiple imputation and...
wiley.figshare.com
docx
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emily Granger; Jamie Sergeant; Mark Lunt (2023). Dataset for: Avoiding pitfalls when combining multiple imputation and propensity scores [Dataset]. http://doi.org/10.6084/m9.figshare.9253178.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9253178.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Emily Granger; Jamie Sergeant; Mark Lunt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Overcoming bias due to confounding and missing data is challenging when analysing observational data. Propensity scores are commonly used to account for the first problem and multiple imputation for the latter. Unfortunately, it is not known how best to proceed when both techniques are required. We investigate whether two different approaches to combining propensity scores and multiple imputation (Across and Within) lead to differences in the accuracy or precision of exposure effect estimates. Both approaches start by imputing missing values multiple times. Propensity scores are then estimated for each resulting dataset. Using the Across approach, the mean propensity score across imputations for each subject is used in a single subsequent analysis. Alternatively, the Within approach uses propensity scores individually to obtain exposure effect estimates in each imputation, which are combined to produce an overall estimate. These approaches were compared in a series of Monte Carlo simulations and applied to data from the British Society for Rheumatology Biologics Register. Results indicated that the Within approach produced unbiased estimates with appropriate confidence intervals, whereas the Across approach produced biased results and unrealistic confidence intervals. Researchers are encouraged to implement the Within approach when conducting propensity score analyses with incomplete data.
S
Experimental Dataset on the Impact of Unfair Behavior by AI and Humans on...
scidb.cn
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang Luo (2025). Experimental Dataset on the Impact of Unfair Behavior by AI and Humans on Trust: Evidence from Six Experimental Studies [Dataset]. http://doi.org/10.57760/sciencedb.psych.00565
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.psych.00565
Dataset updated
Apr 30, 2025
Dataset provided by
Science Data Bank
Authors
Yang Luo
Description
This dataset originates from a series of experimental studies titled “Tough on People, Tolerant to AI? Differential Effects of Human vs. AI Unfairness on Trust” The project investigates how individuals respond to unfair behavior (distributive, procedural, and interactional unfairness) enacted by artificial intelligence versus human agents, and how such behavior affects cognitive and affective trust.1 Experiment 1a: The Impact of AI vs. Human Distributive Unfairness on TrustOverview: This dataset comes from an experimental study aimed at examining how individuals respond in terms of cognitive and affective trust when distributive unfairness is enacted by either an artificial intelligence (AI) agent or a human decision-maker. Experiment 1a specifically focuses on the main effect of the “type of decision-maker” on trust.Data Generation and Processing: The data were collected through Credamo, an online survey platform. Initially, 98 responses were gathered from students at a university in China. Additional student participants were recruited via Credamo to supplement the sample. Attention check items were embedded in the questionnaire, and participants who failed were automatically excluded in real-time. Data collection continued until 202 valid responses were obtained. SPSS software was used for data cleaning and analysis.Data Structure and Format: The data file is named “Experiment1a.sav” and is in SPSS format. It contains 28 columns and 202 rows, where each row corresponds to one participant. Columns represent measured variables, including: grouping and randomization variables, one manipulation check item, four items measuring distributive fairness perception, six items on cognitive trust, five items on affective trust, three items for honesty checks, and four demographic variables (gender, age, education, and grade level). The final three columns contain computed means for distributive fairness, cognitive trust, and affective trust.Additional Information: No missing data are present. All variable names are labeled in English abbreviations to facilitate further analysis. The dataset can be directly opened in SPSS or exported to other formats.2 Experiment 1b: The Mediating Role of Perceived Ability and Benevolence (Distributive Unfairness)Overview: This dataset originates from an experimental study designed to replicate the findings of Experiment 1a and further examine the potential mediating role of perceived ability and perceived benevolence.Data Generation and Processing: Participants were recruited via the Credamo online platform. Attention check items were embedded in the survey to ensure data quality. Data were collected using a rolling recruitment method, with invalid responses removed in real time. A total of 228 valid responses were obtained.Data Structure and Format: The dataset is stored in a file named Experiment1b.sav in SPSS format and can be directly opened in SPSS software. It consists of 228 rows and 40 columns. Each row represents one participant’s data record, and each column corresponds to a different measured variable. Specifically, the dataset includes: random assignment and grouping variables; one manipulation check item; four items measuring perceived distributive fairness; six items on perceived ability; five items on perceived benevolence; six items on cognitive trust; five items on affective trust; three items for attention check; and three demographic variables (gender, age, and education). The last five columns contain the computed mean scores for perceived distributive fairness, ability, benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be analyzed directly in SPSS or exported to other formats as needed.3 Experiment 2a: Differential Effects of AI vs. Human Procedural Unfairness on TrustOverview: This dataset originates from an experimental study aimed at examining whether individuals respond differently in terms of cognitive and affective trust when procedural unfairness is enacted by artificial intelligence versus human decision-makers. Experiment 2a focuses on the main effect of the decision agent on trust outcomes.Data Generation and Processing: Participants were recruited via the Credamo online survey platform from two universities located in different regions of China. A total of 227 responses were collected. After excluding those who failed the attention check items, 204 valid responses were retained for analysis. Data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2a.sav in SPSS format and can be directly opened in SPSS software. It contains 204 rows and 30 columns. Each row represents one participant’s response record, while each column corresponds to a specific variable. Variables include: random assignment and grouping; one manipulation check item; seven items measuring perceived procedural fairness; six items on cognitive trust; five items on affective trust; three attention check items; and three demographic variables (gender, age, and education). The final three columns contain computed average scores for procedural fairness, cognitive trust, and affective trust.Additional Notes: The dataset contains no missing values. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be directly analyzed in SPSS or exported to other formats as needed.4 Experiment 2b: Mediating Role of Perceived Ability and Benevolence (Procedural Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 2a and to further examine the potential mediating roles of perceived ability and perceived benevolence in shaping trust responses under procedural unfairness.Data Generation and Processing: Participants were working adults recruited through the Credamo online platform. A rolling data collection strategy was used, where responses failing attention checks were excluded in real time. The final dataset includes 235 valid responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2b.sav, which is in SPSS format and can be directly opened using SPSS software. It contains 235 rows and 43 columns. Each row corresponds to a single participant, and each column represents a specific measured variable. These include: random assignment and group labels; one manipulation check item; seven items measuring procedural fairness; six items for perceived ability; five items for perceived benevolence; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final five columns contain the computed average scores for procedural fairness, perceived ability, perceived benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to support future reuse and secondary analysis. The dataset can be directly analyzed in SPSS and easily converted into other formats if needed.5 Experiment 3a: Effects of AI vs. Human Interactional Unfairness on TrustOverview: This dataset comes from an experimental study that investigates how interactional unfairness, when enacted by either artificial intelligence or human decision-makers, influences individuals’ cognitive and affective trust. Experiment 3a focuses on the main effect of the “decision-maker type” under interactional unfairness conditions.Data Generation and Processing: Participants were college students recruited from two universities in different regions of China through the Credamo survey platform. After excluding responses that failed attention checks, a total of 203 valid cases were retained from an initial pool of 223 responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3a.sav, in SPSS format and compatible with SPSS software. It contains 203 rows and 27 columns. Each row represents a single participant, while each column corresponds to a specific measured variable. These include: random assignment and condition labels; one manipulation check item; four items measuring interactional fairness perception; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final three columns contain computed average scores for interactional fairness, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variable names are provided using standardized English abbreviations to facilitate secondary analysis. The data can be directly analyzed using SPSS and exported to other formats as needed.6 Experiment 3b: The Mediating Role of Perceived Ability and Benevolence (Interactional Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 3a and further examine the potential mediating roles of perceived ability and perceived benevolence under conditions of interactional unfairness.Data Generation and Processing: Participants were working adults recruited via the Credamo platform. Attention check questions were embedded in the survey, and responses that failed these checks were excluded in real time. Data collection proceeded in a rolling manner until a total of 227 valid responses were obtained. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3b.sav, in SPSS format and compatible with SPSS software. It includes 227 rows and

Cafe Sales - Dirty Data for Cleaning Training

kaggle.com

zip

Updated Jan 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

H
Replication Data for: Democratization and Gini index: Panel data analysis...
dataverse.harvard.edu
search.datacite.org
Updated Apr 13, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LEIZHEN ZANG; Xiong Feng (2019). Replication Data for: Democratization and Gini index: Panel data analysis based on random forest method [Dataset]. http://doi.org/10.7910/DVN/W2CXVU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/W2CXVU
Dataset updated
Apr 13, 2019
Dataset provided by
Harvard Dataverse
Authors
LEIZHEN ZANG; Xiong Feng
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The mechanism for the association between democratic development and the wealth gap has always been the focus of political and economic research, yet with no consistent conclusion. The reasons for that often are, 1) challenges to generalize the results obtained from analyzing a single country’s time series studies or multinational cross-section data analysis, and 2) deviations in research results caused by missing values or variable selection in panel data analysis. When it comes to the latter one, there are two factors contribute to it. One is that the accuracy of estimation is interfered with the presence of missing values in variables, another is that subjective discretion that must be exercised to select suitable proxies amongst many candidates, which are likely to cause variable selection bias. In order to solve these problems, this study is the pioneeringly research to utilize the machine learning method to interpolate missing values efficiently through the random forest model in this topic, and effectively analyzed cross-country data from 151 countries covering the period 1993–2017. Since this paper measures the importance of different variables to the dependent variable, more appropriate and important variables could be selected to construct a complete regression model. Results from different models come to a consensus that the promotion of democracy can significantly narrow the gap between the rich and the poor, with marginally decreasing effect with respect to wealth. In addition, the study finds out that this mechanism exists only in non-colonial nations or presidential states. Finally, this paper discusses the potential theoretical and policy implications of results.
2
Labour Force Survey Household Datasets, 2002-2024: Secure Access
datacatalogue.ukdataservice.ac.uk
Updated Aug 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics, Social Survey Division (2025). Labour Force Survey Household Datasets, 2002-2024: Secure Access [Dataset]. http://doi.org/10.5255/UKDA-SN-7674-17
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-7674-17
Dataset updated
Aug 28, 2025
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
Office for National Statistics, Social Survey Division
Area covered
United Kingdom
Description
Background
The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.
New reweighting policy
Following the new reweighting policy ONS has reviewed the latest population estimates made available during 2019 and have decided not to carry out a 2019 LFS and APS reweighting exercise. Therefore, the next reweighting exercise will take place in 2020. These will incorporate the 2019 Sub-National Population Projection data (published in May 2020) and 2019 Mid-Year Estimates (published in June 2020). It is expected that reweighted Labour Market aggregates and microdata will be published towards the end of 2020/early 2021.
Secure Access QLFS household data
Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. For some quarters, users should note that all missing values in the data are set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. From the 2013 household datasets, the standard -8 and -9 missing categories have been reinstated.

Secure Access household datasets for the QLFS are available from 2002 onwards, and include additional, detailed variables not included in the standard 'End User Licence' (EUL) versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurence of learning difficulty or disability; and benefits.

Prospective users of a Secure Access version of the QLFS will need to fulfil additional requirements, commencing with the completion of an extra application form to demonstrate to the data owners exactly why they need access to the extra, more detailed variables, in order to obtain permission to use that version. Secure Access users must also complete face-to-face training and agree to Secure Access' User Agreement (see 'Access' section below). Therefore, users are encouraged to download and inspect the EUL version of the data prior to ordering the Secure Access version.

LFS Documentation
The documentation available from the Archive to accompany LFS datasets largely consists of each volume of the User Guide including the appropriate questionnaires for the years concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS LFS User Guidance pages before commencing analysis.
The study documentation presented in the Documentation section includes the most recent documentation for the LFS only, due to available space. Documentation for previous years is provided alongside the data for access and is also available upon request.
Review of imputation methods for LFS Household data - changes to missing values
A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

Variables DISEA and LNGLST
Dataset A08 (Labour market status of disabled people) which ONS suspended due to an apparent discontinuity between April to June 2017 and July to September 2017 is now available. As a result of this apparent discontinuity and the inconclusive investigations at this stage, comparisons should be made with caution between April to June 2017 and subsequent time periods. However users should note that the estimates are not seasonally adjusted, so some of the change between quarters could be due to seasonality. Further recommendations on historical comparisons of the estimates will be given in November 2018 when ONS are due to publish estimates for July to September 2018.

An article explaining the quality assurance investigations that have been conducted so far is available on the ONS Methodology webpage. For any queries about Dataset A08 please email Labour.Market@ons.gov.uk.

Latest Edition Information
For the seventeenth edition (August 2025), one quarterly data file covering the time period July-September, 2024 has been added to the study.
Housing Benefit recoveries and fraud data April 2012 to March 2013
gov.uk
Updated Sep 11, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department for Work and Pensions (2013). Housing Benefit recoveries and fraud data April 2012 to March 2013 [Dataset]. https://www.gov.uk/government/statistics/housing-benefit-recoveries-and-fraud-data-april-2012-to-march-2013
Explore at:
Dataset updated
Sep 11, 2013
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Department for Work and Pensions
Description
We have produced this publication in accordance with the arrangements approved by the UK Statistics Authority. It contains aggregate-level data received on a quarterly basis from each local authority.

Historically, the figures in these releases have not allowed for imputation of any missing values and headline numbers were based on the summation of the data presented in the tables. As some authorities do not return a form and some can’t answer all the questions, the presence of missing values can affect the interpretation of trends.

Therefore to help users understand the possible impact, this year’s first release includes imputed totals for the whole country across the full time series.

Coverage: Great Britain

Geographic breakdown: local authority and county

Frequency: bi-annual

Next release date: 12 March 2014

Your feedback

We continue to welcome any feedback that users may have on our Housing Benefit recoveries and fraud national statistics. In particular we would be interested in learning more about:

any additional needs you may have

how you use these statistics

the types of decision these statistics inform

Please complete our user questionnaire to tell us what you think.
S
A Dataset on the Impact of Leader-Member Exchange (LMX) and Capacity-Based...
scidb.cn
Updated Apr 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang Luo (2025). A Dataset on the Impact of Leader-Member Exchange (LMX) and Capacity-Based LMX Differentiation (CLMXD) on Perceived Overqualification and Cyberloafing in the Workplace [Dataset]. http://doi.org/10.57760/sciencedb.24279
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.24279
Dataset updated
Apr 26, 2025
Dataset provided by
Science Data Bank
Authors
Yang Luo
Description
This dataset originates from a multi-method research project titled "Is High-Quality LMX Always a Blessing? Exploring the Impact of LMX and Capacity-Based LMX Differentiation on Followers’ Perceived Overqualification and Cyberloafing Engagement." The project investigates how the interaction between leader-member exchange (LMX) quality and capacity-based LMX differentiation (CLMXD) shapes employees’ perceptions of overqualification and their engagement in cyberloafing behaviors. It further explores how (in)congruence between LMX and CLMXD influences employee outcomes through perceived overqualification.1 Study 1: Multi-Wave Field SurveyData generation procedures: Data for Study 1 were collected through an online questionnaire survey using a snowball sampling approach. Participants were corporate employees from various industries located in Beijing, China. Recruitment was initiated via the researcher's personal network, followed by distribution through a WeChat group.Temporal and geographical scope: The survey was conducted over two waves, separated by a two-week interval, in Beijing, China, during 2024.Data processing methods: Data were matched across the two time points using the last four digits of participants’ telephone numbers. Participants failing attention checks were excluded. Cookies were used to prevent multiple submissions from the same device.Data structure: The dataset consists of 271 valid cases. Each row represents one participant. Variables include Leader-Member Exchange (LMX), Capacity-Based LMX Differentiation (CLMXD), Perceived Overqualification (POQ), Cyberloafing behaviors, and demographic variables such as gender, age, education, position, and tenure. All scales used a 7-point Likert format.Measurement and units: Responses were recorded on scales from 1 (strongly disagree) to 7 (strongly agree).Missing data: There were no missing data, as all questions were mandatory.Error handling: Participants with inconsistent or careless responses (e.g., failing attention checks) were excluded to ensure data quality.Data file details: The file is provided in .sav format (SPSS file), containing 271 rows and 43 columns.2 Study 2: Experimental Vignette StudyData generation procedures: Study 2 employed an experimental vignette methodology (EVM). Participants were randomly assigned to one of four conditions in a 2 (high vs. low LMX) × 2 (high vs. low CLMXD) between-subjects design. They were asked to vividly imagine themselves as protagonists in the scenarios and subsequently respond to a series of measures.Temporal and geographical scope: The data were collected in 2024 from MBA students enrolled in a Chinese university.Data processing methods: Responses were screened for data quality. Surveys with missing answers, multiple selections on single-choice questions, or indications of non-serious answering were excluded.Data structure: The dataset consists of 164 valid cases. Each row corresponds to one participant. Variables include LMX manipulation check, CLMXD manipulation check, and Perceived Overqualification (POQ). Demographic variables such as gender, age, education, and position are also included.Measurement and units: All variables were measured using 7-point Likert-type scales, from 1 (strongly disagree) to 7 (strongly agree).Missing data: All missing or invalid entries were excluded during the cleaning process; thus, the final dataset contains no missing data.Error handling: Data quality was ensured through attention to missing responses and response patterns. Only complete and reliable responses were retained.Data file details: The file is provided in .sav format (SPSS file), containing 164 rows and 34 columns.
S
A survey result of experiment of non proportional reasoning
scidb.cn
Updated Nov 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lior Cohen (2024). A survey result of experiment of non proportional reasoning [Dataset]. http://doi.org/10.57760/sciencedb.17541
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.17541
Dataset updated
Nov 28, 2024
Dataset provided by
Science Data Bank
Authors
Lior Cohen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Generation ProceduresThe data for this study was generated through a survey experiment designed to assess how participants evaluate stock risk based on nominal price changes. Participants were randomly assigned to one of four questionnaires, each presenting a hypothetical scenario involving a stock whose price had declined by 25%. The stocks were priced at $2,000, $200, $20, and $2. This setup allowed the researchers to observe how the nominal price influenced participants’ perceptions of risk and their likelihood to sell the stock. The survey included questions about participants’ risk assessments and their decisions regarding selling shares after being informed of the price decline. The data collection involved online survey tools, ensuring a diverse participant pool. Temporal and Geographical ScopeThe survey was conducted in September and December 2022. The geographical scope primarily encompasses participants from various regions in the USA. Tabular Data DescriptionThe survey results were compiled into tabular data containing several entries corresponding to each participant’s responses—the total number of data entries was 632. Row and Column Headers: Each row represents an individual participant’s response with their Participant ID, while columns include variables tested. Variables reported in the data: Variable Description Respondent ID Random number of the respondent’s randomly assigned ID Gender Male=0, Female=1. Region The location of the respondent: Middle Atlantic, West North Central, New England, South Atlantic, Mountain, West South Central, Pacific, East North Central, East South Central Age Age Groups are divided into four, numbered from 1 to 4, respectively: 18-29, 30-44, 45-60, and 60 and older. Stock Price: Q_2,Q_20,Q_200,Q_2000 Participants were randomly assigned a questionnaire and were asked about the stocks of $2, $20, $200, and $2,000. To construct this variable, it has four values range from 1 to 4, in which (1) $2; (2) $20; (3) $200; (4) $2,000. Income Reported annual income is divided into income groups from 1 to 8. Risk appetite/ Self-assessment of risk-taking behavior Participants were asked how much of a risk taker they consider themselves to be on a scale of 1 (risk averse) to 10 (risk lover). TraderD Participants were asked if they trade stocks or bonds and, if so, how often; the possible answers were, from 1 to 4, respectively: Trade regularly (daily or weekly[1]), trade a few times a month, rarely (a few times a year at most), or have never traded before. Follow news People were asked if they follow stock-related news and stock or bond price changes, and if so, how often. The possible responses were divided into five: I never follow news related to stocks or stock indices=1; Occasionally, a few times a year=2; Sometimes, a few times a month =3; Often, a few times a week=4; Always, daily=5. Risk Participants were asked about how risky they think the stock is on a scale of 1 (not risky) to 7 (high risk) Percent Sold Participants were asked how many shares they would sell due to the stock decline. Based on the answer, I calculated the percentage of stocks they sold out of their portfolio. Sold Based on the same question from the previous variable, I constructed another variable with the values of (1) for those who sold at least 10% of their portfolio and (0) otherwise. In some cases, the respondents answered they would purchase more stocks; in those few cases, I assumed their response was zero. Missing DataA few respondents skipped questions or provided incomplete answers. If present, missing data were excluded from the analysis. Description of Each Data FileThe primary data file generated from this study was structured as an Excel file containing all participant responses. Content: The file includes columns for participant demographics, stock prices presented, risk assessments, selling decisions, and number of shares sold.Format: Excel.Size: 64 kb. [1] In some questionnaires I divided this answer into daily or weekly to allow more variability, however, since the number of answers for “daily”, when constructing this variable and combined the responses from daily and weekly together.
g
Sicily and Calabria Extortion Database
search.gesis.org
da-ra.de
Updated Nov 18, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GLODERS Global Dynamics of Extortion Racket System FP7 Project (2015). Sicily and Calabria Extortion Database [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-1116
Explore at:
Dataset updated
Nov 18, 2015
Dataset provided by
GESIS search
GESIS, Köln
Authors
GLODERS Global Dynamics of Extortion Racket System FP7 Project
License
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Area covered
Sicily, Calabria
Description
The Sicily and Calabria Extortion Database was extracted from police and court documents by the Palermo team of the GLODERS — Global Dynamics of Extortion Racket Systems — project which has received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under grant agreement no. 315874 (http://www.gloders.eu, “Global dynamics of extortion racket systems”, https://cordis.europa.eu/project/id/315874).

The data are provided as an SPSS file with variable names, variable labels, value labels where appropriate, missing value definitions where appropriate. Variable and value labels are given in English translation, string texts are quoted from the Italian originals as we thought that a translation could bias the information and that users of the data for secondary analysis will usually be able to read Italian.

The rows of the SPSS file describe one extortion case each. The columns start with some technical information (unique case number, reference to the original source, region, case number within the regions (Sicily and Calabria). These are followed by information about when the cases happened, the pseudonym of the extorter, his role in the organisation and the name and territory of the mafia family or mandamento he belongs to. Information about the victims, their affiliations and the type of enterprise they represent follows; the type of enterprise is coded according to the official Italian coding scheme (AtEco, which can be downloaded from http://www.istat.it/it/archivio/17888). The next group of variables describes the place where the extortion happened. The value labels for the numerical pseudonyms of extorters and victims (both persons and firms) are not contained in this file, hence the pseudonyms can only be used to analyse how often the same person or firm was involved in extortion.

After this more or less technical information about the extortion cases the cases are described materially. Most variables come in two forms, both the original textual description of what happened and how it happened and a recoded variable which lends itself better for quantitative analyses. The features described in these variables encompass

whether the extortion was only attempted (and unsuccessful from the point of view of the extorter) or completed, i.e. the victim actually paid,
whether the request was for a periodic or a one-off payment or both and what the amount was (the amounts of periodic and one-off amounts are not always comparable as some were only defined in terms of percentages of victim income or in terms of obligations the victim accepted to employ a relative of the extorter etc.),
whether there was an intimidation and whether it was directed to a person or to property,
whether the extortion request was brought forward by direct personal contact or by some indirect communication,
whether there was some negotiation between extorter and victim, and if so, what it was like, and whether a mediator interfered,
how the victim reacted: acquiescent, conniving or refusing,
how the law enforcement agencies got to know about the case (own observation, denunciation, etc.),
whether the extorter was caught, brought to investigation custody or finally sentenced (these variables contain a high percentage of missing data, partly due to the fact that some cases are still under prosecution or before court or as a consequence of incomplete documents.
Inverse Model Results for Filchner-Ronne Catchment
zenodo.org
bin, nc
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Wolovick; Michael Wolovick; Angelika Humbert; Angelika Humbert; Thomas Kleiner; Thomas Kleiner; Martin Rückamp; Martin Rückamp (2023). Inverse Model Results for Filchner-Ronne Catchment [Dataset]. http://doi.org/10.5281/zenodo.7798650
Explore at:
bin, ncAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7798650
Dataset updated
Nov 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Wolovick; Michael Wolovick; Angelika Humbert; Angelika Humbert; Thomas Kleiner; Thomas Kleiner; Martin Rückamp; Martin Rückamp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This page contains the results of the inversions for basal drag and drag coefficient in the Filchner-Ronne catchment presented in Wolovick et al., (2023), along with the code used to perform the inversions and L-curves, analyze the results, and produce the figures presented in that paper.
This all looks very complicated. There's so many files here. The description is so long. I just want to know the basal drag!
If you don't want to get into the weeds of inverse modeling and L-curve analysis, or if you are uninterested in wading through our collection of model structures and scripts, then you should use the file BestCombinedDragEstimate.nc. That file contains our best weighted mean estimate of the ice sheet basal drag in our domain, along with the weighted standard deviation of the scatter of the different models about the mean. As discussed in the paper, this combined estimate is constructed from the weighted mean of 24 individual inversions, representing 8 separate L-curve experiments on our highest-resolution mesh, with three regularization values per L-curve (best estimate regularization, along with minimum and maximum acceptable regularization levels). Each inversion is weighted according to the inverse of its total variance ratio, which is a quality metric incorporating both observational misfit and inverted structure. For ease of use, these results have been interpolated from the unstructured model mesh onto a 250 m regular grid. If you only want to know the basal drag in the Filchner-Ronne region, that is the only file you should use.

For users who want to go further, we will now explain the remaining files in this release. First we give a brief summary of all of the scripts included here and their functions, and then we will give an explanation of the matfiles that contain the actual inversion and L-curve results. Note that the scripts presented here are the matlab scripts used to organize and set up model runs for ISSM. The Ice-Sheet and Sea-level System Model (ISSM) is a highly versatile parallelized finite-element ice sheet model run in C but controlled using Matlab or Python front-ends. We do not include the underlying code for ISSM here; users who are interested in installing ISSM should go to the ISSM home page. We merely include the Matlab scripts we used to organize our ISSM front-end, set up model structures, and then analyze and visualize results.
Main Matlab scripts:
These are the main functional scripts used to set up and run the model.
ISSMInversion_v3.m. This is the primary script we used to set up the inversions and perform L-curve analysis. It requires a model mesh as input along with some gridded data. It also produces an L-curve figure (figure 3) after performing the L-curve analysis. This script can be run in two modes: "setupandsend", which prepares model structures and sends them to the cluster to be solved, and "loadandanalyze", which loads the solutions from the cluster, saves them to matfiles, and performs analysis and visualization. In addition to L-curve analysis and the L-curve figure, this script can also produce a variety of additional figures of model output that we did not show in the paper.
MakeISSMMesh_v4.m. This is the script we used to make our model meshes. It requires a domain boundary as input along with some gridded data.
ModelBoundaryPicker_v1.m. This script opens a crude graphical interface for picking the domain outline.
OrganizeInversionsForRelease_v2.m. This script assembles L-curve and inverse model results and organizes them into the data release you see here. Note that it doesn't compute the combined drag estimate itself, (that is done by CombinedDragFigure_v1.m), but it does interpolate the combined drag estimate from the model mesh to the grid, and it produces the output netcdf file.
Note that the gridded data files needed by some of the above scripts are not included in our release here. Users interested in using these scripts for their own projects will need to provide their own gridded inputs, for instance from BedMachine or Measures.
Figure-making scripts:
These scripts produced almost all of the figures we presented in the paper, and also computed the statistics we presented in the tables in the paper.
CombinedDragFigure_v1.m. This script computes the combined drag estimate on the highest-resolution mesh, and makes a figure displaying it (Figure 12 in the paper).
InversionComparisonFigure_HOSSA_v1.m. This makes figure 11 in the paper and also computes the statistics shown in table 3.
InversionComparisonFigure_m_v1.m This makes figures 9 and 10, and also computes the statistics shown in table 2.
InversionComparisonFigure_N_v1.m. This makes figure 8, and also computes the statistics shown in table 1.
InversionComparisonFigure_v1a.m. This makes figure 4 in the paper.
InversionResConvergenceFigure_v2.m. This makes figure 6.
InversionResMisfitFigure_v1.m. This makes figure 7.
InversionSettingFigure_v1.m. This makes figure 1.
InversionSpectrumFigure_v1.m. This performs spectral analysis and makes figure 5.
InversionThermalSettingFigure_v1.m. This makes figure A1.
MeshSizeFigure_v1.m. This makes figure A2.
NComparisonFigure_v1.m. This makes figure 2.
Other utility Matlab functions:
These miscellaneous function do various tasks. Many of them are called as subroutines of the scripts above. Additionally, many of them are generally useful in contexts beyond the inverse modeling presented here.
FlattenModelStructure.m. ISSM has the unfortunate convention of saving every variable in 3D meshes on every single 3D mesh node, which is quire wasteful for variables that are actually 2D (ie, most of the model variables). This function flattens all uneccessarily 3D information, but unlike the built-in ISSM function flatten.m, this script preserves the 3D geometry of the mesh, along with 3D variables that actually are 3D (such as englacial temperature, for example). This function can also be run in reverse to expand variables back to full 3D before calling solve().
intuitive_lowpass.m. This function low-pass filters a 1D dataset using a gaussian filter. It has several options for handling boundary conditions at the end points.
LaplacianInterpolation.m. This function fills in missing data values for gridded data products by solving Poisson's equation (Laplacian=0).
LaplacianInterpolation_mesh.m. This function does the same thing but on an unstructured mesh.
loadnetcdf.m. This function loads variables from netcdf files into the Matlab workspace using a similar syntax as load() for matfiles.
MultiWavelengthInterpolator.m. This function interpolates gridded data onto an unstructured mesh using a multi-grid approach. The grid is smoothed at multiple wavelengths and each mesh element interpolates from the wavelength that is appropriate for its size. This functionality is useful for preventing aliasing in coarse-resolution areas when interpolating onto a mesh with variable mesh size. It also produces results that are approximately (but not precisely) conservative.
ThreeByThree.m. This function iteratively performs a 3x3 smoothing on gridded data.
unpack.m. This function takes a structure and "unpacks" it by making every field into a variable in the workspace.

Matfiles with L-curve data and model structures.
The results of our L-curve analyses and our actual inversion results are stored in matfiles. We performed 21 experiments shown in the paper; for each one we performed an independent L-curve analysis using 25 individual inversions, for a total of 525 inversions. However, for this data release we simplify matters by only presenting 3 inversions per experiment, corresponding to the best regularization value (LambdaBest) and the maximum and minimum acceptable regularization values (LambdaMax and LambdaMin). In addition, for each experiment we also provide an LCurveFile that summarizes the L-curve analysis but does not contain any actual model results. In total, we present 84 matfiles in this data release.
Naming convention:
All matfiles presented here have the following naming convention:
Mesh#_eqn_m#_Ntype_LambdaType.mat
Mesh#: this represents the mesh on which the inversions were performed, ranging from Mesh1 (highest resolution) to Mesh10 (lowest resolution).
eqn: this represents the type of equations solved in the inversion. Values are "SSA" or "HO".
m#: exponent in the sliding law. Values are m1, m3, and m5.
Ntype: effective pressure source in the sliding law. Values are "noN" (ie, Weertman sliding), "Nop", "Nopc", and "Ncuas".
LambdaType: values of this string are "LCurveFile" (for the file summarizing the whole L-curve experiment), "LambdaMin", "LambdaBest", and "LambdaMax".
Variables in the model files:
Every file ending with "LambdaMin", "LambdaBest", or "LambdaMax" is a model file containing the same set of variables. Those variables are:
md. This is a a model structure variable usable by any ISSM installation. Note that if you do not have ISSM installed on your machine, Matlab will not recognize class "model" and you will not be able to load this variable. The results of the inversion are stored in md.results.StressbalanceSolution. Other important things for the inversion, such as cost functions, cost function coefficients, and
n
Data from: Common barriers, but temporal dissonance: genomic tests suggest...
data.niaid.nih.gov
dataone.org
+1more
zip
Updated Jan 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Thomaz; L. Lacey Knowles (2020). Common barriers, but temporal dissonance: genomic tests suggest ecological and paleo-landscape sieves structure a coastal riverine fish community [Dataset]. http://doi.org/10.5061/dryad.zkh18936g
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.zkh18936g
Dataset updated
Jan 15, 2020
Dataset provided by
University of Michigan
Authors
Andrea Thomaz; L. Lacey Knowles
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Assessments of spatial and temporal congruency across taxa from genetic data provide insights into the extent to which similar processes structure communities. However, for coastal regions that are affected continuously by cyclical sea-level changes over the Pleistocene, congruent interspecific response will not only depend upon co-distributions, but also on similar dispersal histories among taxa. Here, we use SNPs to test for concordant genetic structure among four co-distributed taxa of freshwater fishes (Teleostei: Characidae) along the Brazilian Atlantic coastal drainages. Based on population relationships and hierarchical genetic structure analyses, we identify all taxa share the same geographic structure suggesting the fish utilized common passages in the past to move between river basins. In contrast to this strong spatial concordance, model-based estimates of divergence times indicate that despite common routes for dispersal, these passages were traversed by each of the taxa at different times resulting in varying degrees of genetic differentiation across barriers with most divergences dating to the Upper Pleistocene, even when accounting for divergence with gene flow. Interestingly, when this temporal dissonance is viewed through the lens of the species-specific ecologies, it suggests that an ecological sieve influenced whether species dispersed readily, with an ecological generalist showing the highest propensity for historical dispersal among the isolated rivers of the Brazilian coast (i.e., the most recent divergence times and frequent gene flow estimated for barriers). We discuss how our findings, and in particular what the temporal dissonance, despite common geographic passages, suggest about past dispersal structuring coastal communities as a function of ecological and paleo-landscape sieves.

Methods Six double digest Restriction-site Associated DNA (ddRAD) libraries were constructed: three libraries contained 118 individuals of Mimagoniates microlepis for this study, two libraries containing 136 individuals of Hyphessobrycon boulengeri, and one library with 87 individuals of Bryconamericus. In addition, two libraries with 182 individuals of Hollandichthys were re-analyzed for this study (Thomaz et al., 2017). For all the libraries prepared specifically for this study, we followed the protocol of Peterson, Weber, Kay, Fisher, & Hoekstra (2012); the two previously sequenced libraries of Hollandichthys followed the Parchman et al. (2012) protocol (see Thomaz et al., 2017 for preparation details). Genomic data were demultiplexed and processed separately for each taxon with the STACKS version 1.41 pipeline (Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013). Because of the various requirements of different analyses used to characterize the geographic structuring of genomic variation, three datasets were generated per taxon varying the amount of missing data and the numbers of individuals. One dataset was comprised by one random single SNP per locus with maximum of 50% missing data, which was used for estimates of population trees using SVSquartets on PAUP. The other dataset included loci with maximum 25% missing data after filtering - note that for M. microlepis we allowed 35% missing data - and was used with a random single SNP per locus in the STRUCTURE analysis. Separate datasets were used in FASTSIMCOAL2 analyses and were generated, when possible, from 20 individuals with the smallest amount of missing data from all the populations separated by each geographic barrier for each taxon (40 individuals in total), and a single variable SNP per RADtag with less than 10% missing data. For all these datasets, individuals with considerably fewer SNPs in comparison to other individuals of the same population were excluded. All filtering steps were performed using the toolset PLINK v.1.90 (Purcell et al., 2007) - see Thomaz and Knowles (2020) for details on methodology.
Z
Data from: Impact of delayed response on Wearable Cognitive Assistance
data.niaid.nih.gov
zenodo.org
Updated Feb 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olguín Muñoz, Manuel; Klatzky, Roberta; Wang, Junjue; Padmanabhan, Pillai; Satyanaryanan, Mahadev; Gross, James (2021). Impact of delayed response on Wearable Cognitive Assistance [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4489265
Explore at:
Dataset updated
Feb 2, 2021
Dataset provided by
School of Computer Science, Carnegie Mellon University
Department of Psychology, Carnegie Mellon University
Intel Labs Pittsburgh
School of Electrical Engineering & Computer Science, KTH Royal Institute of Technology
Authors
Olguín Muñoz, Manuel; Klatzky, Roberta; Wang, Junjue; Padmanabhan, Pillai; Satyanaryanan, Mahadev; Gross, James
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the data associated with our research project titled Impact of delayed response on Wearable Cognitive Assistance. A preprint of the associated paper can be found at https://arxiv.org/abs/2011.02555.

GENERAL INFORMATION

Title of Dataset: Impact of delayed response on Wearable Cognitive Assistance

Author Information

First Author Contact Information Name: Manuel Olguín Muñoz Institution: School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology Address: Malvinas väg 10, Stockholm 11428, Sweden Email: molguin@kth.se Phone Number: +46 73 652 7628

Author Contact Information Name: Roberta L. Klatzky Institution: Department of Psychology, Carnegie Mellon University Address: 5000 Forbes Ave, Pittsburgh, PA 15213 Email: klatzky@cmu.edu Phone Number: +1 412 268 8026

Author Contact Information Name: Mahadev Satyanarayanan Institution: School of Computer Science, Carnegie Mellon University Address: 5000 Forbes Ave, Pittsburgh, PA 15213 Email: satya@cs.cmu.edu Phone Number: +1 412 268 3743

Author Contact Information Name: James R. Gross Institution: School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology Address: Malvinas väg 10, Stockholm 11428, Sweden Email: jamesgr@kth.se Phone Number: +46 8 790 8819

DATA & FILE OVERVIEW

Directory of Files: A. Filename: accelerometer_data.csv Short description: Time-series accelerometer data. Each row corresponds to a sample.

B. Filename: block_aggregate.csv Short description: Contains the block- and slice-level aggregates for each of the metrics and statistics present in this dataset. Each row corresponds to either a full block or a slice of a block, see below for details. C. Filename: block_metadata.csv Short description: Contains the metadata for each block in the task for each participant. Each row corresponds to a block. D. Filename: bvp_data.csv Short description: Time-series blood-volume-pulse data. Each row corresponds to a sample. E. Filename: eeg_data.csv Short description: Time-series electroencephalogram data, represented as power per band. Each row corresponds to a sample; power was calculated in 0.5 second intervals. F. Filename: frame_metadata.csv Short description: Contains the metadata for each video frame processed by the cognitive assistant. Each row corresponds to a processed frame. G. Filename: gsr_data.csv Short description: Time-series galvanic skin response data. Each row corresponds to a sample. H. Filename: task_step_metadata.csv Short description: Contains the metadata for each step in the task for each participant. Each row corresponds to a step in the task. I. Filename: temperature_data.csv Short description: Time-series thermometer data. Each row corresponds to a sample.

Additional Notes on File Relationships, Context, or Content (for example, if a user wants to reuse and/or cite your data, what information would you want them to know?):

The data contained in these CSVs was obtained from 40 participants in a study performed with approval from the Carnegie Mellon University Institutional Research Board. In this study, participants were asked to interact with a Cognitive Assistant while wearing an array of physiological sensors. The data contained in this dataset corresponds to the actual collected data, after some preliminary preprocessing to convert from sensors readings into meaningful values.

Participants have been anonymized using random integer identifiers.

block_aggregate.csv can be replicated by cross-referencing the start and end timestamps of each block in block_metadata.csv and the timestamps for each desired metric.

The actual video frames mentioned in frame_metadata.csv are not included in the dataset since their contents were not relevant to the research.

File Naming Convention: N/A

DATA DESCRIPTION FOR: accelerometer_data.csv

Number of variables: 7

Number of cases/rows: 1844688

Missing data codes: N/A

Variable list:

A. Name: timestamp Description: Timestamp of the sample.

B. Name: x Description: Acceleration reading from the x-axis of the accelerometer in g-forces [g].

C. Name: y Description: Acceleration reading from the y-axis of the accelerometer in g-forces [g].

D. Name: z Description: Acceleration reading from the z-axis of the accelerometer in g-forces [g].

E. Name: ts Description: Time difference with respect to first sample.

F. Name: participant Description: Denotes the numeric ID representing each individual participant.

G. Name: delay Description: Delay that was being applied on the task when this reading was obtained in time delta format.

DATA DESCRIPTION FOR: block_aggregate.csv

Number of variables: 16

Number of cases/rows: 2520

Missing data codes:

Except for the 'slice' columns, empty cells mean that the data is not applicable or was removed from the dataset due to noise or instrument failure.

For the 'slice' column, a missing value indicates that the row corresponds to the whole block as opposed to a slice of it.

Variable List:

A. Name: participant Description: Denotes the numeric ID representing each individual participant.

B. Name: block_seq Description: Denotes the position of the block in the task. Ranges from 1 to 21.

C. Name: slice Description: Index of the 4-step slice of the block over which the data was aggregated. Ranges from 0 to 2, however higher values are only applicable for blocks of appropriate length (i.e. blocks of length 4 only have a 0-slice, length 8 have 0 and 1, and length 12 have slices from 0 to 2). A missing value indicates that this row instead contains aggregate values for the whole block.

D. Name: block_length Description: Length of the block. Valid values are 4, 8 and 12.

C. Name: block_delay Description: Delay applied to the block, in seconds.

F. Name: start Description: Timestamp marking the start of the block or slice.

G. Name: end Description: Timestamp marking the end of the block or slice.

H. Name: duration Description: Duration of the block or slice, in seconds.

I. Name: exec_time_per_step_mean Description: Mean execution time for each step in the block or slice.

J. Name: bpm_mean Description: Mean heart rate, in beats-per-minute, for the block or slice.

K. Name: bpm_std Description: Standard deviation of the heart rate, in beats-per-minute, for the block or slice.

L. Name: gsr_per_second Description: Galvanic skin response in microsiemens, summed and then normalized by block or slice duration.

M. Name: movement_score Description: Movement score for the block or slice. The movement score is calculated as the sum of the magnitude of all the acceleration vectors in the block or slice, divided by duration in seconds.

N. Name: eeg_alpha_log_mean Description: Log of the average EEG power for the alpha band for the, block or slice.

O. Name: eeg_beta_log_mean Description: Log of the average EEG power for the beta band for the, block or slice.

P. Name: eeg_total_log_mean Description: Log of the average EEG power for the complete EEG signal, for the block or slice.

DATA DESCRIPTION FOR: block_metadata.csv

Number of variables: 8

Number of cases/rows: 880

Missing data codes: N/A

Variable list:

A. Name: participant Description: Denotes the numeric ID representing each individual participant.

B. Name: seq Description: Index of the block in the task, ranging from 0 to 21. Note that block 0 is not to be included in aggregate calculations.

C. Name: length Description: Length of the block in number of steps.

D. Name: delay Description: Delay applied to the block.

E. Name: start Description: Timestamp marking the start of the block.

F. Name: end Description: Timestamp marking the end of the block.

G. Name: duration Description: Duration of the block as a timedelta.

H. Name: exec_time Description: Execution time of the block as a timedelta.

DATA DESCRIPTION FOR: bvp_data.csv

Number of variables: 8

Number of cases/rows: 3683504

Missing data codes: Columns bpm and ibi only contain values for rows corresponding to a sample taken at a heartbeat.

Variable list:

A. Name: ts Description: Time difference with respect to first sample.

B. Name: timestamp Description: Timestamp of the sample.

C. Name: bvp Description: Blood-volume-pulse reading, in millivolts.

D. Name: onset Description: Boolean indicating if this sample corresponds to the onset of a pulse.

E. Name: bpm Description: Instantaneous beat-per-minute value.

F. Name: ibi Description: Instantaneous inter-beat-interval value.

G. Name: delay Description: Delay that was being applied on the task when this reading was obtained in time delta format.

H. Name: participant Description: Denotes the numeric ID representing each individual
A Missing Data Approach to Correct for Direct and Indirect Range...
plos.figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Pfaffel; Marlene Kollmayer; Barbara Schober; Christiane Spiel (2023). A Missing Data Approach to Correct for Direct and Indirect Range Restrictions with a Dichotomous Criterion: A Simulation Study [Dataset]. http://doi.org/10.1371/journal.pone.0152330
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0152330
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andreas Pfaffel; Marlene Kollmayer; Barbara Schober; Christiane Spiel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A recurring methodological problem in the evaluation of the predictive validity of selection methods is that the values of the criterion variable are available for selected applicants only. This so-called range restriction problem causes biased population estimates. Correction methods for direct and indirect range restriction scenarios have widely studied for continuous criterion variables but not for dichotomous ones. The few existing approaches are inapplicable because they do not consider the unknown base rate of success. Hence, there is a lack of scientific research on suitable correction methods and the systematic analysis of their accuracies in the cases of a naturally or artificially dichotomous criterion. We aim to overcome this deficiency by viewing the range restriction problem as a missing data mechanism. We used multiple imputation by chained equations to generate complete criterion data before estimating the predictive validity and the base rate of success. Monte Carlo simulations were conducted to investigate the accuracy of the proposed correction in dependence of selection ratio, predictive validity, and base rate of success in an experimental design. In addition, we compared our proposed missing data approach with Thorndike’s well-known correction formulas that have only been used in the case of continuous criterion variables so far. The results show that the missing data approach is more accurate in estimating the predictive validity than Thorndike’s correction formulas. The accuracy of our proposed correction increases as the selection ratio and the correlation between predictor and criterion increase. Furthermore, the missing data approach provides a valid estimate of the unknown base rate of success. On the basis of our findings, we argue for the use of multiple imputation by chained equations in the evaluation of the predictive validity of selection methods when the criterion is dichotomous.

Retail Store Sales: Dirty for Data Cleaning

kaggle.com

zip

Updated Jan 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning

Explore at:

zip(226740 bytes)Available download formats

Dataset updated

Jan 18, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Retail Store Sales Dataset

Overview

The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

File Information

File Name: retail_store_sales.csv
Number of Rows: 12,575
Number of Columns: 11

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Customer ID`	A unique identifier for each customer. 25 unique customers.	`CUST_01`
`Category`	The category of the purchased item.	`Food`, `Furniture`
`Item`	The name of the purchased item. May contain missing values or `None`.	`Item_1_FOOD`, `None`
`Price Per Unit`	The static price of a single unit of the item. May contain missing or `None` values.	`4.00`, `None`
`Quantity`	The quantity of the item purchased. May contain missing or `None` values.	`1`, `None`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `None`
`Payment Method`	The method of payment used. May contain missing or invalid values.	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Online`
`Transaction Date`	The date of the transaction. Always present and valid.	`2023-01-15`
`Discount Applied`	Indicates if a discount was applied to the transaction. May contain missing values.	`True`, `False`, `None`

Categories and Items

The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

Electric Household Essentials

Item Code	Item Name	Price
Item_1_EHE	Blender	5.0
Item_2_EHE	Microwave	6.5
Item_3_EHE	Toaster	8.0
Item_4_EHE	Vacuum Cleaner	9.5
Item_5_EHE	Air Purifier	11.0
Item_6_EHE	Electric Kettle	12.5
Item_7_EHE	Rice Cooker	14.0
Item_8_EHE	Iron	15.5
Item_9_EHE	Ceiling Fan	17.0
Item_10_EHE	Table Fan	18.5
Item_11_EHE	Hair Dryer	20.0
Item_12_EHE	Heater	21.5
Item_13_EHE	Humidifier	23.0
Item_14_EHE	Dehumidifier	24.5
Item_15_EHE	Coffee Maker	26.0
Item_16_EHE	Portable AC	27.5
Item_17_EHE	Electric Stove	29.0
Item_18_EHE	Pressure Cooker	30.5
Item_19_EHE	Induction Cooktop	32.0
Item_20_EHE	Water Dispenser	33.5
Item_21_EHE	Hand Blender	35.0
Item_22_EHE	Mixer Grinder	36.5
Item_23_EHE	Sandwich Maker	38.0
Item_24_EHE	Air Fryer	39.5
Item_25_EHE	Juicer	41.0

Furniture

Item Code	Item Name	Price
Item_1_FUR	Office Chair	5.0
Item_2_FUR	Sofa	6.5
Item_3_FUR	Coffee Table	8.0
Item_4_FUR	Dining Table	9.5
Item_5_FUR	Bookshelf	11.0
Item_6_FUR	Bed F...

Bank Loan Case Study Dataset
kaggle.com
zip
Updated May 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreshth Vashisht (2023). Bank Loan Case Study Dataset [Dataset]. https://www.kaggle.com/datasets/shreshthvashisht/bank-loan-case-study-dataset/discussion
Explore at:
zip(117814223 bytes)Available download formats
Dataset updated
May 4, 2023
Authors
Shreshth Vashisht
Description
This case study aims to give you an idea of applying EDA in a real business scenario. In this case study, apart from applying the techniques that you have learnt in the EDA module, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimize the risk of losing money while lending to customers.

Business Understanding: The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company. If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company. The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample All other cases: All other cases when the payment is paid on time. When a client applies for a loan, there are four types of decisions that could be taken by the client/company:

Approved: The company has approved loan application Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want. Refused: The company had rejected the loan (because the client does not meet their requirements etc.). Unused Offer: Loan has been cancelled by the client but on different stages of the process. In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

Business Objectives: It aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilize this knowledge for its portfolio and risk assessment.

To develop your understanding of the domain, you are advised to independently research a little about risk analytics – understanding the types of variables and their significance should be enough).

Data Understanding: Download the Dataset using the link given under dataset section on the right.

application_data.csv contains all the information of the client at the time of application. The data is about wheather a client has payment difficulties. previous_application.csv contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer. columns_descrption.csv is data dictionary which describes the meaning of the variables. You are required to provide a detailed report for the below data record mentioning the answer to the questions that follows:

Present the overall approach of the analysis. Mention the problem statement and the analysis approach briefly Indentify the missing data and use appropriate method to deal with it. (Remove columns/or replace it with an appropriate value) Hint: Note that in EDA, since it is not necessary to replace the missing value, but if you have to replace the missing value, what should be the approach. Clearly mention the approach. Identify if there are outliers in the dataset. Also, mention why do you think it is an outlier. Again, remember that for this exercise, it is not necessary to remove any data points. Identify if there is data imbalance in the data. Find the ratio of data imbalance. Hint: Since there are a lot of columns, you can run your analysis in loops for the appropriate columns and find the insights. Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms. Find the top 10 c...

Facebook

Twitter

Click to copy link

Link copied

Cite

Julien Clavel; Gildas Merceron; Gilles Escarguel (2013). Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50

Data from: Missing data estimation in morphometrics: how much is too much?

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.f0b50

Dataset updated

Dec 5, 2013

Dataset provided by

Centre National de la Recherche Scientifique

Authors

Julien Clavel; Gildas Merceron; Gilles Escarguel

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.

Clear search

Close search

Google apps

Main menu

Data from: Missing data estimation in morphometrics: how much is too much?

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...

Data from: A multiple imputation method using population information

Data from: Evaluating Supplemental Samples in Longitudinal Research:...

Data from: A Comparison of FIML- versus Multiple-imputation-based methods to...

Dataset for: Avoiding pitfalls when combining multiple imputation and...

Experimental Dataset on the Impact of Unfair Behavior by AI and Humans on...

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback

Replication Data for: Democratization and Gini index: Panel data analysis...

Labour Force Survey Household Datasets, 2002-2024: Secure Access

Housing Benefit recoveries and fraud data April 2012 to March 2013

Your feedback

A Dataset on the Impact of Leader-Member Exchange (LMX) and Capacity-Based...

A survey result of experiment of non proportional reasoning

Sicily and Calabria Extortion Database

Inverse Model Results for Filchner-Ronne Catchment

Data from: Common barriers, but temporal dissonance: genomic tests suggest...

Data from: Impact of delayed response on Wearable Cognitive Assistance

GENERAL INFORMATION

DATA & FILE OVERVIEW

DATA DESCRIPTION FOR: accelerometer_data.csv

DATA DESCRIPTION FOR: block_aggregate.csv

DATA DESCRIPTION FOR: block_metadata.csv

DATA DESCRIPTION FOR: bvp_data.csv

A Missing Data Approach to Correct for Direct and Indirect Range...

Retail Store Sales: Dirty for Data Cleaning

Dirty Retail Store Sales Dataset

Overview

File Information

Columns Description

Categories and Items

Electric Household Essentials

Furniture

Bank Loan Case Study Dataset

Data from: Missing data estimation in morphometrics: how much is too much?