34 datasets found
  1. n

    Data from: Missing data estimation in morphometrics: how much is too much?

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Dec 5, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Clavel; Gildas Merceron; Gilles Escarguel (2013). Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 5, 2013
    Dataset provided by
    Centre National de la Recherche Scientifique
    Authors
    Julien Clavel; Gildas Merceron; Gilles Escarguel
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.

  2. Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Nikolaj Bak; Lars K. Hansen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.

  3. Data from: A multiple imputation method using population information

    • tandf.figshare.com
    pdf
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tadayoshi Fushiki (2025). A multiple imputation method using population information [Dataset]. http://doi.org/10.6084/m9.figshare.28900017.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Tadayoshi Fushiki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multiple imputation (MI) is effectively used to deal with missing data when the missing mechanism is missing at random. However, MI may not be effective when the missing mechanism is not missing at random (NMAR). In such cases, additional information is required to obtain an appropriate imputation. Pham et al. (2019) proposed the calibrated-δ adjustment method, which is a multiple imputation method using population information. It provides appropriate imputation in two NMAR settings. However, the calibrated-δ adjustment method has two problems. First, it can be used only when one variable has missing values. Second, the theoretical properties of the variance estimator have not been provided. This article proposes a multiple imputation method using population information that can be applied when several variables have missing values. The proposed method is proven to include the calibrated-δ adjustment method. It is shown that the proposed method provides a consistent estimator for the parameter of the imputation model in an NMAR situation. The asymptotic variance of the estimator obtained by the proposed method and its estimator are also given.

  4. Data from: Evaluating Supplemental Samples in Longitudinal Research:...

    • tandf.figshare.com
    txt
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura K. Taylor; Xin Tong; Scott E. Maxwell (2024). Evaluating Supplemental Samples in Longitudinal Research: Replacement and Refreshment Approaches [Dataset]. http://doi.org/10.6084/m9.figshare.12162072.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Laura K. Taylor; Xin Tong; Scott E. Maxwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite the wide application of longitudinal studies, they are often plagued by missing data and attrition. The majority of methodological approaches focus on participant retention or modern missing data analysis procedures. This paper, however, takes a new approach by examining how researchers may supplement the sample with additional participants. First, refreshment samples use the same selection criteria as the initial study. Second, replacement samples identify auxiliary variables that may help explain patterns of missingness and select new participants based on those characteristics. A simulation study compares these two strategies for a linear growth model with five measurement occasions. Overall, the results suggest that refreshment samples lead to less relative bias, greater relative efficiency, and more acceptable coverage rates than replacement samples or not supplementing the missing participants in any way. Refreshment samples also have high statistical power. The comparative strengths of the refreshment approach are further illustrated through a real data example. These findings have implications for assessing change over time when researching at-risk samples with high levels of permanent attrition.

  5. f

    Data from: A Comparison of FIML- versus Multiple-imputation-based methods to...

    • tandf.figshare.com
    docx
    Updated Feb 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Liu; Suppanut Sriutaisuk (2024). A Comparison of FIML- versus Multiple-imputation-based methods to test measurement invariance with incomplete ordinal variables [Dataset]. http://doi.org/10.6084/m9.figshare.14062423.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Feb 26, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Yu Liu; Suppanut Sriutaisuk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To ensure meaningful comparison of test scores across groups or time, measurement invariance (i.e., invariance of the general factor structure and the values of the measurement parameters) across groups or time must be examined. However, many empirical examinations of measurement invariance of psychological/educational questionnaires need to address two issues: Using the appropriate model for ordinal variables (e.g., Likert scale items), and handling missing data. In two Monte Carlo simulations, this study examined the performance of one full-information-maximum-likelihood-based method and five multiple-imputation-based methods to obtain tests of measurement invariance across groups for ordinal variables that have missing data. Our results indicate that the full-information-maximum-likelihood-based method and one of the multiple-imputation-based methods generally have better performance than the other examined methods, though they also have their own limitations.

  6. Dataset for: Avoiding pitfalls when combining multiple imputation and...

    • wiley.figshare.com
    docx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emily Granger; Jamie Sergeant; Mark Lunt (2023). Dataset for: Avoiding pitfalls when combining multiple imputation and propensity scores [Dataset]. http://doi.org/10.6084/m9.figshare.9253178.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Emily Granger; Jamie Sergeant; Mark Lunt
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Overcoming bias due to confounding and missing data is challenging when analysing observational data. Propensity scores are commonly used to account for the first problem and multiple imputation for the latter. Unfortunately, it is not known how best to proceed when both techniques are required. We investigate whether two different approaches to combining propensity scores and multiple imputation (Across and Within) lead to differences in the accuracy or precision of exposure effect estimates. Both approaches start by imputing missing values multiple times. Propensity scores are then estimated for each resulting dataset. Using the Across approach, the mean propensity score across imputations for each subject is used in a single subsequent analysis. Alternatively, the Within approach uses propensity scores individually to obtain exposure effect estimates in each imputation, which are combined to produce an overall estimate. These approaches were compared in a series of Monte Carlo simulations and applied to data from the British Society for Rheumatology Biologics Register. Results indicated that the Within approach produced unbiased estimates with appropriate confidence intervals, whereas the Across approach produced biased results and unrealistic confidence intervals. Researchers are encouraged to implement the Within approach when conducting propensity score analyses with incomplete data.

  7. S

    Experimental Dataset on the Impact of Unfair Behavior by AI and Humans on...

    • scidb.cn
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang Luo (2025). Experimental Dataset on the Impact of Unfair Behavior by AI and Humans on Trust: Evidence from Six Experimental Studies [Dataset]. http://doi.org/10.57760/sciencedb.psych.00565
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Yang Luo
    Description

    This dataset originates from a series of experimental studies titled “Tough on People, Tolerant to AI? Differential Effects of Human vs. AI Unfairness on Trust” The project investigates how individuals respond to unfair behavior (distributive, procedural, and interactional unfairness) enacted by artificial intelligence versus human agents, and how such behavior affects cognitive and affective trust.1 Experiment 1a: The Impact of AI vs. Human Distributive Unfairness on TrustOverview: This dataset comes from an experimental study aimed at examining how individuals respond in terms of cognitive and affective trust when distributive unfairness is enacted by either an artificial intelligence (AI) agent or a human decision-maker. Experiment 1a specifically focuses on the main effect of the “type of decision-maker” on trust.Data Generation and Processing: The data were collected through Credamo, an online survey platform. Initially, 98 responses were gathered from students at a university in China. Additional student participants were recruited via Credamo to supplement the sample. Attention check items were embedded in the questionnaire, and participants who failed were automatically excluded in real-time. Data collection continued until 202 valid responses were obtained. SPSS software was used for data cleaning and analysis.Data Structure and Format: The data file is named “Experiment1a.sav” and is in SPSS format. It contains 28 columns and 202 rows, where each row corresponds to one participant. Columns represent measured variables, including: grouping and randomization variables, one manipulation check item, four items measuring distributive fairness perception, six items on cognitive trust, five items on affective trust, three items for honesty checks, and four demographic variables (gender, age, education, and grade level). The final three columns contain computed means for distributive fairness, cognitive trust, and affective trust.Additional Information: No missing data are present. All variable names are labeled in English abbreviations to facilitate further analysis. The dataset can be directly opened in SPSS or exported to other formats.2 Experiment 1b: The Mediating Role of Perceived Ability and Benevolence (Distributive Unfairness)Overview: This dataset originates from an experimental study designed to replicate the findings of Experiment 1a and further examine the potential mediating role of perceived ability and perceived benevolence.Data Generation and Processing: Participants were recruited via the Credamo online platform. Attention check items were embedded in the survey to ensure data quality. Data were collected using a rolling recruitment method, with invalid responses removed in real time. A total of 228 valid responses were obtained.Data Structure and Format: The dataset is stored in a file named Experiment1b.sav in SPSS format and can be directly opened in SPSS software. It consists of 228 rows and 40 columns. Each row represents one participant’s data record, and each column corresponds to a different measured variable. Specifically, the dataset includes: random assignment and grouping variables; one manipulation check item; four items measuring perceived distributive fairness; six items on perceived ability; five items on perceived benevolence; six items on cognitive trust; five items on affective trust; three items for attention check; and three demographic variables (gender, age, and education). The last five columns contain the computed mean scores for perceived distributive fairness, ability, benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be analyzed directly in SPSS or exported to other formats as needed.3 Experiment 2a: Differential Effects of AI vs. Human Procedural Unfairness on TrustOverview: This dataset originates from an experimental study aimed at examining whether individuals respond differently in terms of cognitive and affective trust when procedural unfairness is enacted by artificial intelligence versus human decision-makers. Experiment 2a focuses on the main effect of the decision agent on trust outcomes.Data Generation and Processing: Participants were recruited via the Credamo online survey platform from two universities located in different regions of China. A total of 227 responses were collected. After excluding those who failed the attention check items, 204 valid responses were retained for analysis. Data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2a.sav in SPSS format and can be directly opened in SPSS software. It contains 204 rows and 30 columns. Each row represents one participant’s response record, while each column corresponds to a specific variable. Variables include: random assignment and grouping; one manipulation check item; seven items measuring perceived procedural fairness; six items on cognitive trust; five items on affective trust; three attention check items; and three demographic variables (gender, age, and education). The final three columns contain computed average scores for procedural fairness, cognitive trust, and affective trust.Additional Notes: The dataset contains no missing values. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be directly analyzed in SPSS or exported to other formats as needed.4 Experiment 2b: Mediating Role of Perceived Ability and Benevolence (Procedural Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 2a and to further examine the potential mediating roles of perceived ability and perceived benevolence in shaping trust responses under procedural unfairness.Data Generation and Processing: Participants were working adults recruited through the Credamo online platform. A rolling data collection strategy was used, where responses failing attention checks were excluded in real time. The final dataset includes 235 valid responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2b.sav, which is in SPSS format and can be directly opened using SPSS software. It contains 235 rows and 43 columns. Each row corresponds to a single participant, and each column represents a specific measured variable. These include: random assignment and group labels; one manipulation check item; seven items measuring procedural fairness; six items for perceived ability; five items for perceived benevolence; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final five columns contain the computed average scores for procedural fairness, perceived ability, perceived benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to support future reuse and secondary analysis. The dataset can be directly analyzed in SPSS and easily converted into other formats if needed.5 Experiment 3a: Effects of AI vs. Human Interactional Unfairness on TrustOverview: This dataset comes from an experimental study that investigates how interactional unfairness, when enacted by either artificial intelligence or human decision-makers, influences individuals’ cognitive and affective trust. Experiment 3a focuses on the main effect of the “decision-maker type” under interactional unfairness conditions.Data Generation and Processing: Participants were college students recruited from two universities in different regions of China through the Credamo survey platform. After excluding responses that failed attention checks, a total of 203 valid cases were retained from an initial pool of 223 responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3a.sav, in SPSS format and compatible with SPSS software. It contains 203 rows and 27 columns. Each row represents a single participant, while each column corresponds to a specific measured variable. These include: random assignment and condition labels; one manipulation check item; four items measuring interactional fairness perception; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final three columns contain computed average scores for interactional fairness, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variable names are provided using standardized English abbreviations to facilitate secondary analysis. The data can be directly analyzed using SPSS and exported to other formats as needed.6 Experiment 3b: The Mediating Role of Perceived Ability and Benevolence (Interactional Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 3a and further examine the potential mediating roles of perceived ability and perceived benevolence under conditions of interactional unfairness.Data Generation and Processing: Participants were working adults recruited via the Credamo platform. Attention check questions were embedded in the survey, and responses that failed these checks were excluded in real time. Data collection proceeded in a rolling manner until a total of 227 valid responses were obtained. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3b.sav, in SPSS format and compatible with SPSS software. It includes 227 rows and

  8. Cafe Sales - Dirty Data for Cleaning Training

    • kaggle.com
    zip
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training
    Explore at:
    zip(113510 bytes)Available download formats
    Dataset updated
    Jan 17, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Cafe Sales Dataset

    Overview

    The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

    File Information

    • File Name: dirty_cafe_sales.csv
    • Number of Rows: 10,000
    • Number of Columns: 8

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    ItemThe name of the item purchased. May contain missing or invalid values (e.g., "ERROR").Coffee, Sandwich
    QuantityThe quantity of the item purchased. May contain missing or invalid values.1, 3, UNKNOWN
    Price Per UnitThe price of a single unit of the item. May contain missing or invalid values.2.00, 4.00
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, 12.00
    Payment MethodThe method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN").Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Takeaway
    Transaction DateThe date of the transaction. May contain missing or incorrect values.2023-01-01

    Data Characteristics

    1. Missing Values:

      • Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
    2. Invalid Values:

      • Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
    3. Price Consistency:

      • Prices for menu items are consistent but may have missing or incorrect values introduced.

    Menu Items

    The dataset includes the following menu items with their respective price ranges:

    ItemPrice($)
    Coffee2
    Tea1.5
    Sandwich4
    Salad5
    Cake3
    Cookie1
    Smoothie4
    Juice3

    Use Cases

    This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

    Cleaning Steps Suggestions

    To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

    1. Handle Invalid Values:

      • Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
    2. Date Consistency:

      • Ensure all dates are in a consistent format.
      • Fill missing dates with plausible values based on nearby records.
    3. Feature Engineering:

      • Create new columns, such as Day of the Week or Transaction Month, for further analysis.

    License

    This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

    Feedback

    If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

  9. H

    Replication Data for: Democratization and Gini index: Panel data analysis...

    • dataverse.harvard.edu
    • search.datacite.org
    Updated Apr 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LEIZHEN ZANG; Xiong Feng (2019). Replication Data for: Democratization and Gini index: Panel data analysis based on random forest method [Dataset]. http://doi.org/10.7910/DVN/W2CXVU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    LEIZHEN ZANG; Xiong Feng
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The mechanism for the association between democratic development and the wealth gap has always been the focus of political and economic research, yet with no consistent conclusion. The reasons for that often are, 1) challenges to generalize the results obtained from analyzing a single country’s time series studies or multinational cross-section data analysis, and 2) deviations in research results caused by missing values or variable selection in panel data analysis. When it comes to the latter one, there are two factors contribute to it. One is that the accuracy of estimation is interfered with the presence of missing values in variables, another is that subjective discretion that must be exercised to select suitable proxies amongst many candidates, which are likely to cause variable selection bias. In order to solve these problems, this study is the pioneeringly research to utilize the machine learning method to interpolate missing values efficiently through the random forest model in this topic, and effectively analyzed cross-country data from 151 countries covering the period 1993–2017. Since this paper measures the importance of different variables to the dependent variable, more appropriate and important variables could be selected to construct a complete regression model. Results from different models come to a consensus that the promotion of democracy can significantly narrow the gap between the rich and the poor, with marginally decreasing effect with respect to wealth. In addition, the study finds out that this mechanism exists only in non-colonial nations or presidential states. Finally, this paper discusses the potential theoretical and policy implications of results.

  10. 2

    Labour Force Survey Household Datasets, 2002-2024: Secure Access

    • datacatalogue.ukdataservice.ac.uk
    Updated Aug 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office for National Statistics, Social Survey Division (2025). Labour Force Survey Household Datasets, 2002-2024: Secure Access [Dataset]. http://doi.org/10.5255/UKDA-SN-7674-17
    Explore at:
    Dataset updated
    Aug 28, 2025
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    Authors
    Office for National Statistics, Social Survey Division
    Area covered
    United Kingdom
    Description

    Background

    The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

    New reweighting policy
    Following the new reweighting policy ONS has reviewed the latest population estimates made available during 2019 and have decided not to carry out a 2019 LFS and APS reweighting exercise. Therefore, the next reweighting exercise will take place in 2020. These will incorporate the 2019 Sub-National Population Projection data (published in May 2020) and 2019 Mid-Year Estimates (published in June 2020). It is expected that reweighted Labour Market aggregates and microdata will be published towards the end of 2020/early 2021.

    Secure Access QLFS household data
    Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. For some quarters, users should note that all missing values in the data are set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. From the 2013 household datasets, the standard -8 and -9 missing categories have been reinstated.

    Secure Access household datasets for the QLFS are available from 2002 onwards, and include additional, detailed variables not included in the standard 'End User Licence' (EUL) versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurence of learning difficulty or disability; and benefits.

    Prospective users of a Secure Access version of the QLFS will need to fulfil additional requirements, commencing with the completion of an extra application form to demonstrate to the data owners exactly why they need access to the extra, more detailed variables, in order to obtain permission to use that version. Secure Access users must also complete face-to-face training and agree to Secure Access' User Agreement (see 'Access' section below). Therefore, users are encouraged to download and inspect the EUL version of the data prior to ordering the Secure Access version.

    LFS Documentation
    The documentation available from the Archive to accompany LFS datasets largely consists of each volume of the User Guide including the appropriate questionnaires for the years concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS LFS User Guidance pages before commencing analysis.

    The study documentation presented in the Documentation section includes the most recent documentation for the LFS only, due to available space. Documentation for previous years is provided alongside the data for access and is also available upon request.

    Review of imputation methods for LFS Household data - changes to missing values
    A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

    Variables DISEA and LNGLST
    Dataset A08 (Labour market status of disabled people) which ONS suspended due to an apparent discontinuity between April to June 2017 and July to September 2017 is now available. As a result of this apparent discontinuity and the inconclusive investigations at this stage, comparisons should be made with caution between April to June 2017 and subsequent time periods. However users should note that the estimates are not seasonally adjusted, so some of the change between quarters could be due to seasonality. Further recommendations on historical comparisons of the estimates will be given in November 2018 when ONS are due to publish estimates for July to September 2018.

    An article explaining the quality assurance investigations that have been conducted so far is available on the ONS Methodology webpage. For any queries about Dataset A08 please email Labour.Market@ons.gov.uk.

    Latest Edition Information
    For the seventeenth edition (August 2025), one quarterly data file covering the time period July-September, 2024 has been added to the study.

  11. Housing Benefit recoveries and fraud data April 2012 to March 2013

    • gov.uk
    Updated Sep 11, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Work and Pensions (2013). Housing Benefit recoveries and fraud data April 2012 to March 2013 [Dataset]. https://www.gov.uk/government/statistics/housing-benefit-recoveries-and-fraud-data-april-2012-to-march-2013
    Explore at:
    Dataset updated
    Sep 11, 2013
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Department for Work and Pensions
    Description

    We have produced this publication in accordance with the arrangements approved by the UK Statistics Authority. It contains aggregate-level data received on a quarterly basis from each local authority.

    Historically, the figures in these releases have not allowed for imputation of any missing values and headline numbers were based on the summation of the data presented in the tables. As some authorities do not return a form and some can’t answer all the questions, the presence of missing values can affect the interpretation of trends.

    Therefore to help users understand the possible impact, this year’s first release includes imputed totals for the whole country across the full time series.

    Coverage: Great Britain

    Geographic breakdown: local authority and county

    Frequency: bi-annual

    Next release date: 12 March 2014

    Your feedback

    We continue to welcome any feedback that users may have on our Housing Benefit recoveries and fraud national statistics. In particular we would be interested in learning more about:

    • any additional needs you may have
    • how you use these statistics
    • the types of decision these statistics inform

    Please complete our user questionnaire to tell us what you think.

  12. S

    A Dataset on the Impact of Leader-Member Exchange (LMX) and Capacity-Based...

    • scidb.cn
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang Luo (2025). A Dataset on the Impact of Leader-Member Exchange (LMX) and Capacity-Based LMX Differentiation (CLMXD) on Perceived Overqualification and Cyberloafing in the Workplace [Dataset]. http://doi.org/10.57760/sciencedb.24279
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Yang Luo
    Description

    This dataset originates from a multi-method research project titled "Is High-Quality LMX Always a Blessing? Exploring the Impact of LMX and Capacity-Based LMX Differentiation on Followers’ Perceived Overqualification and Cyberloafing Engagement." The project investigates how the interaction between leader-member exchange (LMX) quality and capacity-based LMX differentiation (CLMXD) shapes employees’ perceptions of overqualification and their engagement in cyberloafing behaviors. It further explores how (in)congruence between LMX and CLMXD influences employee outcomes through perceived overqualification.1 Study 1: Multi-Wave Field SurveyData generation procedures: Data for Study 1 were collected through an online questionnaire survey using a snowball sampling approach. Participants were corporate employees from various industries located in Beijing, China. Recruitment was initiated via the researcher's personal network, followed by distribution through a WeChat group.Temporal and geographical scope: The survey was conducted over two waves, separated by a two-week interval, in Beijing, China, during 2024.Data processing methods: Data were matched across the two time points using the last four digits of participants’ telephone numbers. Participants failing attention checks were excluded. Cookies were used to prevent multiple submissions from the same device.Data structure: The dataset consists of 271 valid cases. Each row represents one participant. Variables include Leader-Member Exchange (LMX), Capacity-Based LMX Differentiation (CLMXD), Perceived Overqualification (POQ), Cyberloafing behaviors, and demographic variables such as gender, age, education, position, and tenure. All scales used a 7-point Likert format.Measurement and units: Responses were recorded on scales from 1 (strongly disagree) to 7 (strongly agree).Missing data: There were no missing data, as all questions were mandatory.Error handling: Participants with inconsistent or careless responses (e.g., failing attention checks) were excluded to ensure data quality.Data file details: The file is provided in .sav format (SPSS file), containing 271 rows and 43 columns.2 Study 2: Experimental Vignette StudyData generation procedures: Study 2 employed an experimental vignette methodology (EVM). Participants were randomly assigned to one of four conditions in a 2 (high vs. low LMX) × 2 (high vs. low CLMXD) between-subjects design. They were asked to vividly imagine themselves as protagonists in the scenarios and subsequently respond to a series of measures.Temporal and geographical scope: The data were collected in 2024 from MBA students enrolled in a Chinese university.Data processing methods: Responses were screened for data quality. Surveys with missing answers, multiple selections on single-choice questions, or indications of non-serious answering were excluded.Data structure: The dataset consists of 164 valid cases. Each row corresponds to one participant. Variables include LMX manipulation check, CLMXD manipulation check, and Perceived Overqualification (POQ). Demographic variables such as gender, age, education, and position are also included.Measurement and units: All variables were measured using 7-point Likert-type scales, from 1 (strongly disagree) to 7 (strongly agree).Missing data: All missing or invalid entries were excluded during the cleaning process; thus, the final dataset contains no missing data.Error handling: Data quality was ensured through attention to missing responses and response patterns. Only complete and reliable responses were retained.Data file details: The file is provided in .sav format (SPSS file), containing 164 rows and 34 columns.

  13. S

    A survey result of experiment of non proportional reasoning

    • scidb.cn
    Updated Nov 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lior Cohen (2024). A survey result of experiment of non proportional reasoning [Dataset]. http://doi.org/10.57760/sciencedb.17541
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 28, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Lior Cohen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Generation ProceduresThe data for this study was generated through a survey experiment designed to assess how participants evaluate stock risk based on nominal price changes. Participants were randomly assigned to one of four questionnaires, each presenting a hypothetical scenario involving a stock whose price had declined by 25%. The stocks were priced at $2,000, $200, $20, and $2. This setup allowed the researchers to observe how the nominal price influenced participants’ perceptions of risk and their likelihood to sell the stock. The survey included questions about participants’ risk assessments and their decisions regarding selling shares after being informed of the price decline. The data collection involved online survey tools, ensuring a diverse participant pool. Temporal and Geographical ScopeThe survey was conducted in September and December 2022. The geographical scope primarily encompasses participants from various regions in the USA. Tabular Data DescriptionThe survey results were compiled into tabular data containing several entries corresponding to each participant’s responses—the total number of data entries was 632. Row and Column Headers: Each row represents an individual participant’s response with their Participant ID, while columns include variables tested. Variables reported in the data: Variable Description Respondent ID Random number of the respondent’s randomly assigned ID Gender Male=0, Female=1. Region The location of the respondent: Middle Atlantic, West North Central, New England, South Atlantic, Mountain, West South Central, Pacific, East North Central, East South Central Age Age Groups are divided into four, numbered from 1 to 4, respectively: 18-29, 30-44, 45-60, and 60 and older. Stock Price: Q_2,Q_20,Q_200,Q_2000 Participants were randomly assigned a questionnaire and were asked about the stocks of $2, $20, $200, and $2,000. To construct this variable, it has four values range from 1 to 4, in which (1) $2; (2) $20; (3) $200; (4) $2,000. Income Reported annual income is divided into income groups from 1 to 8. Risk appetite/ Self-assessment of risk-taking behavior Participants were asked how much of a risk taker they consider themselves to be on a scale of 1 (risk averse) to 10 (risk lover). TraderD Participants were asked if they trade stocks or bonds and, if so, how often; the possible answers were, from 1 to 4, respectively: Trade regularly (daily or weekly[1]), trade a few times a month, rarely (a few times a year at most), or have never traded before. Follow news People were asked if they follow stock-related news and stock or bond price changes, and if so, how often. The possible responses were divided into five: I never follow news related to stocks or stock indices=1; Occasionally, a few times a year=2; Sometimes, a few times a month =3; Often, a few times a week=4; Always, daily=5. Risk Participants were asked about how risky they think the stock is on a scale of 1 (not risky) to 7 (high risk) Percent Sold Participants were asked how many shares they would sell due to the stock decline. Based on the answer, I calculated the percentage of stocks they sold out of their portfolio. Sold Based on the same question from the previous variable, I constructed another variable with the values of (1) for those who sold at least 10% of their portfolio and (0) otherwise. In some cases, the respondents answered they would purchase more stocks; in those few cases, I assumed their response was zero. Missing DataA few respondents skipped questions or provided incomplete answers. If present, missing data were excluded from the analysis. Description of Each Data FileThe primary data file generated from this study was structured as an Excel file containing all participant responses. Content: The file includes columns for participant demographics, stock prices presented, risk assessments, selling decisions, and number of shares sold.Format: Excel.Size: 64 kb. [1] In some questionnaires I divided this answer into daily or weekly to allow more variability, however, since the number of answers for “daily”, when constructing this variable and combined the responses from daily and weekly together.

  14. g

    Sicily and Calabria Extortion Database

    • search.gesis.org
    • da-ra.de
    Updated Nov 18, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GLODERS Global Dynamics of Extortion Racket System FP7 Project (2015). Sicily and Calabria Extortion Database [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-1116
    Explore at:
    Dataset updated
    Nov 18, 2015
    Dataset provided by
    GESIS search
    GESIS, Köln
    Authors
    GLODERS Global Dynamics of Extortion Racket System FP7 Project
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Area covered
    Sicily, Calabria
    Description

    The Sicily and Calabria Extortion Database was extracted from police and court documents by the Palermo team of the GLODERS — Global Dynamics of Extortion Racket Systems — project which has received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under grant agreement no. 315874 (http://www.gloders.eu, “Global dynamics of extortion racket systems”, https://cordis.europa.eu/project/id/315874).

    The data are provided as an SPSS file with variable names, variable labels, value labels where appropriate, missing value definitions where appropriate. Variable and value labels are given in English translation, string texts are quoted from the Italian originals as we thought that a translation could bias the information and that users of the data for secondary analysis will usually be able to read Italian.

    The rows of the SPSS file describe one extortion case each. The columns start with some technical information (unique case number, reference to the original source, region, case number within the regions (Sicily and Calabria). These are followed by information about when the cases happened, the pseudonym of the extorter, his role in the organisation and the name and territory of the mafia family or mandamento he belongs to. Information about the victims, their affiliations and the type of enterprise they represent follows; the type of enterprise is coded according to the official Italian coding scheme (AtEco, which can be downloaded from http://www.istat.it/it/archivio/17888). The next group of variables describes the place where the extortion happened. The value labels for the numerical pseudonyms of extorters and victims (both persons and firms) are not contained in this file, hence the pseudonyms can only be used to analyse how often the same person or firm was involved in extortion.

    After this more or less technical information about the extortion cases the cases are described materially. Most variables come in two forms, both the original textual description of what happened and how it happened and a recoded variable which lends itself better for quantitative analyses. The features described in these variables encompass

    • whether the extortion was only attempted (and unsuccessful from the point of view of the extorter) or completed, i.e. the victim actually paid,
    • whether the request was for a periodic or a one-off payment or both and what the amount was (the amounts of periodic and one-off amounts are not always comparable as some were only defined in terms of percentages of victim income or in terms of obligations the victim accepted to employ a relative of the extorter etc.),
    • whether there was an intimidation and whether it was directed to a person or to property,
    • whether the extortion request was brought forward by direct personal contact or by some indirect communication,
    • whether there was some negotiation between extorter and victim, and if so, what it was like, and whether a mediator interfered,
    • how the victim reacted: acquiescent, conniving or refusing,
    • how the law enforcement agencies got to know about the case (own observation, denunciation, etc.),
    • whether the extorter was caught, brought to investigation custody or finally sentenced (these variables contain a high percentage of missing data, partly due to the fact that some cases are still under prosecution or before court or as a consequence of incomplete documents.

  15. Inverse Model Results for Filchner-Ronne Catchment

    • zenodo.org
    bin, nc
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Wolovick; Michael Wolovick; Angelika Humbert; Angelika Humbert; Thomas Kleiner; Thomas Kleiner; Martin Rückamp; Martin Rückamp (2023). Inverse Model Results for Filchner-Ronne Catchment [Dataset]. http://doi.org/10.5281/zenodo.7798650
    Explore at:
    bin, ncAvailable download formats
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Wolovick; Michael Wolovick; Angelika Humbert; Angelika Humbert; Thomas Kleiner; Thomas Kleiner; Martin Rückamp; Martin Rückamp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This page contains the results of the inversions for basal drag and drag coefficient in the Filchner-Ronne catchment presented in Wolovick et al., (2023), along with the code used to perform the inversions and L-curves, analyze the results, and produce the figures presented in that paper.

    This all looks very complicated. There's so many files here. The description is so long. I just want to know the basal drag!

    If you don't want to get into the weeds of inverse modeling and L-curve analysis, or if you are uninterested in wading through our collection of model structures and scripts, then you should use the file BestCombinedDragEstimate.nc. That file contains our best weighted mean estimate of the ice sheet basal drag in our domain, along with the weighted standard deviation of the scatter of the different models about the mean. As discussed in the paper, this combined estimate is constructed from the weighted mean of 24 individual inversions, representing 8 separate L-curve experiments on our highest-resolution mesh, with three regularization values per L-curve (best estimate regularization, along with minimum and maximum acceptable regularization levels). Each inversion is weighted according to the inverse of its total variance ratio, which is a quality metric incorporating both observational misfit and inverted structure. For ease of use, these results have been interpolated from the unstructured model mesh onto a 250 m regular grid. If you only want to know the basal drag in the Filchner-Ronne region, that is the only file you should use.

    For users who want to go further, we will now explain the remaining files in this release. First we give a brief summary of all of the scripts included here and their functions, and then we will give an explanation of the matfiles that contain the actual inversion and L-curve results. Note that the scripts presented here are the matlab scripts used to organize and set up model runs for ISSM. The Ice-Sheet and Sea-level System Model (ISSM) is a highly versatile parallelized finite-element ice sheet model run in C but controlled using Matlab or Python front-ends. We do not include the underlying code for ISSM here; users who are interested in installing ISSM should go to the ISSM home page. We merely include the Matlab scripts we used to organize our ISSM front-end, set up model structures, and then analyze and visualize results.

    Main Matlab scripts:

    These are the main functional scripts used to set up and run the model.

    • ISSMInversion_v3.m. This is the primary script we used to set up the inversions and perform L-curve analysis. It requires a model mesh as input along with some gridded data. It also produces an L-curve figure (figure 3) after performing the L-curve analysis. This script can be run in two modes: "setupandsend", which prepares model structures and sends them to the cluster to be solved, and "loadandanalyze", which loads the solutions from the cluster, saves them to matfiles, and performs analysis and visualization. In addition to L-curve analysis and the L-curve figure, this script can also produce a variety of additional figures of model output that we did not show in the paper.
    • MakeISSMMesh_v4.m. This is the script we used to make our model meshes. It requires a domain boundary as input along with some gridded data.
    • ModelBoundaryPicker_v1.m. This script opens a crude graphical interface for picking the domain outline.
    • OrganizeInversionsForRelease_v2.m. This script assembles L-curve and inverse model results and organizes them into the data release you see here. Note that it doesn't compute the combined drag estimate itself, (that is done by CombinedDragFigure_v1.m), but it does interpolate the combined drag estimate from the model mesh to the grid, and it produces the output netcdf file.

    Note that the gridded data files needed by some of the above scripts are not included in our release here. Users interested in using these scripts for their own projects will need to provide their own gridded inputs, for instance from BedMachine or Measures.

    Figure-making scripts:

    These scripts produced almost all of the figures we presented in the paper, and also computed the statistics we presented in the tables in the paper.

    • CombinedDragFigure_v1.m. This script computes the combined drag estimate on the highest-resolution mesh, and makes a figure displaying it (Figure 12 in the paper).
    • InversionComparisonFigure_HOSSA_v1.m. This makes figure 11 in the paper and also computes the statistics shown in table 3.
    • InversionComparisonFigure_m_v1.m This makes figures 9 and 10, and also computes the statistics shown in table 2.
    • InversionComparisonFigure_N_v1.m. This makes figure 8, and also computes the statistics shown in table 1.
    • InversionComparisonFigure_v1a.m. This makes figure 4 in the paper.
    • InversionResConvergenceFigure_v2.m. This makes figure 6.
    • InversionResMisfitFigure_v1.m. This makes figure 7.
    • InversionSettingFigure_v1.m. This makes figure 1.
    • InversionSpectrumFigure_v1.m. This performs spectral analysis and makes figure 5.
    • InversionThermalSettingFigure_v1.m. This makes figure A1.
    • MeshSizeFigure_v1.m. This makes figure A2.
    • NComparisonFigure_v1.m. This makes figure 2.

    Other utility Matlab functions:

    These miscellaneous function do various tasks. Many of them are called as subroutines of the scripts above. Additionally, many of them are generally useful in contexts beyond the inverse modeling presented here.

    • FlattenModelStructure.m. ISSM has the unfortunate convention of saving every variable in 3D meshes on every single 3D mesh node, which is quire wasteful for variables that are actually 2D (ie, most of the model variables). This function flattens all uneccessarily 3D information, but unlike the built-in ISSM function flatten.m, this script preserves the 3D geometry of the mesh, along with 3D variables that actually are 3D (such as englacial temperature, for example). This function can also be run in reverse to expand variables back to full 3D before calling solve().
    • intuitive_lowpass.m. This function low-pass filters a 1D dataset using a gaussian filter. It has several options for handling boundary conditions at the end points.
    • LaplacianInterpolation.m. This function fills in missing data values for gridded data products by solving Poisson's equation (Laplacian=0).
    • LaplacianInterpolation_mesh.m. This function does the same thing but on an unstructured mesh.
    • loadnetcdf.m. This function loads variables from netcdf files into the Matlab workspace using a similar syntax as load() for matfiles.
    • MultiWavelengthInterpolator.m. This function interpolates gridded data onto an unstructured mesh using a multi-grid approach. The grid is smoothed at multiple wavelengths and each mesh element interpolates from the wavelength that is appropriate for its size. This functionality is useful for preventing aliasing in coarse-resolution areas when interpolating onto a mesh with variable mesh size. It also produces results that are approximately (but not precisely) conservative.
    • ThreeByThree.m. This function iteratively performs a 3x3 smoothing on gridded data.
    • unpack.m. This function takes a structure and "unpacks" it by making every field into a variable in the workspace.

    Matfiles with L-curve data and model structures.

    The results of our L-curve analyses and our actual inversion results are stored in matfiles. We performed 21 experiments shown in the paper; for each one we performed an independent L-curve analysis using 25 individual inversions, for a total of 525 inversions. However, for this data release we simplify matters by only presenting 3 inversions per experiment, corresponding to the best regularization value (LambdaBest) and the maximum and minimum acceptable regularization values (LambdaMax and LambdaMin). In addition, for each experiment we also provide an LCurveFile that summarizes the L-curve analysis but does not contain any actual model results. In total, we present 84 matfiles in this data release.

    Naming convention:

    All matfiles presented here have the following naming convention:

    Mesh#_eqn_m#_Ntype_LambdaType.mat

    • Mesh#: this represents the mesh on which the inversions were performed, ranging from Mesh1 (highest resolution) to Mesh10 (lowest resolution).
    • eqn: this represents the type of equations solved in the inversion. Values are "SSA" or "HO".
    • m#: exponent in the sliding law. Values are m1, m3, and m5.
    • Ntype: effective pressure source in the sliding law. Values are "noN" (ie, Weertman sliding), "Nop", "Nopc", and "Ncuas".
    • LambdaType: values of this string are "LCurveFile" (for the file summarizing the whole L-curve experiment), "LambdaMin", "LambdaBest", and "LambdaMax".

    Variables in the model files:

    Every file ending with "LambdaMin", "LambdaBest", or "LambdaMax" is a model file containing the same set of variables. Those variables are:

    • md. This is a a model structure variable usable by any ISSM installation. Note that if you do not have ISSM installed on your machine, Matlab will not recognize class "model" and you will not be able to load this variable. The results of the inversion are stored in md.results.StressbalanceSolution. Other important things for the inversion, such as cost functions, cost function coefficients, and

  16. n

    Data from: Common barriers, but temporal dissonance: genomic tests suggest...

    • data.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Jan 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Thomaz; L. Lacey Knowles (2020). Common barriers, but temporal dissonance: genomic tests suggest ecological and paleo-landscape sieves structure a coastal riverine fish community [Dataset]. http://doi.org/10.5061/dryad.zkh18936g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 15, 2020
    Dataset provided by
    University of Michigan
    Authors
    Andrea Thomaz; L. Lacey Knowles
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Assessments of spatial and temporal congruency across taxa from genetic data provide insights into the extent to which similar processes structure communities. However, for coastal regions that are affected continuously by cyclical sea-level changes over the Pleistocene, congruent interspecific response will not only depend upon co-distributions, but also on similar dispersal histories among taxa. Here, we use SNPs to test for concordant genetic structure among four co-distributed taxa of freshwater fishes (Teleostei: Characidae) along the Brazilian Atlantic coastal drainages. Based on population relationships and hierarchical genetic structure analyses, we identify all taxa share the same geographic structure suggesting the fish utilized common passages in the past to move between river basins. In contrast to this strong spatial concordance, model-based estimates of divergence times indicate that despite common routes for dispersal, these passages were traversed by each of the taxa at different times resulting in varying degrees of genetic differentiation across barriers with most divergences dating to the Upper Pleistocene, even when accounting for divergence with gene flow. Interestingly, when this temporal dissonance is viewed through the lens of the species-specific ecologies, it suggests that an ecological sieve influenced whether species dispersed readily, with an ecological generalist showing the highest propensity for historical dispersal among the isolated rivers of the Brazilian coast (i.e., the most recent divergence times and frequent gene flow estimated for barriers). We discuss how our findings, and in particular what the temporal dissonance, despite common geographic passages, suggest about past dispersal structuring coastal communities as a function of ecological and paleo-landscape sieves.

    Methods Six double digest Restriction-site Associated DNA (ddRAD) libraries were constructed: three libraries contained 118 individuals of Mimagoniates microlepis for this study, two libraries containing 136 individuals of Hyphessobrycon boulengeri, and one library with 87 individuals of Bryconamericus. In addition, two libraries with 182 individuals of Hollandichthys were re-analyzed for this study (Thomaz et al., 2017). For all the libraries prepared specifically for this study, we followed the protocol of Peterson, Weber, Kay, Fisher, & Hoekstra (2012); the two previously sequenced libraries of Hollandichthys followed the Parchman et al. (2012) protocol (see Thomaz et al., 2017 for preparation details). Genomic data were demultiplexed and processed separately for each taxon with the STACKS version 1.41 pipeline (Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013). Because of the various requirements of different analyses used to characterize the geographic structuring of genomic variation, three datasets were generated per taxon varying the amount of missing data and the numbers of individuals. One dataset was comprised by one random single SNP per locus with maximum of 50% missing data, which was used for estimates of population trees using SVSquartets on PAUP. The other dataset included loci with maximum 25% missing data after filtering - note that for M. microlepis we allowed 35% missing data - and was used with a random single SNP per locus in the STRUCTURE analysis. Separate datasets were used in FASTSIMCOAL2 analyses and were generated, when possible, from 20 individuals with the smallest amount of missing data from all the populations separated by each geographic barrier for each taxon (40 individuals in total), and a single variable SNP per RADtag with less than 10% missing data. For all these datasets, individuals with considerably fewer SNPs in comparison to other individuals of the same population were excluded. All filtering steps were performed using the toolset PLINK v.1.90 (Purcell et al., 2007) - see Thomaz and Knowles (2020) for details on methodology.

  17. Z

    Data from: Impact of delayed response on Wearable Cognitive Assistance

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olguín Muñoz, Manuel; Klatzky, Roberta; Wang, Junjue; Padmanabhan, Pillai; Satyanaryanan, Mahadev; Gross, James (2021). Impact of delayed response on Wearable Cognitive Assistance [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4489265
    Explore at:
    Dataset updated
    Feb 2, 2021
    Dataset provided by
    School of Computer Science, Carnegie Mellon University
    Department of Psychology, Carnegie Mellon University
    Intel Labs Pittsburgh
    School of Electrical Engineering & Computer Science, KTH Royal Institute of Technology
    Authors
    Olguín Muñoz, Manuel; Klatzky, Roberta; Wang, Junjue; Padmanabhan, Pillai; Satyanaryanan, Mahadev; Gross, James
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the data associated with our research project titled Impact of delayed response on Wearable Cognitive Assistance. A preprint of the associated paper can be found at https://arxiv.org/abs/2011.02555.

    GENERAL INFORMATION

    1. Title of Dataset: Impact of delayed response on Wearable Cognitive Assistance

    2. Author Information

    First Author Contact Information Name: Manuel Olguín Muñoz Institution: School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology Address: Malvinas väg 10, Stockholm 11428, Sweden Email: molguin@kth.se Phone Number: +46 73 652 7628

    Author Contact Information Name: Roberta L. Klatzky Institution: Department of Psychology, Carnegie Mellon University Address: 5000 Forbes Ave, Pittsburgh, PA 15213 Email: klatzky@cmu.edu Phone Number: +1 412 268 8026

    Author Contact Information Name: Mahadev Satyanarayanan Institution: School of Computer Science, Carnegie Mellon University Address: 5000 Forbes Ave, Pittsburgh, PA 15213 Email: satya@cs.cmu.edu Phone Number: +1 412 268 3743

    Author Contact Information Name: James R. Gross Institution: School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology Address: Malvinas väg 10, Stockholm 11428, Sweden Email: jamesgr@kth.se Phone Number: +46 8 790 8819

    DATA & FILE OVERVIEW

    Directory of Files: A. Filename: accelerometer_data.csv Short description: Time-series accelerometer data. Each row corresponds to a sample.

    B. Filename: block_aggregate.csv
      Short description: Contains the block- and slice-level aggregates for each of the metrics and statistics present in this dataset. Each row corresponds to either a full block or a slice of a block, see below for details.
    
    
    C. Filename: block_metadata.csv
      Short description: Contains the metadata for each block in the task for each participant. Each row corresponds to a block.
    
    D. Filename: bvp_data.csv
      Short description: Time-series blood-volume-pulse data. Each row corresponds to a sample.
    
    
    E. Filename: eeg_data.csv
      Short description: Time-series electroencephalogram data, represented as power per band. Each row corresponds to a sample; power was calculated in 0.5 second intervals.
    
    
    F. Filename: frame_metadata.csv
      Short description: Contains the metadata for each video frame processed by the cognitive assistant. Each row corresponds to a processed frame.
    
    
    G. Filename: gsr_data.csv
      Short description: Time-series galvanic skin response data. Each row corresponds to a sample.
    
    
    H. Filename: task_step_metadata.csv
      Short description: Contains the metadata for each step in the task for each participant. Each row corresponds to a step in the task.
    
    
    I. Filename: temperature_data.csv
      Short description: Time-series thermometer data. Each row corresponds to a sample.
    

    Additional Notes on File Relationships, Context, or Content (for example, if a user wants to reuse and/or cite your data, what information would you want them to know?):

    • The data contained in these CSVs was obtained from 40 participants in a study performed with approval from the Carnegie Mellon University Institutional Research Board. In this study, participants were asked to interact with a Cognitive Assistant while wearing an array of physiological sensors. The data contained in this dataset corresponds to the actual collected data, after some preliminary preprocessing to convert from sensors readings into meaningful values.

    • Participants have been anonymized using random integer identifiers.

    • block_aggregate.csv can be replicated by cross-referencing the start and end timestamps of each block in block_metadata.csv and the timestamps for each desired metric.

    • The actual video frames mentioned in frame_metadata.csv are not included in the dataset since their contents were not relevant to the research.

    File Naming Convention: N/A

    DATA DESCRIPTION FOR: accelerometer_data.csv

    1. Number of variables: 7

    2. Number of cases/rows: 1844688

    3. Missing data codes: N/A

    4. Variable list:

      A. Name: timestamp Description: Timestamp of the sample.

      B. Name: x Description: Acceleration reading from the x-axis of the accelerometer in g-forces [g].

      C. Name: y Description: Acceleration reading from the y-axis of the accelerometer in g-forces [g].

      D. Name: z Description: Acceleration reading from the z-axis of the accelerometer in g-forces [g].

      E. Name: ts Description: Time difference with respect to first sample.

      F. Name: participant Description: Denotes the numeric ID representing each individual participant.

      G. Name: delay Description: Delay that was being applied on the task when this reading was obtained in time delta format.

    DATA DESCRIPTION FOR: block_aggregate.csv

    1. Number of variables: 16

    2. Number of cases/rows: 2520

    3. Missing data codes:

      • Except for the 'slice' columns, empty cells mean that the data is not applicable or was removed from the dataset due to noise or instrument failure.
      • For the 'slice' column, a missing value indicates that the row corresponds to the whole block as opposed to a slice of it.
    4. Variable List:

      A. Name: participant Description: Denotes the numeric ID representing each individual participant.

      B. Name: block_seq Description: Denotes the position of the block in the task. Ranges from 1 to 21.

      C. Name: slice Description: Index of the 4-step slice of the block over which the data was aggregated. Ranges from 0 to 2, however higher values are only applicable for blocks of appropriate length (i.e. blocks of length 4 only have a 0-slice, length 8 have 0 and 1, and length 12 have slices from 0 to 2). A missing value indicates that this row instead contains aggregate values for the whole block.

      D. Name: block_length Description: Length of the block. Valid values are 4, 8 and 12.

      C. Name: block_delay Description: Delay applied to the block, in seconds.

      F. Name: start Description: Timestamp marking the start of the block or slice.

      G. Name: end Description: Timestamp marking the end of the block or slice.

      H. Name: duration Description: Duration of the block or slice, in seconds.

      I. Name: exec_time_per_step_mean Description: Mean execution time for each step in the block or slice.

      J. Name: bpm_mean Description: Mean heart rate, in beats-per-minute, for the block or slice.

      K. Name: bpm_std Description: Standard deviation of the heart rate, in beats-per-minute, for the block or slice.

      L. Name: gsr_per_second Description: Galvanic skin response in microsiemens, summed and then normalized by block or slice duration.

      M. Name: movement_score Description: Movement score for the block or slice. The movement score is calculated as the sum of the magnitude of all the acceleration vectors in the block or slice, divided by duration in seconds.

      N. Name: eeg_alpha_log_mean Description: Log of the average EEG power for the alpha band for the, block or slice.

      O. Name: eeg_beta_log_mean Description: Log of the average EEG power for the beta band for the, block or slice.

      P. Name: eeg_total_log_mean Description: Log of the average EEG power for the complete EEG signal, for the block or slice.

    DATA DESCRIPTION FOR: block_metadata.csv

    1. Number of variables: 8

    2. Number of cases/rows: 880

    3. Missing data codes: N/A

    4. Variable list:

      A. Name: participant Description: Denotes the numeric ID representing each individual participant.

      B. Name: seq Description: Index of the block in the task, ranging from 0 to 21. Note that block 0 is not to be included in aggregate calculations.

      C. Name: length Description: Length of the block in number of steps.

      D. Name: delay Description: Delay applied to the block.

      E. Name: start Description: Timestamp marking the start of the block.

      F. Name: end Description: Timestamp marking the end of the block.

      G. Name: duration Description: Duration of the block as a timedelta.

      H. Name: exec_time Description: Execution time of the block as a timedelta.

    DATA DESCRIPTION FOR: bvp_data.csv

    1. Number of variables: 8

    2. Number of cases/rows: 3683504

    3. Missing data codes: Columns bpm and ibi only contain values for rows corresponding to a sample taken at a heartbeat.

    4. Variable list:

      A. Name: ts Description: Time difference with respect to first sample.

      B. Name: timestamp Description: Timestamp of the sample.

      C. Name: bvp Description: Blood-volume-pulse reading, in millivolts.

      D. Name: onset Description: Boolean indicating if this sample corresponds to the onset of a pulse.

      E. Name: bpm Description: Instantaneous beat-per-minute value.

      F. Name: ibi Description: Instantaneous inter-beat-interval value.

      G. Name: delay Description: Delay that was being applied on the task when this reading was obtained in time delta format.

      H. Name: participant Description: Denotes the numeric ID representing each individual

  18. A Missing Data Approach to Correct for Direct and Indirect Range...

    • plos.figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Pfaffel; Marlene Kollmayer; Barbara Schober; Christiane Spiel (2023). A Missing Data Approach to Correct for Direct and Indirect Range Restrictions with a Dichotomous Criterion: A Simulation Study [Dataset]. http://doi.org/10.1371/journal.pone.0152330
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andreas Pfaffel; Marlene Kollmayer; Barbara Schober; Christiane Spiel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A recurring methodological problem in the evaluation of the predictive validity of selection methods is that the values of the criterion variable are available for selected applicants only. This so-called range restriction problem causes biased population estimates. Correction methods for direct and indirect range restriction scenarios have widely studied for continuous criterion variables but not for dichotomous ones. The few existing approaches are inapplicable because they do not consider the unknown base rate of success. Hence, there is a lack of scientific research on suitable correction methods and the systematic analysis of their accuracies in the cases of a naturally or artificially dichotomous criterion. We aim to overcome this deficiency by viewing the range restriction problem as a missing data mechanism. We used multiple imputation by chained equations to generate complete criterion data before estimating the predictive validity and the base rate of success. Monte Carlo simulations were conducted to investigate the accuracy of the proposed correction in dependence of selection ratio, predictive validity, and base rate of success in an experimental design. In addition, we compared our proposed missing data approach with Thorndike’s well-known correction formulas that have only been used in the case of continuous criterion variables so far. The results show that the missing data approach is more accurate in estimating the predictive validity than Thorndike’s correction formulas. The accuracy of our proposed correction increases as the selection ratio and the correlation between predictor and criterion increase. Furthermore, the missing data approach provides a valid estimate of the unknown base rate of success. On the basis of our findings, we argue for the use of multiple imputation by chained equations in the evaluation of the predictive validity of selection methods when the criterion is dichotomous.

  19. Retail Store Sales: Dirty for Data Cleaning

    • kaggle.com
    zip
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning
    Explore at:
    zip(226740 bytes)Available download formats
    Dataset updated
    Jan 18, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Retail Store Sales Dataset

    Overview

    The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

    File Information

    • File Name: retail_store_sales.csv
    • Number of Rows: 12,575
    • Number of Columns: 11

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    Customer IDA unique identifier for each customer. 25 unique customers.CUST_01
    CategoryThe category of the purchased item.Food, Furniture
    ItemThe name of the purchased item. May contain missing values or None.Item_1_FOOD, None
    Price Per UnitThe static price of a single unit of the item. May contain missing or None values.4.00, None
    QuantityThe quantity of the item purchased. May contain missing or None values.1, None
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, None
    Payment MethodThe method of payment used. May contain missing or invalid values.Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Online
    Transaction DateThe date of the transaction. Always present and valid.2023-01-15
    Discount AppliedIndicates if a discount was applied to the transaction. May contain missing values.True, False, None

    Categories and Items

    The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

    Electric Household Essentials

    Item CodeItem NamePrice
    Item_1_EHEBlender5.0
    Item_2_EHEMicrowave6.5
    Item_3_EHEToaster8.0
    Item_4_EHEVacuum Cleaner9.5
    Item_5_EHEAir Purifier11.0
    Item_6_EHEElectric Kettle12.5
    Item_7_EHERice Cooker14.0
    Item_8_EHEIron15.5
    Item_9_EHECeiling Fan17.0
    Item_10_EHETable Fan18.5
    Item_11_EHEHair Dryer20.0
    Item_12_EHEHeater21.5
    Item_13_EHEHumidifier23.0
    Item_14_EHEDehumidifier24.5
    Item_15_EHECoffee Maker26.0
    Item_16_EHEPortable AC27.5
    Item_17_EHEElectric Stove29.0
    Item_18_EHEPressure Cooker30.5
    Item_19_EHEInduction Cooktop32.0
    Item_20_EHEWater Dispenser33.5
    Item_21_EHEHand Blender35.0
    Item_22_EHEMixer Grinder36.5
    Item_23_EHESandwich Maker38.0
    Item_24_EHEAir Fryer39.5
    Item_25_EHEJuicer41.0

    Furniture

    Item CodeItem NamePrice
    Item_1_FUROffice Chair5.0
    Item_2_FURSofa6.5
    Item_3_FURCoffee Table8.0
    Item_4_FURDining Table9.5
    Item_5_FURBookshelf11.0
    Item_6_FURBed F...
  20. Bank Loan Case Study Dataset

    • kaggle.com
    zip
    Updated May 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreshth Vashisht (2023). Bank Loan Case Study Dataset [Dataset]. https://www.kaggle.com/datasets/shreshthvashisht/bank-loan-case-study-dataset/discussion
    Explore at:
    zip(117814223 bytes)Available download formats
    Dataset updated
    May 4, 2023
    Authors
    Shreshth Vashisht
    Description

    This case study aims to give you an idea of applying EDA in a real business scenario. In this case study, apart from applying the techniques that you have learnt in the EDA module, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimize the risk of losing money while lending to customers.

    Business Understanding: The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

    When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

    If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company. If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company. The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

    The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample All other cases: All other cases when the payment is paid on time. When a client applies for a loan, there are four types of decisions that could be taken by the client/company:

    Approved: The company has approved loan application Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want. Refused: The company had rejected the loan (because the client does not meet their requirements etc.). Unused Offer: Loan has been cancelled by the client but on different stages of the process. In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

    Business Objectives: It aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

    In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilize this knowledge for its portfolio and risk assessment.

    To develop your understanding of the domain, you are advised to independently research a little about risk analytics – understanding the types of variables and their significance should be enough).

    Data Understanding: Download the Dataset using the link given under dataset section on the right.

    application_data.csv contains all the information of the client at the time of application. The data is about wheather a client has payment difficulties. previous_application.csv contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer. columns_descrption.csv is data dictionary which describes the meaning of the variables. You are required to provide a detailed report for the below data record mentioning the answer to the questions that follows:

    Present the overall approach of the analysis. Mention the problem statement and the analysis approach briefly Indentify the missing data and use appropriate method to deal with it. (Remove columns/or replace it with an appropriate value) Hint: Note that in EDA, since it is not necessary to replace the missing value, but if you have to replace the missing value, what should be the approach. Clearly mention the approach. Identify if there are outliers in the dataset. Also, mention why do you think it is an outlier. Again, remember that for this exercise, it is not necessary to remove any data points. Identify if there is data imbalance in the data. Find the ratio of data imbalance. Hint: Since there are a lot of columns, you can run your analysis in loops for the appropriate columns and find the insights. Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms. Find the top 10 c...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Julien Clavel; Gildas Merceron; Gilles Escarguel (2013). Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50

Data from: Missing data estimation in morphometrics: how much is too much?

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Dec 5, 2013
Dataset provided by
Centre National de la Recherche Scientifique
Authors
Julien Clavel; Gildas Merceron; Gilles Escarguel
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.

Search
Clear search
Close search
Google apps
Main menu